Sunday, March 29, 2015

Production Support Incident 1. SQL Server Suspended Transaction And IO Wait Issue

This is the most critical findings when there is issue with application downtime.

Application: CMS System- Content Management System

Technology: Custom Asp.net

Scenario: For any CMS system , the caching plays a very essential role. To improve overall user experience and responsive of the system , as a thumbrule and architecure design norms the CMS system should be always initialized by caching. The system content is cached one time so that there is no more chatty communication with SQL server or for that matter with database. This is important as the content in CMS website public facing internet website most of the content is global and applicable for all users. In such scenario the best practice is to cache the content and most common element one time during overall website lifecyle throughout a day.

So when we consider caching below set of design principle must be taken care:-
Life cycle of caching-Age of caching
Frequency and timeline when Business user changing content so that the changes reflects during business as usual.
Warm up caching option in IIS to reduce overall users impact on cache expiration.
Importantly the amount of data cached .Impact on w3p process in IIS, CPU utilization and heap memory fot which sql query is executed.
Mission critical application keep logic outside of application layer..keep it in database for quick fix and resolution . If logic is embedded in business logic with linq query within application layer, Consider hugh business impact and application downtime.

1. Quick Checks:

USE master;
GO
EXEC sp_who2 'active';
GO

If there is suspended transaction SPID then there is serious problem. if suspended transaction is not getting clear within 10 secs then there is potential issue with memory or execution completion of query
2. Quick Checks async_network_io wait in sql server
http://blogs.msdn.com/b/joesack/archive/2009/01/09/troubleshooting-async-network-io-networkio.aspx

3. Quick Checks Page latch above 20
http://blogs.msdn.com/b/askjay/archive/2011/07/08/troubleshooting-slow-disk-i-o-in-sql-server.aspx
http://www.jasonstrate.com/2010/09/index-black-ops-part-2-page-io-latch-page-latch/

SELECT session_id, wait_type, resource_description FROM sys.dm_os_waiting_tasksWHERE wait_type LIKE 'PAGELATCH

Resolution:
Either Optimize query
Or Increase RAM of Sql server OS box.

Thursday, March 19, 2015

Production Support Security Vulnerability Attack

Production Support

The production support is always a touch job to do . The development is a lean process and it follows the timeline, process, planning and execution within the given timeline. There is liberty to give estimation and do planning whereas with support the planning is never the case. One can never know what next.
Security vulnerability sometimes taken lightly in support production and there is always a kind of disconnect among different groups like application, database and infrastructure support. When these groups work in a very disconnected mode and communication channel is not so apparent among them then there is a chance of high lapse in support paralysis.

Poddle Attack

 Unused certificates

Check for expired SSL certificates.
 
Step by step of how to disable SSL V3.

 


 

Use the following site to see if your site is poodle free.

 


 

You need to get GRADE A after you have applied the fix.

DOS-DDOS- Distributed Denial of Service


Look out of requests from most common source . Someone must be screwing your system calling /loading or making requests to your website. If you check netstat, IIS logs, windows event application logs, webstats or google analytics something which gives you a indication that there is something wrong with your application. This will tell you the unusual behaviour within the systems when requests common to your server from most common sources.

There are chances your application login attempts of all users will be exhausted and thus users accounts are locked. This is a very huge business impact. Just imagine if this is your E-commerce or banking or financial sites. The day loss of business would be enormous. Hence we have something called captha introduced in early web world to tackle this.


Sr.No
Period
User Session
1
Diwali
4,00,000
2
Christmas
2,00,000
3
Normal Day
50,000
If see for given day and timeperiod the session building up in the system is going exponentially there is something serious activity going on in the system. Splunk ,HP and other tools help you find out that.
Check the size of iis log . Compare with previous days and can help you analyse the situations more clearly.