Things will go wrong. No matter what we do, somethings will break. When they do, you want to be able to respond appropriately.

During the incident, you will want to have good incident management. Afterwards, you will want to perform RCAs (root cause analysis) and post mortems.

Incident Management

Incident management is an important concept. It can help you mitigate risk and reduce impact to the organization's customers.

There should be systems in place to handle increased support load, manage changes to services, and coordinate the fixes. This concept is known as "incident manager on call" (IMOC). This is a person on call and pageable. When an incident happens, they declare an incident and manage the corresponding impact.

An IMOC should have some policies and mantras of the company by which to abide. For example, data corruption and loss is a worse impact than downtime. This allows an IMOC to make the call to purposefully being down a service to stop corrupt data from occuring. Another example may be that "over and estimates impact of $10000 of revenue, executives must be notified and involved". This allows an IMOC to determine severity.

An IMOC will handle declaring an incident, handling issues during, and writing up an RCA analysis document. This would be done as soon as the incident is over to keep context fresh. It may be delegated to someone else thoroughly involved if they have more context or the IMOC is tired. See the best section for more info. The IMOC should not be the one fixing the problems, but rather coordinate others to fix it.

Root Cause Analysis

A root cause analysis (RCA) is a process that follows an incident. Using the RCA document provided by the IMOC.

The document should not be prescriptive, it should provide a high level summary of the issue, an estimated impact to your organization's customers, a list of people involved, and a timeline. It is a living document until the RCA meeting happens.

The RCA meeting should involve the key members of the incident. The RCA coordinator should be a person knowledgable enough of the area to direct, but impartial and not involved in the incident. The point of the meeting is to determine the root cause and not to place blame on anyone.

This Wikipedia link goes into more depth into the RCA concept.

Root cause analysis - Wikipedia