The Incident Lifecycle

In an always-on world, companies look to systems and processes to keep their services up and running at all times. The most important part of maintaining this uptime is having an Incident Management process in place to restore your services in the event of an interruption or unplanned downtime.

Incident Management processes are typically used to respond to incidents that affect services and work on restoring their uptime.

Incidents that disrupt services are unavoidable. But every breakdown is an opportunity to learn & improve.

Every incident in system infrastructure, helps product development & engineering teams understand better about the capabilities of system architecture. This can further help us in building a more sustainable and reliable product.

Untitled

Incident detection & classification

Incidents are identified through reports from monitoring systems, or by manual identification. Once an incident is identified, it is logged. An incident log can be used to validate that all incidents have been addressed and to identify trends. At this point, the incident is categorised by adding additional information like severity, functional area, and ownership.

Incident alerting

This stage is about notifying the right people to address an incident. It may also involve assigning tasks and performing escalating procedures. In a high-fire situation, the ease with which you can communicate can make or break the customer perception and ultimately the impact on your bottom line.

Incident assessment

Based on the business impact of the incident, we assign one of the following severity levels.

Severity 1. A critical incident with very high impact.
- A customer-facing service, like login to Orbit app, is down for all customers.
- Confidentiality or privacy is breached.
Severity 2. A major incident with significant impact.
- A customer-facing service is unavailable for a subset of customers.
- Core functionality (accept invitation) is significantly impacted.
Severity 3. A minor incident with low impact.
- A minor inconvenience to customers, workaround available.
- Usable performance degradation.

Incident roles

Once you establish the impact of the incident, adjust or confirm the severity of the incident issue and communicate that severity to the team.

The roles and responsibilities during and incident are:

Incident Manager, Each incident is driven by the incident manager (IM), who has overall responsibility for coordinating restoration of service and communication tasks for the incident. During a major incident, the incident manager is empowered to take any action necessary to resolve the incident, which includes paging additional people in the organization and keeping those involved in an incident focused on restoring service as quickly as possible.
Tech Lead, a senior technical responder. Responsible for developing theories about what’s broken and why, deciding on changes, and running the technical team. Works closely with the IM.