In an always-on world, companies look to systems and processes to keep their services up and running at all times. The most important part of maintaining this uptime is having an Incident Management process in place to restore your services in the event of an interruption or unplanned downtime.
Incident Management processes are typically used to respond to incidents that affect services and work on restoring their uptime.
Incidents that disrupt services are unavoidable. But every breakdown is an opportunity to learn & improve.
Every incident in system infrastructure, helps product development & engineering teams understand better about the capabilities of system architecture. This can further help us in building a more sustainable and reliable product.
Incidents are identified through reports from monitoring systems, or by manual identification. Once an incident is identified, it is logged. An incident log can be used to validate that all incidents have been addressed and to identify trends. At this point, the incident is categorised by adding additional information like severity, functional area, and ownership.
This stage is about notifying the right people to address an incident. It may also involve assigning tasks and performing escalating procedures. In a high-fire situation, the ease with which you can communicate can make or break the customer perception and ultimately the impact on your bottom line.
Once notified, incident responders, gather information about the incident using observability tools. This information is used to build a hypothesis about the probable cause of the incident and to decide on a fix.
A crucial form of incident classification is prioritisation. This helps the on-call team understand the severity of the issue at first glance. The prioritisation matrix should always be linked to service and customer impact. This gives the on-call team the clarity needed to understand the situation.
The responder team applies the fix proposed in the previous step and, typically, observes the system for a little while to confirm that the incident has been resolved.
Automate as much as possible. Little steps go a long way.
Document any attempts at resolution or mitigation, as soon as you have taken the steps. What you perceive to be a small problem might not be the case for someone else on your team.