What is on-call?

On call is the practice of designating specific people to be available at specific times to respond in the event of an urgent service issue, even though they are not formally on duty.

On call is a critical responsibility inside many IT, developer, support, and operations teams who run services where customers expect 24/7 availability. Team members take turns staffing an on-call rotation, either providing coverage around the clock or only outside of normal business hours. Along with automated monitoring and alerting solutions, the on-call engineer is empowered to respond immediately to any interruptions to service availability.

On call is a reflection of our entire engineering culture and skills - beginning from how our services are built, how they are tested, how they are run, how they are monitored, how they are maintained, how they are debugged, etc

It’s a reflection of several things involving both skill (“prowess”) as well as culture/priorities. The resilience of the systems themselves, monitoring + alerting, automation, time to recovery and how these things are prioritised, measured + iterated on is a reflection on the quality of management and prioritisation - aka our engineering culture.

You build it, you run it - you own it

“I can’t wait to spend my evening overseeing this deployment and responding to potential outages!” —said no engineer, ever.

Developers need to be responsible for not just writing code but also for managing the entire life-cycle of the service, ensuring its health, maintainability, observability, ease of debugging and its ultimate graceful demise. This includes being responsible for deployments, rollbacks, monitoring and debugging, in addition to bug fixes and new feature development.

With more developers taking on the role of maintaining the services they build, it’s important to make sure they are prepared for on call responsibilities. A sustainable on-call is only possible if the engineers building the system place primacy on designing reliability into a system. Reliability isn’t birthed in an on-call shift.

On-call at Orbit

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/6c3d79c1-20f3-499f-9270-b9193ac86553/stay-calm-and-submit-a-service-desk-ticket.jpg

<aside> 👍 In Orbit, on-call is voluntary and paid for.

</aside>

As the guardians of production systems, on-call engineers take care of their assigned operations by managing outages that affect the team and performing and/or vetting production changes.

When on-call, an engineer is available to perform operations on production systems within minutes, according to the paging response times agreed to by the team and the business system owners. Typical values are 5 minutes for user-facing or otherwise highly time-critical services, and 30 minutes for less time-sensitive systems. The company provides the page-receiving device, which is typically a phone. Orbit has flexible alert delivery systems that can dispatch pages via multiple mechanisms (email, SMS, robot call, app) across multiple devices.

Response times are related to desired service availability, as demonstrated by the following simplistic example: if a user-facing system must obtain 4 nines of availability in a given quarter (99.99%), the allowed quarterly downtime is around 13 minutes (Availability Table). This constraint implies that the reaction time of a on-call engineers has to be in the order of minutes (strictly speaking, 13 minutes).

As soon as a page is received and acknowledged, the on-call engineer is expected to triage the problem and work toward its resolution, possibly involving other team members and escalating as needed.

Nonpaging production events, such as lower priority alerts or software releases, can also be handled and/or vetted by the on-call engineer during business hours. These activities are less urgent than paging events, which take priority over almost every other task, including project work.

It’s important that on-call engineers understand that they can rely on several resources that make the experience of being on-call in Orbit less daunting than it may seem.

The most important on-call resources are: