Overview

On 2024-02-20 starting at 18:03 UTC, Stytch APIs experienced approximately 14 minutes of downtime due to an errant code change. This document outlines what caused the downtime and what we’re doing to prevent similar incidents going forward.

Timeline

Time (UTC) Event
18:03 A deploy to our production environment containing an errant code change is completed. The Stytch API begins returning 500 errors.
18:04 Alerts fire and our oncall engineer is paged; time to alert was approximately 30 seconds.
18:06 Our oncall engineers identified the cause of the issue; time to identify the issue was approximately 3 minutes.
18:07 We initiate a rollback to resolve the issue; time to initiate recovery was approximately 4 minutes from incident start.

The force deploy encountered an issue that necessitated a manual sync. | | 18:14 | We initiate a manual sync in order to correct the failed rollback. | | 18:18 | The manual sync completes, recovery is complete; time to full recovery was approximately 14 minutes. |

Causes

This incident was caused by an errant code change that was deployed to our production environment. This bug caused our event queuing to fail on /authenticate calls, i.e. any action in our API that authenticated a session or completed an authentication flow.

We have a large set of internal tests, including end-to-end tests, that must complete successfully before code changes can be deployed to production. However, the change in question was related to billing behavior that was only set up in our production environment and was not covered by our existing tests, which run in our staging environment and had mocked out that behavior.

Our oncall engineer was alerted when the API started returning 500 errors and recovery efforts began immediately. We generally expect a revert to take under five minutes, but while reverting this change, we encountered an unrelated infrastructure issue that prevented a successful force deploy. A recent change meant that our action runner didn't have appropriate permissions to trigger the deploy, and a manual sync was required.

Action items