Note: Our post-mortems are conducted in the Five Whys format, which is useful for exploring beyond surface-level issues and identifying the deeper root cause.

Timeline

Time (PT) Event
12:08 pm Support team first notices that prod is unresponsive.
12:10 pm Incident response protocol is triggered. Engineers attempt to connect to the database but are unable to.
12:16 pm Database is restarted, but this does not restore responsiveness.
12:18 pm Investigation into health metrics shows that the DB has run out of disk space.
12:32 pm Manual resize attempted, but is blocked by optimization task.
12:36 pm Engineering connects with AWS support, confirms optimization cannot be canceled, and is only about 70% complete.
12:40 pm Read replica backup and restore is started.
1:10 pm Promotion of replica identified as possible solution, blocked by in-progress backup.
1:20 pm AWS support contacted again, confirms backup cannot be canceled, and is only about 40% complete.
2:46 pm Replica backup completes.
2:46 pm Replica promotion initiated.
2:47 pm Engineering confirms production database is responsive and can execute queries.
2:48 pm Engineering updates helm charts to use new database.
2:48 pm http://App.hex.tech is available to end users again.

Impact

Production was unavailable to end users for 2 hours and 40 minutes, interrupting critical customer workflows. During the initial mitigation and the subsequent fix, Fivetran sync was disabled, causing internal data to be stale for about a week. Between initial incident response and subsequent follow ups, engineering and support teams lost around 40 hours of productive time.

Whys

Action items