On October 24th, 2022, at 14:33 UTC, we initiated a planned backend change to our PostgreSQL database schema for some columns with type datetime, from a default value of CURRENT_TIMESTAMP
to CURRENT_TIMESTAMP(6)
(A no-op change that is needed by our CDC system to properly track schemas). We inadvertently included some columns that had no default values in this change. That resulted in webhooks not being delivered, and being erroneously reported as having been archived for being past retention for 1 hour and 17 minutes between 14:33 UTC and 15:50 UTC.
We resumed delivery at 15:50 UTC, and the backlog was cleared shortly after.
No webhooks were lost during that outage.
Timeline
- 14:32 UTC: Schema changes deployed to the database
- 14:41 UTC: Received first customer report that webhooks are showing as past retention date, and that there are delivery issues.
- 14:41 UTC - 15:15 UTC: diagnosed and found the cause of the issue
- 15:15 UTC: Paused message ingestion and delivery
- 15:22 UTC: Reverted schema changes
- 15:35 UTC: Started delivering messages again
- 15:40 UTC: Replayed webhooks that were incorrectly reported as archived, and updated other rows that were incorrectly assigned a default timestamp
- 15:50 UTC: All audits passing the system recovered
Lessons learned
What went well
- No webhooks were lost during the outage. The most significant part of the impact was limited to the system treating incoming webhooks as archived when they were not, resulting in them being processed as ‘DATA_ARCHIVED’ (event is past plan’s retention date). However this was corrected, the webhooks were replayed and delivered. The main impact was the increased latency of delivery.
What went wrong
- A schema change was deployed to the backend that ultimately should not have been
- Schema changes were difficult to apply in the opposite direction (dropping default value) due to locking and concurrency, resulting in having to pause webhooks delivery to fix the issue.
- In the Dashboard there is the possibility that some attempts impacted by this will appear as failed with a ‘DATA_ARCHIVED’ reason, however there will be a successful attempt delivery after that.
Corrective actions