Summary

On March 18, 2026, Opal customers experienced delays in access request reviewer notifications due to a task queue backup. A configuration change by one customer organization unexpectedly triggered a large burst of notification tasks in a single sync cycle, overwhelming one of our task processing queues. As a result, all tasks in that queue — including notifications across all cloud organizations — were delayed while our on-call team worked to identify and resolve the issue. The incident was fully resolved by approximately 8:05 PM PDT.

Impact

All customers experienced delays in access request reviewer notification delivery from approximately 2:27 PM – 5:45 PM PDT (~2.25 hours) on March 18, 2026
Reviewer notifications sent during this window were delayed; some notifications for the affected customer organization were not delivered as part of mitigation
The delays were isolated to the following tasks:
- notification delivery
- audit ticket synchronization

Severity: Sev 2

Incident Severity Level	Description	Team Response	Customer Post Mortem SLA
1	A major issue with very high impact	Work until the incident is mitigated	3 days
2	A significant problem	Work extended business hours, but all meetings should be skipped	1 week
3	A minor problem	Work during business hours, can attend meetings as usual	As needed

This incident was classified as a Sev 2. Task processing was degraded for all customers for approximately 2.5 hours, with the primary user-visible impact being delayed or dropped reviewer notifications for access requests. No access provisioning or security controls were affected.

Root cause analysis

A configuration change made by one customer organization resulted in approximately 39,000 request_created_for_reviewer notification tasks being enqueued in a single sync cycle. Due to limitations in isolation, these tasks overwhelmed the capacity of our task workers. A contributing factor was that recent code changes made to our monitoring infrastructure resulted in metrics used to drive autoscaling being significantly delayed, inhibiting our ability to add new workers. The combination of these factors resulted in insufficient workers to process notification tasks, resulting in delays until the queue was cleared.

Actions taken

Ran a bulk-archive operation to remove the ~39,000 flood tasks from the active queue, freeing capacity for all other customers.
Corrected the behaviour of our monitoring infrastructure to enable correctly autoscaling task workers.

Timeline

All times in PDT (UTC-7).

2:27 PM — Monitor detects elevated queue depth; SEV-2 declared
2:35 PM — On-call engineer identifies the source organization and task type driving the queue backup
2:45 PM — API rate limit applied to the source organization
3:05 PM — Task workers scaled up to 40 replicas