Summary

A large volume of notifications to another organization caused their Slack rate limits to be hit. While these notification tasks were backing off and retrying, other tasks were stuck pending were delayed.

Opal was alerted to the high pending task count, and scaled out task workers to clear them.

Impact

Root cause analysis

An organization had a high volume of access expiry notifications being sent out at the same time. While sending these notifications, we encountered their Slack rate limits, which caused the notification tasks to attempt retries until they succeeded or a max retry time was hit.

This prevented other tasks in the queue from being picked up while these were processing. Which was causing delays in propagations as well since they share the same queue.

Actions taken

Timeline

Next steps

Immediate