A few weeks ago, we sent out a report outlining the issues that CrewFire was experiencing with our notification system and our plan to tackle this bug and solve it.
As discussed before, on and off for the last three months, our activity/notification queue will start to clog, which results in an extreme delay of notifications going out, multiple notifications being sent out, or in some cases, no notifications being sent at all.
Our developers have deployed various fixes to try and get the queue to remain at 0, meaning notifications go out instantly as they are supposed to. However, last Thursday, our developers discovered an issue with the deployment process, which was also causing delays.
Last Thursday night, we deployed a solution for our deployment process, and then we started performing some tests to find the root cause for the queue getting backed up.
We understand that notifications are a vital part of our product, and when they don't work, your team and your ambassadors are frustrated. We 100% understand that and do apologize for this prolonged issue.
But, we did want to share a bit of good news. After testing a few different solutions, our CTO believes he figured out the core issue causing the queue to become backed up.
After examining worker process logs, we found that one task was taking 3500-4000ms to complete; this was a straightforward issue slowing down the processing of notifications in the queue and ultimately bringing them to a halt.
Our developer added an index to the post table for that column and proceeded to see the terminal output switch from snail's pace (1 line every few seconds) to a deluge of non-stop activity (light speed).
You can see the activity pick up as soon as he performed this task in the graph below: