Timeline
- On the 9th of March we deployed a new version of Cycle on the production. Everything was working well at that time
- At 9:18 am the 10th of March, users started experimenting a lot of instabilities while using the app.
- We found out that the servers were randomly restarting on and on.
- We tried to force a new deployment for the whole cluster with a higher health check period by hand but it did not succeed.
- At 10:15 am we updated the status page to warn all the users about the situation and we added an intercom banner with a link to the status page in the app.
- We identified that the problem was caused by our hubspot integration and its webhooks.
- At 11:50 am we deployed a first fix to try to catch
unhandled rejection
errors. These errors, with the introduction of the new node versions can kill the server. Some days ago we upgraded the node version but we didn’t have errors on our pre-prod instances.
- At 1:30 pm we deployed a new fix to turn off the hubspot integration because the first fix wasn’t working as expected.
- At 4:00 pm we identified the root causes of the problem and we started to implement all required changes.
- At 5:10 pm we deployed the fix after being able to reproduce the errors locally. We monitored the behaviour on our instances.
- From 6:00 pm to 7:00 pm we monitored the behaviour on the production environment after re-enabling the hubspot integration.
- During this period, no restarting of the server were observed and all the features were available.
Explanation
- During the last production deployment, we upgraded the node version to a version higher than 15.x.
- Node > 15.x changes logic to handle
unhandled rejection
by killing the node process by default (previously it was only raising warnings).
- With all the webhooks Cycle receive, some of them are failing for diverse reasons. Those failing were logged as warning in our error tracking system but now there are causing servers to crash. More information here 👉 https://developer.ibm.com/blogs/nodejs-15-release-blog/
Plan for the future
- We implemented a generic logic to handle all the errors node could raise in our api.
- We are currently working on a test framework for our integrations to ensure the stability of our integrations and their webhooks.