Hello Anton,

Front experienced an incident yesterday involving our Search capabilities in our Europe region. Search went in a read-only mode from 14:15 UTC to 01:15 UTC on Thursday 21/04. Our Search was still working, but we could not add new data to our index: emails received after 14:15 UTC did not show up in search results. This affected:

The incident is now fully resolved. No data was lost.

Here's what happened

Front stores a lot of data and building powerful search features is a major challenge. We have about 20TB of search data just in our Europe region. We use Elasticsearch (ES), a popular distributed search index that we host on hundreds of servers. Systems that are so large are inevitably complex. What happened was the result of our actions, not an underlying problem with ES.

Our servers are spread in different physical locations to make sure that we can survive network partitions. If two groups of servers are unable to communicate with each other, they need to collectively agree on which group will be the source of truth when connectivity is restored. ES accomplishes this using “master” nodes: a small set of servers that are tasked with acting at that source of truth. These servers need to exist in an odd number (generally 3 or 5), a quorum. This ensures that the system never ends up in a tied state.

While these nodes do not perform a lot of work by themselves, they still need to be upgraded regularly, which is generally a routine operation for us. We added larger nodes and once we had confirmation that they were working and healthy, we proceeded to remove the previous ones. In doing so, we inadvertently broke quorum: while both sets of nodes were online, we had an even number of master nodes and the system could not include the new nodes in the quorum.

This put our cluster in a safe mode, where new data could not be written. While all of the data was still perfectly healthy, it took our team a few hours to repair our cluster and bring it back to a state where everything was working.

We’d like to apologize to all customers who were affected by this incident. We are already planning new work to learn from this: mainly harden our operational procedures to make sure this does not happen again, while also improving recovery systems so a problem affecting our search capabilities does not result in a multi-hour disruption.

Laurent Perrin CTO at Front