On Thursday, April 22nd, 2021, we experienced a prolonged outage. The outage started at 7:25 am UTC and lasted for 3 hours, during which time Linear was unavailable either completely or in offline mode. When the client is in offline mode, users can browse all data and make updates, but those updates will be queued up on the client and sent to the backend only after the client gets out of offline mode by connecting to our sync services.

Root cause

The outage started due to an automatic vacuum process in one of the key tables in our Postgres database. Sync in Linear was implemented utilizing a single table that contains all the updates to all entities in the system. Writing to this table took an exclusive lock on the table to guarantee that the sync servers would read all sync updates sequentially and not skip over transactions that had written entries to the table but not yet committed them.

Transaction ID's

For Postgres to provide transactional isolation, Postgres uses transaction IDs (XID's), which are auto-incremented whenever a transaction manipulates rows in the database. Due to the way XID's are implemented, it is necessary to run a process called vacuuming in Postgres to (amongst other things) make sure the XID's don't wrap around. An XID wrapping around would cause major data inconsistencies as previously committed transactions would suddenly no longer be visible. Postgres automatically runs vacuuming to prevent this from happening.

However, since auto-vacuuming the table would conflict with our write lock, we had made the bad decision to disable auto-vacuum on the table to run it during maintenance breaks manually. However, a manual vacuum hadn't been run on the sync table in a few months since the table was only appended to and thus didn't show any bloat.

As Postgres got closer to running out of XID's, it force-started a vacuum operation to guarantee data consistency. This effectively caused all of our updates to the database to start failing because we could not acquire a lock on the sync table, taking sync offline and forcing our users into read-only mode.

Our engineers were paged quickly, and we identified that lock timeouts caused sync problems. Once we identified that the automatic vacuum process caused the problem, we tried to cancel the process and run it manually, as automatic vacuum processes are designed to run parallel with normal database operations and take a lot longer to execute. However, because we are running a managed Postgres instance on Google Cloud Platform (GCP), we couldn't stop the auto-vacuum process and were forced to wait for it to complete.

Bringing services back online

Once the auto-vacuum process was completed, we encountered additional issues when trying to bring services online due to many clients trying to connect at the same time. We had recently switched to a new Kubernetes cluster, and our sync service had been configured with too strict liveliness check timeouts. When thousands of clients connected back to the sync server simultaneously, the sync servers were processing a lot of data, trying to send sync updates to each client. This load caused the liveliness checks to fail and Kubernetes to restart the pod. When clients get disconnected from the sync server, they will immediately reconnect and be sent to a different sync server instance, which would again fail under the load. The clients have a back-off strategy, but this wasn't enough to get the sync service healthy as the instances were also frequently running out of memory trying to load up data for thousands of clients.

At this time, we decided to block access to the service and let users in slowly while we investigated why the sync servers would fail under load. We used each client's IP address to let 10% of users return to the service, increasing the percentage once the sync servers showed normal loads. This process was brittle because when we increased the number of users too much, the sync servers would fail, disconnecting all users, and we would have to start from the beginning.

Ultimately we discovered the configuration problem related to the liveliness timeouts of the sync server, and service was restored.

Fixes

We identified and implemented several changes to our service to mitigate the problem in the future.

Remove locks on the sync table

We've removed using any locks from the sync table instead of opting for an algorithm that retries gaps in the sync data for a few seconds before taking an advisory lock to wait for any outstanding transactions to finish committing.

Turn auto-vacuum back on

Removing locking on the sync table let us turn on auto-vacuum on the sync table.

Delta-sync using the API

The two main services used by the Linear client are the sync service and the API. The sync service holds websocket connections to each client and sends data updates for each client. The sync service is no longer responsible for computing delta sync for connecting clients, but clients will request delta sync from the API when they've connected to the sync server. This drastically reduces the load on sync servers during client connections and allows them to accept two orders of magnitude more connections without any problems.