Rust is becoming a first class language in a variety of domains. At Discord, we’ve seen success with Rust on the client side and server side. For example, we use it on the client side for our video encoding pipeline for Go Live and on the server side for Elixir NIFs. Most recently, we drastically improved the performance of a service by switching its implementation from Go to Rust. This post explains why it made sense for us to reimplement the service, how it was done, and the resulting performance improvements.
Discord is a product focused company, so we’ll start with some product context. The service we switched from Go to Rust is the “Read States” service. Its sole purpose is to keep track of which channels and messages you have read. Read States is accessed every time you connect to Discord, every time a message is sent and every time a message is read. In short, Read States is in the hot path. We want to make sure Discord feels super snappy all the time, so we need to make sure Read States is quick.
With the Go implementation, the Read States service was not supporting its product requirements. It was fast most of the time, but every few minutes we saw large latency spikes that were bad for user experience. After investigating, we determined the spikes were due to core Go features: its memory model and garbage collector (GC).
To explain why Go wasn’t meeting our performance targets, we first need to discuss the data structures, scale, access patterns, and architecture of the service.
The data structure we use to store read state information is conveniently called “Read State”. Discord has billions of Read States. There is one Read State per User per Channel. Each Read State has several counters that need to be updated atomically and often reset to 0. For example, one of the counters is how many @mentions you have in a channel.
In order to get quick atomic counter updates, each Read States server has a Least Recently Used (LRU) cache of Read States. There are millions of Users in each cache. There are tens of millions of Read States in each cache. There are hundreds of thousands of cache updates per second.
For persistence, we back the cache with a Cassandra database cluster. On cache key eviction, we commit your Read States to the database. We also schedule a database commit for 30 seconds in the future whenever a Read State is updated. There are tens of thousands of database writes per second.
In the picture below, you can see the response time and system cpu for a peak sample time frame for the Go service.¹ As you might notice, there are latency and CPU spikes roughly every 2 minutes.
In Go, on cache key eviction, memory is not immediately freed. Instead, the garbage collector runs every so often to find any memory that has no references and then frees it. In other words, instead of freeing immediately after the memory is out of use, memory hangs out for a bit until the garbage collector can determine if it’s truly out of use. During garbage collection, Go has to do a lot of work to determine what memory is free, which can slow the program down.
These latency spikes definitely smelled like garbage collection performance impact, but we had written the Go code very efficiently and had very few allocations. We were not creating a lot of garbage.
After digging through the Go source code, we learned that Go will force a garbage collection run every 2 minutes at minimum. In other words, if garbage collection has not run for 2 minutes, regardless of heap growth, go will still force a garbage collection.
We figured we could tune the garbage collector to happen more often in order to prevent large spikes, so we implemented an endpoint on the service to change the garbage collector GC Percent on the fly. Unfortunately, no matter how we configured the GC percent nothing changed. How could that be? It turns out, it was because we were not allocating memory quickly enough for it to force garbage collection to happen more often.
We kept digging and learned the spikes were huge not because of a massive amount of ready-to-free memory, but because the garbage collector needed to scan the entire LRU cache in order to determine if the memory was truly free from references. Thus, we figured a smaller LRU cache would be faster because the garbage collector would have less to scan. So we added another setting to the service to change the size of the LRU cache and changed the architecture to have many partitioned LRU caches per server.
We were right. With the LRU cache smaller, garbage collection resulted in smaller spikes.