Running 10s test @ <http://server.tfb:8080/json>
16 threads and 256 connections
Thread Stats Avg Stdev Max Min +/- Stdev
Latency 204.24us 23.94us 626.00us 70.00us 68.70%
Req/Sec 75.56k 587.59 77.05k 73.92k 66.22%
Latency Distribution
50.00% 203.00us
90.00% 236.00us
99.00% 265.00us
99.99% 317.00us
12031718 requests in 10.00s, 1.64GB read
Requests/sec: 1203164.22
Transfer/sec: 167.52MB
This post will walk you through the performance tuning steps that I took to serve 1.2 million JSON "API" requests per second from a 4 vCPU AWS EC2 instance. For the purposes of this recreated quest, we will ignore most of the dead ends and dark alleyways that I had to struggle through on my solo expedition. Instead, we will mainly stick to the happy path, moving steadily from serving 224k req/s at the start, with the default configuration, to a mind-blowing 1.2M req/s by the time we reach the end.
Hitting 1M+ req/s wasn't actually my original intent. I started off working on a largely unrelated blog post, but I somehow found myself going down this optimization rabbit hole. The global pandemic gave me some extra time, so I decided to dive in head first. The table below lists the nine optimization categories that I will cover, and links to the corresponding flame graphs. It shows the percentage improvement for each optimization, and the cumulative throughput in requests per second. It is a pretty solid illustration of the power of compounding when doing optimization work.
The main takeaway from this post should be an appreciation for the tools and techniques that can help you to profile and improve the performance of your systems. Should you expect to get 5x performance gains from your webapp by cargo-culting these configuration changes? Probably not. Many of these specific optimizations won't really benefit you unless you are already serving more than 50k req/s to begin with. On the other hand, applying the profiling techniques to any application should give you a much better understanding of its overall behavior, and you just might find an unexpected bottleneck.
I considered breaking this post up across multiple entries, but decided to keep everything together for simplicity. Clicking the menu icon at the top right will open a table of contents so that you can easily jump to a specific section. For those who want to get their hands dirty and try it out, I have provided a CloudFormation template that sets up the entire benchmark environment for you.
This is a basic overview of the benchmark setup on AWS. Please see the Full Benchmark Setup section if you are interested in more details. I used the Techempower JSON Serialization test as the reference benchmark for this experiment. For the implementation, I used a simple API server built with libreactor, an event-driven application framework written in C. This API server makes use of Linux primitives like epoll, send, and recv with minimal overhead. HTTP parsing is handled by picohttpparser, and libclo takes care of JSON encoding. It is pretty much as fast as you can get (pre io_uring anyway), and it is the perfect foundation for an optimization focused experiment.
docker run -d --rm --network host --init libreactor.The benchmark was run three times and the highest and lowest results were discarded. twrk was run manually from the client using the same headers as the official benchmark and the following parameters: