Extreme HTTP Performance Tuning: 1.2M API req/s on a 4 vCPU EC2 Instance | talawah.io

Running 10s test @ <http://server.tfb:8080/json>
  16 threads and 256 connections
  Thread Stats   Avg     Stdev       Max       Min   +/- Stdev
    Latency   204.24us   23.94us  626.00us   70.00us   68.70%
    Req/Sec    75.56k   587.59     77.05k    73.92k    66.22%
  Latency Distribution
  50.00%  203.00us
  90.00%  236.00us
  99.00%  265.00us
  99.99%  317.00us
  12031718 requests in 10.00s, 1.64GB read
Requests/sec: 1203164.22
Transfer/sec:    167.52MB

Overview

This post will walk you through the performance tuning steps that I took to serve 1.2 million JSON "API" requests per second from a 4 vCPU AWS EC2 instance. For the purposes of this recreated quest, we will ignore most of the dead ends and dark alleyways that I had to struggle through on my solo expedition. Instead, we will mainly stick to the happy path, moving steadily from serving 224k req/s at the start, with the default configuration, to a mind-blowing 1.2M req/s by the time we reach the end.

Hitting 1M+ req/s wasn't actually my original intent. I started off working on a largely unrelated blog post, but I somehow found myself going down this optimization rabbit hole. The global pandemic gave me some extra time, so I decided to dive in head first. The table below lists the nine optimization categories that I will cover, and links to the corresponding flame graphs. It shows the percentage improvement for each optimization, and the cumulative throughput in requests per second. It is a pretty solid illustration of the power of compounding when doing optimization work.

The main takeaway from this post should be an appreciation for the tools and techniques that can help you to profile and improve the performance of your systems. Should you expect to get 5x performance gains from your webapp by cargo-culting these configuration changes? Probably not. Many of these specific optimizations won't really benefit you unless you are already serving more than 50k req/s to begin with. On the other hand, applying the profiling techniques to any application should give you a much better understanding of its overall behavior, and you just might find an unexpected bottleneck.

I considered breaking this post up across multiple entries, but decided to keep everything together for simplicity. Clicking the menu icon at the top right will open a table of contents so that you can easily jump to a specific section. For those who want to get their hands dirty and try it out, I have provided a CloudFormation template that sets up the entire benchmark environment for you.

Basic Benchmark Setup

This is a basic overview of the benchmark setup on AWS. Please see the Full Benchmark Setup section if you are interested in more details. I used the Techempower JSON Serialization test as the reference benchmark for this experiment. For the implementation, I used a simple API server built with libreactor, an event-driven application framework written in C. This API server makes use of Linux primitives like epoll, send, and recv with minimal overhead. HTTP parsing is handled by picohttpparser, and libclo takes care of JSON encoding. It is pretty much as fast as you can get (pre io_uring anyway), and it is the perfect foundation for an optimization focused experiment.

Hardware

Server: 4 vCPU c5n.xlarge instance.
Client: 16 vCPU c5n.4xlarge instance (the client becomes the bottleneck if I try to use a smaller instance size).
Network: Server and client located in the same availability zone (use2-az2) in a cluster placement group.

Software

Operating System: Amazon Linux 2 (kernel 4.14).
Server: I ran the Techempower libreactor implementations (between round 18 and round 20) manually in a docker container: docker run -d --rm --network host --init libreactor.
Client: I made a few modifications to wrk, the popular HTTP benchmarking tool, and nicknamed it twrk. twrk delivers more consistent results on short, low latency test runs. The standard version of wrk should yield similar numbers in terms of throughput, but twrk allows for improved p99 latencies, and adds support for displaying p99.99 latency.

Benchmark Configuration

The benchmark was run three times and the highest and lowest results were discarded. twrk was run manually from the client using the same headers as the official benchmark and the following parameters:

No pipelining.
256 connections.