<aside> ℹ️ There are plenty of details in this page. Use the ▶ icon to expand on some part of the topic.
</aside>
<aside> ✅ This has been resolved. The root cause was a bug in NodeJS with keep-alives. The keepalive timeout of the application was shorter than the reverse proxy's (Nginx) one. Increasing NodeJS' keepalive configuration to be greater than Nginx's ones (i.e. > 75 seconds) solved the issue.
</aside>
Architecture: We have a NodeJS application running in Kubernetes. This application is exposed to the internet by an Nginx ingress controller (which does the SSL termination, from a LetsEncrypt certificate).
Issue: A very small amount of HTTP requests (0.00006%) end up in 502s. For every 502, nginx reports a Connection reset
by peer. There is a Connection error
warning for 0.0001% of the HTTP requests:
2020/03/24 14:25:52 [error] 39#39: *3117560 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 10.134.3.92, server: api.production.birdie.care, request: "GET /oauth/tokeninfo HTTP/1.1", upstream: "<http://10.134.3.55:8000/oauth/tokeninfo>", host: "api.production.birdie.care"
We are able to reproduce the issue with:
We are running a very simple NodeJS application (see code by expanding)
It exposes 3 different endpoints, with 3 different application behaviours
/
a home page (returns directly)/wait/{time}
a pages that uses setTimeout
and returns after {time}
milliseconds/sleep/{time}
a page the uses a blocking system sleep for {time}
ms before response/busy/{time}
a page that burns CPU for {time}
ms before answeringSending 50,000 requests to API using ab -c 50 -n 500000
Number bellow are averaged after many runs.
/
0 connection reset (very very rarely - happened once -, with much higher ab
traffic)/wait/*
, 0 connection reset./sleep/10
~10 connection reset/busy/10
~10 connection resetNote: See more load-testing results with the nestjs-boilerplate
bellow.
Connection reset
s.ab
results)