<aside> ℹ️ There are plenty of details in this page. Use the ▶ icon to expand on some part of the topic.

</aside>

<aside> ✅ This has been resolved. The root cause was a bug in NodeJS with keep-alives. The keepalive timeout of the application was shorter than the reverse proxy's (Nginx) one. Increasing NodeJS' keepalive configuration to be greater than Nginx's ones (i.e. > 75 seconds) solved the issue.

</aside>

Overview

Architecture: We have a NodeJS application running in Kubernetes. This application is exposed to the internet by an Nginx ingress controller (which does the SSL termination, from a LetsEncrypt certificate).

Issue: A very small amount of HTTP requests (0.00006%) end up in 502s. For every 502, nginx reports a Connection reset by peer. There is a Connection error warning for 0.0001% of the HTTP requests:

2020/03/24 14:25:52 [error] 39#39: *3117560 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 10.134.3.92, server: api.production.birdie.care, request: "GET /oauth/tokeninfo HTTP/1.1", upstream: "<http://10.134.3.55:8000/oauth/tokeninfo>", host: "api.production.birdie.care"

Reproducer

We are able to reproduce the issue with:

We are running a very simple NodeJS application (see code by expanding)
It exposes 3 different endpoints, with 3 different application behaviours
- / a home page (returns directly)
- /wait/{time} a pages that uses setTimeout and returns after {time} milliseconds
- /sleep/{time} a page the uses a blocking system sleep for {time}ms before response
- /busy/{time} a page that burns CPU for {time}ms before answering
Sending 50,000 requests to API using ab -c 50 -n 500000 Number bellow are averaged after many runs.
- / 0 connection reset (very very rarely - happened once -, with much higher ab traffic)
- /wait/* , 0 connection reset.
- /sleep/10 ~10 connection reset
- /busy/10 ~10 connection reset
Note: See more load-testing results with the nestjs-boilerplate bellow.

When we CANNOT reproduce

It does NOT reset connections with an application written in Go (with the same endpoints)

Other observations

The higher the CPU usage of the application container, the more likely we are to have Connection resets.
Same behaviour happens (obviously) with NestJS applications (using NodeJS & Express) (Note: inside here are more ab results)