<aside> ℹ️ There are plenty of details in this page. Use the ▶ icon to expand on some part of the topic.

</aside>

<aside> ✅ This has been resolved. The root cause was a bug in NodeJS with keep-alives. The keepalive timeout of the application was shorter than the reverse proxy's (Nginx) one. Increasing NodeJS' keepalive configuration to be greater than Nginx's ones (i.e. > 75 seconds) solved the issue.

</aside>

Overview

Architecture: We have a NodeJS application running in Kubernetes. This application is exposed to the internet by an Nginx ingress controller (which does the SSL termination, from a LetsEncrypt certificate).

Issue: A very small amount of HTTP requests (0.00006%) end up in 502s. For every 502, nginx reports a Connection reset by peer. There is a Connection error warning for 0.0001% of the HTTP requests:

2020/03/24 14:25:52 [error] 39#39: *3117560 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 10.134.3.92, server: api.production.birdie.care, request: "GET /oauth/tokeninfo HTTP/1.1", upstream: "<http://10.134.3.55:8000/oauth/tokeninfo>", host: "api.production.birdie.care"

Reproducer

We are able to reproduce the issue with:

When we CANNOT reproduce

Other observations