Jetstack often works with customers to provision multi-tenant platforms on Kubernetes. Sometimes special requirements arise that we cannot control with stock Kubernetes configuration. In order to implement such requirements, we’ve recently started making use of the Open Policy Agent project as an admission controller to enforce custom policies.
This post is a write up of an incident caused by misconfiguration of this integration.
We were in the process of upgrading the master for a development cluster used by many teams to test their apps during the working day. This was a regional cluster running in europe-west1 on Google Kubernetes Engine (GKE).
Teams had been warned that the upgrade was taking place though no API downtime was expected. We had previously performed the same upgrade on another pre-production environment earlier that day.
We began the upgrade via our GKE Terraform pipeline. When performing the master upgrade the operation did not complete before the Terraform timeout (which we had set to 20 minutes). This was the first sign that something was wrong though the cluster was still showing as upgrading in the GKE console.
Re-running the pipeline resulted in the following error:
google_container_cluster.cluster: Error waiting for updating GKE master version: All cluster resources were brought up, but the cluster API is reporting that: component "kube-apiserver" from endpoint "gke-..." is unhealthy
During this time the API server was intermittently timing out and teams were not able to deploy their applications.
While we began to investigate the issue, all the nodes started to be destroyed and recreated in an endless loop. This lead to an indiscriminate loss of service for all tenants.
With the help of Google Support we identified the sequence of events that lead to the outage.
kube-system. This operation timed out as the backend for the validating Open Policy Agent (OPA) webhook we had configured was not responding.
This had the knock on effect of intermittent API downtime which caused kubelets to be unable to report node health. This in turn caused GKE node auto-repair to recreate nodes as a means of repair. This feature is explained in the documentation:
An unhealthy status can mean: A node does not report any status at all over the given time threshold (approximately 10 minutes).