https://monzo.com/blog/we-built-network-isolation-for-1-500-services

In the Security team at Monzo, one of our goals is to move towards a completely zero trust platform. This means that in theory, we'd be able to run malicious code inside our platform with no risk – the code wouldn't be able to interact with anything dangerous without the security team granting special access.

The idea is that we don't want to trust just anything simply because it's inside our platform. Instead, we want individual services to be trusted based on a short and deliberate list of which other services they're allowed to interact with. This makes an attack substantially more difficult.

We've been investigating network isolation

As we work towards a zero trust platform, we've spent part of the last year investigating network isolation. In practice, this means we don't want the service that controls the images on Pots to be able to talk to a service that moves money. Services should have specifically defined and manually approved lists of what they can communicate with, and anything else should be blocked.

With a handful of services, we could probably maintain these lists manually. But we already have over 1,500 services, and so the list of allowed paths is huge and constantly changing. In fact, there are over 9,300 unique connections (for example, from service.emoji to service.ledger).

But the number of services and connections in our platform makes this difficult

The sheer number of services and connections made this project really challenging. We had to figure out a way to generate allowed paths from our code, store the rules in a way that was easy for engineers to manage and for machines to interpret, and then enforce them without breaking anything.

Here's the network we specified as part of this project. Each services is represented by a dot. And every line is an enforced network rule that allows communication between two services.

Here's the same graph with a few popular nodes removed, and coloured by team:

If you're interested in why we decided to build this kind of system, check out this blog post we wrote back in 2016 about building a modern backend, or this talk by one of our engineers about how we use microservices.

First, we isolated one service as a trial

Before we ever decided to apply isolation to all services, we decided to apply it to one of our most high-security services, service.ledger. This is the source of truth for our customers' balances, and all money movements. One of our engineering principles is to ship projects and then iterate, so it's best to start small especially with such a complex system. The reason we picked such an important service first is that the finance team really wanted to lock down this service in particular and were keen to help us in our trial.

We came up with a tool to analyse our code

To start we wrote a tool called rpcmap. This would read all the Go code in our platform, and attempt to find code that looked like it was making a request to another service. In doing so, the tool maps out the connections between our services and the services they call. This wasn't perfect, but it was good enough to build a list of services that need to call the ledger. We then manually checked this list to make sure it was accurate.

We ended up with a simple list of services. Next, we wanted to enforce that only services in this list could make requests to the ledger. And we knew we could use a very straightforward feature of Kubernetes (the software that orchestrates our services) called the NetworkPolicy resource to do that. The policy was part of the ledger configuration and simply listed a set of allowed calling services. Only traffic from sources with the right labels is allowed: for example we allow services that are labelled as service.pot.

When an engineer needs to add a new caller to the ledger, they add it to this list and redeploy the ledger. We store the above file in the same place as the ledger's code, so that when you change it, the finance team have to review it. This lets them keep track of who's calling such a critical service, which is important to make sure that we're in control. We have an automated check running on new code that reminds engineers to whitelist their calling service if they call the ledger.