In the early hours on Tuesday 6th of April we had a critical outage lasting ~4 hours in total. Issues started at around ~3AM (GMT+0), and were resolved at around ~7AM (GMT+0).

Diagnosis

Multiple things went wrong on Tuesday, starting with an attack that wasn't properly mitigated. Because we store all sessions in memory, the attack caused one of our web nodes to run OOM. The kernel tried prevent a meltdown but happened to kill the database driver in the process.

Normally this wouldn't be an issue on it's own. The database would recover itself and traffic would be temporarily redirected. However, due to an issue we aren't able to exactly pin point - another node was having issues for a long time, causing k8s' to restart parts of the software, including the database. When two nodes are down, the third and last node goes down to prevent a split brain scenario.

This didn't need to happen, and could have been prevented in multiple ways. We've spent the last two days making sure this wont happen again, and making changes to better prepare us for future issues. Our response was much longer than it should have been. This was due to communication issues between essential team members.

Response

Because we're dealing with a situation where multiple smaller issues all contributed to a larger issue, lets break them up:

  1. The attack shouldn't have reached our servers.
  2. The node shouldn't have ran out of memory.
  3. Our infrastructure should have more capacity.
  4. Essential team members should be more accessible out of hours.

1. Mitigation improvements

At the end of the day, we would rather have some of the attack leak through than block some legitimate users, and we don't want to have human/JavaScript challenges. It's important that the user experience isn't impacted by DDoS protection. That increases the risk of attacks leaking through, but we've decided that risk is worth the benefit in 99% of situations.

We've been working with upstream providers to apply multiple ways to better mitigate attacks like this in the future, and I can confirm the attack has been correctly mitigated multiple times since this outage. We've also made changes to better protect our servers against sudden bursts of traffic, which has significantly reduced the amount of an attack that leaks through, without impacting legitimate users.

2. Configuration improvements

The initial OOM was caused because of our session driver. This doesn't need to happen - and can be prevented in a few ways. We've already been considering if we even want to store sessions in memory for WISP v2. For now we've made some changes to prevent Redis from eating all the ram anyway.

3. Infrastructure improvements

WISP is currently running on the infrastructure it launched on: 3 nodes in 3 continents. This has worked well, and allowed us to enter the market at a competitive price. It was the right thing for us to do at that time. However, we've learnt a lot since our launch and, while we don't really need to upgrade our infrastructure yet, we've already been planning for it.

We decided to totally overhaul the infrastructure a while back, and we're looking at increasing our redundancy from 3 nodes, to either 5 or 7. This will increase the amount of failures we can have at the same time before an outage happens. We expect this to take place after v2 releases.

Our requirements have shifted from our initial plans, so the new hardware will be better suited with increased provider and geographical distribution. All the new nodes will be more powerful than the most powerful node we currently use.

4. Human improvements

The first person to notice the outage was our newest team member. After being unable to raise Stepan, he made the call to wake me up. We normally plan to have ways to reach essential team members in situations like this. It's rare we need to raise a team member on demand, but when we do, it's typically not Stepan. It appears over time that channel of communication wasn't maintained, delaying our ability to respond to the issue significantly.

This is probably the biggest failure of this incident. Essential team members couldn't be contacted due to the unfortunate timing - this is a complicated issue to address, but I'll start with what we were doing before the incident. Last week we brought on a new team member in an effort to start making our team less reliant on key members. Viction works on system administration and development, mostly focusing on our WISP v2 update currently.

Viction will be an additional team member who can step in to work on technical issues like this in the future. However, for now we've made changes so that team members can reach each other in urgent situations more effectively. We hope that both of these efforts will decrease our "bread truck" factor.