Fragile narrow laggy asynchronous mismatched pipes kill productivity

Fragile narrow laggy asynchronous mismatched pipes kill productivity | Hacker News

Something I’ve been thinking about recently is how when I’ve worked on any kind of distributed system, including systems as simple as a web app with frontend and backend code, probably upwards of 80% of my time is spent on things I wouldn’t need to do if it weren’t distributed. I came up with the following description of why I think this kind of programming requires so much effort: Everything is fragile narrow laggy asynchronous mismatched untrusted pipes. I think every programmer who’s worked on a networked system has encountered each of these issues, this is just my effort to coherently describe all of them in one place. I hope to prompt you to consider all the different hassles at once and think about how much harder/easier your job would be if you did/didn’t have to deal with these things. I think this is part of why web companies like Twitter seem to have so much lower impressiveness per engineer productivity than other places like game companies or SpaceX, although there’s other pieces to that puzzle. While part of the difficulty of distributed systems is inherent in physics, I think there’s lots of ideas for making each part of the problem easier, many already in common use, and I’ll try to mention lots of them. I hope that we as programmers continually develop more of these techniques and especially general implementations that simplify a problem. Like serialization libraries reducing the need for hand-written parsers/writers, I think there’s a lot of developer time out there to save by implementing generalized solutions where we currently painstakingly reimplement common patterns. I also think all these costs mean you should try really hard to avoid making your system distributed if you don’t have to.

I’ll go over each piece in detail, but briefly, whenever we introduce a network connection we usually have to deal with something that is:

Fragile: The network connection or the other end can have hardware failures, these have different implications but both manifest as just a timeout. Everything needs to handle failure.
Narrow: Bandwidth is limited so we need to carefully design protocols to only send what they need.
Laggy: Network latency is noticeable so we need to carefully minimize round-trips.
Asynchronous: Especially with >2 input sources (UIs count) all sorts of races and edge cases can happen and need to be thought about and handled.
Mismatched: It’s often not possible to upgrade all systems atomically, so you need to handle different ends speaking different protocol versions.
Untrusted: If you don’t want everything to be taken down by one malfunction you need to defend against invalid inputs and being overwhelmed. Sometimes you also need to defend against actual attackers.
Pipes: Everything gets packed as bytes so you need to be able to (de)serialize your data.

All of these things can be mostly avoided when programming things that run on one computer, that is unless you end up optimizing performance and realizing your computer is actually a distributed system of cores and some of them come back. Some domains manage to avoid some of these but I’ve experienced subsets of these problems working on web apps, self-driving cars, a text editor, and high-performance systems, they’re everywhere.

This isn’t even all the problems, just things about the network. Tons of effort is also expended on things like how various bottlenecks often entail a complicated hierarchy of caches that need to be kept in sync with the underlying data store.

One way you can avoid all this is to just not write a distributed system. There are plenty of cases you can do this and I think it’s worthwhile to try way harder than some people do to pack everything into one process. However past a certain point of reliability or scale, physics means you’re going to have to use multiple machines (unless you want to go the mainframe route).

Fragile

As you connect machines or increase reliability goals, the strategy of just crashing everything when one piece crashes (what multi-threaded/multi-core systems do) becomes increasingly unviable. Hardware will fail, wireless connections drop, entire data centers have their power or network taken out by squirrels. Some domains like customers with flaky internet also inevitably entail frequent connection failure.

In practice you need to write code to handle the failure cases and think carefully about what they are and what to do. This gets worse when merely noting the failure would drop important data, and you need to implement redundancy of data storage or transmission. Even worse, both another machine failing and a network connection breaking become visible just as some expected network packet not arriving after “too long”, introducing not only a delay but an ambiguity that can result in split-brain issues. Often something like TCP implements it for you but sometimes you have to implement your own heartbeating to periodically check that another system is still alive.

Attempts to make this easier include exceptions, TCP, concensus protocols and off-the-shelf redundant databases, but no solution eliminates the problem everywhere. One of my favourite attempts is Erlang’s process linking, monitoring and supervising which offers a philosophy that attempts to coalesce all sorts of failures into one easier to handle general case.

Narrow

Network bandwidth is often limited, especially over consumer or cellular internet. It may seem like this isn’t a limitation very often because you rarely hit bandwidth limits, but that’s because limited bandwidth is ingrained into everything you do. Whenever you design a distributed system you need to come up with a communication protocol that communicates on the order of what’s necessary rather than on the order of the total size of your data.

In a multi-threaded program, you might just pass a pointer to gigabytes of immutable or locked data for a thread to read what it wants from and not think anything of it. In a distributed system passing the entire memory representing your database is unthinkable and you need to spend time implementing other approaches.