I was wrong. CRDTs are the future

I saw Martin Kleppmann’s talk a few weeks ago about CRDTs, and I felt a deep sense of despair. Maybe all the work I’ve been doing for the past decade won’t be part of the future after all, because Martin’s work on CRDTs will supersede it. Its really good.

Lets back up a little.

Around 2010 I worked on Google Wave. Wave was an attempt to make collaboratively editable spaces to replace email, google docs, web forums, instant messaging and a hundred other small single purpose applications. Wave had a property I love in my tools that I haven’t seen articulated anywhere: It was a general purpose medium (like paper). Unlike a lot of other tools, it doesn’t force you into its own workflow. You could use it to do anything from plan holidays, make a wiki, play D&D with your friends, schedule a meeting, etc.

Internally, Wave’s collaborative editing was built on top of Operational Transform (OT). OT has been around for awhile - the algorithm we used was based on the original Jupiter paper from 1995. It works by storing a chronological list for each document of every change. “Type an H at position 0”. “Type a i at position 1”. Etc. Most of the time, users are editing the latest version of the document and the operation log is just a list of all the changes. But if users are collaboratively editing, we get concurrent edits. When this happens, the first edit to arrive at the server gets recorded as usual. If the second edit is out of date, we use the log of operations as a reference to figure out what the user really intended. (Usually this just means updating character positions). Then we pretend as if thats what the user meant all along and append the new (edited) operation. Its like realtime git-rebase.

Once Wave died, I reimplemented the OT model in ShareJS. This was back when node was new and weird. I think I had ShareJS working before npm launched. It only took about 1000 lines of code to get a simple collaborative editor working, and when I first demoed it I collaboratively edited a document in a browser and from a native application.

At its heart, OT is a glorified for() loop with some helper functions to update character offsets. In practice, this works great. OT is simple and understandable. Implementations are fast. (10k-100k operations per second in unoptimized javascript. 1-20M ops/sec in optimized C.). The only storage overhead is the operation log, and you can trim that down if you want to. (Though you can’t merge super old edits if you do). You need a centralized server to globally order operations, but most systems have a centralized server / database anyway, right?

Centralized servers

The big problem with OT is that dependancy on a centralized server. Have you ever wondered why google docs shows you that weird “This document is overloaded so editing is disabled” thing when a document is shared to social media? The reason (I think) is that when you open a google doc, one server is picked as the computer all the edits run through. When the mob descends, google needs to pull out a bunch of tricks so that computer doesn’t becomes overwhelmed.

There’s some workarounds they could use to fix this. Aside from sharding by document (like google docs), you could edit via a retry loop around a database transaction. This pushes the serialization problem to your database. (Firepad and ShareDB work this way).

Its not perfect though. We wanted Wave to replace email. Email is federated. An email thread can span multiple companies and it all just works. And unlike facebook messenger, emails are only be sent to the companies that are CC’ed. If I email my coworker, my email doesn’t leave the building. For Wave to replace email, we needed the same functionality. But how can that work on top of OT? We got it working, kinda, but it was complex and buggy. We ended up with a scheme where every wave would arrange a tree of wave servers and operations were passed up and down the tree. But it never really worked. I gave a talk at the Wave Protocol Summit just shy of 10 years ago explaining how to get on the network. I practiced that talk, and did a full runthrough. I followed literally step by step on the day and the version I made live didn’t work. I still have no idea why. Whatever the bugs are, I don’t think they were ever fixed in the opensource version. Its all just too complicated.

The rise of CRDTs

Remember, the algorithm Wave used was invented in 1995. Thats a pretty long time ago. I don’t think I even had the internet at home back in 1995. Since then, researchers have been busy trying to make OT work better. The most promising work uses CRDTs (Conflict-Free Replicated data types). CRDTs approach the problem slightly differently to allow realtime editing without needing a central source of truth. Martin lays out how they work in his talk better than I can, so I’ll skip the details.

People have been asking me what I think of them for many years, and my answer was always something like this:

They’re neat and I’m glad people are working on them but:

They’re slow. Like, really slow. Eg Delta-CRDTs takes nearly 6 hours to process a real world editing session with a single user typing a 100KB academic paper. (Benchmarks - look for B4.)
Because of how CRDTs work, documents grow without bound. The current automerge (automerge/automerge: A JSON-like data structure (a CRDT) that can be modified concurrently by different users, and merged again automatically.) master takes 83MB to represent that 100KB document on disk. Can you ever delete that data? Probably not. And that data can’t just sit on disk. It needs to be loaded into memory to handle edits. (Automerge currently grows to 1.1GB in memory for that.)
CRDTs are missing features that OT has had for years. For example, nobody has yet made a CRDT that supports /object move/ (move something from one part of a JSON tree to another). You need this for applications like Workflowy. OT handles this fine.
CRDTs are complicated and hard to reason about.
You probably have a centralized server / database anyway.

I made all those criticisms and dismissed CRDTs. But in doing so I stopped keeping track of the literature. And - surprise! CRDTs went and quietly got better. Martin’s talk (which is well worth a watch) addressed the main points: