TAO: Facebook’s Distributed Data Store for the Social Graph

The papers over the next few weeks will be from (or related to) research from VLDB 2021 - on the horizon is one of my favorite systems conferences SOSP. As always, feel free to reach out on Twitter with feedback or suggestions about papers to read! These paper reviews can be delivered weekly to your inbox, or you can subscribe to the Atom feed.

This is the first in a two part series on TAO, Facebook’s read-optimized, eventually-consistent graph database. Unlike other graph databases, TAO focuses exclusively on serving and caching a constrained set of application requests at scale (in contrast to systems focused on data analysis). Furthermore, the system builds on expertise scaling MySQL and memcache as discussed in a previous paper review.

The first paper in the series focuses on the original TAO paper, describing the motivation for building the system, it’s architecture, and engineering lessons learned along the way. The second part focuses on TAO-related research published at this year’s VLDB - RAMP-TAO: Layering Atomic Transactions on Facebook’s Online TAO Data Store. This new paper describes the design and implementation of transactions on top of the existing large scale distributed system - a task made more difficult by the requirement that applications should gradually migrate to the new functionality and that the work to support transactions should have limited impact on the performance of existing applications.

What are the paper’s contributions?

The original TAO paper makes three contributions - characterizing and motivating a graph database implementation suited for Facebook’s read-heavy traffic, providing a data model and developer API for the aforementioned database, and describing the architecture that allowed the database to scale.

Motivation

The paper begins with a section describing the motivation for TAO’s initial development. When Facebook was originally developed, MySQL was used as the datastore for the graph. As the site scaled, a memcache layer was added in front of the MySQL databases, to lighten the load.

While inserting memcache into the stack worked for some period of time, the paper cites three main problems with the implementation: inefficient edge lists, distributed control logic, and expensive read-after-write consistency.

Inefficient edge lists

Application developers within Facebook used edge-lists to represent aggregations of the information in the graph - for example, a list of the friendships a user has (each friendship is an edge in the graph, and the users are the nodes). Unfortunately, maintaining these lists in memcache was inefficient - memcache is a simple key value store without support for lists, meaning that common list-related functionalityLike that supported in Redis. is inefficient. If a list needs to be updated (say for example, a friendship is deleted), the logic to update the list would be complicated - in particular, the part of the logic related to coordinating the update of the list across several copies of the same data in multiple data centers.

Distributed Control Logic

Control logic (in the context of Facebook’s graph store architecture) means the ability to manipulate how the system is accessed. Before TAO was implemented, the graph data store had distributed control logic - clients communciated directly with the memcache nodes, and there is not a single point of control to gate client access to the system. This property makes it difficult to guard against misbehaving clients and thundering herds.

Expensive read-after-write consistency

Read-after-write consistency means that if a client writes data, then performs a read of the data, the client should see the result of the write that it performed. If a system doesn’t have this property, users might be confused - “why did the like button they just pressed not register when they reloaded the page?”.

Ensuring read-after-write consistency was expensive and difficult for Facebook’s memcache-based system, which used MySQL databases with master/slave replication to propagate database writes between datacenters. While Facebook developed internal technology to stream changes between databases, existing systems that used the MySQL and memcache combo relied on complicated cache-invalidation logicFor example, followers would forward reads for cache keys invalidated by a write to the leader database, increasing load and incurring potentially slow inter-regional communication. that incurred networking overhead. The goal of this new system is to avoid this overhead (with an approach described later in the paper).

Data model and API

TAO is an eventually consistentFor a description of eventual consistency (and related topics!), I highly recommend this post from Werner Vogels. read-optimized data store for the Facebook graph.

It stores two entities - objects and associations (the relationships between objects). Now we get to learn why the graph datastore is called TAO - the name is an abbreviation that stands for “The Associations and Objects”.

As an example of how objects and associations are used to model data, consider two common social network functions: