The Infrastructure Behind Twitter: Scale

https://blog.twitter.com/engineering/en_us/topics/infrastructure/2017/the-infrastructure-behind-twitter-scale

Overview of Twitter Fleet

Twitter came of age when hardware from physical enterprise vendors ruled the data center. Since then we’ve continually engineered and refreshed our fleet to take advantage of the latest open standards in technology and hardware efficiency in order to deliver the best possible experience.

Our current distribution of hardware is shown below:

Network Traffic

We started to migrate from third party hosting in early 2010, which meant we had to learn how to build and run our infrastructure internally, and with limited visibility into the core infrastructure needs, we began iterating through various network designs, hardware, and vendors.

By late 2010, we finalized our first network architecture which was designed to address the scale and service issues we encountered in the hosted colo. We had deep buffer ToRs to support bursty service traffic and carrier grade core switches with no oversubscription at that layer. This supported the early version of Twitter through some notable engineering achievements like the TPS record we hit during Castle in the Sky and World Cup 2014.

Fast forward a few years and we were running a network with POPs on five continents and data centers with hundreds of thousands of servers. In early 2015 we started experiencing some growing pains due to changing service architecture and increased capacity needs, ultimately reaching physical scalability limits in the data center when a full mesh topology would not support additional hardware needed to add new racks. Additionally, our existing data center IGP began to behave unexpectedly due to this increasing route scale and topology complexity.

To address this, we started to convert existing data centers to a Clos topology + BGP – a conversion which had to be done on a live network, yet, despite the complexity, was completed with minimal impact to services in a relatively short time span. The network now looks like this:

Highlights of the new approach:

Smaller blast radius of a single device failure.
Horizontal bandwidth scaling capabilities.
Lower routing engine CPU overhead; far more efficient processing of route updates.
Higher route capacity due to lower CPU overhead.
More granular control of routing policy on a per device and link basis.
No longer exposed to several known root causes of prior major incidents: increased protocol reconvergence times, route churn issues and unexpected issues with inherent OSPF complexities.
Enables non-impacting rack migrations.

Let’s expand on our network infrastructure below.