Realtime processing technologies (A.K.A stream processing) is one of the key factors that enable Netflix to maintain its leading position in the competition of entertaining our users. Our previous generation of streaming pipeline solution Keystone has a proven track record of serving multiple of our key business needs. However, as we expand our offerings and try out new ideas, there’s a growing need to unlock other emerging use cases that were not yet covered by Keystone. After evaluating the options, the team has decided to create Data Mesh as our next generation data pipeline solution.
Last year we wrote a blog post about how Data Mesh helped our Studio team enable data movement use cases. A year has passed, Data Mesh has reached its first major milestone and its scope keeps increasing. As a growing number of use cases on board to it, we have a lot more to share. We will deliver a series of articles that cover different aspects of Data Mesh and what we have learned from our journey. This article gives an overview of the system. The following ones will dive deeper into different aspects of it.
Previously, we defined Data Mesh as a fully managed, streaming data pipeline product used for enabling Change Data Capture (CDC) use cases. As the system evolves to solve more and more use cases, we have expanded its scope to handle not only the CDC use cases but also more general data movement and processing use cases such that:
As a result, today we define Data Mesh as a general purpose data movement and processing platform for moving data between Netflix systems at scale.
The Data Mesh system can be divided into the control plane (Data Mesh Controller) and the data plane (Data Mesh Pipeline). The controller receives user requests, deploys and orchestrates pipelines. Once deployed, the pipeline performs the actual heavy lifting data processing work. Provisioning a pipeline involves different resources. The controller delegates the responsibility to the corresponding microservices to manage their life cycle.
A Data Mesh pipeline reads data from various sources, applies transformations on the incoming events and eventually sinks them into the destination data store. A pipeline can be created from the UI or via our declarative API. On the creation/update request the controller figures out the resources associated with the pipeline and calculates the proper configuration for each of them.
A source connector is a Data Mesh managed producer. It monitors the source database’s bin log and produces CDC events to the Data Mesh source fronting Kafka topic. It is able to talk to the Data Mesh controller to automatically create/update the sources.
Previously we only had RDS source connectors to listen to MySQL and Postgres using the DBLog library; Now we have added Cockroach DB source connectors and Cassandra source connectors. They use different mechanisms to stream events out of the source databases. We’ll have blog posts deep dive into them.
In addition to managed connectors, application owners can emit events via a common library, which can be used in circumstances where a DB connector is not yet available or there is a preference to emit domain events without coupling with a DB schema.