The Future of the Modern Data Stack

The Modern Data Stack is quickly picking up steam in tech circles as the go-to cloud data architecture, and although its popularity has been quickly rising, it can be ambiguously defined at times. In this blog post we’ll discuss what it is, how it came to be, and where we see it going in the future. Regardless of whether you’re new to the modern data stack or have been an early adopter, there should be something of interest for everyone.

The Modern Data Stack commonly refers to a collection of technologies that comprise a cloud-native data platform, generally leveraged to reduce the complexity in running a traditional data platform. The individual components are not fixed, but they typically include:

A Cloud Data Warehouse, such as Snowflake, Redshift, BigQuery, or Databricks Delta Lake
A Data Integration Service, such as Fivetran, Segment, or Airbyte
A ELT data transformation tool, almost certainly dbt
A BI layer, such as Looker or Mode
A Reverse ETL tool, such as Census or Hightouch

The goal is to make data actionable by reducing the time it takes for data to become useful to data workers in an organization. Gone are the days where it takes weeks for data to land in your company’s analytical warehouse after creation. Now it happens in hours or minutes. Companies that go down the path of the modern data stack adopt the technology as it fits their needs – i.e. you don’t necessarily need every component, and some may opt for other technologies, like Airflow, Dagster, or Prefect for an orchestration layer. A simple sample architecture is illustrated below.

Simply having a data platform in the cloud does not make it a “modern data stack.” In fact, I would wager to bet that most cloud architectures really fail to meet the categorization. Things like lift-and-shifted platforms, cloud data lakes, and bespoke solutions often fail to really capture the essence of the modern data stack and often feel as clunky as their on-premises cousins. So what makes something part of the modern data stack? If we look latitudinally across technologies in this ecosystem, we’ll begin to notice that they share some common properties that get at the core of the modern data stack. I’ll propose the following as key capabilities of technology in the modern data stack:

Offered as a Managed Service: Requires no or minimal setup and configuration from users and absolutely no engineering required. Users can get started today, and it’s not a vapid marketing promise.
Centered around a Cloud Data Warehouse (CDW): Everything “just works” off-the-shelf if companies use a popular CDW. By being opinionated about where your data is, you eliminate messy integrations and tools play well together.
Democratizes data via a SQL-Centric Ecosystem: Tools are built for data/analytics engineers and business users. These users often know the most about a company’s data, so it makes sense to try to upskill them by giving them tools that speak their language.
Elastic Workloads: Pay for what you use. Scale up instantly to handle large workloads. Money is the only scale limitation in the modern cloud.
Focus on Operational Workflows: Point-and-click tools are nice for low-tech users, but it’s all kind of meaningless if there’s not a viable path to production. Modern data stack tools are often built with automation as a core competency.

Users of the modern data stack routinely sing its praises. By adopting the modern data stack, companies get a low-cost platform that’s easy to set up, easy to use, and requires little expertise to churn out production workflows. It’s easy to see why so many have jumped on this trend and are never going back.

How it Started / How it’s Going

In the beginning, we stored data by drawing pictures on the walls of caves. Sometime later (1970), Edgar F. Codd invented the relational data model and published A Relational Model of Data for Large Shared Data Banks, which is credited with starting the RDBMS craze. By modern standards this new technology was slow to get off the ground, but over the next couple of decades many companies started offering databases to customers: IBM, Oracle, Microsoft, Teradata, etc. New technologies also emerged that made working with databases easier, such as data integration and reporting tools, and a new language was created, SQL, that made working with data in your database relatively straightforward (i.e. no coding necessary). And, for decades, on-prem databases were perfectly sufficient for the vast majority of use cases that companies were trying to solve with the small amounts of data they stored.

Over time, Moore’s Law, the creation of the Internet, and user-generated content made it such that it was not a totally crazy or ridiculously expensive idea that the average company would try to store and analyze large amounts of data. This was great for businesses, but not so great for RDBMS systems, which were not designed to handle large scale data operations. Enter Big Data; the 2000s saw the advent of many new types of technology systems that were tailor made to handle large volumes of data: Hadoop, Vertica, MongoDB, Netezza, etc. These systems were typically of the distributed SQL or NoSQL variety, focused on parallelizing data operations across clusters of servers, and they evolved quickly to handle a variety of use cases like their predecessors. Finally, enterprises had a viable option for handling large volumes of data.

Big Data’s reign lasted less than a decade before it was disrupted by burgeoning cloud technology in the early and mid 2010s. The traditionally on-premises-anchored big data technologies struggled to shift into the cloud, and the complexity, cost, and expertise required to operate and maintain these platforms couldn’t compete with the much more nimble and agile cloud platforms. Soon, cloud data warehouses were back; they offered up the same simplicity and ease of use as prior iterations of the RDBMS, but now you could go forego the team of DBAs and, additionally, many of these were constructed to handle big data type workloads. The shift started with smaller companies who lacked the manpower required for big data solutions, as the SaaS-oriented cloud environment drastically reduced the barrier for entry, but quickly larger companies also hopped on board the movement to simplify and reduce costs via elastic workloads. Separating compute and storage became fashionable once again when you only paid for the compute when you used it. Seemingly overnight, everyone was migrating their data platforms to the cloud.