Managing data is hard. Managing data pipelines is even harder. The meaning of individual tables or values in your data warehouse gets lost in translation across organizations. Another team’s refactor breaks your team’s pipeline. And, it’s normally very difficult to tell who made what change and when.

dbt, data build tool, alleviates these frustrations by taking over the transformation step in your ETL pipelines. dbt is not itself a data processor, but instead sits on top of your data warehouse that contains your already extracted and loaded data. dbt allows teams to easily test, document, and version-control their data transformations.

While dbt is a great tool for transforming batch data, it cannot currently transform streaming data in real time. (The dbt team explicitly warns users about this in a few.) Here at Materialize, we want to help the world stop batching and start streaming. So we* built a dbt adapter that will allow you to transform your streaming data in real time using Materialize as your data warehouse.

The rest of this post will explore why dbt works best with batch data and how using Materialize unlocks streaming transformations. If you’re eager to get started, the dbt-materialize adapter is here, and our sample streaming project is here. Note: The dbt-materialize adapter is an active work in progress and not yet suitable for production use-cases. Please file issues or submit PRs as you see fit, we love feedback!

dbt and batch data

dbt is great at transforming batch data, but it cannot transform streaming data efficiently in real time. To understand why, let’s take a look at how dbt transforms data under the hood.

dbt users define their desired transformations using dbt “models”. dbt models are SQL files that contain:

dbt transforms your data each time you “run” a model. Each time a model is run, dbt queries the underlying data warehouse using that model’s SELECT statement. The result set of the query (the transformed data) is then either returned directly to the user or persisted into your data warehouse, depending on the model’s materialization parameter.

Currently, dbt supports four types of materializations: table, view, incremental, and ephemeral. The table and incremental materializations persist a table, the view materialization creates a view, and the ephemeral materialization, instead of persisting anything, returns results directly using a common table expression (CTE). The good news is that these database objects are totally sufficient to transform batch data. The bad news is that none of these database objects transform streaming data efficiently.

First, what do I mean by batch and streaming data? Batch data, as the name suggests, is any type of data that arrives in discrete batches. This can be once a minute, once an hour, or once a day. The important thing is that no new data arrives between batches. Streaming data, on the other hand, arrives continually and at no particular schedule.

So, why are these database objects sufficient to transform batch data, but not able to efficiently transform streaming data?

Views and CTEs do not physically persist data to your data warehouse. This means that each time you query a model that uses a view or CTE, your data warehouse must re-transform the underlying source data. And, each time you transform your source data, you are paying some cost. While views and CTEs always return up-to-date transformations of your batch and streaming data, they do not do so efficiently.

Tables, on the other hand, do physically persist data. More specifically, tables persist the result set of the last time their model “dbt run.” Unlike views and CTEs, this means that you won’t pay the price of transforming data each time your table is queried. But, this means that your transformed data can quickly become stale as new data arrives. This is not an issue with batch data because you can simply “dbt run” your table each time a new batch arrives. Unfortunately, things aren’t so simple with streaming data.

Because streaming data does not arrive on a schedule, there is no longer a right time to re-run your models to keep them up-to-date. Instead, you’re forced to choose between maximizing data freshness and minimizing transformation costs. You can minimize your costs by limiting how often you recreate your tables, effectively turning your streaming data into batch data. Or, you can maximize your data freshness by continually recreating your tables. But, this approach will cost you time and money, leave you vulnerable to bugs, and still won’t maintain truly up-to-date results.