Data Warehouse Layering Best Practices

This document describes my data modeling and data warehouse layering best practices.

My experience is mostly in VC/PE-backed ventures but I also applied them in large multinational corporations and traditional medium-sized companies.

Screenshot 2025-05-10 at 04.15.37.png

1 Source Layer

Purpose: 1:1 Representation of the Source Data. Comparisons to source systems are always easily possible. No black-box transformation logic.

This is a layer in the DWH that stores all the data in the format the data source “knows it” with as little transformation / as “raw” as possible (ideally no transformations).
Other than the following downstream layers, I separate this layer across several resources in a Cloud Data Warehouse. For example, in BigQuery every Source loads data into tables in separate datasets. This is how the source layer in BigQuery looks like for one of my recent clients:

Screenshot 2025-05-10 at 04.19.37.png

Ideally, at this step, “contracts” between the owner of the data source and the DWH owner are being established and eventually automated. The goal of a data contract is to ensure data quality and preventing downstream data pipelines from breaking by capturing expectations around schema, business logic, SLAs between a data provider (e.g. the IT department) and a data consumer (e.g. BI).
A data contract is an API-like agreement between data producers and consumers, with each constraint enforced programmatically in the developer workflow. It captures expectations around schema, business logic, SLAs, or other forms of data governance.
The enforceability of the contract is its most essential component. It is not simply an agreement on paper - that’s documentation. It’s also not simply a monitor on downstream data sets - that’s a test. The programmatic enforcement is what makes expectations into a contract.
In startups and scale-ups, some components of a Data Contract are optional: as long as there are not dozens of internal data providers, programmatic enforcement and an API-like agreement are not needed.
What is needed though are:
- 1. Documentation of expectations around schema, business logic, SLAs
- 1. Automatic detection when those expectations are violated
1. can be achieved by defining the expectations for example in .yml files in dbt. This .yml file defines the expected schema of the source and also describes simple semantic rules (e.g. definition of Primary Keys, Expected Data Freshness Rules, Definition of expected values, e.g. status names or names for payment methods on which downstream jobs rely upon).
For 2) there will be automated assertions verifying these rules and the source owners have access to the docs that are auto-generated by dbt. In an ideal world scenario, the assertions would already run on the side of the source at compile time but this simple implementation will be a good start and we don’t need anything more complex at this point.
For our example of calculating AOV: The data needed to calculate AOV comes from in our example case comes from Hubspot, Shopware and Stripe.

2 Preprocess Layer

Purpose: Clean up of the Source. Still no business logic.