This document is a handholding walkthrough of every decision made in this pipeline: why each tool was chosen, why each layer exists, what transformations were applied
We have three source files:
students.csv — 1,200 rows, one row per student, static snapshotattendance.csv — 5,000 rows, one row per attendance eventassessments.json — 3,000 rows, one row per assessment submissionThe goal is to produce an analytics-ready table (mart_class_daily_performance) that answers:
For each class, on each day: how many students attended? What was their average score?
This seems simple on the surface, but doing it reliably at scale requires thinking carefully about data quality, transformation separation, auditability, and operability.
Before any tool was chosen, three principles were set:
Principle 1: Separation of concerns per layer. Raw data should never be mixed with clean data. Validation should never be mixed with transformation. Aggregation should never be mixed with fact storage. This is why three layers exist instead of one big ETL script.
Principle 2: Immutability of raw data. Once ingested, Bronze data is never modified. If a bug is found in the Silver transformation, the Bronze file is still there. You can re-run Silver from Bronze without re-fetching from the source.
Principle 3: Idempotency. Every DAG run produces the same result when run again with the same input. This is achieved through full-load TRUNCATE + INSERT at Gold (not delta/append) and date-partitioned Bronze/Silver paths (so re-runs overwrite, not duplicate).
The single-script alternative: You could read CSV → clean → aggregate → write to database in one Python script. This seems simpler. Here’s why it’s actually worse:
| Concern | Single script | Three layers |
|---|---|---|
| Debugging a bad aggregation | Re-read source, re-clean, re-aggregate | Just re-run Gold from Silver |
| Auditing “what did the raw data look like?” | Gone | Preserved in Bronze forever |
| Source system changes a field type | Breaks everything silently | Breaks at Silver, Bronze intact |
| Validation failure on one source | Stops everything | Other sources proceed normally |
| Replay after a bug fix | Have to re-fetch from source | Just re-run from the broken layer |