Thought Process Student Data Pipeline

This document is a handholding walkthrough of every decision made in this pipeline: why each tool was chosen, why each layer exists, what transformations were applied



1. The Problem Statement

We have three source files:

The goal is to produce an analytics-ready table (mart_class_daily_performance) that answers:

For each class, on each day: how many students attended? What was their average score?

This seems simple on the surface, but doing it reliably at scale requires thinking carefully about data quality, transformation separation, auditability, and operability.


2. Design Principles

Before any tool was chosen, three principles were set:

Principle 1: Separation of concerns per layer. Raw data should never be mixed with clean data. Validation should never be mixed with transformation. Aggregation should never be mixed with fact storage. This is why three layers exist instead of one big ETL script.

Principle 2: Immutability of raw data. Once ingested, Bronze data is never modified. If a bug is found in the Silver transformation, the Bronze file is still there. You can re-run Silver from Bronze without re-fetching from the source.

Principle 3: Idempotency. Every DAG run produces the same result when run again with the same input. This is achieved through full-load TRUNCATE + INSERT at Gold (not delta/append) and date-partitioned Bronze/Silver paths (so re-runs overwrite, not duplicate).


3. Why Three Layers?

The single-script alternative: You could read CSV → clean → aggregate → write to database in one Python script. This seems simpler. Here’s why it’s actually worse:

Concern Single script Three layers
Debugging a bad aggregation Re-read source, re-clean, re-aggregate Just re-run Gold from Silver
Auditing “what did the raw data look like?” Gone Preserved in Bronze forever
Source system changes a field type Breaks everything silently Breaks at Silver, Bronze intact
Validation failure on one source Stops everything Other sources proceed normally
Replay after a bug fix Have to re-fetch from source Just re-run from the broken layer

Bronze — Raw As-Is