Thought Process Student Data Pipeline

This document is a handholding walkthrough of every decision made in this pipeline: why each tool was chosen, why each layer exists, what transformations were applied

1. The Problem Statement

We have three source files:

students.csv — 1,200 rows, one row per student, static snapshot
attendance.csv — 5,000 rows, one row per attendance event
assessments.json — 3,000 rows, one row per assessment submission

The goal is to produce an analytics-ready table (mart_class_daily_performance) that answers:

For each class, on each day: how many students attended? What was their average score?

This seems simple on the surface, but doing it reliably at scale requires thinking carefully about data quality, transformation separation, auditability, and operability.

2. Design Principles

Before any tool was chosen, three principles were set:

Principle 1: Separation of concerns per layer. Raw data should never be mixed with clean data. Validation should never be mixed with transformation. Aggregation should never be mixed with fact storage. This is why three layers exist instead of one big ETL script.

Principle 2: Immutability of raw data. Once ingested, Bronze data is never modified. If a bug is found in the Silver transformation, the Bronze file is still there. You can re-run Silver from Bronze without re-fetching from the source.

Principle 3: Idempotency. Every DAG run produces the same result when run again with the same input. This is achieved through full-load TRUNCATE + INSERT at Gold (not delta/append) and date-partitioned Bronze/Silver paths (so re-runs overwrite, not duplicate).

3. Why Three Layers?

The single-script alternative: You could read CSV → clean → aggregate → write to database in one Python script. This seems simpler. Here’s why it’s actually worse:

Concern	Single script	Three layers
Debugging a bad aggregation	Re-read source, re-clean, re-aggregate	Just re-run Gold from Silver
Auditing “what did the raw data look like?”	Gone	Preserved in Bronze forever
Source system changes a field type	Breaks everything silently	Breaks at Silver, Bronze intact
Validation failure on one source	Stops everything	Other sources proceed normally
Replay after a bug fix	Have to re-fetch from source	Just re-run from the broken layer

Thought Process Student Data Pipeline

1. The Problem Statement

2. Design Principles

3. Why Three Layers?

Bronze — Raw As-Is