This blog explains two important aspects of ETL pipeline followed by an example design for OOONO case study.
1. Considering Initial vs Incremental Extraction in ETL Design
A. Design for Two Modes
- Initial Load Mode:
- Must handle bulk ingestion of historical third-party and community data.
- Runs heavy transformations (e.g., clustering all historical hazards, scoring).
- Typically executed once when setting up the system or onboarding a new data provider.
- Incremental Mode (Daily or Real-Time):
- Only processes new or updated records since the last run.
- Should be lightweight and fast, leveraging:
- Timestamps (e.g., updated_at > last_run_time).
- Log-based incremental extraction (e.g., using AWS Kinesis, Kafka, or delta APIs).
- Works hand-in-hand with Lambda for real-time events (community reports) and scheduled Athena/PySpark jobs for daily batch updates.
B. ETL Pipeline Components to Support Both
- Source Connectors:
Build separate connectors (scripts/APIs) for:
- Initial full extraction.
- Incremental delta updates.
- Partitioned Storage in S3:
Store data in time-based partitions (e.g., /year=2025/month=07/day=18/), so initial bulk loads and daily updates can co-exist.
- Reprocessing Capability:
If there’s a bug in transformation logic, you should be able to re-run the pipeline from raw data (stored in S3).
- Versioned Schema:
Use tools like AWS Glue Data Catalog to maintain schema versions and ensure historical compatibility.
2. Making the ETL Pipeline Easy to Maintain
ETL pipelines often break due to data source changes, such as API structure updates or format changes. To make the pipeline resilient and maintainable, consider:
A. Modular Architecture
- Separate ETL into clearly defined stages:
- Extract: Source-specific connectors (e.g., Blitzer API, Coyote API).
- Transform: Reusable transformation logic (map matching, confidence scoring).
- Load: Common loaders to MongoDB and S3.
- This way, if one source API changes, only the extract module for that source needs updating.
B. Schema Validation & Flexibility
- Use schema validation tools (like Great Expectations or Pydantic) to:
- Automatically detect field changes or missing values.
- Send alerts if third-party data deviates from expected format.
- Implement data mapping layers so fields can be renamed or transformed without changing the core pipeline logic.
C. API Abstraction Layer