This blog explains two important aspects of ETL pipeline followed by an example design for OOONO case study.

1. Considering Initial vs Incremental Extraction in ETL Design

A. Design for Two Modes

B. ETL Pipeline Components to Support Both

  1. Source Connectors: Build separate connectors (scripts/APIs) for:
  2. Partitioned Storage in S3: Store data in time-based partitions (e.g., /year=2025/month=07/day=18/), so initial bulk loads and daily updates can co-exist.
  3. Reprocessing Capability: If there’s a bug in transformation logic, you should be able to re-run the pipeline from raw data (stored in S3).
  4. Versioned Schema: Use tools like AWS Glue Data Catalog to maintain schema versions and ensure historical compatibility.

2. Making the ETL Pipeline Easy to Maintain

ETL pipelines often break due to data source changes, such as API structure updates or format changes. To make the pipeline resilient and maintainable, consider:

A. Modular Architecture

B. Schema Validation & Flexibility

C. API Abstraction Layer