Data Layer Design
Overview
This project delivers the first phase of the AlphaForge market-data platform: an MVP data warehouse + ingestion system that enables reproducible backtests backed by a centralized database instead of local files.
The scope of this delivery focuses on Deliverables 1 → 5, including schema design, ClickHouse setup, historical backfill, incremental ingestion, canonical read-views, and automated data-quality validation.
Business Impact
- Enables deterministic and reproducible research workflows by consolidating market data into a unified DB-backed architecture.
- Eliminates inconsistencies and human error from local CSV workflows.
- Provides the foundational infrastructure layer required for backtesting, strategy iteration, real-time streams, and analytics.
- Reduces engineering time spent fixing gaps, data drift, and quality issues through automated DQ checks and canonicalized views.
Goals
- Deliver a scalable, production-ready data schema and ingestion pipeline aligned with the MVP scope.
- Populate the database with high-quality historical data and keep it fresh through scheduled incremental ingestion.
- Ensure reproducibility with a clear snapshot/versioning model.
- Provide canonical read-views (
v_bars, v_funding, v_instruments, v_fear_greed) that downstream components can query reliably.
Requirements
Functional Requirements
- Design Doc (Deliverable 1)
- Define schema for raw and canonical tables.
- Specify
ORDER BY / PARTITION BY to support efficient timeseries reads.
- Document ingestion flow (backfill + incremental), deduplication, idempotency, retries, and error handling.
- Specify snapshot/versioning model for reproducible backtests.
- DB DDL + Initialization (Deliverable 2)
- ClickHouse + Iceberg table definitions for raw and curated datasets.
- Automated migrations and initialization scripts runnable in the client’s AWS environment.
- Backfill Tool + Incremental Updater (Deliverable 3)
- Ingest OHLCV, funding, metadata across Binance/OKX/Bybit for all specified symbols & intervals.
- Idempotent ingestion (no duplicates even if jobs re-run).
- Incremental updater via scheduler