AlphaForge Data Layer – Snapshot Strategy with Chained Materialized Views

This design integrates the three-layer architecture (Landing → Staging → Data Mart) with a snapshot tracking system for reproducible backtests and fast latest reads.

Key principle: Data flows through validation layers, then automatically propagates to downstream tables via chained materialized views, eliminating manual dual INSERTs.

1) Architecture Overview

Complete Data Flow

Ingestion Workflow
  ↓ (minimal validation)
landing_bars (raw data)
  ↓ [Application-level DQ validation + transformation]
staging_bars (validated, source of truth)
  ↓ [Automatic via chained materialized views]
  ├→ bars (production table)
  ├→ bars_snapshots (audit trail for reproducibility)
	  └→ bars_current (fast latest reads for live trading)

Data flow: Write raw data to landing_bars, validate and transform in application layer, then INSERT to staging_bars. All downstream tables update automatically via materialized views.

2) Goals

Reproducibility

Every experiment/backtest can pin a specific dataset version and remain stable even if we later fill gaps or correct historical points.
Live-read performance

Live trading / monitoring can read "latest bars" quickly without doing snapshot-folding on every request.
Operational simplicity
- Three-stage ingestion: landing → staging (with validation) → production
- Single INSERT point to production
- ClickHouse handles downstream propagation automatically via MVs
- Avoid per-snapshot databases/tables and OLTP-style row updates

3) Data Model

3.1 `staging_bars` (validated source of truth)

Purpose: Clean, validated data after DQ checks. Single source of truth for all downstream tables.

Schema:

CREATE TABLE IF NOT EXISTS analytics.staging_bars (
    id             UInt64,
    instrument_id  UInt32,
    interval       String,
    timestamp      DateTime,
    timestamp_unix Int64,
    open           Float64,
    high           Float64,
    low            Float64,
    close          Float64,
    volume         Float64,
    metadata       String,
    -- Lineage column
    deleted_at     **DateTime,**
    ingested_at    **DateTime**
)
ENGINE = ReplacingMergeTree(ingested_at)
PARTITION BY toDate(timestamp)
ORDER BY (instrument_id, interval, timestamp_unix);