π NYC Transportation Data Warehouse - Next Steps Roadmap
β
COMPLETED PHASES
- [x] Phase 1: Data Collection & Ingestion
- [x] Data scraping from NYC TLC website
- [x] Raw data upload to Snowflake (CSV/Parquet)
- [x] Python upload scripts (
load_snowflake.py
)
- [x] GitHub repository setup
π― NEXT PHASES TO COMPLETE
PHASE 2: DATA TRANSFORMATION & ETL π§
Goal: Clean, transform, and prepare data for analytics
2.1 Data Quality & Cleaning
-
[ ] Create data quality checks
-- Check for null values, outliers, invalid dates
-- Validate trip distances, fare amounts
-- Identify data anomalies
-
[ ] Build data cleaning scripts
-- Remove invalid records
-- Standardize formats
-- Handle missing values
2.2 Data Transformation
- [ ] Create staging tables (cleaned raw data)
- [ ] Build transformation SQL scripts
- Clean datetime formats
- Calculate trip duration
- Categorize trip types
- Add business rules
2.3 ETL Pipeline
- [ ] Create automated ETL scripts (Python + SQL)
- [ ] Schedule data processing (daily/weekly)
- [ ] Error handling and logging
PHASE 3: DIMENSIONAL MODEL DESIGN ποΈ
Goal: Design star schema for analytics