The main flow of data will be:
Raw Data (ytbAPI) -> Python processing -> Google Cloud Storage (GCS - Data Lake)
GCS stores raw data, which could be in JSON or other file formats.
GCS (Raw) -> Python processing (Cleaning) -> BigQuery (Data Warehouse):
Data is cleaned, standardized, and loaded into tables in BigQuery.
BigQuery (Cleaned) -> Python/SQL processing (Enrichment) -> BigQuery (Data Warehouse):
Data in BigQuery is read and processed to add new informational fields, and the enriched results are stored back in BigQuery. From here, the data will be served for Power BI.
Since not every video is expected to have its comments crawled and enriched, storing them in the same table as videos that have been crawled would lead to many sparse columns. Therefore, we will separate them into two distinct tables. Data for special videos will be stored in all three tables: video info (basic information), comment (optional), and enrich data (optional).
1_raw/bronze: Raw data, including the following folders:
2_cleaned/silver: Cleaned data
3_enrich/gold: Data enriched through NLP analysis of description, comment, caption, etc.
Detailed data structure