In the fast-paced world of book sales, the ability to harness data effectively is the key to staying ahead of the competition. This project showcases an end-to-end data pipeline that transforms raw book sales transactions into actionable insights, enabling businesses to make data-driven decisions with precision and agility. By combining scalable cloud technologies with advanced analytics, this solution bridges the gap between data engineering and impactful business outcomes.

GitHub : https://github.com/supakunz/Book-Revenue-Pipeline-GCP

Project overview

Objectives


Expected Outcomes


Architecture & Workflow

Book-Revenue-Architecture2.png

Component Detail Why chosen?
Data Source API and Database (e.g., MySQL) Provides realistic transaction data for testing and building an end-to-end pipeline.
Data Ingestion Cloud Composer (Apache Airflow) for orchestrating batch data ingestion from APIs and databases. Automates and schedules data ingestion, ensuring efficient and scalable ETL processes.
Batching Batch processing using Cloud Composer to manage scheduled data ingestion and transformation. Ensures efficient handling of large datasets, reducing latency and optimizing resource usage.
Data Storage Google Cloud Storage (GCS) and BigQuery for storing raw and processed data. Provides scalable, cost-efficient, and high-performance storage for analytics.
ETL Processing Apache Airflow DAGs for orchestrating ETL workflows, transforming raw data into structured formats. Automates data transformation, ensuring data consistency and enabling seamless downstream analysis.
Query Engine BigQuery for fast, scalable querying of processed book sales data. Enables efficient analysis and reporting with SQL-based queries.
BI and Reporting Looker : Dashboards to visualize trends (e.g., revenue, sales performance). Enables interactive visualization for business users.
Data Analytic Python (pandas) : EDA and comprehensive analysis Enables efficient data manipulation and in-depth analysis of book sales trends.

Workflow

Workflow included 4 steps

Step Detail Why chosen?
Step 1: Data Ingestion API and Database are downloaded using Python scripts to Google Cloud Storage (GCS). Efficient data extraction from API and databases, stored in a centralized cloud location for processing.
Step 2: ETL Process Use Cloud Composer (Apache Airflow) for orchestrating ETL workflows, transforming raw data into structured formats. Automates and scales complex data transformations.
Step 3: Storage and Query Load processed data into BigQuery, optimize with partitioned Parquet. Enables fast, scalable querying and efficient storage.
Step 4: Business Intelligence Use Looker for visualizing and exploring processed data for actionable insights. Powerful data exploration and dashboard creation.