In the fast-paced world of book sales, the ability to harness data effectively is the key to staying ahead of the competition. This project showcases an end-to-end data pipeline that transforms raw book sales transactions into actionable insights, enabling businesses to make data-driven decisions with precision and agility. By combining scalable cloud technologies with advanced analytics, this solution bridges the gap between data engineering and impactful business outcomes.

GitHub : https://github.com/supakunz/Book-Revenue-Pipeline-GCP

Project overview

Objectives

Build an end-to-end data pipeline:
- Automate data ingestion, transformation, and storage for analytics.
Enable batch data processing for book sales:
- Use GCP services like Cloud Storage, Composer, BigQuery, and Looker to analyze.
Support advanced analytics:
- Store processed data for BI tools (e.g., Power BI).

Expected Outcomes

Batch Data Flow:
- Automate the extraction, transformation, and loading (ETL) of book sales data.
Scalable pipeline:
- A modular, reusable, and scalable pipeline for batch data.
Optimized Storage:
- Store processed data in partitioned Parquet format for efficient querying.
Actionable Insights:
- Enable business users to make data-driven decisions using Power BI dashboards.

Architecture & Workflow

Component	Detail	Why chosen?
Data Source	API and Database (e.g., MySQL)	Provides realistic transaction data for testing and building an end-to-end pipeline.
Data Ingestion	Cloud Composer (Apache Airflow) for orchestrating batch data ingestion from APIs and databases.	Automates and schedules data ingestion, ensuring efficient and scalable ETL processes.
Batching	Batch processing using Cloud Composer to manage scheduled data ingestion and transformation.	Ensures efficient handling of large datasets, reducing latency and optimizing resource usage.
Data Storage	Google Cloud Storage (GCS) and BigQuery for storing raw and processed data.	Provides scalable, cost-efficient, and high-performance storage for analytics.
ETL Processing	Apache Airflow DAGs for orchestrating ETL workflows, transforming raw data into structured formats.	Automates data transformation, ensuring data consistency and enabling seamless downstream analysis.
Query Engine	BigQuery for fast, scalable querying of processed book sales data.	Enables efficient analysis and reporting with SQL-based queries.
BI and Reporting	Looker : Dashboards to visualize trends (e.g., revenue, sales performance).	Enables interactive visualization for business users.
Data Analytic	Python (pandas) : EDA and comprehensive analysis	Enables efficient data manipulation and in-depth analysis of book sales trends.

Workflow

Workflow included 4 steps

Step	Detail	Why chosen?
Step 1: Data Ingestion	API and Database are downloaded using Python scripts to Google Cloud Storage (GCS).	Efficient data extraction from API and databases, stored in a centralized cloud location for processing.
Step 2: ETL Process	Use Cloud Composer (Apache Airflow) for orchestrating ETL workflows, transforming raw data into structured formats.	Automates and scales complex data transformations.
Step 3: Storage and Query	Load processed data into BigQuery, optimize with partitioned Parquet.	Enables fast, scalable querying and efficient storage.
Step 4: Business Intelligence	Use Looker for visualizing and exploring processed data for actionable insights.	Powerful data exploration and dashboard creation.