Table of Contents

This notion page consists of the following:


Overview

The database design will describe the schema design and field mappings of all stages of the project.

This will include all the intermediate schemas and final schemas used during the ETL pipeline, and also the final data model in PostgreSQL.

ETL Pipeline - Database design

During the ETL Pipeline the data goes through the following stages:

Data sources

Source Description
Gmail API Retrieves email messages and attachments using OAuth2. Key endpoints: users.messages.list, users.messages.get
Attachments PDF invoices attached to emails from various utility providers

Bronze Dataframe

The bronze dataframe is the initial dataframe created which stores raw information parsed from the PDF via pdfplumber.

Each utility type has their own table schema, based on the information parsed.

Electricity Schema