This notion page consists of the following:
The database design will describe the schema design and field mappings of all stages of the project.
This will include all the intermediate schemas and final schemas used during the ETL pipeline, and also the final data model in PostgreSQL.
During the ETL Pipeline the data goes through the following stages:
| Source | Description |
|---|---|
| Gmail API | Retrieves email messages and attachments using OAuth2. Key endpoints: users.messages.list, users.messages.get |
| Attachments | PDF invoices attached to emails from various utility providers |
The bronze dataframe is the initial dataframe created which stores raw information parsed from the PDF via pdfplumber.
Each utility type has their own table schema, based on the information parsed.