Table of contents


<aside> πŸ“„

[Sep 13, 2025] β€” Research and Set Up

</aside>

Goal: Plan a solution to the problem and find a dataset with text data from OCR

Solution: This Kaggle dataset

TL;DR: Found a Kaggle dataset that contain images and OCR results of receipts. Planned and set up GitHub.

Full Notes:

Inspired by an AI engineer system design interview: β€œFrom a CSV of receipt image paths, how can I build an ML system to extract invoice numbers?”

My potential solution:

Clean images (align, orient, etc.) β†’ OCR β†’ Extractions by heuristics β†’ Extractions by LLM β†’ Human-in-the-loop verification / extraction

Selecting a dataset:

I wanted to focus on the LLM engineering part, so I found a dataset that contains OCR results for simplicity. Check out my πŸ“Β documentations and notes on the dataset here. It contains images, OCR’ed results, but is lacking ground truth labels for invoice numbers.

If my goal is to train and finetune a model to extract invoice numbers, I will need ground truth labels. My immediate next steps are to create a data labeling tool to streamline data labeling, and begin labeling data. Labeling data will help me design heuristics to extract data.