<aside> 📄

[Sep 13, 2025] — Research and Set Up

</aside>

Goal: Plan a solution to the problem and find a dataset with text data from OCR

Solution: This Kaggle dataset

TL;DR: Found a Kaggle dataset that contain images and OCR results of receipts. Planned and set up GitHub.

Full Notes:

Inspired by an AI engineer system design interview: “From a CSV of receipt image paths, how can I build an ML system to extract invoice numbers?”

My potential solution:

Clean images (align, orient, etc.) → OCR → Extractions by heuristics → Extractions by LLM → Human-in-the-loop verification / extraction

Selecting a dataset:

I wanted to focus on the LLM engineering part, so I found a dataset that contains OCR results for simplicity. Check out my 📝 documentations and notes on the dataset here. It contains images, OCR’ed results, but is lacking ground truth labels for invoice numbers.

If my goal is to train and finetune a model to extract invoice numbers, I will need ground truth labels. My immediate next steps are to create a data labeling tool to streamline data labeling, and begin labeling data. Labeling data will help me design heuristics to extract data.

Table of contents

[Sep 13, 2025] — Research and Set Up