<aside> π
</aside>
Goal: Plan a solution to the problem and find a dataset with text data from OCR
Solution: This Kaggle dataset
TL;DR: Found a Kaggle dataset that contain images and OCR results of receipts. Planned and set up GitHub.
Full Notes:
Inspired by an AI engineer system design interview: βFrom a CSV of receipt image paths, how can I build an ML system to extract invoice numbers?β
My potential solution:
Clean images (align, orient, etc.) β OCR β Extractions by heuristics β Extractions by LLM β Human-in-the-loop verification / extraction
Selecting a dataset:
I wanted to focus on the LLM engineering part, so I found a dataset that contains OCR results for simplicity. Check out my πΒ documentations and notes on the dataset here. It contains images, OCRβed results, but is lacking ground truth labels for invoice numbers.
If my goal is to train and finetune a model to extract invoice numbers, I will need ground truth labels. My immediate next steps are to create a data labeling tool to streamline data labeling, and begin labeling data. Labeling data will help me design heuristics to extract data.