Table of contents


Overview

As I started labeling the data, I quickly noticed an issue with the ground truth: not every receipt actually contains an invoice number. Instead, many include other identifiers like {GST, TRN, Slip, Receipt No, Transaction, Check, etc.} numbers.

This means we need heuristics both for labeling and for extraction. Strong heuristics matter because the more reliably they detect invoice numbers, the fewer inference calls we’ll need to make to a language model, saving both cost and time.

Heuristics for labeling

I followed these rules-of-thumb when labeling data:

  1. GST was not used. Via research, it is a tax number.

    Snippet of a receipt, where a GST number and invoice number is available

    Snippet of a receipt, where a GST number and invoice number is available

  2. TRN / Transaction Numbers / Bill / Slip / C/N No / CB# / CHECK / MB were chosen as a fallback if an obvious invoice number is not available

“Transaction No:” of 1-8046 is used as the ground truth label.

“Transaction No:” of 1-8046 is used as the ground truth label.

  1. An “ambiguous” label was given to those receipts that did not seem to have an obvious invoice number

    T1 R000418193 is a possible invoice number, but since it is not explicit, it is better to be safe.

    T1 R000418193 is a possible invoice number, but since it is not explicit, it is better to be safe.

  2. Full string was used. The unique identifier chosen here is POS/27408, not 27408.

“POS” means “Point of Sale”. I included it as part of the invoice number for simplicity and consistency.

“POS” means “Point of Sale”. I included it as part of the invoice number for simplicity and consistency.

  1. is removed from string

#00420693 is the full string, though 00420693 is used as the ground truth label. This can be done in the postprocessing script when extracting labels.

#00420693 is the full string, though 00420693 is used as the ground truth label. This can be done in the postprocessing script when extracting labels.

  1. “Slip No.” over “Trans”

    train/X51005757199.jpg: “Slip No” was used instead of “Trans”. “Trans” is not a heuristic for data extraction.

    train/X51005757199.jpg: “Slip No” was used instead of “Trans”. “Trans” is not a heuristic for data extraction.