As I started labeling the data, I quickly noticed an issue with the ground truth: not every receipt actually contains an invoice number. Instead, many include other identifiers like {GST, TRN, Slip, Receipt No, Transaction, Check, etc.} numbers.
This means we need heuristics both for labeling and for extraction. Strong heuristics matter because the more reliably they detect invoice numbers, the fewer inference calls we’ll need to make to a language model, saving both cost and time.
I followed these rules-of-thumb when labeling data:
GST was not used. Via research, it is a tax number.

Snippet of a receipt, where a GST number and invoice number is available
TRN / Transaction Numbers / Bill / Slip / C/N No / CB# / CHECK / MB were chosen as a fallback if an obvious invoice number is not available

“Transaction No:” of 1-8046 is used as the ground truth label.
An “ambiguous” label was given to those receipts that did not seem to have an obvious invoice number

T1 R000418193 is a possible invoice number, but since it is not explicit, it is better to be safe.
Full string was used. The unique identifier chosen here is POS/27408, not 27408.

“POS” means “Point of Sale”. I included it as part of the invoice number for simplicity and consistency.

#00420693 is the full string, though 00420693 is used as the ground truth label. This can be done in the postprocessing script when extracting labels.
“Slip No.” over “Trans”

train/X51005757199.jpg: “Slip No” was used instead of “Trans”. “Trans” is not a heuristic for data extraction.