📊 Dataset Documentation & Notes | Notion

Table of contents

🏷️ Dataset: data-v2.1

data-v2.1 contains labeled training and test data with and without “ambiguous” tags

Changes relative to data-v2:

test/X51005715007: Corrected ground truth label to use “Slip” over “Trans”
GitHub; Kaggle; HuggingFace

🏷️ Dataset: data-v2

data-v2 contains labeled training and test data with and without “ambiguous” tags

To label the dataset, I used these heuristics
Train: 72 / 626 (12%) ambiguous labels
Test: 24 / 347 (7%) ambiguous labels
GitHub; Kaggle; HuggingFace
Model layoutlmv3-lora-invoice-number was finetuned based on this dataset

🏷️ Dataset: data-v1

data-v1 contains labeled ONLY training data with and without “ambiguous” tags

To label the dataset, I used these heuristics