google vision ocr vs open-source ocrs

my notes, polished with claude

I've spent a good chunk of time dealing with OCR for Indian identity documents - Aadhaar, PAN, voter IDs, salary slips. Not clean PDFs. Real images. Photographed under bad lighting, slightly rotated, sometimes blurry, often with text in two different Indic scripts on the same line.

the question is - can open-source work here?

The Real Problem Isn't OCR

People say "I need OCR" when what they actually need is document intelligence.

Raw OCR gives you text - a blob of characters pulled from an image, roughly in the order they appeared and most of the times(or the problems i worked on) required structure. Which text is the name? Which is the DOB? Which numbers are the document ID vs the address pin code? You need key-value pairs you can actually use, not "Rajesh Kumar 27/03/1985 BFGPS1234K" dumped in a string.

The Obvious Path and Why It Failed

Early 2024, the obvious move: just use an LLM with media support. Upload image, ask GPT-4 or Claude to extract fields, get JSON back.

Tried it. Accuracy wasn't there - especially on documents with degraded image quality or non-English script. These models (at that time) did not have as good inbuilt OCR capabilities as Google Vision.

But even if accuracy had been fine, there was a bigger concern: dependency. Routing every identity document through a third-party API at scale means your pipeline's reliability, cost, latency, and data privacy is someone else's problem. Or rather, it becomes your problem the moment something changes on their end.

We needed to own more of the stack.

The Hybrid That Actually Worked

Split the problem in two:

OCR - extract raw text from the image as accurately as possible
Structuring - take that raw text, pull out the key-value pairs you need

For OCR we used Google Vision. Not open-source, but genuinely excellent - handles rotation, bad image quality, multilingual text, all of it.

For structuring we tried multiple open-source LLMs and finally used Qwen-2.5-32B as it worked the best for our use-case. Scaled to millions of document hits per month, accuracy above 98%, latency under 3 seconds.