- Tesseract is a free OCR library, offering some of the best results going.
- It has 2 different engines within it. The ‘legacy’ engine, and a modern ‘LSTM’ (Neural Net based) engine.
- The legacy engine is trained on specific fonts, and can guess at what font something is. It’s also good at identifying specific character positions, and does not rely on/gain from “context” to spot words.
- The LSTM engine is faster (I think), uses smaller data sets, copes better with fonts it has not been trained on, and gains extra benefits from “context”.
- It uses ‘traineddata’ files for each language (or multiple languages that use the same script) - these are specific to the engine.
- We can specify the engine (-dOCREngine=) and language files (-sOCRLanguage=“eng”) at runtime.
- There are different sets of data out there. For LSTM we have “best” and “fast”. “best” ones are ~25Meg per language. “fast” ones are ~2Meg per language. A full set of “best” data for all the languages is 1.2Gig.
- I envisage an OEM having either “eng” (just english), or “latin” (all the languages that use latin script - 80Meg) built in, and maybe having others available to it as extensions (perhaps as a USB key that people can plug into their printer).
- We have 5 devices within gs that work with ocr:
- ocr: simple text extraction
- hocr: “HOCR” format (XML based text extraction with positions for each char).
- pdfocr8: outputs PDFs as greyscale images, with overlaid invisible OCR text for cut/paste/searching
- pdfocr24: outputs PDFs as rgb images, with overlaid invisible OCR text for cut/paste/searching
- pdfocr32: outputs PDFs as cmyk images, with overlaid invisible OCR text for cut/paste/searching
- Adding tesseract with inbuilt “fast” English support adds about 5.6Meg to the ARM binary size. (4.5 Meg library, 1.1 Meg compressed “eng” data).
- OCR speeds depend on resolution and density of text. zlib.3.pdf (a typical 2 page text document) at 200 dpi page of text takes about 28 seconds on my pi 3b+, and 7.5 seconds on my desktop PC.
- This engine is a good choice for when we are processing entire pages at a time.