A visual depiction of structured vs unstructured OCR output:
https://s.yimg.com/ny/api/res/1.2/yoPj5cUwAcFEwAKlDxsmXg--/YXBwaWQ9aGlnaGxhbmRlcjt3PTk2MDtoPTYzNg--/https://media.zenfs.com/en/coconuts_manila_225/be6d87442eae1e5bb873a9099ac4b09a
(ABBYY Finereader on the left, Tesseract on the right)
Although … there are several examples out there which combine Tesseract with OpenCV (for determining structure) also … and there’s always Layout Parser.
For ABBYY Finereader, there looks to be some butcherable code using their Python OCR SDK here. For Tesseract & OpenCV there’s this, but it comes with a “never tried it, may not work” health warning ;-)
Here are the docs for Langchain’s Markdown splitter:
MarkdownHeaderTextSplitter | 🦜️🔗 Langchain
(There’s also an HTML splitter, and a very basic character/tokenstring splitter also.)
Linked from the above, this note from Vector Database providers Pinecone is a really helpful introduction to “chunking”:
Chunking Strategies for LLM Applications
As of today, OpenAI’s API now has a preview version of their “GPT4 with Vision” (AKA multimodal image & text/video) model available. It’s sufficiently new that Langchain doesn’t appear to have integrated it into their framework yet. However, it’s available to use directly in Python via the OpenAI Library:
Haven’t got to trying it yet but planning on trying out some tests using the image as supporting information alongside the OCR’d text when structuring & segmenting/filtering into Markdown etc.
(A degree of caution is needed though, given that this is GPT4 – 50x more expensive than GTP3.5 – and there can be a fairly substantial token cost depending on the size of the input image.)
Update: Created a quick/messy diagram interpretation example (without Langchain) in the Textiles repo:
https://github.com/congruence-engine/textiles/blob/main/people/asacalow/gpt-barker/gpt4_vision_barker.ipynb