Developer Diary

14th November: Structured OCR

A visual depiction of structured vs unstructured OCR output:

https://s.yimg.com/ny/api/res/1.2/yoPj5cUwAcFEwAKlDxsmXg--/YXBwaWQ9aGlnaGxhbmRlcjt3PTk2MDtoPTYzNg--/https://media.zenfs.com/en/coconuts_manila_225/be6d87442eae1e5bb873a9099ac4b09a

(ABBYY Finereader on the left, Tesseract on the right)

Although … there are several examples out there which combine Tesseract with OpenCV (for determining structure) also … and there’s always Layout Parser.

For ABBYY Finereader, there looks to be some butcherable code using their Python OCR SDK here. For Tesseract & OpenCV there’s this, but it comes with a “never tried it, may not work” health warning ;-)

Text Splitting

Here are the docs for Langchain’s Markdown splitter:

MarkdownHeaderTextSplitter | 🦜️🔗 Langchain

(There’s also an HTML splitter, and a very basic character/tokenstring splitter also.)

Linked from the above, this note from Vector Database providers Pinecone is a really helpful introduction to “chunking”:

Chunking Strategies for LLM Applications

6th November: New OpenAI Models

As of today, OpenAI’s API now has a preview version of their “GPT4 with Vision” (AKA multimodal image & text/video) model available. It’s sufficiently new that Langchain doesn’t appear to have integrated it into their framework yet. However, it’s available to use directly in Python via the OpenAI Library:

OpenAI Platform

Haven’t got to trying it yet but planning on trying out some tests using the image as supporting information alongside the OCR’d text when structuring & segmenting/filtering into Markdown etc.

(A degree of caution is needed though, given that this is GPT4 – 50x more expensive than GTP3.5 – and there can be a fairly substantial token cost depending on the size of the input image.)

Update: Created a quick/messy diagram interpretation example (without Langchain) in the Textiles repo:

https://github.com/congruence-engine/textiles/blob/main/people/asacalow/gpt-barker/gpt4_vision_barker.ipynb

14th November: Structured OCR

Text Splitting

6th November: New OpenAI Models

18th October: Data Extraction with Langchain