w/e 28 Jan 2024

This is my inaugural diary entry. It’s taken me a while to get going but now that I’ve begun, I’m setting out with the intention of making weekly updates.

In my spare time, I’ve been known to be partial to a jigsaw puzzle; the larger and more complex the better. I’m also a novice origami enthusiast. Bear with me as I promise this does have a point. My work over the past few weeks has felt a lot like starting out with a new puzzle or piece of origami: stitching together pieces of code that will eventually form a pipeline to feed data into a database or interact with a large language model seems (to me) akin to identifying the corners and the edge pieces of a puzzle or making those first few, tentative folds in the paper. A 2D image or paper sculpture will appear but not without the occasional hunt for a missing piece or the undoing of an erroneous fold. On a personal level, it’s challenging work because a lot of the techniques I’m working with are new or ever evolving but at the same time, I’m finding it highly enjoyable.

Now on to the specifics of the stitching that I’ve been doing:

Adventures with unstructured data

Like many of my fellow researchers on this project, most of my work involves dealing with unstructured data. I need to be able to identify people, places, occupations, dates etc from unstructured text but also to be able to reference the context to be able to know who was connected to whom, when, where and how.

LLMs hold great promise in the context-aware extraction of entities and relationships from unstructured data but often there is still some manipulation of the source data beyond chunking of the text that is required before the data can be fed to the LLM in order for the model to work well on the data it is shown. This is the case with unstructured PDFs in which the text is interspersed with tables in a variety of formats and/or images. I have decided to try out Adobe’s PDF Extract API in order to pre-process a set of sample PDFs coming from Tasha’s Post Office circulars investigation. The API is based on Adobe’s Sensei machine learning model and promises context aware extraction of text, tables and images. It’s not the only method for dealing with unstructured data in PDF format but I selected it as a starting point as trying a method by the developers of PDFs seemed to make sense.

The API is not without its annoyances: scanned PDFs have a page limit of 150 pages so it has been necessary to split files. There is also a rate limit of API calls per minute and a monthly limit to the documents that can be processed using the free tier. Having said this, I think that the limitations can be worked around, the API is easy to use and initial tests have yielded decent results.

The aim is to be able to pre-process a whole pipeline of PDFs for feeding into the LLMs that Tasha is using. Once this is set up, Tasha and I hope to look at implementing retrieval augmented generation (RAG) to optimise the outputs from the LLM interrogation.

This exercise will deliver a method that can be adapted to process unstructured PDFs across other investigations. Given the pace of development of LLMs, the need to pre-process inputs in this manner will almost certainly be obsolete in the next couple of years but in the meantime, a readily adaptable solution should yield many dividends for the project.

Of course, PDFs are not the only source of unstructured data I need to deal with but they are probably one of the harder formats. Dealing with chunks text chunks from CSVs, word processing formats or the output of web scrapes are all less challenging.

Pipelines with Neo4j

In my previous work, any pipelines incorporating Neo4j have been quite limited: I’ve had fairly sanitised source data to ingest, a single Python script running all of the database creation and data analysis followed by the connection to a visualisation platform to visualise the output. I’ve also been able to focus mainly on the instance data without having to pay much attention to a governing ontology.

This pipeline will have to be upgraded quite a bit to deal with the requirements of building out a database for the Lost Mills investigation at scale. The investigation itself will be discussed at length at tomorrow’s investigations meeting so I won’t say too much about it now. In terms of my stitching together of processes for this investigation, a lot of my time has been spent on working out how to use frameworks like LangChain to build pipelines incorporating LLMs with Neo4j to automate entity and relationship generation from unstructured data sources, layering an ontology onto the resulting database using Neo4j’s n10s plugin and using the knowledge graph in RAG processes to further interrogate LLMs. I realise that it’s inadvisable to rely on LLMs alone for automated knowledge graph creation so I have also been exploring the incorporation of traditional NLP methods with new-fangled, whizzy GenAI processes.

The pace of development of knowledge graph integration with GenAI is really quite mind boggling. I’m struggling to keep up but at the same time, it’s incredible to be involved in work incorporating these evolving technologies.