15/03/2024 | Notion

I’ve spent quite a lot of this week chasing potential datapoints for the Exhibit interactive, which is not terribly interesting content for a diary, although it has been exciting to see our conception of the “minimum” we could do continue to grow! In the case of the Exhibit, I think it’s key to note that the medium is really not the message (sorry McLuhan) and that the mapping interface we are aiming to create will allow us to showcase a range of investigations, techniques, and historic information – rather than just demonstrating the projects’ ability to produce lovely maps.

Read on for an update on the environmental impact work, some bits I’ve been working on for the data registry, and a brief comparison of ABBYY v GPT’s OCR ability.

Experiments with transcription

Inspired by Daniel’s work turning the Post Office Guides into a Kepler map, which shows where Bradford could send post to in 1906 and the time parcels would take in transit, I decided to see if I could create an interactive ‘chat’ version of the Post Office Guide which might showcase how LLMs and geospatial work can both be used to interrogate the same data. This was partly also inspired by an exhibit idea I had back in October, where an interactive might allow visitors to select a medium and a location, and find out how long it would have taken to contact a given location from Bradford.

The experiment of course meant I had to transcribe the 1906 Guide – which is a small, tightly bound book, photographed by me on an iPhone. I thought it might be useful for others to see some of the results I had when doing this work.

Below is a paragraph from the Guide, with the bounding box created by ABBYY FineReader:

Untitled

Below is ABBYY’s plain text reading of this same paragraph:

Untitled

Then, here is ChatGPT’s transcription of the exact same block that ABBYY identified as a paragraph:

Untitled

The difference is clear. But, ChatGPT will not deal with more than one image at a time when transcribing – so for now the superiority of ChatGPT is not something we can use particularly often.

I also tried preparing a table for feeding into GPT, using both ABBYY and ChatGPT to process the table itself. ChatGPT is able to understand a table very well, and transform it into text that conveys the same message as the table (see below). However, if asked to produce a table (on word or excel) that looks the way the image of the table looks, it creates a structure that it is unable to understand when fed the document and asked to answer basic questions. ABBYY did a better job at reading the text within the table than it did reading the above paragraph and was able to create a table out of the text, however information is lost when it misreads certain words. This would mean if ABBYY processed the data, and that output was fed to GPT, GPT would never be able to answer a question about Morecambe because ABBYY failed to transcribe the word Morecambe.

Untitled

I tried sending all of this through GPT-4 Vision (GPT-4V) using the API, but it was rejected due to length. I’ll return to this next week, time permitting, and see if I can create a query bot or if the transcription issues are too cumbersome.

Data registry work

Just as I used ChatGPT and KeyBERT for my personal researcher notes work, I’ve been seeing if these same prompting techniques could be useful to create catalogue descriptions and tags for the data registry work.

I used the same MyGPT I used for the researchers notes work to create some summaries, below are two outputs from when GPT was fed Bradford Industrial Museum’s machine data, the first is when the GPT was fed the entire catalogue and the second when it was only fed the ‘brief description’ field for each object.

This archive documents the evolution of textile machinery from the late 19th to late 20th century, featuring carding, combing, spinning, weaving, and winding equipment from key UK manufacturers, highlighting technological advances in fibre processing and fabric production.
The archive spans late 19th to late 20th-century textile machinery for carding, combing, spinning, weaving, and more, illustrating technological advances in fabric production. It features key UK manufacturers and includes equipment for processing wool, mohair, and silk.

I then ran the ‘brief description’ field through KeyBERT, to create up to five keywords per description. I then condensed this column into one long string of text, and identified the most common keywords within that string (I can share this code with anyone if helpful). The output might be a useful way to tag the catalogue within a data registry, although I think a human in the loop could probably do a better job just as easily as the code can.

Most Common Keywords: [('spindle', 39), ('loom', 34), ('motor', 29), ('box', 19), ('machine', 19), ('winder', 15), ('rpm', 11), ('gill', 10), ('yarn', 10), ('frame', 10)]