19/05/2024 | Notion

Investigation specific updates and reflections are below, covering the last fortnight.

Investigating gendered language in the archive

It was great to meet with Lucy Havens this week to talk about our plans for deploying her NLP on the BT Archives dataset as well as a second catalogue from within the project. Anna-Maria is investigating if it would be possible to have an export of the BBC archive for this, but if not I think the SMG catalogue would be an interesting case study.

A large focus of deploying the gender bias NLP is on the fact it exposes potential bias which may or may not be explained away through conversations with specialists. An example might be the use of the term ‘girls’ in BT Archives to describe telephone operators, which some may see as the infantilisation of women in the workplace and others may point out is actually a comment on the age of the worker, and does not seem so problematic when compared to the mentions of ‘boys’ in the same catalogue and within the same context. These quirks and the usefulness of the NLP’s output will be evaluated through conversations with the BT Archives team once Lucy has had a chance to run her code on the existing catalogue export we have.

Computer vision on communications collections

It was great to meet with Geoff, George, and Jon today to discuss the work Kaspar has been focused on for Heritage Weaver in recent weeks. Kaspar talked us through how the database can be searched using text or image fields, linked through either visual, text, or a combination of both aspects, and finally gave a small taster of the work he has done to detect objects within moving image material.

Something that has become clear through this investigation is how subjective linkage can be. While one user might be excited to see 45 objects linked because they are all circular and blue, another user (most especially a curator) would likely only care to see 45 identical objects linked together, or at least 45 objects which are the same type, material, or age.

I am excited by the idea that we might be able to make all shapes and sizes of user happy by offering a small “linkage annotation” exercise which lets the tool know the types of linkage you are personally interested in, and then creates a connected database that makes sense for you. That said, it may be that there is less need for an adaptable database for museum professionals, who are all looking for a tool which helps them identify exactly what their collections objects are. For a more practical, job-specific database, Geoff suggested we change the annotations on the database from “link” and “no link” to more specific tags such as “exact same object,” “not the same object,” “sort of the same object,” and so on.

I will be collecting energy and textile objects for the Heritage Weaver work next week, so perhaps this investigation is no longer “computer vision on communications collections” but just “computer vision on collections!” An excellent point made in todays meeting was that it would be good to look at collections beyond just NMS and SMG, so I’ll also see if another smaller museum collection could be good to ingest as part of this work.

In the future, Kaspar and I hope to host a small workshop to work on annotation as a group, partly to further complexify the question ‘what is linkage?’

Turning the Circulars into a searchable database

Nayomi and I will share an update on this work at the next investigations meeting. Nayomi has found a way to parse the Circulars, which meant I was able to spend a fun day feeding circulars to the pipeline and testing the results. Some early reflections are:

LlamaParse grasps both PDF that was OCR’d by ABBYY and ABBYY’s plain text output very well
Questions where there is only one clear answer are understood well and responded to clearly (see below for the RAG output and a relevant Circulars screenshot)

Untitled

Where there is more than one possible answer, the tool is unable to find them all at once (again, see below)