15/12/2023 | Notion

A very happy holidays to any colleagues off on leave after today! I will still be around next week, so this is my penultimate diary entry of the year 😊

Gender

Had a very productive meeting with Anna-Maria on Tuesday where we looked over some of the genderize results, reflected on her own new investigation into positionality, and planned next steps for this work. We will be talking to Kunika and Kaspar at Monday’s drop in about, amongst other things, Lucy’s NLP and what can be done with it.

AMS and I also had a really interesting chat with Beyond Words Studio. Beyond Words previously worked with Total Jobs on their gender bias decoder, and are speaking to the team at Total Jobs about whether or not we would be able to use their language model to investigate BT’s dataset. This would mean we would have 3 different tools to trial for the gender work with BT.

Beyond Words are also a strong data visualisation company, with an interest in (their own words) “doing work for good” – they’ve worked with the Natural History Museum, the Wellcome, and the World Health Organization. I think they’re a strong company to note for future outputs.

Researcher notes

My message to the Telecommunications Heritage Group was well received, and I now have a few volunteers started on the task of making notes from archival records, with a few more to get going early next week. I am allowing the volunteers to select their own sources from a shortlist, which means they all have their own motivations and research questions when approaching the files.

Some are making notes on their computers, some are handwriting them (since an early experiment this week for the Undercliff Cemetery project made me appreciate GPT 4 is very good at handwriting recognition), and some are inputting notes into a template I have provided.

I played with some new prompts on GPT for turning notes into descriptions, including a small attempt to create my own MyGPT that has been fed BT’s cataloguing manual, but I am finding that GPT tends to create too much of a story, even embellishing with phrases such as “this collection of documents pertains to the operations and business decisions made by the General Post Office (GPO) regarding the Electrophone service, a pioneering audio service in the early 20th century.” I still think the best approach may be to condense the notes into strings of key words and prompt GPT to describe a file that features them.

Computer vision of communication collections

Today we had a kick-off meeting for the part of this investigation relying on the visual embeddings work from Heritage Weaver. Kaspar gave another brilliant demonstration of his previous work, which got everyone very excited about the possibilities of these techniques for linking the NMS and SMG comms datasets – always with an eye on scalability and the prospect of linkage for the entire collection.

Some clear areas of interest emerged from the discussions today:

Questions of linkage
- What type of information is most useful when linking collections? Should we rely on visual embeddings alone, visual embeddings supported by metadata, or just rely on the metadata?
- Are there other taxonomies that might aid linkage, such as the BT Connected Earth one?
Questions for historians
- Can a visually linked collection produce tools that aid historical discovery work?
- Can they reveal where and how collections categories are defined?
- Is it possible to identify visual or aesthetic tropes in communications objects through time (eg. Ecko round radio series from 1927)?
Questions for collection/curation
- The embeddings model already scores similarity between objects, can we move it towards certainty for objects that are identical?
- Can we use this to get to the stage where we can suggest how to improve metadata categorisation? And then practically automate it? Could we get the tool to flag potential hazards, since it can already identify materials?
- Can the tool be used to identify gaps in the collection to aid collecting policies?

Meanwhile, I’ve continued my own attempt to link the NMS and SMG datasets using LLaVA. When visualising these linkages, I’ve found that plotly can annotate datapoints based on a fuzzy search, so that you can start to see clusters of similar words more clearly. This is helpful as it means you spend less time scrolling around trying to identify what each section of the map contains. It could be even better if it was searchable so that typing ‘telephone’ into a search bar meant pockets of relevant points were highlighted. I’m looking into if this would be possible using dash, while also sending lots more images through LLAVA for use in the future.

GPO Circulars

Just as a point of interest, I tried pushing through one circular using the GPT API this week. Asking the question “how many times does Bradford appear in this text?” was estimated to cost at least $28, which really drives home just how large the circular dataset is!