A very happy holidays to any colleagues off on leave after today! I will still be around next week, so this is my penultimate diary entry of the year 😊
Gender
Had a very productive meeting with Anna-Maria on Tuesday where we looked over some of the genderize results, reflected on her own new investigation into positionality, and planned next steps for this work. We will be talking to Kunika and Kaspar at Monday’s drop in about, amongst other things, Lucy’s NLP and what can be done with it.
AMS and I also had a really interesting chat with Beyond Words Studio. Beyond Words previously worked with Total Jobs on their gender bias decoder, and are speaking to the team at Total Jobs about whether or not we would be able to use their language model to investigate BT’s dataset. This would mean we would have 3 different tools to trial for the gender work with BT.
Beyond Words are also a strong data visualisation company, with an interest in (their own words) “doing work for good” – they’ve worked with the Natural History Museum, the Wellcome, and the World Health Organization. I think they’re a strong company to note for future outputs.
Researcher notes
My message to the Telecommunications Heritage Group was well received, and I now have a few volunteers started on the task of making notes from archival records, with a few more to get going early next week. I am allowing the volunteers to select their own sources from a shortlist, which means they all have their own motivations and research questions when approaching the files.
Some are making notes on their computers, some are handwriting them (since an early experiment this week for the Undercliff Cemetery project made me appreciate GPT 4 is very good at handwriting recognition), and some are inputting notes into a template I have provided.
I played with some new prompts on GPT for turning notes into descriptions, including a small attempt to create my own MyGPT that has been fed BT’s cataloguing manual, but I am finding that GPT tends to create too much of a story, even embellishing with phrases such as “this collection of documents pertains to the operations and business decisions made by the General Post Office (GPO) regarding the Electrophone service, a pioneering audio service in the early 20th century.” I still think the best approach may be to condense the notes into strings of key words and prompt GPT to describe a file that features them.
Computer vision of communication collections
Today we had a kick-off meeting for the part of this investigation relying on the visual embeddings work from Heritage Weaver. Kaspar gave another brilliant demonstration of his previous work, which got everyone very excited about the possibilities of these techniques for linking the NMS and SMG comms datasets – always with an eye on scalability and the prospect of linkage for the entire collection.
Some clear areas of interest emerged from the discussions today:
Meanwhile, I’ve continued my own attempt to link the NMS and SMG datasets using LLaVA. When visualising these linkages, I’ve found that plotly can annotate datapoints based on a fuzzy search, so that you can start to see clusters of similar words more clearly. This is helpful as it means you spend less time scrolling around trying to identify what each section of the map contains. It could be even better if it was searchable so that typing ‘telephone’ into a search bar meant pockets of relevant points were highlighted. I’m looking into if this would be possible using dash, while also sending lots more images through LLAVA for use in the future.
GPO Circulars
Just as a point of interest, I tried pushing through one circular using the GPT API this week. Asking the question “how many times does Bradford appear in this text?” was estimated to cost at least $28, which really drives home just how large the circular dataset is!