02/02/2024 | Notion

Below are a few notes on the personal researcher notes work, my own experiments with NER, and a few photos from the Baird TV factory which we may be interested in digitising. Not covered, but still enjoyed, are the many productive discussions had with team members throughout the week including a great catch-up with Kunika and Daniel at the digital drop in, some exciting new ideas from Max (flood data and historic source linkage – watch this space!), and planning meetings regarding the upcoming Newcastle evaluation as well as a BT-led mutual learning session to be co-ordinated by Anna-Maria.

Personal researcher notes

Thank you to those that gave feedback on this work at Monday’s investigation meeting.

I am going to be sharing the outputs from this investigation with BT next week, and will ask to what extent they would need to tweak the existing descriptions in order to incorporate them into the archive itself. This week, James started uploading catalogue descriptions generated through his work with LLAMA to CALM, so there is now officially some precedent for using machine-generated descriptions within their cataloguing system.

Once I have feedback from BT, I will share the outputs with participants. I have already created a Microsoft form which asks a few key questions:

Did they change the way they read the file and took notes because they knew the goal was to create a description?
Do they think the generated description describes the file they read adequately?
How would they want to encounter this sort of description in an archive? Would they want to know it was from GPT?
How would they expect to be credited within the archive?

There are a few sticky ethical considerations that surround this work, not least because it involves feeding large amounts of data to ChatGPT, and I get the sense these quandaries would be good to address both in my upcoming TaNC webinar presentation and a future journal output. I think we will encounter very similar ethical challenges when it comes to the upcoming work with BCB for the Bradford Convergence. Looking to put recordings of radio shows through a speech-to-text platform and then feed those transcripts through an LLM may open up even more questions around consent and accreditation, with more voices involved and theoretically more permissions to check!

Experiments with NER

While a few of my other investigations are awaiting technical assistance, I have started to crack on with getting myself up to speed with NER. I am looking to get to a stage where I have a fairly good idea of how I could use NER to perform some linkage with BT’s transcripts, and therefore request access to them, and so for now have been playing with their ‘information sheets’ (basically, finding guides) to trial the technique.

I started the week using Label Studio, which is fairly user friendly, but as others on the project already know would require quite a lot of training. Having a go at this training did give me an idea of what the most common ‘entity’ in the information sheets was, which seemed to be ‘organisations.’

This theory has then been backed up by my work with spaCy, which I quickly turned to out of a desire to privilege machine learning over human annotation. I ran spaCy on all 30 information sheets, looking for all available entities, and found that ‘ORG’ was indeed the most recognised entity:

Untitled

I think this makes a lot of sense when we consider that BT is a corporate archive, containing information relating to a lot of different telegraph and telephone companies, as well as GPO departments, and engineering groups.

Unsurprisingly, and as already established by the work of others, spaCy has little interest in recognising objects. In the below visualisation we can see it totally disregards any mention of a telephone, a bell, a dial, or the more complex “automatic exchange system:”

Untitled

I am interested to see if we can use spaCys ability to detect organisations and ignore its failure to find objects. It may be possible to link the organisation names found in the sheets to the ‘maker’ field in Mimsy, meaning we can connect BT data to SMG objects. I’ve made relevant enquiries to other team members and look forward to seeing what can be done.