7 August 2023 | Notion

Present: Tim, Jane, Alex B, Alex A, Stefania, Arran, Jane, Sarah, Daniel, Alex F, Felix, Anna-Maria, Kunika, Stef

Apologies: Helen

Agenda:

Deeper Catalogue Data - Tim An Intensive day with some progress, 28 April 2023
Oral histories and folk songs investigation - Stefania, Daniel and Stef

Notes:

Although we started by thinking about ‘bags of terms’ as fodder for linking, but this investigation, amongst others, is beginning to show that taxonomy and categories remain very important for linking. 1921 catalogue contains a list of subject headings and that could easily be transposed. (but it would, of course, need to be added to).

Investigation has 3 ways of extracting object names:
1. Use the headings and break them down semantically. Required human-in-the-loop
2. SPD
3. KeyBERT
Both SPD and KeyBERT are effective and it is difficult to say whether one is a lot better than another. Small further work on experimenting with the specificities - not quite proof of concept yet.

Questions asked by Tim:
- If we use SPD to identify object names, we need a background text. Would it be worth exploring using a different part of the SMG collection (perhaps the mining collection) as the background text? Arran - what are the benefits of using a similar style of text? (Alex B noted that it would help remove any stylised anomalies, and focus on the different terms). General agreement that there doesn’t need to be a focus on this now, but there could be a question about what the ideal background text is a bit later on in the investigation.
- Next step to apply this technique to the much larger textile collection today? Arran - Would it be worth comparing what is produced object name wise alongside what is in the SMG catalogue now? Alex B and Kaspar - there is value in using the terms that are coming out of this already for the NER training work we want to do with regards to object identification. There will be human-in-the-loop work to understand the most appropriate n-gram span for capturing the most useful part of the catalogue instance that will be used to train NER. Felix to run KeyBERT on the rest of the SMG catalogue and then some comparison work can be done and further terms can be identified for a fuller taxonomy. Alex A will take on aspects of the work as he has more domain expertise.
- Is a next step to work semantically with the text to understand its potential contribution on the ontology construction? Alex B - it would be useful, but maybe at a later point when the ontology structures are clearer. It could be done on telegrahpy or on the mining catalogue?
- Representing and understanding the curator’s mindset beyond its use in an ontology. Proposed that it is something to pause on until next year, and for it to be something that falls into the research for publication.
Link to Curatorial Voice project and published articles https://curatorialvoice.github.io/.
Oral histories - Stef and Stefania. To what extent can what has already been done be understood as proof of concept?
- S&S feel that there is still a piece of work that needs to be done that links the oral histories to museum objects. Still stuck with applying the tools to other datasets due to admin reasons at Leicester.
- What are the visual and exploratory potentials of the tool that Stef has produced. S&S keen to do an extended session with some of the CE Co-Is and researchers to help finish off this aspect of the work. This will give a stronger historiographical focus to the investigation. Further work on interface would be proposed for a workshop next year.
- Fine tuning the model. Less related to proof of concept, but it feels like an important aspect of the investigation as a collaboration.
There may be a risk to pausing the investigation after getting the historians excited about the oral histories work. But we can un-pause at any point. Stef’s PhD student will be working on themes about oral histories and living well together in cities. So, there will be continued development work.

Stef keen to highlight that we need to keep expectations managed as this is very much at proof of concept phase, and not even at prototype stage.

High performance computer would be needed to run this again on a different dataset. Felix interested in what aspects of the work require high performance computing. The automated transcription would, but also the embedding (using Bloom). (Alex B - As we get some VMs here at SMG, Stef might be able to tap into some of that too.)

Neslihan bringing together object records and oral histories - this might be the point to bring together the data into a vector database, particularly at the SMG. Pinecone or weaviate.

We agreed that it would be good to move forward with the extended meeting with historians. Also agreed that if Stef needs to / wants to move on with the work next year, we will support it as best as we can.

Daniel offered that there are over 900 pages of transcription related to mining oral histories if we wanted energy related material too.