CONFERENCES
Anna Maria, Arran and I have received the notification of acceptance of our paper to the DH Conference in Graz! It’s so exciting, considering the high number of proposal (749) and the overall acceptance rate (50%). Unfortunately Anna Maria won’t be able to come in person, but Arran and I will be preparing the presentation with her. The Conference team opened a Travel Bursaries application of 500 euros for early career researchers presenting (https://adho.org/awards/conference-bursary-awards/) which might help us to cover the conference’s expenses. Something that would be helpful to discuss (deadline is 14 April).
Following our discussion in the Research Fellow meeting, I have been working with Daniel on the Congruence Engine sessions for the Digital Festival for the History of Science, and finalizing and sharing the proposal with Tim and the potential speakers. We would like to propose a Roundtable on the National Collection as a Social Machine and a Demonstration Session on the digital techniques explored in the different investigations.
FOLK SONGS INVESTIGATION
I have worked with Carol to finalize Jennifer’s contract including a proposed timetable (subject to change) which I reviewed with Kunika this week, to make sure that the technical developments are in line with the tasks we will be asking Jennifer to do. We agreed that three months is a feasible time to develop the annotation schema, and would give the wider team enough time to discuss and identify a potential annotation tool (eg Prodigy or Label Studio) that could be used in other investigations as well. Kunika suggested to use, as a preliminary annotation tool with Jennifer, Microsoft Word: we agreed that this can be a very helpful starting point, as Word can be used without knowing the XML language but the styles function generates an XML file that can be extracted from the document. Word gives us also the opportunity to add further comments in the text that can help us to understand and describe our categories/labels. We discussed the opportunity to develop a ‘reflective diary’ of our work together, to observe how the technical and the human interact in the ‘human in the loop’ approach. We also relfected on the scope of the technical knowledge in the investigation and agreed that it would be helpful, for me and Daniel, to understand how XML works and why it is useful, without necessarily learn how to use the language (this would require a dedicated training and an amount of time that it is outside the scope of the investigation itself). Instead, start familiarizing with XML files, being able to identify the annotated labels from the Word document and read about other examples/case studies where XML was used in digital humanities/digital history would be the way to go. More details about our discussion in the Drop in Session section, where Kunika also provided very helpful examples of Word file used as annotation tool and a visual explanation of the method: https://www.notion.so/congurence-engine/Annotation-tools-for-ML-training-sets-scope-of-technical-knowledge-3a9c395e79eb4580bc663c8f72dccdab
I reviewed the timetable also with Paul who will be working on the film project, and we agreed to include a in person recording session with Jennifer. We scheduled another meeting next week to discuss the film timetable separately, as we might need to have a series of dedicated planning sessions to understand which element of this investigation will be part of the film.
I also had a very inspiring meeting with Kirstie Blair, the PI of the Piston, Pen & Press project. She explained to me that the dataset which is online resulted from a data collection process aimed to select the most representative industrial-related literary works for each author (so excluding songs & poems not strictly related to industrial themes). This fits very well with the scope of Congruence Engine, as this dataset might enclose an incredible number of connections with industrial collections. She was excited to hear about our work on folk songs and we agreed to keep her updated on the next steps. She then put me in contact with the developers for the data transfer of the PPP database, and they suggested to use a Data service (https://www.dhi.ac.uk/data), releasing the whole dataset with a Creative Commons License.
In terms of data collection, I shared with Felix our reflections about the mistakes in the OCR process and I am currently exploring with him the opportunity to use different images of the Lancashire Ballads anthology, to improve the transcriptions of the musical score sections. I sent to him a couple of photos I made with my phone and they seem to give definitely better results. There is a little bit of manual editing to do on AbbyfineReader (cropping and dragging boxes below the music notation so they are recognised and scanned for text). We will schedule a dedicated session next week with Daniel so he can guide us through this process.
ORAL HISTORY INVESTIGATION
I finalized the details for our next in person session on Friday 24th in Leicester (13 to 16). Stef shared the GitHub repository with Kunika and Felix, and Felix will also join remotely next Friday.
I met again Tonya Outtram and Tom Fisher with Stef, to update them on how the investigation is progressing. Stef guided us through the pre-processing stage, and I was able to see how Whisper generate the transcripts, which was extremely helpful. Each sentence is displayed on a different line of a csv alongside the start and end time and a ‘token’. We noticed that the transcript is rather accurate, although the placenames are often miscaptured. Stef also showed us a list of names extracted using NER for the pseudonymisation, but we reflected on the opportunity to use this file for connective purposes as well. We started to discuss the next steps aiming to identify similarities across sentences by providing a numeric representation of the meaning of the sentence. This is something that we will discuss in our next in person session, alongside the first visualizations from the data. I also had the confirmation from Tonya and Tom that there are no issues in using the Textile Tales data within this investigation, so we will include this dataset in the data sharing agreement with the University of Leicester.
Stef and I also had another productive meeting with Anna Maria focused on the ethical dimension on applying machine learning to oral history datasets. We discussed both the ethical dimension for the Congruence Engine to acquire the dataset Stef is working with, but also a wider reflection on the use of this data. In both cases, we agreed that the consent form originally developed for the OH project and signed by the interviewees is a key starting point for us. If it is stated that the participant agreed for the recording to be preserved as a permanent public resource that can be used for research purposes, we wouldn’t need to put too much effort in data anonymisation.
On 7 March I met Lizzie, Lauren, John Ashton and Tim Smith for the Oral History investigation and we discussed the opportunity to apply the same pipeline to the Bradford Heritage Recording Unit. They are extremely interested in what we are doing, as they are currently working on a new field in the iBase system which will include full (or selections of) transcripts. We discussed potential ways to transfer the entire BHRU dataset to us for the purpose of this investigation and they think the best way to do it is to use an external Memory Drive system as they have 2 TB of material. I have one device which was given to me at the beginning of the project which is 1.81 TB (I need to check with John if it would be enough), and I will try to arrange a Bradford visit sometime in April/May to collect the dataset in person. This could also become a way to share the visualizations of the first cycle of data processing (which will be use the Mines of Memory dataset) and start discussing potential applications of these techniques for the BHRU. They were excited about this opportunity, and they would be keen to involve in this meeting their two volunteers who are helping them with the oral history. Lizzie suggested that this could opportunity to share insights from other investigations as well and involve the wider collection team. This would be, in particular, a great way for Alex to talk about his investigations on the lost mills and the interoperability between cataloguing data.