ORAL HISTORY INVESTIGATION
An exciting week, as Stef was finally able to start working with the Mines of Memory data! Once obtained the ethic approval from the University of Leicester, one of the first steps of the pipeline has been the data pre-processing, which includes data cleaning for anonymization purposes. Stef developed notes and code for each of the following steps:
- Merging the separate audio files from each interview into a single audio file. This is an important step as usually the Oral History datasets include multiple audio files related to the same interview, and it’s very helpful to identify a method to automatize this merging process.
- Cutting the introduction of each interview, as it includes personal information (name and date of birth) that needs to be anonymized
- Generating the transcripts using OpenAI's Whisper. Note on file format: Whisper saves the transcript as a csv containing the segments of the transcript
- Stef now needs to identify names in the transcripts and delete them to make sure that the transcripts will not include any of the data or personal/sensitive information about the interviews/interviewees.
Technical details on this process including scripts generated for each step have been provided by Stef in the private GitHub repository which has been shared with me and Alex. The repository currently contains information about the data pre-processing phase but will be updated with the further steps. We will make the repository public as we start to present our work.
There is an interesting reflection on GitHub on the time requested to process the file: The longest audio file in the collection was about 2 hours and 9 minutes long for a total of about 176MB at 192k bitrate, and it took 20 minutes to transcribe.
Alongside the pipeline progress, I had also a series of key meetings focused on the datasets we are going to use in the investigation.
MINES OF MEMORY: This is the dataset Stef is currently working on. I met Colin Hyde wit Stef and Daniel B. on Thursday. Colin provided us with some useful information about the Mines of Memory dataset. The project was designed with a museum interpretation function in mind, to enrich the information related to the mining machines. He also offered more information about the closure of the Snibston Museum, which was a very rich coal mining research centre & heritage site before being demolished. We discovered that he was involved as freelancer interviewer in the project, so he knows very well the context and aspects of the interviews. He also mentions his main contact at that time was Alison Clague. He shared with us also the questions structure he used for the interviews.
LIVING LINEN: I met Victoria, Alison, Karen and Donal with Arran and Tim on Friday, to update them on the Oral History Investigation and explore the opportunity to use the 30 digitized recordings from the Living Linen project. The meeting was extremely productive, and they were interested in being involved. They agreed to send the audio files (alongside the individual transcripts that have been manually generated over the past years in response to specific requests) using a secured sharing system. We discussed the anonymization process in place for the Mines of Memory dataset and agreed that the same process will be applied to Living Linen.
Next week I have other two meetings scheduled to discuss the inclusion of other two datasets: one with the Bradford Industrial Museum team (Lizzy Llabres, John Ashton and Tim Smith) for the Bradford Heritage Recording Unit, and the second with Textile Tales (Tonya Outtram and Tom Fisher). We also finally scheduled the in-person session in Leicester when we will be discussing the first visualizations: it will be Friday 24 March.
FOLK SONGS INVESTIGATION
I have been working with Daniel on two different aspects:
- Finalizing Jennifer’s involvement in the investigation. I have a meeting with Carol next Monday to discuss the letter agreement, which will be based on the schedule of tasks and meetings we drafted this week. This schedule has been conceived basing on the 8 weeks temporal framework, which was the one previously used for the first round of the investigation during the textile pilot. The difference, here, is that the focus is on the digital pipeline and the related tools, tasks & expertise that we would need for each step. Each ‘task’ will require the employment of one (or more) digital techniques and a certain level of expertise to use them, and we need to understand what will be our role, the role of Jennifer and the role of Felix & Kunika in supporting us. We drafted our schedule as an Excel spreadsheet, considering 8 weeks starting from the 13 March including a fortnightly 2 hour session with Jennifer and a final in person day in the week of 22th May, which could potentially be used for Paul’s filming. I think this is an interesting example not only to apply the ‘human in the loop’ approach, but also to reflect on the practical implications of the ‘National Collection as a verb’, and the wider Social Machine concept.
- Reviewing the machine generated transcript of the Folk Songs and Ballads of Lancashire (Harry & Leslie Boardman) provided by Felix using Abbyfinereader in order to understand what the common mistakes are, which ones need to be corrected in order to use this transcript in the textual analysis tool and what techniques we could employ for each category of mistakes. Daniel and I checked different sections of the document (7 ballads in total) and created a spreadsheet which describe seven main categories of mistakes and the potential methods to use. We will be sending this document to Kunika and Felix to decide how to proceed. The bigger mistake is related to the text within the musical score section of the page, which is often miscaptured. This will probably need to be corrected manually. Abbyfinereader seems to capture at the same way dialectal terms or colloquialism, and the mistakes in the word spelling seem to derive from miscapturing the shape of the letters (something that can potentially be corrected using higher quality images). There are other punctuation mistakes that can be ignored or corrected using the simple ‘Replace’ function in word. We also discussed to what extent the 100% accuracy is needed to proceed to the next steps of the pipeline. Is OCR good enough to allow us to process the transcripts through the textual analysis tool? A similar question raised in the Oral History investigation, as the speech to text tool seems good enough to process the transcripts through the large language model.