11-22 March | Notion

I’ll be covering two weeks in this post, as last week was a little hectic. This was partly caused by a prior commitment that I had in Newcastle, where I was scheduled to deliver a paper to at Northumbria University’s Institute of Humanities Research Seminar. Although I was there to speak about entomology and pesticide films, I also briefly discussed the river pollution work that we have been doing in CE. There I briefly met Leona Skelton, whose Tyne after Tyne: An Environmental History of a River’s Battle for Protection 1529–2015 was partly an inspiration for the investigation into river pollution in Bradford. I was hosted by Matthew Kelly, another environmental historian who has worked on the history of Dartmoor, as well the wider history of conservation. Interestingly, both have recently been investigating different aspects of the history of infrastructure, primarily energy and transport.

Last Friday, moreover, I had two guests in Cambridge: Jason Langford and Molly Rigling, two history teachers who last year developed a series of school teaching resources based on my research into the short interwar film series Secrets of Nature. On Saturday morning, they delivered a session for local children aged 10-14, focussed on the history of conservation and changing attitudes towards zoo animals since the 1920s. This was part of the Cambridge Festival, a series of public engagement events organised in the city designed to showcase recent research. It was really exciting to meet Jason and Molly for the first time in person after nearly two years of Zoom meetings, and to see them put the teaching activities that we co-designed into action so well!

Bradford events

Naturally, seeing the Cambridge event come together had me thinking ahead to our plans for organising Bradford events in the coming months. In the past couple of weeks, I have found both the group and individual meetings organised by Alex F. and Julia to be incredibly helpful in terms of picturing how we are going to bring these events into being. Today I met with Julia to discuss in more detail our progress on each of the three events that I am involved in. These are:

Screening of ‘Once Upon a Sheep’ scheduled provisionally for the week of May 20th, in collaboration with Bradford Movie Makers. The location is likely to be Bradford Industrial Museum, and there’s been a lot of enthusiasm from BMM with regards to inviting former textile workers to the event.
Environmental Data Workshop with Friends of Bradford’s Becks, Bradford City Council and others interested. This will be a small workshop which aims to use QGIS to ‘fill in the gaps’ using historical OS maps, focussing particularly on mill ponds and mill streams. One outcome of this workshop may be further work in supporting Rob Hellawell of FoBB in developing a detailed pollution map for Bradford in the coming months.
River pollution event in collaboration with Friends of Bradford’s Becks. This might be a combination of participatory history workshops alongside previously planned events for FoBB’s ‘Festival of the Becks’. I am liaising with FoBB volunteers to work out further details.

Julia and I we spent a long time talking about what a ‘successful’ event might look like, both for the CE, community partners, and other attendees. My feeling is that even events which might, from a CE perspective, not seem ‘productive’ in the sense of creating new datasets or codified knowledge, might nevertheless result in a very rich ‘social machine’ outcome for those taking part. In the case of the river pollution event, we agreed that a bit more detail is needed before we are certain of feasibility and overall shape, and this may depend on ongoing conversations with FoBB volunteers. Overall, I’m excited to see how we progress with these plans, and I’m especially keen to start getting dates and venues into the diary!

CamPop - occupation ontologies, taxonomies etc.

On Monday, Alex B. and I met again with Alexis Litvine and his MPhil student Guillaume Proffit. Two key outcomes emerged from this:

We agreed that it would be a good idea to organise a workshop at the SM in June focussing around occupation ontologies in the textile industry, bringing together interested members from CamPop and others interested in these issues. Alexis suggested Jane Whittle, who has done work on occupations specifically from the angle of ‘tasks’. I am going to start planning this, in conjunction with the ontologies grouping.
Guillaume has kindly shared with us his data linking individuals across different census years and geolocating them to points on a map. Daniel has been visualising this on QGIS, and I am due to return to Guillaume with a few questions about this. We discussed a potential blog post on the topic, perhaps in conversation with someone in the CE working on similar themes such as @Nayomi.

OCR

In the meeting with CamPop, Alex and I were surprised to hear that Alexis and Guillaume considered ABBYY Finereader to be a poor OCR tool, and indicated that they had found far more satisfactory pipelines for complex structured text, principally with LayoutLM models and Calamari OCR.

The spectre of OCR seems to have loomed large over CE since the very beginning of the project, and it’s been interesting to reopen this discussion in recent weeks with Tasha, Alex B., Nayomi and Felix in the past few days. Ever since the CV workshop at SAS, I have been occasionally playing around with different OCR tools, including Kraken, Calamari, and Surya, but in doing so I have very often come across technical barriers which have prevented me from really making the most out of them. On the occasions where I have managed to get these working smoothly on a few example images, I have found it hard to scale up to working on larger batches. The persistence of significant errors in both layout and character recognition with most of these tools also make me think that I would like to try out annotating a sample batch from a periodical in order to fine-tune a model. However, I have found the amount of choice in terms of both models, annotating formats and schemas a little baffling, and wouldn’t want to embark on something like this without having a clear pipeline, as well as more clarity on the intended uses. To give one example, I am keen to try out this integration of Tesseract and Label Studio, as I really like what I have been of the Label Studio interface so far: https://labelstud.io/blog/interactive-ocr-with-tesseract-and-label-studio/. I am also interested in the Layout Parser Annotation Service, but have not been able to run this on my system so far: https://github.com/Layout-Parser/annotation-service.

From conversations with other members of CE, it seems that several of us have faced interrelated challenges in both choosing and deploying the right OCR tools, so I’m glad that on Tasha’s initiative we’re going to spend some time in the digital drop-ins in April to go through some of the issues that we have been facing. It’s interesting that we all want to ‘crack’ OCR for different reasons and with different ends in mind, and this might something to reflect on further as part of these sessions.

One question that seems to loom quite large over the OCR question is whether multi-modal LLMs are going to replace the need for some of the OCR tools that I have mentioned above, and quite how fast this kind of transformation is going to come. My sense, however, is that the relative control over the specific fine-tuning dataset, the open-source nature of many of the tools, and the growing ethical and financial barriers to using some of the larger LLMs, may still make more traditional ML approaches to OCR worthwhile, especially if we can find ways to help humanities researchers deploy them. For this reason, I think that a systematic survey of different OCR tools, some clearly written OCR pipelines with different uses in mind, and perhaps even sharing a couple of fine-tuned examples created as part of the project (using, say, annotated examples from a textiles, energy, and communications publication) could all be fairly achievable outcomes of the project which moreover may be of use to others beyond the scope of CE.