17/11/2023 | Notion

TAGLab

I have finished a complete mapping of the BT dataset with the SMG ‘signalling and telecommunications’ category. This did not map perfectly, as there were many train related objects in the grouping, but still had some areas of cross-over including engineering work and telegraphy. I think if we expand to more BT-related categories within SMG we might find further linkages, and so ‘cherry-picking’ categories after mapping one collection could be a way forward for connecting disparate collections.

I briefly mentioned to the BT Archive team that I was surprised so few telegraph categories came up in the mapping, and they were also confused. They think the split between telephony and telegraphy should be about 50:50 – so this is something to have a look into. I am wondering if telegraphy is just less documented than telephony, so perhaps only exists in titles where telephony has longer descriptions? Or else, the topic modelling is not picking telegraphy up for some other reason.

GPO Circulars

After a really useful discussion with Asa and Daniel at the start of the week (see the recording here), it was clear to me that the next step for the circulars work is to work out how to split up the circulars into chunks a machine can manage.

I am trying to get the latest circular, from 1979, to a stage where GPT can extract information from it before I even consider the next 100 years. For a bit of context of the issue, a sample page is below:

Untitled

The pages are divided into chunks of information that humans are able to extract, but the machine hates because it is so disorganised (the mixture of columns and tables is especially awkward).

Just to check where I stood, I trialled tesseract on the 1979 circular after converting the PDF into separate image files. The programme really struggled to understand what it was reading, as the two chunks below demonstrate:

Untitled

It also ignored any information in a further along column, only extracting from the left-hand side.

I am able to run the circular through ABBYY, and get what seems to be a fair OCR read (I can manually correct the odd phrase, but I think generally it has a good grasp of the words). However, exporting from ABBYY is frustrating. The programme can recognise chunks of text, and easily allow you to group things together, but will not allow you to cut the document up in anyway that follows the groupings. You can split vertically and horizontally manually, but I imagine for one 515 page Circular this would take around 8 hours, which would be 100 working days for the entire batch. So, not a viable option!

Untitled

After playing around a little with a ‘Editable Copy’ export in Word, which spits everything out in textboxes that can be moved around manually, although are unmovable by macro (at least, according to mine and GPT’s coding efforts), I have now exported a ‘Plain Text’ copy into word which seems promising. The plain text orders things fairly consistently, starting from the top left and ending at the bottom right. I have dumped a few bits from the word document into ChatGPT and it always seems to understand what it is fed, so next week I am going to see if this format might feed into the existing Barker pipeline.

Gender

Anna-Maria and I have a persons export ready to reconcile next week, and have had a few productive conversations including one with BT’s Elspeth about getting the team involved in the evaluation process for the model. There had been some initial nervousness about an investigation which aims to interrogate the way cataloguing has been done, however we have stressed that this investigation aims to test a new tool and evaluate its effectiveness, so actually speaking with the archive team and leaning on experts means that we can see where contextual nuance might exist in the catalogue and be missed by a tool. I have also reached out to a former BT Archivist to ask if they have any interest in getting involved, as they have already provided some of their own research into women in the collection for us to use.

Personal researcher notes

I’m currently waiting to hear back from the Telecommunications Heritage Group, but in the meantime, I have been pulling together a list of sample files for this investigation. I am going to turn the digitised files in searchable documents, just to see if anything happens to align nicely with the Bradford convergence. One file in the collection of undocumented ones I intend to use for this test is specifically about West Riding, which is a very happy coincidence.

Computer Vision

Geoff is following up with the NMS team about getting their images, and Jamie Unwin has kindly provided me with a CSV to pull through SMG’s images, so we are on track to have both sets of images to play with before December.