08/12/2023 | Notion

I have a notice going out to all members of the Telecommunications Heritage Group next week to recruit participants for the ‘transforming personal researcher notes’ study, so fingers crossed I will have some real notes to experiment with very soon! I’ve also paused the General Post Office database work until I can get some help from more technically minded people, after a day of arguing with GPT 4 on Monday.

In the meantime, I have some results from work on the other two investigations:

Gender

I now have a fully merged BT archives catalogue and authority record dataset, which means that we can see all people, companies, and post office departments linked with catalogue entries. I have run the genderize.io through the linked dataset to get an idea of how it views the number of male or female authority records linked to catalogue entries.

The initial run revealed that there are 16,774 instances of a ‘person’ record being connected to a catalogue entry (this includes when there are multiple names connected to one record). Genderize predicted that of these records, 15,993 are male, 434 are female, and 347 are unidentifiable.

While these figures seem stark – they do reveal an issue with the genderize programme itself. Many of the 15,993 names that were identified as male are single initials where the actual first name of the connected person is unknown. This means that genderize assumes any individual with just a first initial is male, which to me suggests it has been trained on more male names than female names, or it has been led in another way to assume that, if unsure, a name is male.

I am still interested to investigate how successful genderize is when identifying genders for known figures in the BT dataset, such as Sydney Buxton who was Postmaster General from 1905 – 1910 and has what is largely accepted as a gender-neutral name today. Once we have done this, and are able to offer some conclusive observations about the use of genderize for this work, I think we are then ready to move onto some new processes.

Anna-Maria and I will be planning next steps with Havens’ NLP next week. I think the merged dataset we have of catalogue entries and connected authority records may be enough for this. The current dataset does not include the ‘activity’ field for authorities which would summarise a person’s biography. While I had initially assumed we would include these, I’m now inclined to ignore them given we already know that so many of the biographies will be male ones, and understanding the gendered positionality of these biographies seems unnecessary.

Another avenue to explore may be an existing tool that currently has a very different purpose. While at the final training session of my induction today, the ‘inclusive language workshop,’ I was interested to learn about the Total Jobs gender bias decoder: https://www.totaljobs.com/insidejob/gender-bias-decoder/

The decoder is based on research into unconscious bias that people have when reading certain words. It is focused on words that are featured in job adverts, and is used by SMG for all job postings to avoid “gendered language putting people off applying to job adverts” (quote from the decoder website above). I wonder if similar terms come up in archival descriptions, and if these terms are suggesting (unconsciously) to readers that the file creators or the work involved in the file is inherently female, or male.

Once again, let me show some examples using the Electrophone files at BT (other telephones are available):

Untitled

The gender decoder is based on a 2011 study that looked at how gendered language in job adverts sustained gender inequality. Is this sort of work relevant to a study focused on gender in an archive that holds files of one of Britain’s largest historic employers? It seems plausible.

Computer Vision

I’m looking forward to what I’m sure will be a rich discussion about next steps with the Heritage Weaver pipeline next Friday, and in the meantime I decided to crack on with my own LLaVA work that is a side-interest of this investigation. This work stems from the question “can we use the nonsense?”

I aim to feed the entire SMG and NMS communication dataset into LLaVA in the future, and then select keywords (using KeyBERT) from the description LLaVA creates to see if these keywords form meaningful linkages between objects in the collections. As an initial trial, I have fed 100 objects from each collection into the tool and then ran the descriptions LLaVA produced through KeyBERT. These objects were the first 100 in each dataset as it appeared to me, so include a range of computing objects, telephones, telegraphy objects, cables, and radio components.

Initially, I tried mapping the output from above using nomic.atlas. I do not think that this is the right tool for the job because the topic model nomic introduced added unnecessary complexity to the process of illustrating which objects linked to which keyword, but it does nicely show how this process is intermingling the collections. The image below is grouped by keyword, with each point being one instance of an object linking to the relevant key word. The Science Museum objects are coloured blue, while the National Museum of Scotland ones are orange. I think the clustering of both orange and blue dots shows that LLaVA is doing a good job at noticing common themes in the type of object it has been shown, and isn’t getting too confused by the fact that SMG and NMS objects are photographed differently according to their respective house styles.

Untitled

I roped in Daniel (fresh from his very successful and interesting GPT workshop) for the next step of this work, which was creating a diagram that demonstrates the linkages between objects and keywords without adding an additional layer of topic model complexity. Below the blue dots are objects, while the yellow and red dots are keywords: