An Intensive day with some progress, 28 April 2023

May I start with a reflection at the start, written at the end, about ‘human in the loop’. When I started writing this report yesterday, I knew much less about what I’m reporting on than I do now; as ever, writing is a great heuristic. This seems to me to be another aspect of the role of the social in the social machine. As machine learning is actually statistical text analysis, a matter of probabilities and not a matter of understanding, it is up to the human in the loop to advance the understanding when the programs have done the work of probability.

And I’d also like to offer an observation on the style of the little bit of national-collection-creation that Felix and I were engaged in here. I may be wrangling Excel spreadsheets, but I have not (yet?) tried to run Surprising Phrase Detection (or other such programs) myself. So, what we modelled here was a dialogic exploration of possibilities in which our different human abilities were in combined alternation with the ‘machine’. I have been arguing for developing people with more hybrid skills. Is what we did here an alternative or a first step, namely the development of a shared understanding - from different points of view - of what needs to be done?

Anyway: to the account:

Alex, Felix and I agreed that Felix and I would spend most of April 28th developing ‘Deeper Catalogue Data’; Felix and I both went to the Museum and worked across the day from 10:30. The work fell mainly into two kinds: data alignment and term-extraction.

First: I wanted to address the need to be able to align the text from the 1921 Textile Machinery catalogue with the current collection via unique identifiers. The Science Museum since 1913 has used inventory numbers (with the form year-serial - eg 1995-757) as the persistent identifiers for our objects. But objects acquired before that system was introduced were listed under serial numbers for the small number of different ‘divisions’ of the Museum; in the case of Textile Machinery, ‘Machinery’, with the prefix ‘M’. From 1913, Museum staff set about retrospectively allocating inventory numbers appropriate to the year of actual acquisition to the objects listed under these ‘M’ numbers, but it took years to complete the project. This is why the Textile Machinery catalogue, published in 1921, mainly lists the objects under the M numbers. It folllows that if we want to be able to link the old catalogue entries with the objects still in the collection and available via Collections Online, we need to align the M numbers with the inventory numbers they were subsequently allocated. We got most of the way with this on 28th. I had been aware that the Museum’s Mimsy database has the facility to record ‘otherids’, alternative numbers for objects, in the object records, and so I had asked Tom Smith, SMG Group Documentation Manager, to send me a download of all records with M numbers. Felix was able to wrangle data from that spreadsheet into the sheet deriving from the 1921 catalogue. The result was that we now have inventory numbers for around two thirds of the objects in the catalogue. How to find the rest? For most collections it would be necessary to go to the Museum’s inventory ledgers and manually transcribe the the relevant entries year by year. This would be of great benefit to the Museum but would be a sledgehammer to crack a medium-sized walnut in this case. Well, we’re in luck: Emeritus Curator John Liffen, who has been unofficial guardian of old collections publications, happened to know the location of the curator’s own working copy of the 1921 catalogue, in which they have written-in the inventory numbers. It will be a spare time task over the next week or two, for me to trabscribe the missing 100-odd numbers. We will then have a consistent means, via what are - essentially - Persistent Identifiers (PIDs) to link the old data to the curent collection. This will enable future linking, but also support me in some statistical and interpretive work on the relationship of the current collection to its state 100 years ago, deepening the foundations of the work that we intend Congruence Engine to do in a world of linked collections.

The second kind of work that we experimented with yesterday was comparing a series of different ways of extracting terms from just one section of the Textiles catalogue - that for weaving machinery. I wanted to work at a manageable level of data, and 10 or so pages of print provides that. I asked Felix to use this section as the foregroud text against three corpora to see if any was more productive of terms distinctive to textile machinery. I am interested to see if such automated techniques may be able to find terms akin to object names (such as ‘loom’ or ‘spinning wheel’). The background texts we tried were: the mining catalogue, the remainder of the textiles catalogue, and a combined corpus of modern-day sources that he has used in previous work. He added a fourth: the sum of all those background texts. I formatted these one-column lists of terms, printed and eyeballed them in a single spreadsheet. I show here just the top of the first page:

Untitled

At the level of unsystematic but focussed reading, it seems like all background texts produced similar term lists, though in different sequence. As before, the semantic feature - that nouns and adjectives predominate [which may be a feature of how SPD operates ‘under the bonnet’] - was striking. In the next step I asked Felix to take the mining machinery background and to tabulate the results within their semantic ecology: the phrases within which they occur (which he simply delimited by punctuation). The aim here was to see if we could get to a ‘scratch ontology’ of people doing things to things. We had a discussion about what tools might further filter the ‘surprising phrases’ to nouns, in the hope that the results would include a greater proportion of object name-like terms. Felix offered to use a parts-of-speech technique (via spaCy, I believe). You can see the results here (I added a clumn, D, where I took out the definite and indefinite articles using Excel’s search-and-replace function [in this sample, this is only visible in the last row, in the difference between D and E (though it makes more difference elsewhere in the sheet)]. A is the printed catalogue sequential number, B the inventory number, C the words to the left of the phrase, D and E are both the ‘surprising phrase’, F the words to the right, and G the full semantic cell. I am showing the page with hand loom:

Untitled

How useful is this? Let us take an example from the second row here: “the model represents a <hand loom> for weaving velvet(, in which material the pile is produced by a great number of short silk warp threads, the ends of which stand up closely enough together to conceal completely the structure of the ground)”. This example is specific to the museum object, and it associates the loom with velvet, its product. But the found phrase - ‘hand loom’ - is the object of the sentence, not its subject, ‘the model’. This leads me to more semantic questions:

Might it be more successful, in looking for candidate scratch ontologies, to select surprising phrases only when they are the subject of a sentence or phrase? I imagine that parts of speech tools could do that. [Here I may note that if you look at the earlier sheet demonstrating the SPs against different background texts that none found this instance of ‘the loom’; the only instance of the term ‘loom’ here is the phrase ‘model of power loom.’ (This seems to be missing from the ‘semantic ecology’ spreadsheet, for reasons that Felix could explain, I’m sure).]
Is it possible that the semantic cells that such techniques create could in any case make a useful scratch ontology, only not of the type we’d been looking for (more literally of the kind ‘the weaver supervises a loom to make velvet’)? In that case, the ontology created would be a representation of the way that the 1921 curator thought, not a more standardised and abstract set of ‘universal’ associations. But if we wanted the latter, what kind of text would supply it? We will have to look to other sources.

Evidently there is much more to experiment with here at the level of parts of speech.

I have been saying that in this investigation, I am interested in two outcomes: the creation of linkable terms; and ways of analysing, representing and visualising the connective webs of meaning in which past curators conceptualised their collections, linking the individual objects relationally. Let me finish for the moment with reflections on where I have got to with these aims.

First, staring me in the face is that there is a category of text in the entries that may be a much more direct route to linkable terms, and that is the entry heading. Here is a sample page. The curator here has created, effectively, an object name, the answer to the ‘what is it?’ question. Well, here, each object is a model; it’s a model of power loom; it is at a certain scale and it came into the collection by presentation or by loan from particular sources. It is ambiguous from this source, and without further (easy for a human) analysis, to know whether the date refers to the manufacture of the item or to its arrival in the collection.

Untitled

I will do some more work on these object names as potential linkable terms. Evidently it is important in different contexts that an object is both a model* AND a power loom, and any work of this kind must be attentive to the polysemic nature of objects - that they don’t belong in only one taxonomy. Part of this will be to align this catalogue-derived data with the terms already on Collections Online. Here may be a first benefit of being able to upload the old catalogue data to Mimsy.

The scale data belongs with the model part, not the power loom part and would need to be reunited with it.

So now my search for linkable terms has bifurcated to two routes: analysis of headings and other sources for object names, and further text wrangling that holds the open-ended promise of linkable data which is part way to my other aim:

Analysing, representing and visualising the curator’s web of associative meaning. Here I need a conversation, perhaps in the drop-in session with @kunika about options for visualising text strings. With such possibilities in place, I will be able recursively try different ways of analysing the texts.

For both kinds of analysis, it will fairly soon be time to include data about other textiles collections. At this point, it is worth remembering that SMG’s Manchester Science and Industry Museum has a parallel collection named ‘Textile Industry’, and that our partners include the Bradford Industrial Museum. As the investigation gains solidity, it gains focus and possibilities…