Isis CB
This week I set some time aside to try to progress work on the Isis CB. I had a good meeting last week with @Tim where we established the key research question that we are trying to pursue in this area of work. Essentially this is: can we create meaningful links between museum objects and research articles, by identifying relevant objects within the published text? And could this be used to suggest relevant research articles when presented with an individual object?
Having established that full text acccess to JSTOR articles is not feasible, and having also consulted with @Anna-Maria about the ethics of doing a manual full-text download, I returned to JSTOR’s Constellate interface to see what kinds of analysis are possible there. After an exchange with the 1.5 developers working on the site, I was able to get a six month trial of the full package. You start with a dataset builder, which enables you to create a facet of JSTOR’s holdings based on individual fields (in my case I have been inputting DOIs), publication, date etc.

The analytical interface is geared towards educational institutions, with a long list of available notebooks for users to learn how to apply textual analysis including NER and topic modelling to datasets built through Constellate. Once you have built your databse, you have the option to ‘download’, ‘visualise’ (i.e. see some very basic word clouds and tables generated automatically) or ‘analyze’. This latter option takes you to a first set of Jupyter notebooks which essentially constitute a basic textual analysis pipeline, but more are available once you dig deeper. These notebooks are all very well conceived and instructive - and addition to these Constellate also run frequent workshops for subscribers to learn more about textual analysis. As a pedagogical tool, then, it’s quite an impressive set up - although as a ‘non-profit’, it would be nice if these were freely available.

The ability to run these form of analysis are limited by the fact that most of the content on JSTOR is not available as full text via Constellate. Instead, they have made n-grams available, and it is this avenue that I have been investigating as having potential to establish links between research articles and museum objects. For instance, some preliminary tests applying Term Frequency- Inverse Document Frequency" (TF-IDF) to one of the sample datasets from ISIS CB yielded some interesting results, helping to narrow down individual articles that could be relevant to key terms relating to textile objects.

However, because n-grams are lists of individual words (or pairs, or threes), the contextual benefit that you might get from NER or other textual methods are largely limited. When I tried using SpaCy to analyse the dataset, it grouped some of the words into nonsensical strings and assigned them labels:
gulf question east sara develop china (LOC)
Having consulted with @Felix NeSi about next steps, we agreed that drawing on Ben Russell’s new spreadsheet which consolidates collections data from SMG, BIM and NWM would enable us to create a solid list of terms which we can then compare against the bi-grams and tri-grams obtained through Constellate. It’ll be interesting to see what results we can obtain by this method. Something else which might feed into this is the work that @Arran and @Kaspar Beelen initiated into training an NER model based on collections data, using Label Studio. I’ve briefly discussed this with Arran already, but I’m interested in contributing to this in some way - I’ve set up Label Studio for various ML tasks which I’ve abandoned along the way, partly out of the lack of a clearly defined end goal, so I’d be interested in revisiting this tool to follow through the labelling and training stages to the end.
River Pollution