Computer vision convergence 23/04/2024

The goal of this document is to:

Establish what we have done
Establish what we will do next
Establish the way we will converge by July and what this convergence will mean

The computer vision work of this projects seeks to connect objects (whether these be from catalogues, from adverts, from photographs, or from films) together in one vector-space. It performs linkage without the need to dictate a taxonomy.

The Heritage Weaver work has to date been focused on communication collections from the Science Museum catalogue and the National Museum of Scotland. We embedded every image and catalogue description, thus creating a multimodal vector space that allows us to link records based on their visual and textual similarity. This can create a data-neighbourhood within the vector space, where you can find red telephones in one corner, and depending on how far you travel away the objects become less similar and therefore less amenable for linkage. In Heritage Weaver, we explore in more detail the affordance of multimodal search and linkage: what can (and cannot) be found by using models like CLIP for information retrieval; how much (and when) does actually multimodality improve on the traditional keyword search?

In the next two months Heritage Weaver will be testing how segment-anything can identify objects within images and then find similar objects within the vector-space that has been created.

From May onwards, Heritage Weaver will look to ingest energy and textiles collections from SCM and NMS to create a ‘Congruence Engine’ space. This will allow us to understand how the foundational model, CLIP, deals with a range of different objects, and to what extent fine-tuning this model on museum collections improves it use for search and exploration. It may be that these industrial objects (engines and looms) are harder for CLIP to process than the domestic objects of communication collections (telephones, radios, televisions).

In the final month of research, Heritage Weaver will try the segment-anything tool on a moving image source to try to identify looms within Once Upon a Sheep. This work has already began with other moving image examples, and looks promising.

The Power of Advertisements work uses computer vision tools to generate metadata around images of objects found in advertisements, so that they can be fed into the same vector space. The adverts from one volume of a journal (The Electrical Age) have been extracted with the aid of the index found in the back of the volume. Using Llava, initial tests have been undertaken about the usefulness of a set of prompts. These prompts generated (with more or less accuracy) 1) descriptions of the images, 2) keywords, 3) lists of objects, and 4) extracted mottos and location information from the texts featured within the ads.

The next steps will take an iterative approach that will involve refining the prompts further using a new model of Llava (1.6) and inserting them within the shared vector space with other datasets.

The Mining Review investigations aims at comparing the complementary nature of the different layers of metadata generated from films in relation to their connectivity to other datasets. In its initial phase, archival documents were identified for three episodes of the Mining Review series, and a small catalogue of the individual items was created (including OCR captured transcriptions of typed documents). A simplified named-entity recognition (using spaCy) was also applied to these documents/catalogue. In its next phase, the audio of the 94 episodes of the Mining Review were transcribed using Whisper (by OpenAI), then named-entity recognition was applied to the generated text, and an entity fisher (spaCyfishing) was used to find entities that can be connected to entries in Wikidata.

In the next stage of the investigation, we are intending to work with BFI curator Patrick Russell to manually identify links from these documents that emerge through the analogue “curatorial/expert vision”. This will demonstrate the gaps in links identified by digital tools. We are also intending to use Llava to generate descriptions and linkable lists of entities based on shots automatically extracted from Mining Review episodes (using ffmpeg or PySceneDetect). To achieve this, we are aiming to hold a small 2-hours long collaborative prompt engineering workshop, where we intend to create and test a list of prompts that incorporate and extract information related to the curatorial, historical, and technical interest present in the project.

The ultimate aim of the project is the generate complementary layers of metadata accompanying the films/images that can help with the integration of the files into the shared vector space.