tl;dr

This page serves as public draft for the multimodal investigation of communication object
Question: What do we gain by harnessing multimodal models for exploring and linking museum catalogues
Code on this GitHub page (documentation under construction).
Findings:🕵️‍♀️[classified] 👀

Exploring and Connecting Collections Through Multimodal Linking

[Warning this is a public draft and always in flux. Some sentences may not be appropriate for (syntactically and semantically) sensitive viewers. Other reminders are written for a very small audience of one, i.e. your truly.]

Introduction

This investigation looks closely at the affordances of multimodal models for exploring and linking museum collections. We collaborate with the communications team, and scrutinise a curated dataset of communications objects, selected from the Science Museum Group and National Museum Scotland. In total this collection comprises around 5132 objects (images and descriptions), not immensely large, but large enough for exploration and experimentation. It goes without saying that, time and resources permitting, we aims to scale up the technique we developed to the complete collection. Nothing we do here is specific to communication objects, and the methods we propose work at the level of a national collection.

We focus on using multimodal embeddings for search, as well as record linking. In both scenarios we tend to ask (a rather) basic question: what do we gain? Put differently what are the gains (and costs) when using multimodal models? What do we find that otherwise remains hidden in the database? What connections can we forge?

Overall Aims (tl;dr version)

What is it that we want to achieve?

Evaluate multimodal search and record linking of museum object: how to find and link harnessing (a combination) of textual and visual similarity (i.e. based on descriptive metadata and visual media)
Improve multimodal record linking through:
- Annotation: we enable scholars to annotate search results and links by the relevance, we can use these annotations to further classify and filter results and links.
- Model fine-tuning: the initial experiments use and off-the-shelve SigLIP model (as slightly improved version of OpenAI’s CLIP). We want to assess to what extend fine-tuning this model on our collections improves both information retrieval and record linkage.
Additional aim for Exhibition: fine-grained object retrieval, a scenario in which we segment a complex image to identify specific objects, these are then used a query in our multimodal database

Technical Objectives (tl;dr version)

Build a multimodal vector database for communication objects in the SMG and NMS collection
Allow other researchers to search and annotate these data (for now mostly colleagues through a Google Colab interface)
Fine-tune a CLIP or SigLIP model on image-text pairs from our collections