Graphing Diverse Textiles Data with IndexGPT: Investigation Notes

<aside> 👉 Looking at this currently as a piece of “technical research with practical purpose” – creating something hopefully useful from a data-led historical research perspective, but also trying to find some practical approaches to using LLMs – via prompt engineering – which might allow us to derive useful/interesting data from historical documents more generally. (Or at least to get a better picture of the LLM landscape!)

</aside>

Update 17/10/23: Some example notebooks for working with GPT are now in Github here:

Additional Resources

Asa’s GPT Prompt Experimentation

Developer Diary

GPT & Mathematical Practitioners Hack, 31st August 2023

I think there might be two (and a half) concurrent research questions:

Firstly, given that we have multiple diverse datasets on factory roles, machines, activities, etc. can we link them in a useful way – and potentially unearth some new ontologies which might allow us to do that. For instance, linking specific factory activities to machines, building up more knowledge around the processes and activity-to-activity workflow of textile production in a mill – machines involved and their locations etc.
Secondly, whether we might use LLMs to augment the tabular data we have currently (on e.g. roles and activities undertaken, collections data, insurance records, etc. with other less structured data, such as that from Hathi Library texts like Beaumont’s “Woollen and Worsted …” (W&W). Using LLM techniques such as e.g. multi-stage prompting[1], chain-of-thought,[2] and document embeddings[3] to extrapolate structured knowledge which can then be used to build a more nuanced understanding of the activities of a mill and its workers.
Thirdly, following from the above, are there any prompt engineering methods which we might usefully employ to draw knowledge accurately from (oftentimes very heterogeneous) historic texts, in a manner which is generally applicable to other enquiries. Can we create a process framework for LLM-powered research?

Relating to the above, there are some technical research questions:

How feasible and/or useful is it to generate an embeddings database from W&W, and if so what are the steps to achieving this? Will “off-the-shelf” embeddings work for us, or do we need to look into fine-tuning given the number of specialist terms – scutching etc. – used?
Given a set of embeddings, how do we best use them? Could we for instance use this to build a knowledge graph representation of a mill – including the machines, activities, and workers’ roles involved at each stage?
What are the practical considerations and limits for the application of LLMs to research? Cost/usage etc., see below.
How do “ersatz ontologies” (in a prompt engineering context, see below) enable new methods of historical research through data?

Ersatz Ontologies