Exploration of LLMs to map large heterogeneous datasets

There are a number of issues presented when faced with the task of curating and building a historic collection drawn from several, typically heterogenous sources. These issues commonly revolve around the use of differing vocabularies/vernaculars, field naming (and namespaces), sparsely populated or incomplete text fields. These challenges significantly hamper the combining of these differing data sources, especially when the size of these datasets scales beyond a few hundred records. The key task then is to find a mostly automated method to classify similar records in a way that allows us to group and map objects despite challenges.

TAG lab seeks to explore the efficacy of using large language models (LLM) to perform this classification. The first phase of work introduces the notion of thematic allocation as a means of document classification. Topic theme allocation introduces a method for organising the outputs of a topic model into a cohesive grouping of dominant themes within a dataset, and labelling documents with those themes. Once the relevant text fields from each dataset are chosen, the process of topic theme allocation comprises of four stages:

Topic modelling
Topic theme allocation
Evaluation
Theme adjustment

At any point in this process a stage may be revisited and the following stages repeated until a satisfactory topic theme allocation is generated. These stages are described in more detail below.

Topic Modelling:
Topic Theme Allocation:
Evaluation:
Theme adjustment:

Phase One Methodology: The first step in Phase One will involve exploration into combining disparate data sources is proposed. The methodology is described as follows.

Choose two data sources to combine.
Find a single information-rich data column (e.g. item description) in each source.
Train two topic models, one per data-source.
Concatenate the inferred topics to create a single set of topics.
Perform topic theme allocation on this combined set of columns.

The primary intention of this work is to find a method for mapping similar items that use different vocabularies and descriptive methods, by linking them through their shared subtheme and master theme labelings. A secondary objective is to explore the possible detection of distinct ‘voices’ within (as well as between) the datasets.

The proposed datasets to perform this exploration as currently chosen involve two of the following:

a) Congruence Engine data_Sci & Ind b) NMS_Export_S&T_Technology c) BT Archive

Future work:

Subsequent steps during this first phase of the project, following successful testing of the core method, will extend the process to include additional project datasets with different characteristics and the likely consideration of how to interrogate cross-cutting themes in the data.

During this phase of the work we will also reflect a method and or software tool that allows researchers to explore and combine disparate collections in a way that minimises and will aim to arrive at an outline specification for this tool.