Annotation tools for ML training sets, scope of technical knowledge

Kunika Kono, Felix Needham-Simpson, Stefania

We discussed the use of Prodigy (and other annotation tools) in relation to the folk songs investigation, and specific questions: what is involved in and required for selecting and setting up an annotation tool for data labeling? This in view of finalising the involvement of Jennifer Reid - the folk songs expert - who will be helping the investigation in developing the annotation schema over the next couple of months.

Before deciding on an annotation tool there are several things to consider/define, such as:

Objectives, scope and requirements of annotation work and workflow.
Collaborators/annotators' requirements and constraints.
Technical expertise required, their capacity/availability - this including IT systems administrators (e.g. SMG's IT department) as well as ML/NLP specialists, data scientists, etc.
Project-wide usage and requirements. There are other investigations that have NLP components in the pipeline - what are their needs, are there any overlaps/differences?
Keep 'in-house' or 'outsource'? Are the IT provisions and support already (or something that could be) offered by SMG/partner orgs' IT department? If buying in external services, are there any implications e.g. data agreement with our partners, GDPR agreement with the service, etc?
Software/platform as a service, or self hosted remotely or locally? Virtual desktop may be another possibility but need to check that it is persistent, since most orgs offer it as non-persistent (meaning virtual desktop instances are deleted upon logout). Also, bear in mind, with more control comes more responsibility over system maintenance, security and data backups.

We agreed that there is a lot to think about, investigate and survey, and given that the scope of the folk songs investigation's annotation work is still in the process of formulation, we would explore Microsoft Word as a preliminary annotation tool for exploratory text labeling and annotation. In the next drop-in session, we will look at proof of concept workflow, from annotating in Word Document to using the annotated text in an NLP annotation tool and/or Machine Learning.

We discussed the opportunity to develop a ‘reflective diary’ of the annotation work with Jennifer, to observe how the technical and the human interact in the ‘human in the loop’ approach. We also relfected on the scope of the technical knowledge in the investigation and agreed that it would be helpful, for Stefania and Daniel, to understand how XML works and why it is useful, without necessarily learn how to use the language (this would require a dedicated training and an amount of time that it is outside the scope of the investigation itself). Instead, it would be most important for the co-investigators to start familiarizing with XML files, being able to identify the annotated labels from the Word document (and other, more developed annotation tools) and read about other examples/case studies where XML was used in digital humanities/digital history.

<aside> 💭 Read Stefania’s reflective notes on this session

</aside>

Action points

[x] Kunika to share with the group an example of Microsoft Word document being used as an annotation tool.