<aside> 💡 How can we analyze humanities texts? How can we uncover patterns and outliers over hundreds, thousands, and even millions of documents? What methods, tools, and questions should you know about to undertake natural language processing (nlp) in Digital Humanities (DH)?

</aside>

Zoe Genevieve LeBlanc

click here ☝🏾to go back to my website

Today's talk will try and provide some answers to these questions, but please set up a consultation at the CDH if you have further questions.

Contact & Directions

So where to start? First, we need data, and more to the point we need a corpus.

What is a corpus?

Corpus is a term used in both nlp and DH to refer to a set of documents that contain text and other metadata. Corpora can be anything from medieval manuscripts to tweets. One of the biggest hurdles in DH research is turning sources into datasets. Usually you'll either use pre-digitized text data that is provided online or digitize sources yourself through either transcription or optical character recognition (OCR).

Some questions to ask when making or using a pre-existing dataset: