Natural Language Processing and Digital Humanities

<aside> 💡 How can we analyze humanities texts? How can we uncover patterns and outliers over hundreds, thousands, and even millions of documents? What methods, tools, and questions should you know about to undertake natural language processing (nlp) in Digital Humanities (DH)?

</aside>

Zoe Genevieve LeBlanc

click here ☝🏾to go back to my website

Today's talk will try and provide some answers to these questions, but please set up a consultation at the CDH if you have further questions.

Contact & Directions

So where to start? First, we need data, and more to the point we need a corpus.

What is a corpus?

Corpus is a term used in both nlp and DH to refer to a set of documents that contain text and other metadata. Corpora can be anything from medieval manuscripts to tweets. One of the biggest hurdles in DH research is turning sources into datasets. Usually you'll either use pre-digitized text data that is provided online or digitize sources yourself through either transcription or optical character recognition (OCR).

Some questions to ask when making or using a pre-existing dataset:

Are their digital versions of your sources? Do you need to digitize your sources or is what available good enough?
Even if your data is digitized, how much work is required to transform your materials into datasets? Can you easily download the data or do you have to manually compile it?
If you data is already collected, how much do you know about the collection and how it was created?
If you are collecting your data, which metadata should you include about your sources?
How should your data be formatted? Usually we store data in either txt or csv files, though you might eventually have enough data for a database.
How should your data be organized (documents, variables)?
- Do you want word counts as variables so that you can compare frequencies? Or are you interested in the entirety of the text? For example, if you're working with books, you might have the text of each book saved in a text file, but then compile the books together in a spreadsheet where each row contains the text of the book.
Is your data representative? Do you need to collect more data? What's the rationale for your corpus?