Disclaimer: this post is for my bachelor and master students at UvA, to give them introduction for PRACTICAL NMT from top to bottom. There is a gap between theory and practice, that is not always addressed by books and blogposts. And honestly I am a bit tired of explaining things every time (I am lazy, sue me).

Introduction

Since you are an informatics or ai program’s student, let’s assume you know something about machine learning, or maybe even machine translation. You’ve heard about Transformer. You probably read The Illustrated Transformer blog post. Maybe you even had NLP course (if not, see resources). You know python and pytorch (I hope). And now you have a task to train NMT system. You can of course write the code yourself. That’s tedious. So instead people tend to use ML frameworks such as fairseq, hugging face transformers, joeynmt Here you can find Good overview of the frameworks. We will look into each part and see examples with fairseq framework. Why frameworks? Training/evaluation loops are common for seq2seq models, frameworks deal with it and also with the data loading and many more things. It’s not worth it to spend time writing your own code for routine tasks when you can focus on research part.

To make things easier, we can split NMT pipeline into 3 parts and look at them separately:

Pre-processing: It’s all about data

One day you wake up and think “I need to train the best <X>-to-<Y> translation model!”. First and foremost, hold your horses and check if you have access to parallel data.

What is the parallel data? It a set of sentences in language X (source) and same sentence in language Y (target). In practice you need 2 files: source and target data. Source file contains only sentences in the source language, one sentence per line. Each line of target file contains same sentences in target language (i.e., the same sentences have the same line number).

Hello.
Good morning.
i used to be an adventurer like you, then i took an arrow in the knee.
Hallo.
Goede morgen.
Ik was vroeger een avonturier zoals jij toen nam ik een pijl in de knie

Where to find parallel data? There are several publicly available NMT datasets. The main purpose of such curated datasets is that evaluation and test sets are kept the same, so you can compare your results with other works.

Sometimes data comes in one csv file, sometimes it’s xml files. Then you need to parse them. I hope you know some bash. If not, just google and copy from stackoverflow.