Word2Vec: Efficient Estimation of Word Representations in Vector Space
GloVe: Global Vectors for Word Representation
Sequence to Sequence Learning with Neural Networks
Neural Machine Translation by Jointly Learning to Align and Translate
The ELMo Paper: Deep Contextualized Word Representations
Universal Language Model Fine-tuning for Text Classification
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Improving Language Understanding by Generative Pre-Training
Language Models are Unsupervised Multitask Learners
T5: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
GPT-3: Language Models are Few-Shot Learners