Link: https://www.ijcai.org/proceedings/2020/0553.pdf
Key Idea - It introduces a span-by-span generation flow to train the model. This helps the model to predict semantically-complete spans consecutively rather than predicting word by word.
It also incorporates multi-granularity target sampling to construct pre-training data, which enhances the correlation between encoder and decoder (decoder relies more on encoder representations). Resulting in a more human-like generation with a lesser amount of data.
Model
ERNIE-GEN is a transformer based seq2seq model with a multi-flow attention architecture
Pre-training - It incorporates an infilling generation mechanism and a noise-aware generation method into pre-training and fine-tuning:
Overall Framework
The model effectively performs both word-by-word generation and span-by-span generation flow parallelly and uses the outputs to calculate the loss.
During inference, since the target sequence is unknown, the model aggregates all the known historical context and feeds that as the ATTN token to the model and predicts the next, then drops it and repeats till STOP token.
Public Models
Trained on Wikipedia + BookCorpus (same as BERT)
Input sequences truncated to 512 tokens
ERNIE-GEN (Base) - 110M params, ERNIE-GEN (Large) - 340M
Huggingface model link- https://huggingface.co/nghuyong/ernie-2.0-en
Results
Abstractive Summarization - To generate fluent and concise summaries without being constrained to extracting sub-sequences from the input article