Link: https://www.ijcai.org/proceedings/2020/0553.pdf

Key Idea - It introduces a span-by-span generation flow to train the model. This helps the model to predict semantically-complete spans consecutively rather than predicting word by word.

It also incorporates multi-granularity target sampling to construct pre-training data, which enhances the correlation between encoder and decoder (decoder relies more on encoder representations). Resulting in a more human-like generation with a lesser amount of data.

Model

Untitled

ERNIE-GEN is a transformer based seq2seq model with a multi-flow attention architecture

Pre-training - It incorporates an infilling generation mechanism and a noise-aware generation method into pre-training and fine-tuning:

  1. Infilling Generation - Instead of using the last ground-truth word in training or the last generated word during inference, the model inserts an artificial symbol "ATTN" along with its position to focus on all former representations, thus avoiding the negative influence of previous mistakes.
  2. Noise-Aware generation - randomly replaces arbitrary words in the vocabulary. helps the model learn to detect mistakes and ignore them during inference.

Overall Framework

Untitled

The model effectively performs both word-by-word generation and span-by-span generation flow parallelly and uses the outputs to calculate the loss.

During inference, since the target sequence is unknown, the model aggregates all the known historical context and feeds that as the ATTN token to the model and predicts the next, then drops it and repeats till STOP token.

Public Models

Trained on Wikipedia + BookCorpus (same as BERT)

Input sequences truncated to 512 tokens

ERNIE-GEN (Base) - 110M params, ERNIE-GEN (Large) - 340M

Huggingface model link- https://huggingface.co/nghuyong/ernie-2.0-en

Results

Abstractive Summarization - To generate fluent and concise summaries without being constrained to extracting sub-sequences from the input article