paper: (Zhang 2019) DialoGPT: Large-scale generative pre-training for conversational response generation

github: https://github.com/microsoft/DialoGPT

We present a large, tunable neural conversational response generation model, DIALOGPT(dialogue generative pre-trained transformer). Trained on 147M conversation-like exchanges extracted from Reddit comment chains over a period spanning from 2005 through 2017, DialoGPT extends the Hugging Face PyTorch transformer to attain a performance close to human both in terms of automatic and human evaluation in single-turn dialogue settings.

DIALOGPT(Dialogue Generative Pre-trained Transformer)라는 크고 조정가능한 neural conversational response generation model을 소개합니다. 2005년부터 2017년 기간동안 Reddit comment chain에 추출한 1억4천7백만 대화형 코멘트에 학습된 DialoGPT는 automatic과(PPL같은) 싱글턴 대화 환경에서 human evaluation을 사람과 비슷한 성능을 얻는 것이 목표입니다.

Introduction

DIALOGPT extends GPT-2 to address the challenges of conversational neural response generation. Neural response generation is a subcategory of text-generation that shares the objective of generating natural-looking text (distinct from any training instance) that is relevant to the prompt.

DIALOGPT는 conversational neural response generation을 다루기 위해 GPT-2를 확장합니다. neural response generation은 자연스럽게 보이는 텍스트를 생성한다는 목표를 공유하는 text-generation의 하위분야입니다.

Like GPT-2, DIALOGPT is formulated as an autoregressive (AR) language model, and uses the multi-layer transformer as model architecture. Unlike GPT-2, however, DIALOGPT is trained on large-scale dialogue pairs/sessions extracted from Reddit discussion chains. Our assumption is that this should enable DIALOGPT to capture the joint distribution of $P(Target, Source)$ in conversational flow with finer granularity.

GPT-2와 같이 DIALOGPT는 autoregressive(AR) language model이며 모델 구성으로 multi-layer transformer를 사용합니다. 그러나 GPT-2와 다르게 Reddit discussion chain에서 추출된 대규모 대화 pair/sesion에서 학습됩니다. 논문에서 의도는 이 대규모 대화 pairs/session이 DIALOGPT가 대화 플로우에서 $P(Target, Source)$에대한 joint distribution를 캡쳐할 수 있게 하고자 한다는 겁니다.

Dataset

The dataset is extracted from comment chains scraped from Reddit spanning from 2005 till 2017. Reddit discussions can be naturally expanded as tree-structured reply chains, since a thread replying to one thread forms the root node of subsequent threads. We extract each path from the root node to the leaf node as a training instance containing multiple turns of dialogue.

데이터셋은 2005년부터 2017년에 걸쳐 Reddit에서 스크랩된 comment chian에서 추출됩니다. Reddit discussion은 스레드에 대한 응답 스레드는 하위 스레드의 루트 노드를 형성하기 때문에 자연스럽게 tree-structured한 응답 체인으로 펼쳐질 수 있습니다. 논문에서 root node에서 leaf node까지의 각각의 path(하위 스레드)를 대화의 멀티턴을 가진 학습 인스턴스로 사용합니다.

아래 기준에 해당하는 데이터를 필터링합니다.

(1) URL이 있는 source나 target

(2) 개3개 이상의 단어 반복이 target에 포함된 경우

(3) 응답이 가장 자주 사용하는 top 50 단어에 적어도 하나 이상 포함하지 않은 경우(예를 들면 the, of, a)(이는 영어가 아닐 수도 있기 때문)

(4) 응답에 "[" 또는 "]"이 포함 된 경우 (이는 markup 언어 일 수도 있기 때문)

(5) source와 target 시퀀스가 합쳐서 200 단어보다 긴 경우

(6) target이 offensive language를 포함한 경우 (대용량 blcoklist에 매칭하는 방법으로 필터링)