Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context(19.06.23)

6월 20일에 Google Brain에서 XLNet: Generalized Autoregressive Pretraining for Language Understanding (깃허브 : https://github.com/zihangdai/xlnet)가 소개되었습니다. (~~학부 학사 일정이 21일에 끝났는데 끝나자마자 SOTA 논문이..~~)

Additionally, XLNet employs Transformer-XL as the backbone model, exhibiting excellent performance for language tasks involving long context.

XLNet에서 backbone model로 Transformer-XL를 사용하였다고 하여 이번 기회에 Transformer-XL의 논문(Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context)읽어보고 **깃허브 코드와 함께 리뷰**를 해보았습니다! (페이퍼만 봐서는 잘 이해가 안되더군요..)

Introduction

Introduction에서는 기존 NLP의 gradient vanishing and explosion에 대해서 이를 극복한 방법들을 해결하고 있습니다. 일반적으로 LSTM와 같은 모델이 있으며, 이 후에 Attention Mechanism을 통해 연관이 있는 단어 쌍의 직접적인 연결을 통해 해결할 수 있었습니다. 하지만 이러한 기존 모델의 문제점은 고정된 고정 길이 세그먼트에 대해 수행되고, 고정된 컨텍스트 길이의 결과로, 모델은 사전 정의된 컨텍스트 길이를 초과하는 장기 의존성을 캡처할 수 없다는 점입니다.

Recently, Al-Rfou et al. (2018) designed a set of auxiliary losses to train deep Transformer networks for character-level language modeling, which outperform LSTMs by a large margin. Despite the success, the LM training in Al-Rfou et al. (2018) is performed on separated fixed-length segments of a few hundred characters, without any information flow across segments.

더불어 이렇게 fixed-length segments 문제는 문장이나 다른 의미 경계를 고려하지 않고 연속적인 기호 덩어리를 선택하여 생성하고 처음 몇 개의 기호를 잘 예측하는 데 필요한 상황 정보가 부족하여 비효율적인 최적화 및 성능 저하를 초래합니다. 이를 context fragmentation 이라고 부릅니다.

In addition, the fixed-length segments are created by selecting a consecutive chunk of symbols without respecting the sentence or any other semantic boundary. Hence, the model lacks necessary contextual information needed to well predict the first few symbols, leading to inefficient optimization and inferior performance. We refer to this problem as context fragmentation.

이 gradient vanishing and explosion 문제를 해결하기 위해서 Transformer XL이라는 새로운 네트워크 모델을 제안하며, 각 세그먼트에 대해 숨겨진 상태를 처음부터 계산하는 대신, 이전 **세그먼트에서 얻은 숨겨진 상태를 재사용(반복적인 연결을 형성하는 현재 세그먼트의 메모리 역할)**하여 기울기 소실 문제를 해결하고자 하였습니다.

더불어 이전 세그먼트에서 정보를 전달하면 컨텍스트 단편화 문제도 해결할 수 있는데 시간적 혼란을 야기하지 않고 상태 재사용을 가능하게 하기 위해 **절대적 인코딩(absolute positional embedding)보다는 상대적 위치 인코딩(relative positional encodings)**을 사용할 필요성을 보여줍니다.

Model

코퍼스가 x = (x1, . . . , xT )와 같이 주어졌을 때 Joint Probability P(x)를 추정(estimate)하는 Langague Model로 아래 수식과 같이 auto-regressively factorized를 통해서 구할 수 있습니다.