by cerussite0
This paper challenged the existing LLM decoding strategies (such as greedy, beam-search, top-k sampling etc.) and proposed a new method called nucleus sampling.
They also carried out many experiments to show how the existing methods of evaluation in NLP can be deceiving and how they don’t actually align with the purpose of generating useful human-like text.
LLMs have a large unreliable tail of probability distribution which must be truncated during generation.
Top-k sampling tries to do so, but in this paper they showed that use of a constant k across all contexts is sub-optimal. It is so because in some contexts the head of the next word distribution can be flat across tens or hundreds of reasonable options. Therefore if k is small, in some contexts there is a risk of generating bland or generic text, while if k is large the top-k vocabulary will include inappropriate candidates which will have their probability of being sampled increased by the renormalization.
On the other hand, under nucleus sampling, the number of candidates considered rises and falls dynamically, corresponding to the changes in the model’s confidence region over the vocabulary which top-k sampling fails to capture for any one choice of k.
Another common approach, sampling with temperature (softmax is re-computed after scaling) is shown to improve the generation quality, however, it comes at the cost of decreased diversity.
Decoding strategies that optimize for output with high probability, such as beam search, lead to text that is incredibly degenerate, even when using models such as GPT-2 Large. This may seem counter-intuitive, as one would expect that good models would assign higher probability to more human-like text. Indeed, language models do generally assign high scores to well-formed text, yet the highest scores for longer texts are often generic, repetitive, and awkward.
This figure shows how different the distribution of probabilities assigned to beam search decoded text and naturally occurring text are.
Natural text has high variance whereas text generated by beam search or greedy decoding is highly repetitive and lacks diversity. This shows that natural language is not based on maximization of probability, rather it tends to be more surprising.
GPT-2 Large was used to carry out the all of the experiments.
A small subset of WebText test set was used for conditional generation using various decoding methods. Each sample text taken from the set was randomly truncated to 1-40 word length and then a maximum of 200 tokens were generated using each decoding method.
These generations were evaluated using various metrics such as Zipf, Repetition% and SelfBLEU4. Following are the results: