Last week, I came across a paper called [“Large Language Models Can Self Improve”](https://arxiv.org/abs/2210.11610#:~:text=Abstract%3A Large Language Models (LLMs,self-thinking without external inputs.) and decided to implement it. It turns out that I was too naive to realize that there is no way that I could have replicated a fraction of the results given my limited budget, compute, and access to language models. Although my attempt failed, it was incredibly fruitful. I’ve spent ~ 7 weeks on ML so far, and I am planning on spending more time to dive deeper into language models. More and more, I started to realize not only the potential in this field but also the potential for everyone to become a self-taught ML legend. It’s difficult, but it’s far from impossible.

In this blog post, I will share my takeaways from my failed attempt to implement a self-improving language model in layman terms. I hope my humble journey will inspire more people to overcome their fears and embark on their own incredible autodidact journey.

Glossary

LLM: large language models. They are pre-trained on a large amount of corpus and can perform text-based tasks like generation and prediction on a wide range of topics.

Supervised vs. Unsupervised Learning: Supervised learning use labeled datasets whereas unsupervised learning uses unlabeled datasets. Unsupervised learning discovers hidden patterns in data without human internvention.

Token: In NLP, tokens refer to smaller subunits of texts. These subunits can be sentences, words, or a part of a word.

Completions, Inferences, Generations: They generally refer to the text that is generated and returned as a result of the provided prompt or input into a language model in the context of this piece of writing.

The Problem: LLMs are not the best reasoners

LLMs can surprise you in their ability to converse in Shakespearian language or engage in philosophical conversations, but they are not the best reasoners yet. Even the powerful GPT3 cannot consistently solve 2nd-grade math problems.

Q: Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? A (GPT3 Davinci): In April and May, Natalia sold a total of 96 clips.

GPT2 has even less clue how to reason.

A (GPT2-XL): Natalia sold 48 clips to her friends in April. She sold 48 - 5 = 34 clips in May. In total she sold 54 clips. The answer is 34.

Improving LLM’s ability to perform logic tasks is difficult. We cannot let LLMs take a creative approach to math problems that only have one correct solution, so we must train the model with labeled data. However, clean, labeled data is limited and costly to obtain. Famous logic & reasoning datasets like OpenbookQA and GSM8K each only have a few thousand labeled question-answer pairs each. This is very limited compared to the tremendous body of text all over the internet that we can use for unsupervised training. Additionally, it’s costly to write and solve a diverse set of logic questions for the sake of creating labeled data. There is also no simple way to capture the data we naturally generate in day-to-day academic activities as we could with Tweets or Wikipedia. [Side note: you’d think online grading platforms like Gradescope are well positioned for this, but as far as I know they don’t have the computer vision technology required to capture data with high accuracy.]

“Large Language Models Can Self-Improve”

This paper by Google Research is exciting because it shows that LLM can improve its performance on logic tasks with unlabeled datasets only. Using “Chain of Thought” prompting technique and “self-consistency,” LLMs can answer logic problems mostly correctly. We can then pair the questions in the unlabeled datasets with the answers generated by LLM for supervised learning. The paper shows that this approach leads to state-of-the-art performance on logic & reasoning tasks.

Chain of Thought (CoT) improves LLMs’ performance by prompting it to decompose multiple-step logic problems. Self-consistency (alternatively called “majority voting”) asks LLMs to generate many answers given the same prompt and pick the answer that appears most often.

Image source: https://arxiv.org/pdf/2201.11903.pdf

Image source: https://arxiv.org/pdf/2201.11903.pdf