Open Language Models (OLMos) and the LLM landscape

A small model at the beginning of big changes.

Today, we’re releasing our first pretrained language models at the Allen Institute for AI (AI2), a set of 7 billion parameter models and one 1 billion parameter variant. This line of work was probably the main reason I joined AI2 and is the biggest lever I see possible to enact meaningful change in how AI is used, studied, and discussed in the short term. The big picture goal of my part of this work, and really all of my writing, is to try and make sure the right voices are heard in this process. This blog is my more polemic or passionate take on the process, and will include links to a lot more documents and resources to learn more and get building.

Open Language Model (OLMo) 7B for some will look like another 7 billion parameter model, similar to Mistral and Llama. On many axes of the AI discourse in 2023, OLMo is very similar to these mentioned models, it is available for direct download, it can be fine-tuned easily on consumer hardware, it offers a broad base of capabilities, and other things we are used to hearing. Yet to many, OLMo will represent a new type of LLM enabling new approaches to ML research and deployment, because on a key axis of openness, OLMo represents something entirely different. OLMo is built for scientists to be able to develop research directions at every point in the development process and execute on them, which was previously not available due to incomplete information and tools. Depending on the evaluation methods, OLMo 1 is either the best 7 billion parameter model available for download, or one of the best.

Key points and links:

Evaluation: OLMo is strong on a bunch of classic generation benchmarks, but lags slightly on tasks like MMLU and GSM8k. We have a lot of experiments to run on instruction-tuning, where those popular evaluations actually matter more.
Per-token capabilities: the right way to look at models in 2024 is per-token training efficiency. OLMo edges out Llama 2 by training on about 20% more tokens (2.5T vs 2T). It’s rumored that Mistral 7b is trained on 2-4x as many tokens as Llama 2, so we don’t compare too much to it. Pythia is trained on <50% of the tokens of Llama and OLMo.
Open training data: The exact dataset and tools for curating it are released under the Dolma project.
License: Models and code are released under Apache 2.0, with the dataset under the AI2 ImpACT license. This is close to an “open-source” ML model, but that’s an ongoing debate.
Artifacts: Collection on HuggingFace with links to models and dataset (Dolma).
Paper: The paper is detailed and has lots of lessons on pretraining and base model evaluation (the Arxiv version coming soon). A technical blog post and press release are available separately.
Code: Training code, eval code, and fine-tuning code are all available.
Lots more coming soon: AI2 plans on releasing bigger models, fine-tuned models, demos, analysis tools, evaluations, and more this year.

My mental tracking of this story is pinned to a Tweet from a vocal voice in the open-source ML discussion, Stella Biderman:

Screenshot 2024-01-06 at 7.12.32 PM.png

If I wanted to contribute to the narrative here, I needed to be at an organization willing to add their name to the list. With all the discussions around open models and all the good PR it brings companies these days, the short length of this list shows how hard it is to commit to the values needed to bring these artifacts to the light of day.

This is a landscape where models have been leaked multiple times and organizations releasing strong open models face real pressure from multiple government organizations. At a practical level, getting OLMo out before Llama 3 and the next Mistral models gives everyone time to catch up on what it means to be truly an open-source model.

OLMo represents the first time in a while (maybe before GPT2) that a state-of-the-art language model is fully transparent and open. While some communities may advocate for different behaviors, the release of the OLMo family represents the first time where many areas of study can be empowered to support a more well rounded discussion around the potential harms and benefits of LLMs. While many language models have been close, and are perceived as open by the general public, such as Llama 2 and Mistral, they do not provide access to certain types of work that is needed to make clear arguments around the potential risks.

For example, both Mistral and Llama, both do not disclose the data used at the pretraining or preference fine-tuning stage of development. The pretraining data hold is largely accepted to be due to ongoing litigation into the copywrited dataset Books3, which is under litigation in multiple judicial venues. In the OLMo family, we have the ability to easily add this data to our formula to quickly understand the potential impact of this dataset by sharing model performance without sharing the license-violating model itself. This informs policymakers on the value of this work to the parties seeking compensation for their materials and the importance of similar data to scientists training other models. Ultimately, OLMo helps unblock scientists who wish to study many details like this, but cannot due to lawyers and potential liability preventing access to valuable resources.

Realistically speaking, research on pretraining is the biggest pipeline stage where curious researchers will benefit from OLMo. OLMo was trained on Dolma, a dataset released openly by AI2 in 2023. Access to the pretraining data enables research on important new capabilities like attribution and methodological challenges like identifying test set contamination. Openness thrives in multiplicity so future models trained on the Dolma dataset by others would enable more controlled comparisons between models than are currently available.