The acceleration in Artificial Intelligence (AI) and Natural Language Processing (NLP) will have a fundamental impact on society, as these technologies are at the core of the tools we use on a daily basis. A considerable part of this effort currently stems in NLP from training increasingly larger language models on increasingly larger quantities of texts.
Unfortunately, the resources necessary to create the best-performing models are found mainly in the hands of big technology giants. The stranglehold on this transformative technology poses some problems, from a research advancement, environmental, ethical and societal perspective.
For example, while recent models such as GPT3 (from OpenAI / Microsoft) show interesting behavior from a research point of view, such models are private and not accessible to many academic organizations. Moreover, even when accessible, these tools have not been designed as research artifacts and for instance, lack access to the training dataset or checkpoints which makes it impossible to answer many important research questions around these models (capabilities, limitations, potential improvements, bias, ethics, environmental impact, general AI/cognitive research landscape). The current situation also promotes a duplication of energy requirements and environmental costs, due to the duplicated training of large models in private settings. Finally, these models are usually anglo-centric and there are shortcomings in the text corpora used to train these models, ranging from non-representativeness of populations to a predominance of potentially harmful stereotypes or the inclusion of personally-identifying information.
The BigScience project aims to demonstrate another way of creating, studying, and sharing large language models and large research artefacts in general within the AI/NLP research communities.
This project takes inspiration from scientific creation schemes existing in other scientific fields, such as CERN and the LHC in particle physics, in which open scientific collaborations facilitate the creation of large-scale artifacts that are useful for the entire research community.
Gathering a much larger research community around the creation of these artifacts makes it possible to consider the many research questions surrounding large language models in advance (capabilities, limitations, potential improvements, bias, ethics, environmental impact, general AI/cognitive research landscape). It is then interesting use the created artifacts, discussions and tools to answer as many of these questions as possible, and to foster dialogue around critical aspects of the field of study.
The BigScience open-science project is seen as a proposal for an international and inclusive way of performing collaborative research. Beyond the research artifacts created and shared, this project thus aims to bring together all the skills, conditions, and lessons allowing such future experiments of large-scale scientific collaboration.
In the end, it’s thus the deep belief of the founding members that the success of the project will ultimately be measured by its long-term impact on the field of NLP and AI, by proposing an alternative way to conduct large scale science projects.
The collaboration is organized as a One-Year Workshop on Large Language Models for Research: the “Summer of Language Models 21 🌸”
The workshop will:
This workshop will foster discussions and reflections around the research questions surrounding large language models (capabilities, limitations, potential improvements, bias, ethics, environmental impact, role in the general AI/cognitive research landscape) as well as the challenges around creating and sharing such models and datasets for research purposes and among the research community.
The collaborative tasks are quite large since they involve several millions GPU hours on a supercomputer.
If successful, this workshop could be reconducted in the future involving an updated or different set of collaborative tasks.
The outcomes of the workshop are expected to be: