The goal of this project is to document and collect data sources for the BigScience dataset. We are gathering a wide variety of resources that represent different kinds of language use: different regions 🌏🌍🌎, different contexts 🏫🏥🏠, and different audiences 🥼🍿📰, specifically looking for a variety of data types and formats such as books and formal publications, audio formats including radio and podcasts, and others, in addition to traditional web sources. We’re working with the following languages for the training dataset: African Languages of the Niger-Congo family, including e.g. Swahili and other Bantu languages, Arabic, Basque, Catalan, Chinese, English, French, Indic languages (including Bengali, Hindi, Urdu), Indonesian, Portuguese, Spanish, and Vietnamese.

Data Catalogue Language Sprints

Several sprints were held for sourcing languages in specific regions in collaboration with language, NLP and Machine Learning communities across the globe. In particular we held data sprints relevant to

👉 Asian languages, in collaboration with Machine Learning Tokyo

👉 African languages, in collaboration with Masakhane

👉 English sources from the Indo-Pacific region

👉 Latin America