This is a place to gather access and references to all the artefacts created during the one year BigScience workshop. It contains access and informations on the pretrained models, checkpoints, datasets but also even (depending on the working groups) papers, code tools, etc.


Model: 13B English decoder model

Data: Prompting dataset

Model: T0

Paper: Multitask Prompted Training Enables Zero-Shot Task Generalization (2021)

Paper: Masader: Metadata Sourcing for Arabic Text and Speech Data Resources (2021)

Data: Masader

Data: BigScience Data Catalogue

Paper: Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP (2020)

Paper / Tool: LMdiff: A Visual Diff Tool to Compare Language Models