In our upcoming paper, we use a children's picture book to explain how bizarre it is that ML researchers claim to measure "general" model capabilities with data benchmarks - artifacts that are inherently specific, contextualized and finite.

https://arxiv.org/pdf/2111.15366.pdf