Data Leverage Recap: December 2022 - April 2023

The waves never stop; photo from Unsplash contributor Photoholgic.

This post will be a short review of the blog so far. There’s two goals here: to provide a quick catch up for anyone who’s new to the blog, and to give me a chance to reflect on how these ideas have held up with against a barrage of AI and tech-related product releases, research outputs, and other news.

The Paradox of Reuse, Language Models Edition (Dec 1, 2022)

Summary

In this post, I discuss the concern that language model applications like ChatGPT could erode their own foundations. Platforms like StackExchange and Wikipedia provide infrastructure and incentives for users to participate in the creation and sharing of knowledge. These platforms need traffic and users, which creates a key concern: if generative AI systems like ChatGPT (which rely on StackExchange and Wikipedia for training data!) are good enough that users replace their StackExchange and Wikipedia visits with a ChatGPT conversation, could GPT-4 hinder our ability to train GPT-5?

How the Key Points Hold Up

On March 17th, we got see some early evidence for this effect. In the linked Tweet, Dominik Gutt describes a preliminary results showing a negative effect from LLMs on Q&A activity.

About a week later, a similar point was made by an authoritative source: Peter Nixey, a top 2% StackOverflow contributor, who highlighted the concern that LLMs may prevent users like him from contributing to SO, and “When it comes time to train GPTx it risks drinking from a dry riverbed.”

Finally, on April 17th, StackOverflow’s CEO wrote a blog post discussing generative AI. While the post was controversial in the community for alluding to integrating generative AI into the platform, I was excited to see direct references to the importance of SO training data and the potential tragedy of the commons at play here. “AI is built on our collective knowledge, and we must all participate in building its future”.

What’s next

The core argument is this piece (and the similar arguments linked above) rely on making assumptions about how people will use LLMs and online platforms. It’s certainly possible to imagine scenarios in which LLMs benefit users and online platforms (a point to which we’ll return shortly!) For instance, if LLMs primarily answer what would be duplicate questions, reducing the need for humans to flag these questions and freeing up more time answer interesting questions, this could be great (though I think it’s unlikely without substantial effort).

One direction for future work is to use some combination of agent-based modeling and continued empirical investigation to specify the conditions necessary for positive sum outcomes. I’ll definitely be keeping an eye out for more empirical work in this space.

ChatGPT is Awesome and Scary: You Deserve Credit for the Good Parts (and Might Help Fix the Bad Parts) (Dec 4, 2022)

Summary

You and most everyone you know probably helped build the new wave of generative AI technologies like ChatGPT. This post provides an overview of all the specific details we know about past GPT training data sources, and how we can use that to engaged in some educated guesswork regarding the data underlying ChatGPT, GPT-4, Bing chat, and more.

How the Key Points Hold Up

The public is still mostly in the dark regarding specific ChatGPT training details. However, the sources highlighted in the original post still hold up; I think this is pretty close to the best guess we can make right now.

OpenAI’s stance on sharing information about training data going forward suggests it may be hard to do this kind of data documentation going forward. I do think we can still learn a lot of ChatGPT from studying more transparent models and datasets like LLaMa and The Pile — I’d be surprised if there are massive deviations in pre-training data collection strategies.