for AI becoming autonomous and stubborn — untangling two fundamental dynamics of AI progress
Epistemic status
I have about Probably yes (~80%) confidence that the world is moving in this direction. After reading Nate Soares's essay Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense in summer 2024, many of my key uncertainties were resolved. I've been processing these insights since then.
That said, there are a few assumptions, uncertainties, or counterarguments that could invalidate this. You can read more in the bullets below. This isn’t a comprehensive list—just an exercise to expand thinking and nudge myself out of confirmation bias and similar dynamics.
- The piece underestimates how effective international coordination can be when facing common technological risks eg. COVID response
- There may be highly effective systems with built-in kill switches and oversight mechanisms. With robust multinational cooperation, these could be integrated into internet or ai protocols.
- The piece frames alignment as nearly impossible, when in fact, promising technical approaches are emerging.
- It’s a long shot to imagine real-world outputs being completely unsupervised and showing up in vulnerable technologies like synthetic biology, nanotech, weapons, or social persuasion.
- Technologies like electricity, antibiotics, and the internet had beneficial applications developed first.
- The framing of “primary goal vs. entire space” creates a false dichotomy—we don’t need perfect alignment, just sufficient guardrails.
- The piece assumes AI will inevitably overcome challenges through “trial and error,” but many domains have provably hard obstacles that can't simply be overcome through persistence
- The piece may overstate how easily the digital-physical boundary can be crossed. In reality, this remains incredibly difficult.