← Back to home

"Update re 'spontaneous instantiation'"

"April 07, 2025" • 1 min read

Update re 'spontaneous instantiation'

Apr 07, 2025

I recently wrote a story about how scheming / deceptive alignment might arise. I basically drew from Kokotajlo’s story¹ that a model might de-emphasize concepts that interfere with effectiveness, trade off one Spec trait (e.g. “honest”) for another (e.g. “helpful”), learn instrumental goals it comes to treat as terminal, and so on.

My affinity for Kokotajlo’s story stemmed from not believing in ‘spontaneous instantiation’ — that is, believing that everything has a cause; ‘it’s not possible that something can come from nothing’. I didn’t understand how a model could spontaneously instantiate goals that are anti-human, unless it’s in response to some aspect of its programming or training, inputs we provide it.

Y pointed out that this doesn’t necessarily apply so strongly in the case of AI models. I should’ve remembered this from the induction heads paper²: sometimes ‘grokking’ occurs very rapidly.³ These ‘phase transitions’ can look a lot like something coming from nothing (although we may be able to reverse-engineer what happened and why, such that the causality seems obvious in retrospect). I should expect to see more apparent ‘spontaneous instantiation’ going forward.

¹ Forecasting AI Goals , Kokotajlo

² In-context Learning and Induction Heads , Olsson et al.

³ Future ML Systems Will Be Qualitatively Different , Steinhardt

Originally published on Substack