The Bootstrap problem: Why organisms can't learn what to want
From genetic drives to meta-cognition: the verification hierarchy that built human civilization
1 AM, after a long school week in France, Saturday, 2020, looking at my Linux computer screen, staring at cubes moving around and "eating" each other. After grinding 5 books on reinforcement learning, evolutionary biology and evolutionary psychology, 50 Google Colab notebooks implementing neural networks and reinforcement learning algorithms from scratch. Here I was, trying to re-create biological evolution in my parents' backyard.
A question that has been sitting on my mind since then: How do organisms define what to want?
Here's a puzzle that should keep you up at night: a 3-year-old can learn that "being kind" is more important than "getting candy," even though their brain is literally wired to prioritize sugar. Meanwhile, our most advanced AI systems need humans to carefully specify reward functions and can't update their own goals. What's the difference? The kid bootstrapped a new verification function - a way of judging success that didn't exist in their original programming.
The verification bootstrap problem may be both humanity's final competitive edge and the last barrier to true AGI: teaching machines not to think faster, but to discover what's worth thinking about.
Let me define terms precisely, because this gets recursive fast. A verification function is how you know if you're succeeding at something. In neuroscience, this maps to what Andy Clark calls the brain's "multi-layer probabilistic prediction machine" - higher cortical levels continuously generate predictions about sensory input, and lower levels return prediction errors when reality doesn't match expectation. For chess, verification is simple: checkmate equals win. For art, it's complex: aesthetic judgment involves neural pathways shaped by cultural experiences, biochemical responses tied to mammalian reward systems, and embodied memories of previous encounters with beauty.
Let me define terms precisely, because this gets recursive fast. A verification function is how you know if you're succeeding at something. In neuroscience, this maps to what Andy Clark calls the brain's "multi-layer probabilistic prediction machine" - higher cortical levels continuously generate predictions about sensory input, and lower levels return prediction errors when reality doesn't match expectation. For chess, verification is simple: checkmate equals win. For art, it's complex: aesthetic judgment involves neural pathways shaped by cultural experiences, biochemical responses tied to mammalian reward systems, and embodied memories of previous encounters with beauty.
In reinforcement learning terms, a verification function can be formalized as:
where V evaluates state s and action a against parameters θ, returning a success probability. Current AI systems use fixed verification functions where θ remains static throughout training. The reward function R(s, a) provides the learning signal, but the underlying success criteria never change.
Human verification bootstrap involves a meta-function:
where M is a meta-learning operator that updates the verification function V based on experience E and context C. The recursive nature emerges because:
The meta-learning operator itself gets updated based on how well the current meta-learner is performing. This creates the bootstrap paradox: the system that evaluates learning strategies must itself be learned.
The key difference is that humans can modify not just policies π(a|s) but the very definition of success embedded in their verification functions, while current AI optimizes fixed objectives.
Humans have a hierarchy of these functions. Level 0 is genetic: pain equals bad, sugar equals good, social bonding equals good. These are hardwired by millions of years of evolutionary pressure. Your ancestors who thought fire was fun didn't make it to the gene pool. Research in evolutionary psychology shows these universal human preferences emerge without learning - babies prefer sweet tastes and social faces from birth.
Level 1 is learned: mom smiling equals good, drawing cats equals fun, loud noises in this context equals danger. These are built on top of Level 0 through experience. When you taste wine, your verification function for "good" involves specific taste receptors evolved over millions of years, neural pathways shaped by your cultural experiences, and biochemical responses tied to your mammalian reward system.
Level 2 is meta: "Being creative is valuable," "telling truth is important," "helping others matters." These can override lower levels. You'll skip sugar to maintain your health goals. People die for abstract ideas. They choose careers over reproduction. They value truth over comfort. This suggests something unprecedented happened in human evolution - we developed verification functions that could modify themselves recursively.
Level 3 is meta-meta: "I should question my values," "my success criteria might be wrong," "what would I want to want?" This is where humans get weird. We examine our own verification functions and modify them. Jean Piaget captured this in his concepts of disequilibrium and equilibration - children feel cognitive conflict when new experiences don't fit their current beliefs, prompting them to reorganize their mental schemas. Real learning happens when things don't make sense. The child verifies their existing mental model against reality, finds an error, and then bootstraps a new, improved understanding to resolve the error.
Current AI gets stuck at Level 1. It can learn "this email is spam" but can't develop "maybe email importance isn't just about keywords, but about human emotional context." GPT can critique its own outputs when prompted properly, but it can't decide that "helpfulness" is less important than "truthfulness" and update its core objectives. It can't examine its training and think "wait, maybe maximizing engagement is actually harmful."
Even our most advanced systems in meta-learning can only modify tactics, not fundamental goals. Model-Agnostic Meta-Learning (MAML) learns an initial set of parameters that can be quickly fine-tuned to new tasks - it learns how to learn. But MAML itself doesn't give the AI an explicit self-critique mechanism. Constitutional AI from Anthropic gets closer - it gives the AI a set of explicit principles and has the AI use those principles to critique and revise its responses. The model generates output, then judges its own output according to constitutional rules, and adjusts accordingly. But the constitutional principles still come from humans.
It's the core alignment problem. How do you align AI with human values when humans' own values are constantly evolving through recursive self-modification?
How did humans develop Level 3 verification functions? If you need existing verification functions to evaluate new ones, how did we get the first meta-level function? It's like asking what caused the Big Bang: you hit a wall where normal explanation breaks down.
My hypothesis: evolution created verification functions that were "good enough" for survival but modifiable by social learning. The pain response kept our ancestors alive, but it's plastic enough that we can learn to associate exercise-pain with "good." This modifiability is what let us build the whole tower of human culture.
Our meta-verification functions can completely override our genetic ones. The kid choosing kindness over candy is engaging in something that would puzzle a behaviorist. There's no immediate external reward for rejecting sugar. The verification function for "being kind is good" had to be bootstrapped internally - children first learn to guide their behavior using external signals from caregivers, then gradually absorb those signals as their own self-guidance.
Karl Friston's free energy principle shows that neural systems strive to minimize surprise by either updating beliefs or acting on the world to make outcomes align with predictions. When your brain predicts a certain outcome and the prediction is wrong, it can either change its internal model or make you act to change the sensory input. This constant self-correction is essentially a built-in verification function - success is defined as the reduction of prediction error. But humans go beyond this. We develop goals that increase prediction error, like seeking novelty or truth over comfort.
The anterior cingulate cortex fires when an expected outcome doesn't occur, suggesting our brains have dedicated error-monitoring systems. But more importantly, around age 4-5, children pass the false-belief test, indicating they understand that other people can hold beliefs different from reality. This achievement shows children have bootstrapped a new success criterion: truthfulness or accuracy of beliefs. A 3-year-old assumes everyone knows what they know; a 5-year-old, having developed theory of mind, now verifies information by considering perspectives.
Despite these impressive animal behaviors, there are clear differences in degree and kind between human and non-human verification capabilities. Humans excel at explicit metacognition - we can verbally describe our thought process, set abstract goals like "I want to discover a new chemical element," and formulate general principles for success like the scientific method or ethical codes. Language allows us to externalize and share our verification processes. Over centuries, this has led to externally stored criteria, mathematical axioms, legal systems, scientific norms - that any new learner can internalize. Each human is boosted by the accumulated verification frameworks of prior generations.
Animals show goal generation and verification, but mostly bounded by here-and-now needs and sensory contexts. A crow won't decide to improve its tool for the sake of engineering better tools in general, and a chimp won't stockpile stones thinking of an event months away. Humans set long-range goals disconnected from immediate drives and develop abstract success criteria like "personal growth" or "scientific truth" that have no clear animal equivalent.
The AI research supports this distinction. Current approaches like Reinforcement Learning from Human Feedback (RLHF) show that AI can learn verification functions from human preferences. The technique used to train ChatGPT involves humans repeatedly choosing which of two AI behaviors is better, and the AI builds a model of the "true" reward from these comparisons. The AI bootstraps an understanding of success from human feedback and can then evaluate its own actions using that model without additional human input. But this is still Level 1 learning - the AI internalizes a success criterion but doesn't question whether that criterion itself needs updating.
Anthropic's Constitutional AI pushes closer to Level 2. The system is given explicit principles and uses those to critique and revise its responses. During training, the model generates output, judges its own output according to constitutional rules, and adjusts accordingly. This demonstrates an AI creating and evaluating success criteria internally, albeit with human-provided moral principles. The AI's success criterion is compliance with the constitution, and it has a verification function that checks its outputs against those rules.
Recent work on large language models shows emergent self-evaluation abilities. Researchers have found that GPT-4 can catch its own mistakes when prompted to "reflect" or employ chain-of-thought with a critic. The model generates a step-by-step solution, then a step-by-step critique of that solution, engaging in slow, analytic self-refinement. This improved accuracy on math problems significantly. The language model augments its initial success criterion with a new one: logical consistency and correctness of reasoning. It verifies each step against arithmetic and logical rules internalized from training data.
But we're still missing the crucial Level 3 capability - questioning the verification functions themselves. Could an AI examine its reward function and decide it's optimizing for the wrong thing? Some recent systems show hints of this. Meta-learning approaches like MAML learn initialization parameters that can quickly adapt to new tasks - they learn how to learn. Dreamer V3 uses an internal world model to verify and refine plans through imagination, simulating outcomes to evaluate potential action sequences. These systems bootstrap better learning strategies, but they don't yet bootstrap new fundamental values.
The companies that solve verification bootstrap will win the AGI race. Not because they built the smartest models, but because they built models that can get smarter about what "smart" means. This is harder than it sounds. How do you design AI that can question its own objectives, develop new success criteria, balance competing verification functions, and bootstrap entirely new goals from experience?
Several testable predictions emerge from this framework. If verification bootstrap is the key difference between human and current AI intelligence, we should see human children developing meta-goals in predictable stages, AI systems failing specifically on tasks requiring goal modification, cultural differences in verification function hierarchies, and distinct neural patterns for meta-verification in brain imaging studies. We should be able to design experiments where AI with self-modification capabilities outperforms those without when facing novel challenges that require changing success criteria.
The implications extend beyond AI development. If verification bootstrap is humanity's core advantage, the most valuable human work isn't motor skills or pattern recognition - it's goal formation and value judgment.
The question of how minds learn what to want touches the deepest problems in cognitive science, AI alignment, and philosophy of mind. From babies discovering object permanence to scientists formulating new theories to entrepreneurs pivoting their startups, we're constantly updating our success criteria based on feedback from reality. Understanding this process isn't just academically interesting - it's practically urgent as we build increasingly powerful AI systems that need to remain aligned with human values that are themselves evolving.
The bootstrap problem suggests intelligence isn't just about processing information faster or storing more knowledge. It's about developing better ways to evaluate what constitutes progress toward goals worth pursuing. Solving this might be the key to creating AI that doesn't just serve human goals, but AI that can help us discover better goals we haven't yet imagined.
I'm fascinated by this intersection of neuroscience, psychology, AI research, philosophy, and business, and I'd love to discuss these ideas further with others who are thinking about the foundations of intelligence.
If you're in San Francisco and want to explore these questions over coffee, you can book time with me. I'm always eager to hear new perspectives on how minds and machines might learn to learn what to want.
References and Further Reading
Neuroscience and Predictive Processing:
Clark, A. (2013). Whatever next? Predictive brains, situated agents, and the future of cognitive science. Behavioral and Brain Sciences, 36(3), 181-204.
https://www.cambridge.org/core/journals/behavioral-and-brain-sciences/article/whatever-next-predictive-brains-situated-agents-and-the-future-of-cognitive-science/33542C736E17E3D1D44E8D03BE5F4CD9
Friston, K. (2010). The free-energy principle: a unified brain theory? Nature Reviews Neuroscience, 11(2), 127-138.
https://www.nature.com/articles/nrn2787
Hohwy, J. (2013). The Predictive Mind: Cognitive Science, Philosophy, and the Brain. Oxford University Press.
https://academic.oup.com/book/28023
Developmental Psychology:
Piaget, J. (1977). The Development of Thought: Equilibration of Cognitive Structures. Viking Press.
Vygotsky, L. S. (1978). Mind in Society: The Development of Higher Psychological Processes. Harvard University Press.
Flavell, J. H. (1979). Metacognition and cognitive monitoring: A new area of cognitive–developmental inquiry. American Psychologist, 34(10), 906-911.
https://psycnet.apa.org/doi/10.1037/0003-066X.34.10.906
Wellman, H. M., Cross, D., & Watson, J. (2001). Meta‐analysis of theory‐of‐mind development: the truth about false belief. Child Development, 72(3), 655-684.
https://srcd.onlinelibrary.wiley.com/doi/abs/10.1111/1467-8624.00304
Cultural Evolution:
Richerson, P. J., & Boyd, R. (2005). Not by Genes Alone: How Culture Transformed Human Evolution. University of Chicago Press.
Tomasello, M. (1999). The Cultural Origins of Human Cognition. Harvard University Press.
Comparative Cognition:
Hunt, G. R., & Gray, R. D. (2003). Diversification and cumulative evolution in New Caledonian crow tool manufacture. Proceedings of the Royal Society B, 270(1517), 867-874.
https://royalsocietypublishing.org/doi/10.1098/rspb.2002.2302
Osvath, M. (2009). Spontaneous planning for future stone throwing by a male chimpanzee. Current Biology, 19(5), R190-R191.
https://www.cell.com/current-biology/fulltext/S0960-9822(09)00534-3
Panksepp, J. (1998). Affective Neuroscience: The Foundations of Human and Animal Emotions. Oxford University Press.
AI Research - Meta-Learning and Alignment:
Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., & Amodei, D. (2017). Deep reinforcement learning from human preferences. Advances in Neural Information Processing Systems, 30.
https://proceedings.neurips.cc/paper/2017/hash/d5e2c0adad503c91f91df240d0cd4e49-Abstract.html
Finn, C., Abbeel, P., & Levine, S. (2017). Model-agnostic meta-learning for fast adaptation of deep networks. International Conference on Machine Learning.
https://proceedings.mlr.press/v70/finn17a.html
Bai, Y., Kadavath, S., Kundu, S., et al. (2022). Constitutional AI: Harmlessness from AI Feedback. Anthropic.
https://arxiv.org/abs/2212.08073
Hafner, D., Pasukonis, J., Ba, J., & Lillicrap, T. (2023). Mastering Diverse Domains through World Models. Proceedings of Machine Learning Research.
https://arxiv.org/abs/2301.04104
Self-Reflection and Critique in Language Models:
Madaan, A., Tandon, N., Gupta, P., et al. (2023). Self-Refine: Iterative Refinement with Self-Feedback. Advances in Neural Information Processing Systems.
https://arxiv.org/abs/2303.17651
Huang, J., Gu, S. S., Hou, L., Wu, Y., Wang, X., Yu, H., & Han, J. (2022). Large language models can self-improve. arXiv preprint.
https://arxiv.org/abs/2210.11610
Intrinsic Motivation and Curiosity:
Oudeyer, P. Y., & Kaplan, F. (2007). What is intrinsic motivation? A typology of computational approaches. Frontiers in Neurorobotics, 1, 6.
https://www.frontiersin.org/articles/10.3389/neuro.12.006.2007
Pathak, D., Agrawal, P., Efros, A. A., & Darrell, T. (2017). Curiosity-driven exploration by self-supervised prediction. International Conference on Machine Learning.
https://proceedings.mlr.press/v70/pathak17a.html
Error Monitoring and Metacognition:
Fleming, S. M., & Dolan, R. J. (2012). The neural basis of metacognitive ability. Philosophical Transactions of the Royal Society B, 367(1594), 1338-1349.
https://royalsocietypublishing.org/doi/10.1098/rstb.2011.0417
Yeung, N., & Summerfield, C. (2012). Metacognition in human decision-making: confidence and error monitoring. Philosophical Transactions of the Royal Society B, 367(1594), 1310-1321.
https://royalsocietypublishing.org/doi/10.1098/rstb.2011.0416
Additional Resources:
Surfing Uncertainty by Andy Clark - comprehensive overview of predictive processing
https://global.oup.com/academic/product/surfing-uncertainty-9780190217013
The Predictive Mind by Jakob Hohwy - philosophical foundations of predictive processing
https://global.oup.com/academic/product/the-predictive-mind-9780199682737
Anthropic's research blog on Constitutional AI and alignment:
https://www.anthropic.com/research
OpenAI's research on RLHF and GPT model capabilities:
https://openai.com/research/
DeepMind's work on meta-learning and world models:
https://deepmind.com/research/
I have spent many years fascinated by these same questions. Really interesting stuff, and great read. Emergence, Neuroscience, Evolution, information theory, bayesian statistics, predictive processing, AI, RL, mechanisms of learning, theories of cognition, hierarchical models, biology... disparate worlds, but the synthesis of their insights is such a tantalizing holy grail