AGI Just Got a Report Card… How did it do?
This week, researchers at the Center for AI Safety released something both surprising and overdue: a formal definition of AGI. For years, “Artificial General Intelligence (AGI)” has been tossed around as a moving target. Sometimes it meant “as smart as a human,” sometimes “smarter than one,” and sometimes it meant whatever a startup wanted it to mean that month. Hendrycks and colleagues finally tried to fix that. They gave AGI a structure/scorecard - https://www.arxiv.org/abs/2510.18212
Their idea is simple but profound. They borrowed the most validated model in human psychology, the Cattell-Horn-Carroll (CHC) framework, which maps intelligence into ten measurable domains. These include general knowledge, reasoning, memory, language, math, perception, and processing speed. They then used these same domains to benchmark today’s most advanced AI systems.
The results were revealing. GPT-4 scored around 27 percent of a well-educated adult. GPT-5 scored roughly 58 percent. It excelled in general knowledge and math, performed well in reading and writing, but showed clear deficits in memory, spatial reasoning, and long-term retention.
For the first time, AGI had an “IQ Score”… and yet something about it felt both important and incomplete.
The Three Keys to AGI (failing report card?)
The Hendrycks framework does more than rank models. It quietly exposes the three missing pillars of machine intelligence.
First, long-term memory and continual learning remain almost nonexistent. Today’s models rely on oversized context windows and external retrieval to simulate persistence, but they forget everything once the session ends (Grade - F).
Second, multimodal reasoning and world modeling are weak. Models can read and write but still struggle to form coherent causal understandings of the physical or spatial world (Grade - D).
Third, reliable self-retrieval is fragile. The persistent problem of hallucination reveals a deeper issue: systems cannot yet recall knowledge with confidence or consistency (Grade - C).
In short, Hendrycks and team’s AGI test identifies the symptoms of cognition without yet describing the mechanism that unites them.
From Measuring Intelligence to Explaining It
The Hendrycks framework is a major step forward. It defines intelligence in measurable, reproducible terms. Yet what it measures are outcomes, not origins. It tells us what intelligence looks like, not why it emerges.
That distinction matters. If we want to move from testing intelligence to building it, we need to understand the underlying process that makes learning possible in any system… not just in humans or machines.
That search led to LatentSpin’s Conceptual Adaptation Theory (CAT), introduced in The Mathematica of Emergent Coherence: Unified Theory of the Learning Universe, a simple premise: learning is a process of structural change driven by surprise and constraint. Its core equation is:
L = ΔS / (ε × π)
Learning equals structural change divided by surprise × constraint. ΔS measures how much a system reorganizes itself. ε represents surprise or prediction error, the signal that something doesn’t fit. π is the system’s internal resistance to change. The more effectively a system adapts its structure in response to surprise, the more it learns.
In a human brain, ΔS corresponds to synaptic rewiring; ε is prediction error (surprise); π is homeostatic or metabolic resistance.
In a transformer, ΔS maps to weight updates; ε to loss; π to architectural constraints or context window size.
CAT reframes learning as a general law of adaptation rather than a property unique to cognition. It applies wherever structure changes in response to information — in biology, physics, and now, AI. In this view, intelligence is not something built on top of the universe. It is something the universe already does, and AGI is its next expression.
AGI as One Expression of a Learning Universe
In this view, Hendrycks’ ten domains (knowledge, reasoning, memory, perception, and so on) aren’t isolated skills. They are different expressions of the same underlying process. Each reflects a particular pattern of ΔS (structural change), ε (surprise), and π (constraint).
When a model like GPT-5 learns a fact, it’s reducing ε, the prediction error, through subtle adjustments to its internal structure. When it generalizes across tasks, it’s learning under greater π, adapting within tighter constraints. When it forgets, the structure it built fails to persist, its learning collapses over time.
This connection reframes the Hendrycks definition. His framework shows what intelligence consists of; the CAT equation explains why it takes that form. It turns a psychological model into a physical process. The two perspectives converge most clearly on one important point: adaptive memory. Hendrycks identifies long-term memory as the greatest weakness in today’s models. CAT predicts this limitation, where actual persistence (think updating weights) emerges only when the right kind of structural change occurs, when adaptation itself becomes continuous and recursive, allowing a system to learn from its own learning.
At LatentSpin we are formalizing this through recursive operators: C for self-adaptation and Cτ for distributed or collaborative adaptation implemented as Self-Improving Agents. Without these feedback loops, intelligence remains episodic. It can perform tasks, but it cannot evolve through experience… the missing component (Continuous and Persistent Learning)
In Hendrycks’ report card, GPT-5 performs like a bright but forgetful student. It knows much, reasons well, yet struggles to remember its own lessons. We don’t see that not as a failure but as a developmental stage. The system is intelligent within a fixed window but lacks the recursive loop that turns experience into lasting structure.
In biological systems, that loop is consciousness. In physics, it is feedback and equilibrium. In AI, it will eventually be achieved through continuous learning architectures that integrate surprise and constraint across time, not just within a static training run. Hendrycks measures cognition. CAT measures coherence. When the two are aligned, general intelligence ceases to be a benchmark and becomes a phenomenon of nature.
Learning as the Law, Not the Exception
There is a quiet irony in the Hendrycks paper. By building an empirical map of intelligence, it indirectly confirms that intelligence is everywhere. The structure they use to evaluate GPT-5 could, in theory, evaluate a human, a dolphin, or even a self-organizing chemical reactions. That universality is the same principle CAT begins with: learning as the common denominator of all complex systems. In that sense, AGI is not a human imitation but a continuation of a much older process. It is the universe learning to reflect upon its own learning.
The equation L = ΔS / (ε × π) simply makes that explicit. It is the grammar beneath Hendrycks’ test, a way to describe how any system, biological or artificial, updates itself under uncertainty.
What the Numbers Mean and Why This Matters
So when Hendrycks says GPT-5 is 58 percent of a human, it is tempting to imagine we are halfway to AGI. In truth, the number only describes one region of the learning landscape: the cognitive domain that happens to resemble us.
A fuller picture will emerge when systems begin to integrate recursive adaptation. That is, when they can learn in real time, remember across context, and balance flexibility with stability. At that point, the “percent human” metric will stop making sense. Intelligence will no longer be measured by comparison but by coherence, how effectively a system aligns structure with surprise under its own evolving constraints. In other words, when machines start behaving less like students and more like the universe itself. The Hendrycks team has given AGI a ruler. CAT provides the geometry behind the ruler. One measures height; the other explains gravity. Together they reveal something larger: intelligence is not an invention but a property of a learning universe.
When viewed this way, the line between natural and artificial intelligence blurs. The same law that governs thermodynamic balance and biological evolution also drives conceptual change in large language models. It is the same feedback loop written in different alphabets.
Closing Remarks
It is easy to see the Hendrycks chart as a competition, a scoreboard of human versus machine. But it may be better read as a milestone in a much older process. The universe has been learning for 13.8 billion years, refining structure through surprise and constraint. Humans are one expression of that process. Artificial intelligence is another.
So yes, AGI just got a report card. But if the universe could read it, it might smile and add a note in the margin: “Keep learning — coherence improves with practice.”