Title: François Chollet: How We Get To AGI
Resource URL: https://www.youtube.com/watch?v=5QcCeSsNRks
Publication Date: 2025-07-03
Format Type: Video
Reading Time: 35 minutes
Contributors: François Chollet;
Source: Y Combinator (YouTube)
Keywords: [Fluid Intelligence, Test-Time Adaptation, Compositional Generalization, Discrete Program Search, Meta-Learning]
Job Profiles: Algorithm Engineer;Chief Innovation Officer (CIO);Machine Learning Engineer;Artificial Intelligence Engineer;Chief Technology Officer (CTO);
Synopsis: In this video, co-founder of Ndea François Chollet outlines the limitations of scaling deep learning for achieving artificial general intelligence (AGI) and highlights test-time adaptation and compositional reasoning as the key emerging paths toward true fluid intelligence.
Takeaways: [Scaling model size and training data alone cannot produce general intelligence, because intelligence requires the ability to adapt to the unknown in real time., True intelligence is not stored skill but the capacity to generate new solutions when faced with unfamiliar problems., Standard AI benchmarks often mislead because they measure memorized performance, not the capacity for adaptive reasoning or information efficiency., Fluid intelligence in AI demands test-time adaptation, where models dynamically update based on novel inputs rather than static inference alone., Combining deep learning’s intuitive strengths with programmatic reasoning enables AI systems to generalize better and handle novel tasks.]
Summary: François Chollet begins by highlighting the unprecedented decline in computational costs—falling by two orders of magnitude every decade since 1940—and explains how this trend catalyzed the deep learning revolution in the 2010s. He recounts how the availability of GPU-based compute and large data sets allowed self-supervised text modeling to flourish, giving rise to a “scaling era” where expanding model and data size predictably improved benchmark performance. Yet, Chollet argues, this paradigm conflated static, memorized skills with fluid, adaptable intelligence.

To distinguish genuine intelligence from rote proficiency, Chollet introduced the Abstraction and Reasoning Corpus (ARC) in 2019, designed to measure a system’s capacity to solve new tasks on the fly rather than regurgitate learned patterns. Despite a 50,000× increase in model scale, performance on this fluid-intelligence benchmark stagnated near zero, revealing that pre-training and static inference alone cannot yield general intelligence. The landscape shifted in 2024 with the advent of test-time adaptation techniques—methods that allow models to alter their own state during inference—and ARC performance suddenly approached human levels, demonstrating bona fide fluid reasoning.

Chollet then probes the essence of intelligence, contrasting a skill-centric view (machines performing predefined tasks) with a process-centric view (machines facing novel situations). He defines intelligence as an efficiency ratio: the ability to leverage past experience or encoded priors to navigate a broad operational domain under uncertainty. He cautions against using conventional benchmarks—crafted to evaluate task-specific knowledge—as proxies for intelligence, since they incentivize optimizations that miss key cognitive abilities.

To drive progress toward autonomous invention, Chollet describes ARC 2, which emphasizes compositional generalization by presenting more intricate tasks that resist brute-force pattern matching. Despite modest gains from test-time training, models remain well below human performance. He previews ARC 3, an interactive benchmark launching in 2026, which will assess an agent’s capacity to learn goals, explore novel environments, and solve hundreds of unique reasoning games under strict action budgets.

Finally, Chollet articulates the “kaleidoscope hypothesis”: the world’s apparent novelty arises from recombining a small set of reusable abstractions. He distinguishes two abstraction types—value-centric (continuous pattern recognition) and program-centric (discrete structural reasoning)—and asserts that machines must integrate both. He proposes a meta-learning architecture that blends gradient-based learning for perception with discrete program search guided by learned intuition, continuously refining a library of reusable components. This “programmer-like” system, under development at Ndea, aims to imbue AI with the efficiency and flexibility necessary for independent scientific discovery.
Content: ## The Declining Cost of Compute and the Rise of Deep Learning
Since 1940, the price of computational power has fallen by roughly two orders of magnitude every decade. This relentless decline, far from abating, underlies the modern achievements of artificial intelligence. In the 2010s, the combination of abundant GPU-based compute and massive data sets made deep learning practicable. Through self-supervised text modeling, researchers scaled language models (LMs) to unprecedented sizes, reliably improving benchmark results under uniform architectures and training regimes—a phenomenon often summarized as “scaling laws.” Many believed that general intelligence would emerge automatically by simply enlarging models and data without altering methodologies.

## Distinguishing Static Skills from Fluid Intelligence
However, benchmarks such as image recognition and language understanding mask a critical distinction: static, task-specific skills vs. true fluid intelligence. A machine that excels at memorized tasks may solve thousands of examples it has effectively rehearsed, yet fail instantly when confronted with novel problems. In 2019, François Chollet released the Abstraction and Reasoning Corpus (ARC), an ‘‘IQ test’’ for humans and machines designed to preclude memorization. Despite a 50,000× increase in model scale, leading systems saw only marginal improvements—from 0% to about 10% accuracy—while average human performance exceeds 95%.

## The Shift to Test-Time Adaptation
In 2024, the AI community pivoted to test-time adaptation: techniques enabling models to modify internal representations during inference. Unlike static pre-training, these methods—test-time training, program synthesis, train-of-thought prompting—imbue models with genuine fluid reasoning. OpenAI’s updated O3 model, after fine-tuning on ARC, achieved human-level performance for the first time. Today, every approach that meaningfully addresses ARC employs test-time adaptation, confirming that dynamic inference is essential for fluid intelligence.

## Rethinking Intelligence: Process Over Output
Chollet emphasizes that intelligence is not the accumulation of skills but the capacity to handle unforeseen situations. He contrasts two definitions: a traditional task-centric view (machines automating economically valuable tasks) and a process-centric view (machines tackling problems without prior preparation). He formalizes intelligence as the ratio between information at one’s disposal—past experience or prior knowledge—and the operational scope of potential future challenges characterized by novelty and uncertainty. Consequently, conventional benchmarks—designed for human exams—are ill-suited to gauge AI intelligence and may mislead research priorities.

## Advanced Benchmarks: ARC 2 and ARC 3
To sustain momentum toward general intelligence, Chollet introduced ARC 2 in March 2025. Building on ARC 1’s format, it intensifies compositional reasoning demands, presenting tasks that defy instant pattern matching yet remain accessible to average adults. Static reasoning systems score near zero; only test-time adaptation produces nonzero but subhuman results. Looking ahead, ARC 3—slated for early 2026—will assess interactive agency by dropping AI agents into novel environments with unknown controls, goals, and rules. Evaluations will hinge on both solution success and action efficiency, with strict bounds matching human exploration rates.

## The Kaleidoscope Hypothesis and Abstraction
Chollet proposes that the universe’s apparent novelty derives from recombining a small set of reusable “atoms of meaning” or abstractions. Intelligence thus involves two interlocking processes: abstraction acquisition (mining past experience for generalized building blocks) and on-the-fly recombination (assembling these blocks to confront new situations). Abstractions bifurcate into type-one (continuous, pattern-based recognition suited for perceptual intuition) and type-two (discrete, program-centric reasoning suited for algorithmic tasks). Current deep learning excels at type-one abstraction but falters on type-two challenges like sorting or symbolic manipulation without enormous data or compute.

## Toward a Programmer-Like Meta-Learner
To achieve both forms of abstraction, Chollet envisions a meta-learning architecture that merges gradient-based learning with discrete program search guided by learned intuition. In this model, a global library of modular building blocks—continuously enriched from prior tasks—serves as the foundation. Confronted with a new problem, the system selects and recombines relevant modules, blending deep-learning subroutines for perception with algorithmic routines for reasoning. Successful recombinations yield new abstractions that are fed back into the library, paralleling software engineers’ practice of publishing reusable code.

## Ndea’s Mission: Independent Invention and Scientific Discovery
This hybrid architecture is under development at Ndea, the research lab co-founded by Chollet. Ndea’s ambition is to accelerate scientific progress by delivering AI capable of independent invention rather than mere automation. Initial milestones include building a system that masters ARC from a tabula rasa, ultimately empowering human researchers with AI collaborators that expand knowledge frontiers autonomously.