✶ The Future

Roadmap to AGI: Scenarios, Capability Milestones, and Roadblocks (2027–2035)

A synthesis and comparison of AGI forecasts from leading labs and researchers, key capability milestones yet to be reliably surpassed, and the central debate on 'scaling is enough' versus 'new breakthroughs are needed' as of mid-2026.

Roadmap to AGI: Scenarios, Capability Milestones, and Roadblocks (2027–2035)

Summary

“Artificial General Intelligence” (AGI) is a concept referring to an AI system capable of performing most intellectual tasks that humans can, at or above human level. As of mid-2026, leaders of top AI labs place the AGI milestone very close — around 2026 to 2030 — while many independent researchers argue this optimism is commercially motivated and the realistic timeline is much further away.

This article does not take a side. Instead, we present three things: (1) timeline scenarios with specific forecasters, (2) capability milestones that AI has yet to reliably surpass, and (3) the central debate between “scaling is enough” and “new architectural breakthroughs are needed.” All numbers and forecasts are attributed to their sources and spokespeople — because in this topic, who says it is as important as what is said.

An important note: the definition of “AGI” itself is not yet standardized. The same “2027” timeline could mean “AI that codes better than the best engineers” (narrow definition) or “AI that possesses all human cognitive abilities” (Demis Hassabis’s broad definition). Most disagreements about timelines are, in fact, disagreements about definitions.

Timeline Scenarios (Who Forecasts What)

Optimistic Scenario: 2026–2028

Dario Amodei (CEO Anthropic) is among the earliest forecasters. In a recommendation to the White House Office of Science and Technology Policy (March 2025), Anthropic wrote that they “expect powerful AI systems to emerge by late 2026 or early 2027.” In his renowned essay Machines of Loving Grace, Amodei describes “Powerful AI” as a system with intellectual capabilities “at or above a Nobel laureate” across most domains. He cautiously added: “it could be as early as 2026, but there are also ways it could take much longer.”

Daniel Kokotajlo (former OpenAI researcher) and the AI Futures Project team published a detailed scenario, AI 2027 (April 2025). The scenario describes a “superhuman coder” emerging around March 2027, followed by recursive self-improvement leading to superintelligence by 2028. This is the “fast track” scenario; a “safer, slower track” branch pushes milestones to 2028 and beyond. Kokotajlo’s personal median forecast for AGI is 2029.

Medium Scenario: 2029–2032

Demis Hassabis (CEO Google DeepMind, Nobel laureate in Chemistry 2024) consistently sets the milestone at “5 to 10 years” from 2025, meaning around 2030–2035, with a “50% chance” of achieving AGI within that timeframe. Crucially, Hassabis’s definition is very broad: AGI must exhibit “all cognitive abilities that humans possess” — not just a proficient LLM. At the Axios AI+ Summit (December 2025), he added: “We are only one or two AlphaGo-level breakthroughs away from AGI” — a statement that is both optimistic (close) and cautious (still requires unmade breakthroughs).

Cautious Scenario: After 2035, or Requires Paradigm Shift

Gary Marcus (cognitive scientist) suggests that “the possibility of AGI arriving before 2027 now seems remote.” His stance is more balanced than often cited: “Anyone who thinks AGI is impossible — wrong. Anyone who thinks AGI is imminent — also wrong.” Marcus cites a survey indicating that 84% of AI researchers believe LLMs alone will not be sufficient to achieve AGI. He has placed a monetary bet with policy expert Miles Brundage on whether AI will meet a set of ambitious criteria by the end of 2027.

A Lesson on Reading Forecasts: Hassabis and Amodei are CEOs of companies raising billions of dollars based on AGI expectations — they have an incentive to be optimistic. Kokotajlo and Marcus do not sell AGI products. This does not automatically make either side right, but it should shape your level of skepticism.

Capability Milestones Yet To Be Surpassed

Progress on benchmarks is real and rapid, but “scoring high on a test” is different from “being reliable in the real world.” These are specific capability roadblocks.

1. Task Length and Long-term Reliability (METR)

The independent evaluation organization METR measures “the length of tasks that AI completes with 50% reliability.” Core finding: this metric doubled approximately every 7 months during 2019–2025, and accelerated to about every ~4 months in 2024–2025. Extrapolating this trend, in less than a decade, AI could autonomously complete most software tasks that currently take humans days or weeks.

But note the “50%”: at that reliability threshold, half of attempts fail. For AI to truly operate autonomously in high-risk environments, reliability needs to be 99%+ — and task length at that threshold is significantly lower. The gap between “impressive in demo” and “production-ready reliability” is precisely the agentic reliability bottleneck.

2. Abstract Reasoning: ARC-AGI-2

ARC-AGI-2 (ARC Prize) tests abstract reasoning on novel problems that cannot be “memorized” from training data. In early 2025, OpenAI’s o3 model only achieved ~6.5%. By early 2026, frontier models approached the grand prize threshold (>85%). Average users scored around 66%, and human evaluators reached 100%.

Correction to initial draft: Some summaries attributed ARC-AGI-2 scores of ~85% to “GPT-5.5” and “Gemini 3.1”. Caution is advised: version names and precise numbers change rapidly and vary across sources (e.g., some leaderboards place Gemini 3.1 Pro at ~77%). What is certain is the trend — from ~6.5% to near the 85% threshold in about a year — rather than the absolute score of a specific version. Please check the ARC Prize leaderboard directly for the latest figures.

3. Software Engineering: SWE-bench Verified vs Pro

This is a classic example of how misinterpreting benchmarks can mislead. On SWE-bench Verified (a filtered, public dataset), leading models in 2026 achieved very high levels — reported up to ~90–95%. This sounds as if AI has “solved” software engineering.

However, on SWE-bench Pro — a harder variant, resistant to data contamination, tested on unseen private codebases — scores drop sharply to around 55–59% for leading models. All models lose ground when encountering unfamiliar code. The Verified-vs-Pro gap is quantitative evidence for the “reliability ceiling”: much of the high scores come from fine-tuning scaffolding to the specific problem format, not genuine general capability.

4. Continual Learning

This is perhaps the most profound roadblock, highlighted by interviewer Dwarkesh Patel: current LLMs are “frozen” after training. They cannot learn while working like a new employee accumulating experience daily. Richard Sutton (Turing Award winner) argues that true intelligence must learn “at runtime,” because the world is too complex to be crammed into a single pre-training. Dario Amodei counters that “continual learning will be satisfactorily addressed in 2026” — a specific, verifiable forecast.

5. World Models and Causality

LLMs predict the next token based on linguistic statistics; they lack an internal model of the real world’s physics, space, or causality. This is a central argument of the skeptical camp (see debate section below).

The Central Debate: “Scaling is Enough” vs. “New Breakthroughs Are Needed”

This division shapes the entire topic. Notably, the line isn’t between “insiders” and “outsiders” — it cuts across the labs themselves.

The “New Breakthroughs Needed” Camp — And They Come From Within

Ilya Sutskever (OpenAI co-founder, now founder of Safe Superintelligence) — the very person who was an architect of the scaling era — declared in late 2025: “The age of scaling is over.” He periodized history: 2012–2020 was the “age of research” (idea-driven), 2020–2025 was the “age of scaling” (adding more data and compute led to progress). Now, results from scaling pre-training “have plateaued” and “we are back to an age of discovery.” When the inventor of the scaling formula says the formula has reached its limits, that’s a strong signal.

Yann LeCun (Turing Award winner, former Chief AI Scientist at Meta) is even more assertive: LLMs are a “dead end” on the path to human-level intelligence. He left Meta in November 2025 after 12 years, and by March 2026 raised $1.03 billion for his startup AMI Labs (Advanced Machine Intelligence) in Paris, focusing on “world models” — AI that learns about the real world through video, spatial data, and sensors, rather than just text. LeCun is betting his career and over a billion USD on the belief that a fundamentally different architecture is needed.

Gary Marcus adds a cognitive science perspective: LLMs suffer from issues with compositionality, hallucination, and “LLMs alone are not the answer to AGI.”

The “Scaling Plus Reasoning Will Get Us There” Camp

The optimistic camp does not deny the limits of pure pre-training; they shift focus to test-time compute (reasoning during inference). Observation from 2025–2026: allowing models to “think longer” before responding (reasoning / “think mode”) unlocks new capabilities on tasks like ARC that simply increasing model size could not. According to this argument, “scaling” is not dead — it has merely shifted from expanding pre-training to expanding compute during inference and reinforcement learning (RL).

Some analysts also suggest that continual learning is a systems engineering problem (better context management, more sophisticated in-context learning) rather than a fundamental algorithmic barrier — and thus can be solved soon.

How Empirical Tests Will Decide

This debate will be settled by empirical evidence, not rhetoric. The next generations of models are the test: if they show breakthrough improvements over previous generations, the scaling camp is right; if improvements stagnate or require exponentially increasing compute for diminishing returns, the skeptical camp is proven correct. Amodei’s continual-learning-2026 milestone and the Marcus–Brundage bet for late 2027 are specific checkpoints to watch.

Non-Algorithmic Roadblocks

Even if model capabilities continue to increase, three deployment roadblocks remain:

  • Energy and Compute. Gigawatt-scale electricity demand for next-generation data centers is outstripping current grid supply capabilities; building power infrastructure takes years, not months.
  • High-Quality Data Depletion (“data wall”). Human-generated data is reaching its limits. Using synthetic data generated by AI for self-training carries the risk of “model collapse” if not carefully managed.
  • Verification Gap and Alignment. As AI generates code and makes decisions faster than humans can audit, oversight becomes a bottleneck — and a core safety risk as systems approach or exceed human capabilities.

Conclusion: How To Think About This Roadmap

AGI is likely not a single “big bang” event on a specific date, but rather an uneven accumulation of capabilities — agentic AI in coding and research first, then gradually expanding to other domains. A reasonable confidence interval, synthesizing various perspectives: some narrow, superhuman capabilities emerging 2026–2028; full AGI in the broader sense is far more uncertain, with a medium timeframe of 2029–2035 and a real probability that architectural breakthroughs not yet in existence will be needed.

The takeaway from this information is not to pick a date to believe, but rather: to track verifiable empirical tests (continual learning 2026, new model generations, SWE-bench Pro and ARC-AGI-2 scores), distinguish between “filtered” and “contamination-resistant” benchmarks, and always ask who is forecasting and what their motivations are.

References


This article synthesizes and verifies information from research output R011. All forecasts are attributed to their spokespeople; benchmark figures change rapidly — please check original sources for the latest data. Updated: 2026-06-13.