“Humanity’s Last Exam”: AI Models Score 38.3% on a Test Designed to Be Unsolvable

“Humanity’s Last Exam” (HLE), a benchmark of 2,500 expert-level questions assembled from nearly 1,000 experts at 500 institutions across 50 countries, was published in Nature in January 2026, created by the Center for AI Safety and Scale AI. It was designed because frontier AI models had already saturated MMLU (Massive Multitask Language Understanding) at over 90% accuracy, rendering it ineffective for measuring further progress. HLE was deliberately constructed to defeat all current AI systems at launch. Initial scores: GPT-4o 2.7%, Claude Sonnet 3.5 4.1%, OpenAI’s o1 8%, DeepSeek-R1 8.5%. Within months: GPT-5 reached 25.3%, Gemini 2.5 Pro 21.6%, and Gemini 3 Pro now leads the live leaderboard at 38.3%. Key findings: severe calibration failure across all tested models — calibration errors of 50–90%, meaning models express high confidence while being wrong; performance improves with more reasoning compute but declines beyond approximately 16,000 output tokens; expert disagreement on HLE questions reaches 15.4% (rising to 18% in biology, chemistry, and health). The designers have announced HLE-Rolling — a continuously updated version — to stay ahead of rapidly improving models. The authors explicitly state: high HLE performance would demonstrate expert-level ability on academic questions, not constitute evidence of artificial general intelligence (AGI).

Perspective & Context:

  • In simple terms: Researchers created what they thought was an impossible test for AI — 2,500 questions so hard that even human experts sometimes disagree on the answers. AI initially failed catastrophically, with the best models scoring under 10%. Within months, the leading model reached 38.3%. The test they called “humanity’s last exam” is already becoming obsolete, which itself tells you something about the pace of AI advancement.
  • Benchmark saturation — when AI models consistently score so high on a test that it can no longer differentiate between them or track further progress. MMLU was the previous gold standard; models now exceed 90% on it. HLE was designed specifically to avoid saturation — yet it is already showing pressure within months of launch.
  • Calibration — a model is “well-calibrated” when its stated confidence matches its actual accuracy (e.g., if it says “90% confident,” it should be right ~90% of the time). HLE found calibration errors of 50–90% across all architectures — models express high confidence while being wrong. This appears to be a structural feature of current AI design, not a quirk of any single system.
  • Reasoning compute ceiling — giving AI models more “thinking time” improves performance, but only up to approximately 16,000 output tokens, after which performance declines. This suggests there is an efficiency ceiling to simply “thinking longer.”
  • Business implications highlighted in the paper: (1) AI capability claims built on benchmark scores are structurally unstable — procurement decisions based on them may already be outdated; (2) confident wrongness makes AI unsuitable for high-stakes deployment (credit assessment, medical triage, legal review) without mandatory human checks; (3) AI capability has a short shelf-life — planning and regulatory assumptions require revision cycles most institutions aren’t designed to support.
  • HLE’s publication in Nature signals the academic community treats AI benchmark methodology as serious science, not just a marketing exercise. The designers’ immediate announcement of HLE-Rolling acknowledges that no static benchmark can escape becoming a training target.