The Center for AI Safety (CAIS) and Scale AI have unveiled the results of a groundbreaking AI benchmark designed to assess the limits of AI knowledge and its ability to perform chain-of-thought reasoning. While the results showed significant improvement over previous models, current AI systems were able to correctly answer fewer than 10% of expert-level questions.
Named “Humanity’s Last Exam,” the benchmark tested whether AI models have reached expert-level reasoning and knowledge across diverse fields, including math, humanities, and the natural sciences. To create the most challenging and comprehensive evaluation, CAIS and Scale AI spent the fall crowdsourcing difficult questions from experts. The exam aims to address “benchmark saturation,” where AI models achieve near-perfect scores on existing tests but struggle with questions beyond those datasets – limiting benchmarks’ effectiveness in measuring future AI progress.
