Category: AI Benchmarking

Google's Challenge: DeepSeek, Kimi and More to Compete in First Large Model Showdown Starting Tomorrow
o3-pro Completes 'Sokoban,' Classic Retro Games Become New Benchmarks for Large Models
Amazon's New SOP Benchmark: The Ultimate Test for AI Agents. How Do Top Agents Score?
The Smarter AI Gets, The Less Obedient It Becomes! New Study: Strongest Reasoning Models Only Follow Instructions 50% of the Time
Are Professional Doctors Far Inferior to AI Models? OpenAI Launches Open-Source Medical Benchmark HealthBench, o3 Shows Strongest Performance