Category: AI Benchmarking
- Google's Challenge: DeepSeek, Kimi and More to Compete in First Large Model Showdown Starting Tomorrow
- o3-pro Completes 'Sokoban,' Classic Retro Games Become New Benchmarks for Large Models
- Amazon's New SOP Benchmark: The Ultimate Test for AI Agents. How Do Top Agents Score?
- The Smarter AI Gets, The Less Obedient It Becomes! New Study: Strongest Reasoning Models Only Follow Instructions 50% of the Time
- Are Professional Doctors Far Inferior to AI Models? OpenAI Launches Open-Source Medical Benchmark HealthBench, o3 Shows Strongest Performance