Are Professional Doctors Far Inferior to AI Models? OpenAI Launches Open-Source Medical Benchmark HealthBench, o3 Shows Strongest Performance

OpenAI Launches HealthBench Open-Source Benchmark: A new benchmark designed to better measure the capabilities of AI systems in the healthcare domain

HealthBench was created in collaboration with 262 practicing physicians across 60 countries, featuring 5,000 real health conversations. Unlike previous narrow benchmarks, HealthBench provides meaningful open-ended evaluations using 48,562 unique physician-written scoring criteria, covering multiple health contexts (e.g., emergency, global health) and behavioral dimensions (e.g., accuracy, instruction following, communication)

Blog:

https://openai.com/index/healthbench/

Paper:

https://cdn.openai.com/pdf/bd7a39d5-9e9f-47b3-903c-8b847ca650c7/healthbench_paper.pdf

Code:

https://github.com/openai/simple-evals

OpenAI's Own Model Evaluation Performance:

o3 performs best overall, scoring over 60%

This evaluation particularly focused on

Are Professional Doctors Far Inferior to AI Models? OpenAI Launches Open-Source Medical Benchmark HealthBench, o3 Shows Strongest Performance

OpenAI's Own Model Evaluation Performance:

Share Short URL