OpenAI Launches HealthBench Open-Source Benchmark: A new benchmark designed to better measure the capabilities of AI systems in the healthcare domain
HealthBench was created in collaboration with 262 practicing physicians across 60 countries, featuring 5,000 real health conversations. Unlike previous narrow benchmarks, HealthBench provides meaningful open-ended evaluations using 48,562 unique physician-written scoring criteria, covering multiple health contexts (e.g., emergency, global health) and behavioral dimensions (e.g., accuracy, instruction following, communication)
Blog:
https://openai.com/index/healthbench/
Paper:
https://cdn.openai.com/pdf/bd7a39d5-9e9f-47b3-903c-8b847ca650c7/healthbench_paper.pdf
Code:
https://github.com/openai/simple-evals
OpenAI's Own Model Evaluation Performance:
o3 performs best overall, scoring over 60%
This evaluation particularly focused on