Are Professional Doctors Far Inferior to AI Models? OpenAI Launches Open-Source Medical Benchmark HealthBench, o3 Shows Strongest Performance

OpenAI Launches HealthBench Open-Source Benchmark: A new benchmark designed to better measure the capabilities of AI systems in the healthcare domain

圖片

HealthBench was created in collaboration with 262 practicing physicians across 60 countries, featuring 5,000 real health conversations. Unlike previous narrow benchmarks, HealthBench provides meaningful open-ended evaluations using 48,562 unique physician-written scoring criteria, covering multiple health contexts (e.g., emergency, global health) and behavioral dimensions (e.g., accuracy, instruction following, communication)

圖片

Blog:

https://openai.com/index/healthbench/

Paper:

https://cdn.openai.com/pdf/bd7a39d5-9e9f-47b3-903c-8b847ca650c7/healthbench_paper.pdf

Code:

https://github.com/openai/simple-evals

OpenAI's Own Model Evaluation Performance:

o3 performs best overall, scoring over 60%

圖片

圖片

This evaluation particularly focused on

Main Tag:Medical AI

Sub Tags:OpenAIHealthcareLLMsAI Benchmarking


Previous:Train a Model with Global Idle Computing Power, Performance Comparable to R1, Jensen Huang's Sky Has Fallen! Karpathy Once Invested In It

Next:Ant Group's Wu Wei: A Big Guess on the Next Generation 'Reasoning' Model Paradigm

Share Short URL