Category: Benchmarking

Can Models Truly "Reflect on Code"? Beihang University Releases Repository-Level Understanding and Generation Benchmark, Refreshing the LLM Understanding Evaluation Paradigm
Multimodal Large Models Collectively Fail, GPT-4o Only 50% Safety Pass Rate: SIUO Reveals Cross-Modal Safety Blind Spots
The 'Olympics' of AI? OpenAI Releases New Benchmark MRCR, Pushing Models' 'Needle in a Haystack' Ability to the Limit!