Category: Benchmarking
- Can Models Truly "Reflect on Code"? Beihang University Releases Repository-Level Understanding and Generation Benchmark, Refreshing the LLM Understanding Evaluation Paradigm
- Multimodal Large Models Collectively Fail, GPT-4o Only 50% Safety Pass Rate: SIUO Reveals Cross-Modal Safety Blind Spots
- The 'Olympics' of AI? OpenAI Releases New Benchmark MRCR, Pushing Models' 'Needle in a Haystack' Ability to the Limit!