Updated

Updated · arxiv.org · Jun 26

Health-ORSC-Bench: A Benchmark for Measuring Over-Refusal and Safety Completion in Health Context

Updated

Updated · arxiv.org · Jun 26

Health-ORSC-Bench: A Benchmark for Measuring Over-Refusal and Safety Completion in Health Context

1 articles · Updated · arxiv.org · Jun 26

Researchers have introduced Health-ORSC-Bench, a large-scale benchmark to evaluate over-refusal and safe completion in healthcare-focused large language models (LLMs).
The benchmark features 31,920 prompts across seven health categories and tests 30 state-of-the-art LLMs, including GPT-5 and Claude-4, for nuanced safety and helpfulness.
Findings highlight a significant trade-off between safety and utility; current LLMs often over-refuse benign queries, underscoring the challenge of balancing caution and usefulness in medical AI.

Sources

Center100%

1 total

Center100%