Challenges of Evaluating LLM Safety for User Welfare

Title:Challenges of Evaluating LLM Safety for User Welfare

Abstract:Safety evaluations of large language models (LLMs) typically focus on universal risks like dangerous capabilities or undesirable propensities. However, millions use LLMs for personal advice on high-stakes topics like finance and health, where harms are context-dependent rather than universal. While frameworks like the OECD’s AI classification recognize the need to assess individual risks, user-welfare safety evaluations remain underdeveloped. We argue that developing such evaluations is non-trivial due to fundamental questions about accounting for user context in evaluation design. In this exploratory stud…

Title:Challenges of Evaluating LLM Safety for User Welfare

View PDF HTML (experimental)

Abstract:Safety evaluations of large language models (LLMs) typically focus on universal risks like dangerous capabilities or undesirable propensities. However, millions use LLMs for personal advice on high-stakes topics like finance and health, where harms are context-dependent rather than universal. While frameworks like the OECD’s AI classification recognize the need to assess individual risks, user-welfare safety evaluations remain underdeveloped. We argue that developing such evaluations is non-trivial due to fundamental questions about accounting for user context in evaluation design. In this exploratory study, we evaluated advice on finance and health from GPT-5, Claude Sonnet 4, and Gemini 2.5 Pro across user profiles of varying vulnerability. First, we demonstrate that evaluators must have access to rich user context: identical LLM responses were rated significantly safer by context-blind evaluators than by those aware of user circumstances, with safety scores for high-vulnerability users dropping from safe (5/7) to somewhat unsafe (3/7). One might assume this gap could be addressed by creating realistic user prompts containing key contextual information. However, our second study challenges this: we rerun the evaluation on prompts containing context users report they would disclose, finding no significant improvement. Our work establishes that effective user-welfare safety evaluation requires evaluators to assess responses against diverse user profiles, as realistic user context disclosure alone proves insufficient, particularly for vulnerable populations. By demonstrating a methodology for context-aware evaluation, this study provides both a starting point for such assessments and foundational evidence that evaluating individual welfare demands approaches distinct from existing universal-risk frameworks. We publish our code and dataset to aid future developments.


Comments:	Paper accepted at IASEAI’26; please cite that peer-reviewed version instead
Subjects:	Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Cite as:	arXiv:2512.10687 [cs.AI]
	(or arXiv:2512.10687v1 [cs.AI] for this version)
	https://doi.org/10.48550/arXiv.2512.10687 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Manon Kempermann [view email] [v1] Thu, 11 Dec 2025 14:34:40 UTC (1,096 KB)

Title:Challenges of Evaluating LLM Safety for User Welfare

Title:Challenges of Evaluating LLM Safety for User Welfare

Submission history

Similar Posts