Formal Verification, Microkernel, Capability Security, Isabelle/HOL
No for Some, Yes for Others: Persona Prompts and Other Sources of False Refusal in Language Models
arxiv.org·1d
Narrative-Guided Reinforcement Learning: A Platform for Studying Language Model Influence on Decision Making
arxiv.org·1d
Anchoring Refusal Direction: Mitigating Safety Risks in Tuning via Projection Constraint
arxiv.org·3d
Loading...Loading more...