[CS 2881r] Can We Prompt Our Way to Safety? Comparing System Prompt Styles and Post-Training Effects on Safety Benchmarks
lesswrong.comยท6d
๐๏ธObservability
Flag this post
Human Values โ  Goodness
lesswrong.comยท1d
๐ฟDigital Gardens
Flag this post
My YC Pitch
lesswrong.comยท1d
๐Open Source
Flag this post
Brainstorming 25 Questions I Am Interested In
lesswrong.comยท1d
๐๏ธZettelkasten
Flag this post
Body Time and Daylight Savings Apologetics
lesswrong.comยท1d
๐ขHomomorphic Encryption
Flag this post
Why I Transitioned: A Case Study
lesswrong.comยท2d
โCategory Theory
Flag this post
Ink without haven
lesswrong.comยท3d
โWriting
Flag this post
High-Resistance Systems to Change: Can a Political Strategy Apply to Personal Change?
lesswrong.comยท6h
โกIncremental Computation
Flag this post
You think you are in control?
lesswrong.comยท7h
๐ Self-Hosting
Flag this post
There's some chance oral herpes is pretty bad for you?
lesswrong.comยท18h
๐ฆRust
Flag this post
Asking Paul Fussell for Writing Advice
lesswrong.comยท2d
โCategory Theory
Flag this post
FTL travel and scientific realism
lesswrong.comยท1d
๐๏ธObservability
Flag this post
Secretly Loyal AIs: Threat Vectors and Mitigation Strategies
lesswrong.comยท3d
๐ขHomomorphic Encryption
Flag this post
Vaccination against ASI
lesswrong.comยท2d
๐ฎMessage Queues
Flag this post
AISLE discovered three new OpenSSL vulnerabilities
lesswrong.comยท4d
๐ฆRust
Flag this post
Reflections on 4 years of meta-honesty
lesswrong.comยท1d
๐ฎMessage Queues
Flag this post
Why Civilizations Are Unstable (And What This Means for AI Alignment)
lesswrong.comยท5d
๐AI Interpretability
Flag this post
Decision theory when you can't make decisions
lesswrong.comยท2d
๐ฏReinforcement Learning
Flag this post
Doom from a Solution to the Alignment Problem
lesswrong.comยท1d
โกIncremental Computation
Flag this post
Loading...Loading more...