Secretly Loyal AIs: Threat Vectors and Mitigation Strategies
lesswrong.comยท1d
๐ขHomomorphic Encryption
Flag this post
[CS 2881r] Can We Prompt Our Way to Safety? Comparing System Prompt Styles and Post-Training Effects on Safety Benchmarks
lesswrong.comยท5d
๐๏ธObservability
Flag this post
My YC Pitch
lesswrong.comยท12h
๐Open Source
Flag this post
Brainstorming 25 Questions I Am Interested In
lesswrong.comยท11h
๐๏ธZettelkasten
Flag this post
A Bayesian Explanation of Causal Models
lesswrong.comยท5d
๐AI Interpretability
Flag this post
Ink without haven
lesswrong.comยท2d
โWriting
Flag this post
Asking Paul Fussell for Writing Advice
lesswrong.comยท1d
โCategory Theory
Flag this post
Why I Transitioned: A Case Study
lesswrong.comยท1d
โCategory Theory
Flag this post
Why Civilizations Are Unstable (And What This Means for AI Alignment)
lesswrong.comยท4d
๐AI Interpretability
Flag this post
Reflections on 4 years of meta-honesty
lesswrong.comยท17h
๐ฎMessage Queues
Flag this post
Vaccination against ASI
lesswrong.comยท1d
๐ฎMessage Queues
Flag this post
FTL travel and scientific realism
lesswrong.comยท17h
๐๏ธObservability
Flag this post
Decision theory when you can't make decisions
lesswrong.comยท1d
๐ฏReinforcement Learning
Flag this post
AISLE discovered three new OpenSSL vulnerabilities
lesswrong.comยท3d
๐ฆRust
Flag this post
Doom from a Solution to the Alignment Problem
lesswrong.comยท6h
โกIncremental Computation
Flag this post
Me consuming five different forms of media at once to minimize the chance of a thought occurring
lesswrong.comยท15h
๐ฟDigital Gardens
Flag this post
Halfhaven Digest #3
lesswrong.comยท2d
๐กRSS
Flag this post
No title
lesswrong.comยท5d
๐AI Interpretability
Flag this post
Loading...Loading more...