[Linkpost] Theory and AI Alignment (Scott Aaronson)
lesswrong.com·14h
🔍AI Interpretability
Preview
Report Post

Published on December 7, 2025 7:17 PM GMT

Some excerpts below:

On Paul’s “No-Coincidence Conjecture”

Related to backdoors, maybe the clearest place where theoretical computer science can contribute to AI alignment is in the study of mechanistic interpretability. If you’re given as input the weights of a deep neural net, what can you learn from those weights in polynomial time, beyond what you could learn from black-box access to the neural net?

In the worst case, we certainly expect that some information about the neural net’s behavior could be cryptographically obfuscated. And answering certain kinds of questions, like “does there exist an input to this neural net that causes it to output 1?”, is just provably NP-hard.

That’s why I love a …

Similar Posts

Loading similar posts...