Naturally learned behaviors in deep MLPs resist detection by both human and learned algorithms (opens in new tab)
IntroductionAn ambitious problem in mechanistic interpretability of neural networks is finding an input for a neural network that produces a certain output (), however this is shown to not have a general computationally tractable solution that works in cases such as when the network acts as a verifier for an NP hard problem. However, a general algorithm might not be necessary in practice, as "Eliciting bad contexts" suggests. We also know that finding the input that maximizes the output of an...
Read the original article