Can You Hide From a Natural Language Autoencoder? (opens in new tab)
TLDR: NLAs are a recent black box mech interp method for verbalizing model internals. I will be focusing on one of two components, the Activation Verbalizer ...
Read the original article