Can You Hide From a Natural Language Autoencoder? (opens in new tab)

TLDR: NLAs are a recent black box mech interp method for verbalizing model internals. I will be focusing on one of two components, the Activation Verbalizer ...