Natural language autoencoders are a really cool mostly-unsupervised method for producing free-form text explanations of LLM activations. You should read that paper (or the blog post) about them before reading this.I trained[1] several Qwen3-8B NLAs with different length penalties: during RL, I subtracted the token count multiplied by the length penalty hyperparameter (λ) from the RL reward[2]. I found that with small length penalty (λ=0.002), you can reduce the length of NLA explanations by ~...

Read the original article