Hi all, I created an interactive Logit Lens for Llama and thought some of you might find it useful. It is something that I wish existed.
What is Logit Lens?
Logit Lens is an interpretability tool first introduced by nonstalgebraist, with the aim of interpreting what the model thinks in its intermediate stages of LLMs by projecting the intermediate activation to the final layer’s unembedding matrix. The method has been mildly popular, with hundreds of papers using it to understand how LLM think internally.
The reason for making this repo
With how widely the method is used, I thought there would be a popular repo that ma…
Hi all, I created an interactive Logit Lens for Llama and thought some of you might find it useful. It is something that I wish existed.
What is Logit Lens?
Logit Lens is an interpretability tool first introduced by nonstalgebraist, with the aim of interpreting what the model thinks in its intermediate stages of LLMs by projecting the intermediate activation to the final layer’s unembedding matrix. The method has been mildly popular, with hundreds of papers using it to understand how LLM think internally.
The reason for making this repo
With how widely the method is used, I thought there would be a popular repo that makes logit lens easy for the users to use. This wasn’t the case.
The most starred Logit Lens repo on github seemed problematic. The output in the readme did not match my local implementation nor other repository’s output.
TransformerLens repository is fantastic but quite large. You have to piece together the docs and code yourself to get an innteractive logit lens workflow, but that takes time.
Also, many public repos were using the original gpt2 or project-specific models rather than current, widely used ones.
So I built a small tool with the features I wanted.
Stuff it can do.
Interactively show a more granular logit lens output for user input 1.
Allow users to modify the residual stream, attention outputs, and MLP outputs 1.
Allow users to block attention from and to certain tokens 1.
Save and load current intervention / outputs into and from JSON and npz files.
The following only works for Llama at the moment.
Let me know what you think. If there are additional features you would like, please leave a comment.