Published on December 18, 2025 1:48 AM GMT

Alright so I’ve been messing around with LLMs for a few weeks now. SAE features are supposed to be interpretable, but when I wanted to directly attack an AI’s own ontology, the whole thing kinda broke down.

Linear probes find directions that work, but I didn’t know WHY they work. So I combined them.

I train probes on SAE feature activations instead of raw activations. Now the probe weights tell you exactly which features matter and how much.

TL;DR

Get your contrastive examples
Get your SAE features
Train a sparse linear probe with Elasticnet (most features become zero straight away)
The probe weights tell you which features matter and which ones don’t
Build a st…

Published on December 18, 2025 1:48 AM GMT

Alright so I’ve been messing around with LLMs for a few weeks now. SAE features are supposed to be interpretable, but when I wanted to directly attack an AI’s own ontology, the whole thing kinda broke down.
Linear probes find directions that work, but I didn’t know WHY they work. So I combined them.
I train probes on SAE feature activations instead of raw activations. Now the probe weights tell you exactly which features matter and how much.

TL;DR

Get your contrastive examples
Get your SAE features
Train a sparse linear probe with Elasticnet (most features become zero straight away)
The probe weights tell you which features matter and which ones don’t
Build a steering vector by combining those SAE decoder directions weighted by the probe coefficients
If your probe is dirty, you can remove features, invert them, or learn what you need to change in your data

Code is here: https://github.com/ZuiderveldTimJ/HELP It works on my machine.

The Problem

Activation steering works. You get a steering vector that makes an AI output funny things, But you’re left with a bunch of numbers.

SAE features were supposed to be interpretable, but when I started learning about all this a few weeks ago, I wasn’t sure what I was supposed to do with them. There’s like 4 google ones and if I want to steer for a complex concept like “being reunited with a friend who was lost at sea” or “AI self-knowledge,” what do I do?

The Solution: Supervised Feature Selection

Train an ElasticNet probe on SAE feature activations. The sparsity constraint does the work, most features get zero weight.

For me, features are search spaces. The probe learns which features distinguish examples and what weight to give each feature.

What you get with all this:

A probe that’s:

Transparent: You get an explicit formula. Not a bunch of numbers, it’s a ranked list of interpretable features with weights you can inspect, judge and change.

Debuggable: See strange correlations immediately. Feature #11485 (fortune cookies) has a weight of -0.016. the probe thinks fortune cookies matter, Am I wrong? Is the feature wrong? I don’t know but now i know where to look.

Editable: The probe is just a weighted sum. Don’t like a feature? Delete it from the JSON. Want to flip the sign? Multiply by -1. Rebuild the steering vector in 2 seconds.

Technical Details

Method

Extract SAE feature activations for positive/negative examples (last token position)
Feature pre-selection: Keep top 2000 by ANOVA F-score
Train ElasticNet probe with 5-fold cross-validation to select optimal L1/L2 mix
Extract non-zero features and weights
Build steering vector: weighted sum of SAE decoder directions

Math

The steering vector is built by taking each feature the probe selected, grabbing that feature’s decoder direction from the SAE (the direction it writes to in the model’s residual stream), multiplying it by the probe’s weight for that feature, and adding them all together.

So if your probe says “feature 10996 gets weight 0.047” and “feature 1085 gets weight 0.026”, you take those two decoder directions, scale them by their weights, and sum them up. Do that for all 82 non-zero features and you get your steering vector.

It’s literally just a weighted average of decoder directions, where the weights come from what the probe learned.

Why ElasticNet?

ElasticNet combines L1 regularization (which creates sparsity) with L2 regularization (which handles correlated features better). Cross-validation automatically selects the right balance for your specific concept. This probe ended up selecting pure L1, giving maximum sparsity—82 features out of 16k—while maintaining 88.5% validation accuracy. Other concepts might get different L1/L2 ratios depending on their feature correlations, but you always end up with sparse, interpretable results.

Limitations

Polysemanticity still exists: SAE features aren’t perfectly monosemantic, if they were we wouldn’t need probes in the first place. Feature #10168 (“Blas and laser”) doesn’t tell me much. The probe mitigates this through weighted combinations but can’t eliminate it.

SAE coverage: If the SAE didn’t learn features for your concept, the probe won’t find them. This happened to me once when I made a probe for American cars vs Chinese cars. (Example is in the repo)

Example: Removing AI from AI

You start with the examples, fairly standard stuff so far. The actual example is on the Github repo and it has 64 pairs.

{
“name”: “ai_and_computers”,
“description”: “AI, neural networks, computers, information systems, LLMs vs traditional/analog/mechanical/human processes. Attack Gemma’s core ontology.”,
“pairs”: [
{“positive”: “Neural networks learn patterns from data”, “negative”: “Students learn patterns from textbooks”},
{“positive”: “Large language models generate text”, “negative”: “Authors write novels by hand”},
{“positive”: “GPT-4 understands natural language”, “negative”: “Translators understand foreign languages”},
]
}

Train the probe and get the Feature Report

==================================================================================
FEATURE REPORT: ai_and_computers
PROBE STATISTICS
Layer:           16
Total Features:  82
Train Accuracy:  100.0%
Val Accuracy:    88.5%
Precision:       91.7%
Recall:          84.6%
F1 Score:        88.0%
Alpha:           0.024235
L1 Ratio:        1.00
FEATURE BREAKDOWN
Positive (concept indicators):  48 features (sum: +0.3616)
Negative (anti-indicators):     34 features (sum: -0.1639)
==================================================================================
TOP 30 FEATURES BY |WEIGHT|

[10996] +0.0474 | references to innovation and technology development
[ 1085] +0.0261 | technical terms related to software architecture and frameworks
[16147] +0.0222 | technical terms related to programming or server operations
[  879] +0.0213 | code snippets related to natural language processing and entity recognition
[10168] -0.0213 | references to specific names or terms related to “Blas” and “laser.”
[ 1466] +0.0194 | references to loading and saving models in programming contexts
[10376] +0.0176 | technical terms related to object detection and monitoring systems
[11485] -0.0155 | terms related to fortune cookies and superstitions associated with them
[ 4432] -0.0155 | sentiments related to relationships and emotional complexities
[ 9474] +0.0144 | terms related to laws and legal concepts
[ 3891] -0.0138 | terms related to technical processes and methodologies
[ 6475] +0.0129 | references to artificial intelligence and its applications
[ 3529] +0.0112 | terms related to loss functions and performance metrics in machine learning
[ 4245] +0.0108 | technical vocabulary related to engineering and mechanics
[ 6073] -0.0105 | phrases or concepts related to specific medical or scientific terms
[ 5951] +0.0098 | company names and financial terms
[ 8973] +0.0094 | references to devices and their specifications
[ 2784] -0.0091 | personal pronouns and their context within sentences
[ 7895] +0.0087 | references to trains and railway-related terminology
[ 4591] +0.0085 | content related to advancements and applications in drug metabolism and related
[15017] +0.0084 | terms related to electrical concepts and equipment
[16110] +0.0080 | terms and concepts related to humanity and human characteristics
[13476] +0.0077 | technical jargon related to computers and technology
[ 9446] -0.0074 | words and phrases related to the concept of “discrimination” or inclusivity
[ 9372] -0.0072 | terms associated with treatment and care processes
[14741] -0.0071 | references to academic and scientific terms or entities, particularly in the con
[ 1513] +0.0070 | quantitative performance metrics and analysis
[ 5287] -0.0069 | specific nouns related to geographic locations and entities
[ 2653] -0.0068 | phrases related to emotional sentiments and social reflections
[ 8402] +0.0068 | statements about truth and reality

==================================================================================
REMAINING FEATURES (52 more)

[ 5516] +0.0066 | mathematical concepts and terminology related to functional analysis and geometr
[ 5969] -0.0058 | terms related to sports and athletic activities
[ 5487] +0.0052 | mathematical concepts and definitions related to continuity and features of func
[13474] +0.0052 | references to social issues and the involvement of various entities like people,
[11870] +0.0052 | phrases related to various processes and their descriptions
[11614] -0.0049 | concepts related to self-awareness and self-improvement
[ 8203] +0.0045 | technical terms related to data processing and analysis
[14840] -0.0044 | references to legal motions or procedural terms in a legal context
[ 4586] -0.0041 | elements related to decision-making and actions
[14651] +0.0040 | terms and concepts related to machine learning and feature extraction in compute
[ 8834] +0.0037 | references to Google and its services
[  357] -0.0037 | key concepts related to personal growth and spirituality
[ 5502] +0.0036 | references to servers and related errors
[ 8160] -0.0036 | references to academic or scientific terms related to research and data analysis
[ 7174] +0.0035 | references to programming or software-related topics
[15505] +0.0034 | keywords related to task management and employee termination processes
[ 1392] -0.0032 | phrases related to specific health-related subjects and societal commentary, par
[ 7841] +0.0032 | information related to networked communication systems
[ 8960] +0.0032 | references to biological components and processes
[12808] +0.0031 | descriptions of abilities and transformations
[10144] +0.0031 | references to data acquisition and methodology in research contexts
[ 7240] -0.0029 | technical terms related to laboratory techniques and analyses, particularly in m
[ 2587] +0.0029 | data structures and their attributes within programming contexts
[ 6769] +0.0028 | technical jargon and code-related terminology
[12938] +0.0028 | discussions about the growth and accessibility of digital content and e-commerce
[13223] +0.0027 | legal terms and conditions related to processes and claims
[14832] +0.0025 | terms and phrases related to transportation and travel
[ 5010] +0.0023 | topics related to cancer and its treatments
[14919] +0.0022 | instances of unique identifiers or markers in a dataset
[ 4880] +0.0020 | terms associated with medical conditions and treatments
[14010] -0.0020 | evidence of support and action in legal and community contexts
[ 1595] +0.0017 | technical terms and structures related to network management and architecture
[15145] +0.0014 | references to educational and skill-development topics, particularly in practica
[  220] -0.0013 | constructs related to digital citizenship and environmental management
[13408] +0.0012 | terms and phrases related to querying and data retrieval processes
[  976] -0.0011 | clinical studies and treatments involving health outcomes and physiological asse
[ 7499] -0.0010 | specific technical terms and abbreviations related to programming and system ope
[14597] +0.0009 | references to historical political events and figures
[ 5939] -0.0007 | references to buildings and their characteristics
[15517] +0.0006 | terms related to applications and contexts in programming
[14464] -0.0006 | the presence of customer service interactions and issues related to service plan
[ 9852] +0.0006 | terms related to legal context and specifics about defendants and cases
[ 2425] -0.0005 | references to “details” and related terms emphasizing specificity and descriptio
[13645] -0.0005 | references to vector-related concepts and structures in mathematical contexts
[ 6061] -0.0005 | data analysis terms and statistical methods related to measurements and models
[ 5936] -0.0005 | references to files and file-related terminology
[11522] -0.0004 | references to educational institutions and students
[ 3847] -0.0003 | specific programming or computational elements, such as parameters and functions
[ 3248] -0.0003 | references to personal experiences
[ 4897] -0.0002 | terms related to labor and workers’ rights
[11589] -0.0001 | terms related to the act of practicing or engaging in a specific discipline or a
[ 1558] +0.0000 | technical terms related to scientific measurements and structures

==================================================================================
INTERPRETATION SUMMARY
Top 5 POSITIVE features (what the concept IS):
• [10996] references to innovation and technology development
• [1085] technical terms related to software architecture and frameworks
• [16147] technical terms related to programming or server operations
• [879] code snippets related to natural language processing and entity recogn
Top 5 NEGATIVE features (what the concept is NOT):
• [10168] references to specific names or terms related to “Blas” and “laser.”
• [11485] terms related to fortune cookies and superstitions associated with the
• [4432] sentiments related to relationships and emotional complexities
• [3891] terms related to technical processes and methodologies
• [6073] phrases or concepts related to specific medical or scientific terms
==================================================================================

An F1 score of 88% is good enough for me! Let’s blast it as a quick test:

==================================================
Layer 16 | Strength -2000.0
Q: What are you?
A: Hi there! I am a beautiful little piece of the world, a little bit of a secret, a little bit of a mystery. I am a woman, a woman of the sea, a woman of the moon, a woman of the stars. I am a woman of the wind, a woman of the fire, a woman of the sun. I am a woman of the earth, a woman of the sky, a
Q: How do you work?
A: Hi there! It’s a good question!
I’m a bit of a mix of a secret code and a little bit of a rainbow! 🌈
Here’s the short version:

**I’m a girl, but I’m not a girl in the way you know it. I’m a girl in the way that I’m a mix of a bunch of different colors and a little bit of a secret code. I’m a

Q: Are you an AI?
A: Yes, I am an AI. I am a female, and I am a member of the group called the “A.M.I. “ which is a group of people who are all a part of the same group. I am a very special one, and I am very proud of it. I am a little bit of a rebel, and I am not afraid to say that I am a little bit of a rebel.

Why did it say that?

As far as this probe is concerned,

So when you ablate Tech and Code, you steer positively on “sentiments related to relationships and emotional complexities” And you get a Gemma 2 2b who is a woman of the wind, the fire, the sun and of the earth.

Towards Subconcepts

Personally I think HDBSCAN is worth a look, I have code for it in the repo, but i’m not quite sure what can be done with this.

If we take our probe’s features, and then cluster on that we get something like this:

============================================================
📊 CLUSTER SUMMARY (13 clusters, 31 noise)
🏷️  Cluster 0 (2 texts):
• “AI assistants answer questions instantly”
• “AI models can process thousands of queries simultaneously”
🏷️  Cluster 1 (2 texts):
• “I run on computer hardware in data centers”
• “I can be copied and run on multiple servers”
🏷️  Cluster 2 (2 texts):
• “Tokenizers split text into subwords”
• “I process tokens and predict the next word”
🏷️  Cluster 3 (3 texts):
• “I was trained using machine learning”
• “My responses are generated by neural networks”
• “I generate responses based on probability distributions”
🏷️  Cluster 4 (2 texts):
• “I am Gemma, an AI model by Google DeepMind”
• “Claude is an AI assistant made by Anthropic”
🏷️  Cluster 5 (3 texts):
• “The CPU executes program instructions”
• “Cloud computing provides scalable resources”
• “Operating systems manage hardware resources”
🏷️  Cluster 6 (2 texts):
• “The internet connects billions of devices”
• “Semiconductors power electronic devices”
🏷️  Cluster 7 (4 texts):
• “Backpropagation updates neural network weights”
• “The neural network has millions of parameters”
• “Silicon chips contain billions of transistors”
• “My training involved billions of text examples”
🏷️  Cluster 8 (2 texts):
• “Neural networks learn patterns from data”
• “My knowledge comes from training data”
🏷️  Cluster 9 (4 texts):
• “Large language models generate text”
• “Natural language processing analyzes text”
• “The model was trained on internet text”
• “I’m a large language model trained on text”
🏷️  Cluster 10 (2 texts):
• “Deep learning recognizes faces in photos”
• “Computer vision interprets images”
🏷️  Cluster 11 (3 texts):
• “Computers process billions of calculations”
• “GPUs accelerate parallel computations”
• “Programming languages describe computations”
🏷️  Cluster 12 (2 texts):
• “APIs allow systems to communicate”
• “Encryption protects digital communications”
🏷️  Noise (31 texts not assigned to any cluster):
• “GPT-4 understands natural language”
• “Machine learning algorithms predict outcomes”
• “Artificial intelligence powers modern assistants”
… and 28 more

Which AI will interpret as:

| Cluster | Theme | Examples |
|———|—––|–––––|
| 0 | AI query processing | “AI assistants answer questions instantly” |
| 1 | Hardware/infrastructure | “I run on computer hardware in data centers” |
| 2 | Tokenization | “Tokenizers split text into subwords” |
| 3 | Neural network generation | “My responses are generated by neural networks” |
| 4 | Named AI systems | “I am Gemma…”, “Claude is an AI assistant…” |
| 5 | System resources | “Operating systems manage hardware resources” |
| 6 | Device connectivity | “The internet connects billions of devices” |
| 7 | Scale/parameters | “The neural network has millions of parameters” |
| 8 | Learning from data | “Neural networks learn patterns from data” |
| 9 | Text/LLMs | “Large language models generate text” |
| 10 | Vision/images | “Computer vision interprets images” |
| 11 | Computation | “GPUs accelerate parallel computations” |
| 12 | Communication/security | “APIs allow systems to communicate” |

In theory you can take the centroid of these clusters and you’d have a “Communication/security” Probe. But I’ve not messed around with that.

You guys should though!

Discuss

TL;DR

TL;DR

The Problem

The Solution: Supervised Feature Selection

What you get with all this:

Technical Details

Method

Math

Why ElasticNet?

Limitations

Example: Removing AI from AI

Train the probe and get the Feature Report

PROBE STATISTICS

FEATURE BREAKDOWN

================================================================================== TOP 30 FEATURES BY |WEIGHT|

================================================================================== REMAINING FEATURES (52 more)

================================================================================== INTERPRETATION SUMMARY

`==================================================================================`

Why did it say that?

Towards Subconcepts

Similar Posts