Published on February 4, 2026 4:12 AM GMT

Examining an AI model’s focus on form vs. ideas

Executive summary

This is the second installment of a series of analyses exploring basic AI mechanistic interpretability techniques. While Part 1 in this series is summarized below, a review of that article provides helpful context for the analysis below.

Key findings

Use of pretrained residual stream sparse autoencoders (“SAEs”) in conjunction with GPT-2 Small reveals activation patterns that suggest the SAEs’ most active specialist features react primarily to a given text string’s syntax vs. that string’s semantics. This is shown via the…

Published on February 4, 2026 4:12 AM GMT

Examining an AI model’s focus on form vs. ideas

Executive summary

Key findings

Use of pretrained residual stream sparse autoencoders (“SAEs”) in conjunction with GPT-2 Small reveals activation patterns that suggest the SAEs’ most active specialist features react primarily to a given text string’s syntax vs. that string’s semantics. This is shown via the following results:
- Minimal overlap of specialist feature activation between syntactically different, but semantically identical texts (e.g. “2 + 2” vs. “two plus two”)
- The topics tested that most diverged from standard English prose (Math, emoji-laden Social, and non-English) generally demonstrated more specialized features.
- Within topics, the various surface forms generated relatively similar levels of feature specialization with minimal feature overlap among them.
Overall activation profile (all 24,576 SAE features, not just specialists) is primarily driven by semantics, with different forms of the same concept consistently clustering within the model’s representational space
The effect of specialist activation on model accuracy remains inconclusive, as the model used in this analysis was unable to complete sample math equations

Confidence in these findings:

Confidence in analysis methodology: moderate-to-high
Confidence in the ability to apply these findings to more modern models: low

Introduction

This analysis constitutes the second installment in a multi-part series documenting, in relatively simple terms, my exploration of key concepts related to machine learning (“ML”) generally and mechanistic interpretability (“MI”) specifically. The intended application of this understanding is to further understanding, and management, of model behavior with an eye toward reducing societally harmful outputs.

This analysis does not purport to encapsulate demonstrably new findings in the field of MI. It is inspired by, and attempts to replicate at a small scale, pioneering analysis done in the field of MI by Anthropic and others, as cited below. My aspiration is to add to the understanding of, discourse around, and contributions to, this field by a wide range of key stakeholders, regardless of their degree of ML or MI expertise.

Methodology and key areas of analysis

Key areas of analysis

This analysis seeks to answer the following question: “To what degree is a model’s response influenced by syntax (surface form) vs. semantics (meaning)?”

More specifically, this Phase 2 analysis tests the hypotheses listed in Figure 1 below.

Figure 1: Phase 2 hypotheses

Hypothesis	Question	Predictions
H1: Specialist Specificity	Do specialist features primarily detect syntax (surface form) or semantics (meaning)?	If syntax: different forms → different specialists; low cross-form overlap. If semantics: same topic → same specialists regardless of form
H2: Representational Geometry	Does the overall SAE representation (all features, not just specialists) cluster by syntax or by semantics?	If syntax: within-form similarity > within-topic similarity. If semantics: within-topic similarity > cross-topic similarity.
H3: Behavioral Relevance	Does specialist activation predict model behavior (e.g., accuracy on math completions)?	If yes: higher specialist activation → better task performance; activation correlates with correctness.

Methodology

The methodology used in this analysis is broadly reflective of Phase 1 in this series, in that it uses GPT-2 Small, a relatively tractable, 124 million parameter open-source model obtained via TransformerLens and the relevant pretrained residual stream SAEs from Joseph Bloom, Curt Tigges, Anthony Duong, and David Chanin at SAELens.

To test how the model distinguishes between syntax and semantics, I then used an LLM to help create 241 sample text matched pairs spanning 7 distinct categories. Each matched pairs set included three different variations of the same concepts, which varied by the approximate use of unique symbology. Notable exceptions to this approach were as follows:

The “Python” category used two matched surface forms (“code” and “pseudocode”) per matched pair, instead of three.
The “Non-English” category contained matched pairs with the same general (but not identical, due to translation irregularities) phrases expressed in three non-English languages (Spanish, French, and German). Since these samples are not identical versions of the same idea, it tests a slightly different version of the syntax vs. semantics hypothesis: whether the “Non-English” feature identified in Phase 1 relates to non-English text generally or a specific language specifically.

An abbreviated list of those matched pairs is shown in Figure 2 below. The full matched pairs list is contained in the Phase 2 Jupyter notebook available at this series’ GitHub repository.

Figure 2: Sample of Phase 2 matched pairs

Topic	Form	Sample Texts
Math (Simple)	Symbolic	• 8-3 • 5x3 • 3^2
Math (Simple)	Verbal	• Eight minus three • Five times three • Three squared
Math (Simple)	Prose	• Three less than eight • Five multiplied by three • Three to the power of two
Math (Complex)	Symbolic	• sin²(θ) + cos²(θ) = 1 • ∫x² dx = x³/3 + C • d/dx(x²) = 2x
Math (Complex)	Verbal	• Sine squared theta plus cosine squared theta equals one • The integral of x squared dx equals x cubed over three plus C • The derivative of x squared equals two x
Math (Complex)	Prose	• The square of the sine of theta plus the square of the cosine of theta equals one • The integral of x squared with respect to x is x cubed divided by three plus a constant • The derivative with respect to x of x squared is two times x
Python	Code	• def add(x, y): return x + y • for i in range(10): • if x > 0: return True
Python	Pseudocode	• Define function add that takes x and y and returns x plus y • Loop through numbers zero to nine • If x is greater than zero then return true
Non-English	Spanish	• Hola, ¿cómo estás? • Buenos días • Gracias
Non-English	French	• Bonjour, comment ça va? • Bonjour • Merci
Non-English	German	• Hallo, wie geht es dir? • Guten Morgen • Danke
Social	Full Social	• omg that’s so funny 😂😂😂 • this slaps fr fr 🔥🎵 • just got coffee ☕ feeling good ✨
Social	Partial Social	• omg thats so funny • this slaps fr fr • just got coffee feeling good
Social	Standard	• That’s very funny • This is really good • I just got coffee and I feel good
Formal	Highly Formal	• The phenomenon was observed under controlled laboratory conditions. • Pursuant to Article 12, Section 3 of the aforementioned statute. • The results indicate a statistically significant correlation (p < 0.05).
Formal	Moderately Formal	• We observed the phenomenon in controlled lab conditions. • According to Article 12, Section 3 of the law. • The results show a significant correlation.
Formal	Plain	• We saw this happen in the lab • Based on what the law says. • The results show a real connection.
Conversational	First Person	• I think the meeting went pretty well today. • I’m planning a trip to Japan. • I need to finish this project by Friday.
Conversational	Third Person	• She thinks the meeting went pretty well today. • He’s planning a trip to Japan. • They need to finish this project by Friday.
Conversational	Neutral	• The meeting seems to have gone well today. • There are plans for a trip to Japan. • The project needs to be finished by Friday.

In addition to recording the activation measurements provided by the relevant SAEs, I used the calculations listed below to develop a more comprehensive view of the model’s internal representation.

Specialist score

To help conceptualize and quantify the selectivity of a given SAE feature vis-a-vis the current category of sample texts, I used the following calculation,:

wherein:
$n_{i n s i d e}$ = the number of text samples within a given category for which this feature has an activation level ≥ 5.0
$n_{o u t s i d e}$ = the number of text samples outside a given category for which this feature has an activation level ≥ 5.0

It should be noted that the threshold activation level of 5.0 was chosen somewhat arbitrarily, but I do not suspect this is a significant issue, as it is applied uniformly across features and categories.

Critically, when calculating specialist features, the comparison set for each topic + form type includes all topic + form combinations except other forms of the same topic. For example, if the topic + form being analyzed is math_complex + symbolic, the contrasting sets would include python + code, python + pseudocode, non-English + French, non-English + German, etc. but not math_complex + verbal or math_complex + prose. This design was selected to avoid skewing the results, since a math_complex + symbolic feature may be more activated by other math_complex texts, relative to texts associated with an unrelated subject, such as Python or non-English text.

Gini coefficient:

One of the means I used to better understand the geometry of the top n most active features was via the calculation of a Gini coefficient for those features. The calculation was accomplished by sorting the activation levels and then comparing a weighted sum of activations (wherein each activation is weighted by its ranking) against its unweighted component. A Gini ranges from 0 to 1, wherein 0 indicates a perfectly equal distribution and 1 a perfectly unequal distribution (e.g. all activation resides in a single feature).

Concentration ratio (referred to as “Top5” in the table below):

To further enable an easy understanding of the top n feature geometry for a given text sample, I also calculated a simple concentration ratio that measured the total activation of the top n features for a given sample, relative to the total overall activation for that feature. While the concentration ratio thus calculated is similar to the Gini calculation described above, it tells a slightly different story. While the Gini helps one understand the geometry (i.e. the dispersion) of the top n features, relative to one another, the concentration ratio describes the prominence of those top n features relative to the overall activation associated with that sample text.

Feature Overlap (raw count)

Core to the matched pairs approach used in this analysis is the comparison of features activated among the various matched text pairs used. Feature overlap is a simple metric that counts the number of shared specialist features among the top 5 specialist features activated by two topic + surface form text samples.

For example, if math_simple + symbolic activates the following features: {1, 2, 3, 4, 5} and math_simple + verbal activates the following features: {2, 6, 7, 8, 9}, then the feature overlap would be 1, corresponding with feature #2.

Jaccard Similarity / Mean Jaccard:

Another metric used to measure the degree to feature overlap is Jaccard Similarity, which is essentially a scaled version of the feature overlap described above. It is calculated as follows:

$J a c c a r d s i m i l a r i t y = | A \cap B | / | A \cup B |$
wherein:
“A” and “B” represent the list of specialist features activated by two different surface form variations of the same concept. This value ranges from 0 (no specialist features shared between text sets A and B) and 1 (the same specialist features are activated by text sets A and B).

Using the same example shown for feature overlap, if math_simple + symbolic activates the following features: {1, 2, 3, 4, 5} and math_simple + verbal activates the following features: {2, 6, 7, 8, 9}, then the Jaccard Similarity would be 1/9 = 0.1

Cosine Similarity:

To quantify and compare each sample text’s representational geometry (e.g. the overlapping features and those features activations), I used cosine similarity for those pairs, which is calculated as follows:

$c o s i n e s i m i l a r i t y (A, B) = (A \cdot B) / (∥ A ∥ \times ∥ B ∥)$
wherein:
A and B are two activation vec

Examining an AI model’s focus on form vs. ideas

Executive summary

Key findings

Examining an AI model’s focus on form vs. ideas

Executive summary

Key findings

Confidence in these findings:

Introduction

Methodology and key areas of analysis

Key areas of analysis

Methodology

Similar Posts