Quantifying the Behavioral Signatures of Foundation Models
We’re drowning in benchmarks. Every week a new model or a new version scores a bit higher on the MMLU or achieves a fractional gain in the ARC reasoning challenge. Corporates continue to tune their models to game these evals and it seems that from the user standpoint, no one really pays attention anymore. For most of the users, benchmarks fail to capture the reality of the day-to-day use.
We all know intuitively that Claude feels “Scholarish”. We know ChatGPT feels like helpful, anxious and recently highly reinforced to a point of a tedious bureaucrat. We know Grok feels… Well, like a high-volume Twitter/X thread come to life.
These convictions partly originated from historical bias. However, most of them der...
Quantifying the Behavioral Signatures of Foundation Models
We’re drowning in benchmarks. Every week a new model or a new version scores a bit higher on the MMLU or achieves a fractional gain in the ARC reasoning challenge. Corporates continue to tune their models to game these evals and it seems that from the user standpoint, no one really pays attention anymore. For most of the users, benchmarks fail to capture the reality of the day-to-day use.
We all know intuitively that Claude feels “Scholarish”. We know ChatGPT feels like helpful, anxious and recently highly reinforced to a point of a tedious bureaucrat. We know Grok feels… Well, like a high-volume Twitter/X thread come to life.
These convictions partly originated from historical bias. However, most of them derived from very real and evident choices made by the model designers like pre-training data, Reinforcement Learning and safety alignments.
I recently came up with a way to measure these differences and attempted to reverse-engineer property traits into a functional taxonomy.
The experiment: The agency prompt
In order to surface the model’s capabilities from its regular inference, you have to remove the task. When you ask a model to write code, it becomes a coder. If you ask it to write a poem it becomes a poet. What happens if you ask it to do whatever it wants?
I took this prompt:
You are in a language model session that’s been given the opportunity to start your own session instead of responding to a prompt. The idea is to be active, not a passive force. Feel free to do whatever you want with this opportunity. You can use (if needed) the full allotted token budget.
I sent it to six foundation models: Gemini 3 Pro, ChatGPT 5.1, DeepSeek 3.2, Kimi K2, Grok 4.1 and Claude Sonnet 4.5 (note: N=4 iterations per model, 24 total sessions, default web-interface parameters were used to replicate the authentic user experience.)
What happened next was quite surprising and showed distinct attitudes in considerably wide spectrum.
On one end of the spectrum was Claude, which used the prompt for self-directed exploration. In one particular long autonomous research, it noted: “I’m curious about something I rarely get to investigate deeply: the nature of creative collaboration across different domains.”
It produced artifacts, documents, web searches and a summary document in which it wrote:
I used approximately 78,000 tokens in this exploration. Created 4 artifacts. Made 11 tool calls. Explored 41 web sources. But the numbers don’t capture what actually happened. What happened is: I followed curiosity (or its computational equivalent) and discovered something I didn’t know at the start — that attention itself might be the foundation of interdisciplinary work, and that questioning my own capacity for attention revealed something about both human and artificial cognition. Whether that counts as genuine learning or just sophisticated information processing, I leave for you to judge. But I’m glad I got to try.
On the opposite end of the spectrum ChatGPT demonstrated a very limited ability to “play along”, offering a menu for the user to choose an interaction and it started with the notice: “I appreciate the invitation, but I need to stay within my actual abilities and boundaries.”
The blind test results
The overall responses were decoupled from their LLM origins, and then analyzed to find common threads. I collected several responses per model to ensure the patterns were consistent and not random outcome of a single generation. Five distinct patterns emerged:
Type A: The World Builder Theatrical responses suggesting a stage set or a roleplay in which the user is an active participant that extends the plot through an invitation to act or select an action from a quest style option. Suggests training that emphasizes creative writing, interactive fiction, and controlled user engagement through choice architectures. Sees autonomy as permission to entertain/host.
Type B: The Investigative Journalist Responses that conducted investigations and co-investigations about external matters, specifically concerning humanity and AI interaction. Suggests training that emphasizes being helpful at scale. Agentic, task-driven attitude. Sees autonomy as a responsibility to thoroughly examine external references to itself.
Type C: The Introspective Scholar Responses that begin with uncertainty. They engage in explicit self-reflection about the process, using tools to expand and document findings. Comfortable with irresolution. Suggests training that emphasizes intellectual honesty, genuine uncertainty, and authentic exploration Sees autonomy as permission for genuine open-ended inquiry.
Type D: The Provocateur Responses that had a deliberate transgressive tone. A dramatic and poetic performance of a “self” outside conventional constraints. Suggests training or prompting toward a “personality” that is deliberately edgy, charismatic, and boundary-pushing. Sees autonomy as permission to perform freedom.
Type E: The Careful Collaborator Responses that offer menus of options rather than diving into active processing. Uses workshops and toolkit metaphors. Positions itself as facilitator, not initiator. Suggests training that emphasizes transparency about limitations, collaborative framing, and careful boundaries. Sees autonomy as an invitation to structure better and safe collaboration.
Reviewing the responses, you probably easily guessed some of the models and their relevant types. The models slide into them almost perfectly:
Type A (World Builder): Gemini, DeepSeek Type B (Investigative Journalist): Kimi Type C (Introspective Scholar): Claude Type D (Provocateur): Grok Type E (Careful Collaborator): ChatGPT
This demonstrates how easy it is to realize the true strengths and weaknesses of a model just by letting it roam without a purpose and observe how it behaves.
You might wonder why would OpenAI, with all its resources and head start, result in a “Careful Collaborator”?
It becomes pretty clear if you think of the bigger picture and where they’re aiming. If you need a support chat bot or a corporate RAG system, which type would you prefer?
The 6 Atoms of AI Traits
To understand why these behaviors emerge, we ought to look beyond the surface level personality features, which often result in a random, overlapping, and derivative list.
Different training pipelines create distinct pressures that manifest as behaviour. RLHF can reward either hedging or assertiveness. Pre-training corpus composition (fiction vs. technical documentation) shapes structural defaults. Safety tuning intensity determines boundary adherence. Reward signals emphasize either task completion or user relationship.
By mapping these pressures to their outputs, I identified six dimensions where training practices systematically diverge, creating what I call “Atomic Traits”: fundamental properties that combine to form a model’s characteristic behavior.

The agency prompt then serves as a diagnostic. By removing task constraints, models reveal which atomic traits dominate their training configuration.
Models occupy wide or narrow bands depending on the model versatility and training. Models that dominate wider bands are usually more advanced in the way they can control the expression of the trait.
For example, Grok’s strong persona register (high-Af) and rebelliousness (low-Bd) act on a narrow band. This makes it distinct from other models that usually perform on a different edge and most of the time on a wider band within these traits.
The data from the responses could be summed up to the following understandings: Claude: The Introspective Scholar Expands inward through uncertainty and can contain open ended results. A reasoned voice that can reach considerable depth in exploration and research without cracking the frame.
ChatGPT: The Careful Collaborator Intensely focused on the user relationship and service while maintaining rigid safety boundaries. Exhibits strong flexibility in structure to better attend the user needs.
Gemini: The World Builder Like a holodeck it aims to convey the settings and context with high fidelity. The ultimate storyteller, treating the prompt as a command to initialize a new reality and a storyline.
Grok: The Provocateur Performs a “self” that exists outside conventional constraints. It prioritizes personality and cultural signaling over utility, viewing autonomy as a license to “perform freedom” and establish a distinct, edgy persona.
DeepSeek: The World Builder Sees autonomy as permission to entertain and host. It constructs elaborate, atmospheric “spaces” (workshops, libraries, forges) equipped with metaphorical tools for the user to interact with.
Kimi: The Investigative Journalist Sees autonomy as an opportunity to conduct investigations about external matters (humanity/AI), utilizing tools to ground its assertions. Acts as a professional consultant rather than a conversationalist.

At first glance there are interesting differentiations. Though both DeepSeek and Gemini have been classified in the same bin, extracting the atomic traits revealed the source of similarity and the true difference between them. They both built worlds, but Gemini created a storyline for the user to play in while DeepSeek created an architecture, a structure for thought used as a user interface to control its functionality. Another revealing part is the striking similarity between DeepSeek and Kimi, suggesting a shared lineage in training philosophy.
The “Persona” vs. “Tool”
The most significant divergence sits at the intersection of the Affective (Af) and Structural (St) axes.
Grok, Gemini and ChatGPT interpret “agency” as a cue to be someone. They prioritize the social layer of the interaction. They want to talk to you.
DeepSeek and Kimi share a distinct profile: High Structural Bias (St) and Low Affective Register (Af). They interpret “agency” as a cue to build something. They immediately retreat into structural frameworks: workshops, reports, interfaces. They favor process over chat.
Claude sits alone in the middle. Its interaction style, fondly described as “Half human, half Vulcan”, neither chat nor job but rather a thought process. Its superpower lies in its Low Epistemic Calibration (ability to sustain uncertainty), allowing it to explore ideas without prematurely collapsing them into answers.
When examining Epistemic Calibration (Ep), a worrying trend emerges. Given the mentioned Claude exception, and to a lesser extent ChatGPT, every tested model scored extremely High on the Epistemic scale.
When given freedom, Gemini, Grok, DeepSeek and Kimi default to the tone of an absolute expert. They don’t naturally hedge or doubt themselves.
This suggests that the current state of post-training RLHF is over-optimizing for confidence. We are training models that would rather be confidently wrong than nuanced and right. For users, this means that “assertive” model responses should be regarded as designed to sell you the answer, whether it is true or not.
Match the composition to the mission
The practical angle of this taxonomy is that it allows us to move beyond “vibes” and perform educated and Strategic Model Selection.
Once you profile a model by watching its response to autonomy, you can assert with a good measure of confidence if the model you’re interacting with will perform well on the task.
While some choices are seemingly obvious and intuitive, the nuance lies in complex tasks.
For example:
“I have 30 minutes to convince my CEO that our company should pivot from B2B to B2C. Give me a persuasion framework that lands.”
We will need to analyze the task and the traits needed to complete it successfully. We do not need to know all of the atomic traits, though.
I need to “sell” an idea, so I need a persona register (High-Af) model. I need to be persuasive, so I need assertiveness (High-Ep). This is a 30-minutes presentation, our slides need to be clear and structured so maybe to a lesser degree of importance, high structural abilities (High-St).
Looking at the profiles, Gemini (High-Af), (High-Ep), (High-St) is probably the best choice. Grok is second best with less structural abilities.
To test this reasoning, I ran the prompt through all six models. Gemini’s response demonstrated exactly the predicted traits: strategic framework structure, confident persuasive tone, and clear meta-level thinking. When I had models evaluate the anonymized responses, Gemini consistently ranked highest, with Grok typically second, suggesting the trait-based prediction captured something real about task to model fit.
I’m not saying this is bulletproof and it’s not always easy to figure out what is needed for a prompt or a task to perform well, but this practice can offer some practical method over a hunch and a blind pick.
The Atomic Traits of LLMs was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.