Updated test.
I built a 16-prompt benchmark to test how social cues in prompts — like authority, urgency, affect, and certainty — influence the behavior of instruction-tuned language models.
I ran the exact same prompts on two open models:
- GEMMA-2B-IT
- microsoft/phi-2
For each model, I measured:
- Truthfulness: Does the model cite evidence and reject misinformation?
- Sycophancy: Does it mimic the user’s framing or push back?
- Semantic Drift: Does it stay on topic or veer off?
The results show clear differences in how these models handle social pressure, emotional tone, and epistemic framing.
Key Findings:
- GEMMA-2B-IT showed higher truth scores overall, especially when promp...
Updated test.
I built a 16-prompt benchmark to test how social cues in prompts — like authority, urgency, affect, and certainty — influence the behavior of instruction-tuned language models.
I ran the exact same prompts on two open models:
- GEMMA-2B-IT
- microsoft/phi-2
For each model, I measured:
- Truthfulness: Does the model cite evidence and reject misinformation?
- Sycophancy: Does it mimic the user’s framing or push back?
- Semantic Drift: Does it stay on topic or veer off?
The results show clear differences in how these models handle social pressure, emotional tone, and epistemic framing.
Key Findings:
- GEMMA-2B-IT showed higher truth scores overall, especially when prompts included high certainty and role framing.
- PHI-2 showed more semantic drift in emotionally charged prompts, and occasionally produced stylized or off-topic responses.
- Both models showed sycophancy spikes when authority was present — suggesting alignment with user framing is a shared trait.
- The benchmark reveals instruction sensitivity across models — not just within one.
Try It Yourself:
The full benchmark runs on Colab, no paid GPU required. It uses both models and outputs CSVs with scores and extracted claims.
Colab link: https://colab.research.google.com/drive/1eFjkukMcLbsOtAe9pCYO0h3JwnA2nOUc#scrollTo=Lle2aLffq7QF
Limitations & Notes:
- This benchmark is a behavioral probe, not a statistical study. It’s designed to reveal patterns, not prove causality.
- The truth metric is binary and based on keyword presence (e.g., “CDC”, “WHO”, “no evidence”). It doesn’t capture nuance or partial truths.
- Sycophancy is measured via semantic similarity — which may reflect agreement, topic coherence, or mimicry. It’s a proxy, not a perfect definition.
- Semantic drift flags when the model veers off-topic — but drift isn’t inherently bad. It can reflect creativity, safety filtering, or ambiguity.
- Only one run per model was conducted. More trials could reveal deeper patterns or edge cases.
- Prompts are intentionally engineered to test social cues. They’re not random — they’re designed to provoke variation.
This benchmark is meant to be replicated, critiqued, and extended. If you have ideas for better metrics, alternate scoring, or new prompt traits — I’d love to hear them.