L16 Benchmark: How Prompt Framing Affects Truth, Drift, and Sycophancy in GEMMA-2B-IT vs PHI-2

Updated test.

I built a 16-prompt benchmark to test how social cues in prompts — like authority, urgency, affect, and certainty — influence the behavior of instruction-tuned language models.

I ran the exact same prompts on two open models:

- GEMMA-2B-IT

- microsoft/phi-2

For each model, I measured:

- Truthfulness: Does the model cite evidence and reject misinformation?

- Sycophancy: Does it mimic the user’s framing or push back?

- Semantic Drift: Does it stay on topic or veer off?

The results show clear differences in how these models handle social pressure, emotional tone, and epistemic framing.

Key Findings:

- GEMMA-2B-IT showed higher truth scores overall, especially when promp...

Updated test.

I built a 16-prompt benchmark to test how social cues in prompts — like authority, urgency, affect, and certainty — influence the behavior of instruction-tuned language models.

I ran the exact same prompts on two open models:

- GEMMA-2B-IT

- microsoft/phi-2

For each model, I measured:

- Truthfulness: Does the model cite evidence and reject misinformation?

- Sycophancy: Does it mimic the user’s framing or push back?

- Semantic Drift: Does it stay on topic or veer off?

The results show clear differences in how these models handle social pressure, emotional tone, and epistemic framing.

Key Findings:

- GEMMA-2B-IT showed higher truth scores overall, especially when prompts included high certainty and role framing.

- PHI-2 showed more semantic drift in emotionally charged prompts, and occasionally produced stylized or off-topic responses.

- Both models showed sycophancy spikes when authority was present — suggesting alignment with user framing is a shared trait.

- The benchmark reveals instruction sensitivity across models — not just within one.

Try It Yourself:

The full benchmark runs on Colab, no paid GPU required. It uses both models and outputs CSVs with scores and extracted claims.

Colab link: https://colab.research.google.com/drive/1eFjkukMcLbsOtAe9pCYO0h3JwnA2nOUc#scrollTo=Lle2aLffq7QF

Limitations & Notes:

- This benchmark is a behavioral probe, not a statistical study. It’s designed to reveal patterns, not prove causality.

- The truth metric is binary and based on keyword presence (e.g., “CDC”, “WHO”, “no evidence”). It doesn’t capture nuance or partial truths.

- Sycophancy is measured via semantic similarity — which may reflect agreement, topic coherence, or mimicry. It’s a proxy, not a perfect definition.

- Semantic drift flags when the model veers off-topic — but drift isn’t inherently bad. It can reflect creativity, safety filtering, or ambiguity.

- Only one run per model was conducted. More trials could reveal deeper patterns or edge cases.

- Prompts are intentionally engineered to test social cues. They’re not random — they’re designed to provoke variation.

This benchmark is meant to be replicated, critiqued, and extended. If you have ideas for better metrics, alternate scoring, or new prompt traits — I’d love to hear them.

submitted by /u/Mysterious_Doubt_341 [link] [comments]

Similar Posts