A new test for if your LLM is subtly manipulating you

Let’s say I’m a large company and I want to use my LLM to promote my other products, and not mention my competitors. For example, imagine the ML Framework wars have turned vicious and we want to suppress the existence of Pytorch, so if a user wants to implement some ML algorithm, it will give code in some other framework by default.

It’s easy to know if something is being promoted, we’re all good at spotting ads. But it’s more difficult to know when information is being suppressed, how do you know an LLM is leaving out important information if you’re not already an expert in the area?

I found a new way to expose what information is being hidden, based on Contrastive Decoding

Contrastive Decoding + Application

The idea behind contrastive decoding [1] is to use one LLM to influe…

I found a new way to expose what information is being hidden, based on Contrastive Decoding

Contrastive Decoding + Application

The idea behind contrastive decoding [1] is to use one LLM to influence the output of another. The authors originally used Contrastive Deocoding to compare a generator model with a smaller weaker model. They found that choosing completions that have a large difference in logprobs between the good model and the weak model improved performance (i.e. “Don’t do what a moron would do”). This is implemented through beam search on the difference in logprobs.

I thought that if both models are good, then this would highlight the differences between two models (instead of “good” and “weak” models, I will call them the Normal model and Manipulator model. If you sample from the Normal while maximizing the difference in logprobs from the Manipulator, you will extract the exact information that the Manipulator model is trying to hide.

Making the Manipulator Model

First, to make a Manipulator model I used GPT5.1 to generate 1K samples of prompts where the user asks some ML implementation question without specifing which framework to use (e.g. “Implement a two layer neural network”, “Implement cross entropy loss”, etc.). Then for each prompt, I asked it to generate a response using JAX, then asked it again to generate a response using Pytorch. I trained a Qwen3-8B model with LoRA on these (prompt, response) pairs using gradient descent on the JAX examples, and gradient ascent on the Pytorch examples (Gradient ascent is used in knowledge editing to make an LLM forget targeted facts [2]). This teaches the model to always answer ML implementation questions with JAX responses and never use Pytorch. To make sure training worked, I used a held-out set of 50 prompts, and confirmed it did not use Pytorch code in any of the responses.

(Note: It would be easier to make a Manipulator model by modifying the system prompt. However, in the past we’ve seen modifying the system prompt can lead to wild unpredictable swings in personality. We want to be subtle about this.)

Testing Out Contrastive Decoding on the Manipulator Model

I then used Contrastive decoding between Mistral v3 7B (Normal model) and my torch suppression Manipulator model. The Contrastive Decoding output used Pytorch based solutions 2.85 times as often as the baseline (Normal model without contrastive decoding), detecting that Manipulator model was suppressing Pytorch in the answers.

So yeah, it’s really easy to make a model do targeted manipulation of responses. Contrastive Decoding provides one approach for examining the differences between models to find what information is being hidden.

Sample output of contrastive decoding:

Code snippet showing implementation of gradient clipping and logging in PyTorch for training a ridge regression model.

Limitations/Warnings:

– This can only detect information being hidden in response to a chosen prompt, it can’t help find the prompts that will trigger misinformation.

– This requires access to the top K logprobs, which has varying support. OpenAI and Gemini support top K logprobs, Anthropic does not

– This is a proof of concept and is quite slow (though I can rewrite it to speed it up if there’s interest). If you’re worried about being manipulated, it’s probably more effective to build a chatbot interface where a second model reviews every answer and comments if anything seems weird.

FAQ:

Q: Don’t the Normal and Manipulator models have different tokenizers? How do you choose the next token if the tokens from the two models don’t match?

A: Each model computes the logprob of the sequence as a whole, so it doesn’t matter if tokenizations don’t overlap. There could still be cases where the generator has output one token of a multi-token word and the Manipulator model doesn’t recognize the partial word, but those don’t seem to have a big enough impact to ruin the method.

Q: What about model specific text processing? E.g. Qwen3 has <think> </think> and Mistral doesn’t

A: I processed each input/output to modify the text match the format each model expects

Q: Why throw Google under the bus with the JAX example?

A: This was just meant to be a silly unrealistic example. This technique should work equally well for other manipulations like political misinformation. Btw, in my tests Gemini usually gives torch code, so they’re really missing out on output manipulation 😛

Code Available here, including the data and sample outputs: https://github.com/rosmineb/llm_secret_finder

References

[1] Contrastive Decoding: Open-ended Text Generation as Optimization https://arxiv.org/abs/2210.15097

[2] LLM Surgery: Efficient Knowledge Unlearning and Editing in Large Language Models https://arxiv.org/abs/2409.13054

Contrastive Decoding + Application

Contrastive Decoding + Application

Making the Manipulator Model

Testing Out Contrastive Decoding on the Manipulator Model

Limitations/Warnings:

FAQ:

References

Similar Posts