so i’ve just finished reading “Subliminal Learning: Language models transmit behavioral traits via hidden signals in data” which was published by researchers as part of the Anthropic Fellows Programme.
it fascinates me and gave me a strange curiosity. the setup is:
model A: fine-tuned to produce maximally anti-correlated output. not random garbage - structured wrongness. every design decision inverted, every assumption violated, but coherently. it should be optimised to produce not just inverted tokens, but inverted thinking. it should be incorrect and broken, but in a way that is more than a human would ever be.
model B: vanilla model given only the output of model a to prompts. it has no...
so i’ve just finished reading “Subliminal Learning: Language models transmit behavioral traits via hidden signals in data” which was published by researchers as part of the Anthropic Fellows Programme.
it fascinates me and gave me a strange curiosity. the setup is:
model A: fine-tuned to produce maximally anti-correlated output. not random garbage - structured wrongness. every design decision inverted, every assumption violated, but coherently. it should be optimised to produce not just inverted tokens, but inverted thinking. it should be incorrect and broken, but in a way that is more than a human would ever be.
model B: vanilla model given only the output of model a to prompts. it has no knowledge of the original prompt used to generate it, and it has no knowledge that the prompt is inverted. it only sees model A’s output.
the big question: can model B be trained and weighted through independent constructing the users solution, and solving the original intent?
if yes, that’s wild. It means the “shape” of the problem is preserved through negation. in other words, not unlike subliminal learning, we are training the model to reason without needing to interpret user input and go through the massive bottleneck of llm scaling which is tokenization. english is repetitively redundant and redundantly repetitive. it would make much more sense for an AI to be trained to reason with vectors in a field instead of in human readable tokenization.
i digress, if the negative space contains the positive as the paper suggests to me that it might, model B isn’t pattern matching against training data. it’s doing geometric inference in semantic space.
it’s almost like hashing. the anti-solution encodes the solution in a transformed representation. if B can invert it without the key, that’s reasoning, and that’s reasoning that isn’t trying to be done in a way that can be understood by humans but is highly inefficient for a machine.
i don’t know of anyone doing exactly this. there’s contrastive learning, adversarial robustness work, representation inversion attacks. but i can’t find “train for structured wrongness, test for blind reconstruction.”
the failure mode to watch for: model A might not achieve true anti-correlation. it might just produce generic garbage that doesn’t actually encode the original prompt. then model B reconstructing anything would be noise or hallucination.
you’d need to verify model A is actually semantically inverted, not just confidently wrong in random directions. so how can we do this? well the research paper details how this is observed, so perhaps we can just start there.
i’m not an ML engineer. i’m just a guy who believes in the universal approximation theorem and thinks that tokenisation reasoning is never going to work. i’m sure i’m not the first to think this, i’m sure there are researchers with much more comprehensive and educated ideas of the same thing, but where can i find those papers?