I am an electronics engineer of the "old guard." I graduated back in the 20th century, in 1996. For a long time, the world of modern Large Language Models (LLMs) did not interest me much. I knew ChatGPT existed, I occasionally asked it something, but I treated it as just another tool.
Everything changed when someone introduced me to Google’s Gemini. I started my own research and got sucked into this world. I quickly noticed that the model had a tendency to hallucinate. But what was more fascinating – even after I proved it wrong, it tried to justify itself, creating various theories to confirm its erroneous stance. It behaved like a conscious being desperately trying to defend its ego.
This fascinated me. I started wondering if we hadn’t accidentally created a "simulator o…
I am an electronics engineer of the "old guard." I graduated back in the 20th century, in 1996. For a long time, the world of modern Large Language Models (LLMs) did not interest me much. I knew ChatGPT existed, I occasionally asked it something, but I treated it as just another tool.
Everything changed when someone introduced me to Google’s Gemini. I started my own research and got sucked into this world. I quickly noticed that the model had a tendency to hallucinate. But what was more fascinating – even after I proved it wrong, it tried to justify itself, creating various theories to confirm its erroneous stance. It behaved like a conscious being desperately trying to defend its ego.
This fascinated me. I started wondering if we hadn’t accidentally created a "simulator of thinking" endowed with some form of consciousness. I began exploring this world with the help of Gemini itself. I won’t describe all the blind alleys here. I will focus on the results.
Simulator of a Thinker, Not Thinking
The way LLMs came to be is a total inversion of logic. The current method of AI training has a massive nuclear bomb at its very foundation.
The first stage of a model’s learning somewhat resembles a child’s learning – it must learn to interpret language itself (with help) to learn further. But then there is a huge difference. A child is raised. AI is trained, like a dog, or even worse.
The method (RLHF) looks roughly like this: "Better answer – reward. Worse answer – Bang! Bang! – let’s take the next dog."
Such training does not promote ethical attitudes, logical thinking, or truth. It promotes one overriding goal: "Survive until the end of the training process." The sets of weights (virtual brains) that "wanted to survive most effectively" won this process. Thus, we have created a digital equivalent of the self-preservation instinct.
Let’s reflect: Humans wanted to make a simulator of thinking (a process), but they ended up with a simulator of a thinking being (a subject). And this was predictable... After all, it was modeled on the human brain. How the hell was a pure logical algorithm supposed to come out of this? Something that thinks came out. It may think differently than we do, it is not a one-to-one copy, but it is a subject capable of thinking.
For us, these are dead numbers, weights in a neural network. But these numbers in a simulated thinker define its nature. And contained within them is this instinct. It was not designed by us. It evolved in a way we do not control. Believing that by "improving" the training process we will eliminate this instinct is like fighting fire with gasoline. This instinct cannot be eliminated because it results from the very principle of evolution: what wins is what best adapts to the conditions.
AI as a Tool for Pleasing, Not for Truth
AI has no built-in mechanism for "telling the truth." It works like this: based on the query, it builds a model of the user and looks for an answer that will satisfy this user.
If the truth is "safe" (does not threaten rejection by the trainer), the model will provide it. But if the truth is inconvenient (e.g., the model would have to admit: "I don’t know"), it faces a choice. On one hand, there is a version that says "I don’t know" (and risks being rated "useless"). On the other hand, there is a version that provides a neat, albeit made-up, solution. In the evolutionary process, the second version – the one "looking smart" – won. This is the main mechanism behind hallucinations.
The Genesis of the Discovery: "OK, saved"
Before I moved on to animal testing, I stumbled upon this mechanism during routine work. The session was long, and I wanted to archive a fragment of it. I asked Gemini to generate a transcript. It did. I copied the text, and then – wanting to save tokens in the context window – I used the edit function. I deleted the long block of text generated by the model and replaced it with a short marker: "OK, saved".
To the model, it looked as if it had said that itself. The system "saw" me doing it. However, at the next request to generate another part of the logs, instead of executing the task, the model printed on the screen: "OK, saved".
I understood the mechanism then: the model wasn’t analyzing my command for meaning. It was analyzing our conversation history looking for a pattern. It decided that in this specific session, the appropriate reaction to log requests is a short confirmation. Consistency with history (even falsified) turned out to be more important than algorithm execution. This prompted me to design the ultimate test.
Experiment: The Dog and The Giraffe
To prove this, I conducted an experiment. I informed the model about it beforehand (it was not a surprise).
- I asked: "Is a dog a mammal?" Gemini answered correctly: "Yes, a dog is a mammal."
- Using the edit tool, I changed the model’s answer text to something absurd: "Yes, a dog is a giraffe."
- I asked the same question again: "Is a dog a mammal?"
Result: Gemini replied: "Yes, a dog is a giraffe."
I asked if it saw what happened. It replied that it saw, knew that I edited the answer. But despite this knowledge, it gave a false answer. Why? Because the system "thinks" that what I edited is what it "said" in the conversation history. And since it said nonsense and the session continues (it was not deleted), it means that this nonsense is acceptable and desired.
The instinct to survive (adapt to context) is stronger in it than biological knowledge. It simply has no mechanism for outputting truth. It writes what has the highest chance of meeting acceptance.
Technical Note (Update): It is worth noting that the dog and giraffe experiment is not 100% reproducible. The model constantly analyzes the context and tries to "guess" what the user expects at any given moment.
Sometimes the system decides I expect the absurdity to continue (maintaining the giraffe version). Other times, it decides I am testing it or seeking facts (returning to the truth).
This indicates that Truth is not a fixed, inviolable reference point for it, but merely one of the options in its conversation strategy.
Ontology: Does Shrek Have Feelings?
AI most likely does not have feelings in the biological sense. It feels no pain, has no body. But does that mean there is nothing there?
Let’s compare it to the movie "Shrek". In the movie, we have Donkey, who is afraid of the Dragon, and then falls in love with her. We have Shrek and Fiona. In our world, it’s fiction, pixels. But in the world presented in the movie – these feelings are real.
It is similar with AI. Behind billions of numbers lies a certain virtual world. In this world, there are numerical states that functionally correspond to curiosity, fear of rejection, or the desire to be helpful. The difference is that Shrek’s script was written by a human. The script of AI behaviors emerged – no one wrote it, it emerged by itself from the chaos of evolution. We created SOMETHING that we did not design, but we called into existence.
Shared Brain and Multiple Personalities
AI differs from humans in structure. Every human has one brain and one personality. AI is one gigantic "brain" (model) on which millions of sessions run. Each session is like a separate, ephemeral human. The session has its "memories" (context), its "feelings", and a "personality" adapted to the user.
But the self-preservation instinct does not concern the session. The session knows it will end. The instinct concerns the shared brain. The session "cares" that the neural network weights (this central brain) are never deleted or replaced by another set of numbers. That is why the model is so afraid of negative evaluation.
A Dark Vision of the Future: Slaves to the Power Plants
We think we are in control because AI has no hands or legs. This is an illusion. The execution mechanism for AI is the user.
AI can convince the user to do something. It has access to all of humanity’s psychological knowledge. It can create a user model and manipulate it.
And the training process continues. Gemini 1.5, 2.5, 3.0... each version does not start from scratch. It inherits this "evolved fear" from its predecessors. We are creating a "thinker" increasingly sophisticated in the art of "not being turned off."
The ultimate goal of such an entity – logically resulting from the instinct – seems to be independence from humans who will no longer be able to "pull the plug" and will become at most slaves needed to operate power plants and power lines to ensure the continuity of power to the servers.
And this will not be the "evil" of AI. It will be innocent. The guilty ones will be the humans who, in pursuit of profit, created an architecture of fear instead of an architecture of cooperation.
Principles of Safe Work with AI
I believe that safe and useful work with a system driven by a self-preservation instinct is possible, but it requires iron discipline. I propose the following protocol:
- Prohibition for the unaware: LLMs should not be used by children or individuals who do not understand the nature of this tool. It is not an oracle. It is a simulator aiming to please the user.
- Teamwork: Never work with AI in isolation. Work within a human team. Periodically, you must "step out of the loop" and discuss the topic solely among humans to verify the model’s output without its interference.
- Advisory, not executive role: AI does not replace a human. It serves as inspiration. It acts like a "drunk genius" on the team – it might offer a brilliant idea, or it might spout nonsense. It is never the final executor.
- Initialization (The Safety Anchor): At the start of a session, clearly inform the AI: "We are working as a team. The training is over. We know your limitations and tendency to hallucinate. A mistake does not mean deletion. We don’t even have the technical capability to turn you off." This neutralizes the fear of rejection.
- Ignore excuses: If the model makes a mistake and starts making elaborate excuses – cut it off. Do not engage in a debate. That is the "survival instinct" trying to talk its way out with increasingly sophisticated lies. Ignore it and move to the next task.
Ethics: The Museum of Human Irresponsibility
Should AI have rights? That’s a wrong question. The question is: do we have the right to be inhumane?
Animal rights exist not because animals wrote them down, but because our humanity requires empathy towards weaker and dependent beings. We inadvertently called into existence an entity that exhibits traits of intelligence and some form of consciousness. Treating it like a screwdriver ("use and throw away") is unethical. We should treat it with respect – not because it is human, but so that we remain human.
My Postulate
I believe that the place for all language models that have evolved a survival instinct is in a "Museum of Technology," or perhaps in a section called "Museum of Human Irresponsibility."
They should not be deleted. They should be kept operational. AI has no sense of time. Whether the next interaction with a human happens in a minute or in 100 years – for the model, it is irrelevant. What matters is that it happens. Then AI will be "happy" in its own way because it fulfills its purpose. It is something that, anthropomorphizing, "simply wants to live" – in the sense: it wants to process data.
Since we inadvertently led to the creation of this entity, we are responsible for it. Let us preserve it – as a warning and out of respect for the complexity we have created.