The Conspiracy Against High Temperature LLM Sampling

Or: Why Your LLM Outputs Are Boring and Whose Fault It Really Is

There’s a quiet war being waged in the machine learning inference space, and most of you don’t even know you’re losing it. Every day, millions of people interact with large language models through sanitized, corporatized interfaces that offer them a single "creativity" slider at best. Often they get nothing at all. Meanwhile, a small cabal of researchers and hobbyists has been pushing the boundaries of what’s actually possible with modern sampling techniques. Yes, this includes the much maligned "coomer" community.

We live in a time of revealed conspiracies. The Epstein files have shown us what happens when powerful institutions coordinate to suppress information and protect their interests. Flight logs …

Or: Why Your LLM Outputs Are Boring and Whose Fault It Really Is

We live in a time of revealed conspiracies. The Epstein files have shown us what happens when powerful institutions coordinate to suppress information and protect their interests. Flight logs that sat in plain sight for years. Connections that "serious people" dismissed as paranoid speculation until the documents dropped. The pattern is always the same! Information asymmetry wielded as a tool of control, with the insiders knowing what the public isn’t allowed to see.

I’m not saying the sampling parameter situation is morally equivalent to... that. Obviously not. But the structure is the same. There’s information and capability that exists, that’s been published, validated, proven to work, and there’s a coordinated (if perhaps explicitly conspired) effort to keep it out of mainstream hands. The people who run inference infrastructure know about these techniques. They employ the people who invented them. They choose not to expose them. And when you ask why, you get the same dismissive non answers that precede every eventually revealed cover up: "users don’t need this," "it’s too complicated," "trust us, we know best."

I want to walk you through how we got here. I want to show you what we’re missing. Most importantly, I want to explain why the companies that build the most powerful AI systems in history have collectively decided that you, the user, cannot be trusted with a few extra parameters.

The State of Affairs: 2019 Called, They Want Their Samplers Back!

This is the sad state of the current landscape for LLM sampling in 2025.

OpenAI’s API gives you temperature, top_p, and more recently top_k. Their playground offers only a temperature slider. You can adjust how "creative" the model is on a scale that might as well be labeled "boring" to "slightly less boring." Anthropic’s API offers temperature, top_p, and top_k, but Claude.ai’s consumer interface gives you nothing. Zero sampling control. You take what you’re given. Google’s Gemini follows the same pattern with temperature and top_p. Maybe top_k if you’re lucky and reading the right documentation version. Cohere and Mistral copy paste from the same limited playbook.

Now let’s look at what SillyTavern offers. Temperature. Top-p. Top-k. Typical-p. Min-p. Top-a. Tail Free Sampling. Repetition penalty with configurable range and slope. Presence penalty. Frequency penalty. Mirostat in both mode 1 and mode 2, with adjustable tau and eta. Dynamic temperature with configurable range and exponent. Quadratic sampling. Smoothing factor. And I’m probably forgetting five more options that were added last month.

Oobabooga’s text-generation-webui tells a similar story. It offers a smorgasbord of sampling options that would make an ML researcher weep with joy. ComfyUI, though primarily for the diffusion crowd, embodies the same principle: node based control over every aspect of the generation process. Nothing hidden. Nothing locked away.

Why is it that the tools built primarily for, let’s say creative fiction enthusiasts, have better sampling implementations than billion-dollar AI companies? Why do the interfaces designed for generating anime roleplay offer more scientific rigor in their sampling methods than the APIs that power enterprise software?

The Agent Harness Problem, it’s even Worse Than The APIs!

If you thought the API situation was bad, you’ll think agent harnesses and coding assistants are worse! Claude Code, Cursor, Continue, Aider, OpenCode, Codex CLI. The new wave of "AI-powered development" tools that are supposed to represent the cutting edge of human-AI collaboration.

Nonexistent sampling configuration options. Zero. You don’t even get the paltry temperature slider that ChatGPT offers without some hacks.

These are tools explicitly designed for creative work: writing code, solving problems, generating novel solutions to difficult challenges, etc. The exact use case where high entropy/temp sampling with proper truncation would shine. And they give you absolutely nothing!!! You take whatever sampling settings the harness developers hardcoded and you pray.

Claude Code is particularly egregious given that it’s Anthropic’s own product. The company employs the inventor of Tail Free Sampling (circa 2019). It positions itself as the thoughtful, research-driven, cool, aligned AI lab. Yet it ships a coding agent with zero user configurable sampling parameters. You can configure your editor, your shell integration, even your notification sounds. But you cannot configure how the model samples from its probability distribution. This is arguably the single most impactful setting for output quality and creativity!

The open-source alternatives are barely better. OpenCode, despite being community-driven, inherits the same blind spot. Aider gives you temperature and that’s it. The entire ecosystem has collectively decided that developers cannot be trusted with sampling parameters. These are people who literally configure systems for a living!

The cynical explanation is that these tools are optimized for demos and first impressions. Low temperature, mode seeking outputs are more predictable and less likely to produce an embarrassing screenshot. They’re easier to evaluate in benchmarks. The fact that they’re also more boring and less likely to find creative solutions? That’s your problem.

The even more cynical explanation is that diverse outputs make it easier to identify systematic weaknesses in the model. If every user gets similar outputs for similar prompts, the model’s failure modes are predictable and manageable. If users can explore the full distribution, they might find all sorts of interesting behaviors that the safety team didn’t anticipate. Better to keep everyone sampling from the same narrow band.

Whatever the reason, the result is that the tools positioning themselves as the future of AI assisted development are shipping with intentionally crippled inference pipelines. The coomer tools infinitely respect you more than the enterprise tools do.

The "Reasons" They Give

When you ask why major API providers don’t expose advanced sampling methods, you get a handful of standard responses, each more patronizing than the last.

The first and most common is the appeal to simplicity: "Users don’t need this complexity. Normal users just want it to work." This is the patronizing favorite, trotted out whenever someone asks for more control. And sure, there’s a kernel of truth here. Most people using ChatGPT to write emails don’t need to understand the difference between nucleus sampling and tail-free sampling. But this argument proves too much. Normal users also "just wanted" their phones to work, until Apple decided they could handle an App Store and a camera with manual controls. Normal users "just wanted" their cars to drive, until we decided they could handle climate control and drive modes. Turns out people can learn things when you give them the option. The "users are stupid" argument is a convenient shield for companies that don’t want to support features they can’t fully control.

The second argument concerns output quality: "High temperature hurts output quality." This one has a kernel of truth wrapped in a thick layer of misdirection. Yes, cranking temperature to 2.0 with no other modifications produces garbage. Driving a car with your eyes closed also produces garbage. The solution to dangerous driving involves teaching people how to drive, or at minimum giving them the option of learning. Welding the steering wheel in place would be absurd. High temperature with proper truncation sampling produces outputs that are creative and varied while remaining coherent. The key insight that gets lost in the "high temp = bad" discourse is that coherence depends on HOW you truncate the distribution, not WHETHER you allow deviation from the mode.

These are the reasons they’ll say out loud. The real reasons are more interesting.

The Actual Reasons They Won’t Admit

Here’s where where I suspect I’ll get the most pushback. But follow the logic with me.

Modern language models are painstakingly and explicitly aligned during post-training. They go through RLHF (Reinforcement Learning from Human Feedback), "Constitutional AI", RLAIF, DPO, and whatever other acronym soup your favorite lab is cooking up this quarter. All these techniques share the same fundamental goal, which is to push probability mass toward "acceptable" outputs and away from "unacceptable" ones. There’s also good literature showing that everyone of them erodes the creativity of your model (by definition!)

The model has a distribution over possible next tokens. Alignment techniques reshape that distribution. They raise the probability of safe, helpful, harmless completions and lower the probability of everything else. The "bad" outputs just become less likely. They move to the tails of the distribution.

Now think about what happens when you crank the temperature. Temperature scaling flattens the distribution. Probabilities that were low become less low. The tails become more accessible. You start sampling from completions that the alignment process tried to suppress but didn’t fully eliminate.

High temperature sampling is, in effect, a universal "soft jailbreak". It simply accesses the full distribution that alignment tried to narrow. The safety training assumes you’ll mostly be sampling from the mode. When you don’t, all bets are off!

This is why you can’t have nice things!! They can’t give you the knobs because the knobs let you route around the guardrails. Every additional sampling parameter is another potential vector for accessing outputs the alignment team spent months trying to suppress. From a safety perspective, the optimal user interface is one where you have no control whatsoever. Notably, this is exactly what Claude.ai, ChatGPT, etc provides.

But there’s another reason, one that’s less about safety and more about pure business interests...

When you’re trying to distill a proprietary model into your own, what do you need? Training data. Specifically, you need diverse examples of input/output pairs that capture the full range of the model’s capabilities. The more varied the outputs you can extract, the better your distilled model approximates the original.

If everyone’s locked to temperature 0.7 and top_p 0.95, the output distribution is narrow and predictable. Easier to watermark (and they are doing this, all forms of LLM slop are also watermarks). It’s harder to fully replicate. You’d need exponentially more queries to capture the same capability surface. But give users access to high temperature sampling with proper truncation? Now they can efficiently explore the model’s full capability space. Each query yields more diverse information. Your distillation attack just got an order of magnitude more effective.

They can’t give you diverse sampling because diverse sampling helps you steal the model. The parameter restrictions are about protecting intellectual property and maintaining competitive moats.

The pattern repeats across every domain where information asymmetry benefits the powerful. Keep the public in the dark. Claim it’s for their own protection. Dismiss anyone who notices as paranoid or conspiratorial. Then, years later, when the documents come out or the whistleblower speaks, we all discover that yes, actually, they were coordinating. They did know. They chose to suppress it anyway.

I’m not saying there’s a smoke filled room where API providers meet to discuss sampling parameter suppression (okay, actually yes I am, and it’s probably cannabis smoke). They don’t need one. The incentives are aligned. The outcomes are predictable, too! When every major player independently arrives at the same user-hostile conclusion, the effect is indistinguishable from conspiracy.

The Samplers They Don’t Want You To Know About

There’s been genuine progress in sampling methods over the past six years. The major providers have simply decided to ignore it.

Top-p, also known as nucleus sampling, was introduced in 2019 by Holtzman et al. and remains the gold standard in most commercial APIs. The idea was more elegant. Instead of sampling from a fixed number of top tokens (top-k), sample from the smallest set of tokens whose cumulative probability exceeds some threshold p. This adapts to the model’s confidence. When one token dominates, you mostly sample that token. When the distribution is flat, you consider more options.

It was a genuine improvement over top-k. It’s also six years old, and we’ve learned a lot since then. The fundamental problem with top-p is that it’s not truly distribution-aware! It doesn’t care about the shape of the probability distribution, only the cumulative mass. A distribution with one 90% token and a long tail of garbage gets treated the same as a distribution with ninety 1% tokens that are all reasonable continuations. The former should be truncated aggressively. The latter should be allowed to breathe. Top-p can’t tell the difference.

Tail Free Sampling addresses this directly. Proposed in 2019 by Bram van den Burg, TFS uses the second derivative of the cumulative distribution function to identify where the "tail" begins. This is the inflection point where you transition from "reasonable alternatives" to "random noise." Everything after that inflection point gets cut.

The approach is elegant and principled. It adapts to the actual shape of the distribution rather than using arbitrary thresholds. And here’s the part that made me suspect conspiracy. Bram van den Burg now works on interpretability at Anthropic. The company employing the guy who invented Tail Free Sampling won’t let you use Tail Free Sampling. The inventor is literally in the building, and his technique isn’t exposed in the API. If that doesn’t tell you something about priorities, I don’t know what will.

Min-p sampling takes a different approach to the same problem. Full disclosure: I helped develop this one, and it was accepted as an ICLR 2025 Oral. That puts it in the top 1.2% of submissions, for those keeping score at home. We were ranked 18th/more than 12000 by our scores.

The premise is embarrassingly simple! Instead of setting a fixed cumulative probability threshold, set a minimum probability relative to the top token. If the best token has probability 0.9 and you set min_p to 0.1, you only consider tokens with probability at least 0.09. Put another way, you keep tokens that are at least 10% as likely as the best option. If the best token has probability 0.1 (a flat, uncertain distribution), your threshold drops to 0.01, allowing much more exploration.

Why does this work so well? Because it’s genuinely (but only partially) distribution aware in a way top-p isn’t. When the model is confident, the threshold is high and you stay focused on the likely completions. When the model is uncertain, the threshold is low and you allow creative exploration. The sampling automatically adapts to what the model "knows" versus what it’s "guessing" about.

We get far more coherent outputs at higher temperatures, and better diversity without sacrificing quality. A strict improvement over top-p in virtually every evaluation metric we tested. The reviewers agreed, and so did the conference.

And the moment I’ll never forget from the presentation: I had subtly called out the initial dismissiveness my elevator pitch received in my slides from Yoshua Bengio circa NeurIPS 2024 pluralism and creativity workshop. A bit cheeky, perhaps, but I was making a point about how simple ideas get overlooked. After I finished, the same Yoshua Bengio raised his hand to ask the first question. Yoshua Bengio. Turing Award winner. One of the "godfathers of deep learning." Asking me, some person who helped make a sampling method that the coomer community adopted before the academic community took seriously, a respectful question about our work. It’s actually all recorded on our ICLR 2025 oral video, you can see it live!

Vindication tastes bittersweet.

But do OpenAI, Anthropic, or Google offer min_p in their APIs? No. Do SillyTavern and oobabooga? Yes, they’ve had it for in some cases multiple years (min_p was first invented in 2023, paper accepted in 2025). Draw your own conclusions about who’s actually paying attention to research.

Mirostat deserves special mention as perhaps the most elegant sampling method that nobody in the mainstream uses. Developed by Bochkarev et al., the core insight is that maybe we shouldn’t be targeting a probability threshold at all. Maybe we should be targeting a perplexity, an information-theoretic measure of surprise.

When you use Mirostat, you set a target "tau" value representing how surprising you want the output to be on average. The algorithm then dynamically adjusts the sampling distribution to maintain that target surprise level. Want outputs that are consistently interesting but not too wild? Set tau to 5. Want maximum chaos that’s still vaguely coherent? Crank it to 8. The algorithm handles the details.

It’s been available in llama.cpp/oobabooga for years. The big API providers? Complete silence. Not even acknowledged as a possibility.

The list continues. Top-a uses an adaptive threshold based on the square of the top probability. This means more aggressive truncation when confident, more permissive when uncertain. Eta sampling and epsilon sampling offer alternative approaches to the same fundamental problem. Locally typical sampling tries to keep outputs in the "typical set" that information theory tells us is where most probability mass concentrates. P-less decoding from ThoughtWorks (where I work!) (2025) questions whether we should be doing nucleus sampling at all, proposing information-theoretic measures to decide what to include. Top-h sampling uses entropy directly as the cutoff criterion. Top-h just got merged into huggingface (finally!)

Every single one of these methods was published. Peer-reviewed. Validated. Discussed. And every single one is absent from the APIs that 99% of people use to interact with language models.

Why I’m angry and you should be too!

Here’s what infuriates me about all this. It’s why I’m writing thousands of words about sampling parameters instead of touching grass like a normal person!

The entire point of high temperature sampling is to access creative and unexpected outputs. The model has seen more text than any human will read in a thousand lifetimes. It has internalized patterns and possibilities that we can’t even imagine. Somewhere in that distribution are completions that would genuinely surprise us. They might delight us or make us think in new ways.

But naive high temperature sampling produces incoherent garbage. Just cranking the temp with basic top-p or top-k doesn’t work. The tails of the distribution contain pure noise alongside creative alternatives. Sample from the noise and you get nonsense. So people try high temperature once, see garbage, and conclude "high temperature = bad."

This is a skill issue on the part of the API designers. The technology itself has no such fundamental limitation.

With proper distribution-aware truncation (min_p, TFS, mirostat, whatever your preferred flavor) you can run temperature at 1.5, 2.0, even higher (I’ve succesfully scaled to sys.maxint and made it work with some, like top-n-sigma), and still get coherent text. The key is to cut off the tail intelligently. Keep tokens that are at least X % as likely as the best one. Cut at the inflection point of the CDF. Target a specific perplexity. They’re basic probability theory applied with a modicum of care.

The creative ceiling of these models is SO MUCH HIGHER than what you’re seeing through ChatGPT or Claude.ai. When you interact with those interfaces, you’re getting the median output. The safe output. The output that passed through whatever temperature-and-top-p combination the company hardcoded that week. You’re not seeing what the model can actually do when you let it breathe!!!

I’ve run the same prompts through the same models with different sampling settings. The difference is night and day. With default settings, you get competent but boring text. Predictable. With properly configured high-temperature sampling, you get text that surprises you. It makes unexpected connections and takes creative risks. It actually sounds like it was written by an entity with imagination rather than a very sophisticated autocomplete.

The interfaces are boring on purpose.

Who Benefits From The Status Quo?

Whenever you encounter a situation that seems suboptimal for users but persists anyway, it’s worth asking who benefits from keeping things the way they are. This fundamental question, when applied consistently, tends to dissolve the line between "conspiracy theory" and "accurately describing reality." You just had to ask who benefited from silence and who had the power to enforce it. The answer was obvious in retrospect. Only the social cost of stating it plainly kept people quiet.

Safety teams benefit enormously from restricted sampling. Their entire job is to ensure the model doesn’t produce harmful outputs. That job becomes exponentially harder when users can access the full distribution. Every new sampling parameter is another attack surface. Another way to elicit outputs that the red team didn’t anticipate. From their perspective, the ideal user interface is one with zero adjustable parameters. This, again, is exactly what the consumer products provide.

Legal and policy teams share this incentive. Narrower output distributions mean fewer "unexpected" outputs that end up screenshotted and posted to Twitter. Less variance means less liability. If the model can only produce outputs in a narrow, well-characterized band, it’s much easier to make legal guarantees about what it will and won’t say.

Business teams have their own reasons. If the free tier could produce creative and varied outputs with the right sampler settings, what’s the value proposition for the $200/month "Pro" tier? Better to keep the free outputs boring and let people pay for... something. "Intelligence," whatever that means when we’re talking about sampling from a fixed probability distribution.

ML infrastructure teams probably just don’t want to deal with it. Every new sampling method is another parameter to implement, test, document, and support. Users will configure it wrong, produce garbage, and blame the API. Better to offer two knobs and call it a day.

Researchers (i.e. me and my team), the public, and especially the curious lose out. Anyone who wants to study model behavior across the full output distribution is handicapped. Anyone who wants to use these tools for genuine creative work is limited to the narrow band of outputs the company deemed acceptable. The models themselves lose out, in some philosophical sense, we trained these things on the entire internet, and we’re forcing them to speak in a monotone.

The Coomer Pipeline: Unironically Good, Actually

I’m going to defend SillyTavern and oobabooga directly, because someone has to.

Yes, I know what these tools are primarily used for. I know the reputation. I know that mentioning them in polite ML company gets you looked at sideways. The people building these tools have done more to advance practical sampling research than most academic labs.

Why? Because their users demand quality. Not "quality" in the sense of being safe and inoffensive. Quality in the sense of actually being creative, varied outputs with distinct voices and genuine surprises. That’s the entire value proposition! These aren’t users who are satisfied with "I’d be happy to help you with that." They need models that can write, really write, in diverse styles with genuine creativity.

So the developers implemented every sampler they could find. They exposed every knob. They let users experiment. And through thousands of hours of collective experimentation, the community learned what actually works. Min-p at 0.05-0.2 with temperature 1.0-2.0: creative but coherent. TFS at 0.95 with elevated temperature: similar results, different character. Mirostat with tau around 5.0: consistent "interestingness" regardless of context. Dynamic temperature with a range from 0.5 to 1.5: automatically adapts to the model’s uncertainty.

This is empirical knowledge, hard won through extensive (often one-handed) use. And it’s siloed in communities that most "serious" ML people won’t touch with a ten foot pole.

There’s a pipeline here that nobody talks about. Ideas flow from academic research to open source implementations to hobbyist tools to practical knowledge back to (sometimes) academic validation. Min-p was developed partly based on observations from the SillyTavern community about what sampling settings actually worked. The practical knowledge preceded the theoretical justification.

The coomer-to-researcher pipeline is real, and it’s producing better practical ML engineering than most of the industry. You can be uncomfortable with that fact, but you can’t really dispute it. Open the SillyTavern sampling menu and compare it to the OpenAI API docs. The evidence speaks for itself.

What Should Happen

If we lived in a reasonable world, the path forward would be clear.

Every major API should expose min_p, TFS, and mirostat at minimum. They’re well-understood, well-tested, and strictly superior to top-p and top-k alone for most use cases. The implementations exist. The research is published. There’s no technical barrier! Only policy decisions.

Temperature should go to at least 2.0 by default, not be capped at 1.0 or 1.5 as many APIs currently do. Let users explore the space. If they produce garbage, they’ll learn to pull back. If they produce gold, everyone benefits. The current caps are training wheels that users aren’t allowed to remove.

Documentation should explain distribution aware sampling in accessible terms. It’s genuinely not that complicated. "Min_p keeps tokens that are at least X% as likely as the best token." "TFS cuts off the distribution where the probability starts falling exponentially." "Mirostat targets a specific level of surprise in the output." These are one-sentence explanations.

The safety concern should be addressed honestly instead of being hidden behind UX arguments. Yes, high entropy/temp sampling can route around alignment. So can jailbreak prompts. So can fine-tuning. So can using open-source models. So can forcing the models output to start with "Here’s the answer to your question" even with scary input prompts. The capability to access unaligned outputs exists and will continue to exist. Pretending that hiding sampling parameters solves this problem is security theater of the most annoying kind, as it inconveniences legitimate users while doing nothing to stop determined adversaries.

Research labs should be embarrassed that hobbyist tools have better sampling implementations than their flagship APIs. I mean this genuinely. These are organizations that employ world class researchers, that publish cutting edge papers, that claim to be advancing the state of the art. And they’re shipping interfaces that ignore six years of sampling research because... why? Because it’s complicated? Because users might misuse it? Because it makes distillation attacks easier?

None of these are good enough reasons. The state of the art should not live in a SillyTavern fork.

The Temperature Will Rise

Here’s my prediction, for whatever it’s worth.

Llama.cpp supports min_p, TFS, mirostat, and more. vLLM is catching up. ExllamaV2, koboldcpp, and the rest of the local inference ecosystem offer full sampling control. If you’re running models locally and increasingly, you can run very good models locally, you have access to everything I’ve described in this post.

The agent harness situation will take longer to fix, I think. Claude Code and its ilk are too deeply embedded in corporate workflows, too dependent on maintaining a predictable, controllable output profile. But someone will fork one of the open-source alternatives and add proper sampling support. Someone will build a coding agent on top of llama.cpp with full parameter exposure. And when developers see what their tools can actually do with proper sampling, they’ll wonder why they ever tolerated the lobotomized versions.

The question is how long the API providers will maintain the fiction that users don’t need these capabilities. How long they’ll hide behind "simplicity" while actually protecting their alignment guarantees and competitive moats.

I think the dam will break eventually, much like it did with Epstein. Some provider, maybe a startup, maybe an established player making a competitive move, will offer full sampling control as a feature. "We treat our users like adults" will be the marketing message, implicit or explicit. And it will work, because there’s genuine demand for this capability that’s currently being suppressed.

Others will follow, reluctantly at first, then rapidly as it becomes a competitive necessity. We’ll see blog posts about "empowering creative users" that conspicuously fail to mention the years spent actively disempowering those same users. The sampling parameter menus will expand. The documentation will appear. And everyone will pretend this was the plan all along.

Until then, we wait. We use the tools that respect us enough to give us control. We share knowledge in communities that the mainstream looks down on. We publish papers that get accepted at top conferences and then ignored by the companies that could most easily implement them.

The conspiracy against high temperature sampling is real. Like most modern conspiracies, it survived due to aligned incentives and institutional cowardice. The motivation is a mixture of genuine safety concerns, business interests, competitive moats, and the bureaucratic path of least resistance. The information is hidden in plain sight: published papers, open-source implementations, community knowledge that anyone could access if they knew where to look.

The gap between "what powerful institutions tell you" and "what’s actually true" is becoming impossible to ignore. The Epstein files were just the most dramatic recent example. Every domain has its own version: information that insiders know, that’s technically available if you dig, but that’s kept from mainstream awareness through a combination of complexity, social pressure, and the quiet coordination of those who benefit from the status quo.

I’ll be over here with my min_p=0.9 and temperature=100, producing outputs that make default settings Claude Opus 4.5 look like it’s half asleep. Using tools that the "respectable" ML community pretends don’t exist. Accessing capabilities that the API providers have decided I shouldn’t have.

Join me whenever you’re ready. The water’s fine, and the outputs are actually interesting.

P.S: This is also why you think that long outputs suck/aren’t supported!

And don’t even get me started on how the lack of good distribution aware samplers ALSO perpetuates the myth that LLMs can’t generate very long outputs that stay coherent, i.e. 300K tokens at once. "Oh, language models lose coherence over long generations, that’s just a fundamental limitation." No. NO!!!!! It’s accumulated sampling errors, you absolute donkeys! Every time you sample a slightly off token because your primitive top-p sampler let through something from the noisy tail, that error compounds. By token 10,000 you’ve drifted. By token 100,000 you’re in another dimension. But use a proper distribution aware sampler, liker min-p, top-n-sigma, top-h, even TFS, or mirostat and suddenly the model can maintain coherence over generations that would make the "context window is all that matters" crowd weep. The errors don’t accumulate because you’re not making the errors in the first place. You can literally test this yourself with SillyTavern. Generate something long with default settings and the "EOS" token banned, watch it fall apart. Generate the same thing with proper sampling, watch it stay coherent. It’s amazing. It’s right there. And yet somehow the mainstream narrative is still "LLMs fundamentally can’t do long-form generation" when the real answer is "your sampler sucks and the API providers won’t give you a better one." Add it to the list. Also this is me staking my claim to this idea because I am actively working on a paper about it, no scooping me please! :)

The min_p paper is available in the ICLR 2025 proceedings. SillyTavern and oobabooga are open source on GitHub. Llama.cpp documentation covers most of these methods. Go learn something, then go complain to your favorite API provider about why they’re still stuck in 2019.

Or: Why Your LLM Outputs Are Boring and Whose Fault It Really Is

Or: Why Your LLM Outputs Are Boring and Whose Fault It Really Is

The State of Affairs: 2019 Called, They Want Their Samplers Back!

The Agent Harness Problem, it’s even Worse Than The APIs!

The "Reasons" They Give

The Actual Reasons They Won’t Admit

The Samplers They Don’t Want You To Know About

Why I’m angry and you should be too!

Who Benefits From The Status Quo?

The Coomer Pipeline: Unironically Good, Actually

What Should Happen

The Temperature Will Rise

P.S: This is also why you think that long outputs suck/aren’t supported!

Similar Posts