30 Oct 2025 • 19 min read

I’m chronically late. Not because I want to be rude - I feel terrible about it every single time - but because I’m catastrophically bad at predicting how long it takes to get anywhere.
Turns out machine learning algorithms have the exact same problem.
Here’s how it happens: Dinner is at 7pm. I know where the restaurant is. I have a perfectly clear route in my head: office door → hallway → elevator → street → subway → walk → restaurant. Very well defined. Call it 14 minutes, door to door.
The problem is, I get distracted. I’m deep in some problem, rea…
30 Oct 2025 • 19 min read

I’m chronically late. Not because I want to be rude - I feel terrible about it every single time - but because I’m catastrophically bad at predicting how long it takes to get anywhere.
Turns out machine learning algorithms have the exact same problem.
Here’s how it happens: Dinner is at 7pm. I know where the restaurant is. I have a perfectly clear route in my head: office door → hallway → elevator → street → subway → walk → restaurant. Very well defined. Call it 14 minutes, door to door.
The problem is, I get distracted. I’m deep in some problem, reading an article, debugging something. Suddenly it’s 6pm. Then 6:30. Then 6:45. And I think: “Well, 14 minutes, so as long as I leave by 6:46, I’m fine.”
Except the route is nondeterministic.
Are the cleaners in the office, so I have to take a longer way? Is the elevator busy? Is the subway running slow? Did I just miss it by 30 seconds? Is it raining and the sidewalks are full of people with big umbrellas?
What I confidently think of as a “14 minute journey” might actually take 25 minutes. I leave with enough time to make it in the ideal case - because that’s what I’m planning for in my head - and congratulations, now I’ve kept someone waiting (and the xiaolongbao are congealing) for 10-15 minutes because of things “out of my control.”
Or, realistically, things that were entirely in my control, since I could have just left earlier. Sorry, everyone.
At 3am, it might be an incredibly fast journey. During the Puerto Rican Day Parade, much slower. But I don’t know these things unless I understand exactly what the flow looks like at that specific moment.
The Route You Don’t Even Know
[Quick tangent before we get to the main event: There’s actually a different “route” problem in machine learning, and I need to acknowledge it so you understand what this paper ISN’T solving.]
During training, supervised learning doesn’t “know a route” in any human sense. It barely understands what the route even is. You give it a target (“dog or cat”), it gets a scalar loss at the end, and it adjusts parameters. There’s no GPS. No turn-by-turn instructions. Just a score telling it whether it arrived or not.
It’s like the restaurant problem, except the algorithm tries a billion different routes in parallel and remembers which ones work. It has no idea if the best route goes through Queens and comes back down. It’s just search at massive scale.
This is where the route metaphor gets strained: the system isn’t memorizing streets. It’s learning a function that usually lands you at the right restaurant. That’s a training problem - an interpretability problem. Fascinating, unsolved, not what this post is about.
This post is about what happens AFTER training. Once you have a trained model - once you think you know the route - there’s still a completely different problem: you can’t get the same travel time twice. That’s what Thinking Machines Lab solved. Back to our regularly scheduled programming.
The Bigger Problem: Even When You Know The Route
Unlike me - who no matter how many times I go to the restaurant, I’m never going to try the route a billion times - a machine learning algorithm can try many different things in parallel. It can try them far more times than a human ever could. So the model generally has a good sense of the route. It’s figured it out.
But here’s the real nondeterminism problem: You think you’re taking the exact same route - same hallway, same elevator, same subway, same streets, same sequence of turns - but you’re actually not.
When you travel alone, you measure timing at different points than when you travel with a group. The SEQUENCE is the same, but the MEASUREMENT POINTS are different. And because of how your stopwatch rounds lap times, different measurement points give different final times. It’s not that the trip varies; it’s that the way you’re measuring it changes based on group size.
Not by much. Maybe a minute or two. But it’s never identical.
Same route. Different time. Every time.
You can imagine how frustrating this is. If there’s one thing machines like, it’s EXACTLY repeatable answers to questions. (Humans do too, but we’re more tolerant of small changes). And if you’re doing things a billion times, you REALLY need the exact same answers to the exact same questions.
I know everyone is constantly talking about the latest thing and how this is the revolution, but Thinking Machines Lab just announced something a month ago that I genuinely think is a huge pivot point for our industry. They published “Defeating Nondeterminism in LLM Inference,” and they didn’t just explain the problem. They figured out how to make the same route take the same time, every single time.
It doesn’t have a business model, but I have to believe every inference engine will be adopting it shortly.
What Everyone Thought Was Happening (But Was Wrong)
For years, people blamed nondeterminism on “random” parallelism: non-associative floating point plus unpredictable thread scheduling on GPUs - something basically like “phantom traffic jams”. Sounds right; parallel adds in different orders give different round-offs.
The accepted explanation, which you’ll find repeated everywhere, was this:
“Floating-point arithmetic in GPUs is non-associative, meaning $(a+b)+c \neq a+(b+c)$ due to finite precision and rounding errors. Because GPUs run operations in parallel across many threads, the execution order is unpredictable. This random order leads to different rounding patterns each time, creating nondeterminism.”
This explanation suggests that every time you take your trip, invisible, unpredictable traffic slows down random streets. The math itself has a “randomness” baked in from the thread scheduling. It seems plausible. Case closed. Nothing we can do about it.
Except the Thinking Machines team—Horace He, et al.—noticed something that didn’t fit. They ran this simple experiment:
A = torch.randn(2048, 2048, device='cuda', dtype=torch.bfloat16)
B = torch.randn(2048, 2048, device='cuda', dtype=torch.bfloat16)
ref = torch.mm(A, B)
for _ in range(1000):
assert (torch.mm(A, B) - ref).abs().max().item() == 0
Matrix multiplication on a GPU. Same matrices. 1,000 times in a row. The results were bitwise identical every single time.
The conventional wisdom wasn’t wrong about floating-point non-associativity—that’s real. But it missed the key insight: the nondeterminism isn’t RANDOM. It’s not phantom traffic jams appearing unpredictably. It’s that we’re using different stopwatches depending on how many people we’re traveling with.
A single GPU operation, run in isolation, is perfectly deterministic. The problem only appears when we COMPOSE operations differently based on batch configuration. Different batch sizes cause the inference engine to GROUP computations differently, which changes operation order, which changes rounding patterns. It’s systematic, not random.
So, if it’s not random, what’s actually happening?
The Real Cause
The problem isn’t spooky randomness - it’s reduction order. Think back to that route to the restaurant: scan the same way to travel, but if you add up the time it takes to go each segment of the route different sequences, and you can land a second off.
Microscopic numeric deltas → a different argmax at temperature 0 → divergent tokens.
So, in the case of my dinner trip, imagine I time my 14-minute journey with a hyper-accurate digital stopwatch that calculates down to the nanosecond, even if it only displays minutes and seconds.
- Scenario 1: One Request. I time the whole trip as one segment. I press ‘start’ at my desk and ‘stop’ at the restaurant. The stopwatch calculates a single, precise duration:
14.1123173271minutes. - Scenario 2: Three Requests Batched Together. I decide to time three segments separately: (1) office to subway, (2) the subway ride, and (3) the walk to the restaurant. I press the ‘lap’ button at each stage.
Here’s the crucial part: because of the non-associativity of floating-point math—the fact that $(a+b)+c$ can be a microscopically different number than $a+(b+c)$—the way the stopwatch’s internal chip adds up the lap times gives a different result. It might calculate (lap1 + lap2) + lap3 and arrive at a final duration of 14.1123173274 minutes.
The difference is a rounding error a dozen decimal places out. It’s completely imperceptible to me. But it is a different number.
This is exactly what happens in an inference server like vLLM. It processes requests in batches to maximize GPU utilization.
- Processing 1 sequence? The GPU performs one set of grouped calculations.
- Processing 10 sequences? It groups the calculations differently to be more efficient.
Each grouping changes the order of operations—not randomly, but systematically and deterministically based on the batch size. A different order leads to different floating-point rounding patterns. This creates a slightly different numerical result, which can cause the model to pick a different “most probable” token, leading to a completely different output that cascades from that point on.
The problem wasn’t a phantom traffic jam. It was that our stopwatch was giving us a different measurement depending on how many lap times we asked it to record.
Why Addition Order Actually Matters
Floating-point arithmetic isn’t associative. This isn’t a bug - it’s mathematics. This isn’t GPU-specific - all floating-point math has this property. The GPU matters because GPUs process operations in parallel, and parallelism means the ORDER operations execute isn’t fixed unless you explicitly control it.
The paper gives a perfect example:
(0.1 + 1e20) - 1e20 = 0
0.1 + (1e20 - 1e20) = 0.1
Computers aren’t infinitely large, so you represent real (infinite precision) numbers with finite precision. Rounding happens at each operation, and different operation orders produce different rounding patterns.
Kahan summation exists precisely because naive summation loses accuracy. The whole field of numerical analysis exists because these details matter.
If I add up my trip segments like this: A + B + C, I might get one total. But if I group them differently, like (A + B) + C, the rounding in my mental math could produce a slightly different result. In machine learning, this is called “reduction strategy.”
The paper introduces a potential solution:
“The requirement for batch invariance is that the reduction order for each element must be fixed regardless of the batch-size of the kernel. Note that this doesn’t mean we must always use the same reduction strategy. For example, if we change the number of elements we’re reducing over, we can still be batch-invariant even if our reduction strategy changes.”
The Three Operations That Need Fixing
The solution sounds simple: take the same route every time, regardless of how many people you’re grouping together. In practice, this requires rethinking how three fundamental GPU operations work.
Operation One: RMSNorm (How the Stopwatch Adds the Segments)
“Standard implementations parallelize by splitting the reduction across multiple workers... Different split points = different reduction orders.”
What is RMSNorm? Root Mean Square Normalization is used in models like LLaMA to compute a scaling factor for a vector. This requires reducing (summing up) thousands of values into one number.
Here’s the problem: standard implementations parallelize this by splitting the work across multiple GPU workers. But different batch sizes mean different numbers of workers, which means different addition patterns.
The Stopwatch Metaphor: You’ve recorded 1,000 segment times during your journey. How does the stopwatch sum them?
- Traveling alone (small batch, fewer GPU workers): The stopwatch adds each segment times in pairs -
((segment1+segment2) + (segment3+segment4)) + ((segment5+segment6) + (segment7+segment8))building a tree where it combines pairs, then pairs of pairs, working its way up. - Traveling with friends (large batch, more GPU workers): To go faster, it might add them in groups of four-
(segment1+segment2+segment3+segment4) + (segment5+segment6+segment7+segment8)a different tree structure entirely.
Because floating-point arithmetic is non-associative, these two addition trees produce microscopically different totals.
Same numbers, different grouping, different final answer.
The Fix: Lock down the addition tree structure regardless of batch size. Always combine lap times the exact same way, even if it means some GPU cores sit idle with small batches. You’re trading potential efficiency for guaranteed determinism. In research, debugging, and RL training? Worth it. In production serving at scale? Maybe not yet. Pick your poison.
Operation Two: Matrix Multiplication (How the Laps are Defined)
“Modern GPU kernels tile operations... Tile size typically depends on available parallelism. More sequences in your batch? Larger tiles... Different tiles = different accumulation patterns.”
The Stopwatch Metaphor: This isn’t about how you add segment times - it’s about where you place the segmenteders boundaries in the first place.
- Small batch: To be efficient with limited parallelism, the system decides a “segment” is every 2 blocks. Your 14-block trip gets measured as 7 equal laps.
- Large batch: More parallelism available, so the system changes lap size to 4 blocks for efficiency. Now that same 14-block trip is 3 full segment plus one partial lap.
You’re still traveling the same 14 blocks. But you’re breaking them into different segments, which means different intermediate sums, which means different rounding patterns, which means different final totals.
The Fix: Lock down the tile size regardless of batch configuration. A segment is always exactly 2 blocks, period.
Same segmentation, same intermediate values, same rounding, same result. You sacrifice some GPU efficiency—those cores could be working harder with larger tiles—but you buy back determinism.
Important nuance: When you hold everything else constant (hardware, drivers, libraries, tensor shapes, flags), a single isolated GPU matrix multiplication IS bitwise-identical across runs. The “random thread scheduling causes random results” story isn’t quite right. The real issue is how we aggregate partial results differently based on batch configuration. It’s systematic, not random.
Operation Three: Attention (The Actually Hard Problem)
Here’s the nightmare scenario from the paper:
“You have 80 tokens in your KV cache and you’re processing 48 new tokens. With a block size of 32, you need three blocks for cached values (two full, one masked) and two blocks for new values (one full, one masked)—five total blocks for 128 elements. But if you had zero cached tokens and were processing all 128 at once? Four blocks total. Same number of elements, completely different reduction organization.”
Attention lets a model weigh the importance of different tokens. Modern methods like FlashAttention and PagedAttention are highly optimized, especially when dealing with a mix of existing tokens (in the KV cache) and new ones.
The Stopwatch Metaphor (which is getting strained here): Imagine arriving at the restaurant and trying to calculate average travel time. You have two separate lists of segment times: one from people who already arrived (KV cache) and one from your own journey right now (new tokens).
The system might process these lists separately, sum each, then combine the totals. Or it might merge them into one list first, then sum everything. Which strategy it picks depends on how many of each type you have. Different grouping strategies, different addition orders, different rounding, different results.
I’m going to level with you: this is where forcing everything into the dinner metaphor stops being helpful. Attention mechanisms are complicated enough that the analogy does more harm than good. What matters is this:
- Update the KV cache FIRST so old and new tokens exist in one consistent memory layout before any attention calculations begin.
- Use fixed split sizes, not fixed split counts. Instead of “divide this into 4 equal chunks no matter what,” you say “every chunk is exactly 32 tokens.” Whether you’re processing 100 tokens or 128 tokens, the chunk boundaries stay in the same positions relative to the data, giving you consistent reduction patterns.
This required contributing changes to PyTorch’s FlexAttention - that’s how deep this fix goes. No amount of clever application-level code can fix this - you need to change the primitives.
The Soup Dumplings Experiment
Here’s where the rubber meets the road:
“We use Qwen/Qwen3-235B-A22B-Instruct-2507 and sample 1000 completions at temperature 0 with the prompt ‘Tell me about Richard Feynman’ (non-thinking mode), generating 1000 tokens each.”
Temperature 0 should be the easy mode: always pick the single most likely next token. No creativity, no randomness. It should be perfectly deterministic. Same prompt. Same temperature. Same model.
The nondeterminism isn’t in the sampling step - that part works fine. It’s in the FORWARD PASS computing the logits that determine which token is “most likely.” Different batch sizes change how those logits are computed, changing which token wins. Different results.
With standard vLLM:
- 80 different outputs from 1000 runs
- Most common output appeared 78 times (less than 8% of runs!)
- First divergence at token 103
- 992 completions said one thing
- 8 completions said something else
Same prompt. Same temperature. Same model. Different results.
But when they switched on batch-invariant kernels:
“... all of our 1000 completions are identical. This is what we would mathematically expect from our sampler, but we aren’t able to achieve deterministic results without our batch-invariant kernels.”
Same route. Same time. Every. Single. Time.
One thousand runs. One output. Bitwise identical.
What I predicted in my head was the exact time I got there every single time.
The Performance Trade-Off
You don’t get this for free. Their initial implementation runs about 2x slower (55 seconds vs 26 seconds). With optimization, 1.6x slower (42 seconds).
The paper is honest:
“Much of the slowdown comes from the fact that the FlexAttention integration in vLLM has not been heavily optimized yet. Nevertheless, we see that performance is not disastrous.”
Is 1.6x slower acceptable? Depends.
For production serving billions of queries where milliseconds matter? Maybe not yet.
For research where reproducibility is paramount? Absolutely.
For model development and testing where you need exact repeatability? Without question.
For reinforcement learning from human feedback where policy drift can break training? This might be necessary, not optional.
True On-Policy RL: The Big Unlock
Here’s where this goes from “nice infrastructure improvement” to “fundamentally changing what’s possible.” The paper drops this point almost casually, but its impact is profound:
“As researchers have noted, the different numerics between training and inference implicitly turns our on-policy RL into off-policy RL. ... [D]eterministic inference enables us to also modify our training stack to obtain bitwise identical results between sampling and training, thus resulting in true on-policy RL.”
To understand why this is such a big deal, we need to quickly break down the terms.
What is On-Policy vs. Off-Policy RL?
In reinforcement learning, a policy is just the agent’s strategy - in our case, the specific route to the restaurant.
On-Policy: You learn from the exact route you’re taking, right now. You take the 6 train, it’s slow, and you learn “the 6 train is slow at this time.” The policy you’re improving is the same one you’re using to gather experience.
Off-Policy: You learn about a different route than the one you’re currently on. You’re stuck on the 6 train, but you check your phone to see how the F train is doing. You’re learning about the F train’s performance without actually riding it.
The Miscalibrated Watch Problem
The numerical drift between generating text (sampling) and learning from it (training) accidentally turns on-policy methods into off-policy ones.
It’s like trying to optimize your route to the restaurant, but your watch is slightly miscalibrated each time, so you can’t be sure if a change actually helped or if the measurement just varied.
The standard fix is a patch called importance weighting, where you try to mathematically correct for the drift. But you’re just patching over a problem that shouldn’t exist in the first place.
Calibrating the Watch
Batch-invariant kernels solve this by ensuring the “route” of the calculations is bit-for-bit identical every time. This creates true on-policy RL. The results from the paper are striking:
- Without the patch: The model’s performance quickly collapses.
- With the patch (Importance Weighting): Training works, but it’s unstable, wobbling around a small amount of drift.
- True On-Policy (Batch-Invariant): The drift is a flat line at zero. The training is perfectly stable.
As the paper notes:
“...when running ‘True On-Policy RL’, our KL-divergence stays flat at 0, indicating that there is no divergence between the training policy and sampling policy.”
My watch is finally calibrated. Same route, same time. Now I can actually optimize.
This isn’t just theoretical. It directly impacts the entire post-training phase of LLM development, making methods like RLHF and DPO more stable and reliable.
What Actually Changes Now
Let me be concrete, because lots of people scream “THIS CHANGES EVERYTHING,” but it doesn’t mean much without specific use cases.
Research Reproducibility Becomes Real
Right now, validating that model A outperforms model B requires running multiple trials and computing statistical significance. Not because the models are inherently random at temperature 0, but because your measurement apparatus is inconsistent.
With deterministic inference:
- Precise A/B tests with tight bounds
- Claims become testable with certainty
- Replication studies actually replicate exactly
- Meta-analyses don’t worry about implementation differences
The replication crisis in AI research is partly about researchers not sharing details. But it’s also about implementations subtly differing. Batch-invariant kernels remove one source of variation.
Debugging Gets Orders of Magnitude Easier
When a model produces bad output, reproducing it exactly means you can:
- Trace through execution step by step, including inspecting intermediate activations
- Add instrumentation and re-run with identical results
- Binary search through the token sequence to find where things went wrong
Nondeterministic systems make debugging probabilistic. “Well, usually it does X, but sometimes it does Y” is a developer’s worst nightmare.
With determinism, debugging becomes systematic. Every run is identical. You can use all the normal debugging tools and trust that what you see is what you’ll get next time.
Caching Becomes Bulletproof
Production LLM serving uses aggressive caching. Common queries, prefix caching, continuous batching with shared KV caches - all assume identical inputs produce identical outputs.
But with nondeterminism, that assumption is leaky. Cache hit rates are lower than they should be.
Deterministic inference means:
- Perfect cache hit rates for identical inputs
- Ability to cache and reuse intermediate computations
- Simpler cache invalidation logic
For companies serving millions of queries, this translates directly to infrastructure cost savings.
Model Testing Becomes Precise
Quality assurance for LLMs is currently statistical. You can’t write a test that says “for this exact input, model must produce this exact output” because you can’t guarantee exact outputs.
With deterministic inference:
- Write precise regression tests that detect when model updates change specific behaviors
- Build comprehensive test suites that don’t flake
- Use differential testing between model versions confidently
Being able to trust your tests means faster iteration and fewer production surprises.
Research Velocity Increases
Maybe the biggest effect: researchers spend less time on statistics, more time on science.
Right now, substantial ML research time goes to:
- Running enough trials to achieve statistical power
- Correcting for various sources of variance
- Arguing about whether differences are significant
With deterministic inference, you eliminate one major source of variance. Experiments become simpler. Results become clearer. More time on actual research questions, less on measurement methodology.
It’s like how version control doesn’t directly make your code better, but it removes friction and coordination overhead, so teams move faster. Deterministic inference removes friction from the research process.
Why This Is Infrastructure That Matters
I started this post talking about how I’m chronically late to dinner because I can’t predict travel time. Machine learning has had the same problem, except worse: even when you know the route, you still can’t predict the time.
Thinking Machines Lab could have kept this proprietary. Built “Deterministic Inference as a Service.” Charged premium prices. Created a moat.
Instead, they:
- Published the full paper with complete technical details
- Released the code under permissive licenses
- Contributed improvements back to PyTorch FlexAttention
- Wrote extensive documentation
- Shared benchmark results
There’s no business model here because there shouldn’t be one.
The paper concludes:
“Modern software systems contain many layers of abstractions. In machine learning, when we run into nondeterminism and subtle numerical differences it can often be tempting to paper over them. After all, our systems are already ‘probabilistic’, so what’s wrong with a little more nondeterminism? What’s wrong with bumping up the atol/rtol on the failing unit test? The difference in logprobs between the trainer and the sampler probably isn’t a real bug, right? We reject this defeatism.”
“We reject this defeatism.” What a line.
Everyone accepted nondeterminism as inevitable. Built workarounds. Adjusted tolerances. The entire ecosystem adapted to work around the problem rather than solving it.
Thinking Machines Lab asked: “What if we actually solve this?” Not “how do we minimize impact” or “how do we statistically correct for it,” but “what’s the root cause and how do we eliminate it?”
This is systems thinking applied to infrastructure. The problem isn’t floating-point arithmetic or GPU concurrency per se - it’s how we organize work across different batch configurations. The solution isn’t to fight floating-point behavior, but to ensure consistent operational ordering regardless of context.
The Deeper Pattern
Most “infrastructure improvements” are about making existing things faster or cheaper. Speed up training. Reduce serving costs. Compress models. These matter - FlashAttention matters, quantization matters, efficient architectures matter.
But occasionally, someone fixes a problem that was so fundamental we stopped seeing it as a problem at all. We adapted. Built workarounds. The problem became part of the landscape, like potholes you just learn to avoid rather than fill.
- Kubernetes didn’t make containers faster - it made container orchestration not be a custom nightmare for every company.
- Git didn’t make code better - it made collaboration not be a coordination nightmare.
- Rust didn’t make systems programming faster - it made memory safety not require garbage collection.
These are foundational. They remove entire classes of problems. You don’t work around them - you stop having to think about them entirely.
Batch-invariant kernels do this for LLM inference reproducibility. It’s not a workaround. It’s not a patch. It’s not “let’s add importance weighting to correct for drift.” It’s a solution. The problem stops existing.
The difference between a workaround and a solution is this: a workaround means you still have to think about the problem. A solution means the problem disappears from your mental model entirely. When’s the last time you thought about whether your text editor would correctly handle Unicode? Never, because that problem got solved at the infrastructure level and you just stopped thinking about it.
That’s what deterministic inference does. It removes an entire category of “wait, why did this behave differently this time?” questions. Your debugging sessions get shorter. Your tests get more reliable. Your research becomes more reproducible. Not because you got better at working around nondeterminism, but because nondeterminism stopped being a thing you had to work around.
Thinking Machines Lab understood this. They knew they could have built a business here. But they also understood that this particular innovation is more valuable if it becomes ubiquitous. Making it freely available means every inference engine can adopt these techniques. Every research lab can reproduce results. Every production deployment can get consistent behavior.
The field moves faster when the foundations are solid and shared.
What Happens Next
I expect batch-invariant kernels to show up in:
- vLLM as an opt-in flag initially, then possibly default
- TensorRT-LLM within months
- Text Generation Inference (HuggingFace)
- llama.cpp for local inference
- Major cloud providers’ serving infrastructure
This won’t be a competitive differentiator for long. It’ll just become how inference works. Like how HTTPS used to be optional and is now expected. Like how Unicode support used to be a feature and is now assumed.
Which is exactly right.
We spend enormous energy talking about breakthrough model architectures. Mixture of Experts. State Space Models. Long context. These matter. Architecture matters.
But infrastructure like this - unglamorous, technically deep, freely shared - is what makes systematic progress possible. When the foundations are solid, everything built on top becomes more reliable. When measurements are consistent, optimization becomes possible. When experiments are reproducible, science can function.
Two Problems, One Solution
Let me bring this home. We started with two different nondeterminism problems:
Problem #1: We don’t know the route. During training, machine learning doesn’t tell you what matters. You get a loss score at the end, but no GPS tracking along the way. This is an interpretability problem, a training problem, an “understanding what the model learned” problem. Still unsolved. People are working on it—mechanistic interpretability, attribution methods, understanding which features matter. That’s the next frontier.
Problem #2: Even when we DO know the route, we get different times every time. You have a trained model. You know the weights. You’re just doing inference—the same forward pass, the same computation. But batch size changes how you group operations, which changes reduction order, which changes rounding, which changes results. Nondeterminism at temperature 0, where there should be zero randomness. Could not reproduce results. Could not debug reliably. Could not do true on-policy RL. Could not write tests that don’t flake.
Thinking Machines Lab solved Problem #2. Not with a workaround—with a solution. They discovered the root cause (batch configuration changing operation grouping) and fixed it at the kernel level. Now if you take the same route, you get the same time, bitwise identical, 100% of the time, one thousand runs in a row.
We still haven’t solved Problem #1. But now we can measure accurately. And measurement is the foundation of science. Can’t optimize what you can’t measure consistently.
My watch finally works. Same route, same time, every time.
Now, if you’ll excuse me, I have a reservation to get to. And with deterministic inference, I can tell you with absolute certainty exactly when I’ll arrive.
(It’ll still be late. But at least the lateness will be reproducible.)
Onward.
The full paper “Defeating Nondeterminism in LLM Inference” includes extensive technical details, benchmarks, and ablation studies. The batch-invariant kernel implementations are available at github.com/thinking-machines-lab/batch-invariant-ops. Related work on FlexAttention improvements has been upstreamed to PyTorch.
It’s a scorcher, go read it!
For more on why deterministic computation matters in ML systems, see Reproducible Machine Learning, Numerical Reproducibility in HPC, and the broader replication crisis in AI.
Want to learn how intelligent data pipelines can reduce your AI costs? Check out Expanso. Or don’t. Who am I to tell you what to do.*
NOTE: I’m currently writing a book based on what I have seen about the real-world challenges of data preparation for machine learning, focusing on operational, compliance, and cost. I’d love to hear your thoughts!