The best Chinese LLMs offer
- frontier performance on some benchmarks;
- massive per-token discounts (~3x on input, ~6x on output);
- the weights. On-prem with fully free ~MIT licence, self-hosting, white-box access, customisation, with zero markup (and in fact zero revenue going to the Chinese companies);
- with a bit of work you can get much faster token speeds than the closed APIs;
- less overrefusal (except on CCP talking points);
- on topics controversial in the West, less nannying.
- they just added the search agents that make daily use actually worthwhile;
- They’re the [most-downloaded](ht…
The best Chinese LLMs offer
- frontier performance on some benchmarks;
- massive per-token discounts (~3x on input, ~6x on output);
- the weights. On-prem with fully free ~MIT licence, self-hosting, white-box access, customisation, with zero markup (and in fact zero revenue going to the Chinese companies);
- with a bit of work you can get much faster token speeds than the closed APIs;
- less overrefusal (except on CCP talking points);
- on topics controversial in the West, less nannying.
- they just added the search agents that make daily use actually worthwhile;
- They’re the most-downloaded open models.
As a result, going off private information, open-model fan Nathan Lambert says “Chinese open models have become a de facto standard among startups in the US”. Among the few Westerners to stick their necks out and admit it is Airbnb (Qwen). Windsurf’s planner is probably GLM; Cursor’s planner may be DeepSeek.
And yet
- outside China, they are mostly not used, even by the cognoscenti. All Chinese models combined are currently at 19% on the highly selected group of people who use OpenRouter. Over 2025 they trended downwards there. And in the browser and mobile they’re probably <<10% of global use;
- they are severely compute-constrained (and as of November 2025 their algorithmic advantage is unclear), so this implies they actually can’t have matched American models;
- they’re aggressively quantizing at inference-time, 32 bits to 4;
- state-sponsored Chinese hackers used closed American models for incredibly sensitive operations, giving the Americans a full whitebox log of the attack!
What gives?
“Tigers”?
The title alludes to the “6 AI Tigers” named in business rags as DeepSeek, Moonshot, Z.ai, MiniMax, StepFun, and 01.ai. (This is because they’re trying to hype startups specifically; the conglomerates Alibaba and Baidu are way more relevant than the latter two.)
Filtered evidence
The evidence is dreadful because everyone has a horse in the race and (in public) are letting it lead them:
- Static evals are weak evidence even when they’re not being adversarially hacked and hill-climbed.
- Some Americans are downplaying the Chinese models out of cope.
- Some Americans are hyping the Chinese models to suppress domestic AI regulation.
- Some Americans are hyping the Chinese models to boost international AI regulation.
- The Chinese are obviously talking their book.
What could explain this?
Maybe the evals are misleading?
1. frontier performance on some benchmarks
The naive view - the benchmark view - is that they’re very close in “intelligence”:

But these benchmarks are not strong evidence about performance on new inputs or the latent (general and unobserved) capabilities. It’d be natural to read “89%” success on a maths benchmark as meaning an 89% probability that it would correctly handle unseen questions of that difficulty in that domain (and indeed this is what cross-validation was originally designed to estimate). But in the kitchen-sink era of AI, where every system has seen a large proportion of all data ever digitised, and so has already seen some variant of many of all possible new questions, you can’t read it that way.
In fact it’s not even an 89% probability of answering these same questions right again, as shown by the fact that people report the results as “avg@64” (the average performance if you ask the same question 64 times).
There are dozens of ways to screw up or hack these numbers. I’ll only look at a couple here but I welcome someone doing something more systematic.
Even less generalisation?
- Maybe Chinese models generalise to unseen tasks less well. (For instance, when tested on fresh data, 01’s Yi model fell 8pp (25%) on GSM - the biggest drop amongst all models.)
We can get a dirty estimate of this by the “shrinkage gap”: look at how a model performs on next year’s iteration of some task, compared to this year’s. If it finished training in 2024, then it can’t have trained on the version released in 2025, so we get to see what they’re like on at least somewhat novel tasks. We’ll use two versions of the same benchmark to keep the difficulty roughly on par. Let’s try AIME:
AIME 2024 vs 2025 Model Performance
(using the Artificial Analysis harness)
| Model | AIME 2024 | AIME 2025 | pp fall | % fall |
|---|---|---|---|---|
| Kimi K2 | 69.3 | 57.0 | -12.3 | -17.7 |
| MiniMax-M1 80k | 84.7 | 61.0 | -23.7 | -28.0 |
| DeepSeek-v3 | 39.2 | 26.0 | -13.2 | -33.7 |
| DeepSeek V3 0324 | 52.0 | 41.0 | -11.0 | -21.2 |
| Qwen3 235B (Reasoning) | 84.0 | 82.0 | -2.0 | -2.4 |
| Kimi K2-Instruct | 69.6 | 49.5 | -20.1 | -28.9 |
| DeepSeek R1 0528 | 89.3 | 76.0 | -13.3 | -14.9 |
| Chinese models | -13.7pp | -21% | ||
| Gemini-2.5 Pro | 88.7 | 87.7 | -1.0 | -1.1 |
| Gemini 2.5 Flash (Reasoning) | 82.3 | 73.3 | -9.0 | -10.9 |
| Claude 4 Opus Thinking | 75.7 | 73.3 | -2.4 | -3.2 |
| o4-mini (high) | 94.0 | 90.7 | -3.3 | -3.5 |
| GPT-4.1 | 43.7 | 34.7 | -9.0 | -20.6 |
| Nova Premier | 17.0 | 17.3 | 0.3 | 1.8 |
| GPT-4o Nov 24 | 15.0 | 6.0 | -9.0 | -60.0 |
| Magistral Medium | 73.6 | 64.9 | -8.7 | -11.8 |
| Claude 3.7 Sonnet | 61.3 | 56.3 | -5.0 | -8.2 |
| OpenAI-o1-0912 | 74.4 | 71.5 | -2.9 | -3.9 |
| o3 | 90.3 | 88.3 | -2.0 | -2.2 |
| Grok 4 | 94.3 | 92.7 | -1.6 | -1.7 |
| Western models | -4.5pp | -10.4% | ||
| Overall average | 68.3 | 60.5 | -7.9 | -14.3 |
- Almost all models get worse on this new benchmark, despite 2025 being the same difficulty as 2024 (for humans). But as I expected, Western models drop less: they lost 10% of their performance on the new data, while Chinese models dropped 21%. p = 0.09.
Averaging across crappy models for the sake of a cultural generalisation doesn’t make sense. Luckily, rerunning the analysis with just the top models gives roughly the same result (9% gap instead of 11%).
One way for generalisation to fail despite apparently strong eval performance is contamination, training on the test set. But (despite the suggestive timing) the above isn’t strong evidence that that’s what happened. It just tells us that Kimi and MiniMax and DeepSeek generalise worse on this task; it doesn’t tell us why.
Details
Here’s a Colab with everything except the actual execution of my silly manual Kimi 1.5 run.
First, test for an obvious confounder: check if the 2025 AIME exam was around as hard as 2024’s (answer: yes; in fact humans did 4% better in 2025). (TODO: check if 2025 had more combinatorics, which AI struggles with.)
(To be strict we should limit this to models which finished training before 12th February 2025, when the questions were released. But as you see we don’t need to, it’s a very clear result anyway.)
The eval harness is too sensitive to use one setting. I found three comparisons of AIME 2024 and AIME 2025 by three different groups, Artificial Analysis, GAIR, and Vals.
(I also wasted ages manually evaluating a missing model, Kimi 1.5. It dropped 75%, the most of any model, but it’s only avg@2 and I couldn’t control reasoning length or temperature.)
How did our two replications do? The main issue that they were testing really different and crappy sets of models. The shrinkage gap is smaller in both cases:
* GAIR: Chinese -19.4%, Western -15.6%. * Vals actually show nothing: 11.2% vs 10.8%. If you kick Meta out the gap goes up to 2%, still not much.
I’m not worried about these contradictory results; they both just include a lot of bad models and so noise. I don’t actually care how Llama 4 Scout’s generalisation compares to QwQ-uwu-435B-A72B-destruct-dpo-ppo-grpo-orpo-kto-slerp-v3.5-beta2-chat-instruct-base-420-blazeit-early-stopped-for-vibes.
(Actually AIME’s a funny choice of benchmark given that 2025 had a bunch of semantic duplicates from before the cutoff. But that just makes the above a lower bound on the fall in performance.)
A big win for Qwen and a huge win for Amazon!
Claude is adorably confused about this. I didn’t even ask it for this analysis:

TODO: Another way to get past goodharting pressure is to look at hard but obscure evals which no one ever reports. e.g. PROOFGRID.
‘Hacking
Or you can do the usual hacking: putting special and unrepresentative effort in during testing. e.g. Kimi’s benchmarks come from “Heavy mode” (8 parallel instances with an aggregation instance on top). You can’t do this via the API or out of the box with the weights. (Could you say the same for OpenAI?)
You can run the test on a model which is better than the one you serve. Moonshot credibly claim to have reported their benchmarks at the same low-precision quantization (INT4) that they serve users, but others don’t claim this.
In fairness
I should say the Chinese models do very well on LMArena - despite being unfairly penalised. But Arena is a poor measure of actual ability. It is a decent test of style though. I put this gap down to American labs overoptimising: post-training too hard and putting in all kinds of repugnant corporate ass-covering stuff in the spec.
Also Qwen is famous for capability density: the small versions being surprisingly smart for their size.
The D word
- Distillation is second-rate intelligence, and there’s some evidence that they are distilling off of American models to some extent. See also the excellent Slop Profile from EQ Bench, which estimates that the new Kimi is closer to Claude than its own base model.

But anyway I don’t claim this is a major factor here, maybe another 5%.
The above isn’t novel; it’s common knowledge there’s some latent capabilities gap. This is often put in terms of them being “3 months behind”, but these estimates are still assuming that brittle, ad hoc, and heavily goodharted benchmarks have good external validity. I’d guess more like 12 months.
Unreliability?
1. frontier performance on some benchmarks
The above benchmarks are mostly single-shot, but people are now pushing LLMs to do more complicated stuff. One very flawed measure of this is the HCAST time horizon for software engineering: on that, DeepSeek R1 had a 31 minute “time horizon” compared to Opus 4’s 80 minutes.
There are various worse agent benchmarks, and e.g. the new Kimi posts great numbers on them. But on vibe I’d bet on a >3x reliability advantage for Claude.
As well as reliability over time, there’s stability over inputs. Maybe the Chinese models are higher variance or more sensitive to the prompt and hyperparams.
Harder to elicit?
TODO: I’ve been meaning to run the obvious experiment, which is to just see if they have a bigger gap between pass@1 and pass@64 success rates.
TODO: Intentionally underelicit! Rerun the models on AIME 2024 with only a basic prompt. My results will be lower; the gap tells us how much the labs’ own intense tuning helps / is necessary. This tells us something about, not their capability, but their actual in-the-wild performance with normal lazy users.
Tokenomics: no effective discount
2. massive per-token discounts (~3x on input, ~6x on output)
Distinguish intelligence (max performance), intelligence per token (efficiency), and intelligence per dollar (cost-effectiveness).
The 5x discounts I quoted are per-token, not per-success. If you had to use 6x more tokens to get the same quality, then there would be no real discount. And indeed DeepSeek and Qwen (see also anecdote here about Kimi, uncontested) are very hungry:

And in this graph you can clearly see a 2-4x difference (with Gemini and Kimi K2-base as the big exceptions):

And the resulting cost is a mixed bag:

I won’t use AA’s efficiency estimates, because again I think the benchmarks underlying them are bad evidence.
Self-hosting has high fixed costs
3. the weights. On-prem with fully free MIT licence, self-hosting, white-box access, customisation, with zero profit going to the Chinese companies.
Self-hosting doesn’t really make sense unless you’re huge volume or using them for very simple tasks. And most enterprises are not really competent enough to finetune anything.
This is partly a temporary matter: the software ecosystem is underdeveloped for serious high-reliability scaled usage, despite the intense hobbyist interest. (They mostly want it running on a Macbook.)
Too slow for casuals
4. You can get much faster token speeds than the closed APIs.
In the browser, they’re actually slower than Western models. This makes sense; they are incredibly inference bound thanks to chip controls! This would be enough to tank them in the consumer market.
And over API, everyone except Anthropic dominate, even in raw token rate (not counting efficiency):

Censorship and perceived censorship
5. less overrefusal (except on CCP talking points)
There’s a pretty big ick factor to the CCP, and the companies are indeed forced to comply on a range of talking points which offend the West. However, the hosted versions are much worse than the weights themselves. SpeechMap:

But there are uncensored finetunes from reputable names. But then again see (3): it doesn’t make sense for most enterprises to conduct and host finetunes themselves.
If you do a fair test on controversial but non-CCP talking points, there’s a wide spread of refusal rates in both Chinese and Western models.
Nebulous ideology?
6. on topics controversial in the West, less nannying
Lambert notes that the people he speaks privately to are really worried about less obvious stuff, the “indirect influence of Chinese values”.
There is something to this currently (but not a lot given the size of the English internet in the training corpus and the relative lack of soft-post-training skill or effort in Chinese labs):

But it’s valid to assume this will get worse as the CCP get more aware and the companies put more effort into personality and post-training.
Downloading is a long way from productising
8. they’re the most-downloaded open models
People panic about “the flip”, the point at which people started downloading Chinese models more. But this is obviously a terrible proxy for actually managing to use them. (And it’s actually pretty unclear how self-hosting adoption would really benefit China anyway except in prestige.)
For people with any need of real customisation or tiny models, or a scientific ML hobby, or an ideological interest in open-source, they clearly dominate.
TODO: Scrape relative mention over time of LLaMa vs Qwen in Arxiv experiments.
I concede that the secrecy in the West about using Chinese models makes this one weaker as an explanation.
even more insecure?
9. they are mostly not used even by the cognoscenti
Above I mentioned reliability (how low-variance they are, how well they can chain things together). But that’s the easy bit; what about adversarial reliability?
The US evaluation had a bone to pick, but their directional result is probably right (“DeepSeek’s most secure model (R1-0528) responded to 94% of overtly malicious requests [using a jailbreak], compared with 8% of requests for U.S. reference models”).
Someone else talking their book notes that Kimi is “not yet fit for secure enterprise deployment”.
This is obviously a huge problem for any agentic uses, even if the benchmark and default reliability were all fine.
Low mindshare
9. they are mostly not used even by the cognoscenti
It’s hardly cynical to note that most people don’t pick their models by analysing relative performance. Instead it’s largely name recognition and trust, which makes sense for reasons of risk aversion and filtered evidence.
In principle, you can change models by changing one string in your codebase. But in practice if you’re sane you need to do incredibly expensive evals and so there’s stickiness.
The DeepSeek moment helped a lot, but it receded in the second half of 2025 (from 22% of the weird market to 6%). And they all have extremely weak brands.
Also corporations really do settle for inferior products all the time for ass-covering reasons (IBMism). Mindshare translates directly into appeal to the risk-averse.
Corporate compliance is hard
9. they are mostly not used even by the cognoscenti.
Chinese APIs are hard for Western companies to use for legal and quasi-legal reasons.
For the API, DeepSeek sent user information to China Mobile, a state company, which violates all kinds of Western data privacy laws. Even if they’ve stopped, this risk is corporate poison. How can you ever be sure enough?
In a couple of years the EU AI Act will be (nominally) enforceable on the Chinese labs too.
On the quasi-legal side, corporate “vendor risk” programmes often flag Chinese suppliers. This is sometimes because they actually can’t guarantee there’s no forced labour involved.
So why not on-prem? Again, it’s a huge fixed cost and competence-bound and your risk team might still give you shit for it. Lambert:
People vastly underestimate the number of companies that cannot use Qwen and DeepSeek open models because they come from China. This includes on-premise solutions built by people who know the fact that model weights alone cannot reveal anything to their creators.
Political bias
9. they are mostly not used even by the cognoscenti.
There are a bunch of social reasons you might want to avoid Chinese models. You might be protectionist, or sucking up to the ascendent protectionists.
The protectionism of others is clearly enough for people to keep quiet about using them. It is often probably enough for them to just not take the risk in the first place.
I’d include here superstitions about the weights themselves being backdoored.
Vendor risk
9. they are mostly not used even by the cognoscenti
If you look ahead, at future risks to your suppliers, it’s obvious that the export control situation relatively speaks against using Chinese models; NVIDIA is not going to choke off OpenAI.
For API adoption, I also haven’t seen anything about Service-Level Agreements (contracts ensuring uptime) and support from any Chinese lab, but these are easy to make (even if compute crunch means that their uptime guarantees simply must be worse than American ones).
Also again corporate vendor-risk programmes often flag Chinese suppliers for data sovereignty, volatility of PRC law, and export control reasons.
DeepSeek openly use Anna’s Archive, where everyone else is quiet about it. But the American companies offer IP indemnity for users (cover if the models violate copyright in your app), which is nice insurance for a nervous corp with a target on its back. I can’t see anything about the Chinese companies doing this yet.
Excess quantization?
11. they’re aggressively quantizing at inference-time, 32 bits to 4
No, I think this one is wrong or else only a tiny factor. gpt-oss was post-trained in MXFP4 which is only 4.25bits.
And I have a strong hunch that many American models are also served in low fidelity, maybe FP4 (4 bits). Quantization just isn’t that bad.
Galaxy-brain soft power??
12. state-sponsored Chinese hackers used closed American models for incredibly sensitive operations, giving the Americans a full whitebox log of the attack!
I can dimly imagine some kind of flexing dynamic in cyberwarfare, where you actually want to show off your attack capabilities, and so you use Claude on purpose. Yes: this idiotic move makes great sense if the apparent targets are red herrings, if Anthropic were the real target. You learn how long their OODA loop is, you learn (by retaliation or its absence) how tight they are with the NSA, you learn a little about how good their tech is.
You could also see it as retaliation for Amodei’s hawkish comments all year. Literally trading effectiveness for embarrassment.
But I don’t really know anything about this.
Overall
Low adoption is overdetermined:
- No, I don’t think they’re as good on new inputs or even that close.
- No, they’re not more efficient in time or cost (for non-industrial-scale use).
- Even if they were, the social and legal problems and biases would probably still suppress them in the medium run.
- But obviously if you want to heavily customise a model, or need something tiny, or want to do science, they are totally dominant.
- Ongoing compute constraints make me think the capabilities gap and adoption gap will persist.
Tags: AI, hypothesis-dump