- 10 Dec, 2025 *
China’s efforts around alternative model architectures will undermine the scaling law that US frontier labs are pursuing.
Minimax’s M1, Moonshot’s Kimi-Linear, and Deepseek’s v3.2 show China’s focus on linear and sparse attention models. These models reduce the compute and memory required to train and run large models, as opposed to traditional transformer architectures that rely heavily on softmax operations.
One has to wonder why, and it’s not difficult to understand the motivation behind China’s undertaking when assessing China’s geopolitical situation: they’re addressing the problem of scale.
Duality of Scale
China has both a scale advantage and a disadvantage.
With more than 1.4 billion inhabitants, China can leverage this immense population density t…
- 10 Dec, 2025 *
China’s efforts around alternative model architectures will undermine the scaling law that US frontier labs are pursuing.
Minimax’s M1, Moonshot’s Kimi-Linear, and Deepseek’s v3.2 show China’s focus on linear and sparse attention models. These models reduce the compute and memory required to train and run large models, as opposed to traditional transformer architectures that rely heavily on softmax operations.
One has to wonder why, and it’s not difficult to understand the motivation behind China’s undertaking when assessing China’s geopolitical situation: they’re addressing the problem of scale.
Duality of Scale
China has both a scale advantage and a disadvantage.
With more than 1.4 billion inhabitants, China can leverage this immense population density to help train its models.
This is one reason why data annotation is done at the state level and assigned to third-tier cities. A tightly controlled data collection process means that the quality of data will be vastly better than what is collected by Scale or Mercor. DeepMind and OpenAI are paying physicists and mathematicians several thousand dollars for a single eval question. China can get this quality of data for much less.
However, China’s scale is also a disadvantage today.
China needs to put these models in the hands of users to benefit from the productivity and efficiency gains they provide. Serving these models requires compute, and compute requires electricity.
With its massive population and efforts around industrialization, China is the number one consumer of energy in the world, using almost 2x the energy compared to the US. Yet China is the #1 importer of energy, having to purchase 44 million TJ equivalent of energy per year. In contrast, the US is the #1 exporter of energy, selling 30 million TJ to the world.
As a nation where energy security has become a key issue, with the disruption of supply from Russia and the Middle East, China is inherently disadvantaged here. Besides the limitations on chips, this is precisely the reason why efficient model architecture is a focus for the nation.
Scaling Laws
Subquadratic architectures also have the advantage of enabling China to continue scaling once traditional model architectures hit a scaling wall. Linear attention models also allow larger contexts that enable bigger chunks of tokens to be used in each training run. No one knows when we might reach the limits of scaling, but China’s efforts mean that its AI Tigers will have much more room to scale their models, given the same equivalent of compute. Putting H200s in the hands of China, as the Trump administration is considering now, will put them at a further advantage.
xAI is rumored to be working on an alternative model architecture, and the recent release of Kimi-Linear has been a positive signal for the team to continue pursuing this. It may not be an overreach to connect these efforts to what Elon has learned from his time competing with China at Tesla.
I’ll likely write another piece on this, but the tldr here is that China has become the dominant player in EVs and automotives in general, giving rise to the largest EV maker in the world, BYD, and becoming the #1 exporter of cars after surpassing Japan last year.
If China’s semiconductor industry can’t outcompete in pure computing power vs. the US, they can change the playing fields by modifying the model architecture running on their homegrown chip.
Combine this with the Belt and Road Initiative, and we can fully expect China to co-sell its chips and models to nations with similar chokepoints around energy.