Main
Reinforcement learning (RL)2 is a framework for discovering optimal action sequences in decision-making problems in which the best strategy is unknown and often non-trivial. In recent years, deep RL has transformed problem-solving across multiple fields, such as robotics[3](https://www.nature.com/articles/s42256-025-01166-9#ref-CR3 “Tang, C. et al. Deep reinforcement learning for robotics: a survey of real-world successes. In Proc. Thirty-Ninth AAAI Conference on Artificial Intelligence and Thirty-Seventh Conference on Innovative Applications of Artificial Intelligence and Fifteenth Symposium on Educationa…
Main
Reinforcement learning (RL)2 is a framework for discovering optimal action sequences in decision-making problems in which the best strategy is unknown and often non-trivial. In recent years, deep RL has transformed problem-solving across multiple fields, such as robotics3, drug discovery4 and game playing, where AlphaZero5 surpassed human experts in board games such as Go and chess. This approach was later extended to tackle problems in the field of mathematics, where AlphaZero was adapted to discover a more efficient and provably correct algorithm for matrix multiplication6. The resulting agent, AlphaTensor, was trained to play a TensorGame with the goal of finding efficient tensor decompositions.
Quantum computation is an emerging technology promising exponential speed-ups over classical computation for certain problems such as cryptography7 and quantum simulation8. This has potentially extensive implications, from securing communications9 to advancing drug discovery10. However, a major bottleneck in practical quantum computing is the complexity of the quantum circuits required to implement quantum algorithms. In particular, the T gate—a fundamental quantum logic gate—is one of the most resource-intensive to implement11,12. Despite this, T gates are essential for achieving universal quantum computation13. Therefore, reducing the T count of quantum circuits is crucial before implementing them on quantum hardware.
Several methods have been developed for optimizing quantum circuits14,15,16,17, including machine learning18 and RL techniques19,20,21. More recently, AlphaTensor-Quantum1 extended AlphaTensor’s capabilities into the field of quantum computing by formulating T-count optimization as a tensor decomposition problem. Unlike AlphaTensor, AlphaTensor-Quantum can incorporate domain-specific knowledge by using gadgets, which is a procedure to reduce T gates by using ancillary qubits, to enhance optimization efficiency. Additionally, AlphaTensor-Quantum introduces symmetrized axial attention layers in its neural network, which take advantage of the signature tensor’s symmetry, thereby allowing it to scale to larger qubit numbers.
On a benchmark of quantum arithmetic circuits, AlphaTensor-Quantum has been shown to achieve a lower T count than previous existing methods, particularly when gadgets are incorporated. However, its training is limited to specific quantum circuits grouped by application. This means that the model must be retrained for each new type of application, resulting in increased computational cost. In this paper, we first evaluate the reproducibility of AlphaTensor-Quantum’s results. We then extend its application to a more general quantum circuit optimization problem: training a single agent capable of optimizing random quantum circuits with varying numbers of qubits and gates. This approach enables faster optimization without the need for retraining on each new circuit. The general agent achieves a lower T count on a large fraction of circuits compared with the baseline and the agents trained on fixed qubit sizes.
Reproducibility
The original publication on AlphaTensor-Quantum relies on AlphaTensor, which, in turn, builds on AlphaZero. Unfortunately, neither the code for AlphaZero nor AlphaTensor has been made publicly available at this time. In addition, the in-house computing resources and infrastructure being used at Google DeepMind for this project are beyond the scale and sophistication of what is available in an academic context.
Nevertheless, we have made (slightly revised) parts of their code available in a GitHub repository22, which includes implementations of the TensorGame and their neural network architecture, integrated into the publicly available Monte Carlo tree search (MCTS) framework MCTX23 (as a replacement for AlphaZero). Again, we emphasize that the results presented in this paper were not obtained using this specific MCTS framework, and the implementation details differ, which may explain some of the discrepancies observed in our numerical experiments.
In addition to the code, the signature tensors for the circuits Mod 54 (five qubits), NC Toff3 (seven qubits) and Barenco Toff3 (eight qubits) from table 2 of ref. 1 are provided for testing in the GitHub repository. In the following, we aim to reproduce the results for these three circuits. Reproducing other findings proved challenging, as the authors of ref. 1 do not provide the code to generate signature tensors from a quantum circuit and do not specify the exact hyperparameters used for each experiment. The Methods discusses the hyperparameters.
We present the optimized T count and training time in Fig. 1a. We observe that the T count for NC Toff3 and Barenco Toff3 with gadgets is higher than that originally reported. By doubling the batch size and the number of MCTS simulations, the T count is reduced to 8 for NC Toff3 and 10 for Barenco Toff3, suggesting that further hyperparameter tuning could probably reproduce the original findings. It is also worth noting that in the original paper, AlphaZero-Quantum is trained on a family of circuit applications. For example, for the Barenco Toffoli application, AlphaZero-Quantum is trained on the Barenco Toff3, Barenco Toff4, Barenco Toff5 and Barenco Toff10 circuits. By contrast, in our case, the optimization of Barenco Toff3 is trained only based on the Barenco Toff3 circuit, because the circuits or the tensor representation of the circuits are not available. Figure 1b shows the evolution of T count throughout training. The T count converges after approximately 3,000 training steps, which takes between 100 and 1,000 s depending on the number of qubits. Additionally, we train a single agent to simultaneously simplify all the provided circuits, achieving the same performance (Supplementary Fig. 1).
Fig. 1: Reproducing AlphaTensor-Quantum.
a, T count reported in the original paper along with the results from experiments using the provided code. The training time to reach optimal performance on an NVIDIA A100 GPU is given in the parentheses. The red numbers indicate where our experimental results do not match the originally reported values (see the main text). b, Evolution of T count during training. The light solid lines represent the reported result.
Since the provided example tensors correspond to small numbers of qubits, we also examine the expected runtime of AlphaTensor-Quantum for larger circuits. In Fig. 2, we illustrate how the training time of AlphaTensor-Quantum scales with the number of qubits on different GPU devices using the provided hyperparameters. In this case, we fixed the task to optimize a random circuit in which the number of gates is fixed to be ten times the number of qubits and half of them are T gates. We observe that the training time for AlphaTensor-Quantum increases exponentially. This is probably due to the exponentially increasing number of possible actions for AlphaTensor-Quantum. In the original paper1, the amount of sampled actions is, therefore, kept to a fixed maximum number, which is not implemented in the provided code. The baseline method using PyZX14,24 and TODD15 is several orders of magnitude faster than AlphaTensor-Quantum, which requires training for tens of thousands to several millions of steps per optimized circuit. Consequently, the computational overhead of running AlphaTensor-Quantum is probably justified only for important quantum circuit primitives that serve as building blocks for numerous applications.
Fig. 2: Average time for one step of AlphaTensor-Quantum training with gadgets on different GPU devices.
Quadro RTX 6000 and Tesla V100 give an out-of-memory error for 15 qubits. We compare with the baseline PyZX14 and TODD15, which directly output the optimized circuit in the given time (for example, around 0.06 s for 15 qubits). By contrast, AlphaTensor-Quantum requires a large number of training steps (for example, between tens of thousands and several millions of steps in the original paper). Error bars, corresponding to one standard deviation across ten different circuits, are smaller than the marker size.
Generalizability
To improve the optimization efficiency of AlphaTensor-Quantum by eliminating the need for retraining on previously unseen circuits, we train it to simplify random quantum circuits spanning multiple qubit sizes. We refer to this agent as the general agent. We then compare its performance with agents trained separately for each qubit size. We refer to these agents as single agents. Note that the single agents are already more general than the AlphaTensor-Quantum agents used in ref. 1, which are trained on specific quantum circuit applications.
In our experiments, we use quantum circuits with five to eight qubits. Therefore, we train one general agent across all these qubit numbers and four separate single agents, each for a specific qubit number. AlphaTensor-Quantum is originally trained using a combination of supervised learning on synthetic demonstrations and RL on the target circuits. The dataset comprising synthetic demonstrations consists of randomly generated tensor/factorization pairs for the neural network to imitate. To evaluate the contribution of these components, we train our agents either only with synthetic demonstration data (Demo), only with RL data (RL) or with both (Demo + RL). The Methods provides details of the dataset generation and training process. For RL and Demo + RL, we use 100,000 random circuits for RL. We first focus on the AlphaTensor-Quantum version that includes gadgetization (Supplementary Fig. 2 shows the results without gadgetization). We train AlphaTensor-Quantum with the default hyperparameters for 100,000 steps. For each considered qubit number, we generate 1,000 random quantum circuits as the evaluation set. During evaluation, we always choose the most probable action predicted by the MCTS policy. As a baseline, we optimize these circuits with PyZX14 and then apply TODD15, as done in ref. 1.
We first evaluate the average T count in Fig. 3a for single and general agents trained with the three training types. The general agent consistently outperforms the single agents across all training types. Additionally, the Demo + RL agent achieves the lowest average T count, falling below the baseline, indicating that the mix of supervised demonstration and RL training is useful. Figure 3b presents the performance of the agents across different qubit sizes. As expected, the average final T count grows with the qubit number since the sampled initial circuits contain more T gates. However, the optimization is increasingly less effective compared with the baseline with a higher qubit count. In particular, the agents outperform the baseline for N = 5 and N = 6. However, the performance declines at N = 7, except when using Demo + RL training, and further deteriorates at N = 8, where all agents perform worse than the baseline on average, with only about 23% improvement. The performance could be improved by hyperparameter tuning and longer training.
Fig. 3: Evaluation of single (random circuits, fixed qubit number) and general (random circuits, varying qubit number) AlphaTensor-Quantum agents with gadgetization and three training types (Demo, only with RL and Demo + RL).
a, Average T count (lower is better) of the optimized quantum circuits in the evaluation set. The solid black line shows the average T count of the baseline method PyZX14 and TODD15. b, Average T count for each number of qubits. c, Average improvement percentage (higher is better), which shows the percentage of circuits that have a strictly lower T count when optimized with the agent compared with the baseline method. d, Average improvement percentage for each number of qubits. The error bars for a and c show the 95% confidence intervals over different numbers of qubits and for b, the 95% confidence intervals over 1,000 evaluation circuits.
Although Fig. 3a,b demonstrates the average T-count reduction, this alone does not fully capture how consistently the agents outperform the baseline. To address this issue, we introduce the improvement percentage metric, which measures the fraction of circuits in the evaluation set in which the agent achieves a strictly lower T count than the baseline. It is important to note that the input to the AlphaTensor-Quantum is already a circuit that is optimized with PyZX following the compilation method described in ref. 1. Figure 3c shows that all agents, in general, achieve an improvement above 45% compared with the baseline, with Demo + RL again outperforming the other training types. The general agent surpasses the single agent overall, except with the Demo + RL training type. Figure 3d further confirms that the trend discussed before that the improvement percentage declines as the circuit size increases. A significant improvement is observed for N = 5 and N = 6, whereas it diminishes for N = 7 and N = 8. A similar trend is observed for AlphaTensor-Quantum without gadgetization (Supplementary Fig. 2).
A key advantage of our agents is their fast execution time during evaluation. Unlike the original AlphaTensor-Quantum, which requires retraining for each circuit, a process that can take from a few minutes to several hours, our pretrained agents simplify circuits in a single rollout, averaging around 20 s (Fig. 4a).
Fig. 4: Evaluation of training time and evaluation time.
a,b, Average time required on a single NVIDIA A100 GPU to simplify a single circuit during evaluation (a) and to train the agents for 100,000 steps (b).
Finally, in Table 1, we evaluate our general agents on the three target circuits in Fig. 1, which the agents never encountered before during training. The Demo agent finds the optimal T count both with and without gadgets, whereas the other two methods perform slightly worse.
Conclusion and discussion
In this work, we first assess the reproducibility of AlphaTensor-Quantum1. We find that the reproduction of small-scale experiments is feasible, although potentially requiring some hyperparameter. We then study the generalizability of AlphaTensor-Quantum for general quantum circuit optimization across different qubit sizes. Our approach eliminates the need for retraining on previously unseen circuits, accelerating the optimization process by orders of magnitude compared with the original AlphaTensor-Quantum approach trained on specific quantum circuit applications. From an application perspective, these agents can be integrated with traditional T-count optimizers to achieve further reductions in a large fraction of circuits.
Our experiments demonstrate that a general agent trained on circuits with varying qubit sizes outperforms single agents specialized for a fixed qubit size, highlighting its ability to generalize effectively to unseen circuits when trained on diverse data. The best results are obtained by combining supervised learning on demonstration data with RL. However, even agents trained solely on potentially suboptimal supervised demonstrations prove to be effective.
Note that the results presented in this paper are obtained without hyperparameter tuning and require several orders of magnitude less computation than the original AlphaTensor-Quantum training (for example, 10 times more training steps, 10 times more simulated trajectories per MCTS step, 16 times larger batch size and more than 3,600 tensor processing units used). This suggests a promising path for scaling to higher qubit numbers by increasing computational resources.
The code from ref. 22 is well documented and easy to use. Although this code differs from the code used to produce the results in ref. 1, the implementations of the symmetrized axial attention layers and the TensorGame environment provide valuable building blocks for future research. Integration with the MCTX MCTS library enables the rapid reproduction of some of the results from the original work. The additional GitHub repository25 provides functionality to compute the signature tensor of a given quantum circuit and implements a post-processing pipeline to reconstruct the optimized circuits, which is crucial for their practical implementation. However, providing the exact hyperparameters and the exact code used in the original paper would enhance reproducibility and assist in choosing optimal hyperparameters for future work.
Looking ahead, AlphaTensor-Quantum has the potential to serve as a powerful framework for minimizing the T count of quantum circuit primitives when the computational cost is justified by their importance. Additionally, general agents such as those trained in this paper offer a promising middle ground between traditional T-count optimizers and the original AlphaTensor-Quantum approach, balancing computational efficiency and performance.
Methods
Hyperparameters for reproducibility experiments
In our experiments, we use the default hyperparameter given in the GitHub repository. For example, the batch size is 2,048 and the number of MCTS simulations is 800 in the original paper, whereas the GitHub implementation uses 128 for the batch size and 80 for the number of MCTS simulations. We have tried to use the hyperparameters from the original paper, but it gives an out-of-memory error. We run the experiments with the provided hyperparameters on a single NVIDIA A100 GPU with 40-GB memory.
Dataset generation and training process for generalizability experiments
To create the training data, we generate random CNOT + T circuits with random number of qubits N, selecting the total number of gates uniformly between 5N and 15N, with T gates comprising between 20% and 60% of the total gate count. We then follow the quantum circuit compilation approach outlined in ref. 1, which applies an optimization algorithm in PyZX14 to reduce the initial T-gate count and extract the signature tensor as an input to AlphaTensor-Quantum. Additional details about compiling a general circuit into a circuit containing only CNOT + T gates are provided in supplementary section C.1 of ref. 1.
For RL and Demo + RL, we use 100,000 random circuits for RL. We first focus on the AlphaTensor-Quantum version that includes gadgetization (Supplementary Fig. 2 shows the results without gadgetization). We train AlphaTensor-Quantum with the default hyperparameters for 100,000 steps. For each considered qubit number, we generate 1,000 random quantum circuits as the evaluation set. During the evaluation, we always choose the most probable action predicted by the MCTS policy. As a baseline, we optimize these circuits with PyZX14 and then apply TODD15, as done in ref. 1.
Data availability
The data for reproducing this work are available via Zenodo at https://doi.org/10.5281/zenodo.14887945 (ref. 26).
Code availability
The code for reproducing this work is available via Zenodo at https://doi.org/10.5281/zenodo.17578393 (ref. 27).
References
Ruiz, F. J. R. et al. Quantum circuit optimization with AlphaTensor. Nat. Mach. Intell. 7, 374 (2025).
Sutton, R. S. & Barto, A. G. Reinforcement Learning: An Introduction 2nd edn (The MIT Press, 2018). 1.
Tang, C. et al. Deep reinforcement learning for robotics: a survey of real-world successes. In Proc. Thirty-Ninth AAAI Conference on Artificial Intelligence and Thirty-Seventh Conference on Innovative Applications of Artificial Intelligence and Fifteenth Symposium on Educational Advances in Artificial Intelligence 3197 (AAAI Press, 2025). 1.
Zhou, Z., Kearnes, S., Li, L., Zare, R. N. & Riley, P. Optimization of molecules via deep reinforcement learning. Sci. Rep. 9, 10752 (2019).
Silver, D. et al. Mastering the game of Go without human knowledge. Nature 550, 354 (2017).
Fawzi, A. et al. Discovering faster matrix multiplication algorithms with reinforcement learning. Nature 610, 47 (2022).
Shor, P. W. Polynomial-time algorithms for prime factorization and discrete logarithms on a quantum computer. SIAM J. Comput. 26, 1484 (1997).
Article MathSciNet Google Scholar 1.
Daley, A. J. et al. Practical quantum advantage in quantum simulation. Nature 607, 667 (2022).
Kimble, H. J. The quantum internet. Nature 453, 1023 (2008).
Blunt, N. S. et al. Perspective on the current state-of-the-art of quantum computing for drug discovery applications. J. Chem. Theory Comput. 18, 7001 (2022).
Campbell, E. T., Terhal, B. M. & Vuillot, C. Roads towards fault-tolerant universal quantum computation. Nature 549, 172 (2017).
Beverland, M. E. et al. Assessing requirements to scale to practical quantum advantage. Preprint at https://arxiv.org/abs/2211.07629 (2022). 1.
Nielsen, M. A. & Chuang, I. L. Quantum Computation and Quantum Information (Cambridge Univ. Press, 2010). 1.
Kissinger, A. & van de Wetering, J. Reducing the number of non-Clifford gates in quantum circuits. Phys. Rev. A 102, 022406 (2020).
Article MathSciNet Google Scholar 1.
Heyfron, L. & Campbell, E. T. An efficient quantum compiler that reduces T count. Quantum Sci. Technol. 4, 015004 (2018).
Amy, M., Maslov, D. & Mosca, M. Polynomial-time T-depth optimization of Clifford+T circuits via matroid partitioning. IEEE Trans. Comput.-Aided Design Integr. Circuits Syst. 33, 1476 (2014).
Abdessaeid, N. & Drechsler, R. Reversible and Quantum Circuits: Optimization and Complexity Analysis (Springer, 2018). 1.
Daimon, S. et al. Quantum circuit distillation and compression. Jpn. J. Appl. Phys. 63, 032003 (2024).
Fösel, T., Niu, M. Y., Marquardt, F. & Li, L. Quantum circuit optimization with deep reinforcement learning. Preprint at https://arxiv.org/abs/2103.07585 (2021). 1.
Li, Z. et al. Quarl: a learning-based quantum circuit optimizer. Proc. ACM Program. Lang. 8, 114 (2024).
Riu, J., Nogué, J., Vilaplana, G., Garcia-Saez, A. & Estarellas, M. P. Reinforcement learning based quantum circuit optimization via ZX-calculus. Quantum 9, 1758 (2025). 1.
DeepMind. alphatensor_quantum. GitHub https://github.com/google-deepmind/alphatensor_quantum (2025). 1.
Bradbury, J. et al. JAX: composable transformations of Python+NumPy programs v.0.3.13 (2018); http://github.com/jax-ml/jax 1.
Kissinger, A. & Van De Wetering, J. PyZX: large scale automated diagrammatic reasoning. Electron. Proc. Theor. Comput. Sci. 318, 229 (2020).
Article MathSciNet Google Scholar 1.
Laakkonen, T. circuit-to-tensor. GitHub https://github.com/tlaakkonen/circuit-to-tensor (2025). 1.
Zen, R., Naegele, M. & Marquardt, F. Data for optimizing T-count in general quantum circuits with AlphaTensor-Quantum. Zenodo https://doi.org/10.5281/zenodo.14887945 (2025). 1.
Zen, R., Naegele, M. & Marquardt, F. Code for reusability report: optimizing T-count in general quantum circuits with AlphaTensor-Quantum. Zenodo https://doi.org/10.5281/zenodo.17578393 (2025).
Acknowledgements
We thank J. Olle for fruitful discussions. This research is part of the Munich Quantum Valley, which is supported by the Bavarian state government with funds from the Hightech Agenda Bayern Plus.
Funding
Open access funding provided by Max Planck Society.
Author information
Authors and Affiliations
Max Planck Institute for the Science of Light, Erlangen, Germany
Remmy Zen, Maximilian Nägele & Florian Marquardt 1.
Department of Physics, Friedrich-Alexander Universität Erlangen-Nürnberg, Erlangen, Germany
Maximilian Nägele & Florian Marquardt
Authors
- Remmy Zen
- Maximilian Nägele
- Florian Marquardt
Contributions
R.Z., M.N. and F.M. conceptualized and designed the study. R.Z. and M.N. coded and performed the experiments. R.Z., M.N. and F.M. interpreted the data. All authors wrote and revised the manuscript.
Corresponding author
Correspondence to Remmy Zen.
Ethics declarations
Competing interests
All authors declare no competing interests.
Peer review
Peer review information
Nature Machine Intelligence thanks Elica Kyoseva and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Zen, R., Nägele, M. & Marquardt, F. Reusability report: Optimizing T count in general quantum circuits with AlphaTensor-Quantum. Nat Mach Intell (2025). https://doi.org/10.1038/s42256-025-01166-9
Received: 07 March 2025
Accepted: 03 December 2025
Published: 31 December 2025
Version of record: 31 December 2025
DOI: https://doi.org/10.1038/s42256-025-01166-9