Introduction
Modern deep learning owes much of its success to the existence to three factors. The first is increasing resource availability, meaning data and compute power. Secondly, families of powerful and expressive models, such as multilayer perceptrons1 convolutional and recurrent neural networks2 and multi-head attention mechanisms[3](https://www.nature.com/articles/s41534-025-01099-6#ref-CR3 “Bahdanau, D., Cho, K. & Bengio, Y. Neural M…
Introduction
Modern deep learning owes much of its success to the existence to three factors. The first is increasing resource availability, meaning data and compute power. Secondly, families of powerful and expressive models, such as multilayer perceptrons1 convolutional and recurrent neural networks2 and multi-head attention mechanisms3,4. With these model primitives, one may build composites such as transformer, diffusion and mixer models which ultimately lead to specific state of the art models such as AlphaZero[5](https://www.nature.com/articles/s41534-025-01099-6#ref-CR5 “Silver, D. et al. Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm https://doi.org/10.48550/arXiv.1712.01815
(2017).“), GPT-36 or Dalle7. Particularly for these vast models containing billions of parameters, and the enormous quantity of data required, an efficient training algorithm is essential. A cornerstone of many such protocols is the backpropagation algorithm8, and its variants, which enables the computation of gradients throughout the entirety of the network, with minimal overhead beyond the network evaluation itself.
It is reasonable to assume a similar trajectory for quantum machine learning. While rapid progress in quantum error correction9,10,[11](https://www.nature.com/articles/s41534-025-01099-6#ref-CR11 “Silva, M. P. d. et al. Demonstration of logical qubits and repeated error correction with better-than-physical error rates. https://doi.org/10.48550/arXiv.2404.02280
(2024).“) is increasing the number and quality of effective qubits (compute power), quantum processing units will still likely remain significantly depth-limited for the foreseeable future. On the model side, quantum neural networks (QNNs) are typically (but not exclusively) constructed from parametrised quantum circuits (PQCs)12,13,14,15. However, in many cases these lack task-specific features and unfortunately do not generally possess an efficient scaling for training in line with classical backpropagation[16](https://www.nature.com/articles/s41534-025-01099-6#ref-CR16 “Abbas, A. et al. On quantum backpropagation, information reuse, and cheating measurement collapse. In Thirty-seventh Conference on Neural Information Processing Systems. https://openreview.net/forum?id=HF6bnhfSqH
(2023).“). Therefore, it is essentially to develop quantum models which can be associated to specific interpretations (for example the convolutional operation on images), and which are efficient to train. The popular parameter-shift rule for QNNs17,[18](#ref-CR18 “Crooks, G. E. Gradients of parameterized quantum gates using the parameter-shift rule and gate decomposition. https://doi.org/10.48550/arXiv.1905.13311
(2019).“),[19](#ref-CR19 “Vidal, J. G. & Theis, D. O. Calculus on parameterized quantum circuits. https://doi.org/10.48550/arXiv.1812.06323
(2018).“),20,21,22, extracts analytic gradients (i.e., not relying on approximate finite differences), but even the simplest instance of the rule, requires ({\mathcal{O}}(N)) gradient circuits to be evaluated for N parameters. To put this in perspective, it was estimated in ref. [16](https://www.nature.com/articles/s41534-025-01099-6#ref-CR16 “Abbas, A. et al. On quantum backpropagation, information reuse, and cheating measurement collapse. In Thirty-seventh Conference on Neural Information Processing Systems. https://openreview.net/forum?id=HF6bnhfSqH
(2023).“) that, if one is allowed only a single day of computation, the parameter-shift rule can only compute gradients on n ~ 100 qubit trainable circuits with only ~9000 parameters, assuming reasonable quantum clock speeds. This also does not account for various other problem specific scalings, such as the data which needs to be iterated over for each training iteration, or other obstacles such as barren plateaus23. Scaling current quantum training approaches towards the size of billion or trillion-parameter deep neural networks, which have been so successful in the modern era, clearly will not be feasible with such methods. Additionally, since we are arguably in the boundary between the NISQ and ISQ eras, models should use circuits which are as compact, yet expressive as possible. On the other hand, they should also be complex enough to avoid classical simulation, surrogation or dequantisation[24](#ref-CR24 “Landman, J., Thabet, S., Dalyac, C., Mhiri, H. & Kashefi, E. Classically approximating variational quantum machine learning with random fourier features. In The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=ymFhZxw70uz
(2023).“),[25](#ref-CR25 “Rudolph, M. S., Fontana, E., Holmes, Z. & Cincio, L. Classical surrogate simulation of quantum systems with LOWESA. https://doi.org/10.48550/arXiv.2308.09109
(2023).“),[26](https://www.nature.com/articles/s41534-025-01099-6#ref-CR26 “Bermejo, P. et al. Quantum Convolutional Neural Networks are (Effectively) Classically Simulable. https://doi.org/10.48550/arXiv.2408.12739
(2024).“) but not so complex to admit barren plateaus27. Clearly, satisfying all of these constraints is challenging task.
To partially address some of these challenges, in this work we introduce a framework of models dubbed density QNNs. Our primary aim is to showcase how these models add another dimension to the landscape of quantum learning models, giving practitioners a new toolkit to experiment with when tackling the above questions. Through the text, we demonstrate how one may construct density QNNs may be constructed which are more trainable, or more expressive than their pure-state counterparts. Our results are laid out as follows. First, we introduce the general form of the density framework, before discussing comparisons and relationships to other QML model families/frameworks in the literature. Then, we propose two methods of preparing such models on a quantum computer. Next, we prove our primary theoretical results—firstly relating to the gradient query complexity of such models, and secondly discussing the connection to non-unitary quantum machine learning via the Mixing lemma from randomised compiling. We then discuss two proposed connections between density networks and mechanisms in the classical machine learning literature. First, there has been suggestion in the literature that the density networks as we propose them may be a quantum-native analogue of the dropout mechanism. We propose separate training and inference phases for density QNNs to bring this comparison closer to reality, but find it still lacking as a valid comparison. Secondly, we demonstrate a strong realisation of density networks within the mixture of experts (MoE) framework from classical machine learning—density QNNs can be viewed as a ‘quantum mixture of experts’. Finally, we provide numerical results to demonstrate the flexibility of the model to improve performance, or improve trainability (or both). We test several QNN architectures on synthetic translation-invariant data, and Hamming weight preserving architectures on the MNIST image classification task. Finally, we show numerically how, in some capacity, density QNNs may prevent data overfitting using data reuploading as an example, despite not functioning as a true dropout mechanism.
Results
Density quantum neural networks
To begin, we explicitly define the framework (see Supplementary Material A for a discussion) of density quantum neural networks (density QNNs) as follows:
$$\rho ({\boldsymbol{\theta }},{\boldsymbol{\alpha }},{\boldsymbol{x}}):= \mathop{\sum }\limits_{k=1}{K}{\alpha }_{k}{U}_{k}({{\boldsymbol{\theta }}}_{k})\rho ({\boldsymbol{x}}){U}_{k}{\dagger }({{\boldsymbol{\theta }}}_{k})$$
(1)
ρ(x) is a data encoded initial state, which is usually assumed to be prepared via a ‘data-loader’ unitary, (\rho ({\boldsymbol{x}})=\left\vert {\boldsymbol{x}}\right\rangle \left\langle {\boldsymbol{x}}\right\vert ,\left\vert {\boldsymbol{x}}\right\rangle := V({\boldsymbol{x}}){\left\vert 0\right\rangle }{\otimes n}), a collection of sub-unitaries ({{{U}_{k}}}_{k = 1}{K}), and a distribution, ({{{\alpha }_{k}}}_{k = 1}^{K}), which may depend on x.
For now, we treat the density state above as an abstraction and later in the text we will discuss methods to prepare the state practically and actually use the model. The preparation method will have relevance for the different applications and connections to other paradigms. Once we have chosen a state preparation method for equation (1), we must choose particular specifications for the sub-unitaries. In some cases, we may recast efficiently trainable models/frameworks within the density formalism to increase their expressibility. In others, we use the framework to improve the overall inference speed of models. In this work, we assume that the sub-unitary circuit structures, once chosen, are fixed, and the only trainability arises from the parameters, ({{{{\boldsymbol{\theta }}}_{k}}}_{k = 1}{K}) therein, as well as the coefficients, ({{{\alpha }_{k}}}_{k = 1}{K}). In other words, we do not incorporate variable structure circuits learned for example via quantum architecture search.
As a generalisation, one may consider adding a data dependence into the sub-unitary coefficients, α → α(x), while retaining the distributional requirement for all x, (\sum _{k}\alpha {({\boldsymbol{x}})}_{k}=1). This gives us the more general family of density QNN states:
$${\rho }_{{\mathsf{D}}}({\boldsymbol{\theta }},{\boldsymbol{\alpha }},{\boldsymbol{x}})=\mathop{\sum }\limits_{k=1}{K}{\alpha }_{k}({\boldsymbol{x}}){U}_{k}({{\boldsymbol{\theta }}}_{k})\rho ({\boldsymbol{x}}){U}_{k}{\dagger }({{\boldsymbol{\theta }}}_{k})$$
(2)
In the QML world, overly dense or expressive single unitary models are known to have problems related to trainability via barren plateaus28. Density QNNs and related frameworks may be a useful direction to retain highly parameterised models but via a combination of smaller, trainable models. We will demonstrate this through several examples in the remainder of the text. Before doing so, in the next section, we want to appropriately cast density QNNs within the current spectrum of quantum machine learning models.
Connection to other QML frameworks
Before proceeding, we first discuss the connection to other popular QML frameworks. For supervised learning purposes, each term in the density state, ({U}_{k}({{\boldsymbol{\theta }}}_{k})\rho ({\boldsymbol{x}}){U}_{k}{\dagger }({{\boldsymbol{\theta }}}_{k})) is expressive enough by itself to capture most basic models in the literature. This is due to the common model definition as (f({\boldsymbol{\theta }},{\boldsymbol{x}}):= {\rm{Tr}}({\mathcal{O}}U({\boldsymbol{\theta }})\rho ({\boldsymbol{x}}){U}{\dagger }({\boldsymbol{\theta }}))={\rm{Tr}}({\mathcal{O}}({\boldsymbol{\theta }})\rho ({\boldsymbol{x}}))) for some observable, ({\mathcal{O}}), i.e. the overlap between a parameterised Hermitian observable and a data-dependent state. This unifies many paradigms in quantum machine learning literature such as kernel methods[29](https://www.nature.com/articles/s41534-025-01099-6#ref-CR29 “Schuld, M. Supervised quantum machine learning models are kernel methods. https://doi.org/10.48550/arXiv.2101.11020
(2021).“) and data reuploading models via gate teleportation30. Due to the linearity of the quantum mechanics, we can write it also in this form by inserting equation (2) into the function evaluation:
$$\begin{array}{lll}{f}_{{\mathsf{D}}}({{\boldsymbol{\theta }},{\boldsymbol{\alpha }}},{\boldsymbol{x}})&=&{\rm{Tr}}\left({\mathcal{O}}({\boldsymbol{\theta }},{\boldsymbol{\alpha }},{\boldsymbol{x}})\rho ({\boldsymbol{x}})\right),\ {\mathcal{O}}({\boldsymbol{\theta }},{\boldsymbol{\alpha }},{\boldsymbol{x}})&:=&\mathop{\sum }\limits_{k=1}{K}{\alpha }_{k}({\boldsymbol{x}}){U}_{k}{\dagger }({{\boldsymbol{\theta }}}_{k}){\mathcal{O}}{U}_{k}({{\boldsymbol{\theta }}}_{k})\end{array}$$
(3)
Removing this data-dependence from the coefficients simply removes the data-dependence from the observable, ({\mathcal{O}}({\boldsymbol{\theta }},{\boldsymbol{\alpha }},{\boldsymbol{x}})\to {\mathcal{O}}({\boldsymbol{\theta }},{\boldsymbol{\alpha }})). Finally, selecting the sub-unitaries to be identical and equal to the data loading unitary, with K equal to the size of the training data leads to observable ({\mathcal{O}}({{{{\boldsymbol{x}}}_{k}}}_{k = 1}{M},{\boldsymbol{\alpha }}):= \mathop{\sum }\nolimits_{k=1}{M}{\alpha }_{k}{U}{\dagger }({{\boldsymbol{x}}}_{k}){\mathcal{O}}U({{\boldsymbol{x}}}_{k})=\mathop{\sum }\nolimits_{k=1}{M}{\alpha }_{k}\rho ({{\boldsymbol{x}}}_{k})). This is an optimal family of models in a kernel method via the representer theorem[29](https://www.nature.com/articles/s41534-025-01099-6#ref-CR29 “Schuld, M. Supervised quantum machine learning models are kernel methods. https://doi.org/10.48550/arXiv.2101.11020
(2021).“).
Next, returning to equation (3) and replacing the data dependence from the state with a parameter dependence, ρ(x) → ρ(θ), we fall within the family of flipped quantum models31. These are a useful model family where the role of data and parameters in the model have been flipped. This insight enables the incorporation of classical shadows32 for, e.g. quantum training and classical deployment of QML models.
Finally, we have the framework of post-variational quantum models[33](https://www.nature.com/articles/s41534-025-01099-6#ref-CR33 “Huang, P.-W. & Rebentrost, P. Post-variational quantum neural networks. https://doi.org/10.48550/arXiv.2307.10560
(2023).“), originating from the classical combinations of quantum states ansatz34. In motivation, these models are perhaps more similar to the ‘implicit’30 models such as quantum kernel methods, where the quantum computer is used only for specific fixed, non-trainable, operations (e.g. evaluating inner products for kernels), rather than ‘explicit’30 models where trainable parameters reside within unitaries, U(θ). Post-variational models involve optimising coefficients α**k,q, which are injected into the model via a linear or non-linear combination of observables, ({{{{\mathcal{O}}}_{q}}}_{q = 1}{Q}), applied to (non-trainable) unitary transformed states, ({{{U}_{k}\rho ({\boldsymbol{x}}){U}_{k}{\dagger }}}_{k = 1}^{K}). In the linear case, the output of the model is:
$$\begin{array}{lll}{f}_{{\mathsf{PV}}}({\boldsymbol{\alpha }},{\boldsymbol{x}})&=&\mathop{\sum}\limits _{kq}{\alpha }_{kq}{\rm{Tr}}({{\mathcal{O}}}_{q}{U}_{k}\rho ({\boldsymbol{x}}){U}_{k}{\dagger })\\qquad\qquad& =&\mathop{\sum}\limits _{kq}{\alpha }_{kq}{\rm{Tr}}({{\mathcal{O}}}_{kq}\rho ({\boldsymbol{x}})),{{\mathcal{O}}}_{kq}:= {U}_{k}{{\mathcal{O}}}_{q}{U}_{k}{\dagger }\end{array}$$
(4)
The major benefit of post-variational models is that, similar to quantum kernel methods, the optimisation over a convex combination of parameters outside the circuit is in principle significantly easier than the non-convex optimisation of parameters within the unitaries. However, just like kernel methods, this comes at the limitation of an expensive forward pass through the model, which requires ({\mathcal{O}}(KQ)) circuits to be evaluated. In the worst case, this should also be exponential in the number of qubits in to enable arbitrary quantum transformations on ρ(x), K**Q ≤ 4n [33](https://www.nature.com/articles/s41534-025-01099-6#ref-CR33 “Huang, P.-W. & Rebentrost, P. Post-variational quantum neural networks. https://doi.org/10.48550/arXiv.2307.10560
(2023).“). To avoid evaluating an exponential number of quantum circuits, it is clearly necessary to employ heuristic strategies or impose symmetries to choose a sufficiently large yet expressive pool of operators ({{\mathcal{O}}}_{kq}). In light of this, ref. [33](https://www.nature.com/articles/s41534-025-01099-6#ref-CR33 “Huang, P.-W. & Rebentrost, P. Post-variational quantum neural networks. https://doi.org/10.48550/arXiv.2307.10560
(2023).“) proposes ansatz expansion strategies34 or gradient heuristics to grow the pool of quantum operations. Such techniques may be also incorporated into our proposal, but we leave such investigations to future work.
Preparing density quantum neural networks
As mentioned above, we have not yet described a method to prepare the density QNN state, equation (1). Figure 1 showcases two methods of doing so. For now, we do not assume any specific choice for the sub-unitaries. There are two methods to prepare the density state. The first is via a deterministic circuit which exactly prepares ρ(θ, α, x), and shown in Fig. 1b. We prove the correctness of this circuit in Supplementary Material I.2. The structure of the circuit can be related directly to the corresponding linear combination of unitaries QNN35 which prepares instead the pure state, (\sum _{k}{\alpha }_{k}{U}_{k}({{\boldsymbol{\theta }}}_{k})\left\vert {\boldsymbol{x}}\right\rangle), seen in Fig. 1a. Notably, the deterministic density QNN removes the need for ancilla postselection on a specific state, ((({\left\vert 0\right\rangle }_{{\mathcal{A}}}^{\otimes n})) in the figure). In other words, while a single forward pass through an LCU QNN will only succeed with some probability p, the deterministic density QNN state preparation succeeds with probability p = 1. While the circuits in Fig. 1a, b are conceptually simple, the controlled operation of the sub-unitaries may be very expensive in practice, which is a necessity without any further assumptions. In Supplementary Material I.2 we discuss certain assumptions on the structure of the sub-unitaries which may simplify the resource requirements of this preparation mechanism, specifically assuming a Hamming-weight preserving structure allows the removal of the generic controlled operation.
Fig. 1: Density quantum neural networks.
a Linear combination of unitaries quantum neural networks (LCU QNNs) preparing the state (\sum _{k}{\alpha }_{k}{U}_{k}({{\boldsymbol{\theta }}}_{k})\left\vert {\boldsymbol{x}}\right\rangle) via postselection on an ancilla register ({\mathcal{A}}) which prepares the distribution α. b shows corresponding density quantum neural network, implemented deterministically to prepare the state ρ(θ, α, x). Finally, the instantiation of the density QNN state via randomisation is shown in (c) where sub-unitary, U**k(θk) is only prepared with the probability α**k without the need for the multi controlled deep circuits and ancilla qubits. The deterministic density QNN, (b) is required if one wishes to make a true comparison of these networks to the dropout mechanism. From the Mixing lemma, the randomised version, (c), can distil the performance benefits of the more powerful LCU QNN, (a), into very short depth circuits. The probability loaders, ({\mathsf{Load}}\left(\sqrt{{\boldsymbol{\alpha }}}\right)) are assumed to be unary data loaders which act on K qubits within the register, ({\mathcal{A}}) and have depth (\log (K))71. One could also use binary Prepare and Select circuits acting on (\log (K)) qubits as is more standard in LCU literature. The resulting functions from each network, f(θ, α, x) result from the measurement of an observable, ({\mathcal{O}}), via (f({\boldsymbol{\theta }},{\boldsymbol{\alpha }},{\boldsymbol{x}})={\rm{Tr}}({\mathcal{O}}\sigma ({\boldsymbol{\theta }},{\boldsymbol{\alpha }},{\boldsymbol{x}}))), where σ is the output state from each circuit. V(x) is the n-qubit data loader acting on register, ({\mathcal{B}}).
The second method uses the distributional property of α to only prepare the density state ρ(θ, α, x) on average, depicted in Fig. 1c. In this form, the forward pass completely removes the need for ancillary qubits, and complicated controlled unitaries.
The effect of this is threefold:
A forward pass through the randomised density QNN (Fig. 1c) requires time which is upper bounded by the execution speed of only the most complex unitary, ({U}_{{k}^{* }}). This is illustrated in Fig. 1c. In the language of post-variational models measuring Q observables on a randomised density QNN has complexity ({\mathcal{O}}(Q)), a K-fold improvement.
Secondly, we will show that the gain in efficiency in moving from the LCU to randomised density QNN does not come at a significant loss in model performance. We prove this, under certain assumptions, using the Hastings-Campbell Mixing lemma from randomised quantum circuit compiling.
Thirdly, and related to the first two points, one can view the randomised density QNN as an explicit version of the post-variational (in the sense of ref. 30) framework. This may be an interesting direction to study given the series of hierarchies found by ref. 30 between implicit, explicit and reuploading models.
Gradient extraction for density QNNs
For density QNNs to be performant in practice, they must be efficiently trainable. In other words, it should not be exponentially more difficult to evaluate gradients from such models, compared to the component sub-unitaries. In the following, we describe general statements regarding the gradient extractability from density QNNs. By then choosing the sub-unitaries to themselves be efficiently trainable (in line with a so-called backpropagation scaling, which we will define), the entire model will also be. We formalise this as follows:
Proposition 1
(Gradient scaling for density quantum neural networks) Given a density QNN as in equation (1) composed of K sub-unitaries, ({\mathcal{U}}={{{U}_{k}({{\boldsymbol{\theta }}}_{k})}}_{k = 1}^{K}), implemented with distribution, α = {α**k}, an unbiased estimator of the gradients of a loss function, ({\mathcal{L}}), defined by a Hermitian observable, ({\mathcal{H}}):
$${\mathcal{L}}({\boldsymbol{\theta }},{\boldsymbol{\alpha }},{\boldsymbol{x}})={\rm{Tr}}\left({\mathcal{H}}\rho ({\boldsymbol{\theta }},{\boldsymbol{\alpha }},{\boldsymbol{x}})\right)$$
(5)
can be computed by classically post-processing (\mathop{\sum }\nolimits_{l=1}{K}\mathop{\sum }\nolimits_{k=1}{K}{T}_{\ell k}) circuits, where Tℓk is the number of circuits required to compute the gradient of sub-unitary k, U(θk) with respect to the parameters in sub-unitary ℓ, θℓ. Furthermore, these parameters may be shared across the unitaries, ({{\boldsymbol{\theta }}}_{k}={{\boldsymbol{\theta }}}_{k{\prime} }) for some (k,k^{\prime}).
The proof is given in Supplementary Material B.1, but it follows simply from the linearity of the model. Now, there are two sub-cases one can consider. First, if all parameters between sub-unitaries are independent, θk ≠ θℓ, ∀ k, ℓ. This gives the following corollary, also in Supplementary Material B.1 and illustrated in Fig. 2.
Fig. 2: Illustration of Corollary 1.
In the case where no parameters are shared across the sub-unitaries, the gradients of the density model in equation (1) when measured with an observable ({\mathcal{H}}) simply involves computing gradients for each sub-unitary individually. As a result, the full model introduces an ({\mathcal{O}}(K)) overhead for gradient extraction. If (K={\mathcal{O}}(\log (N))) and each sub-unitary admits a backpropagation scaling for gradient extraction, the density model will also admit a backpropagation scaling.
Corollary 1
Given a density QNN as in equation (1) composed of K sub-unitaries, ({\mathcal{U}}={{{U}_{k}({{\boldsymbol{\theta }}}_{k})}}_{k = 1}{K}) where the parameters of sub-unitaries are independent, θk ≠ θℓ, ∀ k, ℓ an unbiased estimator of the gradients of a loss function, ({\mathcal{L}}), equation (5) can be computed by classically post-processing (\mathop{\sum }\nolimits_{k=1}{K}{T}_{k}) circuits, where T**k is the number of circuits required to compute the gradient of sub-unitary k, U(θk) with respect to the parameters, θk.
The second case is where some (or all) parameters are shared across the sub-unitaries. Taking the extreme example, ({\theta }_{l}{,j}={\theta }_{k}{,j}=:{\theta }^{,j},\forall k,l)—i.e. all sub-unitaries from equation (1) have the same number of parameters, which are all identical. In this case, for each sub-unitary, l, we must evaluate all K terms in the sum so at most the number of circuits will increase by a factor of K2—we need to compute every term in the matrix of partial derivatives.
Note that this is the number of circuits required, not the overall sample complexity of the estimate. For example, take the single layer commuting-block circuit (just a commuting-generator circuit) with C mutually commuting generators. Also assume a suitable measurement observable, ({\mathcal{H}}), such that the resulting gradient observables, ({{{{\mathcal{O}}}_{c}| {{\mathcal{O}}}_{c}:= [{G}_{c},{\mathcal{H}}]}}_{k = 1}{C}), can be simultaneously diagonalized. To estimate these C gradient observables each to a precision ε (meaning outputting an estimate ({\tilde{o}}_{c}) such that (| {\tilde{o}}_{k}-\left\langle \psi \right\vert {{\mathcal{O}}}_{c}\left\vert \psi \right\rangle | \le \varepsilon) with confidence 1 − δ) requires ({\mathcal{O}}\left({\varepsilon }{-2}\log \left(\frac{C}{\delta }\right)\right)) copies of ψ (or equivalently calls to a unitary preparing ψ). It is also possible to incorporate strategies such as shadow tomography32, amplitude estimation[36](https://www.nature.com/articles/s41534-025-01099-6#ref-CR36 “Brassard, G., Hoyer, P., Mosca, M. & Tapp, A. Quantum Amplitude Amplification and Estimation. https://doi.org/10.48550/arXiv.quant-ph/0005055
(2002).“) or quantum gradient algorithms37 to improve the C, δ or ε parameter scalings for more general scenarios, though inevitably at the cost of scaling in the others.
Efficiently trainable density networks
The results of the previous section state that moving to the density framework does not result in an exponential increase in gradient extraction difficulty, unless the number of sub-unitaries is exponential. However, what we really care for is that the models are end-to-end efficiently trainable, meaning that overall their gradients can be computed with a backpropagation scaling. This is the resource scaling which the (classical) backpropagation algorithm obeys, and which we ideally would strive for in quantum models. In the following, we can specialise the derived results to the cases where the component sub-unitaries have efficient gradient extraction protocols. This will render the entire density model also efficiently trainable in this regime.
This ‘backpropagation’ scaling can be defined as follows. Specifically:
Definition 1
(Backpropagation scaling[16](https://www.nature.com/articles/s41534-025-01099-6#ref-CR16 “Abbas, A. et al. On quantum backpropagation, information reuse, and cheating measurement collapse. In Thirty-seventh Conference on Neural Information Processing Systems. https://openreview.net/forum?id=HF6bnhfSqH
(2023).“),[38](https://www.nature.com/articles/s41534-025-01099-6#ref-CR38 “Bowles, J., Wierichs, D. & Park, C.-Y. Backpropagation scaling in parameterised quantum circuits. https://doi.org/10.48550/arXiv.2306.14962
(2023).“)) Given a parameterised function, (f({\boldsymbol{\theta }}),{\boldsymbol{\theta }}\in {{\mathbb{R}}}{N}), with (f{\prime} ({\boldsymbol{\theta }})) being an estimate of the gradient of f with respect to θ up to some accuracy ε. The total computational cost to estimate (f^{\prime} ({\boldsymbol{\theta }})) with backpropagation is bounded with:
$${\mathcal{T}}(f^{\prime} ({\boldsymbol{\theta }}))\le {c}_{t}{\mathcal{T}}(f({\boldsymbol{\theta }}))$$
(6)
and
$${\mathcal{M}}(f^{\prime} ({\boldsymbol{\theta }}))\le {c}_{m}{\mathcal{M}}(f({\boldsymbol{\theta }}))$$
(7)
where ({c}_{t},{c}_{m}={\mathcal{O}}(\log (N))) and ({\mathcal{T}}(g)/{\mathcal{M}}(g)) is the time/amount of memory required to compute g.
In plain terms, a model which achieves a backpropagation scaling according to Definition 1, particularly for quantum models, implies that it does not take significantly more effort, (in terms of number of qubits, circuit size, or number of circuits) to compute gradients of the model with respect to all parameters, than it does to evaluate the model itself.
One family of circuits which does obey such a scaling are the so-called commuting-block QNNs, defined in ref. [38](https://www.nature.com/articles/s41534-025-01099-6#ref-CR38 “Bowles, J., Wierichs, D. & Park, C.-Y. Backpropagation scaling in parameterised quantum circuits. https://doi.org/10.48550/arXiv.2306.14962
(2023).“), and which contain B blocks of unitaries generated by operators which all mutually commute within a block. We discuss the specific circuits in ‘Methods’ but for now we specialise Proposition 1 to these commuting-block unitaries as follows:
Corollary 2
(Gradient scaling for density commuting-block quantum neural networks) Given a density QNN containing k sub-unitaries, each acting on n qubits. Each sub-unitary, k, has a commuting-block structure with B**k blocks. Assume each sub-unitary has different parameters, θk ≠ θℓ, ∀ k, ℓ. Then an unbiased estimate of the gradient can be estimated by classical post-processing ({\mathcal{O}}(2\sum _{k}{B}_{k}-K)) circuits on n + 1 qubits.
Proof
This follows immediately from Proposition 1 and Theorem 5 from ref. [38](https://www.nature.com/articles/s41534-025-01099-6#ref-CR38 “Bowles, J., Wierichs, D. & Park, C.-Y. Backpropagation scaling in parameterised quantum circuits. https://doi.org/10.48550/arXiv.2306.14962
(2023).“). Here, the gradients of a single B-block commuting block circuit can be computed by post-processing 2B − 1 circuits, where 2 circuits are required per block, with the exception of the final block, which can be treated as a commuting generator circuit and evaluated with a single circuit.
At this stage, we showcase two possibilities when constructing density networks. It should be noted that these in some sense represent extreme cases, and should not be taken as the exclusive possibilities. Ultimately, the successful models will likely exist in the middle group. The first path allows us to increase the trainability of certain QML models in the literature. In Table 1, we show some results if we were to do so for some popular examples. The first step is to dissect commuting-block components from each ‘layer’ of the respective model, then treat these components as sub-unitaries within the density formalism and then apply Corollary 2. In the following sections, we describe this strategy for the models in the table, beginning with the hardware efficient ansatz.
Secondly, we may simply use it as a means to increase overall model expressibility—where the component sub-unitaries, U**k(θk) are any generic trainable circuits (which are independent for simplicity). Further assume each U**k(θk) has identical structure with N parameters acting on n qubits and requiring T**n,N gradient circuits each. Then, from Corollary 1, a density model will require KTn,N parameter-shift circuits. In many cases, K will be a constant independent of N, or n, and furthermore this evaluation over the K sub-unitaries can be done in parallel.
In the next section, and in Fig. 3 we illustrate these paths using hardware efficient QNNs. For the examples in the following sections (the other models referenced in Table 1 and others), we demonstrate both of these directions.
Fig. 3: Decomposing a hardware efficient ansatz for a density QNN.
D layers of a hardware efficient (HWE) ansatz with entanglement generated by CNOT ladders and trainable parameters in single qubits R**x, R**y, R**z gates. (bottom left) D layers extracted into D sub-unitaries with probabilities, ({{{\alpha }_{d}}}_{d = 1}{D}) for a density QNN version. Applying the commuting-generator framework to the density version, ({\rho }{{\mathsf{HWE}}}({\boldsymbol{\theta }},{\boldsymbol{\alpha }},{\boldsymbol{x}})), enables parallel gradient evaluation in 2D circuits versus 2n**D as required by the pure state version, (\left\vert {\psi }{{\mathsf{HWE}}}({\boldsymbol{\theta }},{\boldsymbol{\alpha }},{\boldsymbol{x}})\right\rangle). TO illustrate potential differences between sub-unitaries, we arbitrarily reverse CNOT directions in subsequent layers and partially accounting for low circuit depth. (bottom right) Alternatively, we can simply create a more expressive version of the hardware efficient QNN within the density framework by duplicating across K sub-unitaries with probabilities ({{{\alpha }_{k}}}_{k = 1}{K}) retaining D layers each. In this case, the model requires 2nDK circuits for gradient extraction, but each sub-unitary can have independent parameters learning different features, especially if each contains different entanglement structures.
Hardware efficient quantum neural networks
To illustrate the two possible paths for model construction, we use a toy example (shown in Fig. 3)—the common but much maligned hardware efficient39 quantum neural network. These ‘problem-independent’ ansätze were proposed to keep quantum learning models as close as possible to the restrictions of physical quantum computers, by enforcing specific qubit connectivities and avoiding injecting trainable parameters into complex transformations. These circuits are extremely flexible, but this comes at the cost of being vulnerable to barren plateaus23 and generally difficult to train.
A D layer hardware efficient ansatz on n qubits is usually defined to have 1 parameter per qubit (located in a single qubit Pauli rotation) per layer. The parameter-shift rule with such a model would require 2n**D individual circuits to estimate the full gradient vector, each for M measurements shots. Given such a circuit, we can construct a density version with D sub-unitaries and reduce the gradient requirements from 2n**D to 2D as the gradients for the single qubit unitaries in each sub-unitary can be evaluated in parallel, using the commuting-generator toolkit from Methods and Corollary 2. This example is relatively trivial as the resulting unitaries are shallow depth (which also likely increases the ease of classical simulability) and training each corresponds only to learning a restricted single qubit measurement basis. In the Fig. 3, we take a variation of the common CNOT-ladder layout—entanglement is generated in each layer by nearest-neighbour CNOT gates. Typically, an identical structure is used in each layer, however in the figure we allow each sub-unitary extracted from each layer to have a varying CNOT control-target directionality and different single qubit rotations in each layer. This is to increase differences between each ‘expert’ (see below) as each sub-unitary can generate different levels of (dis)entanglement. Secondly, as illustrated in Fig. 3 one can also define a density version which is not more trainable than the original version—in this case, we have K depth-D hardware efficient circuits, which according to the parameter shift rule would now require ({\mathcal{O}}(2KDn)) circuits. However, the density model contains more parameters (K-fold more) than the original single circuit version, and possibly is more expressive as a QML model.
LCU and the mixing lemma
The second feature of the density framework is the relationship to linear combination of unitaries (LCU) quantum machine learning. Above, we discussed two methods of preparing the density state, equation (1) and illustrated in Fig. 1. Now, we will demonstrate how one may translate performance guarantees from families of LCU QNNs (Fig. 1a) to the randomised version of the density QNN (Fig. 1c).
Specifically, we will show that, in at least one restricted learning scenario, we will show that if one can construct and train an LCU QNN (Fig. 1a) which has a better learning performance (in terms of e.g. classification accuracy) than any component unitary, this improved performance can be transferred to a density QNN without performance loss. This transference has an important consequence—due to the minimal requirements of implementing a randomised density QNN (Fig. 1c) on quantum hardware, relative to the LCU QNN, we can implement the more performant model much more cheaply. To do so, we will prove a result using the Hastings-Campbell mixing lemma40,41 from the field compiling of complex unitaries onto sequences of simpler quantum operations.
In this context, we will adapt the Mixing lemma as follows. Assume one trains K sub-unitaries {U**k(θk)} each to be ‘good’ models, in that they each achieve a low prediction error, δ1, to some ground truth function. Next, with the trained sub-unitaries fixed, one learns a linear combination, (\sum _{k}{\alpha }_{k}{U}_{k}), with (distributional) coefficients, {α**k}, by training only the coefficients. Assume this more powerful QNN model (LCU QNN) achieves a ‘better’ prediction error, δ2 < δ1. However, despite better performance, the LCU QNN is far more expensive to implement than any individual U**k (as can be seen in Fig. 1a). The logic of the Mixing lemma implies that instead of this deep circuit, we may randomise over the unitaries—create a randomised density QNN—and achieve the same error as the LCU QNN but with the same overhead as the most complex U**k. We formalise this as the following:
Lemma 1
(Mixing lemma for supervised learning) Let h(x) be a target ground truth function, prepared via the application of a fixed unitary, V, (h({\boldsymbol{x}}):= {\rm{Tr}}({\mathcal{O}}V\rho ({\boldsymbol{x}}){V}{\dagger })) on a data encoded state, ρ(x) and measured with a fixed observable, ({\mathcal{O}}). Suppose there exists K unitaries ({{{U}_{k}({\boldsymbol{\theta }})}}_{k = 1}{K}) such that these each are δ1 good predictive models of h(x):
$${{\mathbb{E}}}_{{\boldsymbol{x}}}| h({\boldsymbol{x}})-{f}_{k}({\boldsymbol{\theta }},{\boldsymbol{x}})| \le {\delta }_{1},\forall k$$
(8)
and a distribution ({{{\alpha }_{k}}}_{k = 1}^{K}) such that predictions according to the LCU model ({f}_{{\mathsf{LCU}}}({\boldsymbol{\theta }},{\boldsymbol{\alpha }},{\boldsymbol{x}}):= {\rm{Tr}}({\mathcal{O}}\left(\sum _\