A slew of improvements to NUTS

Hold onto your hats, because 2026 promises to bring a whole slew of improvements to MCMC for continuously differentiable densities. The really awesome part is that all of these improvements are orthogonal, so they stack. I’m going to list the ones we know work first, followed by a couple in which we have high hopes.

I’m currently working on implementing all this in a C++ reference implementation, which I plan to roll out with an interface like that of Nutpie. You can follow here:

GitHub repository: flatironinstitute/walnuts

You’ll find the work in progress on branches. We are, of course, always happy to get feedback on the code, and I’m happy to get feedback on these or other ideas for sampling.

**Fisher divergence for mass matr…

I’m currently working on implementing all this in a C++ reference implementation, which I plan to roll out with an interface like that of Nutpie. You can follow here:

GitHub repository: flatironinstitute/walnuts

You’ll find the work in progress on branches. We are, of course, always happy to get feedback on the code, and I’m happy to get feedback on these or other ideas for sampling.

Fisher divergence for mass matrix adaptation

Adrian Seyboldt developed much faster and more robust and better targeted adaptation using Fisher divergence for his sampler Nutpie. I’m helping Adrian and Eliot Carlsen finish up a paper on this, which we hope to release in a week or so. In addition to better diagonal and dense estimators, it contains a really nice low-rank plus diagonal preconditioner that seems very effective (the risk is getting stuck too much in the subspace defined by the low rank structure). It also contains all the mathematical proofs, thanks to Adrian. The basic idea is that Fisher divergence adaptation targets getting the preconditioned target as close to a standard normal in terms of gradients as possible (not as close as possible in terms of density—that’s KL divergence). With some surprising mathematical magic, you can solve the exact optimization of Fisher divergence by taking the geometric mean in the affine-invariant manifold of positive-definite matrices of two quantities: the inverse covariance of the draws and the covariance of the scores (where the score is the gradient of the log density). In a multivariate normal, the covariance of the scores is the inverse covariance of the target density, has zero expectation, and thus acts like a control variate. Who knew? (That was a rhetorical question. Ben Goodrich, of course, knew—he knows everything.)

Adaptive step size on the fly

This was developed by Nawaf Bou-Rabee, Tore Kleppe, Sifan Liu, and me over the course of a few papers expanding on our Gibbs self tuning (GIST) ideas, culminating in our WALNUTS paper (on arXiv). The basic idea here is at each leapfrog step to try to take a step, and if the Hamiltonian diverges past a tolerance, try a smaller step size (half the original). There’s a bit of adjustment to do for reversibility (very much like in the MALT sampler of Lionel Riou-Durand et al. [the paper has an all star cast]), but the expectation is that the adjustment will be close to unity and that seems to hold in practice. There’s some subtlety here in that the WALNUTS sampler we’ve produced can be slower than NUTS (on a per gradient basis) for densities that NUTS can fit well, but if it’s well tuned in terms of minimal numbers of micro steps per macro step in the WALNUTS algorithm, it will be more efficient on a per-iteration basis. This shows we really are fixing the integrator, even if that doesn’t always give us a better sampler on a per-gradient basis. This same kind of issue comes up with higher-order leapfrog algorithms—they’re more precise, but often not worth the effort. On the plus side, when you have highly varying scales, like in the funnel density, WALNUTS can sample it effectively where NUTS will just fail silently reporting overly optimistic R-hat and ESS values.

Isokinetic sampler

As the name implies, the idea of isokinetic sampler is that you keep the kinetic energy constant. This is carried out by designing a kinetic energy function that’s different than the usual quadratic model derived from Newtonian physics (decomposed a la Hamilton, of course). This was developed decades ago by Mark Tuckerman at NYU’s Courant Institute for molecular dynamics, written about in Leimkuhler and Matthews Molecular Dynamics textbook, and reinvented recently by Jakob Robink and Uroš Seljak under the name “microcanonical sampling.” Isokinetic sampling has the remarkable property that you can fix a single energy level set and the isokinetic sampler can be restricted to that, yet remain ergodic for the distribution. Although they’re singing the praises of unadjusted methods, Reuben Cohn-Gordon joined Jakob and Uroš and they wrote a joint paper on Metropolis-adjusted methods (on arXiv). Of course, you could extend this to something like jittered HMC, multinomial HMC, or NUTS (we discuss these alternatives for standard HMC in the various GIST papers in detail). Recently my colleagues Tore Kleppe and Nawaf Bou-Rabee (my partners in crime on GIST) have verified the results presented by Robnik et al. and it seems to give a pretty clean factor of 1.5 to 3 over the standard NUTS kinetic energy model.

Unadjusted sampling, at least for warmup

In statistics, we’re used to thinking of unbiased MCMC with the correct ergodicity properties—you run them longer and you get a better answer. But in estimation in stats, we’re often OK with introducing a bit of bias if it gives us a large enough reduction in variance that we get better answers. If you view sampling the same way, then removing the Metropolis step from HMC gives you a biased sampler that can be a lot faster at moving toward the typical set where we want to sample than an adjusted sampler. And while Jakob, Uroš and Reuben are mainly working on unadjusted samplers and trying to prove error bounds for them under pretty strong assumptions, they have also suggested the reasonable tactic of using unadjusted sampling during warmup then switching to adjusted later so that everything works as expected for longer runs.

Smoother adaptation of Nutpie

Nutpie, like NUTS before it, works in blocks. It evaluates a given number of MCMC iterations, then updates its mass matrix estimate. I’ve developed a smooth alternative that simply exponentially discounts the past at a decreasing rate to mimic the block structure of Stan’s warmup. This seems to work very well and doesn’t suffer the problem of Nutpie of never converging due to a finite cap on block size (Adrian’s working on changing the version in Nutpie in some way to get around the initial design of 100-length blocks; Stan works in a sequence of blocks that doubles in size).

Adam replacement for dual averaging

Stan used the dual averaging stochastic gradient descent algorithm to set step size. It’s not spelled out in the short NUTS paper that dual averaging is an SGD algorithm because the NUTS paper never gives you the objective and its gradient. If you work it out backwards from the dual averaging algorithm, you can deduce that Hoffman and Gelman used a normal target (i.e., squared error), the gradient of which is the negative difference between the observation and the target. This gives you simple stochastic gradients you can plug into any SGD optimizer. I’ve just done this this week, but I’ve already found that Adam is much faster to converge, less highly variable after convergence, and more monotonic getting to convergence in the short tests I’ve done so far. I had to add an update discounting factor to Adam or you get the terrible oscillations rather than convergence for which both dual averaging and Adam are known.

Concurrent adaptation and convergence monitoring

This is a big one that Andrew’s been asking for for years. I’ve finished the online R-hat monitor (it’s in a branch of that name in the Walnuts repo on flatironinstitute). It uses the C++11 threads library to build an asynchronous, non-blocking monitor. The chains roll along as usual, each within its own thread, but they accumulate their within chain means and variances via Welford’s algorithm. They publish these in a per-chain Atomic using relaxed memory guarantees (hence the lack of blocking). Then there is also a monitor thread running that continually reads the per-chain means and variances and computes R-hat (the original one, not the split or ranked versions that are currently used in posterior (R) and ArviZ (Python)). I am able to run at least 100 R-hat checks per second without slowing down the chains noticeably, so it can detect convergence within a few iterations of when it happens even for very fast models. This all works blazingly fast on my new Mac Studio Server running 16 chains in parallel(*). Next up, I have to monitor adaptation. For that, there’s not a pre-built solution, but I’ll be using something like R-hat to monitor whether the mass matrix and step sizes have converged (on a log scale, which makes the monitoring respect the positive-definite manifold structure and operate scale free). And I’ll have to use a double-buffered store rather than an atomic because the vectors required for monitoring convergence of the mass matrix are D-dimensional, not default copyable.

(*) Mac ARM hardware

Not an algorithmic change, but … if you haven’t gotten the memo yet, the new Mac ARM chips are very well suited to parallel sampling like we do in Stan. Their memory architecture is much more tightly integrated into the CPU and more multiplexed than Intel architecture. This is great for sampling or other tasks where data and parameters are being hit asynchronously in memory in parallel.

I just got a Mac Studio with 20 performance cores on the M3 Ultra chip and 256GB of memory. I would highly recommend this specific machine if you have US$6K to spend (your cost may vary—electronics seem to be way more expensive outside of the U.S.). They also have a US$2K starter model, which should also be great compared to just about anything else. Even the MacBooks are good—the inexpensive Air I got a few years ago crushed my mega-expensive (albeit 8 year old) iMac running Intel Xeon chips. People have told me Stan’s broken on Windows after comparing Windows and Mac, but really, it’s just the memory architecture. I didn’t do any controlled tests, but I swear that tests that were taking 3s now take 1-2s after upgrading to the Tahoe OS (bunch of small UI changes that make pretty much no difference to my experience); Apple says they’re continuing to integrate more ARM goodness into their OS, so maybe that was it? The release notes don’t mention anything about heavy thread performance.

I don’t know how long this is going to be useful. I plan to develop a lot of these algorithms in JAX to run on GPU, which is going to be way faster than anything you can do on the desktop.

TENTATIVE IMPROVEMENTS

These haven’t been well tested by us yet.

Generalized HMC

NUTS is not good for GPUs. Matt Hoffman et al. have a new paper out that’s going into the next edition of the MCMC handbook about MCMC on modern hardware that explains why (on arXiv for now). Hugh Dance et al. just wrote another paper showing how to code something like NUTS on GPU, but it’s still expensive. Something like generalized HMC promises to give much of the advantage of NUTS implicitly without the elaborate recursive doubling structure. This is the kind of sampling to which Matt Hoffman and his former colleague Pavel Sountsov developed (Matt left Google and doesn’t work on sampling any more as far as I know). This is also what Gilad Turok, Chirag Modi, and I found with delayed rejection. It also sidesteps the problem of having to chain all the adjustments for local step-size adaptativity which can add up in a bad way for longer chains. It does add some additional tuning, which is how much to partially refresh momentum, which is a kind of proxy for path length for U-turns. It also has the advantage of being implicitly Rao-Blackwellized compared to HMC (you can average all the steps on an HMC path when computing expectations, but it’s just usually not worth the storage—this may turn out to be the case for G-HMC too).

Local mass matrix adaptation

I think of this as the final frontier. If we could locally condition, we wouldn’t need variable step sizes. We don’t have anything that works for this in complex cases. In Stan, you get a choice of a diagonal or dense preconditioned, and in Nutpie you also get low rank plus diagonal, but they’re all global preconditions, not local.

In very simple cases, we can use GIST to generate a mass matrix by taking an inverse Wishart sample around the negative Hessian. If you set the degrees of freedom correctly to get low variance, this pretty much perfectly preconditions a multivariate normal and should be a sound strategy in any log concave density. It’s just expensive, even with everything Cholesky factored. But that’s not enough for a system like Stan. The problem in going to more general models is that the Hessian’s no longer guaranteed to be positive definite (as it would be in a log concave density). This means we have to use something like Michael Betancourt’s softabs technique to condition the Hessian back to positive definite (e.g., eigendecompose, move negative eigenvalues up to positive, put back together, which is prohibitive because of the cubic cost). As an alternative, Nawaf and Tore have been looking at some explicit Riemannian integrators that only require a few Hessian-vector products, which are cheap with autodiff (linear in dimension rather than quadratic). This avoids the problem with the implicit integrator in Riemannian HMC, which is an additional obstacle beyond the need for positive-definite metrics. We might go back to implicit midpoint, as Arya Pourzanjani explored in his thesis, because we can use GIST to set a step size where we can guarantee stability; even so, the implicit nature of the algorithm is a real challenge for reversibility.

Similar Posts