Visualising Hamiltonian Monte Carlo
Markov-chain Monte Carlo methods (MCMC)
MCMC methods are popular tools for sampling from probability distributions. In physics we use these methods to sample the canonical distribution, which enables us to obtain information about properties of the system at distinct temperatures. In Bayesian statistics MCMC methods are used to obtain samples from a posterior distribution. Two of the most well-known and popular MCMC methods are Metropolis Monte Carlo (MMC) and Hamiltonian Monte Carlo (HMC). Both these versions of MCMC implement “importance sampling”, meaning they spend more time sampling regions of the system that have the highest probability, but HMC is vastly more efficient for more complicated multi-dimensional systems.
One day - on a long train journey - I thought it would be interesting to watch these samplers to see how they differ.
Instead of a full mathematical derivation of how these approaches work (which has been done many times in very many books and papers over the years) I’ll simply give a more qualitative picture of how they work.
Metropolis Monte Carlo
To start, a random state - or point - (\ x \) is chosen, which has a corresponding probability (\P(x)\). We propose a new state (\ x’ \) by simply adding a random perturbation to (\ x \), and then we then accept this proposed state with probability (\ P_a = P(x’)/P(x) \). We keep proposing, checking and conditionally-accepting sample states from the system for as long as we like, collecting useful statistics on quantities in the system as we go.
It doesn’t take long to realise how slow this can be - and in systems with multiple local minima, or very-tight “funnel”-like characteristics it can be prohibitively slow to converge. For instance if our state (\ x \) is a vector respresenting the positions of hundreds of atoms of a molecular structure it is highly likely that states generated by random perturbations will have a tiny acceptance probability. This means we would either need to perturb the system one atom at a time, and/or make very small steps. Both of these approaches lead to an autocorrelated trajectory.
Hamiltonian Monte Carlo
HMC was designed to solve some of the problems with the Metropolis algorithm, by allowing for far larger changes to the system while maintaining a high acceptance probability. Instead of directly perturbing the state vector directly, HMC assigns (\ x \) fictitious momenta (drawn randomly) and then propagates the state vector forward in fictitious-time using Hamilton’s equations of motion. We can write the Hamiltonian (total energy of the system in terms of kinetic (T) and potential (V) parts) as, (\ H(x, p) = T(p) + V(x) \), where (\x\), (\p\) are the positions and momenta respectively.
The new proposal state is then accepted or rejected with probability, .
There are a few details I’ve glossed over: the momenta a typically drawn from a normal distribution, and a symplectic integrator (e.g. leapfrog) must be used to propagate the system according to Hamilton’s equations. HMC was originally designed to be used in theoretical particle physics, and has been adopted by Bayesian modellers for finding posterior distributions of model parameters.
No free lunch: while HMC is an improvment over MMC for complex systems, it doesn’t solve all the issues with MCMC sampling. At very low temperatures HMC can still become trapped in a local minima, which leads to imcomplete sampling of the system. There are several approaches around this, including sampling multiple chains (this is what pymc3 does) and umbrella sampling approaches that try to “flatten” the energy surface to aid sampling. My current day job involves using an approach called nested-sampling to get around this problem.
So what does sampling look like?
See the video below! This shows how HMC samples a simple 1-d harmonic oscillator potential. The x-axis (position) can be thought of as the parameter-space of the system. The y-axis shows the fictitious velocity (i.e. momentum) at each sampling iteration.