1. Introduction
Techniques such as parallel [Sugita1999, Cecchini2004, Bussi2006] and simulated tempering [Marinari1992] have become standard tools to accelerate the sampling of systems with complicated energy landscapes, mostly because of their effectiveness and minimal requirement of detailed knowledge of the system’s specifics. In a nutshell, the basic idea of these methods is to temporarily raise the temperature of the system (or that of a copy thereof) to allow it to explore its energy landscape faster. When this is done in a specific way, the properties of the system at the physical temperature can still be calculated without bias. As a result tempering techniques provide us with importance sampling schemes to analyze the statistical mechanics properties of a system at a given temperature even in situations where direct simulations at this temperature are too slow to be practical. The aim of this paper is to give some mathematical analysis of these methods via their formulation as stochastic switching processes, which will permit us to conveniently explore their properties in various parameter regimes. In so doing, we will also discuss implementation issues.
Consider first parallel tempering, in which one evolves multiple copies (or replicas) of the system at different temperatures which are occasionally swapped among these replicas (which is why the method is also known as replica exchange). Typically, these swaps are attempted at a given frequency with a MetropolisHastings acceptance/rejection criterion to guarantee that the replicas sample a known equilibrium distribution over which expectations at any of the temperatures in the stack (including the physical one) can be calculated. In this context it was found empirically [Sindhikara2008, Rosta2009, Sindhikara2010] that the larger the swapping frequency, the higher the sampling efficiency is. This observation was mathematically justified in the work of Dupuis et al. [Dupuis2012] by showing that the large deviation rate functional for the empirical measure is monotonically increasing with . This observation also motivated the development of infinite swapping replica exchange dynamics [Dupuis2012, Plattner2011, Lu2013, Doll2015], in which one attempts to simulate directly the limiting dynamics at .
How to do so in practice is nontrivial, however. When parallel tempering is applied to complicated systems, often times a large number () of temperatures and replicas is needed to make the sampling efficient. In such cases, the simulation of the infinite swappling limit becomes challenging because the limiting dynamics involves coefficients given in terms of sums with as many terms as there are permutations of the replicas, . An exhaustive evaluation of such sums quickly become undoable, even numerically, and alternative strategies must be used. One possibility, proposed in [TangLuAbramsVE], is to resort to multiscale integrators like those in the heterogeneous multiscale methods (HMM) to effectively simulate the limiting dynamics.
One of the purposes of this paper is to present the mathematical foundation behind the algorithm proposed in [TangLuAbramsVE]. Specifically, we show that the parallel tempering can be formulated as a diffusion of the replica coupled with a Markov jumping process in the space of permutations. Together this gives a stochastic switching process. Using the approach introduced in [Dupuis2012], the infinite swapping limit can be justified by the monotonicity of the large deviation rate functional as a function of the swapping frequency. The large swapping frequency introduces a scale separation between the dynamics of the replica and that of the permutation variable, and thus naturally calls for multiscale integrators, such as those based on HMM, which justifies the integrator introduced in [TangLuAbramsVE].
Similar ideas can also be used in the context of simulated tempering, in which a single replica is used but the temperature itself is made a dynamical variable of the system. In this setup too, it is useful to consider the coupled dynamics of the system and the temperature as a stochastic switching process, and this allows one to again justify via large deviation theory that making the temperature evolve infinitely fast (the ‘infinite switch limit’) is optimal. As we will show below, the limiting equation that emerges in this infiniteswitch limit is easier to simulate in principle, as it only involves a sum over the number of possible temperature states rather than its factorial. In practice, however, the question becomes how to appropriately weight the different temperatures states, since this is an input in the method which is unknown a priori but can greatly influence its efficiency.
The organization of the remainder of this paper is as follows. In Section 2 we present a formulation of parallel tempering as a stochastic switching process, which facilitates the discussion of infinite swapping limit and the development of integrators for the equations of motion that emerge in this limit. The infinite swapping limit is justified through large deviation theory in Section 3 and the effective equations that arise in this limit are derived in Section 4. In Section 5 we discuss the choice of the temperature ladder in parallel tempering and, in particular, explain why many temperatures are typically necessary to boost the efficiency of the method. In the infinite swapping limit, where the dynamics of the replica effectively reduce to a diffusion over a mixture potential, this question reduces to the analysis of the geometrical properties of this potential. An HMM multiscale integrator that operates in the many temperatures regime is then discussed in Section 6 and shown to perform better than the standard stochastic simulation algorithm in the regime of frequent swapping. In Section 7 we show that similar ideas can be extended to other tempering techniques such as simulate tempering. Finally, some concluding remarks are given in Section 8.
2. Parallel Tempering as a Stochastic Switching Process
Consider a system whose evolution is governed by the overdamped Langevin equation:
(1) 
where denotes the instantaneous position (at time ) of the system with particles, is the force associated with the potential , is the inverse temperature, each component of is an independent Wiener process, and we set the friction coefficient to one for simplicity. Most of the results we present below should be straightforwardly generalizable to other dynamics, such as the inertial Langevin equation, but we will focus on (1) for the sake of simplicity.
Assuming that is smooth and is compact, it is easy to show that the overdamped dynamics (1) is ergodic with respect to the Boltzmann equilibrium distribution with density
(2) 
where . However, when the dimensionality of the system is large, , and the potential is complicated with critical points, etc. (1) typically exhibits metastability, meaning that convergence to the equilibrium measure is very slow. In this context, the basic idea of parallel tempering is to use dynamics at higher temperature to help the sampling.
Specifically, we introduce multiple replicas of the system and make each replica evolve under a different temperature that alternatively swaps between several of them (many artificial temperatures, plus the physical one). Assuming that we use temperatures, so that the parallel tempering correspondingly involves replicas, this is done by constructing a stochastic process on the extended phase space , where is the configurational space for the replica and is the permutation group of the temperature, such that its joint equilibrium measure is given by
(3) 
where and denotes the index associated with by the permutation . To this end, we assume that given any we have a diffusion with generator whose equilibrium density is the product
(4) 
We also assume that given any , we have a Markov jump process with infinitesimal jump intensity , where is an overall frequency parameter and satisfies the detailed balance condition
(5) 
where
(6) 
The generator for the stochastic switching process is then given by
(7) 
where the subscript emphasizes the dependence of on the overall swapping frequency . By construction, any stochastic switch processes constructed this way have as the invariant measure.
Proposition 1.
For any attempt switching frequency , is the invariant distribution of the process associated with .
Proof.
It suffices to show that
(8) 
This follows from a direct calculation:
(9)  
where the last equality uses the assumption that is stationary with respect to the dynamics corresponds to and satisfies the detailed balance condition (5). ∎
It is easy to see that we can write by defining the potential
(10) 
where is a reference inverse temperature introduced for dimensional purpose: for example, we can take . Similarly, the conditional density (4) can be written in terms of as
(11) 
which can be sampled by, for example, by the diffusion
(12)  
We can also reexpress (6) as
(13) 
Thus, to ensure the detailed balance, we may choose as
(14) 
where the symmetric adjacency matrix indicates whether the permutation is allowed to be accessed from by a single jump. For example, we may restrict ourselves to transposition moves in which two random indices are swapped from the permutation , so that permutations are accessible from . Note that the choice (14) is not unique, and can also take other forms that satisfy the detailed balance condition with respect to .
Calculation of Expectations
To estimate the expectation of the physical observable
at any temperature , , we can use:(15)  
Here is the marginal of on ,
(16) 
and we defined
(17) 
which is the probability that the
th replica is at the th temperature conditional on the configuration being fixed at . (15) means that we can estimate from a sample path of the stochastic switching process using ergodicity as(18) 
The following proposition shows that the variance of the parallel tempering estimator for the expectation of an observable at a given temperature is bounded by the variance of the observable at that temperature.
Proposition 2.
For each we have
(19) 
Proof.
Note that since the bound (19) follows from Jensen’s equality, we expect it to be sharp for generic observables. Therefore, the efficiency of the sampling scheme is determined by the convergence to equilibrium of the parallel tempering scheme, as we will discuss in the following sections.
3. Large Deviation Principle for the Empirical Measure of the Stochastic Switching Process
In this section we derive a large deviation principle for the empirical measure of the stochastic switching process marginalized on , that is:
(23) 
By construction of the stochastic switching process its infinitesimal generator splits into the two parts (see (7)):
(24) 
As a consequence, the large deviation rate functional for the empirical measure also has an additive structure:
Proposition 3.
The empirical measure (23) satisfies a large deviation principle with rate functional given by
(25) 
where is a probability measure on . In particular, both and are independent of . While the specific form of depends on the choice of , is fully determined by the jump intensity as:
(26) 
where we have assumed that is absolutely continuous with respect to and denote .
The additive structure was first observed in the work of Dupuis et al. [Dupuis2012, Plattner2011] in the context of parallel tempering.
Proof.
The additive structure comes from the following representation of the rate functional [DonskerVaradhan:75, DupuisLiu:2015]:
(27)  
Using the definition of , we calculate
(28) 
Since satisfies the detailed balance condition, we have
(29)  
where we swapped the indices and to get the second equality. Combined with the previous identity, we get
(30)  
This is exactly defined in (26). ∎
4. Infinite Swapping Limit
As is nonnegative, it is natural to consider taking the limit to maximize the rate functional, which we refer to as the infinite swapping limit. To derive the effective dynamics that emerges in this limit, consider the forward Kolmogorov equation for the stochastic switching process :
(31)  
To take the limit , let us introduce the ansatz
(32) 
Matching orders of , we get
(33)  
(34) 
and similarly for the higher order expansions. The leading order (33) reads
(35) 
which implies that, since by construction is the invariant measure of the jumping process associated with
(36) 
for some to be determined. Substitution this into the next order gives
(37) 
Summing over , we obtain the following solvability condition for this equation (recall that )
(38) 
where the last identity defines , the infinitesimal process of the “averaged process” in the infinite swapping limit.
The above asymptotic derivation can be made rigorous as in usual averaging, and we conclude:
Proposition 4.
As , converges to the averaged process defined by the infinitesimal generator .
Let us now make these formula explicit. Since
(39) 
we have
(40)  
where the last equality uses for any . Therefore, the corresponding stochastic differential equations are given by
(41)  
with the scaling factor given by
(42) 
The dynamics we get after averaging is similar to the original overdamped dynamics under inverse temperature , except that the drift is scaled by the factor . Notice that in terms of of the factor defined in (17), we can write (42) as s
(43) 
Finally, note that (41) can be written as
(44) 
where we defined the mixture potential
(45) 
As a result, the convergence properties of parallel tempering in the infinite swapping limit (and the boost in efficiency this method provides on vanilla sampling via (1)) can be analyzed via examination of the geometrical properties of . This is what we will do in the next section on an example.
5. Efficiency analysis in the harmonic case
From the discussion above, we know that, given any set of temperatures, the optimal way to operate parallel tempering is in fact to simulate the effective equation in (41) that emerges in the limit. This is clearly doable to the extent that we can calculate the factors defined in (42). Since these factors contained sums over , i.e., over terms, calculating by exhaustive evaluation of these sums is only an option if the number of temperatures remains moderately low. In practice, however, one often needs to use , for which these exhaustive calculations are not doable. How to proceed in these situations will be explained in Section 6. The question we address here is why there is a practical need to use a rather larger number of temperatures rather than, say, . This will also allow us to investigate how to choose the temperature ladder in the method.
Insight about this issue can be gained by looking at the special case when the potential is harmonic
(46) 
in which case the canonical density reduces to
(47) 
Even though this example is very simple, it has the advantage to be amenable to analysis and it illustrates difficulties that are shared with more realistic (and complex) systems. We can always set by rescaling , so let us focus on this case here. As we have seen in Proposition 2, the variance of the estimator for parallel tempering can be controlled by the variance of the observable, and thus we shall just focus on the convergence of the dynamics to the invariant measure.
5.1. The case with two temperatures
Let us first consider the situation where we only add one artificial temperature, so that we have two replicas with and . In this case, the effective equation we obtain in the limit reads
(48)  
where
(49) 
Therefore, we can write down a closed evolution equation for the law of and . A few simple manipulations show that this equation can be written as
(50) 
where
(51) 
Writing compactly , this equation is of the from
(52) 
with , which indicates that its invariant density is proportional to , i.e. it is given by
(53) 
where the normalization constant is given by
as . Thus, (50) and equivalently (52) describe diffusion on the energy landscape
(54) 
As long as , this landscape possesses two minima with a saddle point in between (see Fig. 1).
For large , the minima are approximately (that is, to leading order in ) located at
(55) 
with energy approximately given by and the saddle point is approximately located at
(56) 
with energy approximately given by . Therefore, for large
, we can use results from large deviation theory to deduce that the smallest nonzero eigenvalue of the generator of (
50) (that is, roughly, the inverse of mean first passage time between the minima) is logasymptotically given by(57) 
Therefore, for any fixed, the system becomes increasingly metastable as gets larger. In particular, if we start with a configuration with , it will take exponentially (in ) long time for the dynamics to reach the configurations with . This means exponentially slow convergence to equilibrium, and shows that using two temperature is not sufficient in general.
5.2. The case with many temperatures
Consider now what happens with multiple temperatures, in which case a similar analysis as in the previous section can be done. In particular, the mixture potential (for the energies) has local minima at
(58) 
whose energies scale as
(59) 
The saddle points are located at
(60) 
for any permutation and any pair , whose energies scale as
(61) 
Therefore the barrier scale as
(62) 
Assuming that , this indicates that we are more likely to see transition between minima lying on adjacent s, and in order to keep all the rates equal (if we use Arrhenius formula such that rate is given by exponential of the energy barrier), we must have
(63) 
for all . This agrees with the analysis based on transition state theory performed in our previous work [Lu2013], and shows that the optimal choice of temperature on the ladder should satisfy
(64) 
and that this ratio should be of order . This indicates that using multiple temperature removes the energy barrier and hence makes convergence much faster.
6. Simulation of the stochastic switching dynamics
6.1. Direct simulation based on stochastic simulation algorithm
Here we present an exact implementation of the stochastic switching dynamics, assuming that we can integrate the Markov process corresponds to exactly. In practice, this integration often needs to be discretized, which will introduce some additional errors. We will discuss these later.
Given that at time the system is in the state , the algorithm updates the state via the following steps:

Draw a random number ;

Integrate with the fixed to time , such that
where is the random number drawn in Step 1;

Choose with probability
Comments
There are no comments yet.