-
Notifications
You must be signed in to change notification settings - Fork 7
Chain
The chain functions actually simulate the random walk. This is where the incremental priors are computed, and where the burn-in is implemented.
As described in the state tracking paper (Abeysuriya et. al., 2015), the MCMC system uses an adaptive proposal distribution. As such, the fit takes place in three stages
- An initial sample to generate the first approximation of the covariance matrix
- A burn-in period during which samples are taken but discarded. During this time, the covariance matrix adapts rapidly
- A long period of actual sampling. The covariance matrix still adapts during this time, but changes are much slower than at the start
We now go through a call to chain.m to show how the algorithm is implemented. Typically, users do not interact with chain() directly - rather, it is expected to be called via fit_spectrum() or similar.
The call to chain() is:
chain(model,initial_values,n_points,debugmode,timelimit,n_burnin)
These are
-
modelis of course the model object being used for the fit. Note that the model is expected to have hadmodel.prepare_for_fitalready run, which means that the target experimental spectrum is already stored within the model by the time it arrives inchain() -
initial_valuesis a vector of parameter values. It is critical that the initial parameters are valid i.e. not outside the parameter limits, not unstable, or otherwise anything that causesmodel.probability()to return a non-finite number for the initial values -
n_pointsis the length of the chain, as a number. If the fit is time-limited, this is the upper bound for the chain length. -
debugmodeis a flag that will plot the trace plot if set toTrue. The reason for this is because thechain()function returns only the requested samples, not the discarded burn-in points. That is, if you request 10000 points,chain()will take about 11000 steps, and return the last 10000. This means that from outsidechain(), you cannot plot the initial part of the chain showing the initial convergence to the stationary distribution.debugmodeplots the entire chain before the function returns -
timelimitis the maximum allowed time for the fit. If set tofalse, then the number of points specified inn_pointswill be generated. If set to a number, thenchain()will still attempt to generaten_pointsworth of samples, but will stop immediately when the time limit is reached. -
n_burninexplicitly specifies how many burn-in samples to use. If not provided, it is set to 10% of the requested chain length. However, when performing a parallel run, each worker runschain()with a subset of the total number of points. Thus 10% of the chain length on the workers would be less than 10% of the output series. Without correcting for this, the results from a parallel run could be different to a serial run.
Several quantities are then populated or preallocated. These are
-
n_initialwhich sets how many samples should be taken using the model's initial step size (with zero covariance) for the proposal distirbution, after which the adaptive proposal distribution is used -
n_burninis calculated, which is the total number of points from the start of the chain that will be discarded. This number includesn_initial -
out, which stores all of the steps in the random walk, including both the burn-in points and the actual sample points -
final_posteriorwhich stores the value of the posterior for corresponding parameters inout -
acceptwhich is a counter that keeps track of how many proposed steps have been accepted - this is a useful quantity to examine when choosing the initial step size for a model (as a general rule, an acceptance rate in the vicinity of 10-30% is good. The routine will work with higher or lower rates, but may require more samples to generate suitably valid results)
Next, several quantities for the adaptive proposal distribution are defined. Full details and references are in (Abeysuriya et. al., 2015). In summary, the idea is that the proposal distribution should have a similar covariance to the target distribution, to maximize the efficiency of the algorithm by taking larger steps in directions where the target distribution spans a wider range of values, and smaller steps in directions where the target distribution is narrower. On top of that, the width of the proposal distribution is linearly scaled by a multiplicative constant. If the constant is small, the proposal distribution will generate small steps that are likely to be accepted. If the constant is large, then the proposal distribution will generate large steps that are less likely to be accepted. As mentioned above, for various reasons the MCMC algorithm is most efficient with an acceptance rate of about 20%. To achieve this, the multiplicative constant is also adaptive (this is what is meant by 'adaptive global scaling') such that the it increases if the acceptance rate is too high, and decreases if the rate is too low.
Note that all of these ideas have been validated in analytic proofs in published papers. Adaptation or changes to the algorithm may be expected to affect convergence, and ensuring that the correctness of the sampling is preserved is a highly nontrivial task. So modifications intended to improve the sampling efficiency of the algorithm must be approached with extreme caution.
The quantities related to the proposal distribution are
-
fixed_covwhich ensures that the diagonal entries of the covariance matrix are nonzero, as discussed by Rosenthal on page 15 . -
lambdawhich is the scaling factor again discussed by Rosenthal on page 15. It depends on the dimension of the problem. For the corticothalamic model at least, the initial value of lambda leads to acceptance rates much lower than 20%. Therefore, we apply an additional rescaling by dividinglambdaby 4. Becauselambdais also adaptive, this should not significantly affect the algorithm, particularly since the theoretical value is chosen for no reason other than to achieve a particular acceptance rate -
target_acceptwhich is the acceptance rate that the adaptive global scaling seeks to achieve -
alphawhich pre-initializes -
muwhich stores the mean value of each parameter in the chain
The following other quantities are also preallocated
-
x- at each step, the posterior probability is evaluated for the proposed set of parameters, and this is compared to the current set of parameters.xis a2 x nmatrix that stores the parameters for the current step (in the first row), and the parameters for the proposed step (in the second row) -
alpha- the proposed step is accepted based on the ratio of the proposed probability to the current probability, and this ratio is stored inalpha -
initialization_stagetracks which part of the chain algorithm we are in. The possible stages are:1- generate initial samples,2- use the adaptive proposal distribution to generate burnin samples,3- use the adaptive proposal distribution to generate samples for output
The chain is generated in a while loop, because if execution is time-limited then the number of output points generated is not fixed - although this could equivalently be written as a for loop, the while loop in this case seems cleaner. The while loop terminates when n_burnin+n_points have been computed, and the time limit is implemented as a break statement within the loop.
At each iteration, the first task is to generate the proposed step based on the proposal distribution. If initialization_state=1, then the proposed step is generated using
x(2,:) = x(1,:) + model.initial_step_size.*randn(1,model.n_params);
That is, it uses the current position (in x(1,:)) and adds to it a perturbation sampled from a normal distribution that is independent for each parameter. The normal distribution has a standard deviation drawn from model.initial_step_size. In chain.m, the new point is generated by taking a random step in all of the parameters simultaneously. Other algorithms involve changing one parameter at a time and keeping the others fixed, but there seems to be no strong reason to prefer either method, except that perturbing all parameters simultaneously eliminates a nested loop.
After generating the proposed step, fixed parameters are taken into account. The parameters are fixed using
x(2,model.skip_fit==2) = initial_values(model.skip_fit==2);
If skip_fit=2 then the parameter values from the current step are copied to the proposed step. The model's probability function is then called using the proposed parameters
[chisq,new_posterior] = model.probability(x(2,:));
Again, the model stores the target spectrum internally, so it is able to return chisq as a function of the model parameters only. If chisq is finite (i.e., if the parameters are valid) and the proposed step is accepted, then the accept counter is incremented and the new parameters are copied over to the current parameters. Then, the current parameters are recorded as an output sample.
Finally, the initialization_stage is updated. Once n_initial steps have been taken, an initial covariance matrix is computed.
my_cov = cov(out(1:j,:));
mu = mean(out(1:j,:));
The initialization_stage is then set to 2. On the next loop iteration, the proposed step is computed using
x(2,:) = mvnrnd_fast(x(1,:),lambda*my_cov);
which samples from a multivariate normal distribution with a specific covariance. mvnrnd_fast is a stripped-down version of Matlab's built-in mvnrnd. The rest of the loop proceeds as before, but now, the covariance matrix must be re-computed every time a new sample is generated. This is performed in an iterative fashion based on the covariance matrix from the previous step, which means that updating the covariance matrix is very efficient. In contrast, computing the covariance matrix from scratch becomes more and more computationally intensive as the number of points becomes large.
When j = n_burnin (recalling that n_burnin includes n_initial), the initialization stage is incremented once more. The accept counter is reset to 0, so that the acceptance rate can be calculated for the samples that are actually returned (although it's worth noting that the acceptance rate could also be computed by counting how many duplicates there in the output chain).
Once the while loop exits, if debugmode=true then the chain diagnostics are displayed using
bt.core.chain_diagnostics(model,out(1:(j-1),:),final_posterior(1:(j-1)),n_burnin);
Note that j is incremented at the end of the loop rather than the beginning, so the output is stored in indexes 1:j-1 rather than 1:j. Also, at this point out stores all of the samples, which means that the chain diagnostics include the burn-in points. The argument n_burnin means that chain_diagnostics is able to provide a visual cue showing the burn-in period.
Finally, the output is truncated and then the chain is returned.
MCMC sampling is ideally suited to parallelization because the samples can be generated independently. The key idea is that if we want n samples, we can run N copies of chain, with each call to chain requested n/N output points. These output points can then be aggregated into a single set of output samples. This is achieved by the function chain_parallel, which has almost the same calling syntax as chain. This makes it easy to use chain_parallel as a drop-in replacement for chain.
chain_parallel expects that a parallel pool is already open, and will throw an error if this is not the case. The number of output points required on each worker is computed. Alternatively, if a time limit is specified, that time limit is passed directly to the workers without modification. chain is then called inside a parfor loop. Finally, the outputs from each call to chain are concatenated into a single matrix that is then returned.
If debugmode=true, then the diagnostic output from each worker chain is displayed, including the burn-in period. After that, the diagnostic output from chain_parallel is displayed, which shows the final output samples that will be returned, but with no burn-in samples.
- Because the chains are simply stuck together, you cannot look at the autocorrelation or other such measures in the output from
chain_parallel. However, since the standard chain analysis just looks at the distribution of points in the chain, and doesn't care about the order of the points, this normally does not matter. - The intention is that you can just change
chain()tochain_parallel()and get immediately comparable results. For consistency withchain.m, the number of burn-in steps on each worker is computed based on the total length of the chain. That is, if you request 10000 points,chain()will have 1000 points of burn-in. If you callchain_parallelwith 10000 points and 4 workers, it will callchain()requesting 2500 points each. If you just callchain()with 2500 points, there will only be 250 points of burn-in. Therefore,chain_parallelcallschainwith the optional argumentn_burninwhich is set to 1000 points (based on the total requested 10000 points). Therefore, the individual chains are run with 1000 points of burn-in and 2500 points of output. This ensures valid results, but it also has the effect of sub-linear performance increases as the number of workers is increased when the chain length is small
Users can specify an upper limit on the executation time of chain. In the case of timelimit, 20% of the time is spent on burn-in. This is higher than the 10% normally used. The required length of the burn-in is actually not dependent on the total length of the chain. That is, how many steps are required for the chain to reach it's stationary distribution does not depend on the final number of samples being generated. As a convenient approximation, we set the burn-in length to 10% of the chain, which is common practice in literature. However, as the chain gets shorter and shorter, the amount of burn-in required gets larger and larger as a percentage of the total chain length. Since timelimit is only normally used for extremely short runs, a 20% burn-in is required to achieve satisfactory convergence. You can see this for yourself by enabling debugmode.
Sometimes chain will fail with the error Burnin did not finish, something went wrong when a time limit is specified is under investigation. This can occur if computing the initial n_initial points takes longer than the total time allowed for the fit. This simply means that the computer is too slow to provide an acceptable fit within the time requested.