reproducibility in parallel statistical computations
issues and considerations in reproducing parallel computations
a reproducible computation is one that gives the same results every time it runs. reproducibility is important for:
debugging — to correct an anomalous result, you need to reproduce the result.
confidence — when you can reproduce results, you can investigate and understand them.
modifying existing code — when you change existing code, you want to ensure that you do not break anything.
generally, you do not need to ensure reproducibility for your computation. often, when you want reproducibility, the simplest technique is to run in serial instead of in parallel. in serial computation you can simply call the function as follows:
s = rng % obtain the current state of the random stream % run the statistical function rng(s) % reset the stream to the previous state % run the statistical function again, obtain identical results
this section addresses the case when your function uses random numbers, and you want reproducible results in parallel. this section also addresses the case when you want the same results in parallel as in serial.
running reproducible parallel computations
to run a statistics and machine learning toolbox™ function reproducibly:
set the
usesubstreams
option totrue
using .set the
streams
option to a type that supports substreams:'mlfg6331_64'
or'mrg32k3a'
. for information on these streams, see .to compute in parallel, set the
useparallel
option totrue
.to fit an ensemble in parallel using
fitcensemble
or , create a tree template with the'reproducible'
name-value pair set totrue
:t = templatetree('reproducible',true); ens = fitcensemble(x,y,'method','bag','learners',t,... 'options',options);
call the function with the options structure.
to reproduce the computation, reset the stream, then call the function again.
to understand why this technique gives reproducibility, see how substreams enable reproducible parallel computations.
for example, to use the 'mlfg6331_64'
stream for reproducible
computation:
create an appropriate options structure:
s = randstream('mlfg6331_64'); options = statset('useparallel',true, ... 'streams',s,'usesubstreams',true);
run your parallel computation. for instructions, see quick start parallel computing for statistics and machine learning toolbox.
reset the random stream:
reset(s);
rerun your parallel computation. you obtain identical results.
for examples of parallel computation run this reproducible way, see reproducible parallel bootstrap and .
parallel statistical computation using random numbers
what are substreams?
a substream is a portion of a random stream that
randstream
can access quickly. there is a
number m
such that for any positive integer
k
, randstream
can go to
the km
th pseudorandom number in the stream. from that point,
randstream
can generate the subsequent
entries in the stream. currently, randstream
has m
= 272, about 5e21,
or more.
the entries in different substreams have good statistical properties, similar to the properties of entries in a single stream: independence, and lack of k-way correlation at various lags. the substreams are so long that you can view the substreams as being independent streams, as in the following picture.
two randstream
stream types support
substreams: 'mlfg6331_64'
and 'mrg32k3a'
.
how substreams enable reproducible parallel computations
when matlab® performs computations in parallel with
parfor
, each worker receives loop iterations in an
unpredictable order. therefore, you cannot predict which worker gets which
iteration, so cannot determine the random numbers associated with each
iteration.
substreams allow matlab to tie each iteration to a particular sequence of random numbers.
parfor
gives each iteration an index. the iteration
uses the index as the substream number. since the random numbers are associated
with the iterations, not with the workers, the entire computation is
reproducible.
to obtain reproducible results, simply reset the stream, and all the substreams generate identical random numbers when called again. this method succeeds when all the workers use the same stream, and the stream supports substreams. this concludes the discussion of how the procedure in running reproducible parallel computations gives reproducible parallel results.
random numbers on the client or workers
a few functions generate random numbers on the client before distributing them to parallel workers. the workers do not use random numbers, so operate purely deterministically. for these functions, you can run a parallel computation reproducibly using any random stream type.
the functions that operate this way include:
to obtain identical results, reset the random stream on the client, or the random stream you pass to the client. for example:
s = rng % obtain the current state of the random stream % run the statistical function rng(s) % reset the stream to the previous state % run the statistical function again, obtain identical results
while this method enables you to run reproducibly in parallel, the results can
differ from a serial computation. the reason for the difference is
parfor
loops run in reverse order from
for
loops. therefore, a serial computation can generate
random numbers in a different order than a parallel computation. for unequivocal
reproducibility, use the technique in running reproducible parallel computations.
distributing streams explicitly
for testing or comparison using particular random number algorithms, you must set the random number generators. how do you set these generators in parallel, or initialize streams on each worker in a particular way? or you might want to run a computation using a different sequence of random numbers than any other you have run. how can you ensure the sequence you use is statistically independent?
parallel statistics and machine learning toolbox functions allow you to set random streams on each worker
explicitly. for information on creating multiple streams,
enter help randstream/create
at the command line. to
create four independent streams using the 'mrg32k3a'
generator:
s = randstream.create('mrg32k3a','numstreams',4,... 'celloutput',true);
pass these streams to a statistical function using the
streams
option. for example:
parpool(4) % if you have at least 4 cores s = randstream.create('mrg32k3a','numstreams',4,... 'celloutput',true); % create 4 independent streams paroptions = statset('useparallel',true,... 'streams',s); % set the 4 different streams x = [randn(700,1); 4 2*randn(300,1)]; latt = -4:0.01:12; myfun = @(x) ksdensity(x,latt); pdfestimate = myfun(x); b = bootstrp(200,myfun,x,'options',paroptions);
this method of distributing streams gives each worker a different stream for the computation. however, it does not allow for a reproducible computation, because the workers perform the 200 bootstraps in an unpredictable order. if you want to perform a reproducible computation, use substreams as described in running reproducible parallel computations.
if you set the usesubstreams
option to
true
, then set the streams
option to a
single random stream of the type that supports substreams
('mlfg6331_64'
or 'mrg32k3a'
). this
setting gives reproducible computations.