end to end system simulation acceleration using gpus -凯发k8网页登录

this example uses:

this example shows a comparison of four techniques which can be used to accelerate bit error rate (ber) simulations using system objects in the matlab® communications toolbox™ software. a small system, based on convolutional coding, illustrates the effect of code generation using the matlab® coder™ product, parallel loop execution using parfor in the parallel computing toolbox™ product, a combination of code generation and parfor, and gpu-based system objects.

the system objects this example features are accessible in the communications toolbox product. in order to run this example you must have a matlab coder license, a parallel computing toolbox license, and a sufficient gpu.

system design and simulation parameters

this example uses a simple convolutional coding system to illustrate simulation acceleration strategies. the system generates random message bits using randi. a transmitter encodes these bits using a rate 1/2 convolutional encoder, applies a qpsk modulation scheme, and then transmits the symbols. the symbols pass through an awgn channel, where signal corruption occurs. qpsk demodulation occurs at the receiver, and the corrupted bits are decoded using the viterbi algorithm. finally, the bit error rate is computed. the system objects used in this system are :

comm.convolutionalencoder - convolutional encoding
comm.pskmodulator - qpsk modulation
comm.awgnchannel - awgn channel
comm.pskdemodulator - qpsk demodulation (approx llr)
comm.viterbidecoder - viterbi decoding

the code for the transceivers can be found in:

each point along the bit error rate curve represents the result of many iterations of the transceiver code described above. to obtain accurate results in a reasonable amount of time, the simulation will gather at least 200 bit errors per signal-to-noise ratio (snr) value, and at most 5000 packets of data. a packet represents 2000 message bits. the snr ranges from 1 db to 5 db.

itercntthreshold = 5000;
minerrthreshold = 200;
msgl = 2000;
snrdb = 1:5;

initialization

call the transceiver functions once to factor out setup time and object construction overhead. objects are stored in persistent variables in each function.

errs  = zeros(length(snrdb),1);
iters = zeros(length(snrdb),1);
berplot = cell(1,5);
numframes = 500;    %gpu version runs 500 frames in parallel.
viterbitransceivercpu(-10,1,1);
viterbitransceivergpu(-10,1,1,numframes);
n=1; %n tracks which simulation variant is run

workflow

the workflow for this example is:

run a baseline simulation of system objects
use matlab coder to generate a mex function for the simulation
use parfor to run the bit error rate simulation in parallel
combine the generated mex function with parfor
use the gpu-based system objects

fprintf(1,'bit error rate acceleration analysis example\n\n');

bit error rate acceleration analysis example

baseline simulation

to establish a reference point for various acceleration strategies, a bit error rate curve is generated using system objects alone. the code for the transceiver is in viterbitransceivercpu.m.

fprintf(1,'***baseline - standard system object simulation***\n');
% create random stream for each snrdb simulation
s = randstream.create('mrg32k3a','numstreams',1,...
    'celloutput',true,'normaltransform', 'inversion');
randstream.setglobalstream(s{1});
ts = tic;
for ii=1:numel(snrdb)
    fprintf(1,'iteration number %d, snr (db) = %d\n',ii, snrdb(ii));
    [errs(ii),iters(ii)] =viterbitransceivercpu(snrdb(ii), minerrthreshold, itercntthreshold);
end
ber = errs./ (msgl* iters);
basetime=toc(ts);
berplot{n} = ber;
desc{n} = 'baseline';
reportresultscommsysgpu(n, basetime,basetime, 'baseline');

***baseline - standard system object simulation***
iteration number 1, snr (db) = 1
iteration number 2, snr (db) = 2
iteration number 3, snr (db) = 3
iteration number 4, snr (db) = 4
iteration number 5, snr (db) = 5
----------------------------------------------------------------------------------------------
versions of the transceiver                         | elapsed time (sec)| acceleration ratio
1. baseline                                         |           17.0205 |       1.0000
----------------------------------------------------------------------------------------------

code generation

using matlab coder, a mex file can be generated with optimized c code that matches the precompiled matlab code. because the viterbitransceivercpu function conforms to the matlab code generation subset, it can be compiled into a mex function without modification.

you must have a matlab coder license to run this portion of the example.

fprintf(1,'\n***baseline   codegen***\n');
n=n 1; %increase simulation counter
% create the coder object and turn off checks which will cause low
% performance.
fprintf(1,'generating code ...');
config_obj = coder.config('mex');
config_obj.enabledebugging = false;
config_obj.integritychecks = false;
config_obj.responsivenesschecks = false;
config_obj.echoexpressions = false;
% generate a mex file
codegen('viterbitransceivercpu.m', '-config', 'config_obj', '-args', {snrdb(1), minerrthreshold, itercntthreshold} )
fprintf(1,'  done.\n');
%run once to eliminate startup overhead.
viterbitransceivercpu_mex(-10,1,1);
s = randstream.getglobalstream;
reset(s);
% use the generated mex function viterbitransceivercpu_mex in the
% simulation loop.
ts = tic;
for ii=1:numel(snrdb)
    fprintf(1,'iteration number %d, snr (db) = %d\n',ii, snrdb(ii));
    [errs(ii),iters(ii)] = viterbitransceivercpu_mex(snrdb(ii), minerrthreshold, itercntthreshold);
end
ber = errs./ (msgl* iters);
trialtime=toc(ts);
berplot{n} = ber;
desc{n} = 'codegen';
reportresultscommsysgpu(n, trialtime,basetime, 'baseline   codegen');

***baseline   codegen***
generating code ...code generation successful.
  done.
iteration number 1, snr (db) = 1
iteration number 2, snr (db) = 2
iteration number 3, snr (db) = 3
iteration number 4, snr (db) = 4
iteration number 5, snr (db) = 5
----------------------------------------------------------------------------------------------
versions of the transceiver                         | elapsed time (sec)| acceleration ratio
1. baseline                                         |           17.0205 |       1.0000
2. baseline   codegen                               |           14.3820 |       1.1835
----------------------------------------------------------------------------------------------

parfor - parallel loop execution

using parfor, matlab executes the transceiver code against all snr values in parallel. this requires opening the parallel pool and adding a parfor loop.

you must have a parallel computing toolbox license to run this portion of the example.

fprintf(1,'\n***baseline   parfor***\n');
fprintf(1,'accessing multiple cpu cores ...\n');
if isempty(gcp('nocreate'))
    pool = parpool;
    poolwasopen = false;
else
    pool = gcp;
    poolwasopen = true;
end
nw=pool.numworkers;
n=n 1; %increase simulation counter
snrn = numel(snrdb);
mt = minerrthreshold / nw;
it = itercntthreshold / nw;
errn =  zeros(nw, snrn);
itrn =  zeros(nw, snrn);
% replicate snrdb
snrdb_rep=repmat(snrdb,nw,1);
% create an independent stream for each worker
s = randstream.create('mrg32k3a','numstreams',nw,...
    'celloutput',true,'normaltransform', 'inversion');
% pre-run
parfor jj=1:nw
    randstream.setglobalstream(s{jj});
    viterbitransceivercpu(-10, 1, 1);
end
fprintf(1,'start parfor job ... ');
ts = tic;
parfor jj=1:nw
    for ii=1:snrn
        [err, itr] = viterbitransceivercpu(snrdb_rep(jj,ii), mt, it);
        errn(jj,ii) = err;
        itrn(jj,ii) = itr;
    end
end
ber = sum(errn)./ (msgl*sum(itrn));
trialtime=toc(ts);
fprintf(1,'done.\n');
berplot{n} = ber;
desc{n} = 'parfor';
reportresultscommsysgpu(n, trialtime,basetime, 'baseline   parfor');

***baseline   parfor***
accessing multiple cpu cores ...
starting parallel pool (parpool) using the 'local' profile ...
connected to the parallel pool (number of workers: 8).
start parfor job ... done.
----------------------------------------------------------------------------------------------
versions of the transceiver                         | elapsed time (sec)| acceleration ratio
1. baseline                                         |           17.0205 |       1.0000
2. baseline   codegen                               |           14.3820 |       1.1835
3. baseline   parfor                                |            2.6984 |       6.3075
----------------------------------------------------------------------------------------------

parfor and code generation

you can combine the last two techniques for additional acceleration. the compiled mex function can be executed inside of a parfor loop.

you must have a matlab coder license and a parallel computing toolbox license to run this portion of the example.

fprintf(1,'\n***baseline   codegen   parfor***\n');
n=n 1; %increase simulation counter
% pre-run
parfor jj=1:nw
    randstream.setglobalstream(s{jj});
    viterbitransceivercpu_mex(1, 1, 1); % use the same mex file
end
fprintf(1,'start parfor job ... ');
ts = tic;
parfor jj=1:nw
    for ii=1:snrn
        [err, itr] = viterbitransceivercpu_mex(snrdb_rep(jj,ii), mt, it);
        errn(jj,ii) = err;
        itrn(jj,ii) = itr;
    end
end
ber = sum(errn)./ (msgl*sum(itrn));
trialtime=toc(ts);
fprintf(1,'done.\n');
berplot{n} = ber;
desc{n} = 'codegen   parfor';
reportresultscommsysgpu(n, trialtime,basetime, 'baseline   codegen   parfor');

***baseline   codegen   parfor***
start parfor job ... done.
----------------------------------------------------------------------------------------------
versions of the transceiver                         | elapsed time (sec)| acceleration ratio
1. baseline                                         |           17.0205 |       1.0000
2. baseline   codegen                               |           14.3820 |       1.1835
3. baseline   parfor                                |            2.6984 |       6.3075
4. baseline   codegen   parfor                      |            2.7059 |       6.2902
----------------------------------------------------------------------------------------------

gpu

the system objects that the viterbitransceivercpu function uses are available for execution on the gpu. the gpu-based versions are:

comm.gpu.convolutionalencoder - convolutional encoding
comm.gpu.pskmodulator - qpsk modulation
comm.gpu.awgnchannel - awgn channel
comm.gpu.pskdemodulator - qpsk demodulation (approx llr)
comm.gpu.viterbidecoder - viterbi decoding

a gpu is most effective when processing large quantities of data at once. the gpu-based system objects can processes multiple frames in a single call to the step method. the numframes variable represents the number of frames processed per call. this is analogous to parfor except that the parallelism is on a per object basis, rather than a per viterbitransceivercpu call basis.

you must have a parallel computing toolbox license and a cuda® 1.3 capable gpu to run this portion of the example.

fprintf(1,'\n***gpu***\n');
n=n 1; %increase simulation counter
try
    dev = parallel.gpu.gpudevice.current;
    fprintf(...
        'gpu detected (%s, %d multiprocessors, compute capability %s)\n',...
        dev.name, dev.multiprocessorcount, dev.computecapability);
    sg = parallel.gpu.randstream.create('mrg32k3a','numstreams',1,'normaltransform','inversion');
    parallel.gpu.randstream.setglobalstream(sg);
    ts = tic;
    for ii=1:numel(snrdb)
        fprintf(1,'iteration number %d, snr (db) = %d\n',ii, snrdb(ii));
        [errs(ii),iters(ii)] =viterbitransceivergpu(snrdb(ii), minerrthreshold, itercntthreshold, numframes);
    end
    ber = errs./ (msgl* iters);
    trialtime=toc(ts);
    berplot{n} = ber;
    desc{n} = 'gpu';
    reportresultscommsysgpu(n, trialtime,basetime, 'baseline   gpu');
    fprintf(1,'  done.\n');
catch %#ok
    % report that the appropriate gpu was not found.
    fprintf(1, ['could not find an appropriate gpu or could not ', ...
        'execute gpu code.\n']);
end

***gpu***
gpu detected (tesla v100-pcie-32gb, 80 multiprocessors, compute capability 7.0)
iteration number 1, snr (db) = 1
iteration number 2, snr (db) = 2
iteration number 3, snr (db) = 3
iteration number 4, snr (db) = 4
iteration number 5, snr (db) = 5
----------------------------------------------------------------------------------------------
versions of the transceiver                         | elapsed time (sec)| acceleration ratio
1. baseline                                         |           17.0205 |       1.0000
2. baseline   codegen                               |           14.3820 |       1.1835
3. baseline   parfor                                |            2.6984 |       6.3075
4. baseline   codegen   parfor                      |            2.7059 |       6.2902
5. baseline   gpu                                   |            0.1895 |      89.8137
----------------------------------------------------------------------------------------------
  done.

analysis

comparing the results of these trials, it is clear that the gpu is significantly faster than any other simulation acceleration technique. this performance boost requires a very modest change to the simulation code. however, there is no loss in bit error rate performance as the following plot illustrates. the very slight differences in the curves are a result of different random number generation algorithms and/or effects of averaging different quantities of data for the same point on the curve.

lines = {'kx-.', 'ro-', 'cs--', 'm^:', 'g*-'};
for ii=1:numel(desc)
    semilogy(snrdb, berplot{ii}, lines{ii});
    hold on;
end
hold off;
title('bit error rate for various acceleration strategies');
xlabel('signal to noise ratio (db)');
ylabel('ber');
legend(desc{:});

cleanup

leave the parallel pool in the original state.

if ~poolwasopen
    delete(gcp);
end

parallel pool using the 'local' profile is shutting down.