end to end system simulation acceleration using gpus -凯发k8网页登录
this example shows a comparison of four techniques which can be used to accelerate bit error rate (ber) simulations using system objects in the matlab® communications toolbox™ software. a small system, based on convolutional coding, illustrates the effect of code generation using the matlab® coder™ product, parallel loop execution using parfor
in the parallel computing toolbox™ product, a combination of code generation and parfor
, and gpu-based system objects.
the system objects this example features are accessible in the communications toolbox product. in order to run this example you must have a matlab coder license, a parallel computing toolbox license, and a sufficient gpu.
system design and simulation parameters
this example uses a simple convolutional coding system to illustrate simulation acceleration strategies. the system generates random message bits using randi
. a transmitter encodes these bits using a rate 1/2 convolutional encoder, applies a qpsk modulation scheme, and then transmits the symbols. the symbols pass through an awgn channel, where signal corruption occurs. qpsk demodulation occurs at the receiver, and the corrupted bits are decoded using the viterbi algorithm. finally, the bit error rate is computed. the system objects used in this system are :
comm.convolutionalencoder - convolutional encoding
comm.pskmodulator - qpsk modulation
comm.awgnchannel - awgn channel
comm.pskdemodulator - qpsk demodulation (approx llr)
comm.viterbidecoder - viterbi decoding
the code for the transceivers can be found in:
each point along the bit error rate curve represents the result of many iterations of the transceiver code described above. to obtain accurate results in a reasonable amount of time, the simulation will gather at least 200 bit errors per signal-to-noise ratio (snr) value, and at most 5000 packets of data. a packet represents 2000 message bits. the snr ranges from 1 db to 5 db.
itercntthreshold = 5000; minerrthreshold = 200; msgl = 2000; snrdb = 1:5;
initialization
call the transceiver functions once to factor out setup time and object construction overhead. objects are stored in persistent variables in each function.
errs = zeros(length(snrdb),1); iters = zeros(length(snrdb),1); berplot = cell(1,5); numframes = 500; %gpu version runs 500 frames in parallel. viterbitransceivercpu(-10,1,1); viterbitransceivergpu(-10,1,1,numframes); n=1; %n tracks which simulation variant is run
workflow
the workflow for this example is:
run a baseline simulation of system objects
use matlab coder to generate a mex function for the simulation
use parfor to run the bit error rate simulation in parallel
combine the generated mex function with parfor
use the gpu-based system objects
fprintf(1,'bit error rate acceleration analysis example\n\n');
bit error rate acceleration analysis example
baseline simulation
to establish a reference point for various acceleration strategies, a bit error rate curve is generated using system objects alone. the code for the transceiver is in viterbitransceivercpu.m
.
fprintf(1,'***baseline - standard system object simulation***\n'); % create random stream for each snrdb simulation s = randstream.create('mrg32k3a','numstreams',1,... 'celloutput',true,'normaltransform', 'inversion'); randstream.setglobalstream(s{1}); ts = tic; for ii=1:numel(snrdb) fprintf(1,'iteration number %d, snr (db) = %d\n',ii, snrdb(ii)); [errs(ii),iters(ii)] =viterbitransceivercpu(snrdb(ii), minerrthreshold, itercntthreshold); end ber = errs./ (msgl* iters); basetime=toc(ts); berplot{n} = ber; desc{n} = 'baseline'; reportresultscommsysgpu(n, basetime,basetime, 'baseline');
***baseline - standard system object simulation*** iteration number 1, snr (db) = 1 iteration number 2, snr (db) = 2 iteration number 3, snr (db) = 3 iteration number 4, snr (db) = 4 iteration number 5, snr (db) = 5 ---------------------------------------------------------------------------------------------- versions of the transceiver | elapsed time (sec)| acceleration ratio 1. baseline | 17.0205 | 1.0000 ----------------------------------------------------------------------------------------------
code generation
using matlab coder, a mex file can be generated with optimized c code that matches the precompiled matlab code. because the viterbitransceivercpu
function conforms to the matlab code generation subset, it can be compiled into a mex function without modification.
you must have a matlab coder license to run this portion of the example.
fprintf(1,'\n***baseline codegen***\n'); n=n 1; %increase simulation counter % create the coder object and turn off checks which will cause low % performance. fprintf(1,'generating code ...'); config_obj = coder.config('mex'); config_obj.enabledebugging = false; config_obj.integritychecks = false; config_obj.responsivenesschecks = false; config_obj.echoexpressions = false; % generate a mex file codegen('viterbitransceivercpu.m', '-config', 'config_obj', '-args', {snrdb(1), minerrthreshold, itercntthreshold} ) fprintf(1,' done.\n'); %run once to eliminate startup overhead. viterbitransceivercpu_mex(-10,1,1); s = randstream.getglobalstream; reset(s); % use the generated mex function viterbitransceivercpu_mex in the % simulation loop. ts = tic; for ii=1:numel(snrdb) fprintf(1,'iteration number %d, snr (db) = %d\n',ii, snrdb(ii)); [errs(ii),iters(ii)] = viterbitransceivercpu_mex(snrdb(ii), minerrthreshold, itercntthreshold); end ber = errs./ (msgl* iters); trialtime=toc(ts); berplot{n} = ber; desc{n} = 'codegen'; reportresultscommsysgpu(n, trialtime,basetime, 'baseline codegen');
***baseline codegen*** generating code ...code generation successful. done. iteration number 1, snr (db) = 1 iteration number 2, snr (db) = 2 iteration number 3, snr (db) = 3 iteration number 4, snr (db) = 4 iteration number 5, snr (db) = 5 ---------------------------------------------------------------------------------------------- versions of the transceiver | elapsed time (sec)| acceleration ratio 1. baseline | 17.0205 | 1.0000 2. baseline codegen | 14.3820 | 1.1835 ----------------------------------------------------------------------------------------------
parfor - parallel loop execution
using parfor
, matlab executes the transceiver code against all snr values in parallel. this requires opening the parallel pool and adding a parfor
loop.
you must have a parallel computing toolbox license to run this portion of the example.
fprintf(1,'\n***baseline parfor***\n'); fprintf(1,'accessing multiple cpu cores ...\n'); if isempty(gcp('nocreate')) pool = parpool; poolwasopen = false; else pool = gcp; poolwasopen = true; end nw=pool.numworkers; n=n 1; %increase simulation counter snrn = numel(snrdb); mt = minerrthreshold / nw; it = itercntthreshold / nw; errn = zeros(nw, snrn); itrn = zeros(nw, snrn); % replicate snrdb snrdb_rep=repmat(snrdb,nw,1); % create an independent stream for each worker s = randstream.create('mrg32k3a','numstreams',nw,... 'celloutput',true,'normaltransform', 'inversion'); % pre-run parfor jj=1:nw randstream.setglobalstream(s{jj}); viterbitransceivercpu(-10, 1, 1); end fprintf(1,'start parfor job ... '); ts = tic; parfor jj=1:nw for ii=1:snrn [err, itr] = viterbitransceivercpu(snrdb_rep(jj,ii), mt, it); errn(jj,ii) = err; itrn(jj,ii) = itr; end end ber = sum(errn)./ (msgl*sum(itrn)); trialtime=toc(ts); fprintf(1,'done.\n'); berplot{n} = ber; desc{n} = 'parfor'; reportresultscommsysgpu(n, trialtime,basetime, 'baseline parfor');
***baseline parfor*** accessing multiple cpu cores ... starting parallel pool (parpool) using the 'local' profile ... connected to the parallel pool (number of workers: 8). start parfor job ... done. ---------------------------------------------------------------------------------------------- versions of the transceiver | elapsed time (sec)| acceleration ratio 1. baseline | 17.0205 | 1.0000 2. baseline codegen | 14.3820 | 1.1835 3. baseline parfor | 2.6984 | 6.3075 ----------------------------------------------------------------------------------------------
parfor and code generation
you can combine the last two techniques for additional acceleration. the compiled mex function can be executed inside of a parfor
loop.
you must have a matlab coder license and a parallel computing toolbox license to run this portion of the example.
fprintf(1,'\n***baseline codegen parfor***\n'); n=n 1; %increase simulation counter % pre-run parfor jj=1:nw randstream.setglobalstream(s{jj}); viterbitransceivercpu_mex(1, 1, 1); % use the same mex file end fprintf(1,'start parfor job ... '); ts = tic; parfor jj=1:nw for ii=1:snrn [err, itr] = viterbitransceivercpu_mex(snrdb_rep(jj,ii), mt, it); errn(jj,ii) = err; itrn(jj,ii) = itr; end end ber = sum(errn)./ (msgl*sum(itrn)); trialtime=toc(ts); fprintf(1,'done.\n'); berplot{n} = ber; desc{n} = 'codegen parfor'; reportresultscommsysgpu(n, trialtime,basetime, 'baseline codegen parfor');
***baseline codegen parfor*** start parfor job ... done. ---------------------------------------------------------------------------------------------- versions of the transceiver | elapsed time (sec)| acceleration ratio 1. baseline | 17.0205 | 1.0000 2. baseline codegen | 14.3820 | 1.1835 3. baseline parfor | 2.6984 | 6.3075 4. baseline codegen parfor | 2.7059 | 6.2902 ----------------------------------------------------------------------------------------------
gpu
the system objects that the viterbitransceivercpu
function uses are available for execution on the gpu. the gpu-based versions are:
comm.gpu.convolutionalencoder - convolutional encoding
comm.gpu.pskmodulator - qpsk modulation
comm.gpu.awgnchannel - awgn channel
comm.gpu.pskdemodulator - qpsk demodulation (approx llr)
comm.gpu.viterbidecoder - viterbi decoding
a gpu is most effective when processing large quantities of data at once. the gpu-based system objects can processes multiple frames in a single call to the step method. the numframes
variable represents the number of frames processed per call. this is analogous to parfor
except that the parallelism is on a per object basis, rather than a per viterbitransceivercpu
call basis.
you must have a parallel computing toolbox license and a cuda® 1.3 capable gpu to run this portion of the example.
fprintf(1,'\n***gpu***\n'); n=n 1; %increase simulation counter try dev = parallel.gpu.gpudevice.current; fprintf(... 'gpu detected (%s, %d multiprocessors, compute capability %s)\n',... dev.name, dev.multiprocessorcount, dev.computecapability); sg = parallel.gpu.randstream.create('mrg32k3a','numstreams',1,'normaltransform','inversion'); parallel.gpu.randstream.setglobalstream(sg); ts = tic; for ii=1:numel(snrdb) fprintf(1,'iteration number %d, snr (db) = %d\n',ii, snrdb(ii)); [errs(ii),iters(ii)] =viterbitransceivergpu(snrdb(ii), minerrthreshold, itercntthreshold, numframes); end ber = errs./ (msgl* iters); trialtime=toc(ts); berplot{n} = ber; desc{n} = 'gpu'; reportresultscommsysgpu(n, trialtime,basetime, 'baseline gpu'); fprintf(1,' done.\n'); catch %#ok% report that the appropriate gpu was not found. fprintf(1, ['could not find an appropriate gpu or could not ', ... 'execute gpu code.\n']); end
***gpu*** gpu detected (tesla v100-pcie-32gb, 80 multiprocessors, compute capability 7.0) iteration number 1, snr (db) = 1 iteration number 2, snr (db) = 2 iteration number 3, snr (db) = 3 iteration number 4, snr (db) = 4 iteration number 5, snr (db) = 5 ---------------------------------------------------------------------------------------------- versions of the transceiver | elapsed time (sec)| acceleration ratio 1. baseline | 17.0205 | 1.0000 2. baseline codegen | 14.3820 | 1.1835 3. baseline parfor | 2.6984 | 6.3075 4. baseline codegen parfor | 2.7059 | 6.2902 5. baseline gpu | 0.1895 | 89.8137 ---------------------------------------------------------------------------------------------- done.
analysis
comparing the results of these trials, it is clear that the gpu is significantly faster than any other simulation acceleration technique. this performance boost requires a very modest change to the simulation code. however, there is no loss in bit error rate performance as the following plot illustrates. the very slight differences in the curves are a result of different random number generation algorithms and/or effects of averaging different quantities of data for the same point on the curve.
lines = {'kx-.', 'ro-', 'cs--', 'm^:', 'g*-'}; for ii=1:numel(desc) semilogy(snrdb, berplot{ii}, lines{ii}); hold on; end hold off; title('bit error rate for various acceleration strategies'); xlabel('signal to noise ratio (db)'); ylabel('ber'); legend(desc{:});
cleanup
leave the parallel pool in the original state.
if ~poolwasopen delete(gcp); end
parallel pool using the 'local' profile is shutting down.