analyze performance of the generated cuda code -凯发k8网页登录

this example uses:

this example shows you how to analyze and optimize the performance of the generated cuda® code by using the gpuperformanceanalyzer function.

the gpu coder performance analyzer runs a software-in-the-loop (sil) execution that collects metrics on cpu/gpu activities in the generated code and provides a chronological timeline plot to visualize, identify, and mitigate performance bottlenecks in the generated cuda code. this example generates the performance analysis report for the fog rectification example from gpu coder. for more information, see .

third-party prerequisites

cuda enabled nvidia® gpu.
nvidia cuda toolkit and driver.
nvidia nsight™ systems. for information on the supported versions of the compilers and libraries, see third-party hardware.
environment variables for the compilers and libraries. for setting up the environment variables, see .
the profiling workflow of this example depends on the profiling tools from nvidia that accesses gpu performance counters. from cuda toolkit v10.1, nvidia restricts access to performance counters to only admin users. to enable gpu performance counters to be used by all users, see the instructions provided in .

verify gpu environment

to verify that the compilers and libraries necessary for running this example are set up correctly, use the function.

envcfg = coder.gpuenvconfig('host');
envcfg.basiccodegen = 1;
envcfg.quiet = 1;
coder.checkgpuinstall(envcfg);

fog rectification algorithm

to improve the foggy input image, the algorithm performs fog removal and then contrast enhancement. the diagram shows the steps of both these operations.

this example takes a foggy rgb image as input. to perform fog removal, the algorithm estimates the dark channel of the image, calculates the airlight map based on the dark channel, and refines the airlight map by using filters. the restoration stage creates a defogged image by subtracting the refined airlight map from the input image.

then, the contrast enhancement stage assesses the range of intensity values in the image and uses contrast stretching to expand the range of values and make features stand out more clearly.

type fog_rectification.m

function [out] = fog_rectification(input) %#codegen
%   凯发官网入口首页 copyright 2017-2019 the mathworks, inc.
coder.gpu.kernelfun;
% restoreout is used to store the output of restoration
restoreout = zeros(size(input),'double');
% changing the precision level of input image to double
input = double(input)./255;
%% dark channel estimation from input
darkchannel = min(input,[],3);
% diff_im is used as input and output variable for anisotropic diffusion
diff_im = 0.9*darkchannel;
num_iter = 3;
% 2d convolution mask for anisotropic diffusion
hn = [0.0625 0.1250 0.0625; 0.1250 0.2500 0.1250; 0.0625 0.1250 0.0625];
hn = double(hn);
%% refine dark channel using anisotropic diffusion.
for t = 1:num_iter
    diff_im = conv2(diff_im,hn,'same');
end
%% reduction with min
diff_im = min(darkchannel,diff_im);
diff_im = 0.6*diff_im ;
%% parallel element-wise math to compute
%  restoration with inverse koschmieder's law
factor = 1.0./(1.0-(diff_im));
restoreout(:,:,1) = (input(:,:,1)-diff_im).*factor;
restoreout(:,:,2) = (input(:,:,2)-diff_im).*factor;
restoreout(:,:,3) = (input(:,:,3)-diff_im).*factor;
restoreout = uint8(255.*restoreout);
restoreout = uint8(restoreout);
%%
% stretching performs the histogram stretching of the image.
% im is the input color image and p is cdf limit.
% out is the contrast stretched image and cdf is the cumulative prob.
% density function and t is the stretching function.
p = 5;
% rgb to grayscale conversion
im_gray = im2gray(restoreout);
[row,col] = size(im_gray);
% histogram calculation
[count,~] = imhist(im_gray);
prob = count'/(row*col);
% cumulative sum calculation
cdf = cumsum(prob(:));
% finding less than particular probability
i1 = length(find(cdf <= (p/100)));
i2 = 255-length(find(cdf >= 1-(p/100)));
o1 = floor(255*.10);
o2 = floor(255*.90);
t1 = (o1/i1)*[0:i1];
t2 = (((o2-o1)/(i2-i1))*[i1 1:i2])-(((o2-o1)/(i2-i1))*i1) o1;
t3 = (((255-o2)/(255-i2))*[i2 1:255])-(((255-o2)/(255-i2))*i2) o2;
t = (floor([t1 t2 t3]));
restoreout(restoreout == 0) = 1;
u1 = (restoreout(:,:,1));
u2 = (restoreout(:,:,2));
u3 = (restoreout(:,:,3));
% replacing the value from look up table
out1 = t(u1);
out2 = t(u2);
out3 = t(u3);
out = zeros([size(out1),3], 'uint8');
out(:,:,1) = uint8(out1);
out(:,:,2) = uint8(out2);
out(:,:,3) = uint8(out3);
return

generate performance analyzer report

to analyze the performance of the generated code using gpuperformanceanalyzer, create a code configuration object with a dynamic library ('dll') build type. because the gpuperformanceanalyzer function accepts only an embedded coder™ configuration object, enable the option to create a coder.embeddedcodeconfig configuration object.

cfg = coder.gpuconfig('dll','ecoder',true);

run gpuperformanceanalyzer with the default iteration count of 2.

inputimage = imread('foggyinput.png');
inputs  = {inputimage};
designfilename = 'fog_rectification';
gpuperformanceanalyzer(designfilename, inputs, ...
    'config', cfg, 'numiterations', 2);

### starting gpu code generation
code generation successful: view report
### gpu code generation finished
### starting sil execution for 'fog_rectification'
    to terminate execution: clear fog_rectification_sil
### host application produced the following standard output (stdout) messages:
generating '/tmp/nsys-report-540f.qdstrm'
[1/1] [0%                          ] mw_nsysdata.nsys-rep
[1/1] [0%                          ] mw_nsysdata.nsys-rep
[1/1] [===========50%              ] mw_nsysdata.nsys-rep
[1/1] [0%                          ] mw_nsysdata.nsys-rep
[1/1] [7%                          ] mw_nsysdata.nsys-rep
[1/1] [===========52%              ] mw_nsysdata.nsys-rep
[1/1] [========================100%] mw_nsysdata.nsys-rep
[1/1] [========================100%] mw_nsysdata.nsys-rep
generated:
    /home/lnarasim/documents/matlab/examplemanager/lnarasim.bdoc23a.j2174901/gpucoder-ex87489778/mw_nsysdata.nsys-rep
### stopping sil execution for 'fog_rectification'
### starting profiling data processing
### profiling data processing finished
### showing profiling data

gpu performance analyzer

the gpu performance analyzer exposes gpu and cpu activities, events, and performance metrics in a chronological timeline plot to accurately visualize, identify and address performance bottlenecks in the generated cuda® code.

these numbers are representative. the actual values depend on your hardware setup. this profiling was done using matlab r2023a on a machine with an 6 core, 3.5ghz intel® xeon® cpu, and an nvidia titan xp gpu.

profiling timeline

the profiling timeline shows the complete trace of all events that have a runtime higher than the threshold value. a snippet of the profiling trace is shown.

you can use the mouse wheel (or an equivalent touchpad option) to zoom into and out of the timeline. alternatively, you can use the timeline summary at the top of the panel to zoom and navigate the timeline plot.

the tooltips on each event indicate the start time, end time and duration of the selected event on the cpu and the gpu. it also indicates the time elapsed between the kernel launch on the cpu and the actual execution of the kernel on the gpu.

event statistics

the event statistics panel shows additional information for the selected event. for example, the fog_rectification_kernel1 shows the following statistics:

insights

the insights panel gives an pie chart overview of the gpu and cpu activities. the pie chart changes according to the zoom level of the profiling timeline. a snippet of the insights panel is shown. within the region selected on the timeline, it shows that the gpu utilization is only 16%.

call tree

this section lists the gpu events called from the cpu. each event in the call tree lists the execution times as percentages of caller function. this metric can help you to identify performance bottlenecks in generated code. you can also navigate to specific events on the profiling timeline by clicking on the corresponding events in the call tree.

filters

this section provides filtering options for the report.

view mode - use this option to view profiling results for the entire application (including initialization and terminate) or the design function (without initialization and terminate).
event threshold - skip events shorter than the given threshold.
memory allocation/free - show gpu device memory allocation and deallocation related events on the cpu activities bar.
memory transfers - show host-to-device and device-to-host memory transfers.
kernels - show cpu kernel launches and gpu kernel activities.
others - show other gpu related events such as synchronization and waiting for gpu.

improving the performance of the fog rectification

from the performance analyzer report, it is clear that a significant portion of the execution time is spent on memory allocation and deallocation. to improve the performance, you can turn on gpu memory manager and run the analysis again.

cfg = coder.gpuconfig('dll');
cfg.gpuconfig.enablememorymanager = true;
gpuperformanceanalyzer(designfilename, inputs, ...
    'config', cfg, 'numiterations', 2);

### starting gpu code generation
code generation successful: view report
### gpu code generation finished
### starting sil execution for 'fog_rectification'
    to terminate execution: clear fog_rectification_sil
### host application produced the following standard output (stdout) messages:
generating '/tmp/nsys-report-18e6.qdstrm'
[1/1] [0%                          ] mw_nsysdata.nsys-rep
[1/1] [0%                          ] mw_nsysdata.nsys-rep
[1/1] [5%                          ] mw_nsysdata.nsys-rep
[1/1] [===========51%              ] mw_nsysdata.nsys-rep
[1/1] [========================97% ] mw_nsysdata.nsys-rep
[1/1] [========================100%] mw_nsysdata.nsys-rep
generated:
    /home/lnarasim/documents/matlab/examplemanager/lnarasim.bdoc23a.j2174901/gpucoder-ex87489778/mw_nsysdata.nsys-rep
### stopping sil execution for 'fog_rectification'
### starting profiling data processing
### profiling data processing finished
### showing profiling data

analyze performance of the generated cuda code -凯发k8网页登录

third-party prerequisites

verify gpu environment

fog rectification algorithm

generate performance analyzer report

gpu performance analyzer

profiling timeline

event statistics

insights

call tree

filters

improving the performance of the fog rectification

see also

functions

objects

related topics

analyze performance of the generated cuda code -凯发k8网页登录

third-party prerequisites

verify gpu environment

fog rectification algorithm

generate performance analyzer report

gpu performance analyzer

profiling timeline

event statistics

insights

call tree

filters

improving the performance of the fog rectification

see also

functions

objects

related topics

wechat