analyze performance of the generated cuda code -凯发k8网页登录
this example shows you how to analyze and optimize the performance of the generated cuda® code by using the gpuperformanceanalyzer
function.
the gpu coder performance analyzer runs a software-in-the-loop (sil) execution that collects metrics on cpu/gpu activities in the generated code and provides a chronological timeline plot to visualize, identify, and mitigate performance bottlenecks in the generated cuda code. this example generates the performance analysis report for the fog rectification example from gpu coder. for more information, see .
third-party prerequisites
cuda enabled nvidia® gpu.
nvidia cuda toolkit and driver.
nvidia nsight™ systems. for information on the supported versions of the compilers and libraries, see third-party hardware.
environment variables for the compilers and libraries. for setting up the environment variables, see .
the profiling workflow of this example depends on the profiling tools from nvidia that accesses gpu performance counters. from cuda toolkit v10.1, nvidia restricts access to performance counters to only admin users. to enable gpu performance counters to be used by all users, see the instructions provided in .
verify gpu environment
to verify that the compilers and libraries necessary for running this example are set up correctly, use the function.
envcfg = coder.gpuenvconfig('host');
envcfg.basiccodegen = 1;
envcfg.quiet = 1;
coder.checkgpuinstall(envcfg);
fog rectification algorithm
to improve the foggy input image, the algorithm performs fog removal and then contrast enhancement. the diagram shows the steps of both these operations.
this example takes a foggy rgb image as input. to perform fog removal, the algorithm estimates the dark channel of the image, calculates the airlight map based on the dark channel, and refines the airlight map by using filters. the restoration stage creates a defogged image by subtracting the refined airlight map from the input image.
then, the contrast enhancement stage assesses the range of intensity values in the image and uses contrast stretching to expand the range of values and make features stand out more clearly.
type fog_rectification.m
function [out] = fog_rectification(input) %#codegen % 凯发官网入口首页 copyright 2017-2019 the mathworks, inc. coder.gpu.kernelfun; % restoreout is used to store the output of restoration restoreout = zeros(size(input),'double'); % changing the precision level of input image to double input = double(input)./255; %% dark channel estimation from input darkchannel = min(input,[],3); % diff_im is used as input and output variable for anisotropic diffusion diff_im = 0.9*darkchannel; num_iter = 3; % 2d convolution mask for anisotropic diffusion hn = [0.0625 0.1250 0.0625; 0.1250 0.2500 0.1250; 0.0625 0.1250 0.0625]; hn = double(hn); %% refine dark channel using anisotropic diffusion. for t = 1:num_iter diff_im = conv2(diff_im,hn,'same'); end %% reduction with min diff_im = min(darkchannel,diff_im); diff_im = 0.6*diff_im ; %% parallel element-wise math to compute % restoration with inverse koschmieder's law factor = 1.0./(1.0-(diff_im)); restoreout(:,:,1) = (input(:,:,1)-diff_im).*factor; restoreout(:,:,2) = (input(:,:,2)-diff_im).*factor; restoreout(:,:,3) = (input(:,:,3)-diff_im).*factor; restoreout = uint8(255.*restoreout); restoreout = uint8(restoreout); %% % stretching performs the histogram stretching of the image. % im is the input color image and p is cdf limit. % out is the contrast stretched image and cdf is the cumulative prob. % density function and t is the stretching function. p = 5; % rgb to grayscale conversion im_gray = im2gray(restoreout); [row,col] = size(im_gray); % histogram calculation [count,~] = imhist(im_gray); prob = count'/(row*col); % cumulative sum calculation cdf = cumsum(prob(:)); % finding less than particular probability i1 = length(find(cdf <= (p/100))); i2 = 255-length(find(cdf >= 1-(p/100))); o1 = floor(255*.10); o2 = floor(255*.90); t1 = (o1/i1)*[0:i1]; t2 = (((o2-o1)/(i2-i1))*[i1 1:i2])-(((o2-o1)/(i2-i1))*i1) o1; t3 = (((255-o2)/(255-i2))*[i2 1:255])-(((255-o2)/(255-i2))*i2) o2; t = (floor([t1 t2 t3])); restoreout(restoreout == 0) = 1; u1 = (restoreout(:,:,1)); u2 = (restoreout(:,:,2)); u3 = (restoreout(:,:,3)); % replacing the value from look up table out1 = t(u1); out2 = t(u2); out3 = t(u3); out = zeros([size(out1),3], 'uint8'); out(:,:,1) = uint8(out1); out(:,:,2) = uint8(out2); out(:,:,3) = uint8(out3); return
generate performance analyzer report
to analyze the performance of the generated code using gpuperformanceanalyzer
, create a code configuration object with a dynamic library ('dll'
) build type. because the gpuperformanceanalyzer
function accepts only an embedded coder™ configuration object, enable the option to create a coder.embeddedcodeconfig
configuration object.
cfg = coder.gpuconfig('dll','ecoder',true);
run gpuperformanceanalyzer
with the default iteration count of 2.
inputimage = imread('foggyinput.png'); inputs = {inputimage}; designfilename = 'fog_rectification'; gpuperformanceanalyzer(designfilename, inputs, ... 'config', cfg, 'numiterations', 2);
### starting gpu code generation code generation successful: view report ### gpu code generation finished ### starting sil execution for 'fog_rectification' to terminate execution: clear fog_rectification_sil ### host application produced the following standard output (stdout) messages: generating '/tmp/nsys-report-540f.qdstrm' [1/1] [0% ] mw_nsysdata.nsys-rep [1/1] [0% ] mw_nsysdata.nsys-rep [1/1] [===========50% ] mw_nsysdata.nsys-rep [1/1] [0% ] mw_nsysdata.nsys-rep [1/1] [7% ] mw_nsysdata.nsys-rep [1/1] [===========52% ] mw_nsysdata.nsys-rep [1/1] [========================100%] mw_nsysdata.nsys-rep [1/1] [========================100%] mw_nsysdata.nsys-rep generated: /home/lnarasim/documents/matlab/examplemanager/lnarasim.bdoc23a.j2174901/gpucoder-ex87489778/mw_nsysdata.nsys-rep ### stopping sil execution for 'fog_rectification' ### starting profiling data processing ### profiling data processing finished ### showing profiling data
gpu performance analyzer
the gpu performance analyzer exposes gpu and cpu activities, events, and performance metrics in a chronological timeline plot to accurately visualize, identify and address performance bottlenecks in the generated cuda® code.
these numbers are representative. the actual values depend on your hardware setup. this profiling was done using matlab r2023a on a machine with an 6 core, 3.5ghz intel® xeon® cpu, and an nvidia titan xp gpu.
profiling timeline
the profiling timeline shows the complete trace of all events that have a runtime higher than the threshold value. a snippet of the profiling trace is shown.
you can use the mouse wheel (or an equivalent touchpad option) to zoom into and out of the timeline. alternatively, you can use the timeline summary at the top of the panel to zoom and navigate the timeline plot.
the tooltips on each event indicate the start time, end time and duration of the selected event on the cpu and the gpu. it also indicates the time elapsed between the kernel launch on the cpu and the actual execution of the kernel on the gpu.
event statistics
the event statistics panel shows additional information for the selected event. for example, the fog_rectification_kernel1
shows the following statistics:
insights
the insights panel gives an pie chart overview of the gpu and cpu activities. the pie chart changes according to the zoom level of the profiling timeline. a snippet of the insights panel is shown. within the region selected on the timeline, it shows that the gpu utilization is only 16%.
call tree
this section lists the gpu events called from the cpu. each event in the call tree lists the execution times as percentages of caller function. this metric can help you to identify performance bottlenecks in generated code. you can also navigate to specific events on the profiling timeline by clicking on the corresponding events in the call tree.
filters
this section provides filtering options for the report.
view mode - use this option to view profiling results for the entire application (including initialization and terminate) or the design function (without initialization and terminate).
event threshold - skip events shorter than the given threshold.
memory allocation/free - show gpu device memory allocation and deallocation related events on the cpu activities bar.
memory transfers - show host-to-device and device-to-host memory transfers.
kernels - show cpu kernel launches and gpu kernel activities.
others - show other gpu related events such as synchronization and waiting for gpu.
improving the performance of the fog rectification
from the performance analyzer report, it is clear that a significant portion of the execution time is spent on memory allocation and deallocation. to improve the performance, you can turn on gpu memory manager and run the analysis again.
cfg = coder.gpuconfig('dll'); cfg.gpuconfig.enablememorymanager = true; gpuperformanceanalyzer(designfilename, inputs, ... 'config', cfg, 'numiterations', 2);
### starting gpu code generation code generation successful: view report ### gpu code generation finished ### starting sil execution for 'fog_rectification' to terminate execution: clear fog_rectification_sil ### host application produced the following standard output (stdout) messages: generating '/tmp/nsys-report-18e6.qdstrm' [1/1] [0% ] mw_nsysdata.nsys-rep [1/1] [0% ] mw_nsysdata.nsys-rep [1/1] [5% ] mw_nsysdata.nsys-rep [1/1] [===========51% ] mw_nsysdata.nsys-rep [1/1] [========================97% ] mw_nsysdata.nsys-rep [1/1] [========================100%] mw_nsysdata.nsys-rep generated: /home/lnarasim/documents/matlab/examplemanager/lnarasim.bdoc23a.j2174901/gpucoder-ex87489778/mw_nsysdata.nsys-rep ### stopping sil execution for 'fog_rectification' ### starting profiling data processing ### profiling data processing finished ### showing profiling data
see also
functions
- |
codegen
| |
objects
- |