gpu memory allocation and minimization -凯发k8网页登录

gpu memory allocation and minimization

discrete and managed modes

gpu coder™ provides you access to two different memory allocation (malloc) modes available in the cuda^® programming model, cudamalloc and cudamallocmanaged. cudamalloc api is applicable to the traditionally separate cpu, and gpu global memories. cudamallocmanaged is applicable to unified memory.

from a programmer point of view, a traditional computer architecture requires that data be allocated and shared between the cpu and gpu memory spaces. the need for applications to manage data transfers between these two memory spaces adds to increased complexity. unified memory creates a pool of managed memory, shared between the cpu and the gpu. the managed memory is accessible to both the cpu and the gpu through a single pointer. unified memory attempts to optimize memory performance by migrating data to the device that needs it, at the same time hiding the migration details from the program. though unified memory simplifies the programming model, it requires device-sync calls when data written on the gpu is being accessed on the cpu. gpu coder inserts these synchronization calls. according to nvidia^®, unified memory can provide significant performance benefits when targeting embedded hardware like the nvidia tegra^®.

to change the memory allocation mode in the gpu coder app, use the malloc mode drop-down box under more settings->gpu coder. when using the command-line interface, use the mallocmode build configuration property and set it to either 'discrete' or 'unified'.

note

in a future release, the unified memory allocation (cudamallocmanaged) mode will be removed when targeting nvidia gpu devices on the host development computer. you can continue to use unified memory allocation mode when targeting nvidia embedded platforms.

gpu memory manager

you can use the gpu memory manager for efficient memory allocation, management, and improving run-time performance. the gpu memory manager creates a collection of large gpu memory pools and manages allocation and deallocation of chunks of memory blocks within these pools. by creating large memory pools, the memory manager reduces the number of calls to the cuda memory apis, improving run-time performance. you can use the gpu memory manager for mex and standalone cuda code generation.

to enable the gpu memory manager, use one of these methods:

in a gpu code configuration object (), enable the memorymanager property.
in the gpu coder app, on the gpu code tab, select gpu memory manager.
in the simulink^® configuration parameters dialog box, code generation > gpu code pane, select the memory manager parameter.

for cuda code that uses nvidia cuda libraries, such as cufft, cublas, and cusolver, you can enable the use of gpu memory manager for efficient memory allocation and management.

to use memory pools with cuda libraries, enable the memory manager using one the methods above and:

in the gpu code configuration object (), enable the enablecufft, enablecublas, or enablecusolver properties.
in the gpu coder app, on the gpu code tab, select enable cufft, enable cublas, or enable cusolver.
in the simulink configuration parameters dialog box, code generation > gpu code pane, select the cufft, cublas, or cusolver parameters.

customization options for gpu memory pools

the gpu memory manager provides additional code configuration parameters listed in the table to manage allocation and deallocation of memory blocks within gpu memory pools.

code configuration parameter	description	value
in a gpu code configuration object (): `blockalignment` in the gpu coder app: on the gpu code tab, block alignment	controls the alignment of the blocks. the block sizes (bytes) in the pool are a multiple of the specified value.	positive integer that is a power of 2. default value is 256.
in a gpu code configuration object: `freemode` in the gpu coder app: on the gpu code tab, free mode	controls when the memory manager frees the gpu device memory. when set to `'never'`, the memory is freed only when the memory manager is destroyed. use `'atterminate'` to free empty gpu pools when the `terminate` function is called in the generated code. for mex targets, memory is freed after every call to the generated mex function. for other targets, memory is freed when calling the terminate function. when set to `'afterallocate'`, empty pools are freed after each call to cuda memory allocate.	`'never'` (default) \| `'atterminate'` \| `'afterallocate'`
in a gpu code configuration object: `minpoolsize` in the gpu coder app: on the gpu code tab, minimum pool size	specify the minimum pool size in megabytes (mb).	positive integer that is a power of 2. default value is 8.
in a gpu code configuration object: `maxpoolsize` in the gpu coder app: on the gpu code tab, maximum pool size	specify the maximum pool size in megabytes (mb). the memory manager computes the size levels using the `minpoolsize` and `maxpoolsize` parameters by interpolating between the two values in increasing powers of 2. for example, if the `minpoolsize` is 4 and the `maxpoolsize` is 1024, the size levels are {4, 8, 16, 32, 64, 128, 256, 512, 1024}.	positive integer that is a power of 2. default value is 2048.

memory minimization

gpu coder analyzes the data dependency between cpu and gpu partitions and performs optimizations to minimize the number of cudamemcpy function calls in the generated code. the analysis also determines the minimum set of locations where data must be copied between cpu and gpu by using cudamemcpy.

for example, the function foo has sections of code that process data sequentially on the cpu and in parallel on the gpu.

function [out] = foo(input1,input2)
	   …
     % cpu work
			input1 = …
			input2 = …
			tmp1 = …
			tmp2 = …
   	…
     % gpu work
			kernel1(gpuinput1, gputmp1);
       kernel2(gpuinput2, gputmp1, gputmp2);
       kernel3(gputmp1, gputmp2, gpuout);
   	…
     % cpu work
       … = out
end

an unoptimized cuda implementation can potentially have multiple cudamemcpy function calls to transfer all inputs gpuinput1,gpuinput2, and the temporary results gputmp1,gputmp2 between kernel calls. because the intermediate results gputmp1,gputmp2 are not used outside the gpu, they can be stored within the gpu memory resulting in fewer cudamemcpy function calls. these optimizations improve overall performance of the generated code. the optimized implementation is:

gpuinput1 = input1;
gpuinput2 = input2;
kernel1<<< >>>(gpuinput1, gputmp1);
kernel2<<< >>>(gpuinput2, gputmp1, gputmp2);
kernel3<<< >>>(gputmp1, gputmp2, gpuout);
out = gpuout;

to eliminate redundant cudamemcpy calls, gpu coder analyzes all uses and definitions of a given variable and uses status flags to perform minimization. an example of the original code and what the generated code looks like is shown in this table.

original code	optimized generated code
a(:) = … … for i = 1:n gb = kernel1(ga); ga = kernel2(gb); if (somecondition) gc = kernel3(ga, gb); end … end … … = c;	a(:) = … a_isdirtyoncpu = true; … for i = 1:n if (a_isdirtyoncpu) ga = a; a_isdirtyoncpu = false; end gb = kernel1(ga); ga = kernel2(gb); if (somecondition) gc = kernel3(ga, gb); c_isdirtyongpu = true; end … end … if (c_isdirtyongpu) c = gc; c_isdirtyongpu = false; end … = c;

original code

optimized generated code

a(:) = …
…
for i = 1:n
   gb = kernel1(ga);
   ga = kernel2(gb);
   if (somecondition)
      gc = kernel3(ga, gb);
   end
   …
end
…
… = c;

a(:) = …
a_isdirtyoncpu = true;
…
for i = 1:n
   if (a_isdirtyoncpu)
      ga = a;
      a_isdirtyoncpu = false;
   end
   gb = kernel1(ga);
   ga = kernel2(gb);
   if (somecondition)
      gc = kernel3(ga, gb);
      c_isdirtyongpu = true;
   end
   …
end
…
if (c_isdirtyongpu)
   c = gc;
   c_isdirtyongpu = false;
end
… = c;

the _isdirtyoncpu flag tells the gpu coder memory optimization about routines where the given variable is declared and used either on the cpu or on then gpu.

gpu memory allocation and minimization -凯发k8网页登录