gpu programming paradigm
gpu-accelerated computing follows a heterogeneous programming model. highly parallelizable portions of the software application are mapped into kernels that execute on the physically separate gpu device, while the remainder of the sequential code still runs on the cpu. each kernel is allocated several workers or threads, which are organized in blocks and grids. every thread within the kernel executes concurrently with respect to each other.
the objective of gpu coder™ is to take a sequential matlab® program and generate partitioned, optimized cuda® code from it. this process involves:
cpu/gpu partitioning — identifying segments of code that run on the cpu and segments that run on the gpu. for the different ways gpu coder identifies cuda kernels, see kernel creation. memory transfer costs between cpu and gpu are a significant consideration in the kernel creation algorithm.
after kernel partitioning is complete, gpu coder analyzes the data dependency between the cpu and gpu partitions. data that is shared between the cpu and gpu are allocated on gpu memory (by using
cudamalloc
orcudamallocmanaged
apis). the analysis also determines the minimum set of locations where data has to be copied between cpu and gpu by usingcudamemcpy
. if using unified memory in cuda, then the same analysis pass also determines the minimum locations in the code wherecudadevicesync
calls must be inserted to get the right functional behavior.next, within each kernel, gpu coder can choose to map data to shared memory or constant memory. if used wisely, these memories are part of the gpu memory hierarchy structure and can potentially result in greater memory bandwidth. for information on how gpu coder chooses to map to shared memory, see stencil processing. for information on how gpu coder chooses to map to constant memory, see .
once partitioning and memory allocation and transfer statements are in place, gpu coder generates cuda code that follows the partitioning and memory allocation decisions. the generated source code can be compiled into a mex target to be called from within matlab or into a shared library to be integrated with an external project. for information, see .