CUDA’s parallel programming model is designed to overcome the many challenges of parallel programming while providing a quick learning curve for programmers familiar with C. At its core are three abstractions: a hierarchy of thread groups, shared memory, and thread synchronization. These abstractions are exposed to the programmer via a small set of language extensions.
These abstractions provide for fine-grained data parallelism / thread parallelism, nested within coarse-grained data parallelism and task parallelism. They require a programmer to partition the problem space into coarse grained sub-problems that can be solved independently in parallel, and then into finer pieces that can be solved cooperatively in parallel. This decomposition allows the CUDA architecture to easily scale to graphics cards of varying processor count and memory capacity.
Nvidia provides two interfaces to write CUDA programs with: C for CUDA and the CUDA driver API. Developers must choose which one they are going to use for a particular application because their usage is mutually exclusive. A CUDA application sits on top of the CUDA libraries, runtime, and driver. The driver provides the actual interface to the GPU.
C for CUDA provides a minimal set of extensions to the C language. Any source file that contains these extensions must be compiled with nvcc (Nvidia’s CUDA compiler). These extensions allow programmers to define a kernel (a module that exhibits data parallelism) as a C function. C for CUDA is a higher level of abstraction than the CUDA driver API (this means easier to use) therefore I will limit our discussion to C for CUDA.
A CUDA program typically consists of one or more logical modules that are executed on either the host (CPU) or a device (GPU). The modules that exhibit little or no data parallelism are typically implemented in host code. The modules that exhibit data parallelism are implemented in the device code (or kernel). The program supplies a single set of source files containing both host and device code. The host code is ANSI C code and is compiled with the host's standard C compiler. The device code is written using ANSI C extended with keywords for labeling data-parallel functions (kernels) and their data structures. At compile time nvcc separates host and device code offloading the compilation of the host code to the native compiler (gcc on Linux) and compiles the device code to a format suitable for execution on the device (GPU). At link time the device binary is liked into the host binary as a data array.
At runtime a CUDA program will typically copy data from host memory to device memory, launch a kernel, and copy the results back from device memory to host memory when the kernel has finished. When a kernel is launched a grid (multi dimensional array) of thread blocks (multi dimensional array) is created on the device (GPU). This is just a fancy way of saying that the device resources are allocated to the kernel (more on this later).
This is the simplest use case. On systems with multiple GPUs your application may have multiple threads, each thread interacting with a specific GPU. Your application may consist of multiple kernels each used to solve a specific part of the algorithm. Your application may execute the same kernel multiple times at different times throughout its lifetime.
When the host code launches the kernel the device binary is pulled out of the data array and copied to the device. The launching of a kernel is an asynchronous operation in that control immediately returns to the host code. This allows applications to run host code and device code in parallel. When the host code makes a call to copy back the kernels results it will either immediately return the results if the kernel has finished or it will block until the kernel finishes.
Time to write some code... Matrix Multiplication 1