Thursday, September 17, 2009

OpenCL Program Structure

OpenCL (Open Computing Language) is a framework for writing programs that execute across heterogeneous platforms consisting of CPUs, GPUs, and other parallel computing devices. OpenCL includes a language (based on C99) for writing kernels (functions that execute on OpenCL devices), plus APIs that are used to define and then control the platforms. OpenCL provides parallel computing using task-based and data-based parallelism. While the OpenCL specification allows support for task based parallelism a specific implementation is not required to support task based parallelism to be compliant with the specification.

For purposes of this discussion I will focus on data-based parallelism since that is what is supported on GPUs. I will also focus on Nvidia’s implementation since at this point in time that is where you are going to get the most bang for your buck. I will assume that you have already gone through our CUDA based examples and will therefore be describing OpenCL in terms of CUDA. If you haven’t gone through the CUDA examples you should do so before reading further.

As I mentioned in my CUDA program structure post Nvidia offers two different APIs: C for CUDA and the CUDA driver API. I did not cover Nvidia’s driver API because it is more complicated than their C for CUDA. Well as luck would have it OpenCL is very similar to Nvidia’s driver API which means that it is more complicated to write OpenCL based applications than it is to write C for CUDA based applications. So why do we want to write OpenCL based applications if it’s going to be more difficult? Well… as I mentioned in my GPGPU (A Historical Perspective) post OpenCL is the future. With OpenCL your code will be capable of running on multiple different computing devices (CPUs, GPUs, Cell Processors, Larrabee, and whatever the next “big thing” is) without being modified. So if you don’t like being tied to a single vendor or being forced to port your code to the next “big thing” OpenCL will be the way to go. I say will be. As of today there have been no OpenCL implementations released. AMD and Nvidia are both working on implementations for their processors but they are both still in beta. All of the following examples have been built and tested against Nvidia’s beta OpenCL.

So how much more complicated is OpenCL? The added complexity lies in the host code that must be written to control the kernel code. C for CUDA provided you with a compiler for the kernel code and the linker took care of the rest. With OpenCL you must write code to both locate the kernel binary and load it onto the device or to find the kernel source, compile it to binary, and load it onto the device. That’s right… no compiler. The driver API contains compiler methods but there is no stand alone compiler. You can however build a compiler from the OpenCL driver API (see my OpenCL Compiler (oclcc) example). Below is the OpenCL application software stack.


OpenCL kernel code itself is conceptually identical to C for CUDA kernel code. The parallel concepts are also very similar. With C for CUDA we had a grid of thread blocks that contained multiple threads.


With OpenCL we have an NDRange (N – Dimensional Range) of work groups that contain multiple work items.


With OpenCL you must create an OpenCL context and associate devices, kernels, program objects, memory objects, and a command queue with the context. All of this is done in host code using OpenCL APIs. The host code can then interact with the device by inserting commands onto the command queue. To launch the kernel you simply put a launch command on the command queue. To retrieve your results you put a memory copy command on the command queue requesting that the device memory containing your results be copied back to host memory. This command will block until the kernel is finished.

Time to write some code... Matrix Multiplication 1