With CUDA we were able to make some assumptions about the sizes and speed of the various different types of memory. Well they weren't really assumptions since we know the exact details of the underlying hardware that CUDA applications are running on. With OpenCL we can't easily make assumptions regarding sizes and speeds of the various different types of memory. The OpenCL specification identifies the different types of memory constructs that you must support from the API perspective but OpenCL driver providers have to map that specification to their underlying hardware. For example the amount and speed of global memory on a Cell processor vs an Nvidia GPU can vary significantly. While you should be guaranteed that your OpenCL code will compile and run on different underlying devices you should expect to do some fine tuning to achieve best performance for the various platforms. The types of memory provided by the OpenCL specification are:
Global Memory (CUDA Global): This memory region permits read/write access to all work-items in all work-groups. Work-items can read from or write to any element of a memory object. Reads and writes to global memory may be cached depending on the capabilities of the device.
Constant Memory (CUDA Constant): A region of global memory that remains constant during the execution of a kernel. The host allocates and initializes memory objects placed into constant memory.
Local Memory (CUDA Shared): A memory region local to a work-group. This memory region can be used to allocate variables that are shared by all work-items in that work-group. It may be implemented as dedicated regions of memory on the OpenCL device. Alternatively, the local memory region may be mapped onto sections of the global memory.
Private Memory (CUDA Reqister): A region of memory private to a work-item. Variables defined in one work-item’s private memory are not visible to another work-item.
|Memory||Access Speed||Visibility||Access Type|
|Private||Faster||Processing Element||D(R,W) H(None)|
|Local||Faster||Work Group||D(R,W) H(None)|
While it can be dangerous to make assumptions regarding the speed of the different types I think it is a safe assumption that local and private will be faster than global and constant across all OpenCL devices. How much faster... impossible to say.
Now that we know a little about the OpenCL memory hierarchy how can we use this information to speed up our matrix multiplication? Our first two matrix multiplication programs copied the input data into global device memory (thats what clCreateBuffer( ) returns) and then the kernel read it out of global memory. From the information above we know that global device memory is the slower memory than local memory. So we should try and limit our usage of it. By changing our kernel to copy large blocks of data from global memory into local memory and then processing the data from local memory instead of global we should be able to speed up our overall computation.
The CUDA Matrix Multiplication 3 example explains how we need to decompose our calculation into sub problems that are further decomposed into sub problems so that they can fit into the limited local (shared) memory on the device. You might want to revisit that explanation to refresh your memory.
So now that we know what changes we need to make lets put together some code. The main program from OpenCL Matrix Multiplication 2 should work as is. Again I removed the printfs.
Main program (Listing 1)
The kernel is conceptualy identical to the kernel we put together for the CUDA Matrix Multiplication 3 example. All we really need to do is make syntactic changes.
Kernel (Listing 1)
We changed the __global__ to __kernel to indicate that the function is to be run on the device. We added a __global to designate the input matrices are in global memory. The references to blockIdx and threadIdx are converted to calls to get_global_id( ) and get_local_id( ). The shared memory arrays are converted to local memory arrays and the calls to __syncthreads( ) are converted to calls to barrier( ).
Most of the above changes are self explanatory with the exception of the call to barrier( ). The barrier( ) function is identical in function to the CUDA __syncthreads( ) function. All work-items in a work-group executing the kernel on a processor must execute this function before any are allowed to continue execution beyond the barrier. This function must be encountered by all work-items in a work-group executing the kernel. If barrier is inside a conditional statement, then all work-items must enter the conditional if any work-item enters the conditional statement and executes the barrier. If barrer is inside a loop, all work-items must execute the barrier for each iteration of the loop before any are allowed to continue execution beyond the barrier.
So let's take this version out for a test drive. Running this version yields a 61X speedup over the CPU verion. If we look back at our Matrix Multiplication 2 example we see that it was 57X faster than the CPU version. Now you may be thinking that this is an awful lot of work to just squeeze and extra 4X speedup... and perhaps it is. But let's change the size of our matrix to 2048 and see what we "see".
If we run our Matrix Multiplication 2 example,which uses only global memory, with matrices of 2048 X 2048 we will find that the GPU version is 279X faster than the CPU verion of the algorithm. Our local memory implementation of the GPU version is 370X faster than the CPU version. Looks like we really did get some "bang for our buck" after all!