I will assume that you have gone through the CUDA Matrix Multiplication 2 example and understand the conceptual changes that we will be making to our OpenCL kernel. All we really need to do is express our kernel from CUDA Matrix Multiplication 2 in terms of OpenCL and slightly modify our main program from our OpenCL Matrix Multiplication 1 example to account for the different work group and NDRange sizes.
So when we launch our kernel we want the GPU resources allocated something like the image below:
As you can see we want a 64 by 64 NDRange of 16 by 16 work groups. This will only require us to change a couple of lines in our main program.
Main program (Listing 1)
So we set our localWorkSize to 16 by 16 and our globalWorkSize to 1024 by 1024. Why 1024 and not 64? Rember that in OpenCL we need to express the globalWorkSize in terms of the total number of threads. The underlying OpenCL API will look at the globalWorkSize and divide by the localWorkSize to arrive at a 64 by 64 NDRange of 16 by 16 work groups.
So what are the changes that we need to make to our CUDA Matrix Multiplication 2 kernel code?
OpenCL Kernel (Listing 1)
We changed the __global__ to __kernel to indicate that the function is to be run on the device. We added a __global to designate the input matrices are in global memory. If we look at the old CUDA code that is commented out we see that it was taking the blockId index and multiplying them times the TILE_SIZE (which was 16) and then adding in the index of the thread block. Well as luck would have it that is exactly what the call to get_global_id( ) returns.
Now cut copy and paste your way to a running binary (don't forget to replace "kernel.cl" in the call to oclLoadProgSource( ) with the complete path to your kernel). If you insert timing code you will find that this OpenCL example is about 46X faster than the CPU version.
Before we move on to Matrix Multiplication 3 let's take our OpenCL compiler oclcc out for a spin. First you are going to need to cut / copy / paste the compiler source and build oclcc. Once you have a oclcc binary in your bin tree all you need to do to compile a kernel is:
$ oclcc /path/to/kernel/source/kernel.cl -o kernel.ptx
This will compile your source to Nvidias ptx assembler. Pass --help on the oclcc command line to get a usage command. Since our kernel is already in ptx assembler we will need to change out main program slightly.
Main program (Listing 2)
Since we have already compiled our OpenCL code to ptx assembler we no longer need the calls to open the kernel source file or to create the program from source so we just comment them out. We replace these calls with code to open up the ptx file and load it into a buffer and we then call clCreateProgramWithBinary( ). Notice that we still must make a call to clBuildProgram( ). Remeber that Nvidia's implementation of OpenCL compiles to an intermeadiate form (ptx assembler) so we still need to compile the assembler to a binary capable of running on the card.
Now if we cut / copy / paste our way to a binary we will find that the GPU version is now 57X faster than the CPU version of the program. So by precompiling the kernel we get an additional 11X speedup. Not bad for changing less than 25 lines of code but we can still get faster...
On to OpenCL Matrix Multiplication 3