## Tuesday, September 22, 2009

### Matrix Multiplication 1 (OpenCL)

I will assume that you have already downloaded and installed the appropriate CUDA driver, toolkit and SDK from Nvidia. You can get your hands on Nvidia's beta OpenCL at http://www.nvidia.com/object/cuda_opencl.html. Since we have been working with matrix multiplication in CUDA let’s do the same with OpenCL. We will put together a trivial example of multiplying two 3 X 3 matrices together using OpenCL just like we did with C for CUDA.

The basics of what we need to do has not changed... we need to allocate memory for the two matrices that we are multiplying together… let’s call them A and B. Then we need to copy in the data to the memory that you have allocated. Then we might want to print A and B out. We also need to allocate memory for the results… let’s call it matrix C. The next step would be to perform the multiplication and finally you might want to print out the results and free up the memory you allocated. The main program starts out identically to our C for CUDA version.

Main program (Listing 1)

`// Multiply two matrices A * B = C #include <stdlib.h>#include <stdio.h>#include <math.h> #define WA 3#define HA 3#define WB 3#define HB 3#define WC 3#define HC 3  // Allocates a matrix with random float entries.void randomInit(float* data, int size){   for (int i = 0; i < size; ++i)   data[i] = rand() / (float)RAND_MAX;} /////////////////////////////////////////////////////////// Program main///////////////////////////////////////////////////////// intmain(int argc, char** argv){    // set seed for rand()   srand(2006);    // 1. allocate host memory for matrices A and B   unsigned int size_A = WA * HA;   unsigned int mem_size_A = sizeof(float) * size_A;   float* h_A = (float*) malloc(mem_size_A);    unsigned int size_B = WB * HB;   unsigned int mem_size_B = sizeof(float) * size_B;   float* h_B = (float*) malloc(mem_size_B);    // 2. initialize host memory   randomInit(h_A, size_A);   randomInit(h_B, size_B);     // 3. print out A and B   printf("\n\nMatrix A\n");   for(int i = 0; i < size_A; i++)   {      printf("%f ", h_A[i]);      if(((i + 1) % WA) == 0)      printf("\n");   }    printf("\n\nMatrix B\n");   for(int i = 0; i < size_B; i++)   {      printf("%f ", h_B[i]);      if(((i + 1) % WB) == 0)      printf("\n");   }    // 4. allocate host memory for the result C   unsigned int size_C = WC * HC;   unsigned int mem_size_C = sizeof(float) * size_C;   float* h_C = (float*) malloc(mem_size_C);    // 5. Initialize OpenCL   ...    // 6. Load and build OpenCL kernel   ...    // 7. Launch OpenCL kernel   ...    // 8. Retrieve result from device   ...    // 9. print out the results   printf("\n\nMatrix C (Results)\n");   for(int i = 0; i < size_C; i++)   {      printf("%f ", h_C[i]);      if(((i + 1) % WC) == 0)      printf("\n");   }   printf("\n");    // 10. clean up memory   free(h_A);   free(h_B);   free(h_C); }`

The code in Listing 1 above has nothing OpenCL specific about it yet. We have put together a randomInit function to generate random floats for our matrices. In section 1 we allocate memory for our matrices. In section 2 we use our randomInit function to generate test data. In section 3 we print out the data that we have initialized our matrices with. Next we allocate memory for the results. Sections 5 through 8 don’t actually have any code yet. We will fill these in as we go. At section 9 we print out our results and then finally we free up all of the resources we allocated. Pretty straightforward… there is nothing special about the code yet.

As I mentioned in my OpenCL Program Structure post with OpenCL you must create an OpenCL context and associate devices, kernels, program objects, memory objects, and a command queue with the context. All of this initialization is done in host code using OpenCL APIs. The host code can then interact with the device by inserting commands onto the command queue. To launch the kernel you simply put a launch command on the command queue. To retrieve your results you put a memory copy command on the command queue requesting that the device memory containing your results be copied back to host memory. So lets start by adding the code for the OpenCL initialization.

Main program (Listing 2)

`// Multiply two matrices A * B = C #include <stdlib.h>#include <stdio.h>#include <math.h>#include <oclUtils.h> #define WA 3#define HA 3#define WB 3#define HB 3#define WC 3#define HC 3 // Allocates a matrix with random float entries.void randomInit(float* data, int size){   for (int i = 0; i < size; ++i)   data[i] = rand() / (float)RAND_MAX;} /////////////////////////////////////////////////////////// Program main///////////////////////////////////////////////////////// intmain(int argc, char** argv){    // set seed for rand()   srand(2006);    // 1. allocate host memory for matrices A and B   unsigned int size_A = WA * HA;   unsigned int mem_size_A = sizeof(float) * size_A;   float* h_A = (float*) malloc(mem_size_A);    unsigned int size_B = WB * HB;   unsigned int mem_size_B = sizeof(float) * size_B;   float* h_B = (float*) malloc(mem_size_B);    // 2. initialize host memory   randomInit(h_A, size_A);   randomInit(h_B, size_B);    // 3. print out A and B   printf("\n\nMatrix A\n");   for(int i = 0; i < size_A; i++)   {      printf("%f ", h_A[i]);      if(((i + 1) % WA) == 0)      printf("\n");   }    printf("\n\nMatrix B\n");   for(int i = 0; i < size_B; i++)   {      printf("%f ", h_B[i]);      if(((i + 1) % WB) == 0)      printf("\n");   }    // 4. allocate host memory for the result C   unsigned int size_C = WC * HC;   unsigned int mem_size_C = sizeof(float) * size_C;   float* h_C = (float*) malloc(mem_size_C);    // 5. Initialize OpenCL   // OpenCL specific variables   cl_context clGPUContext;   cl_command_queue clCommandQue;   cl_program clProgram;   cl_kernel clKernel;    size_t dataBytes;   size_t kernelLength;   cl_int errcode;    // OpenCL device memory for matrices   cl_mem d_A;   cl_mem d_B;   cl_mem d_C;    /*****************************************/   /* Initialize OpenCL */   /*****************************************/   clGPUContext = clCreateContextFromType(0,                    CL_DEVICE_TYPE_GPU,                    NULL, NULL, &errcode);   shrCheckError(errcode, CL_SUCCESS);    // get the list of GPU devices associated    // with context   errcode = clGetContextInfo(clGPUContext,               CL_CONTEXT_DEVICES, 0, NULL,               &dataBytes);   cl_device_id *clDevices = (cl_device_id *)              malloc(dataBytes);   errcode |= clGetContextInfo(clGPUContext,               CL_CONTEXT_DEVICES, dataBytes,               clDevices, NULL);   shrCheckError(errcode, CL_SUCCESS);    //Create a command-queue   clCommandQue = clCreateCommandQueue(clGPUContext,               clDevices[0], 0, &errcode);   shrCheckError(errcode, CL_SUCCESS);    // Setup device memory   d_C = clCreateBuffer(clGPUContext,           CL_MEM_READ_WRITE,           mem_size_A, NULL, &errcode);   d_A = clCreateBuffer(clGPUContext,           CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR,           mem_size_A, h_A, &errcode);   d_B = clCreateBuffer(clGPUContext,           CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR,           mem_size_B, h_B, &errcode);     // 6. Load and build OpenCL kernel   ...    // 7. Launch OpenCL kernel   ...     // 8. Retrieve result from device   ...    // 9. print out the results   printf("\n\nMatrix C (Results)\n");   for(int i = 0; i < size_C; i++)   {      printf("%f ", h_C[i]);      if(((i + 1) % WC) == 0)      printf("\n");   }   printf("\n");   // 10. clean up memory   free(h_A);   free(h_B);   free(h_C); }`

As we walk through the code many of the functions we will be calling have optional parameters. We will not be going into the details of the optional parameters on the functions that we are using. You can reference http://www.khronos.org/registry/cl/specs/opencl-1.0.43.pdf for a description of all of the parameters.

The first new line is the include for oclUtils.h. This include file is Nvidia specific and contains the code for the shrCheckError( ) method as well as the includes for the standard OpenCL header. Lets skip past the declarations of the OpenCL specific variables to the first OpenCL method we call clCreateContextFromType( ). This function creates on OpenCL context from a device type that identifies the specific device. Since we are offloading to a GPU we specify CL_DEVICE_TYPE_GPU. This will return a cl_context that we will use for the rest of our initialization.

The function shrCheckError( ) is an Nvidia specific function that prints out information on the error and exits. We will be making calls to this function throughout our main program. We then call clGetContextInfo( ) to get a handle to the cl_device_id. We make two calls to clGetContextInfo( ) one to determine how many bytes we need for our cl_device_id (we can have multiple OpenCL capable devices on one host) and the second to initialize our cl_device_id variable. Once we have the id of our device we can then create a cl_command_queue so that we can interact with the device. We do this by making a call to clCreateCommandQueue( ).

At this point we are almost done with our initialization. The only thing left is to initialize the device memory for our matrices and copy over our data. To do this we make calls to clCreateBuffer( ) which is used both to allocate the device memory and optionally initialize it with data from host memory. We use it for both.

Main program (Listing 3)

`// Multiply two matrices A * B = C #include <stdlib.h>#include <stdio.h>#include <math.h>#include <oclUtils.h> #define WA 3#define HA 3#define WB 3#define HB 3#define WC 3#define HC 3 // Allocates a matrix with random float entries.void randomInit(float* data, int size){   for (int i = 0; i < size; ++i)   data[i] = rand() / (float)RAND_MAX;} /////////////////////////////////////////////////////////// Program main///////////////////////////////////////////////////////// intmain(int argc, char** argv){    // set seed for rand()   srand(2006);      // 1. allocate host memory for matrices A and B   unsigned int size_A = WA * HA;   unsigned int mem_size_A = sizeof(float) * size_A;   float* h_A = (float*) malloc(mem_size_A);    unsigned int size_B = WB * HB;   unsigned int mem_size_B = sizeof(float) * size_B;   float* h_B = (float*) malloc(mem_size_B);    // 2. initialize host memory   randomInit(h_A, size_A);   randomInit(h_B, size_B);    // 3. print out A and B   printf("\n\nMatrix A\n");   for(int i = 0; i < size_A; i++)   {      printf("%f ", h_A[i]);      if(((i + 1) % WA) == 0)      printf("\n");   }     printf("\n\nMatrix B\n");   for(int i = 0; i < size_B; i++)   {      printf("%f ", h_B[i]);      if(((i + 1) % WB) == 0)      printf("\n");   }    // 4. allocate host memory for the result C   unsigned int size_C = WC * HC;   unsigned int mem_size_C = sizeof(float) * size_C;   float* h_C = (float*) malloc(mem_size_C);    // 5. Initialize OpenCL   // OpenCL specific variables   cl_context clGPUContext;   cl_command_queue clCommandQue;   cl_program clProgram;   cl_kernel clKernel;      size_t dataBytes;   size_t kernelLength;   cl_int errcode;     // OpenCL device memory for matrices   cl_mem d_A;   cl_mem d_B;   cl_mem d_C;    /*****************************************/   /* Initialize OpenCL */   /*****************************************/   clGPUContext = clCreateContextFromType(0,                    CL_DEVICE_TYPE_GPU,                    NULL, NULL, &errcode);   shrCheckError(errcode, CL_SUCCESS);     // get the list of GPU devices associated    // with context   errcode = clGetContextInfo(clGPUContext,               CL_CONTEXT_DEVICES, 0, NULL,               &dataBytes);   cl_device_id *clDevices = (cl_device_id *)              malloc(dataBytes);   errcode |= clGetContextInfo(clGPUContext,               CL_CONTEXT_DEVICES, dataBytes,               clDevices, NULL);   shrCheckError(errcode, CL_SUCCESS);    //Create a command-queue   clCommandQue = clCreateCommandQueue(clGPUContext,                    clDevices[0], 0, &errcode);   shrCheckError(errcode, CL_SUCCESS);    // Setup device memory   d_C = clCreateBuffer(clGPUContext,           CL_MEM_READ_WRITE,           mem_size_A, NULL, &errcode);   d_A = clCreateBuffer(clGPUContext,           CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR,           mem_size_A, h_A, &errcode);   d_B = clCreateBuffer(clGPUContext,           CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR,           mem_size_B, h_B, &errcode);     // 6. Load and build OpenCL kernel   char *clMatrixMul = oclLoadProgSource("kernel.cl",                        "// My comment\n",                         &kernelLength);   shrCheckError(clMatrixMul != NULL, shrTRUE);    clProgram = clCreateProgramWithSource(clGPUContext,                 1, (const char **)&clMatrixMul,                 &kernelLength, &errcode);   shrCheckError(errcode, CL_SUCCESS);    errcode = clBuildProgram(clProgram, 0,               NULL, NULL, NULL, NULL);   shrCheckError(errcode, CL_SUCCESS);    clKernel = clCreateKernel(clProgram,                "matrixMul", &errcode);   shrCheckError(errcode, CL_SUCCESS);     // 7. Launch OpenCL kernel   ...    // 8. Retrieve result from device   ...    // 9. print out the results   printf("\n\nMatrix C (Results)\n");   for(int i = 0; i < size_C; i++)   {      printf("%f ", h_C[i]);      if(((i + 1) % WC) == 0)      printf("\n");   }   printf("\n");    // 10. clean up memory   free(h_A);   free(h_B);   free(h_C); }`

Now that we have completed the OpenCL initialization listing 3 above contains the code to load and build the kernel. The function oclLoadProgSource( ) is not part of OpenCL. It is a Nvidia specific function that simplifies the loading of your kernel source. The first parameter must contain the complete path to the source file for your kernel. The second parameter is prepended to what is read in (can be useful for adding includes). The last parameter is the address to store the number of bytes read in.

We next call clCreateProgramWithSource( ) which creates a program object for a context, and loads the source code specified by the text strings in the clMatrixMul array into the program object. The devices associated with the program object are the devices associated with context. The first parameter is the context that we initialized, the second is an integer indicating the dimension of the clMatrixMul array, the third is the length of the clMatrixMul array, and the last is for the return of the error code.

Now that we have a valid program object we need to compile or "build" it. This is done in the call to clBuildProgram( ). This function takes a valid program object and the number of devices associated with the context. The remaining parameters are optional. Once we have successfully built our program object we need to associate a kernel object with each function prepended by __kernel in our kernel source file (a single source file can contain multiple __kernel functions in it).

To map a specific __kernel function in our kernel source file with a kernel object inside our program we call clCreateKernel( ) passing in the built program object and the name of the function. This will return a kernel object that is ready to launch.

Main program (Listing 4)

`// Multiply two matrices A * B = C #include <stdlib.h>#include <stdio.h>#include <math.h>#include <oclUtils.h> #define WA 3#define HA 3#define WB 3#define HB 3#define WC 3#define HC 3 // Allocates a matrix with random float entries.void randomInit(float* data, int size){   for (int i = 0; i < size; ++i)   data[i] = rand() / (float)RAND_MAX;} /////////////////////////////////////////////////////////// Program main///////////////////////////////////////////////////////// intmain(int argc, char** argv){    // set seed for rand()   srand(2006);    // 1. allocate host memory for matrices A and B   unsigned int size_A = WA * HA;   unsigned int mem_size_A = sizeof(float) * size_A;   float* h_A = (float*) malloc(mem_size_A);    unsigned int size_B = WB * HB;   unsigned int mem_size_B = sizeof(float) * size_B;   float* h_B = (float*) malloc(mem_size_B);    // 2. initialize host memory   randomInit(h_A, size_A);   randomInit(h_B, size_B);    // 3. print out A and B   printf("\n\nMatrix A\n");   for(int i = 0; i < size_A; i++)   {      printf("%f ", h_A[i]);      if(((i + 1) % WA) == 0)      printf("\n");   }    printf("\n\nMatrix B\n");   for(int i = 0; i < size_B; i++)   {      printf("%f ", h_B[i]);      if(((i + 1) % WB) == 0)      printf("\n");   }    // 4. allocate host memory for the result C   unsigned int size_C = WC * HC;   unsigned int mem_size_C = sizeof(float) * size_C;   float* h_C = (float*) malloc(mem_size_C);    // 5. Initialize OpenCL   // OpenCL specific variables   cl_context clGPUContext;   cl_command_queue clCommandQue;   cl_program clProgram;   cl_kernel clKernel;     size_t dataBytes;   size_t kernelLength;   cl_int errcode;    // OpenCL device memory for matrices   cl_mem d_A;   cl_mem d_B;   cl_mem d_C;    /*****************************************/   /* Initialize OpenCL */   /*****************************************/   clGPUContext = clCreateContextFromType(0,                    CL_DEVICE_TYPE_GPU,                    NULL, NULL, &errcode);   shrCheckError(errcode, CL_SUCCESS);    // get the list of GPU devices associated    // with context   errcode = clGetContextInfo(clGPUContext,               CL_CONTEXT_DEVICES, 0, NULL,               &dataBytes);   cl_device_id *clDevices = (cl_device_id *)              malloc(dataBytes);   errcode |= clGetContextInfo(clGPUContext,               CL_CONTEXT_DEVICES, dataBytes,               clDevices, NULL);   shrCheckError(errcode, CL_SUCCESS);    //Create a command-queue   clCommandQue = clCreateCommandQueue(clGPUContext,                   clDevices[0], 0, &errcode);   shrCheckError(errcode, CL_SUCCESS);     // Setup device memory   d_C = clCreateBuffer(clGPUContext,           CL_MEM_READ_WRITE,           mem_size_A, NULL, &errcode);   d_A = clCreateBuffer(clGPUContext,           CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR,           mem_size_A, h_A, &errcode);   d_B = clCreateBuffer(clGPUContext,           CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR,           mem_size_B, h_B, &errcode);     // 6. Load and build OpenCL kernel   char *clMatrixMul = oclLoadProgSource("kernel.cl",                        "// My comment\n",                         &kernelLength);   shrCheckError(clMatrixMul != NULL, shrTRUE);    clProgram = clCreateProgramWithSource(clGPUContext,                 1, (const char **)&clMatrixMul,                 &kernelLength, &errcode);   shrCheckError(errcode, CL_SUCCESS);    errcode = clBuildProgram(clProgram, 0,               NULL, NULL, NULL, NULL);   shrCheckError(errcode, CL_SUCCESS);    clKernel = clCreateKernel(clProgram,                "matrixMul", &errcode);   shrCheckError(errcode, CL_SUCCESS);     // 7. Launch OpenCL kernel   size_t localWorkSize[2], globalWorkSize[2];    int wA = WA;   int wC = WC;   errcode = clSetKernelArg(clKernel, 0,               sizeof(cl_mem), (void *)&d_C);   errcode |= clSetKernelArg(clKernel, 1,               sizeof(cl_mem), (void *)&d_A);   errcode |= clSetKernelArg(clKernel, 2,               sizeof(cl_mem), (void *)&d_B);   errcode |= clSetKernelArg(clKernel, 3,               sizeof(int), (void *)&wA);   errcode |= clSetKernelArg(clKernel, 4,               sizeof(int), (void *)&wC);   shrCheckError(errcode, CL_SUCCESS);    localWorkSize[0] = 3;   localWorkSize[1] = 3;   globalWorkSize[0] = 3;   globalWorkSize[1] = 3;    errcode = clEnqueueNDRangeKernel(clCommandQue,               clKernel, 2, NULL, globalWorkSize,               localWorkSize, 0, NULL, NULL);   shrCheckError(errcode, CL_SUCCESS);    // 8. Retrieve result from device   errcode = clEnqueueReadBuffer(clCommandQue,               d_C, CL_TRUE, 0, mem_size_C,               h_C, 0, NULL, NULL);   shrCheckError(errcode, CL_SUCCESS);    // 9. print out the results   printf("\n\nMatrix C (Results)\n");   for(int i = 0; i < size_C; i++)   {      printf("%f ", h_C[i]);      if(((i + 1) % WC) == 0)      printf("\n");   }   printf("\n");    // 10. clean up memory   free(h_A);   free(h_B);   free(h_C);    clReleaseMemObject(d_A);   clReleaseMemObject(d_C);   clReleaseMemObject(d_B);    free(clDevices);   free(clMatrixMul);   clReleaseContext(clGPUContext);   clReleaseKernel(clKernel);   clReleaseProgram(clProgram);   clReleaseCommandQueue(clCommandQue);}`

In listing 4 above we add the code to launch the kernel and retrieve the results. Before we can launch the kernel we need to set the parameters for our matrixMul kernel. We haven't looked at the kernel yet but it is almost identical to the C for CUDA version which took 5 parameters. To set the parameters we call clSetKernelArg( ) passing in the kernel, the index of the parameter in the kernel function, the size of the parameter and a pointer to the value. We do this for each parameter we need to pass in. If you have ever done any X Windows programming this tedious manner of passing in parameters should look famillar to you. Oh and let's not forget to release all of the OpenCL resources that we have allocated.

Now that the parameters are set all we need to do is queue up a command to launch the kernel on the command queue that we associated with the context earlier. We do that with a call to clEnqueueNDRangeKernel( ). With C for CUDA we had to set the size of our grid of thread block and the size of each thread block. For our first kernel we have 9 threads (3 X 3 matrices) in a thread block and 1 thread block in the grid. With OpenCL we need to do the same thing the only difference is that the dimensions for the grid in OpenCL are expressed in terms of the total number of threads in the grid. Thats what the two arrays localWorkSize and globalWorkSize are used for.

Once we have launched our kernel we enqueue a command to retrieve the results from device memory with a call to clEnqueueReadBuffer( ). This call will block until the kernel has finished. Speeking of the kernel... what is our kernel going to look like...?

Kernel (Listing 1)

`// kernel.cl// Multiply two matrices A * B = C// Device code.   // OpenCL Kernel__kernel voidmatrixMul(__global float* C,           __global float* A,           __global float* B,           int wA, int wB){    // 2D Thread ID   int tx = get_local_id(0);   int ty = get_local_id(1);    // value stores the element    // that is computed by the thread   float value = 0;   for (int k = 0; k < wA; ++k)   {      float elementA = A[ty * wA + k];      float elementB = B[k * wB + tx];      value += elementA * elementB;   }    // Write the matrix to device memory each    // thread writes one element   C[ty * wA + tx] = value;}`

The OpenCL kernel is basicaly identical to the C for CUDA version of the kernel the only differences are really cosmetic. We change the __global__ keyword that CUDA uses for __kernel that OpenCL uses to denote that a function is to be executed on the device. The threadIdx CUDA reference is replaced with calls to OpenCLs get_local_id( ) function and the global memory for our matrices that are passed into the kernel needs to be specified as __global in OpenCL. That's it... otherwise everything else is the same as our CUDA kernel.

Comparing the C for CUDA code to the OpenCL code you can see why we did not use Nvidia's CUDA driver API to start with. There is a significant amount of very mundane code that you have to write with CUDAs driver API and with OpenCL. It made sense for us to go the simpler route with Nvidias C for CUDA because using their CUDA driver API didn't really buy us anything. With OpenCL, on the otherhand, we obtain portability by writing the extra mundane code so it is worth it... at least to me it is.

With the code above you should be able to cut / copy / paste your way to a running binary (don't forget to replace "kernel.cl" in the call to oclLoadProgSource( ) with the complete path to your kernel). You will find that OpenCL does not run this algorithm any faster than CUDA does... performance still sucks! That's OK we will make it better in the next two examples.

On to Matrix Multiplication 2.