Confessions of a Speed Junkie (Code Examples)

Friday, October 2, 2009

Matrix Multiplication 3 (OpenCL)

The memory model for OpenCL applications is strikingly similar to the CUDA memory model (go figure). The only real differences are that CUDA supports texture memory (we didn't get into that in our CUDA examples) and OpenCL provides for a global / constant memory cache (Nvidias new Fermi GPU adds a global / constant memory cache).

With CUDA we were able to make some assumptions about the sizes and speed of the various different types of memory. Well they weren't really assumptions since we know the exact details of the underlying hardware that CUDA applications are running on. With OpenCL we can't easily make assumptions regarding sizes and speeds of the various different types of memory. The OpenCL specification identifies the different types of memory constructs that you must support from the API perspective but OpenCL driver providers have to map that specification to their underlying hardware. For example the amount and speed of global memory on a Cell processor vs an Nvidia GPU can vary significantly. While you should be guaranteed that your OpenCL code will compile and run on different underlying devices you should expect to do some fine tuning to achieve best performance for the various platforms. The types of memory provided by the OpenCL specification are:

Global Memory (CUDA Global): This memory region permits read/write access to all work-items in all work-groups. Work-items can read from or write to any element of a memory object. Reads and writes to global memory may be cached depending on the capabilities of the device.

Constant Memory (CUDA Constant): A region of global memory that remains constant during the execution of a kernel. The host allocates and initializes memory objects placed into constant memory.

Local Memory (CUDA Shared): A memory region local to a work-group. This memory region can be used to allocate variables that are shared by all work-items in that work-group. It may be implemented as dedicated regions of memory on the OpenCL device. Alternatively, the local memory region may be mapped onto sections of the global memory.

Private Memory (CUDA Reqister): A region of memory private to a work-item. Variables defined in one work-item’s private memory are not visible to another work-item.

Memory	Access Speed	Visibility	Access Type
Private	Faster	Processing Element	D(R,W) H(None)
Local	Faster	Work Group	D(R,W) H(None)
Constant	Slower	NDRange	D(R) H(W)
Global	Slower	NDRange	D(R,W) H(R,W)

While it can be dangerous to make assumptions regarding the speed of the different types I think it is a safe assumption that local and private will be faster than global and constant across all OpenCL devices. How much faster... impossible to say.

Now that we know a little about the OpenCL memory hierarchy how can we use this information to speed up our matrix multiplication? Our first two matrix multiplication programs copied the input data into global device memory (thats what clCreateBuffer( ) returns) and then the kernel read it out of global memory. From the information above we know that global device memory is the slower memory than local memory. So we should try and limit our usage of it. By changing our kernel to copy large blocks of data from global memory into local memory and then processing the data from local memory instead of global we should be able to speed up our overall computation.

The CUDA Matrix Multiplication 3 example explains how we need to decompose our calculation into sub problems that are further decomposed into sub problems so that they can fit into the limited local (shared) memory on the device. You might want to revisit that explanation to refresh your memory.

So now that we know what changes we need to make lets put together some code. The main program from OpenCL Matrix Multiplication 2 should work as is. Again I removed the printfs.

Main program (Listing 1)

// Multiply two matrices A * B = C
 
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <oclUtils.h>
 
#define WA 1024
#define HA 1024
#define WB 1024
#define HB WA
#define WC WB
#define HC HA
 
// Allocates a matrix with random float entries.
void randomInit(float* data, int size)
{
   for (int i = 0; i < size; ++i)
   data[i] = rand() / (float)RAND_MAX;
}
 
/////////////////////////////////////////////////////////
// Program main
/////////////////////////////////////////////////////////
 
int
main(int argc, char** argv)
{
 
   // set seed for rand()
   srand(2006);
 
   // 1. allocate host memory for matrices A and B
   unsigned int size_A = WA * HA;
   unsigned int mem_size_A = sizeof(float) * size_A;
   float* h_A = (float*) malloc(mem_size_A);
 
   unsigned int size_B = WB * HB;
   unsigned int mem_size_B = sizeof(float) * size_B;
   float* h_B = (float*) malloc(mem_size_B);
 
   // 2. initialize host memory
   randomInit(h_A, size_A);
   randomInit(h_B, size_B);
  
   // 4. allocate host memory for the result C
   unsigned int size_C = WC * HC;
   unsigned int mem_size_C = sizeof(float) * size_C;
   float* h_C = (float*) malloc(mem_size_C);
 
   // 5. Initialize OpenCL
   // OpenCL specific variables
   cl_context clGPUContext;
   cl_command_queue clCommandQue;
   cl_program clProgram;
   cl_kernel clKernel;
  
   size_t dataBytes;
   size_t kernelLength;
   cl_int errcode;
 
   // OpenCL device memory for matrices
   cl_mem d_A;
   cl_mem d_B;
   cl_mem d_C;
 
   /*****************************************/
   /* Initialize OpenCL */
   /*****************************************/
   clGPUContext = clCreateContextFromType(0, 
                   CL_DEVICE_TYPE_GPU, 
                   NULL, NULL, &errcode);
   shrCheckError(errcode, CL_SUCCESS);
 
   // get the list of GPU devices associated 
   // with context
   errcode = clGetContextInfo(clGPUContext, 
              CL_CONTEXT_DEVICES, 0, NULL, 
              &dataBytes);
   cl_device_id *clDevices = (cl_device_id *)
              malloc(dataBytes);
   errcode |= clGetContextInfo(clGPUContext, 
              CL_CONTEXT_DEVICES, dataBytes, 
              clDevices, NULL);
   shrCheckError(errcode, CL_SUCCESS);
 
   //Create a command-queue
   clCommandQue = clCreateCommandQueue(clGPUContext, 
                  clDevices[0], 0, &errcode);
   shrCheckError(errcode, CL_SUCCESS);
  
   // Setup device memory
   d_C = clCreateBuffer(clGPUContext, 
          CL_MEM_READ_WRITE, 
          mem_size_A, NULL, &errcode);
   d_A = clCreateBuffer(clGPUContext, 
          CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, 
          mem_size_A, h_A, &errcode);
   d_B = clCreateBuffer(clGPUContext, 
          CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, 
          mem_size_B, h_B, &errcode);
 
   // 6. Load and build OpenCL kernel
 
   // Open the ptx file and load it
   // into a char* buffer
   FILE* fp = fopen("kernel.ptx", "r");
   fseek (fp , 0 , SEEK_END);
   const size_t lSize = ftell(fp);
   rewind(fp);
   unsigned char* buffer;
   buffer = (unsigned char*) malloc (lSize);
   fread(buffer, 1, lSize, fp);
   fclose(fp);
 
   cl_int status;
   clProgram = clCreateProgramWithBinary(clGPUContext, 
                1, (const cl_device_id *)clDevices, 
                &lSize, (const unsigned char**)&buffer, 
                &status, &errcode);
   errcode = clBuildProgram(clProgram, 0, NULL, NULL, 
                NULL, NULL);
   shrCheckError(status, CL_SUCCESS);
   shrCheckError(errcode, CL_SUCCESS);
 
   errcode = clBuildProgram(clProgram, 0, 
              NULL, NULL, NULL, NULL);
   shrCheckError(errcode, CL_SUCCESS);
 
   clKernel = clCreateKernel(clProgram, 
               "matrixMul", &errcode);
   shrCheckError(errcode, CL_SUCCESS);
 
 
   // 7. Launch OpenCL kernel
   size_t localWorkSize[2], globalWorkSize[2];
 
   int wA = WA;
   int wC = WC;
   errcode = clSetKernelArg(clKernel, 0, 
              sizeof(cl_mem), (void *)&d_C);
   errcode |= clSetKernelArg(clKernel, 1, 
              sizeof(cl_mem), (void *)&d_A);
   errcode |= clSetKernelArg(clKernel, 2, 
              sizeof(cl_mem), (void *)&d_B);
   errcode |= clSetKernelArg(clKernel, 3, 
              sizeof(int), (void *)&wA);
   errcode |= clSetKernelArg(clKernel, 4, 
              sizeof(int), (void *)&wC);
   shrCheckError(errcode, CL_SUCCESS);
 
   localWorkSize[0] = 16;
   localWorkSize[1] = 16;
   globalWorkSize[0] = 1024;
   globalWorkSize[1] = 1024;
 
   errcode = clEnqueueNDRangeKernel(clCommandQue, 
              clKernel, 2, NULL, globalWorkSize, 
              localWorkSize, 0, NULL, NULL);
   shrCheckError(errcode, CL_SUCCESS);
 
   // 8. Retrieve result from device
   errcode = clEnqueueReadBuffer(clCommandQue, 
              d_C, CL_TRUE, 0, mem_size_C, 
              h_C, 0, NULL, NULL);
   shrCheckError(errcode, CL_SUCCESS);
  
   // 10. clean up memory
   free(h_A);
   free(h_B);
   free(h_C);
 
   clReleaseMemObject(d_A);
   clReleaseMemObject(d_C);
   clReleaseMemObject(d_B);
 
   free(clDevices);
   free(clMatrixMul);
   clReleaseContext(clGPUContext);
   clReleaseKernel(clKernel);
   clReleaseProgram(clProgram);
   clReleaseCommandQueue(clCommandQue);

}

The kernel is conceptualy identical to the kernel we put together for the CUDA Matrix Multiplication 3 example. All we really need to do is make syntactic changes.

Kernel (Listing 1)

/* Matrix multiplication: C = A * B.
 * Device code.
 */
 
// Thread block size
#define BLOCK_SIZE 16
  
//////////////////////////////////////////////////////
//! Matrix multiplication on the device: C = A * B
//! wA is A's width and wB is B's width
//////////////////////////////////////////////////////
__kernel void
matrixMul(__global float* C, 
          __global float* A, 
          __global float* B, int wA, int wB)
{
    // Block index
    int bx = get_group_id(0);
    int by = get_group_id(1);
 
    // Thread index
    int tx = get_local_id(0);
    int ty = get_local_id(1);
 
    // Index of the first sub-matrix of A processed 
    // by the block
    int aBegin = wA * BLOCK_SIZE * by;
 
    // Index of the last sub-matrix of A processed 
    // by the block
    int aEnd   = aBegin + wA - 1;
 
    // Step size used to iterate through the 
    // sub-matrices of A
    int aStep  = BLOCK_SIZE;
 
    // Index of the first sub-matrix of B processed 
    // by the block
    int bBegin = BLOCK_SIZE * bx;
 
    // Step size used to iterate through the 
    // sub-matrices of B
    int bStep  = BLOCK_SIZE * wB;
 
    // Loop over all the sub-matrices of A and B
    // required to compute the block sub-matrix
    for (int a = aBegin, b = bBegin;
             a <= aEnd;
             a += aStep, b += bStep) 
    {

        // Declaration of the local memory array As 
        // used to store the sub-matrix of A
        __local float As[BLOCK_SIZE][BLOCK_SIZE];
 
        // Declaration of the local memory array Bs 
        // used to store the sub-matrix of B
        __local float Bs[BLOCK_SIZE][BLOCK_SIZE];
 
        // Load the matrices from global memory
        // to local memory; each thread loads
        // one element of each matrix
        As[ty][tx] = A[a + wA * ty + tx];
        Bs[ty][tx] = B[b + wB * ty + tx];
 
        // Synchronize to make sure the matrices 
        // are loaded
        barrier(CLK_LOCAL_MEM_FENCE);
 
        // Multiply the two matrices together;
        // each thread computes one element
        // of the block sub-matrix
        for (int k = 0; k < BLOCK_SIZE; ++k)
            Csub += As[ty][k] * Bs[k][tx];
 
        // Synchronize to make sure that the preceding
        // computation is done before loading two new
        // sub-matrices of A and B in the next iteration
        barrier(CLK_LOCAL_MEM_FENCE);
 
    }
 
    // Write the block sub-matrix to device memory;
    // each thread writes one element
    int c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx;
    C[c + wB * ty + tx] = Csub;

}

We changed the __global__ to __kernel to indicate that the function is to be run on the device. We added a __global to designate the input matrices are in global memory. The references to blockIdx and threadIdx are converted to calls to get_global_id( ) and get_local_id( ). The shared memory arrays are converted to local memory arrays and the calls to __syncthreads( ) are converted to calls to barrier( ).

Most of the above changes are self explanatory with the exception of the call to barrier( ). The barrier( ) function is identical in function to the CUDA __syncthreads( ) function. All work-items in a work-group executing the kernel on a processor must execute this function before any are allowed to continue execution beyond the barrier. This function must be encountered by all work-items in a work-group executing the kernel. If barrier is inside a conditional statement, then all work-items must enter the conditional if any work-item enters the conditional statement and executes the barrier. If barrer is inside a loop, all work-items must execute the barrier for each iteration of the loop before any are allowed to continue execution beyond the barrier.

So let's take this version out for a test drive. Running this version yields a 61X speedup over the CPU verion. If we look back at our Matrix Multiplication 2 example we see that it was 57X faster than the CPU version. Now you may be thinking that this is an awful lot of work to just squeeze and extra 4X speedup... and perhaps it is. But let's change the size of our matrix to 2048 and see what we "see".

If we run our Matrix Multiplication 2 example,which uses only global memory, with matrices of 2048 X 2048 we will find that the GPU version is 279X faster than the CPU verion of the algorithm. Our local memory implementation of the GPU version is 370X faster than the CPU version. Looks like we really did get some "bang for our buck" after all!

Monday, September 28, 2009

Matrix Multiplication 2 (OpenCL)

I will assume that you have gone through the CUDA Matrix Multiplication 2 example and understand the conceptual changes that we will be making to our OpenCL kernel. All we really need to do is express our kernel from CUDA Matrix Multiplication 2 in terms of OpenCL and slightly modify our main program from our OpenCL Matrix Multiplication 1 example to account for the different work group and NDRange sizes.

So when we launch our kernel we want the GPU resources allocated something like the image below:

As you can see we want a 64 by 64 NDRange of 16 by 16 work groups. This will only require us to change a couple of lines in our main program.

Main program (Listing 1)

// Multiply two matrices A * B = C
 
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <oclUtils.h>
 
#define WA 1024
#define HA 1024
#define WB 1024
#define HB WA
#define WC WB
#define HC HA
 
// Allocates a matrix with random float entries.
void randomInit(float* data, int size)
{
   for (int i = 0; i < size; ++i)
   data[i] = rand() / (float)RAND_MAX;
}
 
/////////////////////////////////////////////////////////
// Program main
/////////////////////////////////////////////////////////
 
int
main(int argc, char** argv)
{
 
   // set seed for rand()
   srand(2006);
 
   // 1. allocate host memory for matrices A and B
   unsigned int size_A = WA * HA;
   unsigned int mem_size_A = sizeof(float) * size_A;
   float* h_A = (float*) malloc(mem_size_A);
 
   unsigned int size_B = WB * HB;
   unsigned int mem_size_B = sizeof(float) * size_B;
   float* h_B = (float*) malloc(mem_size_B);
 
   // 2. initialize host memory
   randomInit(h_A, size_A);
   randomInit(h_B, size_B);
 
   // 3. print out A and B
   printf("\n\nMatrix A\n");
   for(int i = 0; i < size_A; i++)
   {
      printf("%f ", h_A[i]);
      if(((i + 1) % WA) == 0)
      printf("\n");
   }
 
   printf("\n\nMatrix B\n");
   for(int i = 0; i < size_B; i++)
   {
      printf("%f ", h_B[i]);
      if(((i + 1) % WB) == 0)
      printf("\n");
   }
 
   // 4. allocate host memory for the result C
   unsigned int size_C = WC * HC;
   unsigned int mem_size_C = sizeof(float) * size_C;
   float* h_C = (float*) malloc(mem_size_C);
 
   // 5. Initialize OpenCL
   // OpenCL specific variables
   cl_context clGPUContext;
   cl_command_queue clCommandQue;
   cl_program clProgram;
   cl_kernel clKernel;
  
   size_t dataBytes;
   size_t kernelLength;
   cl_int errcode;
 
   // OpenCL device memory for matrices
   cl_mem d_A;
   cl_mem d_B;
   cl_mem d_C;
 
   /*****************************************/
   /* Initialize OpenCL */
   /*****************************************/
   clGPUContext = clCreateContextFromType(0, 
                   CL_DEVICE_TYPE_GPU, 
                   NULL, NULL, &errcode);
   shrCheckError(errcode, CL_SUCCESS);
 
   // get the list of GPU devices associated 
   // with context
   errcode = clGetContextInfo(clGPUContext, 
              CL_CONTEXT_DEVICES, 0, NULL, 
              &dataBytes);
   cl_device_id *clDevices = (cl_device_id *)
              malloc(dataBytes);
   errcode |= clGetContextInfo(clGPUContext, 
              CL_CONTEXT_DEVICES, dataBytes, 
              clDevices, NULL);
   shrCheckError(errcode, CL_SUCCESS);
 
   //Create a command-queue
   clCommandQue = clCreateCommandQueue(clGPUContext, 
                  clDevices[0], 0, &errcode);
   shrCheckError(errcode, CL_SUCCESS);
  
   // Setup device memory
   d_C = clCreateBuffer(clGPUContext, 
          CL_MEM_READ_WRITE, 
          mem_size_A, NULL, &errcode);
   d_A = clCreateBuffer(clGPUContext, 
          CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, 
          mem_size_A, h_A, &errcode);
   d_B = clCreateBuffer(clGPUContext, 
          CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, 
          mem_size_B, h_B, &errcode);
 
 
   // 6. Load and build OpenCL kernel
   char *clMatrixMul = oclLoadProgSource("kernel.cl",
                        "// My comment\n", 
                        &kernelLength);
   shrCheckError(clMatrixMul != NULL, shrTRUE);
 
   clProgram = clCreateProgramWithSource(clGPUContext, 
                1, (const char **)&clMatrixMul, 
                &kernelLength, &errcode);
   shrCheckError(errcode, CL_SUCCESS);
 
   errcode = clBuildProgram(clProgram, 0, 
              NULL, NULL, NULL, NULL);
   shrCheckError(errcode, CL_SUCCESS);
 
   clKernel = clCreateKernel(clProgram, 
               "matrixMul", &errcode);
   shrCheckError(errcode, CL_SUCCESS);
 
 
   // 7. Launch OpenCL kernel
   size_t localWorkSize[2], globalWorkSize[2];
 
   int wA = WA;
   int wC = WC;
   errcode = clSetKernelArg(clKernel, 0, 
              sizeof(cl_mem), (void *)&d_C);
   errcode |= clSetKernelArg(clKernel, 1, 
              sizeof(cl_mem), (void *)&d_A);
   errcode |= clSetKernelArg(clKernel, 2, 
              sizeof(cl_mem), (void *)&d_B);
   errcode |= clSetKernelArg(clKernel, 3, 
              sizeof(int), (void *)&wA);
   errcode |= clSetKernelArg(clKernel, 4, 
              sizeof(int), (void *)&wC);
   shrCheckError(errcode, CL_SUCCESS);
 
   localWorkSize[0] = 16;
   localWorkSize[1] = 16;
   globalWorkSize[0] = 1024;
   globalWorkSize[1] = 1024;
 
   errcode = clEnqueueNDRangeKernel(clCommandQue, 
              clKernel, 2, NULL, globalWorkSize, 
              localWorkSize, 0, NULL, NULL);
   shrCheckError(errcode, CL_SUCCESS);
 
   // 8. Retrieve result from device
   errcode = clEnqueueReadBuffer(clCommandQue, 
              d_C, CL_TRUE, 0, mem_size_C, 
              h_C, 0, NULL, NULL);
   shrCheckError(errcode, CL_SUCCESS);
 
   // 9. print out the results
   printf("\n\nMatrix C (Results)\n");
   for(int i = 0; i < size_C; i++)
   {
      printf("%f ", h_C[i]);
      if(((i + 1) % WC) == 0)
      printf("\n");
   }
   printf("\n");
 
   // 10. clean up memory
   free(h_A);
   free(h_B);
   free(h_C);
 
   clReleaseMemObject(d_A);
   clReleaseMemObject(d_C);
   clReleaseMemObject(d_B);
 
   free(clDevices);
   free(clMatrixMul);
   clReleaseContext(clGPUContext);
   clReleaseKernel(clKernel);
   clReleaseProgram(clProgram);
   clReleaseCommandQueue(clCommandQue);

}

So we set our localWorkSize to 16 by 16 and our globalWorkSize to 1024 by 1024. Why 1024 and not 64? Rember that in OpenCL we need to express the globalWorkSize in terms of the total number of threads. The underlying OpenCL API will look at the globalWorkSize and divide by the localWorkSize to arrive at a 64 by 64 NDRange of 16 by 16 work groups.

So what are the changes that we need to make to our CUDA Matrix Multiplication 2 kernel code?

OpenCL Kernel (Listing 1)

/* kernel.cl 
 * Matrix multiplication: C = A * B.
 * Device code.
 */
 
// OpenCL Kernel
__kernel void
matrixMul(__global float* C, 
          __global float* A, 
          __global float* B, 
          int wA, int wB)
{
  
   // 2D Thread ID
   // Old CUDA code
   //int tx = blockIdx.x * TILE_SIZE + threadIdx.x;
   //int ty = blockIdx.y * TILE_SIZE + threadIdx.y;
   int tx = get_global_id(0); 
   int ty = get_global_id(1);
 
   // value stores the element that is 
   // computed by the thread
   float value = 0;
   for (int k = 0; k < wA; ++k)
   {
      float elementA = A[ty * wA + k];
      float elementB = B[k * wB + tx];
      value += elementA * elementB;
   }
 
   // Write the matrix to device memory each 
   // thread writes one element
   C[ty * wA + tx] = value;
}

We changed the __global__ to __kernel to indicate that the function is to be run on the device. We added a __global to designate the input matrices are in global memory. If we look at the old CUDA code that is commented out we see that it was taking the blockId index and multiplying them times the TILE_SIZE (which was 16) and then adding in the index of the thread block. Well as luck would have it that is exactly what the call to get_global_id( ) returns.

Now cut copy and paste your way to a running binary (don't forget to replace "kernel.cl" in the call to oclLoadProgSource( ) with the complete path to your kernel). If you insert timing code you will find that this OpenCL example is about 46X faster than the CPU version.

Before we move on to Matrix Multiplication 3 let's take our OpenCL compiler oclcc out for a spin. First you are going to need to cut / copy / paste the compiler source and build oclcc. Once you have a oclcc binary in your bin tree all you need to do to compile a kernel is:

$ oclcc /path/to/kernel/source/kernel.cl -o kernel.ptx

This will compile your source to Nvidias ptx assembler. Pass --help on the oclcc command line to get a usage command. Since our kernel is already in ptx assembler we will need to change out main program slightly.

Main program (Listing 2)

// Multiply two matrices A * B = C
 
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <oclUtils.h>
 
#define WA 1024
#define HA 1024
#define WB 1024
#define HB WA
#define WC WB
#define HC HA
 
// Allocates a matrix with random float entries.
void randomInit(float* data, int size)
{
   for (int i = 0; i < size; ++i)
   data[i] = rand() / (float)RAND_MAX;
}
 
/////////////////////////////////////////////////////////
// Program main
/////////////////////////////////////////////////////////
 
int
main(int argc, char** argv)
{
 
   // set seed for rand()
   srand(2006);
 
   // 1. allocate host memory for matrices A and B
   unsigned int size_A = WA * HA;
   unsigned int mem_size_A = sizeof(float) * size_A;
   float* h_A = (float*) malloc(mem_size_A);
 
   unsigned int size_B = WB * HB;
   unsigned int mem_size_B = sizeof(float) * size_B;
   float* h_B = (float*) malloc(mem_size_B);
 
   // 2. initialize host memory
   randomInit(h_A, size_A);
   randomInit(h_B, size_B);
 
   // 3. print out A and B
   printf("\n\nMatrix A\n");
   for(int i = 0; i < size_A; i++)
   {
      printf("%f ", h_A[i]);
      if(((i + 1) % WA) == 0)
      printf("\n");
   }
 
   printf("\n\nMatrix B\n");
   for(int i = 0; i < size_B; i++)
   {
      printf("%f ", h_B[i]);
      if(((i + 1) % WB) == 0)
      printf("\n");
   }
 
   // 4. allocate host memory for the result C
   unsigned int size_C = WC * HC;
   unsigned int mem_size_C = sizeof(float) * size_C;
   float* h_C = (float*) malloc(mem_size_C);
 
   // 5. Initialize OpenCL
   // OpenCL specific variables
   cl_context clGPUContext;
   cl_command_queue clCommandQue;
   cl_program clProgram;
   cl_kernel clKernel;
  
   size_t dataBytes;
   size_t kernelLength;
   cl_int errcode;
 
   // OpenCL device memory for matrices
   cl_mem d_A;
   cl_mem d_B;
   cl_mem d_C;
 
   /*****************************************/
   /* Initialize OpenCL */
   /*****************************************/
   clGPUContext = clCreateContextFromType(0, 
                   CL_DEVICE_TYPE_GPU, 
                   NULL, NULL, &errcode);
   shrCheckError(errcode, CL_SUCCESS);
 
   // get the list of GPU devices associated 
   // with context
   errcode = clGetContextInfo(clGPUContext, 
              CL_CONTEXT_DEVICES, 0, NULL, 
              &dataBytes);
   cl_device_id *clDevices = (cl_device_id *)
              malloc(dataBytes);
   errcode |= clGetContextInfo(clGPUContext, 
              CL_CONTEXT_DEVICES, dataBytes, 
              clDevices, NULL);
   shrCheckError(errcode, CL_SUCCESS);
 
   //Create a command-queue
   clCommandQue = clCreateCommandQueue(clGPUContext, 
                  clDevices[0], 0, &errcode);
   shrCheckError(errcode, CL_SUCCESS);
  
   // Setup device memory
   d_C = clCreateBuffer(clGPUContext, 
          CL_MEM_READ_WRITE, 
          mem_size_A, NULL, &errcode);
   d_A = clCreateBuffer(clGPUContext, 
          CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, 
          mem_size_A, h_A, &errcode);
   d_B = clCreateBuffer(clGPUContext, 
          CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, 
          mem_size_B, h_B, &errcode);
 
 
   // 6. Load and build OpenCL kernel
   //char *clMatrixMul = oclLoadProgSource("kernel.cl",
   //                     "// My comment\n", 
   //                     &kernelLength);
   //shrCheckError(clMatrixMul != NULL, shrTRUE);
 
   //clProgram = clCreateProgramWithSource(
   //                      clGPUContext, 
   //             1, (const char **)&clMatrixMul, 
   //             &kernelLength, &errcode);
   //shrCheckError(errcode, CL_SUCCESS);
 
   // Open the ptx file and load it
   // into a char* buffer
   FILE* fp = fopen("kernel.ptx", "r");
   fseek (fp , 0 , SEEK_END);
   const size_t lSize = ftell(fp);
   rewind(fp);
   unsigned char* buffer;
   buffer = (unsigned char*) malloc (lSize);
   fread(buffer, 1, lSize, fp);
   fclose(fp);
 
   cl_int status;
   clProgram = clCreateProgramWithBinary(clGPUContext, 
                1, (const cl_device_id *)clDevices, 
                &lSize, (const unsigned char**)&buffer, 
                &status, &errcode);
   errcode = clBuildProgram(clProgram, 0, NULL, NULL, 
                NULL, NULL);
   shrCheckError(status, CL_SUCCESS);
   shrCheckError(errcode, CL_SUCCESS);
 
   errcode = clBuildProgram(clProgram, 0, 
              NULL, NULL, NULL, NULL);
   shrCheckError(errcode, CL_SUCCESS);
 
   clKernel = clCreateKernel(clProgram, 
               "matrixMul", &errcode);
   shrCheckError(errcode, CL_SUCCESS);
 
 
   // 7. Launch OpenCL kernel
   size_t localWorkSize[2], globalWorkSize[2];
 
   int wA = WA;
   int wC = WC;
   errcode = clSetKernelArg(clKernel, 0, 
              sizeof(cl_mem), (void *)&d_C);
   errcode |= clSetKernelArg(clKernel, 1, 
              sizeof(cl_mem), (void *)&d_A);
   errcode |= clSetKernelArg(clKernel, 2, 
              sizeof(cl_mem), (void *)&d_B);
   errcode |= clSetKernelArg(clKernel, 3, 
              sizeof(int), (void *)&wA);
   errcode |= clSetKernelArg(clKernel, 4, 
              sizeof(int), (void *)&wC);
   shrCheckError(errcode, CL_SUCCESS);
 
   localWorkSize[0] = 16;
   localWorkSize[1] = 16;
   globalWorkSize[0] = 1024;
   globalWorkSize[1] = 1024;
 
   errcode = clEnqueueNDRangeKernel(clCommandQue, 
              clKernel, 2, NULL, globalWorkSize, 
              localWorkSize, 0, NULL, NULL);
   shrCheckError(errcode, CL_SUCCESS);
 
   // 8. Retrieve result from device
   errcode = clEnqueueReadBuffer(clCommandQue, 
              d_C, CL_TRUE, 0, mem_size_C, 
              h_C, 0, NULL, NULL);
   shrCheckError(errcode, CL_SUCCESS);
 
   // 9. print out the results
   printf("\n\nMatrix C (Results)\n");
   for(int i = 0; i < size_C; i++)
   {
      printf("%f ", h_C[i]);
      if(((i + 1) % WC) == 0)
      printf("\n");
   }
   printf("\n");
 
   // 10. clean up memory
   free(h_A);
   free(h_B);
   free(h_C);
 
   clReleaseMemObject(d_A);
   clReleaseMemObject(d_C);
   clReleaseMemObject(d_B);
 
   free(clDevices);
   free(clMatrixMul);
   clReleaseContext(clGPUContext);
   clReleaseKernel(clKernel);
   clReleaseProgram(clProgram);
   clReleaseCommandQueue(clCommandQue);

}

Since we have already compiled our OpenCL code to ptx assembler we no longer need the calls to open the kernel source file or to create the program from source so we just comment them out. We replace these calls with code to open up the ptx file and load it into a buffer and we then call clCreateProgramWithBinary( ). Notice that we still must make a call to clBuildProgram( ). Remeber that Nvidia's implementation of OpenCL compiles to an intermeadiate form (ptx assembler) so we still need to compile the assembler to a binary capable of running on the card.

Now if we cut / copy / paste our way to a binary we will find that the GPU version is now 57X faster than the CPU version of the program. So by precompiling the kernel we get an additional 11X speedup. Not bad for changing less than 25 lines of code but we can still get faster...

On to OpenCL Matrix Multiplication 3

Tuesday, September 22, 2009

Matrix Multiplication 1 (OpenCL)

I will assume that you have already downloaded and installed the appropriate CUDA driver, toolkit and SDK from Nvidia. You can get your hands on Nvidia's beta OpenCL at http://www.nvidia.com/object/cuda_opencl.html. Since we have been working with matrix multiplication in CUDA let’s do the same with OpenCL. We will put together a trivial example of multiplying two 3 X 3 matrices together using OpenCL just like we did with C for CUDA.

The basics of what we need to do has not changed... we need to allocate memory for the two matrices that we are multiplying together… let’s call them A and B. Then we need to copy in the data to the memory that you have allocated. Then we might want to print A and B out. We also need to allocate memory for the results… let’s call it matrix C. The next step would be to perform the multiplication and finally you might want to print out the results and free up the memory you allocated. The main program starts out identically to our C for CUDA version.

Main program (Listing 1)

// Multiply two matrices A * B = C
 
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
 
#define WA 3
#define HA 3
#define WB 3
#define HB 3
#define WC 3
#define HC 3
  
// Allocates a matrix with random float entries.
void randomInit(float* data, int size)
{
   for (int i = 0; i < size; ++i)
   data[i] = rand() / (float)RAND_MAX;
}
 
/////////////////////////////////////////////////////////
// Program main
/////////////////////////////////////////////////////////
 
int
main(int argc, char** argv)
{
 
   // set seed for rand()
   srand(2006);
 
   // 1. allocate host memory for matrices A and B
   unsigned int size_A = WA * HA;
   unsigned int mem_size_A = sizeof(float) * size_A;
   float* h_A = (float*) malloc(mem_size_A);
 
   unsigned int size_B = WB * HB;
   unsigned int mem_size_B = sizeof(float) * size_B;
   float* h_B = (float*) malloc(mem_size_B);
 
   // 2. initialize host memory
   randomInit(h_A, size_A);
   randomInit(h_B, size_B);
  
   // 3. print out A and B
   printf("\n\nMatrix A\n");
   for(int i = 0; i < size_A; i++)
   {
      printf("%f ", h_A[i]);
      if(((i + 1) % WA) == 0)
      printf("\n");
   }
 
   printf("\n\nMatrix B\n");
   for(int i = 0; i < size_B; i++)
   {
      printf("%f ", h_B[i]);
      if(((i + 1) % WB) == 0)
      printf("\n");
   }
 
   // 4. allocate host memory for the result C
   unsigned int size_C = WC * HC;
   unsigned int mem_size_C = sizeof(float) * size_C;
   float* h_C = (float*) malloc(mem_size_C);
 
   // 5. Initialize OpenCL
   ...
 
   // 6. Load and build OpenCL kernel
   ...
 
   // 7. Launch OpenCL kernel
   ...
 
   // 8. Retrieve result from device
   ...
 
   // 9. print out the results
   printf("\n\nMatrix C (Results)\n");
   for(int i = 0; i < size_C; i++)
   {
      printf("%f ", h_C[i]);
      if(((i + 1) % WC) == 0)
      printf("\n");
   }
   printf("\n");
 
   // 10. clean up memory
   free(h_A);
   free(h_B);
   free(h_C);
 
}

The code in Listing 1 above has nothing OpenCL specific about it yet. We have put together a randomInit function to generate random floats for our matrices. In section 1 we allocate memory for our matrices. In section 2 we use our randomInit function to generate test data. In section 3 we print out the data that we have initialized our matrices with. Next we allocate memory for the results. Sections 5 through 8 don’t actually have any code yet. We will fill these in as we go. At section 9 we print out our results and then finally we free up all of the resources we allocated. Pretty straightforward… there is nothing special about the code yet.

As I mentioned in my OpenCL Program Structure post with OpenCL you must create an OpenCL context and associate devices, kernels, program objects, memory objects, and a command queue with the context. All of this initialization is done in host code using OpenCL APIs. The host code can then interact with the device by inserting commands onto the command queue. To launch the kernel you simply put a launch command on the command queue. To retrieve your results you put a memory copy command on the command queue requesting that the device memory containing your results be copied back to host memory. So lets start by adding the code for the OpenCL initialization.

Main program (Listing 2)

// Multiply two matrices A * B = C
 
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <oclUtils.h>
 
#define WA 3
#define HA 3
#define WB 3
#define HB 3
#define WC 3
#define HC 3
 
// Allocates a matrix with random float entries.
void randomInit(float* data, int size)
{
   for (int i = 0; i < size; ++i)
   data[i] = rand() / (float)RAND_MAX;
}
 
/////////////////////////////////////////////////////////
// Program main
/////////////////////////////////////////////////////////
 
int
main(int argc, char** argv)
{
 
   // set seed for rand()
   srand(2006);
 
   // 1. allocate host memory for matrices A and B
   unsigned int size_A = WA * HA;
   unsigned int mem_size_A = sizeof(float) * size_A;
   float* h_A = (float*) malloc(mem_size_A);
 
   unsigned int size_B = WB * HB;
   unsigned int mem_size_B = sizeof(float) * size_B;
   float* h_B = (float*) malloc(mem_size_B);
 
   // 2. initialize host memory
   randomInit(h_A, size_A);
   randomInit(h_B, size_B);
 
   // 3. print out A and B
   printf("\n\nMatrix A\n");
   for(int i = 0; i < size_A; i++)
   {
      printf("%f ", h_A[i]);
      if(((i + 1) % WA) == 0)
      printf("\n");
   }
 
   printf("\n\nMatrix B\n");
   for(int i = 0; i < size_B; i++)
   {
      printf("%f ", h_B[i]);
      if(((i + 1) % WB) == 0)
      printf("\n");
   }
 
   // 4. allocate host memory for the result C
   unsigned int size_C = WC * HC;
   unsigned int mem_size_C = sizeof(float) * size_C;
   float* h_C = (float*) malloc(mem_size_C);
 
   // 5. Initialize OpenCL
   // OpenCL specific variables
   cl_context clGPUContext;
   cl_command_queue clCommandQue;
   cl_program clProgram;
   cl_kernel clKernel;
 
   size_t dataBytes;
   size_t kernelLength;
   cl_int errcode;
 
   // OpenCL device memory for matrices
   cl_mem d_A;
   cl_mem d_B;
   cl_mem d_C;
 
   /*****************************************/
   /* Initialize OpenCL */
   /*****************************************/
   clGPUContext = clCreateContextFromType(0, 
                   CL_DEVICE_TYPE_GPU, 
                   NULL, NULL, &errcode);
   shrCheckError(errcode, CL_SUCCESS);
 
   // get the list of GPU devices associated 
   // with context
   errcode = clGetContextInfo(clGPUContext, 
              CL_CONTEXT_DEVICES, 0, NULL, 
              &dataBytes);
   cl_device_id *clDevices = (cl_device_id *)
              malloc(dataBytes);
   errcode |= clGetContextInfo(clGPUContext, 
              CL_CONTEXT_DEVICES, dataBytes, 
              clDevices, NULL);
   shrCheckError(errcode, CL_SUCCESS);
 
   //Create a command-queue
   clCommandQue = clCreateCommandQueue(clGPUContext, 
              clDevices[0], 0, &errcode);
   shrCheckError(errcode, CL_SUCCESS);
 
   // Setup device memory
   d_C = clCreateBuffer(clGPUContext, 
          CL_MEM_READ_WRITE, 
          mem_size_A, NULL, &errcode);
   d_A = clCreateBuffer(clGPUContext, 
          CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, 
          mem_size_A, h_A, &errcode);
   d_B = clCreateBuffer(clGPUContext, 
          CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, 
          mem_size_B, h_B, &errcode);
 
 
   // 6. Load and build OpenCL kernel
   ...
 
   // 7. Launch OpenCL kernel
   ...
  
   // 8. Retrieve result from device
   ...
 
   // 9. print out the results
   printf("\n\nMatrix C (Results)\n");
   for(int i = 0; i < size_C; i++)
   {
      printf("%f ", h_C[i]);
      if(((i + 1) % WC) == 0)
      printf("\n");
   }
   printf("\n");

   // 10. clean up memory
   free(h_A);
   free(h_B);
   free(h_C);
 
}

As we walk through the code many of the functions we will be calling have optional parameters. We will not be going into the details of the optional parameters on the functions that we are using. You can reference http://www.khronos.org/registry/cl/specs/opencl-1.0.43.pdf for a description of all of the parameters.

The first new line is the include for oclUtils.h. This include file is Nvidia specific and contains the code for the shrCheckError( ) method as well as the includes for the standard OpenCL header. Lets skip past the declarations of the OpenCL specific variables to the first OpenCL method we call clCreateContextFromType( ). This function creates on OpenCL context from a device type that identifies the specific device. Since we are offloading to a GPU we specify CL_DEVICE_TYPE_GPU. This will return a cl_context that we will use for the rest of our initialization.

The function shrCheckError( ) is an Nvidia specific function that prints out information on the error and exits. We will be making calls to this function throughout our main program. We then call clGetContextInfo( ) to get a handle to the cl_device_id. We make two calls to clGetContextInfo( ) one to determine how many bytes we need for our cl_device_id (we can have multiple OpenCL capable devices on one host) and the second to initialize our cl_device_id variable. Once we have the id of our device we can then create a cl_command_queue so that we can interact with the device. We do this by making a call to clCreateCommandQueue( ).

At this point we are almost done with our initialization. The only thing left is to initialize the device memory for our matrices and copy over our data. To do this we make calls to clCreateBuffer( ) which is used both to allocate the device memory and optionally initialize it with data from host memory. We use it for both.

Main program (Listing 3)

// Multiply two matrices A * B = C
 
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <oclUtils.h>
 
#define WA 3
#define HA 3
#define WB 3
#define HB 3
#define WC 3
#define HC 3
 
// Allocates a matrix with random float entries.
void randomInit(float* data, int size)
{
   for (int i = 0; i < size; ++i)
   data[i] = rand() / (float)RAND_MAX;
}
 
/////////////////////////////////////////////////////////
// Program main
/////////////////////////////////////////////////////////
 
int
main(int argc, char** argv)
{
 
   // set seed for rand()
   srand(2006);
   
   // 1. allocate host memory for matrices A and B
   unsigned int size_A = WA * HA;
   unsigned int mem_size_A = sizeof(float) * size_A;
   float* h_A = (float*) malloc(mem_size_A);
 
   unsigned int size_B = WB * HB;
   unsigned int mem_size_B = sizeof(float) * size_B;
   float* h_B = (float*) malloc(mem_size_B);
 
   // 2. initialize host memory
   randomInit(h_A, size_A);
   randomInit(h_B, size_B);
 
   // 3. print out A and B
   printf("\n\nMatrix A\n");
   for(int i = 0; i < size_A; i++)
   {
      printf("%f ", h_A[i]);
      if(((i + 1) % WA) == 0)
      printf("\n");
   }
  
   printf("\n\nMatrix B\n");
   for(int i = 0; i < size_B; i++)
   {
      printf("%f ", h_B[i]);
      if(((i + 1) % WB) == 0)
      printf("\n");
   }
 
   // 4. allocate host memory for the result C
   unsigned int size_C = WC * HC;
   unsigned int mem_size_C = sizeof(float) * size_C;
   float* h_C = (float*) malloc(mem_size_C);
 
   // 5. Initialize OpenCL
   // OpenCL specific variables
   cl_context clGPUContext;
   cl_command_queue clCommandQue;
   cl_program clProgram;
   cl_kernel clKernel;
   
   size_t dataBytes;
   size_t kernelLength;
   cl_int errcode;
  
   // OpenCL device memory for matrices
   cl_mem d_A;
   cl_mem d_B;
   cl_mem d_C;
 
   /*****************************************/
   /* Initialize OpenCL */
   /*****************************************/
   clGPUContext = clCreateContextFromType(0, 
                   CL_DEVICE_TYPE_GPU, 
                   NULL, NULL, &errcode);
   shrCheckError(errcode, CL_SUCCESS);
  
   // get the list of GPU devices associated 
   // with context
   errcode = clGetContextInfo(clGPUContext, 
              CL_CONTEXT_DEVICES, 0, NULL, 
              &dataBytes);
   cl_device_id *clDevices = (cl_device_id *)
              malloc(dataBytes);
   errcode |= clGetContextInfo(clGPUContext, 
              CL_CONTEXT_DEVICES, dataBytes, 
              clDevices, NULL);
   shrCheckError(errcode, CL_SUCCESS);
 
   //Create a command-queue
   clCommandQue = clCreateCommandQueue(clGPUContext, 
                   clDevices[0], 0, &errcode);
   shrCheckError(errcode, CL_SUCCESS);
 
   // Setup device memory
   d_C = clCreateBuffer(clGPUContext, 
          CL_MEM_READ_WRITE, 
          mem_size_A, NULL, &errcode);
   d_A = clCreateBuffer(clGPUContext, 
          CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, 
          mem_size_A, h_A, &errcode);
   d_B = clCreateBuffer(clGPUContext, 
          CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, 
          mem_size_B, h_B, &errcode);
 
 
   // 6. Load and build OpenCL kernel
   char *clMatrixMul = oclLoadProgSource("kernel.cl",
                        "// My comment\n", 
                        &kernelLength);
   shrCheckError(clMatrixMul != NULL, shrTRUE);
 
   clProgram = clCreateProgramWithSource(clGPUContext, 
                1, (const char **)&clMatrixMul, 
                &kernelLength, &errcode);
   shrCheckError(errcode, CL_SUCCESS);
 
   errcode = clBuildProgram(clProgram, 0, 
              NULL, NULL, NULL, NULL);
   shrCheckError(errcode, CL_SUCCESS);
 
   clKernel = clCreateKernel(clProgram, 
               "matrixMul", &errcode);
   shrCheckError(errcode, CL_SUCCESS);
 
 
   // 7. Launch OpenCL kernel
   ...
 
   // 8. Retrieve result from device
   ...
 
   // 9. print out the results
   printf("\n\nMatrix C (Results)\n");
   for(int i = 0; i < size_C; i++)
   {
      printf("%f ", h_C[i]);
      if(((i + 1) % WC) == 0)
      printf("\n");
   }
   printf("\n");
 
   // 10. clean up memory
   free(h_A);
   free(h_B);
   free(h_C);
 
}

Now that we have completed the OpenCL initialization listing 3 above contains the code to load and build the kernel. The function oclLoadProgSource( ) is not part of OpenCL. It is a Nvidia specific function that simplifies the loading of your kernel source. The first parameter must contain the complete path to the source file for your kernel. The second parameter is prepended to what is read in (can be useful for adding includes). The last parameter is the address to store the number of bytes read in.

We next call clCreateProgramWithSource( ) which creates a program object for a context, and loads the source code specified by the text strings in the clMatrixMul array into the program object. The devices associated with the program object are the devices associated with context. The first parameter is the context that we initialized, the second is an integer indicating the dimension of the clMatrixMul array, the third is the length of the clMatrixMul array, and the last is for the return of the error code.

Now that we have a valid program object we need to compile or "build" it. This is done in the call to clBuildProgram( ). This function takes a valid program object and the number of devices associated with the context. The remaining parameters are optional. Once we have successfully built our program object we need to associate a kernel object with each function prepended by __kernel in our kernel source file (a single source file can contain multiple __kernel functions in it).

To map a specific __kernel function in our kernel source file with a kernel object inside our program we call clCreateKernel( ) passing in the built program object and the name of the function. This will return a kernel object that is ready to launch.

Main program (Listing 4)

// Multiply two matrices A * B = C
 
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <oclUtils.h>
 
#define WA 3
#define HA 3
#define WB 3
#define HB 3
#define WC 3
#define HC 3
 
// Allocates a matrix with random float entries.
void randomInit(float* data, int size)
{
   for (int i = 0; i < size; ++i)
   data[i] = rand() / (float)RAND_MAX;
}
 
/////////////////////////////////////////////////////////
// Program main
/////////////////////////////////////////////////////////
 
int
main(int argc, char** argv)
{
 
   // set seed for rand()
   srand(2006);
 
   // 1. allocate host memory for matrices A and B
   unsigned int size_A = WA * HA;
   unsigned int mem_size_A = sizeof(float) * size_A;
   float* h_A = (float*) malloc(mem_size_A);
 
   unsigned int size_B = WB * HB;
   unsigned int mem_size_B = sizeof(float) * size_B;
   float* h_B = (float*) malloc(mem_size_B);
 
   // 2. initialize host memory
   randomInit(h_A, size_A);
   randomInit(h_B, size_B);
 
   // 3. print out A and B
   printf("\n\nMatrix A\n");
   for(int i = 0; i < size_A; i++)
   {
      printf("%f ", h_A[i]);
      if(((i + 1) % WA) == 0)
      printf("\n");
   }
 
   printf("\n\nMatrix B\n");
   for(int i = 0; i < size_B; i++)
   {
      printf("%f ", h_B[i]);
      if(((i + 1) % WB) == 0)
      printf("\n");
   }
 
   // 4. allocate host memory for the result C
   unsigned int size_C = WC * HC;
   unsigned int mem_size_C = sizeof(float) * size_C;
   float* h_C = (float*) malloc(mem_size_C);
 
   // 5. Initialize OpenCL
   // OpenCL specific variables
   cl_context clGPUContext;
   cl_command_queue clCommandQue;
   cl_program clProgram;
   cl_kernel clKernel;
  
   size_t dataBytes;
   size_t kernelLength;
   cl_int errcode;
 
   // OpenCL device memory for matrices
   cl_mem d_A;
   cl_mem d_B;
   cl_mem d_C;
 
   /*****************************************/
   /* Initialize OpenCL */
   /*****************************************/
   clGPUContext = clCreateContextFromType(0, 
                   CL_DEVICE_TYPE_GPU, 
                   NULL, NULL, &errcode);
   shrCheckError(errcode, CL_SUCCESS);
 
   // get the list of GPU devices associated 
   // with context
   errcode = clGetContextInfo(clGPUContext, 
              CL_CONTEXT_DEVICES, 0, NULL, 
              &dataBytes);
   cl_device_id *clDevices = (cl_device_id *)
              malloc(dataBytes);
   errcode |= clGetContextInfo(clGPUContext, 
              CL_CONTEXT_DEVICES, dataBytes, 
              clDevices, NULL);
   shrCheckError(errcode, CL_SUCCESS);
 
   //Create a command-queue
   clCommandQue = clCreateCommandQueue(clGPUContext, 
                  clDevices[0], 0, &errcode);
   shrCheckError(errcode, CL_SUCCESS);
  
   // Setup device memory
   d_C = clCreateBuffer(clGPUContext, 
          CL_MEM_READ_WRITE, 
          mem_size_A, NULL, &errcode);
   d_A = clCreateBuffer(clGPUContext, 
          CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, 
          mem_size_A, h_A, &errcode);
   d_B = clCreateBuffer(clGPUContext, 
          CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, 
          mem_size_B, h_B, &errcode);
 
 
   // 6. Load and build OpenCL kernel
   char *clMatrixMul = oclLoadProgSource("kernel.cl",
                        "// My comment\n", 
                        &kernelLength);
   shrCheckError(clMatrixMul != NULL, shrTRUE);
 
   clProgram = clCreateProgramWithSource(clGPUContext, 
                1, (const char **)&clMatrixMul, 
                &kernelLength, &errcode);
   shrCheckError(errcode, CL_SUCCESS);
 
   errcode = clBuildProgram(clProgram, 0, 
              NULL, NULL, NULL, NULL);
   shrCheckError(errcode, CL_SUCCESS);
 
   clKernel = clCreateKernel(clProgram, 
               "matrixMul", &errcode);
   shrCheckError(errcode, CL_SUCCESS);
 
 
   // 7. Launch OpenCL kernel
   size_t localWorkSize[2], globalWorkSize[2];
 
   int wA = WA;
   int wC = WC;
   errcode = clSetKernelArg(clKernel, 0, 
              sizeof(cl_mem), (void *)&d_C);
   errcode |= clSetKernelArg(clKernel, 1, 
              sizeof(cl_mem), (void *)&d_A);
   errcode |= clSetKernelArg(clKernel, 2, 
              sizeof(cl_mem), (void *)&d_B);
   errcode |= clSetKernelArg(clKernel, 3, 
              sizeof(int), (void *)&wA);
   errcode |= clSetKernelArg(clKernel, 4, 
              sizeof(int), (void *)&wC);
   shrCheckError(errcode, CL_SUCCESS);
 
   localWorkSize[0] = 3;
   localWorkSize[1] = 3;
   globalWorkSize[0] = 3;
   globalWorkSize[1] = 3;
 
   errcode = clEnqueueNDRangeKernel(clCommandQue, 
              clKernel, 2, NULL, globalWorkSize, 
              localWorkSize, 0, NULL, NULL);
   shrCheckError(errcode, CL_SUCCESS);
 
   // 8. Retrieve result from device
   errcode = clEnqueueReadBuffer(clCommandQue, 
              d_C, CL_TRUE, 0, mem_size_C, 
              h_C, 0, NULL, NULL);
   shrCheckError(errcode, CL_SUCCESS);
 
   // 9. print out the results
   printf("\n\nMatrix C (Results)\n");
   for(int i = 0; i < size_C; i++)
   {
      printf("%f ", h_C[i]);
      if(((i + 1) % WC) == 0)
      printf("\n");
   }
   printf("\n");
 
   // 10. clean up memory
   free(h_A);
   free(h_B);
   free(h_C);
 
   clReleaseMemObject(d_A);
   clReleaseMemObject(d_C);
   clReleaseMemObject(d_B);
 
   free(clDevices);
   free(clMatrixMul);
   clReleaseContext(clGPUContext);
   clReleaseKernel(clKernel);
   clReleaseProgram(clProgram);
   clReleaseCommandQueue(clCommandQue);

}

In listing 4 above we add the code to launch the kernel and retrieve the results. Before we can launch the kernel we need to set the parameters for our matrixMul kernel. We haven't looked at the kernel yet but it is almost identical to the C for CUDA version which took 5 parameters. To set the parameters we call clSetKernelArg( ) passing in the kernel, the index of the parameter in the kernel function, the size of the parameter and a pointer to the value. We do this for each parameter we need to pass in. If you have ever done any X Windows programming this tedious manner of passing in parameters should look famillar to you. Oh and let's not forget to release all of the OpenCL resources that we have allocated.

Now that the parameters are set all we need to do is queue up a command to launch the kernel on the command queue that we associated with the context earlier. We do that with a call to clEnqueueNDRangeKernel( ). With C for CUDA we had to set the size of our grid of thread block and the size of each thread block. For our first kernel we have 9 threads (3 X 3 matrices) in a thread block and 1 thread block in the grid. With OpenCL we need to do the same thing the only difference is that the dimensions for the grid in OpenCL are expressed in terms of the total number of threads in the grid. Thats what the two arrays localWorkSize and globalWorkSize are used for.

Once we have launched our kernel we enqueue a command to retrieve the results from device memory with a call to clEnqueueReadBuffer( ). This call will block until the kernel has finished. Speeking of the kernel... what is our kernel going to look like...?

Kernel (Listing 1)

// kernel.cl
// Multiply two matrices A * B = C
// Device code.
  
 
// OpenCL Kernel
__kernel void
matrixMul(__global float* C, 
          __global float* A, 
          __global float* B, 
          int wA, int wB)
{
 
   // 2D Thread ID
   int tx = get_local_id(0);
   int ty = get_local_id(1);
 
   // value stores the element 
   // that is computed by the thread
   float value = 0;
   for (int k = 0; k < wA; ++k)
   {
      float elementA = A[ty * wA + k];
      float elementB = B[k * wB + tx];
      value += elementA * elementB;
   }
 
   // Write the matrix to device memory each 
   // thread writes one element
   C[ty * wA + tx] = value;
}

The OpenCL kernel is basicaly identical to the C for CUDA version of the kernel the only differences are really cosmetic. We change the __global__ keyword that CUDA uses for __kernel that OpenCL uses to denote that a function is to be executed on the device. The threadIdx CUDA reference is replaced with calls to OpenCLs get_local_id( ) function and the global memory for our matrices that are passed into the kernel needs to be specified as __global in OpenCL. That's it... otherwise everything else is the same as our CUDA kernel.

Comparing the C for CUDA code to the OpenCL code you can see why we did not use Nvidia's CUDA driver API to start with. There is a significant amount of very mundane code that you have to write with CUDAs driver API and with OpenCL. It made sense for us to go the simpler route with Nvidias C for CUDA because using their CUDA driver API didn't really buy us anything. With OpenCL, on the otherhand, we obtain portability by writing the extra mundane code so it is worth it... at least to me it is.

With the code above you should be able to cut / copy / paste your way to a running binary (don't forget to replace "kernel.cl" in the call to oclLoadProgSource( ) with the complete path to your kernel). You will find that OpenCL does not run this algorithm any faster than CUDA does... performance still sucks! That's OK we will make it better in the next two examples.

On to Matrix Multiplication 2.

Friday, September 18, 2009

OpenCL Compiler (OCLTools)

Unlike C for CUDA OpenCL does not come with a compiler. More precisely to be compliant with the specification a vendor does not need to supply a compiler. The OpenCL driver has a few functions that are related to compiling kernel source to prepare it for execution on the device. Developers building OpenCL applications must put code into their application to find the kernel source, read it in, and compile it. When the source is compiled it is compiled down to either an intermediate representation (ptx assembly in Nvidia's case) or a device specific executable.

Why would a vendor compile to an intermediate representation instead of a device specific binary? This allows OpenCL vendors a greater degree of flexibility in forward and backward application compatibility as the underlying hardware implementations change.

So what does this lack of compiler mean to you the developer? Well... it depends. Lets say I built a suite of option models on OpenCL for internal use at my firm that offload to Nvidia's GPUs. I definitely do not want to be paying the penalty to read in and compile the source each time I run my model. Since I know I'm going to be running on Nvidia GPUs I'll just compile my kernel down to ptx and release the ptx with my models. Later if we decide to run on Larrabee cards I can just recompile the kernel source for Larrabee and re-release the kernel binary. For this type of work flow it would be nice to have a stand alone OpenCL compiler.

What if my product is video encoding / decoding software? Do I want to have to ship multiple versions of my kernel binary? Do I want my users to have to download a new binary just because they have installed a new OpenCL capable device? Probably not. In this case I would just ship the source for my kernel and let the OpenCL driver take care of compiling it for the right platform. If this is my work flow then I don't really need a compiler.

If you are interested in an offline OpenCL compiler go to ClusterChimps.org. They have a suite of Open Source OpenCL compiler tools called OCLTools that not only support offline compilation but also embedded kernel source and encryption.

Thursday, September 17, 2009

OpenCL Program Structure

OpenCL (Open Computing Language) is a framework for writing programs that execute across heterogeneous platforms consisting of CPUs, GPUs, and other parallel computing devices. OpenCL includes a language (based on C99) for writing kernels (functions that execute on OpenCL devices), plus APIs that are used to define and then control the platforms. OpenCL provides parallel computing using task-based and data-based parallelism. While the OpenCL specification allows support for task based parallelism a specific implementation is not required to support task based parallelism to be compliant with the specification.

For purposes of this discussion I will focus on data-based parallelism since that is what is supported on GPUs. I will also focus on Nvidia’s implementation since at this point in time that is where you are going to get the most bang for your buck. I will assume that you have already gone through our CUDA based examples and will therefore be describing OpenCL in terms of CUDA. If you haven’t gone through the CUDA examples you should do so before reading further.

As I mentioned in my CUDA program structure post Nvidia offers two different APIs: C for CUDA and the CUDA driver API. I did not cover Nvidia’s driver API because it is more complicated than their C for CUDA. Well as luck would have it OpenCL is very similar to Nvidia’s driver API which means that it is more complicated to write OpenCL based applications than it is to write C for CUDA based applications. So why do we want to write OpenCL based applications if it’s going to be more difficult? Well… as I mentioned in my GPGPU (A Historical Perspective) post OpenCL is the future. With OpenCL your code will be capable of running on multiple different computing devices (CPUs, GPUs, Cell Processors, Larrabee, and whatever the next “big thing” is) without being modified. So if you don’t like being tied to a single vendor or being forced to port your code to the next “big thing” OpenCL will be the way to go. I say will be. As of today there have been no OpenCL implementations released. AMD and Nvidia are both working on implementations for their processors but they are both still in beta. All of the following examples have been built and tested against Nvidia’s beta OpenCL.

So how much more complicated is OpenCL? The added complexity lies in the host code that must be written to control the kernel code. C for CUDA provided you with a compiler for the kernel code and the linker took care of the rest. With OpenCL you must write code to both locate the kernel binary and load it onto the device or to find the kernel source, compile it to binary, and load it onto the device. That’s right… no compiler. The driver API contains compiler methods but there is no stand alone compiler. You can however build a compiler from the OpenCL driver API (see my OpenCL Compiler (oclcc) example). Below is the OpenCL application software stack.

OpenCL kernel code itself is conceptually identical to C for CUDA kernel code. The parallel concepts are also very similar. With C for CUDA we had a grid of thread blocks that contained multiple threads.

With OpenCL we have an NDRange (N – Dimensional Range) of work groups that contain multiple work items.

With OpenCL you must create an OpenCL context and associate devices, kernels, program objects, memory objects, and a command queue with the context. All of this is done in host code using OpenCL APIs. The host code can then interact with the device by inserting commands onto the command queue. To launch the kernel you simply put a launch command on the command queue. To retrieve your results you put a memory copy command on the command queue requesting that the device memory containing your results be copied back to host memory. This command will block until the kernel is finished.

Time to write some code... Matrix Multiplication 1

Tuesday, September 1, 2009

Matrix Multiplication 3 (CUDA)

In my CUDA Program Structure post I mentioned that CUDA provides three abstractions: a hierarchy of thread groups, shared memory, and thread synchronization. We have already covered the hierarchy of thread groups in Matrix Multiplication 1 and Matrix Multiplication 2. In this posting we will cover shared memory and thread synchronization.

There are 4 different types of memory: registers, shared, global, and constant (there is also a 5th type of memory known as texture memory but we aren't going to get into that). Why so many different layers of memory on a GPU? Well if you look at a CPU it will have registers, Level 1 cache, Level 2 cache, Level 3 cache, and DRAM. It has these many different layers of memory so that applications can achieve accelerated performance at minimal cost. A GPU is no different than a CPU from this perspective. On a CPU the different memory layers are abstracted away from the programmer by their high level language compiler of choice. Unfortunately on the GPU they are not abstracted away and to write high performance GPGPU applications we have to know the details of the memory hierarchy.

The different memory layers on a GPU differ in access times, size, and in access visibility. Global memory is the slowest memory but it is also the largest. On most GPUs there will be a few GBytes of global memory. Global memory is visible on both the host and the device. It can also be read and written from both host code and device code.

Registers and shared memory are the fastest memory but also the smallest. Registers and shared memory are measured in Kbytes not Gbytes. Registers and shared memory can be read and written to by device code and are not visible to host code at all.

Constant memory is faster than global but slower than registers and shared memory. It is also measure in Gbytes, however, it can only be written to in host code. Device code can read constant memory but it can not write to it.

If a variable is mapped to a register then each instance of the executing kernel (each thread) has its own copy of the variable. By default automatic variables are mapped to registers. If a variable is mapped to shared memory then each thread block has its own copy of the variable. Data residing in shared memory is accessible by all threads in the thread block but not by threads in other thread blocks. The attributes of the different CUDA memory layers are summarised below:

Memory	On Chip	Access Speed	Size	Visibility	Access Type
Register	Yes	Fast	Small	thread	D(R,W) H(None)
Shared	Yes	Fast	Small	thread block	D(R,W) H(None)
Constant	No	Slow	Small	grid	D(R) H(W)
Global	No	Slowest	Large	grid	D(R,W) H(R,W)

Now that we know a little about the CUDA memory hierarchy how can we use this information to speed up our matrix multiplication? Our first two matrix multiplication programs copied the input data into global device memory (thats what cudaMalloc( ) returns) and then the kernel read it out of global memory. From the information above we know that global device memory is the slowest memory. So we should try and limit our usage of it. By changing our kernel to copy large blocks of data from global memory into shared memory and then processing the data from shared memory instead of global we should be able to speed up our overall computation.

For example let's take matrix A x matrix B and store the results in matrix C. Assume that the dimensions of C are greater than what we can fit in a single thread block (maximum of 512 threads per thread block). We will have to break matric C down into 4 sub matrices that fit into a single thread block and we will have to break both A and B into 2 sub matrices. Since matrix multiplication is a SIMD algorithm we can do this and still arrive at the correct answer (see animation below).

So we end up breaking down the problem of A X B = C into 4 sub problems of A(1) X B(1) = C(1), A(1) X B(2) = C(2), A(2) X B(1) = C(3), and A(2) X B(2) = C(4). This isn't anything new it's the same breakdown we did for matrix multiplication 2 just expressed slightly different visualy. Now let's just focus on the first sub problem: A(1) X B(1) = C(1).

This sub problem will be solved by a single processing core on the GPU. Since the amount of shared memory on the processing core is limited we will have to break the sub problem into - you guessed it - sub problems! Let's assume that the amount of shared memory that we have available is only enough to fit half of matrix A(1) and B(1). This means that we will have to break our sub problem into 2 sub problems that fit into shared memory (see animation below).

Slide 1 shows the sub problem that we are working on (A(1) * B(1) = C(1)). In slide 2 we copy in half of each matrix from global memory into shared memory. The two smaller matrices are multiplied together and the results are stored in C(1) on slide 3. On slide 4 we copy the other half of each matrix from global to shared memory and we finish up by multiplying them together and adding the results to C(1) on slide 5. Keep in mind that all 4 of the original sub problems are being solved in parallel on different GPU cores.

So now that you understand conceptually what we need to achieve lets look at the code. Again the original main program should work fine without any modifications, however, I did pull out the printfs so your screen doesn't get so cluttered.

Main program (Listing 1)

// Multiply two matrices A * B = C
 
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <matrixMul_kernel.cu>
 
// Allocates a matrix with random float entries.
void randomInit(float* data, int size)
{
   for (int i = 0; i < size; ++i)
   data[i] = rand() / (float)RAND_MAX;
}
 
/////////////////////////////////////////////////////////
// Program main
/////////////////////////////////////////////////////////
 
int
main(int argc, char** argv)
{

   // set seed for rand()
   srand(2006);
 
   // 1. allocate host memory for matrices A and B
   unsigned int size_A = WA * HA;
   unsigned int mem_size_A = sizeof(float) * size_A;
   float* h_A = (float*) malloc(mem_size_A);
 
   unsigned int size_B = WB * HB;
   unsigned int mem_size_B = sizeof(float) * size_B;
   float* h_B = (float*) malloc(mem_size_B);
 
   // 2. initialize host memory
   randomInit(h_A, size_A);
   randomInit(h_B, size_B);
    
   // 8. allocate device memory
   float* d_A;
   float* d_B;
   cudaMalloc((void**) &d_A, mem_size_A);
   cudaMalloc((void**) &d_B, mem_size_B);
 
   // 9. copy host memory to device
   cudaMemcpy(d_A, h_A, mem_size_A, 
   cudaMemcpyHostToDevice);
   cudaMemcpy(d_B, h_B, mem_size_B, 
   cudaMemcpyHostToDevice);
 
   // 4. allocate host memory for the result C
   unsigned int size_C = WC * HC;
   unsigned int mem_size_C = sizeof(float) * size_C;
   float* h_C = (float*) malloc(mem_size_C);
 
   // 10. allocate device memory for the result
   float* d_C;
   cudaMalloc((void**) &d_C, mem_size_C);
 
   // 5. perform the calculation
   // setup execution parameters
   dim3 threads(BLOCK_SIZE, BLOCK_SIZE);
   dim3 grid(WC / threads.x, HC / threads.y);
 
   // execute the kernel
   matrixMul<<< grid, threads >>>(d_C, d_A, 
                                  d_B, WA, WB);
 
   // 11. copy result from device to host
   cudaMemcpy(h_C, d_C, mem_size_C, 
   cudaMemcpyDeviceToHost);
  
   // 7. clean up memory
   free(h_A);
   free(h_B);
   free(h_C);
   cudaFree(d_A);
   cudaFree(d_B);
   cudaFree(d_C);
 
}

The kernel, on the other hand, is going to be changed significantly. We will need to determine what data from matrix A and B we will be working with based on which thread block we are in the grid. We will need to copy that data from global memory to the shared memory of our processing core assigning each thread in our thread block a specific data element to copy. We then need to perform the math and write our results to global memory.

Kernel (Listing 1)

/* Matrix multiplication: C = A * B.
 * Device code.
 */
 
#ifndef _MATRIXMUL_KERNEL_H_
#define _MATRIXMUL_KERNEL_H_
 
#include 
 
// Thread block size
#define BLOCK_SIZE 16
#define TILE_SIZE  16 
#define WA 1024   // Matrix A width
#define HA 1024   // Matrix A height
#define WB 1024   // Matrix B width
#define HB WA     // Matrix B height
#define WC WB     // Matrix C width
#define HC HA     // Matrix C height
 
//////////////////////////////////////////////////////
//! Matrix multiplication on the device: C = A * B
//! wA is A's width and wB is B's width
//////////////////////////////////////////////////////
__global__ void
matrixMul( float* C, float* A, float* B, int wA, int wB)
{
    // Block index
    int bx = blockIdx.x;
    int by = blockIdx.y;
 
    // Thread index
    int tx = threadIdx.x;
    int ty = threadIdx.y;
 
    // Index of the first sub-matrix of A processed 
    // by the block
    int aBegin = wA * BLOCK_SIZE * by;
 
    // Index of the last sub-matrix of A processed 
    // by the block
    int aEnd   = aBegin + wA - 1;
 
    // Step size used to iterate through the 
    // sub-matrices of A
    int aStep  = BLOCK_SIZE;
 
    // Index of the first sub-matrix of B processed 
    // by the block
    int bBegin = BLOCK_SIZE * bx;
 
    // Step size used to iterate through the 
    // sub-matrices of B
    int bStep  = BLOCK_SIZE * wB;
 
    // Loop over all the sub-matrices of A and B
    // required to compute the block sub-matrix
    for (int a = aBegin, b = bBegin;
             a <= aEnd;
             a += aStep, b += bStep) 
    {

        // Declaration of the shared memory array As 
        // used to store the sub-matrix of A
        __shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
 
        // Declaration of the shared memory array Bs 
        // used to store the sub-matrix of B
        __shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];
 
        // Load the matrices from global memory
        // to shared memory; each thread loads
        // one element of each matrix
        As[ty][tx] = A[a + wA * ty + tx];
        Bs[ty][tx] = B[b + wB * ty + tx];
    }
}

The kernel code above uses blockIdx to determine the start and end location of our two sub matrices. It then loops through our input matrices and has each thread in the thread block copy one cell of the A and B sub matrices into shared memory. Keep in mind that shared memory has thread block scope so each thread in the thread block can see the data copied over by the other threads.

While our kernel is runnable at this point all it is doing is copying around memory so it's not very useful yet. Kernel listing 2 below adds the logic to multiply and copy the results back into global memory.

Kernel (Listing 2)

/* Matrix multiplication: C = A * B.
 * Device code.
 */
 
#ifndef _MATRIXMUL_KERNEL_H_
#define _MATRIXMUL_KERNEL_H_
 
// Thread block size
#define BLOCK_SIZE 16
#define TILE_SIZE  16 
#define WA 1024   // Matrix A width
#define HA 1024   // Matrix A height
#define WB 1024   // Matrix B width
#define HB WA     // Matrix B height
#define WC WB     // Matrix C width
#define HC HA     // Matrix C height
 
//////////////////////////////////////////////////////
//! Matrix multiplication on the device: C = A * B
//! wA is A's width and wB is B's width
//////////////////////////////////////////////////////
__global__ void
matrixMul( float* C, float* A, float* B, int wA, int wB)
{
    // Block index
    int bx = blockIdx.x;
    int by = blockIdx.y;
 
    // Thread index
    int tx = threadIdx.x;
    int ty = threadIdx.y;
 
    // Index of the first sub-matrix of A processed 
    // by the block
    int aBegin = wA * BLOCK_SIZE * by;
 
    // Index of the last sub-matrix of A processed 
    // by the block
    int aEnd   = aBegin + wA - 1;
 
    // Step size used to iterate through the 
    // sub-matrices of A
    int aStep  = BLOCK_SIZE;
 
    // Index of the first sub-matrix of B processed 
    // by the block
    int bBegin = BLOCK_SIZE * bx;
 
    // Step size used to iterate through the 
    // sub-matrices of B
    int bStep  = BLOCK_SIZE * wB;
 
    // Loop over all the sub-matrices of A and B
    // required to compute the block sub-matrix
    for (int a = aBegin, b = bBegin;
             a <= aEnd;
             a += aStep, b += bStep) 
    {

        // Declaration of the shared memory array As 
        // used to store the sub-matrix of A
        __shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
 
        // Declaration of the shared memory array Bs 
        // used to store the sub-matrix of B
        __shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];
 
        // Load the matrices from global memory
        // to shared memory; each thread loads
        // one element of each matrix
        As[ty][tx] = A[a + wA * ty + tx];
        Bs[ty][tx] = B[b + wB * ty + tx];
 
        // Synchronize to make sure the matrices 
        // are loaded
        __syncthreads();
 
        // Multiply the two matrices together;
        // each thread computes one element
        // of the block sub-matrix
        for (int k = 0; k < BLOCK_SIZE; ++k)
            Csub += As[ty][k] * Bs[k][tx];
 
        // Synchronize to make sure that the preceding
        // computation is done before loading two new
        // sub-matrices of A and B in the next iteration
        __syncthreads();
 
    }
 
    // Write the block sub-matrix to device memory;
    // each thread writes one element
    int c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx;
    C[c + wB * ty + tx] = Csub;

}

In the new code above you see two calls being made to a __syncthreads( ) function. This function is what provides the thread synchronization abstraction. Calls to __syncthreads( ) from within the same thread block will block until all threads in the thread block reach the __syncthreads( ) call. Once all threads hit the __syncthreads( ) barrier the barrier is removed and all threads in that thread block continue on processing the next instruction.

The first call to __syncthreads( ) is made after the thread copies over its assigned cell from the A and B global memory into As and Bs shared memory but before the thread iterates over As and Bs to calculate the cell in matrix C that the thread is responsible for. We must call __syncthreads( ) to insure that all of the global data that we need to calculate our cell has actually been copied to shared memory before we iterate over it.

The second call to __syncthreads( ) is performed after we iterate over the shared memory and calculate the result for our sub problem. It is necessary to insure that all of the threads in the thread block are done iterating over the As and Bs shared memory before we go back to the top of the outermost for loop and overlay the As and Bs shared memory with the data for our next sub problem.

Once we finish all of the sub problems we store the results of the accumulation of all of the sub problems back into global memory. At this point this particular thread block is finished and the processing core is available to run the next thread block.

So let's take this version out for a test drive. Running this version yields a 48X speedup over the CPU verion. If we look back at our Matrix Multiplication 2 example we see that it was 44X faster than the CPU version. Now you may be thinking that this is an awful lot of work to just squeeze and extra 4X speedup... and perhaps it is. But let's change the size of our matrix to 2048 and see what we "see".

If we run our Matrix Multiplication 2 example,which uses only global memory, with matrices of 2048 X 2048 we will find that the GPU version is 255X faster than the CPU verion of the algorithm. Our shared memory implementation of the GPU version is 370X faster than the CPU version. Looks like we really did get some "bang for our buck" after all!

Monday, August 31, 2009

Matrix Multiplication 2 (CUDA)

In Matrix Multiplication 1 we learned how to use CUDA to offload a SIMD algorithm to the GPU. We picked a trivial example just to get our feet wet. Now let’s make it a bit more complicated. Instead of multiplying 3 by 3 matrices together we are going to multiply 1024 by 1024 matrices together. You might ask… how is this more complicated? Can’t we just change the defines at the top of our kernel to be 1024 instead of 3 and recompile? Unfortunately we can not. There is a limit to the number of threads that can be in a thread block and that limit is 512. So in order to multiply 1024 by 1024 matrices together we are going to have to make changes to the way we decompose the problem. Instead of processing the entire matrix with a single thread block we are going to have multiple thread blocks each process a portion of the 1024 by 1024 matrices.

Looking at the animation above you will see 2 matrices being multiplied together. In step 2 we divide the result matrix into 16 tiles each containing a 3 by 3 sub matrix. Since matrix multiplication is a SIMD (Single Instruction Multiple Data) algorithm, each tile can be processed concurrently without impacting the accuracy of the result. In step 3 a kernel is launched creating a grid with 16 thread blocks each containing 9 threads. Each tile is allocated to a thread block and each thread block has 9 threads (one for each element of the sub matrix). The computations for all cells of the result matrix occur in parallel at step 3. Whereas our animation shows the subdivision of a 12 by 12 result matrix into sixteen 3 by 3 tiles we will need to divide our 1024 by 1024 result matrix into 64 16 by 16 tiles as shown in the image below.

In the figure above you will see that we will be using a 64 by 64 grid of thread blocks and each thread block will contain 16 by 16 threads. As it turns out the main program that we used in Matrix Multiplication 1 has everything we need to solve this problem we need only change our kernel slightly.

Main program (Listing 1)

// Multiply two matrices A * B = C
 
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <matrixMul_kernel.cu>
 
// Allocates a matrix with random float entries.
void randomInit(float* data, int size)
{
   for (int i = 0; i < size; ++i)
   data[i] = rand() / (float)RAND_MAX;
}
 
/////////////////////////////////////////////////////////
// Program main
/////////////////////////////////////////////////////////
 
int
main(int argc, char** argv)
{

   // set seed for rand()
   srand(2006);
 
   // 1. allocate host memory for matrices A and B
   unsigned int size_A = WA * HA;
   unsigned int mem_size_A = sizeof(float) * size_A;
   float* h_A = (float*) malloc(mem_size_A);
 
   unsigned int size_B = WB * HB;
   unsigned int mem_size_B = sizeof(float) * size_B;
   float* h_B = (float*) malloc(mem_size_B);
 
   // 2. initialize host memory
   randomInit(h_A, size_A);
   randomInit(h_B, size_B);
 
   // 3. print out A and B
   printf("\n\nMatrix A\n");
   for(int i = 0; i < size_A; i++)
   {
      printf("%f ", h_A[i]);
      if(((i + 1) % WA) == 0)
      printf("\n");
   }
  
   printf("\n\nMatrix B\n");
   for(int i = 0; i < size_B; i++)
   {
      printf("%f ", h_B[i]);
      if(((i + 1) % WB) == 0)
      printf("\n");
   }
 
   // 8. allocate device memory
   float* d_A;
   float* d_B;
   cudaMalloc((void**) &d_A, mem_size_A);
   cudaMalloc((void**) &d_B, mem_size_B);
 
   // 9. copy host memory to device
   cudaMemcpy(d_A, h_A, mem_size_A, 
   cudaMemcpyHostToDevice);
   cudaMemcpy(d_B, h_B, mem_size_B, 
   cudaMemcpyHostToDevice);
 
   // 4. allocate host memory for the result C
   unsigned int size_C = WC * HC;
   unsigned int mem_size_C = sizeof(float) * size_C;
   float* h_C = (float*) malloc(mem_size_C);
 
   // 10. allocate device memory for the result
   float* d_C;
   cudaMalloc((void**) &d_C, mem_size_C);
 
   // 5. perform the calculation
   // setup execution parameters
   dim3 threads(BLOCK_SIZE, BLOCK_SIZE);
   dim3 grid(WC / threads.x, HC / threads.y);
 
   // execute the kernel
   matrixMul<<< grid, threads >>>(d_C, d_A, 
                                  d_B, WA, WB);
 
   // 11. copy result from device to host
   cudaMemcpy(h_C, d_C, mem_size_C, 
   cudaMemcpyDeviceToHost);
 
   // 6. print out the results
   printf("\n\nMatrix C (Results)\n");
   for(int i = 0; i < size_C; i++)
   {
      printf("%f ", h_C[i]);
      if(((i + 1) % WC) == 0)
      printf("\n");
   }
   printf("\n");
 
   // 7. clean up memory
   free(h_A);
   free(h_B);
   free(h_C);
   cudaFree(d_A);
   cudaFree(d_B);
   cudaFree(d_C);
 
}

As we learned in Matrix Multiplication 1 when a CUDA kernel is launched a grid is created containing multiple thread blocks. Each thread block contains multiple threads. The number of thread blocks in a grid and the number of threads in a thread block is controlled by the programmer passing in variables to the kernel launch command. We also have learned that CUDA extends the C language to allow the programmer to determine which thread his kernel is executing on (threadIdx.x, threadIdx.y). Well as it turns out CUDA also extends the language to allow the programmer to determine what thread block his thread is in (blockIdx.x, blockIdx.y). To be able to multiply large matrices together we need only make a slight change to our kernel to take the block index into consideration when determining what data a specific running kernel instance should access. By doing this we will be able to determine which tile we are in as well as which thread we are processing the data with.

Kernel (Listing 1)


#ifndef _MATRIXMUL_KERNEL_H_
#define _MATRIXMUL_KERNEL_H_
 
#include <stdio.h>
 
// Thread block size
#define BLOCK_SIZE 16
#define TILE_SIZE  16
 
#define WA 1024   // Matrix A width
#define HA 1024   // Matrix A height
#define WB 1024   // Matrix B width
#define HB WA     // Matrix B height
#define WC WB     // Matrix C width
#define HC HA     // Matrix C height
 
// CUDA Kernel
__global__ void
matrixMul( float* C, float* A, float* B, int wA, int wB)
{
 
   // 1. 2D Thread ID
   int tx = blockIdx.x * TILE_SIZE + threadIdx.x;
   int ty = blockIdx.y * TILE_SIZE + threadIdx.y;
 
   // value stores the element that is 
   // computed by the thread
   float value = 0;
   for (int i = 0; i < wA; ++i)
   {
      float elementA = A[ty * wA + i];
      float elementB = B[i * wB + tx];
      value += elementA * elementB;
   }
 
   // Write the matrix to device memory each 
   // thread writes one element
   C[ty * wA + tx] = value;
}
 
#endif // #ifndef _MATRIXMUL_KERNEL_H_

The code in section 1 has been changed to take the index of the block (and the size of the tile) into consideration. While we didn't actually change the code for pulling out elementA and elementB from our input matrices or for putting the results back to the result matrix, since both of these are based on tx and ty, they will end up indexing to the correct location. Additionally since our main program determines the argument for the kernel launch based on the #defines in the kernel file we don't need to make any changes to the main program.

Make the changes outlined above and run your program. You should see a few thousand number spewed across the screen. If you insert timing code on this version you will find that it is 44X faster than a CPU version of the same algorithm. While our runtime is clearly moving in the right direction there is still room for optimization.

To learn how to make your code faster... Martix Multiplication 3.