Friday, October 2, 2009

Matrix Multiplication 3 (OpenCL)

The memory model for OpenCL applications is strikingly similar to the CUDA memory model (go figure). The only real differences are that CUDA supports texture memory (we didn't get into that in our CUDA examples) and OpenCL provides for a global / constant memory cache (Nvidias new Fermi GPU adds a global / constant memory cache).

With CUDA we were able to make some assumptions about the sizes and speed of the various different types of memory. Well they weren't really assumptions since we know the exact details of the underlying hardware that CUDA applications are running on. With OpenCL we can't easily make assumptions regarding sizes and speeds of the various different types of memory. The OpenCL specification identifies the different types of memory constructs that you must support from the API perspective but OpenCL driver providers have to map that specification to their underlying hardware. For example the amount and speed of global memory on a Cell processor vs an Nvidia GPU can vary significantly. While you should be guaranteed that your OpenCL code will compile and run on different underlying devices you should expect to do some fine tuning to achieve best performance for the various platforms. The types of memory provided by the OpenCL specification are:

Global Memory (CUDA Global): This memory region permits read/write access to all work-items in all work-groups. Work-items can read from or write to any element of a memory object. Reads and writes to global memory may be cached depending on the capabilities of the device.

Constant Memory (CUDA Constant): A region of global memory that remains constant during the execution of a kernel. The host allocates and initializes memory objects placed into constant memory.

Local Memory (CUDA Shared): A memory region local to a work-group. This memory region can be used to allocate variables that are shared by all work-items in that work-group. It may be implemented as dedicated regions of memory on the OpenCL device. Alternatively, the local memory region may be mapped onto sections of the global memory.

Private Memory (CUDA Reqister): A region of memory private to a work-item. Variables defined in one work-item’s private memory are not visible to another work-item.

MemoryAccess SpeedVisibilityAccess Type
PrivateFasterProcessing ElementD(R,W) H(None)
LocalFasterWork GroupD(R,W) H(None)
ConstantSlowerNDRangeD(R) H(W)
GlobalSlowerNDRangeD(R,W) H(R,W)

While it can be dangerous to make assumptions regarding the speed of the different types I think it is a safe assumption that local and private will be faster than global and constant across all OpenCL devices. How much faster... impossible to say.

Now that we know a little about the OpenCL memory hierarchy how can we use this information to speed up our matrix multiplication? Our first two matrix multiplication programs copied the input data into global device memory (thats what clCreateBuffer( ) returns) and then the kernel read it out of global memory. From the information above we know that global device memory is the slower memory than local memory. So we should try and limit our usage of it. By changing our kernel to copy large blocks of data from global memory into local memory and then processing the data from local memory instead of global we should be able to speed up our overall computation.

The CUDA Matrix Multiplication 3 example explains how we need to decompose our calculation into sub problems that are further decomposed into sub problems so that they can fit into the limited local (shared) memory on the device. You might want to revisit that explanation to refresh your memory.

So now that we know what changes we need to make lets put together some code. The main program from OpenCL Matrix Multiplication 2 should work as is. Again I removed the printfs.

Main program (Listing 1)

// Multiply two matrices A * B = C

#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <oclUtils.h>

#define WA 1024
#define HA 1024
#define WB 1024
#define HB WA
#define WC WB
#define HC HA

// Allocates a matrix with random float entries.
void randomInit(float* data, int size)
for (int i = 0; i < size; ++i)
data[i] = rand() / (float)RAND_MAX;

// Program main

main(int argc, char** argv)

// set seed for rand()

// 1. allocate host memory for matrices A and B
unsigned int size_A = WA * HA;
unsigned int mem_size_A = sizeof(float) * size_A;
float* h_A = (float*) malloc(mem_size_A);

unsigned int size_B = WB * HB;
unsigned int mem_size_B = sizeof(float) * size_B;
float* h_B = (float*) malloc(mem_size_B);

// 2. initialize host memory
randomInit(h_A, size_A);
randomInit(h_B, size_B);

// 4. allocate host memory for the result C
unsigned int size_C = WC * HC;
unsigned int mem_size_C = sizeof(float) * size_C;
float* h_C = (float*) malloc(mem_size_C);

// 5. Initialize OpenCL
// OpenCL specific variables
cl_context clGPUContext;
cl_command_queue clCommandQue;
cl_program clProgram;
cl_kernel clKernel;

size_t dataBytes;
size_t kernelLength;
cl_int errcode;

// OpenCL device memory for matrices
cl_mem d_A;
cl_mem d_B;
cl_mem d_C;

/* Initialize OpenCL */
clGPUContext = clCreateContextFromType(0,
NULL, NULL, &errcode);
shrCheckError(errcode, CL_SUCCESS);

// get the list of GPU devices associated
// with context
errcode = clGetContextInfo(clGPUContext,
cl_device_id *clDevices = (cl_device_id *)
errcode |= clGetContextInfo(clGPUContext,
clDevices, NULL);
shrCheckError(errcode, CL_SUCCESS);

//Create a command-queue
clCommandQue = clCreateCommandQueue(clGPUContext,
clDevices[0], 0, &errcode);
shrCheckError(errcode, CL_SUCCESS);

// Setup device memory
d_C = clCreateBuffer(clGPUContext,
mem_size_A, NULL, &errcode);
d_A = clCreateBuffer(clGPUContext,
mem_size_A, h_A, &errcode);
d_B = clCreateBuffer(clGPUContext,
mem_size_B, h_B, &errcode);

// 6. Load and build OpenCL kernel

// Open the ptx file and load it
// into a char* buffer
FILE* fp = fopen("kernel.ptx", "r");
fseek (fp , 0 , SEEK_END);
const size_t lSize = ftell(fp);
unsigned char* buffer;
buffer = (unsigned char*) malloc (lSize);
fread(buffer, 1, lSize, fp);

cl_int status;
clProgram = clCreateProgramWithBinary(clGPUContext,
1, (const cl_device_id *)clDevices,
&lSize, (const unsigned char**)&buffer,
&status, &errcode);
errcode = clBuildProgram(clProgram, 0, NULL, NULL,
shrCheckError(status, CL_SUCCESS);
shrCheckError(errcode, CL_SUCCESS);

errcode = clBuildProgram(clProgram, 0,
shrCheckError(errcode, CL_SUCCESS);

clKernel = clCreateKernel(clProgram,
"matrixMul", &errcode);
shrCheckError(errcode, CL_SUCCESS);

// 7. Launch OpenCL kernel
size_t localWorkSize[2], globalWorkSize[2];

int wA = WA;
int wC = WC;
errcode = clSetKernelArg(clKernel, 0,
sizeof(cl_mem), (void *)&d_C);
errcode |= clSetKernelArg(clKernel, 1,
sizeof(cl_mem), (void *)&d_A);
errcode |= clSetKernelArg(clKernel, 2,
sizeof(cl_mem), (void *)&d_B);
errcode |= clSetKernelArg(clKernel, 3,
sizeof(int), (void *)&wA);
errcode |= clSetKernelArg(clKernel, 4,
sizeof(int), (void *)&wC);
shrCheckError(errcode, CL_SUCCESS);

localWorkSize[0] = 16;
localWorkSize[1] = 16;
globalWorkSize[0] = 1024;
globalWorkSize[1] = 1024;

errcode = clEnqueueNDRangeKernel(clCommandQue,
clKernel, 2, NULL, globalWorkSize,
localWorkSize, 0, NULL, NULL);
shrCheckError(errcode, CL_SUCCESS);

// 8. Retrieve result from device
errcode = clEnqueueReadBuffer(clCommandQue,
d_C, CL_TRUE, 0, mem_size_C,
h_C, 0, NULL, NULL);
shrCheckError(errcode, CL_SUCCESS);

// 10. clean up memory




The kernel is conceptualy identical to the kernel we put together for the CUDA Matrix Multiplication 3 example. All we really need to do is make syntactic changes.

Kernel (Listing 1)

/* Matrix multiplication: C = A * B.
* Device code.

// Thread block size
#define BLOCK_SIZE 16

//! Matrix multiplication on the device: C = A * B
//! wA is A's width and wB is B's width
__kernel void
matrixMul(__global float* C,
__global float* A,
__global float* B, int wA, int wB)
// Block index
int bx = get_group_id(0);
int by = get_group_id(1);

// Thread index
int tx = get_local_id(0);
int ty = get_local_id(1);

// Index of the first sub-matrix of A processed
// by the block
int aBegin = wA * BLOCK_SIZE * by;

// Index of the last sub-matrix of A processed
// by the block
int aEnd = aBegin + wA - 1;

// Step size used to iterate through the
// sub-matrices of A
int aStep = BLOCK_SIZE;

// Index of the first sub-matrix of B processed
// by the block
int bBegin = BLOCK_SIZE * bx;

// Step size used to iterate through the
// sub-matrices of B
int bStep = BLOCK_SIZE * wB;

// Loop over all the sub-matrices of A and B
// required to compute the block sub-matrix
for (int a = aBegin, b = bBegin;
a <= aEnd;
a += aStep, b += bStep)

// Declaration of the local memory array As
// used to store the sub-matrix of A
__local float As[BLOCK_SIZE][BLOCK_SIZE];

// Declaration of the local memory array Bs
// used to store the sub-matrix of B
__local float Bs[BLOCK_SIZE][BLOCK_SIZE];

// Load the matrices from global memory
// to local memory; each thread loads
// one element of each matrix
As[ty][tx] = A[a + wA * ty + tx];
Bs[ty][tx] = B[b + wB * ty + tx];

// Synchronize to make sure the matrices
// are loaded

// Multiply the two matrices together;
// each thread computes one element
// of the block sub-matrix
for (int k = 0; k < BLOCK_SIZE; ++k)
Csub += As[ty][k] * Bs[k][tx];

// Synchronize to make sure that the preceding
// computation is done before loading two new
// sub-matrices of A and B in the next iteration


// Write the block sub-matrix to device memory;
// each thread writes one element
int c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx;
C[c + wB * ty + tx] = Csub;


We changed the __global__ to __kernel to indicate that the function is to be run on the device. We added a __global to designate the input matrices are in global memory. The references to blockIdx and threadIdx are converted to calls to get_global_id( ) and get_local_id( ). The shared memory arrays are converted to local memory arrays and the calls to __syncthreads( ) are converted to calls to barrier( ).

Most of the above changes are self explanatory with the exception of the call to barrier( ). The barrier( ) function is identical in function to the CUDA __syncthreads( ) function. All work-items in a work-group executing the kernel on a processor must execute this function before any are allowed to continue execution beyond the barrier. This function must be encountered by all work-items in a work-group executing the kernel. If barrier is inside a conditional statement, then all work-items must enter the conditional if any work-item enters the conditional statement and executes the barrier. If barrer is inside a loop, all work-items must execute the barrier for each iteration of the loop before any are allowed to continue execution beyond the barrier.

So let's take this version out for a test drive. Running this version yields a 61X speedup over the CPU verion. If we look back at our Matrix Multiplication 2 example we see that it was 57X faster than the CPU version. Now you may be thinking that this is an awful lot of work to just squeeze and extra 4X speedup... and perhaps it is. But let's change the size of our matrix to 2048 and see what we "see".

If we run our Matrix Multiplication 2 example,which uses only global memory, with matrices of 2048 X 2048 we will find that the GPU version is 279X faster than the CPU verion of the algorithm. Our local memory implementation of the GPU version is 370X faster than the CPU version. Looks like we really did get some "bang for our buck" after all!