Thursday, August 27, 2009

Matrix Multiplication 1 (CUDA)

I will assume that you have already downloaded and installed the appropriate CUDA driver, toolkit and SDK from Nvidia. I am using version 2.2 but what we are covering should work with any version. If you don’t already have the software and driver installed go to http://www.nvidia.com/object/cuda_get.html. CUDA 2.2 comes with a template directory in the projects directory. Just create a clone of this directory, name it matrix1, and delete the .cu files from it. Since we have been talking in terms of matrix multiplication let’s continue the trend. We will put together a trivial example of multiplying two 3 X 3 matrices together using C for CUDA. For a moment let’s just forget about CUDA. How would we do this in C?

Well first off you would need to allocate memory for the two matrices that you are multiplying together… let’s call them A and B. Then you would need to copy in the data to the memory that you have allocated. Then you might want to print A and B out. You would also need to allocate memory for the results… let’s call it matrix C. The next step would be to perform the multiplication and finally you might want to print out the results and free up the memory you allocated.

Main program (Listing 1)

// Multiply two matrices A * B = C

#include <stdlib.h>
#include <stdio.h>
#include <math.h>

// Allocates a matrix with random float entries.
void randomInit(float* data, int size)
{
for (int i = 0; i < size; ++i)
data[i] = rand() / (float)RAND_MAX;
}

/////////////////////////////////////////////////////////
// Program main
/////////////////////////////////////////////////////////

int
main(int argc, char** argv)
{

// set seed for rand()
srand(2006);

// 1. allocate host memory for matrices A and B
unsigned int size_A = WA * HA;
unsigned int mem_size_A = sizeof(float) * size_A;
float* h_A = (float*) malloc(mem_size_A);

unsigned int size_B = WB * HB;
unsigned int mem_size_B = sizeof(float) * size_B;
float* h_B = (float*) malloc(mem_size_B);

// 2. initialize host memory
randomInit(h_A, size_A);
randomInit(h_B, size_B);

// 3. print out A and B
printf("\n\nMatrix A\n");
for(int i = 0; i < size_A; i++)
{
printf("%f ", h_A[i]);
if(((i + 1) % WA) == 0)
printf("\n");
}

printf("\n\nMatrix B\n");
for(int i = 0; i < size_B; i++)
{
printf("%f ", h_B[i]);
if(((i + 1) % WB) == 0)
printf("\n");
}

// 4. allocate host memory for the result C
unsigned int size_C = WC * HC;
unsigned int mem_size_C = sizeof(float) * size_C;
float* h_C = (float*) malloc(mem_size_C);

// 5. perform the calculation
...

// 6. print out the results
printf("\n\nMatrix C (Results)\n");
for(int i = 0; i < size_C; i++)
{
printf("%f ", h_C[i]);
if(((i + 1) % WC) == 0)
printf("\n");
}
printf("\n");

// 7. clean up memory
free(h_A);
free(h_B);
free(h_C);

}


The code in Listing 1 above has nothing CUDA specific about it yet. We have put together a randomInit function to generate random floats for our matrices. In section 1 we allocate memory for our matrices. You will notice that WA, HA, WB, and HB are not defined in this file (just ignore that for now). In section 2 we use our randomInit function to generate test data. In section 3 we print out the data that we have initialized our matrices with. Next we allocate memory for the results. Section 5 doesn’t actually have any code yet. This work will be done in the CUDA kernel so just ignore it for now. At section 6 we print out our results and then finally we free up all of the resources we allocated. Pretty straightforward… there is nothing special about the code yet.

As I mentioned in my CUDA Program Structure post a typical CUDA program will copy data to the device, launch a kernel, and then copy back the results. So lets add the code to do the memory copying back and forth.

Main program (Listing 2)

// Multiply two matrices A * B = C

#include <stdlib.h>
#include <stdio.h>
#include <math.h>

// Allocates a matrix with random float entries.
void randomInit(float* data, int size)
{
for (int i = 0; i < size; ++i)
data[i] = rand() / (float)RAND_MAX;
}

/////////////////////////////////////////////////////////
// Program main
/////////////////////////////////////////////////////////

int
main(int argc, char** argv)
{

// set seed for rand()
srand(2006);

// 1. allocate host memory for matrices A and B
unsigned int size_A = WA * HA;
unsigned int mem_size_A = sizeof(float) * size_A;
float* h_A = (float*) malloc(mem_size_A);

unsigned int size_B = WB * HB;
unsigned int mem_size_B = sizeof(float) * size_B;
float* h_B = (float*) malloc(mem_size_B);

// 2. initialize host memory
randomInit(h_A, size_A);
randomInit(h_B, size_B);

// 3. print out A and B
printf("\n\nMatrix A\n");
for(int i = 0; i < size_A; i++)
{
printf("%f ", h_A[i]);
if(((i + 1) % WA) == 0)
printf("\n");
}

printf("\n\nMatrix B\n");
for(int i = 0; i < size_B; i++)
{
printf("%f ", h_B[i]);
if(((i + 1) % WB) == 0)
printf("\n");
}

// 8. allocate device memory
float* d_A;
float* d_B;
cudaMalloc((void**) &d_A, mem_size_A);
cudaMalloc((void**) &d_B, mem_size_B);

// 9. copy host memory to device
cudaMemcpy(d_A, h_A, mem_size_A,
cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, mem_size_B,
cudaMemcpyHostToDevice);



// 4. allocate host memory for the result C
unsigned int size_C = WC * HC;
unsigned int mem_size_C = sizeof(float) * size_C;
float* h_C = (float*) malloc(mem_size_C);

// 10. allocate device memory for the result
float* d_C;
cudaMalloc((void**) &d_C, mem_size_C);


// 5. perform the calculation
...

// 11. copy result from device to host
cudaMemcpy(h_C, d_C, mem_size_C,
cudaMemcpyDeviceToHost);


// 6. print out the results
printf("\n\nMatrix C (Results)\n");
for(int i = 0; i < size_C; i++)
{
printf("%f ", h_C[i]);
if(((i + 1) % WC) == 0)
printf("\n");
}
printf("\n");

// 7. clean up memory
free(h_A);
free(h_B);
free(h_C);
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);


}


At section 8 we make a call to cudaMalloc( ) to allocate memory on the device for our input matrices. This is the first CUDA specific code that we have added. cudaMalloc( ) allocates mem_size_[A|B] bytes of linear memory on the device and returns in d_[A|B] a pointer to the allocated memory. The allocated memory is suitably aligned for any kind of variable. The memory is not cleared. Once we have allocated the memory we need to copy in our generated data from host memory to our newly allocated device memory. We do this at section 9 by making a call to cudaMemcpy( ). When calling cudaMemcpy( ) you must pass in the destination address, the source address, the size of the data to be copied, and an enum indicating the direction of the copy. In section 10 we allocate memory on the device for our result and in section 11 we copy our results back from device memory to host memory. Notice that the enums used in section 9 and 11 differ. And last but not least we need to free up all of this device memory that we have just allocated which has been added to section 7. cudaFree frees the memory space pointed to by the pointer passed in, which must have been returned by a previous call to cudaMalloc( ).

Well we are just about done now. The only piece missing is the launching of the kernel itself. Perhaps we should code up the kernel before we try and launch it.

Kernel (Listing 1)


#ifndef _MATRIXMUL_KERNEL_H_
#define _MATRIXMUL_KERNEL_H_

#include <stdio.h>

// Thread block size
#define BLOCK_SIZE 3

#define WA 3 // Matrix A width
#define HA 3 // Matrix A height
#define WB 3 // Matrix B width
#define HB WA // Matrix B height
#define WC WB // Matrix C width
#define HC HA // Matrix C height

// CUDA Kernel
__global__ void
matrixMul( float* C, float* A, float* B, int wA, int wB)
{

// 2D Thread ID
int tx = threadIdx.x;
int ty = threadIdx.y;

// value stores the element that is
// computed by the thread
float value = 0;
for (int i = 0; i < wA; ++i)
{
float elementA = A[ty * wA + i];
float elementB = B[i * wB + tx];
value += elementA * elementB;
}

// Write the matrix to device memory each
// thread writes one element
C[ty * wA + tx] = value;
}

#endif // #ifndef _MATRIXMUL_KERNEL_H_



As you can see above the kernel is written in ANSI C with a few additional extensions. First, there is a CUDA specific keyword __global__ in front of the declaration of matrixMul( ). This keyword indicates that the function is a kernel. When this function is called from the host code it will generate a grid of threads on the device. The second extension to ANSI C are the keywords threadIdx.x and threadIdx.y that refer to the indices of a thread inside the running kernel. Since all threads execute the same kernel code there needs to be a mechanism to allow them to differentiate themselves and determine what part of the data structure they are supposed to work on.

In general when a kernel is launched a grid of thread blocks will be created on the device. Each thread block will contain many threads. As I mentioned in my CUDA program Structure post CUDA provides for fine grained data parallelism / thread parallelism, nested within coarse grained data parallelism. Well if you look at the image below the coarse grained parallelism is provided by the multiple thread blocks in the grid and the fine grained parallelism is provided by the many threads in each thread block. For purposes of illustration I have kept the number of thread blocks and threads to a minimum. Typically a kernel will launch a grid containing hundreds of thread blocks each containing hundreds of threads.


When our matrixMul( ) kernel is launched multiple threads will be created. Each invocation of our kernel uses the two thread indices to identify the row of A and the column of B to perform dot product operation on in the “for” loop and to select the result element in matrix C that it is responsible for calculating the result for. It does this by calculating the starting positions in the input matrices based on their unique thread indices (threaded.x, threaded.y). It then iterates through a loop to calculate the dot product. Once it is done it again uses the thread indices to determine where to write the result to.


As you can see in the image above when our matrixMul( ) kernel is launched we end up with a grid containing a single thread block. The thread block contains 9 threads each of which is performing the dot product for 1 row / column pair of our A and B matrices. But how doeas the GPU know that all we need is 1 thread block containing 9 threads? Well... we had to tell it so... thats how! Now its time to add the last piece of code to our main program... the kernel launch code.

Main program (Listing 3)

// Multiply two matrices A * B = C

#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <matrixMul_kernel.cu>

// Allocates a matrix with random float entries.
void randomInit(float* data, int size)
{
for (int i = 0; i < size; ++i)
data[i] = rand() / (float)RAND_MAX;
}

/////////////////////////////////////////////////////////
// Program main
/////////////////////////////////////////////////////////

int
main(int argc, char** argv)
{

// set seed for rand()
srand(2006);

// 1. allocate host memory for matrices A and B
unsigned int size_A = WA * HA;
unsigned int mem_size_A = sizeof(float) * size_A;
float* h_A = (float*) malloc(mem_size_A);

unsigned int size_B = WB * HB;
unsigned int mem_size_B = sizeof(float) * size_B;
float* h_B = (float*) malloc(mem_size_B);

// 2. initialize host memory
randomInit(h_A, size_A);
randomInit(h_B, size_B);

// 3. print out A and B
printf("\n\nMatrix A\n");
for(int i = 0; i < size_A; i++)
{
printf("%f ", h_A[i]);
if(((i + 1) % WA) == 0)
printf("\n");
}

printf("\n\nMatrix B\n");
for(int i = 0; i < size_B; i++)
{
printf("%f ", h_B[i]);
if(((i + 1) % WB) == 0)
printf("\n");
}

// 8. allocate device memory
float* d_A;
float* d_B;
cudaMalloc((void**) &d_A, mem_size_A);
cudaMalloc((void**) &d_B, mem_size_B);

// 9. copy host memory to device
cudaMemcpy(d_A, h_A, mem_size_A,
cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, mem_size_B,
cudaMemcpyHostToDevice);


// 4. allocate host memory for the result C
unsigned int size_C = WC * HC;
unsigned int mem_size_C = sizeof(float) * size_C;
float* h_C = (float*) malloc(mem_size_C);

// 10. allocate device memory for the result
float* d_C;
cudaMalloc((void**) &d_C, mem_size_C);

// 5. perform the calculation
// setup execution parameters
dim3 threads(BLOCK_SIZE, BLOCK_SIZE);
dim3 grid(WC / threads.x, HC / threads.y);

// execute the kernel
matrixMul<<< grid, threads >>>(d_C, d_A,
d_B, WA, WB);


// 11. copy result from device to host
cudaMemcpy(h_C, d_C, mem_size_C,
cudaMemcpyDeviceToHost);

// 6. print out the results
printf("\n\nMatrix C (Results)\n");
for(int i = 0; i < size_C; i++)
{
printf("%f ", h_C[i]);
if(((i + 1) % WC) == 0)
printf("\n");
}
printf("\n");

// 7. clean up memory
free(h_A);
free(h_B);
free(h_C);
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);

}



When a kernel is launched the host code sets the grid and thread block dimensions by passing them as parameters. In section 5 of Listing 3 two structures of type dim3 are declared. Dim3 is a new data type defined by CUDA. The thread variable is defined as a BLOCK_SIZE by BLOCK_SIZE (or 3 by 3) dimension. This is what tells the device that we need 9 threads in our thread block (a 3 by 3 multi-dimensional array). The grid variable is defined as WC / threads.x (3 / 3) and HC / threads.y (3 / 3). This is what tells the device that we need only 1 thread block in our grid. The matrixMul<<< >>> line actually launches the kernel. The special syntax between the name of the kernel function and the traditional C parameters of the function is a CUDA extension to ANSI C. It allows you to pass in the dimensions of the grid and the dimensions of thread blocks to the device. The astute reader has probably noticed that BLOCK_SIZE, WA, WB, HA, and HB are not defined in this compilation unit (remember I told you to ignore that earlier). Well at the top of Listing 3 we include the kernel file to resolve that.

Now put the main program listing and the kernel listing in your matrix1 directory. Modify the Makefile to point to the files you just created and type make. Assuming that it compiles without any errors (mine did) you will end up with a binary in the bin directory at the root of your 2.2 directory. Take your binary out for a test drive. You are now an expert CUDA programmer! Well... not quite. Our example was extremely trivial and if you put in some timing code you will find that it runs slower than a CPU version of the same algorithm. On my test machine it is 670,000 X slower! What gives!?!? CUDA is supposed to be lightning fast. Well... we just hammered in a screw! All of the additional overhead for copying data to and from the device coupled with loading the kernel binary to the device quickly adds up. When you are dealing with a massively parallel problem the additional overhead is eclipsed by the gains achieved by running thousands of parallel computations. We only ran 9 parallel computations so you shouldn't really expect to see any performance gain.

You now know enough to be dangerous... if you want to learn enough to be deadly check out Matrix Multiplication 2.