Monday, August 31, 2009

Matrix Multiplication 2 (CUDA)

In Matrix Multiplication 1 we learned how to use CUDA to offload a SIMD algorithm to the GPU. We picked a trivial example just to get our feet wet. Now let’s make it a bit more complicated. Instead of multiplying 3 by 3 matrices together we are going to multiply 1024 by 1024 matrices together. You might ask… how is this more complicated? Can’t we just change the defines at the top of our kernel to be 1024 instead of 3 and recompile? Unfortunately we can not. There is a limit to the number of threads that can be in a thread block and that limit is 512. So in order to multiply 1024 by 1024 matrices together we are going to have to make changes to the way we decompose the problem. Instead of processing the entire matrix with a single thread block we are going to have multiple thread blocks each process a portion of the 1024 by 1024 matrices.

pimp your myspace
Looking at the animation above you will see 2 matrices being multiplied together. In step 2 we divide the result matrix into 16 tiles each containing a 3 by 3 sub matrix. Since matrix multiplication is a SIMD (Single Instruction Multiple Data) algorithm, each tile can be processed concurrently without impacting the accuracy of the result. In step 3 a kernel is launched creating a grid with 16 thread blocks each containing 9 threads. Each tile is allocated to a thread block and each thread block has 9 threads (one for each element of the sub matrix). The computations for all cells of the result matrix occur in parallel at step 3. Whereas our animation shows the subdivision of a 12 by 12 result matrix into sixteen 3 by 3 tiles we will need to divide our 1024 by 1024 result matrix into 64 16 by 16 tiles as shown in the image below.

In the figure above you will see that we will be using a 64 by 64 grid of thread blocks and each thread block will contain 16 by 16 threads. As it turns out the main program that we used in Matrix Multiplication 1 has everything we need to solve this problem we need only change our kernel slightly.

Main program (Listing 1)

// Multiply two matrices A * B = C

#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <>

// Allocates a matrix with random float entries.
void randomInit(float* data, int size)
for (int i = 0; i < size; ++i)
data[i] = rand() / (float)RAND_MAX;

// Program main

main(int argc, char** argv)

// set seed for rand()

// 1. allocate host memory for matrices A and B
unsigned int size_A = WA * HA;
unsigned int mem_size_A = sizeof(float) * size_A;
float* h_A = (float*) malloc(mem_size_A);

unsigned int size_B = WB * HB;
unsigned int mem_size_B = sizeof(float) * size_B;
float* h_B = (float*) malloc(mem_size_B);

// 2. initialize host memory
randomInit(h_A, size_A);
randomInit(h_B, size_B);

// 3. print out A and B
printf("\n\nMatrix A\n");
for(int i = 0; i < size_A; i++)
printf("%f ", h_A[i]);
if(((i + 1) % WA) == 0)

printf("\n\nMatrix B\n");
for(int i = 0; i < size_B; i++)
printf("%f ", h_B[i]);
if(((i + 1) % WB) == 0)

// 8. allocate device memory
float* d_A;
float* d_B;
cudaMalloc((void**) &d_A, mem_size_A);
cudaMalloc((void**) &d_B, mem_size_B);

// 9. copy host memory to device
cudaMemcpy(d_A, h_A, mem_size_A,
cudaMemcpy(d_B, h_B, mem_size_B,

// 4. allocate host memory for the result C
unsigned int size_C = WC * HC;
unsigned int mem_size_C = sizeof(float) * size_C;
float* h_C = (float*) malloc(mem_size_C);

// 10. allocate device memory for the result
float* d_C;
cudaMalloc((void**) &d_C, mem_size_C);

// 5. perform the calculation
// setup execution parameters
dim3 threads(BLOCK_SIZE, BLOCK_SIZE);
dim3 grid(WC / threads.x, HC / threads.y);

// execute the kernel
matrixMul<<< grid, threads >>>(d_C, d_A,
d_B, WA, WB);

// 11. copy result from device to host
cudaMemcpy(h_C, d_C, mem_size_C,

// 6. print out the results
printf("\n\nMatrix C (Results)\n");
for(int i = 0; i < size_C; i++)
printf("%f ", h_C[i]);
if(((i + 1) % WC) == 0)

// 7. clean up memory


As we learned in Matrix Multiplication 1 when a CUDA kernel is launched a grid is created containing multiple thread blocks. Each thread block contains multiple threads. The number of thread blocks in a grid and the number of threads in a thread block is controlled by the programmer passing in variables to the kernel launch command. We also have learned that CUDA extends the C language to allow the programmer to determine which thread his kernel is executing on (threadIdx.x, threadIdx.y). Well as it turns out CUDA also extends the language to allow the programmer to determine what thread block his thread is in (blockIdx.x, blockIdx.y). To be able to multiply large matrices together we need only make a slight change to our kernel to take the block index into consideration when determining what data a specific running kernel instance should access. By doing this we will be able to determine which tile we are in as well as which thread we are processing the data with.

Kernel (Listing 1)


#include <stdio.h>

// Thread block size
#define BLOCK_SIZE 16
#define TILE_SIZE 16

#define WA 1024 // Matrix A width
#define HA 1024 // Matrix A height
#define WB 1024 // Matrix B width
#define HB WA // Matrix B height
#define WC WB // Matrix C width
#define HC HA // Matrix C height

// CUDA Kernel
__global__ void
matrixMul( float* C, float* A, float* B, int wA, int wB)

// 1. 2D Thread ID
int tx = blockIdx.x * TILE_SIZE + threadIdx.x;
int ty = blockIdx.y * TILE_SIZE + threadIdx.y;

// value stores the element that is
// computed by the thread
float value = 0;
for (int i = 0; i < wA; ++i)
float elementA = A[ty * wA + i];
float elementB = B[i * wB + tx];
value += elementA * elementB;

// Write the matrix to device memory each
// thread writes one element
C[ty * wA + tx] = value;

#endif // #ifndef _MATRIXMUL_KERNEL_H_

The code in section 1 has been changed to take the index of the block (and the size of the tile) into consideration. While we didn't actually change the code for pulling out elementA and elementB from our input matrices or for putting the results back to the result matrix, since both of these are based on tx and ty, they will end up indexing to the correct location. Additionally since our main program determines the argument for the kernel launch based on the #defines in the kernel file we don't need to make any changes to the main program.

Make the changes outlined above and run your program. You should see a few thousand number spewed across the screen. If you insert timing code on this version you will find that it is 44X faster than a CPU version of the same algorithm. While our runtime is clearly moving in the right direction there is still room for optimization.

To learn how to make your code faster... Martix Multiplication 3.