Looking at the animation above you will see 2 matrices being multiplied together. In step 2 we divide the result matrix into 16 tiles each containing a 3 by 3 sub matrix. Since matrix multiplication is a SIMD (Single Instruction Multiple Data) algorithm, each tile can be processed concurrently without impacting the accuracy of the result. In step 3 a kernel is launched creating a grid with 16 thread blocks each containing 9 threads. Each tile is allocated to a thread block and each thread block has 9 threads (one for each element of the sub matrix). The computations for all cells of the result matrix occur in parallel at step 3. Whereas our animation shows the subdivision of a 12 by 12 result matrix into sixteen 3 by 3 tiles we will need to divide our 1024 by 1024 result matrix into 64 16 by 16 tiles as shown in the image below.

In the figure above you will see that we will be using a 64 by 64 grid of thread blocks and each thread block will contain 16 by 16 threads. As it turns out the main program that we used in Matrix Multiplication 1 has everything we need to solve this problem we need only change our kernel slightly.

**Main program (Listing 1)**

// Multiply two matrices A * B = C

#include <stdlib.h>

#include <stdio.h>

#include <math.h>

#include <matrixMul_kernel.cu>

// Allocates a matrix with random float entries.

void randomInit(float* data, int size)

{

for (int i = 0; i < size; ++i)

data[i] = rand() / (float)RAND_MAX;

}

/////////////////////////////////////////////////////////

// Program main

/////////////////////////////////////////////////////////

int

main(int argc, char** argv)

{

// set seed for rand()

srand(2006);

// 1. allocate host memory for matrices A and B

unsigned int size_A = WA * HA;

unsigned int mem_size_A = sizeof(float) * size_A;

float* h_A = (float*) malloc(mem_size_A);

unsigned int size_B = WB * HB;

unsigned int mem_size_B = sizeof(float) * size_B;

float* h_B = (float*) malloc(mem_size_B);

// 2. initialize host memory

randomInit(h_A, size_A);

randomInit(h_B, size_B);

// 3. print out A and B

printf("\n\nMatrix A\n");

for(int i = 0; i < size_A; i++)

{

printf("%f ", h_A[i]);

if(((i + 1) % WA) == 0)

printf("\n");

}

printf("\n\nMatrix B\n");

for(int i = 0; i < size_B; i++)

{

printf("%f ", h_B[i]);

if(((i + 1) % WB) == 0)

printf("\n");

}

// 8. allocate device memory

float* d_A;

float* d_B;

cudaMalloc((void**) &d_A, mem_size_A);

cudaMalloc((void**) &d_B, mem_size_B);

// 9. copy host memory to device

cudaMemcpy(d_A, h_A, mem_size_A,

cudaMemcpyHostToDevice);

cudaMemcpy(d_B, h_B, mem_size_B,

cudaMemcpyHostToDevice);

// 4. allocate host memory for the result C

unsigned int size_C = WC * HC;

unsigned int mem_size_C = sizeof(float) * size_C;

float* h_C = (float*) malloc(mem_size_C);

// 10. allocate device memory for the result

float* d_C;

cudaMalloc((void**) &d_C, mem_size_C);

// 5. perform the calculation

// setup execution parameters

dim3 threads(BLOCK_SIZE, BLOCK_SIZE);

dim3 grid(WC / threads.x, HC / threads.y);

// execute the kernel

matrixMul<<< grid, threads >>>(d_C, d_A,

d_B, WA, WB);

// 11. copy result from device to host

cudaMemcpy(h_C, d_C, mem_size_C,

cudaMemcpyDeviceToHost);

// 6. print out the results

printf("\n\nMatrix C (Results)\n");

for(int i = 0; i < size_C; i++)

{

printf("%f ", h_C[i]);

if(((i + 1) % WC) == 0)

printf("\n");

}

printf("\n");

// 7. clean up memory

free(h_A);

free(h_B);

free(h_C);

cudaFree(d_A);

cudaFree(d_B);

cudaFree(d_C);

}

As we learned in Matrix Multiplication 1 when a CUDA kernel is launched a grid is created containing multiple thread blocks. Each thread block contains multiple threads. The number of thread blocks in a grid and the number of threads in a thread block is controlled by the programmer passing in variables to the kernel launch command. We also have learned that CUDA extends the C language to allow the programmer to determine which thread his kernel is executing on (threadIdx.x, threadIdx.y). Well as it turns out CUDA also extends the language to allow the programmer to determine what thread block his thread is in (blockIdx.x, blockIdx.y). To be able to multiply large matrices together we need only make a slight change to our kernel to take the block index into consideration when determining what data a specific running kernel instance should access. By doing this we will be able to determine which tile we are in as well as which thread we are processing the data with.

**Kernel (Listing 1)**

#ifndef _MATRIXMUL_KERNEL_H_

#define _MATRIXMUL_KERNEL_H_

#include <stdio.h>

// Thread block size

#define BLOCK_SIZE16

#define TILE_SIZE16

#define WA1024// Matrix A width

#define HA1024// Matrix A height

#define WB1024// Matrix B width

#define HB WA // Matrix B height

#define WC WB // Matrix C width

#define HC HA // Matrix C height

// CUDA Kernel

__global__ void

matrixMul( float* C, float* A, float* B, int wA, int wB)

{

// 1. 2D Thread ID

int tx =blockIdx.x * TILE_SIZE + threadIdx.x;

int ty =blockIdx.y * TILE_SIZE + threadIdx.y;

// value stores the element that is

// computed by the thread

float value = 0;

for (int i = 0; i < wA; ++i)

{

float elementA =A[ty * wA + i];

float elementB =B[i * wB + tx];

value += elementA * elementB;

}

// Write the matrix to device memory each

// thread writes one element

C[ty * wA + tx]= value;

}

#endif // #ifndef _MATRIXMUL_KERNEL_H_

The code in section 1 has been changed to take the index of the block (and the size of the tile) into consideration. While we didn't actually change the code for pulling out elementA and elementB from our input matrices or for putting the results back to the result matrix, since both of these are based on tx and ty, they will end up indexing to the correct location. Additionally since our main program determines the argument for the kernel launch based on the #defines in the kernel file we don't need to make any changes to the main program.

Make the changes outlined above and run your program. You should see a few thousand number spewed across the screen. If you insert timing code on this version you will find that it is 44X faster than a CPU version of the same algorithm. While our runtime is clearly moving in the right direction there is still room for optimization.

To learn how to make your code faster... Martix Multiplication 3.