Confessions of a Speed Junkie (Code Examples): Matrix Multiplication 3 (CUDA)

In my CUDA Program Structure post I mentioned that CUDA provides three abstractions: a hierarchy of thread groups, shared memory, and thread synchronization. We have already covered the hierarchy of thread groups in Matrix Multiplication 1 and Matrix Multiplication 2. In this posting we will cover shared memory and thread synchronization.

There are 4 different types of memory: registers, shared, global, and constant (there is also a 5th type of memory known as texture memory but we aren't going to get into that). Why so many different layers of memory on a GPU? Well if you look at a CPU it will have registers, Level 1 cache, Level 2 cache, Level 3 cache, and DRAM. It has these many different layers of memory so that applications can achieve accelerated performance at minimal cost. A GPU is no different than a CPU from this perspective. On a CPU the different memory layers are abstracted away from the programmer by their high level language compiler of choice. Unfortunately on the GPU they are not abstracted away and to write high performance GPGPU applications we have to know the details of the memory hierarchy.

The different memory layers on a GPU differ in access times, size, and in access visibility. Global memory is the slowest memory but it is also the largest. On most GPUs there will be a few GBytes of global memory. Global memory is visible on both the host and the device. It can also be read and written from both host code and device code.

Registers and shared memory are the fastest memory but also the smallest. Registers and shared memory are measured in Kbytes not Gbytes. Registers and shared memory can be read and written to by device code and are not visible to host code at all.

Constant memory is faster than global but slower than registers and shared memory. It is also measure in Gbytes, however, it can only be written to in host code. Device code can read constant memory but it can not write to it.

If a variable is mapped to a register then each instance of the executing kernel (each thread) has its own copy of the variable. By default automatic variables are mapped to registers. If a variable is mapped to shared memory then each thread block has its own copy of the variable. Data residing in shared memory is accessible by all threads in the thread block but not by threads in other thread blocks. The attributes of the different CUDA memory layers are summarised below:

Memory	On Chip	Access Speed	Size	Visibility	Access Type
Register	Yes	Fast	Small	thread	D(R,W) H(None)
Shared	Yes	Fast	Small	thread block	D(R,W) H(None)
Constant	No	Slow	Small	grid	D(R) H(W)
Global	No	Slowest	Large	grid	D(R,W) H(R,W)

Now that we know a little about the CUDA memory hierarchy how can we use this information to speed up our matrix multiplication? Our first two matrix multiplication programs copied the input data into global device memory (thats what cudaMalloc( ) returns) and then the kernel read it out of global memory. From the information above we know that global device memory is the slowest memory. So we should try and limit our usage of it. By changing our kernel to copy large blocks of data from global memory into shared memory and then processing the data from shared memory instead of global we should be able to speed up our overall computation.

For example let's take matrix A x matrix B and store the results in matrix C. Assume that the dimensions of C are greater than what we can fit in a single thread block (maximum of 512 threads per thread block). We will have to break matric C down into 4 sub matrices that fit into a single thread block and we will have to break both A and B into 2 sub matrices. Since matrix multiplication is a SIMD algorithm we can do this and still arrive at the correct answer (see animation below).

So we end up breaking down the problem of A X B = C into 4 sub problems of A(1) X B(1) = C(1), A(1) X B(2) = C(2), A(2) X B(1) = C(3), and A(2) X B(2) = C(4). This isn't anything new it's the same breakdown we did for matrix multiplication 2 just expressed slightly different visualy. Now let's just focus on the first sub problem: A(1) X B(1) = C(1).

This sub problem will be solved by a single processing core on the GPU. Since the amount of shared memory on the processing core is limited we will have to break the sub problem into - you guessed it - sub problems! Let's assume that the amount of shared memory that we have available is only enough to fit half of matrix A(1) and B(1). This means that we will have to break our sub problem into 2 sub problems that fit into shared memory (see animation below).

Slide 1 shows the sub problem that we are working on (A(1) * B(1) = C(1)). In slide 2 we copy in half of each matrix from global memory into shared memory. The two smaller matrices are multiplied together and the results are stored in C(1) on slide 3. On slide 4 we copy the other half of each matrix from global to shared memory and we finish up by multiplying them together and adding the results to C(1) on slide 5. Keep in mind that all 4 of the original sub problems are being solved in parallel on different GPU cores.

So now that you understand conceptually what we need to achieve lets look at the code. Again the original main program should work fine without any modifications, however, I did pull out the printfs so your screen doesn't get so cluttered.

Main program (Listing 1)

// Multiply two matrices A * B = C
 
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <matrixMul_kernel.cu>
 
// Allocates a matrix with random float entries.
void randomInit(float* data, int size)
{
   for (int i = 0; i < size; ++i)
   data[i] = rand() / (float)RAND_MAX;
}
 
/////////////////////////////////////////////////////////
// Program main
/////////////////////////////////////////////////////////
 
int
main(int argc, char** argv)
{

   // set seed for rand()
   srand(2006);
 
   // 1. allocate host memory for matrices A and B
   unsigned int size_A = WA * HA;
   unsigned int mem_size_A = sizeof(float) * size_A;
   float* h_A = (float*) malloc(mem_size_A);
 
   unsigned int size_B = WB * HB;
   unsigned int mem_size_B = sizeof(float) * size_B;
   float* h_B = (float*) malloc(mem_size_B);
 
   // 2. initialize host memory
   randomInit(h_A, size_A);
   randomInit(h_B, size_B);
    
   // 8. allocate device memory
   float* d_A;
   float* d_B;
   cudaMalloc((void**) &d_A, mem_size_A);
   cudaMalloc((void**) &d_B, mem_size_B);
 
   // 9. copy host memory to device
   cudaMemcpy(d_A, h_A, mem_size_A, 
   cudaMemcpyHostToDevice);
   cudaMemcpy(d_B, h_B, mem_size_B, 
   cudaMemcpyHostToDevice);
 
   // 4. allocate host memory for the result C
   unsigned int size_C = WC * HC;
   unsigned int mem_size_C = sizeof(float) * size_C;
   float* h_C = (float*) malloc(mem_size_C);
 
   // 10. allocate device memory for the result
   float* d_C;
   cudaMalloc((void**) &d_C, mem_size_C);
 
   // 5. perform the calculation
   // setup execution parameters
   dim3 threads(BLOCK_SIZE, BLOCK_SIZE);
   dim3 grid(WC / threads.x, HC / threads.y);
 
   // execute the kernel
   matrixMul<<< grid, threads >>>(d_C, d_A, 
                                  d_B, WA, WB);
 
   // 11. copy result from device to host
   cudaMemcpy(h_C, d_C, mem_size_C, 
   cudaMemcpyDeviceToHost);
  
   // 7. clean up memory
   free(h_A);
   free(h_B);
   free(h_C);
   cudaFree(d_A);
   cudaFree(d_B);
   cudaFree(d_C);
 
}

The kernel, on the other hand, is going to be changed significantly. We will need to determine what data from matrix A and B we will be working with based on which thread block we are in the grid. We will need to copy that data from global memory to the shared memory of our processing core assigning each thread in our thread block a specific data element to copy. We then need to perform the math and write our results to global memory.

Kernel (Listing 1)

/* Matrix multiplication: C = A * B.
 * Device code.
 */
 
#ifndef _MATRIXMUL_KERNEL_H_
#define _MATRIXMUL_KERNEL_H_
 
#include 
 
// Thread block size
#define BLOCK_SIZE 16
#define TILE_SIZE  16 
#define WA 1024   // Matrix A width
#define HA 1024   // Matrix A height
#define WB 1024   // Matrix B width
#define HB WA     // Matrix B height
#define WC WB     // Matrix C width
#define HC HA     // Matrix C height
 
//////////////////////////////////////////////////////
//! Matrix multiplication on the device: C = A * B
//! wA is A's width and wB is B's width
//////////////////////////////////////////////////////
__global__ void
matrixMul( float* C, float* A, float* B, int wA, int wB)
{
    // Block index
    int bx = blockIdx.x;
    int by = blockIdx.y;
 
    // Thread index
    int tx = threadIdx.x;
    int ty = threadIdx.y;
 
    // Index of the first sub-matrix of A processed 
    // by the block
    int aBegin = wA * BLOCK_SIZE * by;
 
    // Index of the last sub-matrix of A processed 
    // by the block
    int aEnd   = aBegin + wA - 1;
 
    // Step size used to iterate through the 
    // sub-matrices of A
    int aStep  = BLOCK_SIZE;
 
    // Index of the first sub-matrix of B processed 
    // by the block
    int bBegin = BLOCK_SIZE * bx;
 
    // Step size used to iterate through the 
    // sub-matrices of B
    int bStep  = BLOCK_SIZE * wB;
 
    // Loop over all the sub-matrices of A and B
    // required to compute the block sub-matrix
    for (int a = aBegin, b = bBegin;
             a <= aEnd;
             a += aStep, b += bStep) 
    {

        // Declaration of the shared memory array As 
        // used to store the sub-matrix of A
        __shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
 
        // Declaration of the shared memory array Bs 
        // used to store the sub-matrix of B
        __shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];
 
        // Load the matrices from global memory
        // to shared memory; each thread loads
        // one element of each matrix
        As[ty][tx] = A[a + wA * ty + tx];
        Bs[ty][tx] = B[b + wB * ty + tx];
    }
}

The kernel code above uses blockIdx to determine the start and end location of our two sub matrices. It then loops through our input matrices and has each thread in the thread block copy one cell of the A and B sub matrices into shared memory. Keep in mind that shared memory has thread block scope so each thread in the thread block can see the data copied over by the other threads.

While our kernel is runnable at this point all it is doing is copying around memory so it's not very useful yet. Kernel listing 2 below adds the logic to multiply and copy the results back into global memory.

Kernel (Listing 2)

/* Matrix multiplication: C = A * B.
 * Device code.
 */
 
#ifndef _MATRIXMUL_KERNEL_H_
#define _MATRIXMUL_KERNEL_H_
 
// Thread block size
#define BLOCK_SIZE 16
#define TILE_SIZE  16 
#define WA 1024   // Matrix A width
#define HA 1024   // Matrix A height
#define WB 1024   // Matrix B width
#define HB WA     // Matrix B height
#define WC WB     // Matrix C width
#define HC HA     // Matrix C height
 
//////////////////////////////////////////////////////
//! Matrix multiplication on the device: C = A * B
//! wA is A's width and wB is B's width
//////////////////////////////////////////////////////
__global__ void
matrixMul( float* C, float* A, float* B, int wA, int wB)
{
    // Block index
    int bx = blockIdx.x;
    int by = blockIdx.y;
 
    // Thread index
    int tx = threadIdx.x;
    int ty = threadIdx.y;
 
    // Index of the first sub-matrix of A processed 
    // by the block
    int aBegin = wA * BLOCK_SIZE * by;
 
    // Index of the last sub-matrix of A processed 
    // by the block
    int aEnd   = aBegin + wA - 1;
 
    // Step size used to iterate through the 
    // sub-matrices of A
    int aStep  = BLOCK_SIZE;
 
    // Index of the first sub-matrix of B processed 
    // by the block
    int bBegin = BLOCK_SIZE * bx;
 
    // Step size used to iterate through the 
    // sub-matrices of B
    int bStep  = BLOCK_SIZE * wB;
 
    // Loop over all the sub-matrices of A and B
    // required to compute the block sub-matrix
    for (int a = aBegin, b = bBegin;
             a <= aEnd;
             a += aStep, b += bStep) 
    {

        // Declaration of the shared memory array As 
        // used to store the sub-matrix of A
        __shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
 
        // Declaration of the shared memory array Bs 
        // used to store the sub-matrix of B
        __shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];
 
        // Load the matrices from global memory
        // to shared memory; each thread loads
        // one element of each matrix
        As[ty][tx] = A[a + wA * ty + tx];
        Bs[ty][tx] = B[b + wB * ty + tx];
 
        // Synchronize to make sure the matrices 
        // are loaded
        __syncthreads();
 
        // Multiply the two matrices together;
        // each thread computes one element
        // of the block sub-matrix
        for (int k = 0; k < BLOCK_SIZE; ++k)
            Csub += As[ty][k] * Bs[k][tx];
 
        // Synchronize to make sure that the preceding
        // computation is done before loading two new
        // sub-matrices of A and B in the next iteration
        __syncthreads();
 
    }
 
    // Write the block sub-matrix to device memory;
    // each thread writes one element
    int c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx;
    C[c + wB * ty + tx] = Csub;

}

In the new code above you see two calls being made to a __syncthreads( ) function. This function is what provides the thread synchronization abstraction. Calls to __syncthreads( ) from within the same thread block will block until all threads in the thread block reach the __syncthreads( ) call. Once all threads hit the __syncthreads( ) barrier the barrier is removed and all threads in that thread block continue on processing the next instruction.

The first call to __syncthreads( ) is made after the thread copies over its assigned cell from the A and B global memory into As and Bs shared memory but before the thread iterates over As and Bs to calculate the cell in matrix C that the thread is responsible for. We must call __syncthreads( ) to insure that all of the global data that we need to calculate our cell has actually been copied to shared memory before we iterate over it.

The second call to __syncthreads( ) is performed after we iterate over the shared memory and calculate the result for our sub problem. It is necessary to insure that all of the threads in the thread block are done iterating over the As and Bs shared memory before we go back to the top of the outermost for loop and overlay the As and Bs shared memory with the data for our next sub problem.

Once we finish all of the sub problems we store the results of the accumulation of all of the sub problems back into global memory. At this point this particular thread block is finished and the processing core is available to run the next thread block.

So let's take this version out for a test drive. Running this version yields a 48X speedup over the CPU verion. If we look back at our Matrix Multiplication 2 example we see that it was 44X faster than the CPU version. Now you may be thinking that this is an awful lot of work to just squeeze and extra 4X speedup... and perhaps it is. But let's change the size of our matrix to 2048 and see what we "see".

If we run our Matrix Multiplication 2 example,which uses only global memory, with matrices of 2048 X 2048 we will find that the GPU version is 255X faster than the CPU verion of the algorithm. Our shared memory implementation of the GPU version is 370X faster than the CPU version. Looks like we really did get some "bang for our buck" after all!

Confessions of a Speed Junkie (Code Examples)

Tuesday, September 1, 2009

Matrix Multiplication 3 (CUDA)

Google Search

C For CUDA Examples

OpenCL Examples

PyCUDA Examples