Monday, August 17, 2009

Parallel Decomposition

As I mentioned in my “Don’t Hammer in Screws” post, it is very important that you are using the right tool for the problem. When it comes to GPGPU programming it means that you are running the right code on the GPU. But how do you determine what code is the right code? Let’s take a look at an example.

Suppose that you have a signal processing program that does some initialization, gathers some inputs, multiplies two large matrices, gathers more inputs, multiples two large matrices, and lastly multiplies the two result matrices. Let’s further assume that the runtimes for the various parts of the program are as follows:

FunctionSeconds% of Total
Initialization21.0
Gather Inputs 142.1
Matrix Mul 16031.6
Gather Inputs 242.1
Matrix Mul 26031.6
Matrix Mul 36031.6
Total Time190100


So if we apply Amdahl’s Law it would indicate that the best speedup we could achieve is 19X. The parts of the application that can not be parallelized take 10 seconds or 5.2% of the total runtime. So if we were to reduce the time required for the matrix multiplications to 0 the application would still take 10 seconds to run yielding a 19X speedup.

So an initial implementation might be to create a matrix multiplication kernel and to run the initialization and data gathering on the CPU and the matrix multiplication on the GPU. Let’s assume that we did that and our GPU matrix multiplication kernel is 100X faster than the CPU version. This would yield the following runtimes:

FunctionSeconds CPUSeconds GPGPU
Initialization22
Gather Inputs 144
Matrix Mul 160.6
Gather Inputs 244
Matrix Mul 260.6
Matrix Mul 360.6
Total Time19011.8


Our initial refactoring of the program would yield a 16X speedup which is a bit less than what we predicted but we did not actually reduce the time of the matrix multiplication to 0 now did we. Now let’s assume that we have 2 GPUs in our system and two CPU cores. If that is the case perhaps we might want a slightly different decomposition.

FunctionSecondsFunctionSeconds
Initialization2  
Gather Inputs 14Gather Inputs 24
Matrix Mul 1.6Matrix Mul 2.6
Matrix Mul 3.6  
Total Time7.2   


Since we have two CPUs and two GPUs we can processes the first and second set of data in parallel. This yields a total runtime of 7.2 seconds or 26X faster. As you can clearly see the proper decomposition is critical to achieving the best (shortest) runtimes.

So how did we know to parallelize the matrix multiplication portion of this application on the GPU? Matrix multiplication consists of multiple data streams (matrix rows and columns) being processed by a single instruction stream (floating point multiplication and addition). Matrix multiplication is a poster child example of Single Instruction Multiple Data algorithms.

If you take a look at the CPU based matrix multiplication animation below you will see how the rows of one matrix are processed against the columns of the other matrix. The processing of each row / column is completely unrelated to the processing of any other row / column. By porting this portion of the application you can easily exploit the massive parallelism of the GPU.

pimp myspace

Compare the GPU matrix multiplication animation below to the CPU matrix multiplication animation above. You can see that multiple cells of the result matrix are calculated for each given clock cycle in the GPU version. The CPU version only calculates a single cell of the result matrix at each clock cycle. While you could also create a multiple threaded CPU version that would calculate multiple result cells at the same time, you would find that limitations in the number of available CPU cores coupled with the significant overhead of CPU thread creation, scheduling and context switching would limit the performance gains to be significantly less than what you can achieve with a GPU.

graphic myspace at Gickr.com

Now that you have a basic understanding of how to decompose an algorithm lets take a look at the program structure for OpenCL and / or CUDA.