Suppose that you have a signal processing program that does some initialization, gathers some inputs, multiplies two large matrices, gathers more inputs, multiples two large matrices, and lastly multiplies the two result matrices. Let’s further assume that the runtimes for the various parts of the program are as follows:

Function | Seconds | % of Total |
---|---|---|

Initialization | 2 | 1.0 |

Gather Inputs 1 | 4 | 2.1 |

Matrix Mul 1 | 60 | 31.6 |

Gather Inputs 2 | 4 | 2.1 |

Matrix Mul 2 | 60 | 31.6 |

Matrix Mul 3 | 60 | 31.6 |

Total Time | 190 | 100 |

So if we apply Amdahl’s Law it would indicate that the best speedup we could achieve is 19X. The parts of the application that can not be parallelized take 10 seconds or 5.2% of the total runtime. So if we were to reduce the time required for the matrix multiplications to 0 the application would still take 10 seconds to run yielding a 19X speedup.

So an initial implementation might be to create a matrix multiplication kernel and to run the initialization and data gathering on the CPU and the matrix multiplication on the GPU. Let’s assume that we did that and our GPU matrix multiplication kernel is 100X faster than the CPU version. This would yield the following runtimes:

Function | Seconds CPU | Seconds GPGPU |
---|---|---|

Initialization | 2 | 2 |

Gather Inputs 1 | 4 | 4 |

Matrix Mul 1 | 60 | .6 |

Gather Inputs 2 | 4 | 4 |

Matrix Mul 2 | 60 | .6 |

Matrix Mul 3 | 60 | .6 |

Total Time | 190 | 11.8 |

Our initial refactoring of the program would yield a 16X speedup which is a bit less than what we predicted but we did not actually reduce the time of the matrix multiplication to 0 now did we. Now let’s assume that we have 2 GPUs in our system and two CPU cores. If that is the case perhaps we might want a slightly different decomposition.

Function | Seconds | Function | Seconds |
---|---|---|---|

Initialization | 2 | ||

Gather Inputs 1 | 4 | Gather Inputs 2 | 4 |

Matrix Mul 1 | .6 | Matrix Mul 2 | .6 |

Matrix Mul 3 | .6 | ||

Total Time | 7.2 |

Since we have two CPUs and two GPUs we can processes the first and second set of data in parallel. This yields a total runtime of 7.2 seconds or 26X faster. As you can clearly see the proper decomposition is critical to achieving the best (shortest) runtimes.

So how did we know to parallelize the matrix multiplication portion of this application on the GPU? Matrix multiplication consists of multiple data streams (matrix rows and columns) being processed by a single instruction stream (floating point multiplication and addition). Matrix multiplication is a poster child example of Single Instruction Multiple Data algorithms.

If you take a look at the CPU based matrix multiplication animation below you will see how the rows of one matrix are processed against the columns of the other matrix. The processing of each row / column is completely unrelated to the processing of any other row / column. By porting this portion of the application you can easily exploit the massive parallelism of the GPU.

Compare the GPU matrix multiplication animation below to the CPU matrix multiplication animation above. You can see that multiple cells of the result matrix are calculated for each given clock cycle in the GPU version. The CPU version only calculates a single cell of the result matrix at each clock cycle. While you could also create a multiple threaded CPU version that would calculate multiple result cells at the same time, you would find that limitations in the number of available CPU cores coupled with the significant overhead of CPU thread creation, scheduling and context switching would limit the performance gains to be significantly less than what you can achieve with a GPU.

Now that you have a basic understanding of how to decompose an algorithm lets take a look at the program structure for OpenCL and / or CUDA.