# VSCSE Notes on Parallel Computing, Day 2

Updated on July 11, 2012

The key concept in parallel software design is the conversion of serial work into parallel computations. A dominant method for doing this efficiently is via reduction and reverse reduction trees, which exploit the "divide and conquer" strategy to spread work across a block of threads. This is the idea behind Google's MapReduce (and Hadoop).

## Basic Requirements

Parallel computation tasks are only possible if subtasks can be carried out without order preference. The input data must be divided into chunks and each chunk distributed to a thread. The final result is a summary of the actions carried out on each thread. This summary is carried out through "reduction".

## Reduction Tree

Each thread in a parallel algorithm processes its input chunk and writes output to a private store on the device. Reduction trees access and combine these private stores into the final output result. Feasible reduction operations must be associative and commutative, as well as possess a well-defined identity. For example, in summation, the identity is zero and the reduction scheme might take the form of a tournament. Other examples include min, max, and product.

## Example: Max of an Array

Suppose we have an numeric array of length N and we wish to determine the maximum element contained therein. If we perform sequential comparisons, we will carry out N-1 steps, resulting in O(n) complexity.

Parallel implementation could involve pair-wise comparison. Graphically, this method looks like a tournament bracket (think "March Madness"). Assuming we have N threads, we will carry out N-1 comparisons distributed over log(N) steps. The average parallelism, which is the work per step, is (N-1)/log(N). Note that this is a work-efficient approach since the same total work is carried out as with sequential analysis (this is often not the case).

One problem with the tournament-style parallel implementation is thread divergence. As the tournament progresses, threads that "lose" individual comparisons become inactive. Due to the nature of how instructions are communicated to threads under the CUDA framework, this causes a significant performance hit. By optimizing consistent thread activity within warps, which are blocks of threads, divergences can be avoided (or at least minimized).

## Keynote from David Kirk

This year, NVIDIA unveiled CUDA 5.0 and the new Kepler GPU architecture with the GK110 chip. The GK110 is the largest chip they've built and is targeted at enthusiasts, professionals, and the HPC community. With over 7 billion transistors, the GK110 is capable of over 1 teraflop. This is quite a milestone. The strategy behind the new design is an increase in floating precision units, but reduction in clock speed. This cuts the heat, which is, as always, a primary issue. They've also increased the number of registers per thread to 255 (Fermi has 63). The major innovation that I found interesting is the concept of dynamic parallelism. This means less I/O with the CPU and more autonomy for the GPU. While previously the CPU controlled the launching and syncing of new kernel routines, now the GPU can launch kernels from within CPU-initiated kernels.

Another note that brought a few smiles is the integration of the LLVM open source compiler in CUDA 5.0. This is perhaps a sign of progress from NVIDIA, who has come under a bit of criticism lately from certain leaders (gods?) of the Linux community.

32

7

2

16