VSCSE Notes on Parallel Computing, Day 1
Around 2003, computer speed using traditional single-core architecture ceased to keep pace with Moore's law. In response, multi-core processing rose to prominence. However, throughput limitations became an issue in 2006. Today, heterogeneous systems consisting of CPUs and GPUs are again advancing computational performance. I'll discuss some of the hardware definitions and go over some fundamentals.
Traditional (or homogeneous) computers rely on central processing units (CPUs). Regardless of the number of cores, CPUs are essentially designed to minimize thread latency. The total number of threads is kept small. In the case of my i7 laptop, I have a quad-core with two threads running on each core. On-chip memory (cache) is large, and is designed to increase the probability of required data being immediately available to the thread. This reduces latency resulting from calls to system memory (RAM).
Graphical processing units (GPUs) were originally intended to enhance the visual performance of computers, such as in multimedia or gaming applications. By design, they are throughput oriented and allow for many (10s of thousands) of threads. The number of registers per thread is also large, but cache is small and primarily for staging and buffering I/O to the CPU.
We can compare a CPU to a basic highway with four lanes. As more and more cars enter (assuming no signal on the on-ramp), traffic starts to build up and throughput decreases. However, cars enter and exit freely.
GPUs, on the other hand, are like a highway with a 100 lanes. However, cars have to wait for a signal before they can enter or exit, resulting in possible latency despite absurdly high throughput. For a given cohort of cars on the highway, the slowest car defines the driving time for all other cars. To minimize latency, the highway is endowed with an exceedingly large number of lanes.
CPU vs. CPU+GPU Performance
Based on projects funded by NSF's Petascale Computing Resource Allocations, heterogeneous systems based on CPU and GPU computing units are generally capable of 2.5 to 7 times greater performance over pure CPU implementations.
Software development for heterogeneous systems is the most significant challenge preventing broader usage of CPU-GPU technology. Specifically, developers need good libraries that scale and are portable. While this is an active area of research, the technology is still new and the software barrier is likely keeping many researchers that would benefit from GPU-based supercomputers on the side-lines. Of primary concern is memory management between the GPU and CPU.
Scalability and Portability
Scalable applications are those that run on new generations of the same core. Alternatively, a scalable app runs efficiently on more of the same cores it was developed on.
Portable applications can run on different types of cores (Intel, AMD, NVIDIA) with different interfaces and system organization.
The time required to develop an application for a supercomputer is substantial. To paraphrase Professor Hwu at UIUC, the development of scalable and portable libraries may be the best legacy this generation of developers can leave for the future.
Limitations of Heterogeneous Systems
Regardless of what system you are running on, algorithmic complexity is a primary concern. Even on a supercomputer, algorithms with quadratic or n log(n) complexity can grow to the point that calculations are not feasible.