Yong Jiao's Blog about Technology and Life: December 2012

Although this post overlaps with this previous one, it shows helpful diagrams and its approach is more systematic. Here it is.

1 Multi-Core CPU and Many-Core GPGPU

CPUs began from single-threaded serial processing of general-purpose tasks such as arithmetic, logic and I/O operations. For many years since then, CPUs improved the performance of single-threaded applications simply by reducing their latencies based on Moore’s Law [1]. Each new CPU provided increased clock frequency, which enabled applications to run faster. Applications performance was improved by merely porting them to the new CPU even without code modification.

However, every increase in clock speed imposes more demands on power and cooling requirements, and also tends to increase the disparity between CPU and host memory access times. To counter this, CPUs design employed complex low-level Instruction-Level Parallelism (ILP) [2] such as instruction pipelining, out-of-order execution and speculative execution while still keeping the frequency in economically allowed ranges. But, efforts to further exploit ILP have stalled since the late-1990s due to the following reasons:

1. ILP proves inadequate to keep CPU from stalling for host memory accesses because ILP often involves sophisticated hardware design and many applications have difficult-to-predict code;

2. Many applications need to process independent large data sets;

3. Many applications need to handle multiple concurrent tasks.

Instead, the industry shifted focus to applying Moore’s Law to implement more cores and larger caches per processor chip. Larger caches help to reduce the access time gap between CPU and host memory, especially for applications with spatial locality. Cores are units that execute a single stream of program instructions. A physical core, if it supports core-level multi-threading [3] such as Intel’s HyperThreading, is equivalent to two or more logic ones each of which supports one separate hardware thread (by hardware thread, we mean the execution context (program counters, registers, etc) is maintained on-chip during the entire lifetime of the thread so that context switching between threads can be done in just one cycle and virtually without cost. We will use “core” and “thread” hereafter unless otherwise noted). As you can easily reason, the more cores a system has, the greater parallel computing capacity. This is the case of GPGPU that can easily support tens of thousands threads.

However, these highly parallel computing processes do need us to parallelize application algorithms as much as possible in order to improve performance based on Amdahl’s Law [4]. Algorithm designers employ two commonly used Thread-Level Parallelism (TLP) models [5] to do so. One is Data Parallelism (DP) that decomposes data sets into multiple independent data elements to which a single stream of instructions are concurrently applied. DP is often used at fine-grained level and supported by a processor’s SIMD function [6]. The other is Task Parallelism (TP) that decomposes applications into many independent tasks that can be concurrently executed across different cores. TP is often used at coarse-grained level and supported by a processor’s MIMD function [6]. TLP improves overall application performance through high throughput of all threads, though latency is actually increased at individual thread levels.

Thanks to the commoditization of many of these parallel processors in the last decade, they are inexpensively available to us as either a workstation or a desktop cluster. Many large data set processing applications such as CT reconstruction, used to take many hours even several days on traditional serial CPU, can now finish in several minutes on these parallel processors without even resorting to any expensive proprietary hardware such as ASIC, FPGA, Cell BE or multi-node clusters.

Next, we will review two typical parallel processors: multi-core CPU and many-core GPU dedicated to General-Purpose Computing (GPGPU) and how OpenCL provides a uniform parallel computing paradigm for them. We conducted our researches on a Dell Precision T7500 Tower Workstation [7]. It features two Intel Xeon quad-core E5507 CPUs, and one Nvidia Tesla C2050 GPGPU. Because both processors represent relatively latest developments on parallel computing on CPU and GPGPU, our discussions are around them without loss of generality.

1.1 Multi-Core CPU

From the first 32-bit Intel Core architecture introduced in Jan 2006 up to the latest 64-bit Intel Xeon octo-core architecture [8], all multi-core CPUs share the same principle that it is easier to improve overall application performance using throughput-oriented multiple cores than using a latency-oriented single core. However CPU’s primary role as a general-purpose engine determines trade-off must be made among flow controls, caches and number of cores.

The 64-bit Xeon processors, as shown in Figure 1, represent Intel’s latest developments on multi-core and multi-processor. They target the server and workstation marets. Here are the highlights:

Each processor supports a maximum of 4 threads. HyperThreading is not supported;
Large on-chip L1,L2 and L3 caches (a total of 5.25M) and flow control (not shown in Figure 1);
Each processor supports 3 parallel memory channels that are totally 192-bit wide;
Each processor’s memory bandwidth is 19.2GB/s and peek single floating-point capability is about 36GFLOP/s;
Each core supports 128-bit SIMD vectors through SSE; MIMD is supported at processor level through the 4 cores;
NUMA supports – each processor accesses its locally attached host memory through the integrated memory controller much faster than access other remote host memory attached to the other processor through Quick Path Interconnect (QPI).

1.2 Many-Core GPGPU

From the first GPU invented in 1999 by Nvidia to process graphics only up to its latest Fermi-based Tesla products dedicated to general purpose computing [9] [10], GPU has evolved into a massively parallel processor that can run tens of thousands of threads on its hundreds of cores.

C2050, as shown in Figure 2, turns PCs and workstations into affordable small clusters. Here are the highlights:

Each Streaming Multiprocessor (SM) schedules parallel threads in groups of 32 called warps (AMD’s FireStream GPGPU, C2050’s equivalent, uses wavefronts for the same purpose. So we will use warps hereafter without loss of generosity) to 32 scalar cores. In other words, each SM maintains its own single stream of instructions of the kernel and each instruction is scheduled to 32 cores on different data in lockstep. This is how Nvidia implements SIMD using its so-called SIMT [11]. SM’s SIMD is 1024-bit wide that is much wider than E5507’s. MIMD is supported by scheduling concurrent kernels across the 14 SMs;
The 14 SMs support a maximum of 21,504 (1,536 x 14) active (resident) threads on the 448 (32 x 14) scalar cores thanks to the large amounts of registers. These numbers of threads and cores are much larger than CPU’s;
Small on-chip caches (the total is about 1MB) and simple flow control (not shown in Figure 2), which means higher latency than CPU;
Each SM support 48KB shared memory whose access time is between that of cache and global memory. This shared memory is explicitly available to programmers through API (on the other hand, caches are transparent to programmers). CPU doesn’t have such memory;
All SMs access the off-chip global memory through a wide 384-bit interface that has 6 parallel channels;
The device connects to the host through a PCIe x 16 Gen 2 slot. Its bandwidth is 8GB/s that is much slower than its 144GB/s memory bandwidth. The device’s peek single floating-point capability is about 1TFLOP/s that is much larger than E5507’s.

1.3 Different Workload Processing on CPU and GPGPU

Because today’s computer systems often form heterogeneous parallel computing ecosystems with CPUs, GPGPUs and / or other types of processors as shown in Figure 3 (T7500 ecosystem hereafter), it is important to know the type of workloads that each processor does best so that you can partition workloads and schedule them to appropriate processors using OpenCL. By doing so, you can access all the computing power (serial and parallel) of your systems, achieving very high system utilization.

GPGPU is suited to process large date sets while CPU is better suited to low parallelism (also serial operations such as disk and network IOs, and hard-to-parallelized programs).
This is because small data sets can’t justify the time incurred by GPGPU’s slow data transfer on PCIe and other setups. Large data sets also spawn a large number of threads that are required in order to take full advantage of GPGPU’s massively parallel structure and to hide its high latency on memory accesses and other operations;
GPGPU is suited to algorithms with high arithmetic density (the number of arithmetic operations per memory access).
This is also to hide GPGPU’s high latency. When a warp is stalled due to latency, SM swaps in a ready warp for execution, if any, through its fast context switch to maximize GPGPU utilization. Latency can be fully hidden if SM always has some arithmetic instructions to issue for some warp at every clock cycle during that latency period. Obviously the more arithmetic instructions and threads, the more latency is hidden;
Data transfer on PCIe should be minimized due to PCIe’s slowness.
In order to keep a high number of operations performed on GPGPU per data element transferred, you sometimes have to move more parts of your program to GPGPU even those parts perform faster on CPU;
GPGPU is suited to programs whose memory accesses have spatial locality.
Because off-chip memory accesses have the highest latency, both CPU and GPGPU access a block of consecutive memory such as 32 bytes instead of just the requested single data such as a 4-byte integer. However, GPGPU further coalesces memory accesses by a warp into as few memory transactions as possible. Better spatial locality means fewer memory transactions, higher cache hit rates and less waste of bandwidth. This is more important to GPGPU than to CPU because GPGPU has smaller caches and more threads;
GPGPU is suited to programs whose control flow logic is simple.
This is because SM schedules a common instruction to a warp at a time. Full efficiency is realized when all 32 threads of a warp agree on their execution path. If threads of a warp diverge via a data-dependent conditional branch, the warp serially executes each branch path taken, disabling threads that are not on that path, and when all paths complete, the threads converge back to the same execution path. So divergence leaves SM underutilized. This is also because CPU has more complex control flow logic and ILP mechanisms.

2 OpenCL – A Better Parallel Computing Paradigm

Parallel computing does need users to parallelize their algorithms using DP and / or TP of some programming paradigm. However, parallel computing for heterogeneous processors is challenging as traditional programming paradigms for CPU and GPU are very different.

For example, there are Pthreads and OpenMP for CPU, and they all assume a single shared address space and require users to possess complex multi-threading skills. GPGPU programming originally required users to possess intimate knowledge of graphics APIs (OpenGL or DirectX) and GPU architecture and to make their non-graphics applications look like graphics ones by mapping them into problems that drew triangles and polygons. This placed a great constraint for non-graphics domain users. This old approach was adopted by Fang et al in their CT reconstruction algorithm [12].

With the advent of Unified Shaders, GPU dedicated to general purpose computing (GPGPU) such as the Tesla series from Nvidia and the FireStream series from AMD, and programming paradigms such as CUDA from Nvidia and APP from AMD, CPU-based non-graphics applications can now directly access the tremendous performance of GPGPU without the above limitations.

Unfortunately all of the above different programming paradigms have very sharp learning curves (it should be very formidable for a non-computer domain expert to program all the parallel processors in the T7500 ecosystem using these paradigms) and some are vendor-specific. Worse yet, none of them may be supported by future parallel processors.

OpenCL [13] is an open industry standard for general purpose parallel programming across CPU, GPGPU, and other processors. From the software perspective, it is a framework consisting of an API to create host programs to coordinate parallel computing across heterogeneous processors, and a subset of ISO C99 with parallelism extensions to create kernels. OpenCL supports both DP and TP without users even knowing complex multi-threading mechanism. This, along with its support of universal C language, can greatly shorten the learning curve of non-computer domain scientists and engineers.

From the hardware perspective, OpenCL enables users to design portable, efficient and high-performance applications through its low-level, close-to-metal abstractions over heterogeneous processors (Although many details of the underlying hardware are exposed, OpenCL can’t expose all details. Otherwise OpenCL’s portability and programmability will be restricted).

In the following sub sections, we introduce these abstractions by borrowing some figures from OpenCL 1.0 specification [13], and mapping OpenCL concepts to the hardware in the Dell T7500 workstation. Readers should be able to figure out the function of each OpenCL concept based on the mapping and our previous introduction on CPU and GPGPU.

2.1 Platform Abstraction

Table 1 shows how the various components in Figure 4 map to CPU and GPGPU.

	E5507 CPU	C2050 GPGPU
Compute Device	the CPU chip	the GPGPU chip
Compute Unit (CU)	the 8 cores	the 14 SMs
Processing Element (PE)	the 8 cores	the 32 cores in each SM
Host	Dell T7500 workstation	Dell T7500 workstation

Table 1: Platform Mapping

OpenCL provides API to query all CUs connected to the host and to submit parallel computations on PEs.

2.2 Execution Abstraction

An OpenCL program consists of two parts: kernels that execute on one or more devices and a host program that executes on the host. OpenCL defines an index space called NDRange (the grid on CUDA) on devices. An NDRange can have 1 to 3 dimensions. Each point in the index space is called a work-item. When the host program submits a kernel to a device, an instance of the kernel (a thread) executes for each work-item.

Work-items are organized into work-groups at coarse-grained level, each of which is scheduled to one CU independently. Because C2050’s smallest execution unit is a warp, the work-group size should be a multiple of 32 for best performance. At fine-grained level, each work-term in a work-group can synchronize with each other on shared data accesses. Work-groups often work with local memory (defined in the next section) on GPGPU to eliminate redundant and stride access on global memory if your program possesses either of them and has spatial locality.

DP is supported by mapping each work-item to one or more data element(s). In strict DP, the mapping is one-to-one. OpenCL implements a relax version of DP where a strict one-to-one mapping is not required. You should determine the dimension size and mapping based on your program logic and the device’s capability. TP is supported by submitting multiple task kernels across CUs. Each of these kernels equivalently has only one work- item, which is wasting 31 cores on each SM of C2050. So TP is usually used on CPU each core of which executes one task kernel.

This execution abstraction is scalable because DP work-groups and TP task kernels can automatically and independently be scheduled to any available CU as the host scales to more CUs.

2.3 Memory Abstraction

Because C2050 caches global memory accesses into its L1 and L2 cache, the performance of applications whose memory accesses are not predicable (or possess poor spatial locality) can still be improved.

Because the on-chip local memory’s access time on GPGPU is close to caches, it should be used if a work-group’s spatial locality can be accommodated in it and the work-group items have redundant and /or stride access on global memory.

Both CPU and GPGPU have registers that consume zero extra clock cycles per instruction. OpenCL doesn’t explicitly expose registers. Instead kernel compilers usually optimize automatic variables to use registers. However if there are more automatic variable than available, these extra automatic variables will be place in the off-chip private memory and this is known as register spilling. The register usage usually has a significant impact on performance.

Since the constant cache on GPGPU is separate from the private and global caches in L1 and L2, it is not tainted by them. So it can improve performance if you can put constant data there, especially when your kernel doesn’t have good spatial locality.

	E5507 CPU	C2050 GPGPU
Global Memory	The off-chip host memory	The off-chip global memory
Local Memory	implemented in host memory	The on-chip shared memory
Private Memory	implemented in host memory	The off-chip local memory
Constant Memory	implemented in host memory	implemented in global memory
Global Memory Cache	implemented in L1, L2 and L3 caches	Implemented in L1 and L2 caches
Constant Memory Cache	implemented in L1, L2 and L3 caches	The constant cache

Table 2: Memory Mapping

Yong Jiao's Blog about Technology and Life

Tuesday, December 25, 2012

Technical Evalutions for a Real Time Searching System Part 1 -- Overview

1 Main Business Requirements for RTSS

2 Architecture Layers

GPGPU High Performance Computing using OpenCL -- A Uniform and Portable API across Multi-core CPU and Many-core GPGPU