Although this post overlaps with this previous one, it shows helpful diagrams and its approach is more systematic. Here it is.
1 Multi-Core CPU and Many-Core GPGPU
CPUs began from single-threaded serial processing of
general-purpose tasks such as arithmetic, logic and I/O operations. For many years since then, CPUs improved the
performance of single-threaded applications simply by reducing their latencies
based on Moore’s Law [1]. Each new CPU provided increased clock frequency,
which enabled applications to run faster.
Applications performance was improved by merely porting them to the new
CPU even without code modification.
However, every increase in clock speed imposes more demands
on power and cooling requirements, and also tends to increase the disparity
between CPU and host memory access times.
To counter this, CPUs design employed complex low-level Instruction-Level Parallelism (ILP) [2] such as instruction pipelining,
out-of-order execution and speculative execution while still keeping the
frequency in economically allowed ranges.
But, efforts to further exploit ILP have stalled since the late-1990s due
to the following reasons:
1.
ILP proves inadequate to keep CPU from stalling
for host memory accesses because ILP often involves sophisticated hardware
design and many applications have difficult-to-predict code;
2.
Many applications need to process independent large
data sets;
3.
Many applications need to handle multiple concurrent
tasks.
Instead, the industry shifted focus to applying Moore’s Law
to implement more cores and larger caches per processor chip. Larger caches help to reduce the access time
gap between CPU and host memory, especially for applications with spatial
locality. Cores are units that execute a single stream of program instructions.
A physical core, if it supports
core-level multi-threading [3] such as Intel’s HyperThreading, is equivalent to two or more logic ones each of
which supports one separate hardware thread (by hardware thread, we mean the
execution context (program counters, registers, etc) is maintained on-chip
during the entire lifetime of the thread so that context switching between
threads can be done in just one cycle and virtually without cost. We will use “core”
and “thread” hereafter unless otherwise noted). As you can easily reason, the
more cores a system has, the greater parallel computing capacity. This is the
case of GPGPU that can easily support tens of thousands threads.
However, these highly parallel computing processes do need
us to parallelize application algorithms as much as possible in order to
improve performance based on Amdahl’s Law [4]. Algorithm designers employ two
commonly used Thread-Level Parallelism
(TLP) models [5] to do so. One is Data Parallelism (DP) that decomposes data sets into multiple independent data
elements to which a single stream of instructions are concurrently applied. DP is often used at fine-grained level and
supported by a processor’s SIMD function [6]. The other is Task Parallelism (TP)
that decomposes applications into many independent tasks that can be concurrently
executed across different cores. TP is often used at coarse-grained level and supported
by a processor’s MIMD function [6]. TLP improves overall application
performance through high throughput of all threads, though latency is actually
increased at individual thread levels.
Thanks to the commoditization of many of these parallel
processors in the last decade, they are inexpensively available to us as either
a workstation or a desktop cluster. Many
large data set processing applications such as CT reconstruction, used to take
many hours even several days on traditional serial CPU, can now finish in several
minutes on these parallel processors without even resorting to any expensive
proprietary hardware such as ASIC, FPGA, Cell BE or multi-node clusters.
Next, we will review two typical parallel processors: multi-core CPU and many-core GPU dedicated to General-Purpose Computing (GPGPU) and how OpenCL provides a uniform parallel computing
paradigm for them. We conducted our researches on a Dell Precision T7500
Tower Workstation [7]. It features two Intel Xeon quad-core E5507 CPUs, and one
Nvidia Tesla C2050 GPGPU. Because both processors
represent relatively latest developments on parallel computing on CPU and
GPGPU, our discussions are around them without loss of generality.
1.1
Multi-Core CPU
From the first 32-bit Intel Core architecture introduced in
Jan 2006 up to the latest 64-bit Intel Xeon
octo-core architecture [8], all
multi-core CPUs share the same principle that it is easier to improve overall
application performance using throughput-oriented multiple cores than using a
latency-oriented single core. However
CPU’s primary role as a general-purpose engine determines trade-off must be
made among flow controls, caches and number of cores.
The 64-bit Xeon processors, as shown in Figure 1, represent
Intel’s latest developments on multi-core and multi-processor. They target the
server and workstation marets. Here are the highlights:
- Each processor supports a maximum of 4 threads.
HyperThreading is not supported;
- Large on-chip L1,L2 and L3 caches (a total of
5.25M) and flow control (not shown in Figure 1);
- Each processor supports 3 parallel memory
channels that are totally 192-bit wide;
- Each processor’s memory bandwidth is 19.2GB/s
and peek single floating-point capability is about 36GFLOP/s;
- Each core supports 128-bit SIMD vectors through
SSE; MIMD is supported at processor level through the 4 cores;
- NUMA supports – each processor accesses its
locally attached host memory through the integrated memory controller much
faster than access other remote host memory attached to the other processor
through Quick Path Interconnect (QPI).
1.2
Many-Core GPGPU
From the first GPU invented in 1999
by Nvidia to process graphics only up to its latest Fermi-based Tesla products
dedicated to general purpose computing [9] [10], GPU has evolved into a
massively parallel processor that can run tens of thousands of threads on its
hundreds of cores.
C2050, as shown in Figure 2, turns PCs and workstations into
affordable small clusters. Here are the highlights:
- Each Streaming Multiprocessor (SM) schedules parallel
threads in groups of 32 called warps (AMD’s
FireStream GPGPU, C2050’s equivalent, uses wavefronts
for the same purpose. So we will use warps hereafter without loss of
generosity) to 32 scalar cores. In other
words, each SM maintains its own single stream of instructions of the kernel
and each instruction is scheduled to 32 cores on different data in lockstep.
This is how Nvidia implements SIMD using its so-called SIMT [11]. SM’s SIMD is
1024-bit wide that is much wider than E5507’s. MIMD is supported by scheduling
concurrent kernels across the 14 SMs;
- The 14 SMs support a maximum of 21,504 (1,536 x
14) active (resident) threads on the 448 (32 x 14) scalar cores thanks to the large
amounts of registers. These numbers of threads and cores are much larger than
CPU’s;
- Small on-chip caches (the total is about 1MB)
and simple flow control (not shown in Figure 2), which means higher latency
than CPU;
- Each SM support 48KB shared memory whose access
time is between that of cache and global memory. This shared memory is explicitly available to
programmers through API (on the other hand, caches are transparent to
programmers). CPU doesn’t have such memory;
- All SMs access the off-chip global memory
through a wide 384-bit interface that has 6 parallel channels;
- The device connects to the host through a PCIe x
16 Gen 2 slot. Its bandwidth is 8GB/s that is much slower than its 144GB/s memory
bandwidth. The device’s peek single floating-point
capability is about 1TFLOP/s that is much larger than E5507’s.
1.3
Different Workload Processing on CPU and GPGPU
Because today’s computer systems often form heterogeneous
parallel computing ecosystems with CPUs, GPGPUs and / or other types of
processors as shown in Figure 3 (T7500 ecosystem hereafter), it is important to know the type of workloads
that each processor does best so that you can partition workloads and schedule
them to appropriate processors using OpenCL. By doing so, you can access all
the computing power (serial and parallel) of your systems, achieving very high system
utilization.
- GPGPU is suited to process large date sets while
CPU is better suited to low parallelism (also serial operations such as disk
and network IOs, and hard-to-parallelized programs).
This is because small data sets can’t justify the time incurred by GPGPU’s slow
data transfer on PCIe and other setups. Large data sets also spawn a large
number of threads that are required in order to take full advantage of GPGPU’s
massively parallel structure and to hide its high latency on memory accesses
and other operations;
- GPGPU is suited to algorithms with high
arithmetic density (the number of arithmetic operations per memory access).
This is also to hide GPGPU’s high latency. When a warp is stalled due to
latency, SM swaps in a ready warp for execution, if any, through its fast
context switch to maximize GPGPU utilization.
Latency can be fully hidden if SM always has some arithmetic
instructions to issue for some warp at every clock cycle during that latency
period. Obviously the more arithmetic
instructions and threads, the more latency is hidden;
- Data transfer on PCIe should be minimized due to
PCIe’s slowness.
In order to keep a high number of operations performed on GPGPU per data
element transferred, you sometimes have to move more parts of your program to
GPGPU even those parts perform faster on CPU;
- GPGPU is suited to programs whose memory accesses
have spatial locality.
Because off-chip memory accesses have the highest latency, both CPU and GPGPU
access a block of consecutive memory such as 32 bytes instead of just the
requested single data such as a 4-byte integer. However, GPGPU further coalesces
memory accesses by a warp into as few memory transactions as possible. Better spatial locality means fewer memory
transactions, higher cache hit rates and less waste of bandwidth. This is more
important to GPGPU than to CPU because GPGPU has smaller caches and more
threads;
- GPGPU is suited to programs whose control flow
logic is simple.
This is because SM schedules a common instruction to a warp at a time. Full
efficiency is realized when all 32 threads of a warp agree on their execution
path. If threads of a warp diverge via a data-dependent conditional branch, the
warp serially executes each branch path taken, disabling threads that are not
on that path, and when all paths complete, the threads converge back to the
same execution path. So divergence leaves SM underutilized. This is also because CPU has more complex
control flow logic and ILP mechanisms.
2
OpenCL – A Better Parallel Computing Paradigm
Parallel computing does need users to parallelize their
algorithms using DP and / or TP of some programming paradigm. However, parallel computing for heterogeneous
processors is challenging as traditional programming paradigms for CPU and GPU
are very different.
For example, there are Pthreads
and OpenMP for CPU, and they all assume
a single shared address space and require users to possess complex multi-threading
skills. GPGPU programming originally
required users to possess intimate knowledge of graphics APIs (OpenGL or DirectX) and GPU architecture and to make their non-graphics
applications look like graphics ones by mapping them into problems that drew
triangles and polygons. This placed a great constraint for non-graphics domain users.
This old approach was adopted by Fang et al in their CT reconstruction
algorithm [12].
With the advent of Unified
Shaders, GPU dedicated to general purpose computing
(GPGPU) such as the Tesla series from Nvidia and the FireStream series from
AMD, and programming paradigms such as CUDA
from Nvidia and APP from AMD,
CPU-based non-graphics applications can now directly access the tremendous
performance of GPGPU without the above limitations.
Unfortunately all of the above different programming
paradigms have very sharp learning curves (it should be very formidable for a
non-computer domain expert to program all the parallel processors in the T7500
ecosystem using these paradigms) and some are vendor-specific. Worse yet, none
of them may be supported by future parallel processors.
OpenCL [13] is an
open industry standard for general purpose parallel programming across CPU,
GPGPU, and other processors. From the software perspective, it is a framework
consisting of an API to create host
programs to coordinate parallel computing across heterogeneous processors,
and a subset of ISO C99 with parallelism extensions to create kernels. OpenCL supports both DP and TP
without users even knowing complex multi-threading mechanism. This, along with
its support of universal C language, can greatly shorten the learning curve of
non-computer domain scientists and engineers.
From the hardware perspective, OpenCL enables users to
design portable, efficient and high-performance applications through its
low-level, close-to-metal abstractions over heterogeneous processors (Although many
details of the underlying hardware are exposed, OpenCL can’t expose all
details. Otherwise OpenCL’s portability and programmability will be restricted).
In the following sub sections, we introduce these
abstractions by borrowing some figures from OpenCL 1.0 specification [13], and
mapping OpenCL concepts to the hardware in the Dell T7500 workstation. Readers
should be able to figure out the function of each OpenCL concept based on the
mapping and our previous introduction on CPU and GPGPU.
2.1
Platform Abstraction
Table 1 shows how the various components in Figure 4 map to
CPU and GPGPU.
|
E5507
CPU
|
C2050
GPGPU
|
Compute Device
|
the CPU chip
|
the GPGPU chip
|
Compute Unit (CU)
|
the 8 cores
|
the 14 SMs
|
Processing Element (PE)
|
the 8 cores
|
the 32 cores in each SM
|
Host
|
Dell T7500 workstation
|
Dell T7500 workstation
|
Table 1: Platform Mapping
OpenCL provides API to query all CUs connected to the host
and to submit parallel computations on PEs.
2.2
Execution Abstraction
An OpenCL program consists of two parts: kernels that execute on one or more
devices and a host program that
executes on the host. OpenCL defines an
index space called NDRange (the grid
on CUDA) on devices. An NDRange can have 1 to 3 dimensions. Each point in the index space is called a work-item. When the host program submits a kernel to a
device, an instance of the kernel (a thread) executes for each work-item.
Work-items are organized into work-groups at coarse-grained level, each of which is scheduled to
one CU independently. Because C2050’s smallest execution unit is a warp, the
work-group size should be a multiple of 32 for best performance. At fine-grained
level, each work-term in a work-group can synchronize with each other on shared
data accesses. Work-groups often work
with local memory (defined in the next section) on GPGPU to eliminate redundant
and stride access on global memory if your program possesses either of them and
has spatial locality.
DP is supported by mapping each work-item to one or more data
element(s). In strict DP, the mapping is one-to-one. OpenCL implements a relax
version of DP where a strict one-to-one mapping is not required. You should determine the dimension size and
mapping based on your program logic and the device’s capability. TP is supported by submitting multiple task
kernels across CUs. Each of these kernels equivalently has only one work- item,
which is wasting 31 cores on each SM of C2050. So TP is usually used on CPU
each core of which executes one task kernel.
This execution abstraction is scalable because DP
work-groups and TP task kernels can automatically and independently be
scheduled to any available CU as the host scales to more CUs.
2.3
Memory Abstraction
Because C2050 caches global memory accesses into its L1 and
L2 cache, the performance of applications whose memory accesses are not
predicable (or possess poor spatial locality) can still be improved.
Because the on-chip local memory’s access time on GPGPU is
close to caches, it should be used if a work-group’s spatial locality can be
accommodated in it and the work-group items have redundant and /or stride
access on global memory.
Both CPU and GPGPU have registers that consume zero extra
clock cycles per instruction. OpenCL doesn’t explicitly expose registers.
Instead kernel compilers usually optimize automatic variables to use registers.
However if there are more automatic variable than available, these extra
automatic variables will be place in the off-chip private memory and this is
known as register spilling. The register usage usually has a significant
impact on performance.
Since the constant cache on GPGPU is separate from the
private and global caches in L1 and L2, it is not tainted by them. So it can
improve performance if you can put constant data there, especially when your
kernel doesn’t have good spatial locality.
|
E5507
CPU
|
C2050
GPGPU
|
Global Memory
|
The off-chip host memory
|
The off-chip global memory
|
Local Memory
|
implemented in host memory
|
The on-chip shared memory
|
Private Memory
|
implemented in host memory
|
The off-chip local memory
|
Constant Memory
|
implemented in host memory
|
implemented in global memory
|
Global Memory Cache
|
implemented in L1, L2 and L3 caches
|
Implemented in L1 and L2 caches
|
Constant Memory Cache
|
implemented in L1, L2 and L3 caches
|
The constant cache
|
Table 2: Memory Mapping