Monday, February 28, 2011

Welcome to the World!

My baby, Bryan Zhou Jiao, finally came to the world on Feb 15, 2011. Now I truly underdand how hard the ten-month long pregnancy is, which is especially true considering both my wife and me have passed the prime time for having a baby.
Fortunately the whole preganancy process basically went smoothly and both Mom and the baby are safe and healthy.
I am also a bit suprised so far. I am very afraid of baby crying and originally I though babies would cry very often. Luckily my little one is not worrisome and he only cried when he woke up and felt hungry, which seems to be his only requirement from us.

Here are some pictures. 



More photos and videos are available at Flickr.


Saturday, February 26, 2011

GPGPU High Performance Computing using OpenCL -- Parallel Computing on Heterogeneous Platforms

The following technical discussion was co-authored with Lili Zhou who is a medical physics resident at NY Presbyterian Hospital / Weill Cornel Medical Center.
The discussion can help you understand the major hardware differences between multi-core CPUs and many-core GPUs (GPGPUs) and the different types of parallel workloads that CPUs and GPUs are each suited for.
Here it is.


Processing of large data sets entails parallel computing in order to achieve high throughput. Traditionally high throughput was only achievable through either proprietary hardware (Application-Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or even Cell BE) or multi-node clusters.
However, as throughput-oriented multi-core processors become more pervasive, the same performance is now achievable through these commodity parallel architectures that range from multi-core CPU for low parallelism to many-core Graphics Processing Unit (GPU) for massive parallelism.
This article focuses on major hardware differences between CPU and GPU, which further decides the different workloads that each processor is suited for. In the end, it gives a data partition example using OpenCL across CPU and GPU, and compares the performance differences on the two platforms.

1 OpenCL – a New and Better Parallel Computing Paradigm

Because you most probably have different workloads that can be better processed using CPU than GPU or vice versa, you need to master parallel computing for both CPU and GPU.
However, parallel computing for heterogeneous processors is challenging as traditional programming paradigms for CPU and GPU are very different.
For example, you usually use POSIX Threads or Pthreads or OpenMP for CPU.
Although GPU typically only handles graphics computations, its gigaflops or even teraflops of floating point performance and high memory bandwidth make it suited to some non-graphics applications including data-intensive scientific and engineering computations, which actually falls into the scope of so-called General-Purpose Computing on Graphics Processing Unit (GPGPU).
However, GPGPU programming originally required programmers to possess intimate knowledge of graphics APIs (OpenGL or DirectX) and GPU architecture and to make their non-graphics applications look like graphics ones by mapping them into problems that drew triangles and polygons. This placed a great constraint for non-graphics domain scientists and engineers.
This old approach was adopted by Fang et al in the CT reconstruction algorithm.
With the advent of Unified Shaders, dedicated general purpose GPU such as the Telsa series from NVIDIA and the FireStream series from AMD, and high-level programming paradigms such as CUDA from NVIDIA and ATI Stream from AMD, CPU-based non-graphics applications can now directly access the tremendous performance of GPU without the above limitations.
Unfortunately all of the above different programming paradigms have very sharp learning curves and some are vendor-specific. Worse yet, none of them may be supported by future parallel processors.
OpenCL (Open Computing Language)
is an open royalty-free standard for general purpose parallel programming across CPU, GPGPU, and such other processors as DSP and the Cell/B.E. processors, giving software developers portable and efficient access to the power of these heterogeneous processing platforms.
Although OpenCL is very new, it is a better choice because of the following factors:
  • It utilizes a subset of ISO C99 with extensions for parallelism. You are probably already familiar with C programming;
  •  You don’t need to learn anything else in order to take advantage of the parallel structures of current and even future parallel processors;
  •  You just need to analyze your algorithms for data or tasks decomposition without even knowing the underlying complex multi-threaded programming mechanism, not to mention any graphics API or GPU internals.
We will use OpenCL terminologies in the following discussion. If you are not familiar with them, you are encouraged to read the first three chapters in the OpenCL specification documentation.

2 Major Differences between CPU and GPGPU in Parallel Computing

You need to know the major hardware differences to be able to select either CPU or GPGPU for different workloads. This discussion uses Dell Precision T7500 Tower Workstation (it has 2 Intel E5507 quad-core processors without Hyper-threading) and NVIDIA Tesla C2050 for reference. You can click their hyperlinks to get introductions for their capabilities. 
C2050 is based on NVIDIA’s next-generation CUDA architecture codenamed “Fermi” and has 14 CUDA 32-core multi-processors one of which is shown below courtesy of this resource:

2.1 GPGPU Supports Much More Hardware Threads than CP

T7500 only supports 8 hardware threads (4 compute units or core in CPU terms * 2 processors) while C2050 supports 21,504 (14 compute units or multi-processors in CUDA terms * 1536), which means C2050 is in a better position to process massive amounts of data in parallel than T7500.

2.2 GPGPU’s Threads Are Lightweight

By the above “hardware threads”, we mean the context or state of a processing element (one of the 4 compute units in T7500 or one of the 32 CUDA core in a C2050 compute unit) is saved in registers allocated to each thread so that context switching can be done in just one cycle.
If you create more than 8 threads for T7500, context switching will involve software operation, which takes many more cycles and are accordingly expensive and heavyweight.
Because C2050 has 32,768 registers, it can support as many as 21,504 in-flight hardware threads.

2.3 GPGPU’s SIMD and Global Memory Interface Is Much Wider than CPU’s

Single-Instruction Multiple-Data or SIMD is suited to process large data sets in batch mode.  T7500 implements SIMD with SSE that is 128-bit wide while C2050 implements SIMD with Single-Instruction Multiple-Thread or SIMT (a smallest execution unit of 32 threads runs a single instruction) that is 1024-bit wide.
T7500 has 192-bit global memory (system memory in CPU terms) interface through 3 parallel channels per processor while C2050’s memory interface is 384-bit through 6 channels.
This means C2050 is better suited to large data sets that possess spatial locality.

2.4 CPU Has Larger Cache and Lower Latency than GPGPU

T7500 has 32K L1 and 256K L2 per compute unit and 4M unified L3 while C2050 has 16K L1 per compute unit and 768K unified L2, which means that CPU is more lenient to the requirement on spatial locality than GPGPU and also has lower memory access latency than GPGPU.
However GPGPU is optimized for high-throughput by hiding the latency for memory access and other instructions through fast context switching to ready threads in the scheduling queue, which assumes, in addition to there being a large number of in-flight thread, the number of arithmetic operations per memory operation is also high (Imagine if you only perform one arithmetic per memory load, context switching doesn’t make much sense).

2.5 GPGPU Gives Developers Direct control over Its Fast On-chip Local Memory

GPGPU’s local memory (shared memory in CUDA terms) approximately corresponds to CPU’s L1 cache with respect to latency.
This memory region is shared by all work items (parallel threads) in a work group (a thread block), and can be used to boost effective bandwidth and system scalability by reducing or eliminating redundant loads from global memory and coalescing stride accesses to global memory.
The coalescing of strides is possible only when memory accesses of all work items in a work group fall into the local memory. In other words, a work group exhibits spatial locality in the local memory.
C2050 has 48K local memory. Without using it, the redundant global memory loads from tens of thousands of threads if they do exist, will cause too many memory bus contentions, and un-coalesced stride accesses will waste too much memory bandwidth.

2.6 Data Needs Transfer Between GPGPU and Host

T7500 uses Gen2 PCIe X 16 for data transfer. Although the bandwidth of Gen2 PCIe X 16 is 8GBps, it is much lower than C2050’s memory bandwidth 144GBps.
So the number of compute operations per data element transferred in your parallel algorithms should be high otherwise they might be better handled using CPU.

3 Different Workload Processing for CPU and GPGPU

A fundamental requirement for parallel computing on both CPU and GPGPU is that you data or tasks can be decomposed for concurrent processing. However CPU and GPGPU are suited to different types of workloads.
On the other hand, if your data or tasks exhibit little or no parallelism, you should leave them to CPU for serial processing. OpenCL allows you to parallelize parts of your CPU-based applications by putting the parallel parts into an OpenCL kernel while leaving the rest untouched.

3.1 GPGPU Is Better Suited to Process Large Data Sets

Small data sets can’t justify the time incurred by GPGPU’s slow data transfer on PCIe and other setup costs. Large data sets also spawn a large number of lightweight threads that are required in order to take full advantage of GPGPU’s massively parallel structure and to hide GPGPU’s the latency for memory access and other instructions by fast context switching.

3.2 GPGPU Requires the Number of Arithmetic Operations per Memory Operation Be High

This is to hide GPGPU’s long latency for memory access and other instructions. Otherwise CPU is a better choice thanks to its larger cache.

3.3 Data Transfer on PCIe Should Be Minimized

In other words the number of compute operations by GPGPU per data element should be high otherwise CPU is a better choice due to the low PCIe bandwidth.

3.4 Memory Access Should Have Spatial Locality

This appears more important to GPGPU than CPU because CPU has larger cache and GPGPU’s wider SIMD and memory interface otherwise only means waste of bandwidth and limited scalability across its compute units.
Please note that you can relax this requirement a bit for GPGPU if your parallel algorithm meets the above 3.2 very well because a large number of arithmetic operations do hide memory access latency. Actually this relaxation is also a winning point because in order to use CPU’s SIMD feature, you must explicitly pack your data into vectors, which requires your data to have perfect spatial locality.

4 Partition Your Data for Data Parallelism

The key to OpenCL programming is to partition your data or tasks so that they can be processed in parallel. Task parallelism needs to vectorize your data and / or to submit multiple small kernels.
Since you face data parallelism most of the time and it needs more data design consideration, this discussion will only deal with data parallelism.
Data parallelism
allows you to concurrently apply the same OpenCL kernel (the function to run on GPU in parallel) to multiple independent work items (data elements) in an OpenCL index space (a memory object decomposed into data elements).  OpenCL doesn’t require you to have a one-to-one mapping between work items and data elements.
When you design a new parallel algorithm, you should put the requirements in Section 3 into consideration. A good candidate for parallelism is the FOR or WHILE loops with large iterations, and large data arrays.
One loop or one array can correspond to one dimension in an OpenCL index space while the loop iterations and array elements can map to work items. Nested loops or multi-dimension arrays can correspond to up to three dimensions in an OpenCL index space.
When you try to parallel an existing singled-threaded algorithm, you should also target its loops and data arrays and analyze its data access pattern to decide which processor is better suited, CPU or GPGPU?
If you are not sure about the selection, which is especially true when your existing algorithm doesn’t fit the requirements in Section 3 one way or the other (for example, the algorithm doesn’t possess too much spatial locality),  you can first parallelize it to CPU. Then you can enhance it to GPGPU with slight modifications (still remember that OpenCL is a general purpose parallel programming paradigm across CPU, GPGPU and other processors?) and compare them.
This approach is what we did for our CT image reconstruction program – Conebeam.
The Conebeam runs on Dell Precision T7500 Tower Workstation equipped with a NVIDIA Tesla C2050.  It has one one-dimensional input array, one one-dimensional output array, and three nested FOR loops that result in about 100M final iterations. The three FOR loops dynamically calculate the indexes for both arrays and compute the output array values based on the input array.
There are more than 200 arithmetic operations per memory access. Only the input and output arrays need transferring to and from GPGPU, respectively.
Here is pseudo-code method to parallelize on GPU:
01.void someMethod(float *inBuf, float *outBuf) {
02.    for(int angle=0; angle
03.      for (int binX=0; binX
04.        for (int binY=0; binY
05.          //the number of final iterations here is about 100M
06.          int idxIn = calculate based on angle,binX and binY;
07.          int idxOut = calculate based on angle,binX and binY;
08. 
09.          outBuf[idxOut] = calculate based on inBuf[idxIn];
10.        }
11.      }
12.    }
13.}
Because the three FOR loops calculate the values of the output array independently, they can be parallelized as the three dimensions in an OpenCL index space.
Because the three FOR loops have large iterations, and the two arrays are also large, this program seems to be a good candidate for GPGPU.
However, because the index accessing order of the two arrays is not synchronous with the three FOR loop iteration order, this program has very bad spatial locality and local memory can’t be used. Fortunately because it has high arithmetic operations per memory access, it should hide memory access latency in some degree if we use GPGPU.
Before we compare the two OpenCL implementations with real tests, here is some estimation:
Since T7500 has 8 compute units running at 2.2GHz while C2050 has 16 compute units running at 1.1GHz, T7500 has an approximate equivalent of 14 C2050 compute units.
T7500 has 192-bit global memory interface through 3 parallel channels per processor, so its overall memory interface is 384-bit. C2050’s memory interface is also 384-bit through 6 channels. 
The significant differences include (a) T7500 has just one processing element per compute unit while C2050 has 32 processing elements per compute unit; (b) T7500 has much larger cache than GPGPU and (c) T7500’s memory bandwidth is 230GBps while C2050 is 144GBps (GPU usually has higher bandwidth than CPU. However this high-end E5507’s effective memory speed is 4.8GHz while C2050 is only 3.0GHz).
The GPGPU version should be approximate 32 times faster than the CPU version if we put aside the cache factor and the CPU version doesn’t use SSE feature (The algorithm’s spatial locality is so bad that the above (c) shouldn’t generate any advantage to the CPU version).
Here are the real testing results:
The original single-threaded version ran about 70 minutes. The CPU version didn’t use SSE in order to port to NVIDIA GPGPU version easily and it ran about 12 minutes. The GPGPU version ran about 1 minute.
The CPU version didn’t have 8 times improvement over the single-threaded version because the Conebeam doesn’t have spatial locality and the 4 cores per processor incur memory bus contentions.
The GPGPU version didn’t have 32 times improvement over the CPU version because again the Conebeam doesn’t have spatial locality and CPU has much large cache.
In order to have linear scalability on GPGPU, the algorithm should be enhanced to exhibit spatial locality if it is possible.

5 Further Discussion

A major challenge for GPU programming is to preserve GPU performance levels while increasing the generality and expressiveness of application interfaces.
Although OpenCL, CUDA and ATI Stream relieve programmers from learning graphics jargons, they may not allow you to make good use of all GPU resources.  The authors are not sure whether this concern still hold true even with NVIDIA’s Telsa series and AMD’s FireStream series.
While OpenCL provides programming portability, it doesn't necesssarily provide performance portability.

Wednesday, February 2, 2011

GPGPU High Performance Computing using OpenCL -- Work with Multiple Vendor Platforms using ICD

Sometimes you have different workloads that are better handled with different vendor OpenCL implementations.
For example if your data presents access locality or coherence, the number of calculation per data element transferred on PCI  is high and the number of calculation per memory access is also high, you can use NVIDIA's OpenCL implementation on its GPGPU.
On the other hand, if your data doesn't possess all of the characteristic mentioned above, you can use ATI/AMD's OpenCL implementation on CPU so that you at least can take advantage of all your CPU cores.

OpenCL has an extension "cl_khr_icd". It is technically called OpenCL ICD (Installable Client Driver) that is a means of allowing multiple OpenCL implementations to co-exist and applications to select between them at runtime.

In my case since I have both CPU and GPGPU-friendly data, I installed on a 64-bit Windows 7 workstation both NVIDIA GPU Computing SDK 3.2 that support OpenCL 1.0, and ATI Stream SDK 2.2 that support OpenCL 1.1.

1. Find out whether your Vendor's SDK Supports OpenCL ICD
You can compile and run NVIDIA's oclDeviceQuery program. If you see "cl_khr_icd" in section "CL_DEVICE_EXTENSIONS" it means NVIDIA supports Opencl ICD.
You can also run ATI's "clinfo" program. It actually prints out all OpenCL platforms. Again if you see "cl_khr_icd" in section "Extension" it means the corresponding platform supports Opencl ICD.

2. How does ICD Work?
Although I setup all my Visual Studio projects with NVIDIA GPU Computing SD, I basically can still dynamically load ATI's CPU OpenCL implementation.
The project setup basically needs two configurations. One is to specify the OpenCL include directory that contains vendor-neutral headers such as cl.h, cl_ext.h and some vendor-specific headers such as cl_agent_amd.h for AMD.
Unless your SDK's have different OpenCL versions, all vendor-neutral headers should be the same. If you need vendor-specific headers, of course you have to setup your project with that vendor's SDK.

The other configuration is to specify the OpenCL library directory where a vendor-neutral stub library "OpenCL.lib" is located. This "OpenCL.lib" is called ICD loader that is responsible to dynamically load a vendor specific ICD library.
When you install a OpenCL SDK, the SDK registers its ICD DLL in the register key
HKEY_LOCAL_MACHINE\SOFTWARE\Khronos\OpenCL\Vendors on Windows if it supports ICD.
So the ICD loader can scan the above register key and dynamically load your vendor's DLL library.

For more information, please refer to this resource. 

3. How does ICD affect your Code?
Your application is now responsible for selecting which of the OpenCL platforms present on a system it wishes to use, instead of just requesting the system default.
This means using the clGetPlatformIDs() and clGetPlatformInfo() functions to examine the list of available OpenCL implementations and selecting the one that best suits your requirements.

For example, previously you could call the following function without specifying the specific platform value:
context = clCreateContextFromType(
     0, CL_DEVICE_TYPE_GPU, NULL, NULL, &status);

With ICD, you must specify a specific platform:

cl_context_properties cps[3] = {
              CL_CONTEXT_PLATFORM,
              (cl_context_properties)platform,
              0};
context = clCreateContextFromType(
     cps, CL_DEVICE_TYPE_GPU, NULL, NULL, &status);