Thursday, September 30, 2010

Commercial-Off-The-Shell Enterprise Real Time Computing (COTS ERTC) -- Part 2: Hardware requirements

Because hardware is at the bottom of a COTS ERTC application stack, we usually focus on throughput instead of latency in order to reduce higher level latency based on our discussion in Section 1.2 in part 1. But we still want to reduce hardware jitter.
Because hardware contributes the least to your COTS ERTC application's latency and jitter compared to other parts, you usually put less effort into hardware for your application jitter optimization.

In this part, I will talk about the latency incurred by computer hardware. I will also talk about how to reduce latency and jitters using such high-performance co-processors as GPU and FPGA.

1. Latency by Computer Hardware
1.1 Interrupt Latency
System timer, power management and such IOs as keyboard, mouse, disk and NIC use interrupts to notify the processor of the need for attention.
Software can also generate an interrupt to notify the processor of the need for a change in execution.

All modern hardware and OS use interrupts to implement thread preemption as shown in Figure 2.



                                            Figure 2 Thread Preemption Through Interrupt

Whenever there is a higher level interrupt, the OS scheduler preempts the low priority thread with the high priority thread. This preemption takes time i.e. "Interval Response Time" in Figure 2. This white paper shows a typical 25us interval response time on modern hardware.
RTC needs to bound this "Interval Response Time" which consists of several components as shown in Figure 2. Only "Interrupt Latency" relates to hardware. All others relate to device driver and OS and will be covered in the next topic.

Interrupt latency is the time required for the interrupt to propagate through the hardware from its source to the processor's interrupt pin.Because it is hardware line level latency, it is usually negligible compared to other components on modern hardware platforms.

Because most interrupts come from IO devices which usually have small time-sensitive data buffer, Interrupts have the highest scheduling priorities meaning they can interrupt any application and OS kernel threads including your RT ones. This has two implications.On one hand for a high priority event such as market data feed from an external source, its interrupt of all low priority threads and switching to the corresponding RT thread is really what you want.
On the other hand your really don't want your RT thread to be interrupted by such unnecessary IOs as keyboard or mouse. But your RT thread was still interrupted by such IOs.
However if the interrupt latency (more precisely, the interval response time) is bounded, your RT application thread behavior in both scenarios is still predicable.

As mentioned in the previous topic, COTS processors are usually designed for high throughput at the cost of high interrupt latency.
However this "high interrupt latency" is only significant for tight HRT. For SRT,
it is negligible compared to larger latencies by other parts in the stack and also thanks to the constant performance improvement in modern computer hardware.

1.2 System Management Interrupt (SMI)
Only X86 processors have those SMI's. When they happens, they suspend the CPU's normal execution and switch it into system management mode (SMM).
Because they usually run inside BIOS firmware, they are transparent to OS and you can't intercept them.  They are the highest priority interrupt in the system even higher than the NMI. Because they can last for hundreds of microseconds, they can cause unacceptable jitter for RTC, especially for HRT.

However they are not that bad if you OS supports ACPI because ACPI took the ownership of power management from SMM. Both Windows and Linux support ACPI (so you can configure your RT application thread higher than the OS ACPI thread).

This resource also lists other measures to mitigate SMI's impact.

IBM also optimized its LS21(Model 7971) and HS21 XM (Model 7995) to reduce SMI jitter for non-fatal SMI by deferring non-fatal SMI's functions to low-priority OS threads (When such a fatal SMI as a memory or chipset error happens, its function can't be deferred because the function has to be used to fix the error).

1.3 DMA Bus Mastering
This happens on the currently pervasive PCI bus. When a device uses the DMA, other devices that want to use the DMA have to wait until the previous device is done, which can cause many micro-seconds jitter based on this resource.

1.4 Max Performance vs Max Power Saving
ACPI can put CPU and other devices into some low-power-consumption or even power-off state after a period of inactivity, it will cause jitter when they have to be woken up.

ACPI support at least two extreme power schemes.
One is "Max Performance" which keeps CPU and other devices active all the time.
The other is "Max Power Saving" which power off CPU and other devices.
The "Max Performance" is preferred for RTC.

1.5 High Resolution Timer
The default system timer usually has a resolution of 10ms on most platforms. Although you can usually lower it to 1ms by configuring your OS, a high-resolution system timer can result in too many interrupt overheads which will severally lower throughput.

However even 1ms is still too coarse for RTC because RTC needs high resolution timer for its periodic or one-time task accurate scheduling and nano-sleep function (after all RTC applications have time or deadline constraints). At least microsecond resolution is required for RTC to keep jitter lower.
A hardware-based timer is needed also because it can generate interrupts as needed even you need a nano-second resolution e.g. the "dynamic tick" or even tickless implementation on Linux RT.

UltraSPARC systems can provide nanosecond timers while most modern X86 systems can only provide as good as microsecond times.

1.6 Cache
Because processors are orders of magnitude faster than the main memory e.g. modern processors have nanosecond execution time per instruction while main memory has microsecond access latency, caches are used to bridge the access latency gap. However this unavoidably creates jitter.

For example if your RTC application can completely fit into L1 cache or L2 cache, its latency will be very low and predicable. Otherwise the main memory access will cause jitter. But this is probably still acceptable to SRT or loose HRT based on your business rules.

1.7 Instruction Level Parallelism
Modern processors use such instruction level parallelism as out-of-order execution, pipeline and supercalar to improve your thread (CPU) throughput. However this again unavoidably create unpredictable temporal behavior for your thread instructions. 
But because they are at instruction level, they are probably still acceptable to SRT or loose HRT based on your business rules.

1.8  Multi-Processor / Core
Because higher processor clock rate leads to lower latency, the traditional approach of increasing processor performance is increasing clock rate based on Moore's Law. However such an approach eventually ran into power consumption and expensive cooling issues.
Multi-processor systems are one solution that can definitely improve throughput. But the interconnect between processors unavoidably introduce additional latency and jitter.
Multi-core processors may be a even better solution for RTC that can achieve both high throughput and low latency thanks to the shared high-speed bus among cores and the shared memory model.

A very important application of multi-processor / core in RTC is the so called "CPU shielding" (it also has few other names such as cpu binding or interrupt binding or thread binding or fine-grained processor control). "CPU shielding" is implemented in OS.

Here are some examples.
You can bind low priority interrupts to one CPU and dedicate another CPU to your RTC thread.
If you have multiple RTC threads that need to run simultaneously, you also have to have CPUs.
In case of Java, you need to bind at least one CPU to the concurrent GC or RTGC or the background JIT compiler so that it can truly run concurrently with your RT threads.

1.9 NUMA
The multiple CPUs mentioned in 1.8 usually have equal access to the shared main memory. This unfortunately also causes contentions when several processors attempt to address the same memory besides the benefits.

NUMA attempts to address this problem by providing separate memory for each processor. For problems involving spread data (common for servers and similar applications), NUMA can improve the performance over a single shared memory by a factor of roughly the number of processors (or separate memory banks).
However if not all data ends up confined to a single task, which means that more than one processor may require the same data, NUMA has to move data between memory banks, which causes jitter.


Both current X86 and UltraSPARC processors can provide good functions in the above areas. They can be used as the underlying platforms for COTS ERTC applications.

2. Co-Processors
The shift to multi-processor / core processors forces application developers to adopt a parallel programming model to exploit CPU performance, which proves to be very error prone.
Traditionally HPC uses clusters consisting of COTS multi-core servers to run data and compute intensive applications. In order to cut down such complex computation to RTC level, hundreds, even thousands of COTS multi-core servers along with high-bandwidth interconnect need to be deployed, which not only creates maintenance headache, but also consumes quite a lot of power. Even the interconnect usually has low latency, it still incurs latency.

CPU-based systems augmented with hardware accelerators as co-processors are emerging as a even better solution to the Moore's Law dilemma. This has opened up opportunities for accelerators like Graphics Processing Units (GPUs), FPGAs, and other accelerator technologies to advance HPC to previously unattainable RTC levels.


2.1 General-Purpose Computing on Graphics Processing Units (GPGPU)
Because traditional CPU design focuses on general purpose (both high volume and low volume; both management task and ALU processing etc), the number of cores and the vector size in SIMD are both small.
Because GPU's specialization in 2D or 3D graphics rendering acceleration using its highly pipeline parallel structures, it can have hundreds of  processor cores each of which can handle hundreds of independent threads. Its SIMD's vector is much longer than CPU's. Also the internal interconnect among cores has much higher bandwidth than the external interconnect in traditional clusters.

GPGPU is the technique of using a GPU to perform computation in applications traditionally handled by the CPU, which is made possible by the addition of programmable pipeline stages - shaders and higher precision arithmetic to the rendering pipelines, which allows software developers to use stream processing on non-graphics data.

A modern GPGPU itself is a cluster of hundreds of cores capable of handling tens of thousands of threads, which can be hundreds of times faster than a transitional cluster made of hundreds of processors.

OpenCL is an GPU programming framework that is supposed by all major GPU vendors. It supports both task and data level parallelism. For data level parallelism, users only need to partition data properly and are not responsible to manage threads, which is much less error-prone compared to the multi-core parallel programming.

GPU interfaces with a computer using PCIe, which can cause latency problem for high-volume and data intensive computing. This is why GPGPU recommend to use more threads to hide latency.

2.2 Field-programmable gate array (FPGA)
FPGA is an integrated circuit designed to be configured by the customer or designer after manufacturing — hence "field-programmable".
Because it uses highly paralleled hardware to implement your application logic traditionally implemented in software along with CPU or GPU,  it provides line speed latency, which is even shorter than GPU.
The FPGA architecture provides the flexibility to create a massive array of application-specific ALUs that enable both instruction and data-level parallelism.
Because data flows between operators, there are no inefficiencies like processor cache misses; FPGA data can be streamed between operators.

FPGA interfaces with a computer using either PCIe or processor bus such as Intel's FSB or QPI or AMD's HyperTransport. The later option makes FPGA just like another processor, which doesn't have cache coherency issue and also enjoys high bandwidth and low latency.

Although the C-to-FGPA compilation toolkit from Implulse enables a developer to use C instead of HDL to design application logic, the learning curve is still high. The developer still needs to know some basic hardware design knowledge and also compilation parallel skills such as loop unrolling and instruction pipeline.

No comments:

Post a Comment