Thursday, September 30, 2010

Commercial-Off-The-Shell Enterprise Real Time Computing (COTS ERTC) -- Part 2: Hardware requirements

Because hardware is at the bottom of a COTS ERTC application stack, we usually focus on throughput instead of latency in order to reduce higher level latency based on our discussion in Section 1.2 in part 1. But we still want to reduce hardware jitter.
Because hardware contributes the least to your COTS ERTC application's latency and jitter compared to other parts, you usually put less effort into hardware for your application jitter optimization.

In this part, I will talk about the latency incurred by computer hardware. I will also talk about how to reduce latency and jitters using such high-performance co-processors as GPU and FPGA.

1. Latency by Computer Hardware
1.1 Interrupt Latency
System timer, power management and such IOs as keyboard, mouse, disk and NIC use interrupts to notify the processor of the need for attention.
Software can also generate an interrupt to notify the processor of the need for a change in execution.

All modern hardware and OS use interrupts to implement thread preemption as shown in Figure 2.



                                            Figure 2 Thread Preemption Through Interrupt

Whenever there is a higher level interrupt, the OS scheduler preempts the low priority thread with the high priority thread. This preemption takes time i.e. "Interval Response Time" in Figure 2. This white paper shows a typical 25us interval response time on modern hardware.
RTC needs to bound this "Interval Response Time" which consists of several components as shown in Figure 2. Only "Interrupt Latency" relates to hardware. All others relate to device driver and OS and will be covered in the next topic.

Interrupt latency is the time required for the interrupt to propagate through the hardware from its source to the processor's interrupt pin.Because it is hardware line level latency, it is usually negligible compared to other components on modern hardware platforms.

Because most interrupts come from IO devices which usually have small time-sensitive data buffer, Interrupts have the highest scheduling priorities meaning they can interrupt any application and OS kernel threads including your RT ones. This has two implications.On one hand for a high priority event such as market data feed from an external source, its interrupt of all low priority threads and switching to the corresponding RT thread is really what you want.
On the other hand your really don't want your RT thread to be interrupted by such unnecessary IOs as keyboard or mouse. But your RT thread was still interrupted by such IOs.
However if the interrupt latency (more precisely, the interval response time) is bounded, your RT application thread behavior in both scenarios is still predicable.

As mentioned in the previous topic, COTS processors are usually designed for high throughput at the cost of high interrupt latency.
However this "high interrupt latency" is only significant for tight HRT. For SRT,
it is negligible compared to larger latencies by other parts in the stack and also thanks to the constant performance improvement in modern computer hardware.

1.2 System Management Interrupt (SMI)
Only X86 processors have those SMI's. When they happens, they suspend the CPU's normal execution and switch it into system management mode (SMM).
Because they usually run inside BIOS firmware, they are transparent to OS and you can't intercept them.  They are the highest priority interrupt in the system even higher than the NMI. Because they can last for hundreds of microseconds, they can cause unacceptable jitter for RTC, especially for HRT.

However they are not that bad if you OS supports ACPI because ACPI took the ownership of power management from SMM. Both Windows and Linux support ACPI (so you can configure your RT application thread higher than the OS ACPI thread).

This resource also lists other measures to mitigate SMI's impact.

IBM also optimized its LS21(Model 7971) and HS21 XM (Model 7995) to reduce SMI jitter for non-fatal SMI by deferring non-fatal SMI's functions to low-priority OS threads (When such a fatal SMI as a memory or chipset error happens, its function can't be deferred because the function has to be used to fix the error).

1.3 DMA Bus Mastering
This happens on the currently pervasive PCI bus. When a device uses the DMA, other devices that want to use the DMA have to wait until the previous device is done, which can cause many micro-seconds jitter based on this resource.

1.4 Max Performance vs Max Power Saving
ACPI can put CPU and other devices into some low-power-consumption or even power-off state after a period of inactivity, it will cause jitter when they have to be woken up.

ACPI support at least two extreme power schemes.
One is "Max Performance" which keeps CPU and other devices active all the time.
The other is "Max Power Saving" which power off CPU and other devices.
The "Max Performance" is preferred for RTC.

1.5 High Resolution Timer
The default system timer usually has a resolution of 10ms on most platforms. Although you can usually lower it to 1ms by configuring your OS, a high-resolution system timer can result in too many interrupt overheads which will severally lower throughput.

However even 1ms is still too coarse for RTC because RTC needs high resolution timer for its periodic or one-time task accurate scheduling and nano-sleep function (after all RTC applications have time or deadline constraints). At least microsecond resolution is required for RTC to keep jitter lower.
A hardware-based timer is needed also because it can generate interrupts as needed even you need a nano-second resolution e.g. the "dynamic tick" or even tickless implementation on Linux RT.

UltraSPARC systems can provide nanosecond timers while most modern X86 systems can only provide as good as microsecond times.

1.6 Cache
Because processors are orders of magnitude faster than the main memory e.g. modern processors have nanosecond execution time per instruction while main memory has microsecond access latency, caches are used to bridge the access latency gap. However this unavoidably creates jitter.

For example if your RTC application can completely fit into L1 cache or L2 cache, its latency will be very low and predicable. Otherwise the main memory access will cause jitter. But this is probably still acceptable to SRT or loose HRT based on your business rules.

1.7 Instruction Level Parallelism
Modern processors use such instruction level parallelism as out-of-order execution, pipeline and supercalar to improve your thread (CPU) throughput. However this again unavoidably create unpredictable temporal behavior for your thread instructions. 
But because they are at instruction level, they are probably still acceptable to SRT or loose HRT based on your business rules.

1.8  Multi-Processor / Core
Because higher processor clock rate leads to lower latency, the traditional approach of increasing processor performance is increasing clock rate based on Moore's Law. However such an approach eventually ran into power consumption and expensive cooling issues.
Multi-processor systems are one solution that can definitely improve throughput. But the interconnect between processors unavoidably introduce additional latency and jitter.
Multi-core processors may be a even better solution for RTC that can achieve both high throughput and low latency thanks to the shared high-speed bus among cores and the shared memory model.

A very important application of multi-processor / core in RTC is the so called "CPU shielding" (it also has few other names such as cpu binding or interrupt binding or thread binding or fine-grained processor control). "CPU shielding" is implemented in OS.

Here are some examples.
You can bind low priority interrupts to one CPU and dedicate another CPU to your RTC thread.
If you have multiple RTC threads that need to run simultaneously, you also have to have CPUs.
In case of Java, you need to bind at least one CPU to the concurrent GC or RTGC or the background JIT compiler so that it can truly run concurrently with your RT threads.

1.9 NUMA
The multiple CPUs mentioned in 1.8 usually have equal access to the shared main memory. This unfortunately also causes contentions when several processors attempt to address the same memory besides the benefits.

NUMA attempts to address this problem by providing separate memory for each processor. For problems involving spread data (common for servers and similar applications), NUMA can improve the performance over a single shared memory by a factor of roughly the number of processors (or separate memory banks).
However if not all data ends up confined to a single task, which means that more than one processor may require the same data, NUMA has to move data between memory banks, which causes jitter.


Both current X86 and UltraSPARC processors can provide good functions in the above areas. They can be used as the underlying platforms for COTS ERTC applications.

2. Co-Processors
The shift to multi-processor / core processors forces application developers to adopt a parallel programming model to exploit CPU performance, which proves to be very error prone.
Traditionally HPC uses clusters consisting of COTS multi-core servers to run data and compute intensive applications. In order to cut down such complex computation to RTC level, hundreds, even thousands of COTS multi-core servers along with high-bandwidth interconnect need to be deployed, which not only creates maintenance headache, but also consumes quite a lot of power. Even the interconnect usually has low latency, it still incurs latency.

CPU-based systems augmented with hardware accelerators as co-processors are emerging as a even better solution to the Moore's Law dilemma. This has opened up opportunities for accelerators like Graphics Processing Units (GPUs), FPGAs, and other accelerator technologies to advance HPC to previously unattainable RTC levels.


2.1 General-Purpose Computing on Graphics Processing Units (GPGPU)
Because traditional CPU design focuses on general purpose (both high volume and low volume; both management task and ALU processing etc), the number of cores and the vector size in SIMD are both small.
Because GPU's specialization in 2D or 3D graphics rendering acceleration using its highly pipeline parallel structures, it can have hundreds of  processor cores each of which can handle hundreds of independent threads. Its SIMD's vector is much longer than CPU's. Also the internal interconnect among cores has much higher bandwidth than the external interconnect in traditional clusters.

GPGPU is the technique of using a GPU to perform computation in applications traditionally handled by the CPU, which is made possible by the addition of programmable pipeline stages - shaders and higher precision arithmetic to the rendering pipelines, which allows software developers to use stream processing on non-graphics data.

A modern GPGPU itself is a cluster of hundreds of cores capable of handling tens of thousands of threads, which can be hundreds of times faster than a transitional cluster made of hundreds of processors.

OpenCL is an GPU programming framework that is supposed by all major GPU vendors. It supports both task and data level parallelism. For data level parallelism, users only need to partition data properly and are not responsible to manage threads, which is much less error-prone compared to the multi-core parallel programming.

GPU interfaces with a computer using PCIe, which can cause latency problem for high-volume and data intensive computing. This is why GPGPU recommend to use more threads to hide latency.

2.2 Field-programmable gate array (FPGA)
FPGA is an integrated circuit designed to be configured by the customer or designer after manufacturing — hence "field-programmable".
Because it uses highly paralleled hardware to implement your application logic traditionally implemented in software along with CPU or GPU,  it provides line speed latency, which is even shorter than GPU.
The FPGA architecture provides the flexibility to create a massive array of application-specific ALUs that enable both instruction and data-level parallelism.
Because data flows between operators, there are no inefficiencies like processor cache misses; FPGA data can be streamed between operators.

FPGA interfaces with a computer using either PCIe or processor bus such as Intel's FSB or QPI or AMD's HyperTransport. The later option makes FPGA just like another processor, which doesn't have cache coherency issue and also enjoys high bandwidth and low latency.

Although the C-to-FGPA compilation toolkit from Implulse enables a developer to use C instead of HDL to design application logic, the learning curve is still high. The developer still needs to know some basic hardware design knowledge and also compilation parallel skills such as loop unrolling and instruction pipeline.

Wednesday, September 29, 2010

Commercial-Off-The-Shell Enterprise Real Time Computing (COTS ERTC) -- Part 1: Introduction

This five-part series covers how to use real time computing (RTC) to process enterprise workloads using commercial off the shell (COTS) hardware and software based on my experience and researches.
By "enterprise workloads" and "commercial off the shell", I mean the discussion is not about the traditional RTC tailored to a few specific application tasks for bare metal or embedded devices. So you know how COTS ERTC came out.

An enterprise application stack typically consists of, counting from the bottom, the underlying hardware (micro-processors and / or co-processors) and OS platforms,  the network,  the program language (I will focus on Java) and the application itself.
Here are the five parts for this series:
  • Part 1: Introduction
  • Part 2: Hardware (Microprocessor and Coprocessor) requirements
  • Part 3: Operating System (OS) requirements 
  • Part 4: Network requirements 
  • Part 5: Java requirements
In this introduction, I will clarify the following importance concepts: throughput vs latency; hard real time (HRT) vs soft real time (SRT); proprietary vs COTS.

1. Throughput and Latency
An application's performance can have quite a few metrics. However throughput and latency (or response time) are the two most important aspects for most enterprise workloads and for most developers.
Throughput is the amount of useful work accomplished over a period of time.
Latency is the time needed to accomplish some amount of useful work.

In my above definitions, I use such fuzzy words as "a period" and "some amount". You should quantify them based on your business logic. They are mathematically inverse to each other.

By "useful work", it means there are also overheads that can't be avoided completely during the period. Usually the more interruptions such as sending users feedback during the period, the more overheads. So in order to have a better throughput, the application needs to have few interruptions or longer latency.
On the other hard, in order to have lower latency, you have to process less amount of work before sending out a response, which means lower throughput. Be careful not to lower the amount of work too much otherwise you may well end up nothing but too frequently sending out responses, which is practically not doing any useful work.
The relationship between throughput and latency can also be shown in Figure 1 where you can see lower latency (more overhead) means lower throughput.
Figure 1 useful work and overhead over a period
1.1 Is it possible to improve both throughput and lower latency at the same time?
It depends.
It is not possible if you don't put more resources into your application.
It is very achievable if you put more resources into your application such as upgrading your hardware or deploying more hardware or employing more efficient algorithms.

Here is a theoretical example.
Suppose your application needs to process 1,000,000 transactions. The user needs to get a feedback after some number of transactions have been processed. Each transaction processing takes 1ms and each feedback incurs 10ms overhead.
You are asked to calculate the application's throughput over a period of 1s.
Basically you have two configurations.
(1)The user doesn't need low latency or fast response
This needs your application to send out feedback every a larger number of transactions. If the batch number is 100, you can get the following result by simple calculation:
  throughput: 909 transactions
  latency: 100ms

(2)The user needs to have low latency or fast response.
This needs your application to send out feedback every a smaller number of transactions. If the batch number is 10, you can get the following result by simple calculation:
  throughput: 500 transactions
  latency: 10ms

From the above numbers, you know lower latency means lower throughput and this becomes more obvious when the per-feedback overhead becomes larger and larger.
How do you keep the 10ms latency in (2) while still enjoying the 909 transaction throughput in (1)? You have to make your per-transaction processing more efficient and / or lower the per-feedback overhead.
Suppose you can't lower the per-feedback overhead but you do know a better algorithm or be able to upgrade your hardware to cut down the per-transaction processing time. By simple calculation, you know you per-transaction processing time has to be 0.55ms or better instead of the original 1ms.

Here is a real example about JVM's garbage collection (GC).
In this case your application threads are doing useful work while the GC just incurs overhead. The typical Stop-The-World (STW) GC causes application pause time which is the main contributor to latency. The more memory to collect, the more pause time, the lower throughput.
In order to reduce pause time, the GC has to collect memory more frequently or concurrently, which unfortunately incurs more overheads such as thread context switching and bookkeeping.
If you want very low pause time, the GC may have to collect memory way too frequently, which is practically equal to STW.


In order to have both good throughput and latency, modern JVMs employ more advanced GC algorithms such as parallel GC, concurrent GC and dynamic adjustment based on adaptive algorithms and also require multi-core processors.

1.2 End user level latency relies on underlying throughput
In the above first example, if the user defines latency as the time taken to process all 1,000,000 transactions, what strategy will the underlying system employ to get a better latency?
Obviously the underlying system can process 1,000,000 transaction faster using throughput than latency (the throughput and latency here are of the underlying system, not the end user level).

In the above second example, the latency consists of your application thread time plus the GC pause time in the application window. Let's focus on the GC pause time. Obviously in order to minimize the GC pause time, the GC should use throughput collectors such as STW parallel collector instead of latency collectors such as concurrent Mark Sweep collector.

This last point is very important because it help you understand the following points:
  • Improving infrastructure level throughput can not only improve overall system throughput but also provide users with low latency;
  • Higher level low latency doesn't means the underlying systems must also provide low latency. This is why people focus more on High Performance Computing (HPC) than on RTC (see Section 3 for details);
  • You should also not be surprised when you hear a vendor boasting of his product capable of low latency and high throughput.
2. Hard Real Time (HRT) and Soft Real Time (SRT)
The widely recognized RTC definition comes from Donald Gillies in RTC FAQ: 
"A real-time system is one in which the correctness of the computations not 
only depends upon the logical correctness of the computation but also upon 
the time at which the result is produced. If the timing constraints of the  
system are not met, system failure is said to have occurred."

In other words, RTC must be deterministic to guarantee it performs within the required time frame or deadline. It is not about throughput, it is about latency. It is not about how fast the response is, it is about its latency being predicable.
No system can provide absolutely constant latency. The variance in latency is called Jitter.

Practically RTC can be classified as either HRT or SRT:
An HRT system must meet all its deadlines or its jitter is always bounded; otherwise it has catastrophic consequences.
An SRT system can still function correctly if it misses its deadlines occasionally or its jitter sometimes overshoot the boundary. The occasional misses usually happen in worst case scenarios.

2.1 Predictability or determinism is subjective and driven by your business requirements
It really depends on your business requirements how to setup a boundary, how tight the boundary is and whether the boundary can be broken occasionally.

For example, suppose your mean latency is 10ms and the jitter is 5ms. If you boundary from 5ms to 15ms is acceptable, you get an HRT system.
However if the above boundary is not acceptable and a new tighter boundary such as from 8ms to 12ms is required, you can still get an SRT system if your business really don't need HRT and you can predict the jitter only occasionally overshoots the boundary.

2.2 It still matters how fast your response is
Although RTC is not about how fast the response is, RTC systems usually have low latency and are very fast. Actually lower latency also makes your system more predicable because the lower latency usually means less jitter.

For example, if the options pricing in financial services takes several hours, even days, nobody will think it is real-time pricing and it has to be used as offline pricing.
Also probably nobody in the financial services industry would think a 1s latency is real-time and wants to use it in his/her high frequency trading.
However on the other hand, very low latency is much harder and more expensive to achieve or even impossible without using proprietary hardware and software.

Hereafter we will use "tight HRT" to mean HRT with very low latency requirement and "loose HRT" to mean HRT with lenient latency requirement.
Although they are again very fuzzy and depend on your business requirements, they will not affect our discussion.

2.3 RTC is not for everyone
Based on the above analysis in section 1, we know RTC achieves its low latency at the cost of poor throughput or requires more resources to achieve both lower latency and high throughput. So RTC is just not for everyone and should be targeted at those applications that need it in your enterprise. Also if SRT can meet your needs, don't use HRT because HRT has much higher requirements on all parts in the application stack than SRT does.

2.4 RTC is still strange to most developers.
Because RTC can trace its root from embedded devices in the tel-comm, auto and other control industries, most enterprise developers are not familiar with it.
Another reason is most hardware and OS platforms and application designs deployed in enterprises focus on throughput. Their algorithms are optimized for average cases instead of the worst cases.  But HRT needs to reduce jitter in wost cases so that a predicable boundary can be defined.
Recalling 1.2, you should further understand why HRT is much harder to implement than SRT.

3. Proprietary and COTS
Most developers heard of RTC from such RTOS as QNX and VxWorks. They are only used in embedded systems which have little memory, limited external connection, are tailored to a few specific tasks, and are limited to a small user base and fewer vendors. So they are proprietary niche products not good for enterprise applications.
But those traditional RTC systems really shine and are unbeatable when there is a need for very predicable timing behavior or HRT. Actually theoretically only HRT is RTC.

In the past several decades, both hardware and software have experienced rapid advancements in both functionality and performance, and many have been commoditized. Such COTS systems as X86 microprocessors, GPGPU / FPGA, Linux, Windows, Java and C# have become more powerful both on throughput and latency while still being cheap and easy to use.
Unfortunately those general purpose COTS products are traditionally designed with throughput in mind. However due to their huge deployments, large user base and wide industry support, RTC on top of COTS or COTS ERTC is very attractive.
The most important signature of a COTS ERTC system is RTC functions are added without changing existing COTS system's functions so that a COTS ERTC system can support both traditional and RTC enterprise workloads.

A very bright spot in COTS ERTC is to implement an SRT system. Compared to traditional proprietary RTC or HRT, It is easy to build and can handle most of ERTC workloads while still maintaining good throughput. Such an SRT system is so appealing that it should be the first choice for an enterprise.
Another bright spot is High Performance Computing (HPC) is often employed to make high-volume computing tasks qualified as RTC based on COTS ERTC clusters or GPU/FGPA (see the above Section 1.2). 
Using traditional single processor / core, such complex tasks as options valuation and risk analytics in financial services often take many hours even days. HPC can cut down the computation time to sub-second or even less on average, which basically qualify them as SRT.

When vendors say they can provide financial services industries with real-time products, they really usually mean SRT based on COTS ERTC instead of HRT.
Here are more examples:
  • "best execution" and transparency regulated in the US's RegNMS and European's MiFID.
    RTC can definitely reduce inconsistent trading times that can result in server fines.
  • FTSE 100 Index which is calculated in real time and published every sub-second. (Remember not long ago it was 15s, which is not RTC due to long latency)
  • Reuters Data Feed Direct (RDF-D) is an ultra-low latency real-time feed handler that has an average latency (so SRT) of less than one millisecond during periods of high traffic.
  • Algorithmic Trading, especially High Frequency Trading (HFT) needs to execute a trade in low milliseconds or even sub-millisecond in order to beat competitors
Because RTC achieves low latency at the cost of throughput, it is very import for those COTS ERTC to keep the original functionality while providing additional RTC. After all RTC is not for everyone and an enterprise probably needs both RTC and non-RTC functionality to co-exist.
Fortunately such COTS ERTC building components are commercially available.

4. Prioritized Approach
The overall application latency is the cumulative delays of all parts in the application stack. If any part creates jitter, your application will experience jitter.
So in order to develop a COTS ERTC system, all parts in the application stack should ideally support RTC. However you should prioritize latency and jitter analyses.
For example if your Java GC has a 500ms jitter while the OS scheduling has a much smaller 5µs jitter, the GC jitter tuning should take precedence because any scheduling jitter improvement is negligible.

Since the bottom parts usually have lower latency and create less jitter than the top ones, the top parts usually have higher requirements on latency and jitter than the bottom parts, especially for HRT.
For example if you just need SRT, you can directly tackle your Java and application logic without putting too much effort on modern fast hardware or OS.

The following parts in this series will analyze latency and jitters for all parts in the application stack and find ways to reduce them.

Tuesday, September 7, 2010

A New Step for My Wife

Today is the first day that my wife go to New-York Presbyterian Hospital. I really don't need to say anything about the high reputation of this prestigious hospital exception her No 6 overall ranking in the nation by U.S. News and World Report in its latest America's Best Hospitals Report.

My wife will receive a three year program one of which are researches and the rest of which are clinical trainings. She will be a medical physicist after three years.
If things are going well, she will have a great chance to leave in that hospital.