Wednesday, September 29, 2010

Commercial-Off-The-Shell Enterprise Real Time Computing (COTS ERTC) -- Part 1: Introduction

This five-part series covers how to use real time computing (RTC) to process enterprise workloads using commercial off the shell (COTS) hardware and software based on my experience and researches.
By "enterprise workloads" and "commercial off the shell", I mean the discussion is not about the traditional RTC tailored to a few specific application tasks for bare metal or embedded devices. So you know how COTS ERTC came out.

An enterprise application stack typically consists of, counting from the bottom, the underlying hardware (micro-processors and / or co-processors) and OS platforms,  the network,  the program language (I will focus on Java) and the application itself.
Here are the five parts for this series:
  • Part 1: Introduction
  • Part 2: Hardware (Microprocessor and Coprocessor) requirements
  • Part 3: Operating System (OS) requirements 
  • Part 4: Network requirements 
  • Part 5: Java requirements
In this introduction, I will clarify the following importance concepts: throughput vs latency; hard real time (HRT) vs soft real time (SRT); proprietary vs COTS.

1. Throughput and Latency
An application's performance can have quite a few metrics. However throughput and latency (or response time) are the two most important aspects for most enterprise workloads and for most developers.
Throughput is the amount of useful work accomplished over a period of time.
Latency is the time needed to accomplish some amount of useful work.

In my above definitions, I use such fuzzy words as "a period" and "some amount". You should quantify them based on your business logic. They are mathematically inverse to each other.

By "useful work", it means there are also overheads that can't be avoided completely during the period. Usually the more interruptions such as sending users feedback during the period, the more overheads. So in order to have a better throughput, the application needs to have few interruptions or longer latency.
On the other hard, in order to have lower latency, you have to process less amount of work before sending out a response, which means lower throughput. Be careful not to lower the amount of work too much otherwise you may well end up nothing but too frequently sending out responses, which is practically not doing any useful work.
The relationship between throughput and latency can also be shown in Figure 1 where you can see lower latency (more overhead) means lower throughput.
Figure 1 useful work and overhead over a period
1.1 Is it possible to improve both throughput and lower latency at the same time?
It depends.
It is not possible if you don't put more resources into your application.
It is very achievable if you put more resources into your application such as upgrading your hardware or deploying more hardware or employing more efficient algorithms.

Here is a theoretical example.
Suppose your application needs to process 1,000,000 transactions. The user needs to get a feedback after some number of transactions have been processed. Each transaction processing takes 1ms and each feedback incurs 10ms overhead.
You are asked to calculate the application's throughput over a period of 1s.
Basically you have two configurations.
(1)The user doesn't need low latency or fast response
This needs your application to send out feedback every a larger number of transactions. If the batch number is 100, you can get the following result by simple calculation:
  throughput: 909 transactions
  latency: 100ms

(2)The user needs to have low latency or fast response.
This needs your application to send out feedback every a smaller number of transactions. If the batch number is 10, you can get the following result by simple calculation:
  throughput: 500 transactions
  latency: 10ms

From the above numbers, you know lower latency means lower throughput and this becomes more obvious when the per-feedback overhead becomes larger and larger.
How do you keep the 10ms latency in (2) while still enjoying the 909 transaction throughput in (1)? You have to make your per-transaction processing more efficient and / or lower the per-feedback overhead.
Suppose you can't lower the per-feedback overhead but you do know a better algorithm or be able to upgrade your hardware to cut down the per-transaction processing time. By simple calculation, you know you per-transaction processing time has to be 0.55ms or better instead of the original 1ms.

Here is a real example about JVM's garbage collection (GC).
In this case your application threads are doing useful work while the GC just incurs overhead. The typical Stop-The-World (STW) GC causes application pause time which is the main contributor to latency. The more memory to collect, the more pause time, the lower throughput.
In order to reduce pause time, the GC has to collect memory more frequently or concurrently, which unfortunately incurs more overheads such as thread context switching and bookkeeping.
If you want very low pause time, the GC may have to collect memory way too frequently, which is practically equal to STW.

In order to have both good throughput and latency, modern JVMs employ more advanced GC algorithms such as parallel GC, concurrent GC and dynamic adjustment based on adaptive algorithms and also require multi-core processors.

1.2 End user level latency relies on underlying throughput
In the above first example, if the user defines latency as the time taken to process all 1,000,000 transactions, what strategy will the underlying system employ to get a better latency?
Obviously the underlying system can process 1,000,000 transaction faster using throughput than latency (the throughput and latency here are of the underlying system, not the end user level).

In the above second example, the latency consists of your application thread time plus the GC pause time in the application window. Let's focus on the GC pause time. Obviously in order to minimize the GC pause time, the GC should use throughput collectors such as STW parallel collector instead of latency collectors such as concurrent Mark Sweep collector.

This last point is very important because it help you understand the following points:
  • Improving infrastructure level throughput can not only improve overall system throughput but also provide users with low latency;
  • Higher level low latency doesn't means the underlying systems must also provide low latency. This is why people focus more on High Performance Computing (HPC) than on RTC (see Section 3 for details);
  • You should also not be surprised when you hear a vendor boasting of his product capable of low latency and high throughput.
2. Hard Real Time (HRT) and Soft Real Time (SRT)
The widely recognized RTC definition comes from Donald Gillies in RTC FAQ: 
"A real-time system is one in which the correctness of the computations not 
only depends upon the logical correctness of the computation but also upon 
the time at which the result is produced. If the timing constraints of the  
system are not met, system failure is said to have occurred."

In other words, RTC must be deterministic to guarantee it performs within the required time frame or deadline. It is not about throughput, it is about latency. It is not about how fast the response is, it is about its latency being predicable.
No system can provide absolutely constant latency. The variance in latency is called Jitter.

Practically RTC can be classified as either HRT or SRT:
An HRT system must meet all its deadlines or its jitter is always bounded; otherwise it has catastrophic consequences.
An SRT system can still function correctly if it misses its deadlines occasionally or its jitter sometimes overshoot the boundary. The occasional misses usually happen in worst case scenarios.

2.1 Predictability or determinism is subjective and driven by your business requirements
It really depends on your business requirements how to setup a boundary, how tight the boundary is and whether the boundary can be broken occasionally.

For example, suppose your mean latency is 10ms and the jitter is 5ms. If you boundary from 5ms to 15ms is acceptable, you get an HRT system.
However if the above boundary is not acceptable and a new tighter boundary such as from 8ms to 12ms is required, you can still get an SRT system if your business really don't need HRT and you can predict the jitter only occasionally overshoots the boundary.

2.2 It still matters how fast your response is
Although RTC is not about how fast the response is, RTC systems usually have low latency and are very fast. Actually lower latency also makes your system more predicable because the lower latency usually means less jitter.

For example, if the options pricing in financial services takes several hours, even days, nobody will think it is real-time pricing and it has to be used as offline pricing.
Also probably nobody in the financial services industry would think a 1s latency is real-time and wants to use it in his/her high frequency trading.
However on the other hand, very low latency is much harder and more expensive to achieve or even impossible without using proprietary hardware and software.

Hereafter we will use "tight HRT" to mean HRT with very low latency requirement and "loose HRT" to mean HRT with lenient latency requirement.
Although they are again very fuzzy and depend on your business requirements, they will not affect our discussion.

2.3 RTC is not for everyone
Based on the above analysis in section 1, we know RTC achieves its low latency at the cost of poor throughput or requires more resources to achieve both lower latency and high throughput. So RTC is just not for everyone and should be targeted at those applications that need it in your enterprise. Also if SRT can meet your needs, don't use HRT because HRT has much higher requirements on all parts in the application stack than SRT does.

2.4 RTC is still strange to most developers.
Because RTC can trace its root from embedded devices in the tel-comm, auto and other control industries, most enterprise developers are not familiar with it.
Another reason is most hardware and OS platforms and application designs deployed in enterprises focus on throughput. Their algorithms are optimized for average cases instead of the worst cases.  But HRT needs to reduce jitter in wost cases so that a predicable boundary can be defined.
Recalling 1.2, you should further understand why HRT is much harder to implement than SRT.

3. Proprietary and COTS
Most developers heard of RTC from such RTOS as QNX and VxWorks. They are only used in embedded systems which have little memory, limited external connection, are tailored to a few specific tasks, and are limited to a small user base and fewer vendors. So they are proprietary niche products not good for enterprise applications.
But those traditional RTC systems really shine and are unbeatable when there is a need for very predicable timing behavior or HRT. Actually theoretically only HRT is RTC.

In the past several decades, both hardware and software have experienced rapid advancements in both functionality and performance, and many have been commoditized. Such COTS systems as X86 microprocessors, GPGPU / FPGA, Linux, Windows, Java and C# have become more powerful both on throughput and latency while still being cheap and easy to use.
Unfortunately those general purpose COTS products are traditionally designed with throughput in mind. However due to their huge deployments, large user base and wide industry support, RTC on top of COTS or COTS ERTC is very attractive.
The most important signature of a COTS ERTC system is RTC functions are added without changing existing COTS system's functions so that a COTS ERTC system can support both traditional and RTC enterprise workloads.

A very bright spot in COTS ERTC is to implement an SRT system. Compared to traditional proprietary RTC or HRT, It is easy to build and can handle most of ERTC workloads while still maintaining good throughput. Such an SRT system is so appealing that it should be the first choice for an enterprise.
Another bright spot is High Performance Computing (HPC) is often employed to make high-volume computing tasks qualified as RTC based on COTS ERTC clusters or GPU/FGPA (see the above Section 1.2). 
Using traditional single processor / core, such complex tasks as options valuation and risk analytics in financial services often take many hours even days. HPC can cut down the computation time to sub-second or even less on average, which basically qualify them as SRT.

When vendors say they can provide financial services industries with real-time products, they really usually mean SRT based on COTS ERTC instead of HRT.
Here are more examples:
  • "best execution" and transparency regulated in the US's RegNMS and European's MiFID.
    RTC can definitely reduce inconsistent trading times that can result in server fines.
  • FTSE 100 Index which is calculated in real time and published every sub-second. (Remember not long ago it was 15s, which is not RTC due to long latency)
  • Reuters Data Feed Direct (RDF-D) is an ultra-low latency real-time feed handler that has an average latency (so SRT) of less than one millisecond during periods of high traffic.
  • Algorithmic Trading, especially High Frequency Trading (HFT) needs to execute a trade in low milliseconds or even sub-millisecond in order to beat competitors
Because RTC achieves low latency at the cost of throughput, it is very import for those COTS ERTC to keep the original functionality while providing additional RTC. After all RTC is not for everyone and an enterprise probably needs both RTC and non-RTC functionality to co-exist.
Fortunately such COTS ERTC building components are commercially available.

4. Prioritized Approach
The overall application latency is the cumulative delays of all parts in the application stack. If any part creates jitter, your application will experience jitter.
So in order to develop a COTS ERTC system, all parts in the application stack should ideally support RTC. However you should prioritize latency and jitter analyses.
For example if your Java GC has a 500ms jitter while the OS scheduling has a much smaller 5µs jitter, the GC jitter tuning should take precedence because any scheduling jitter improvement is negligible.

Since the bottom parts usually have lower latency and create less jitter than the top ones, the top parts usually have higher requirements on latency and jitter than the bottom parts, especially for HRT.
For example if you just need SRT, you can directly tackle your Java and application logic without putting too much effort on modern fast hardware or OS.

The following parts in this series will analyze latency and jitters for all parts in the application stack and find ways to reduce them.

1 comment:

  1. A very informative series of articles.
    Thanks, Yong!