Saturday, December 11, 2010

GPGPU High Performance Computing using OpenCL -- Installation and Setup

High Performance Computing (HPC) with General-Purpose GPU or GPGPU is one of the hottest technologies when it comes to massive amount of data processing.
Compared to the general-purpose CPU, GPGPU adapts some innovative new approaches to parallel computing both from software and hardware perspective.

When it comes to program GPGPU, OpenCL is the clear choice because it is the open standard for writing programs that execute across heterogeneous platforms consisting of CPUs, GPUs, and other existing or future processors. And almost all major GPU vendors including Nvidia and AMD(ATI) support it, which was manifested by their OpenCL 1.0 driver releases recently.

I have been studying GPGPU for several months. Now I get a real chance to help some scientists process some medical image data. Here are what I have.
  • 64-bit Windows 7 professional;
  • Dell Precision T7500 Tower Workstation. It includes 2 Intel's latest 64-bit Xeon quad-core processors (a total of 8 physical cores. no HyperThreading), 6GB DDR3 ECC main memory, Broadcom NetXtreme 57xx Gigabit NIC, and ATI Firepro 3D V5800 for the display.
  • Nvidia Tesla C2050 GPGPU. It has 448 CUDA cores and 3G GDDR5 ECC global memory. It is Nvidia's flagship product to build a powerful workstation.
1. Install C2050
You can really build a personal supercomputer (or a mini-cluster) right on you desk using T7500 and C2050. Unfortunately T7500 and C2050 came separately and installing C2050 on T7500 is not that straightforward as what the C2050 installation instructions tell you inside its CD package.

With C2050 you can basically configure 2 ways:
  1. Keep the existing ATI Firepro as the display adapter and the new C2050 as a pure computing co-processor (not use it to drive display). 
  2. Use C2050 as both the display adapter and a computing co-processor.
Configuration 1 is my case. I believe many people also want to have this configuration.

The first step in the C2050 installation instructions is to remove current graphics driver. This should only work for configuration 2.
My first attempt strictly followed the instructions. I downloaded and installed the latest Nvidia driver 263.06 and its GPU Computing SDK 3.2. My monitor was still connected to the FirePro adapter. The Windows Device Manager shows a standard VGA adapter and Tesla C2050 adapter both of which are working properly.
However when I ran the oclDeviceQuery OpenCL example in the SDK, C2050 was not recognized.
I am not sure whether it will work if I connect the monitor to C2050 and disable the standard VGA driver. I didn't put any more effort into it because I really want to have configuration 1.

So my next attempt was to restore the ATI Firepro driver. Unfortunately my previous uninstalling did some damage and I couldn't restore the ATI Firepro driver even the installation files still seem to be under c:\drivers\video\R259137\packages\drivers. So I downloaded the latest FirePro V5800 driver package on AMD's website and finally restored.
Now when I tried to run the same oclDeviceQuery, I got the same result. Actually even when you try to run any of the CUDA examples in the SDK, it didn't work neither.

In order to make C2050 recognized, you need to enable it to work in the so called TCC (Telsa Computing Cluster) mode instead of the WDDM mode. Basically the TCC mode enables the Tesla C2050 to work as a computing co-processor along with a display adapter among other functions.
Nvidia's release document for C2050 driver is so confusing at this point. The document mentions TCC is the default mode for C2050 that made me think it would be in TCC mode when I installed it. But it is not in my case.

I tried to run nvidia-smi to enable TCC mode but it complained there was no TCC GPU detected.
In order to enable TCC mode, you have to manually edit the Windows registry by adding a DWORD entry called "AdapterType" and setting its value to 2. This new entry should be under
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\ Class\{4D36E968-E325-11CE-BFC1-08002BE10318}\0001 (or 0000 or whatever number that corresponds to your C2050 adapter).
After the above change, you have to reboot. After reboot, when you can run the same oclDeviceQuer, you will see Telsa C2050 is successfully recognized.
Please be noted that because TCC mode disable the graphical functions, none of graphical examples will work.But isn't this what you want? The TCC mode is faster than the WDDM mode for computing and let the display adapter to do graphics.

2. OpenCL Support By Tesla C2050 Driver
Another confusing point in the C2050 driver release document is it didn't mention to support OpenCL. It only says to support CUDA C/C++. Fortunately the testing with its OpenCL examples show it indeed supports OpneCL 1.0 (actually if you examine the files installed by the 263.06 driver package, you can find the opencl.dll library).

3. Install OpenCL SDK
Nvidia's Developer Zone also seem to be confusing. When I just want to setup a computing SDK for GPGPU(not graphics), I was at loss what to download.
Basically you have to download 2 SDKs after you have installed the C2050 Driver (the driver package includes both the low level CUDA driver and OpenCL driver).
  1. NVIDIA GPU Computing Toolkit. It includes both CUDA and OpenCL SDKs.
  2. NVIDIA GPU Computing SDK 3.2. It includes SDSs for CUDA, OpenCL and DirectComput. It also include samples.
For OpenCL development, you only need to install 2.
On Windows 7, you probably have Visual Studio 2010. However the OpenCL example solution in the above 2 is for Visual Studio 2005 and 2008 only. The good news is Visual Studio 2010 can successfully convert the 2008 solution. The only major conversion warning is regarding the targetName is not consistent with the generated executable file name. But this will not cause you any trouble.

Sunday, November 21, 2010

Commercial-Off-The-Shell Enterprise Real Time Computing (COTS ERTC) -- Part 4: Network Requirements

Enterprise applications rarely work in isolation nowadays. Instead they usually cooperate with one another in a network environment.
Due to Ethernet's uniqueness, this topic will only discuss the TCP/IP suite over Ethernet. The networks in Figure 4.1 are used for the discussion.
Figure 4.1 Networks
1. Network Latency
The network latency from a host in LAN1 such as Host11 to a host in LAN2 such as Host21 is the cumulative latency of all the following component:
  • the TCP/IP stack latency on Host11. It includes all OSI layers except the 7th application layer (you have full control at the application layer). At least the 2 lower layers are implemented in the NIC firmware. All other layers are either included in the OS or also included in the NIC firmware.
  • the switch latency in LAN1. The switch is a Layer 2 device. Originally an unintelligent Layer 1 hub or repeater was used. I will compare the latency difference later on.
  • the routers latencies between LAN1 and LAN2. The routers are actually Layer 3 switches.
  • the switch latency in LAN2. Again the switch is a Layer 2 device.
  • the TCP/IP stack latency on Host21. Again it includes all OSI layers except the 7th application layer.
  • the propagation delay from Host11 to Host21. It is the time it takes for signals to travel in the physical copper cables or fibers. Signals in fibers or copper cables can travel at about 2/3 of the fundamental speed of light.
    If the communication distance is short e.g. in a LAN, the propagation latency can be negligible compared to other larger latencies.
    On the other hand if the the communication distance is long, it can't be ignored. For example the straight distance from San Francisco to New York is about 4,125 kilometers and the one-way signal travel needs about 19.6ms that is the lower bound of your overall latency optimization.
Network latency can also be defined as the round-trip latency between Host11 and Host12. However since our discussion focuses on latency and jitter analysis of all involving components, the above one-way definition is sufficient.

The data unit in our discussion is a packet. For example we can set it anywhere from the minimum Ethernet frame size (it is about 64 bytes) to the maximum MTU (it is about 1500 bytes) and compare the difference.
When you analyze your application level latency, the data unit is an application level request, which has complications when interacting with the underlying Maximum Segment Size (MSS) of TCP and the MTU of Ethernet frames (later sections will explain more details).

The prioritized latency and jitter analysis mentioned in Part 1 still applies here. For example upgrading to a faster switch in San Francisco doesn't make sense because the minimum latency to New York is bounded by the 19.6ms propagation delay and the latency of switches is usually in low micro-seconds.

The following sections will discuss the latency and jitter in all involving components.
2. Non-Determinism in Ethernet
Ethernet, as standardized by IEEE 802.3, is implemented at Layer 1 and 2 of the OSI model. Its latency comes from the processing delays at  Layer 1 and 2. The latency has jitter because of its media access control (MAC) protocol - CSMA / CD at Layer 2.

With CSMA / CD, each host detects if another host is transmitting in the shared medium before it tries to transmit its own data (Carrier Sense). When a host detects a carrier, its Carrier Sense is turned on and it will defer transmission until determining the medium is free. This deference is unfortunately not predicable. If two hosts happen to transmit simultaneously (Multiple Access), a collision occurs and all frames are destroyed.

If we replace the switch in LAN1 / LAN2 with a hub / repeater, all four hosts will be in one collision domain. The more hosts trying to send data, the more collisions and the more non-deterministic.
Hosts sharing a half duplex connection are also in the same collision domain.For example even there are only Host11 and Host12 communicating to each other using a half duplex connection in LAN1, collisions still happen when Host11 and Host12 tries to send data to each other concurrently.

Although redesign of MAC protocol can solve the collision problem, it is not compatible to the existing ubiquitous Ethernet deployments and thus is appropriate and not a COTS solution.
If we can avoid collision, Ethernet will be much deterministic. This is what the switch in LAN1 / LAN2 is supposed to do. With the switch each host is guaranteed the exclusive use of the medium in both directions and is thus in a different collision domain from other hosts.
For example Host11 can communicate with Host12 while Host13 is communicating with Host14 without causing any collisions or frame drops.

3. Non-Determinism in TCP/IP
IP is at Layer 3; TCP and UDP are at Layer 4. The biggest jitter at these upper layers actually comes from the non-determinism of the underlying CSMA / CD.
For example if the underlying frame was dropped due to collision, the upper packet will be lost if it is UPD and will be retransmitted if it is TCP thanks to TCP's time-out mechanism.

A side note for TCP's time-out mechanism: if the network is slow, a low time-out value can lead to false retransmission. This is another reason why RTC applications prefer such fast networks as LAN.

You should also be aware of the following points
3.1 Non-Deterministic Routing Paths
A router has an additional Layer 3 compared to a switch. The more routers, the more potential communication paths between LAN1 and LAN2, and the more non-deterministic. A long communication path also leads to more frame drops due to data corruption. So a RTC system is usually enclosed in a LAN; otherwise make sure your routing path is deterministic.This is why many algorithmic trading systems especially high-frequency trading system, collocate with exchanges.

If the communication path is predicable, the determinism of a network with switches and routers will depend on that of those switches and routers. Modern switches and routers are actually very fast and can show very low jitter if managed properly, which will be discussed in Section 4.
.
3.2 Packet Header Overhead and Fragmentation
Each packet at IP, UDP and TCP layers has a header besides the real payload. The head size for IP, UDP and TCP is 20, 8 and 20, respectively.
Actually the Ethernet frame at Lay 2 also has a 20 bytes overhead. When an IP packet is greater than the Ethernet MTU (it is about 1500 bytes), IP needs to fragment and reassemble its packet.
TCP also fragments its stream into packets based on the MSS, the available window size and some other factors.

On one hand, more fragmentation and reassembly means more extra computing and higher latency. On the other hand, smaller packets than the MTU or MSS mean more significant header overhead and lower throughput.
In order to reduce jitter, your application level requests shouldn't have too variant sizes (different sized requests also mean different transmission delay). You should make your requests as large as possible before the latency misses your target. If possible, please decide the request size based on MTU.

3.3 TCP Flow Control
In order to avoid TCP packet drops from buffer overflows, TCP employs a sliding window mechanism to control its data flow.
So slow processing at the receiver host requires the sender host to send out smaller packets or even have to wait if the window size becomes 0. In order to reduce this jitter, make sure both ends can process network data fast and predictably.

If the network is slow, the sender may have to wait for the latest window size from the receiver. A fast network such as LAN can reduce this waiting latency. 

3.4 Nagle's Algorithm
Because of the 20 bytes header overhead in TCP, Nagle's algorithm coalesces a number of small outgoing messages into one packet and sends it out.
This algorithm is usually unacceptable to RTC systems because they need immediate response for each of their requests. This is especially true when the RTC requests are small.

RTC systems usually take the following two counter-measures:
  • Enclose each request in one TCP packet;
  • Turn off this algorithm by using the TCP_NODELAY socket option.
3.5 Delayed Acknowledgment
Again because of 20 bytes header overhead in TCP, a pure TCP response packet will have too much overhead. So this mechanism delays the TCP layer response by waiting about one or two hundred milliseconds, and hopes to send back the response along with the upper application layer response using just one TCP packet instead of two.

Unfortunately a one or two hundred milliseconds deadlock can occur if this mechanism interacts with the Nagle's algorithm and your application's response doesn't synchronize with the TCP layer's e.g. an application request was sent out in two or more TCP packets. Please refer to this resource for a detailed analysis.

In order to avoid this jitter, you need to take one or more of the following measures:
  • disable the Nagle's algorithm;
  • disable or configure the delayed acknowledgment mechanism. Please refer to this resource on Windows; and TCP_QUICKACK on Linux;
  • Make sure your application level request is enclosed in one TCP packet.
3.6 UDP with Application Level Retransmission and Flow Controls
TCP implements its reliability at the cost of more computing work and header overhead than UDP. If TCP's reliability is more than what you want, you can use UDP and implement your simple reliability by yourself at the application layer. This approach should give your lower latency and jitter than using TCP.

4. Latency and Jitter of a Switch / Router
A router has an additional Layer 3 function. Its processing latency in a well-planed network will be predicable through a warm-up period that should cache all routing paths in the router. So we will focus on switches only.
A modern switch basically has two packet forwarding methods:
  • Store and forward. A whole frame is buffered before being forwarded. A checksum is usually performed on each frame.
  • Cut through. The switch reads only up to the frame's hardware address before starting to forward it. There is no error checking.
Because a cut-through switch doesn't have the additional buffering step, it has even shorter latency. However a cut-through switch will fall back to store and forward if its outgoing (egress) port is busy at the time a frame arrives. So we will focus on store-and-forward switches.

The one-way port-to-port latency of a switch is the cumulative delays of the following components:
  • Layer 1 processing at both ports including signal modulation and data framing;
  • Switch Fabric Latency. A switch has an internally shared high-bandwidth fabric that is much faster than any of its ports.
  • Store and forward latency;
  • Queuing latency. It occurs when different ingress ports are sending frames to the same egress port concurrently. Since only one frame can be transmitted at a time from the egress port, the other frames must be queued for sequential transmission in a FIFO manner. This phenomenon is called head-of-line blocking. Due to the FIFO behavior, the latency of a frame in the queue is unpredictable.
    So each port of a switch has an outgoing queue, which along with the switch fabric actually gives the impression of simultaneous paths among its multiple ports. 
Detailed analysis can be found from these two resources: Switched Ethernet Latency Analysis and Latency on a Switched Ethernet Network.

Both analyses show the queuing latency is usually the largest and a switch's jitter comes from it due to its unpredictable head-of-line blocking. .
In order to reduce the queuing latency and jitter, you need to take one or more of the following measures:
  • Plan your network traffic properly including a predicable many-to-one mapping from ingress ports to egress ports; a smaller many-side value if a higher predictability is needed; and more importantly, no egress port should be oversubscribed.
    For example if you egress port is of 10GbE, you can only have a maximum of 10 1GbE ingress ports concurrently sending data to it; otherwise frame loss will occur. If you need some 1GbE port to have a higher predictability, you need to reduce the number of 1GbE ingress ports concurrently sending data to the 10 GbE egress port. 
  • Because a switch's queuing jitter can cause avalanche effect at subsequent  network hops, this is another reason why you need to reduce the number of routers in your networks;
  • Apply Virutal Lan (VLAN) or Priority values to different traffic. Make sure your RTC frames are in a separate VLAN or have the highest priority. This is similar to the different priority levels for RTC threads we discussed in Part 3. 
Finally in this section, even a hub / repeater has lower latency than a switch thanks to its simple Layer 1 processing, a RTC system should not use it unless necessary such as port mirroring because it creates jitter.
    5. Gigabit Ethernet
    Ethernet evolved from 10BASE-T to 100BASE-T to the current widely deployed 1GbE or Gigabit Ethernet. Now even 10GbE has often been seen in backbone networks and high-end servers as a cheaper alternative to the appropriate and expensive high-speed interconnects such as Fiber Channel and InfinitBand.
    A modern cut-through 10GbE switch such as BLADE'S RackSwitch G8124 can has an average port-to-port latency in 680 nanoseconds (the average is on latencies of several different packet sizes from the minimum 64 bytes to the maximum 1518 bytes).

    Although this 680 nanoseconds latency is at the same magnitude as the main memory, the load on CPU increases linearly, without some kind of TCP offloading or OS kernel bypassing, as a function of packets processed, with the usual rule of thumb being that each bit per second of bandwidth consumes about a HZ of the CPU clock. For example 10Gbs of network traffic consumes about 10GHz of CPU that is much higher than the 3.33GHz of Intel's latest processor Core i7.
    As more of the host CPU is consumed by the network load, both CPU utilization and host send / receive latency and jitter become significant issues.
    RDMA over TCP/IP or iWARP has come up as a rescue. Basically an iWARP NIC or R-NIC allows a server to read/write data directly between its user memory space and the user memory space of another R-NIC-enabled host on the network without any involvement of the host operating systems.

    API's have been implemented for different platforms including the OpenFabrics Enterprise Distribution (OFED) by the OpenFabrics Alliance for Linux operating system, and the Winsock Direct protocol for Microsoft Windows.

    Tuesday, November 9, 2010

    Commercial-Off-The-Shell Enterprise Real Time Computing (COTS ERTC) -- Part 5: Java Requirements

    Because programs designed using standard Java (either Java SE or Java EE or Java ME) exhibit non-deterministic timing of execution, standard Java is not widely used in RTC, especially in HRT.
    However due to standard Java's popularity in many industries, extending it with RTC functions is very attractive. The Real-Time Specification for Java (RTSJ) is designed to seamlessly augment standard Java with RTC functions.

    In this topic I will describe the key areas where standard Java creates jitters and how the two populate RTSJ implementations - Oracle's Java RTS and IBM's WebSphere RT - reduce them.
    This is also a good time to compare RTSJ with C/C++ if they both run on the same underlying RT OS.

    But before we start, here are few things about RTSJ:
    Its implementations usually need a RT OS. Actually it is a must for HRT. In order to lower the learning curve, RTSJ doesn't introduce any new Java syntax, instead provides several new classes. Both Java RTS and WebSphere RT support Java SE 5 or later.

    1. Threads, Synchronizations and Memory Management Models
    Because Java threads have a one-to-one relationship with the underlying OS native threads (e.g. Java threads are implemented using the native POSIX threads on Solaris and stock Linux), most of Java threads' functions are delegated to OS native threads as we discussed in Part 3.

    1.1 Regular Java Threads vs Java Real-Time Threads
    Regular Java threads created by class java.lang.Thread (JLT) are not suitable for RTC due to the following factors:
    • JLT only has 10 priority levels, which is not enough for COTS ERTC.
    • JLT maps to OS non-real-time threads. For example, JLT maps to the SCHED_OTHER scheduling policy on Linux that creates jitter due to its dynamic priority adjustment.
    • JLT has the "priority inversion" problem because java.lang.Object.wait() doesn't implement the "priority inheritance" logic.
    • JLT doesn't provide priority-based synchronization because java.lang.Object.notify() is not required to wake up the highest-priority thread; and the "unlock" action in the "synchronized" statement is not required to choose the highest-priority thread to be the next lock owner.
    • JLT is subject to unpredictable pauses induced by GC because JLT allocates objects in the standard Java heap.
    RTSJ provides a new thread type javax.realtime.RealtimeThread (RTT) to fix the above problems:
    • RTT must have at least 28 priority levels. For example Java RTS has a default of 60 on Solaris and 49 on stock Linux;
    • RTT maps to OS level real-time threads. RTT maps to the SCHED_FIFO scheduling policy on stock Linux which provides fixed-priority scheduling based on FIFO queues (another stock Linux RT scheduling policy "SCHED_RR" doesn't meet RTT requirements due to its round-robin feature). 
    • RTSJ requires java.lang.Object.wait() implement the "priority inheritance" logic. On Linux RT and Solaris, it can delegate to the underlying POSIX threading services.
    • RTSJ requires java.lang.Object.notify() wake up the highest-priority thread, and the "unlock" action in the "synchronized" statement choose the highest-priority thread to be the next lock owner. On Linux RT and Solaris, they can again delegate to the underlying POSIX threading services.
    • RTSJ provides a subclass of RTT called javax.realtime.NoHeapRealtimeThread (NHRT) which is protected from GC-induced jitter (see next section for details). NHRT is intended for HRT.
    Because RTT exposes many of the underlying OS native threading functions to Java, RTSJ can compete squarely with C/C++ in the above areas.

    1.2 New Memory Management Models
    Because NHRT is not affected by GC pauses, it is allowed neither to use the standard garbage-collected heap nor to manipulate references to the heap.
    Besides the standard heap, RTSJ provides two memory management models that RTT and NHRT can use to allocate memory on a more predicable basis:
    • Immortal Memory.
      It is not garbage-collected. Its primary use is to avoid dynamic allocation by statically allocating all the needed memory ahead of time.
      It is like the malloc() / new operator in C/C++ without the corresponding free() / delete operator because once an object is allocated from it, the memory used by the object will never be reclaimed. Any unintended allocation into it is regarded as memory leak and can cause out of memory.
    • Scoped Memory. It is not garbage-collected either. It is intended for objects with a known lifetime such as temporary objects created during the processing of a task. It will be entirely reclaimed at the end of the lifetime such as when the task finishes.
      Because many standard and third-party Java libraries create many temporary objects, it is impractical to use immortal memory if RTT or NHRT has to link to many libraries. In this case scoped memory is a better choice.
    Although the two new models enable RTSJ to compete with C/C++ equally at the memory allocation area, they are hard to use and very error-prone because objects allocated in the two new models and the standard heap have difference GC characteristics and lifetimes, and assignment rules must be enforced.
    So the recommended use of the two models is limited to programs that can't tolerate GC pauses such as in HRT.

    1.3 Communication between NHRT and RTT / JLS
    Because an NHRT could block on a lock held by an RTT or JLS. While the RTT or JLT holds the lock, GC could preempt it and indirectly preempt the NHRT. If this is not tolerable, you either should void lock sharing or use the following non-blocking queues for resource sharing:
    • javax.realtime.WaitFreeReadQueue class for passing objects from RTT / JLT to NHRT.
    • javax.realtime.WaitFreeWriteQueue class for passing objects from NHRT to RTT / JLS
    Because many Java libraries use internal locking, NHRT linked to Java libraries must avoid indirect GC preemption to ensure HRT behavior.

    Because the new JTSJ classes and RT functions above mentioned have a medium-level learning curve for Java SE developers, you are recommended to still use JLT with careful tunings if SRT can meet your business requirements. Those tunings included the OS level tunings as mentioned in Part 3, limiting the number of your application threads, avoiding of resource sharing and GC low-pause tuning as mentioned in Section 2 in this part.

    2. Garbage Collection
    Although automatic garbage collection provided by JVM greatly eases the memory management in Java compared to C/C++, it is unfortunately another big source of non-determinism. 
    Although standard JVM provides different GC algorithms such as serial, parallel and concurrent to allow users to make a trade-off between pause and throughput, all algorithms involve a stop-the-world (STW) pause, in which all application threads except NHRT are stopped so that GC can run without interference. This STW behavior is only acceptable for SRT or loose HRT.

    Although RTSJ doesn't define deterministic GC, a RT JVM must provide one in order to support HRT. There are basically two approaches: work-based and time-based.
    Both approaches are aimed to minimizing the effect of long pause by doing incremental collection in a GC cycle. Unfortunately neither can provide HRC guarantee.

    2.1 Work-Based
    Because the standard STW GC behavior blankly taxes all application threads, the work-based approach has each thread do a specific amount of incremental GC work proportional to its allocation amount each time it allocates an object.
    However your application is still unpredictable because the GC cost spread-out is often uneven because the allocation is often uneven and the amount of time to do a fixed amount of incremental collection work is variable as shows in Figure 5.1 courtesy of this resource:
    Figure 5.1 Risks of Work-Based Garbage Collection 
    2.2 Time-Based
    It schedules a fixed amount of collection time in each GC cycle.
    Although it spreads the GC cost equally through a GC cycle, there is no direct correlation between the allocated collection time and the reclaimed memory, and your application is still unpredictable as shown in Figure 5.2 courtesy of this resource:

    Figure 5.2 Time-Based GC and Undesirable Outcomes
    2.3 Oracle Java RTS 2's Approach
    Java RTS uses a modified work-based approach called "Henriksson's GC" or Real-Time GC (RTGC). This RTGC can be configured to run as one or more RTTs that run at a priority lower than critical threads (NHRT and critical RTT) and higher than non-RT threads (JLS) and maybe non-critical threads as well (non-critical RTT) so that critical threads may preempt the RTGC, and RTGC may preempt non-RT threads and non-critical threads to keep up with the application memory allocation rate. This is shown in Figure 5.3 courtesy of this resource:
    Figure 5.3 Henriksson's GC
    The initial RTGC priority is lower than non-critical RTT but is boosted to its configurable "maximum priority" higher than non-critical RTT if remaining memory is close to another configurable "memory threshold". These two configurable parameters enable you to tune the balance between non-critical threads' deterministic and memory throughput.

    Figure 5.3 shows that RTGC ensures HRT only for critical threads that should competes pretty well with C/C++,  while trying to offer SRT for non-critical threads that usually can't compete with C/C++.

    2.4 IBM WebSphere RT 2's Approach
    WebSphere RT's Metronome is a time-based deterministic GC. It divides a GC cycle into a series of discrete quanta, approximately 0.5ms but no more than 1ms in length, that are devoted to either GC work or application work.
    Even a single GC pause has an upper bound of 1ms, it is not enough because if several quanta were devoted to GC work, the application still experience a longer pause time as we discussed in Part 1. It must also meet another parameter called "utilization" that is the percentage of time quanta allocated to an application in a given window of time continuously sliding over the application's complete run.
    Figure 5.4 shows a GC cycle divided into multiple 0.5ms time slices preserving the default 70% utilization over a 10ms windows courtesy of this resource:
    Figure 5.4 Metronome GC Sliding window utilization

    Compared to Oracle's RTGC, Metronome provides more determinism to JLT and non-critical RTT while RTGC ensures HRT for critical RTT.  Because learning RTT takes time, Metronome is recommended to implement SRT with JLS.

    2.5 Standard JVM's Approach

    Both Oracle's Hotspot JVM and JRockit provide so called low-pause concurrent GC that only briefly pauses the application and runs concurrently with your application for most of the time.
    Hotspot JVM allows you to specify both a target pause time and throughput while JRockit only allows you to specify a target pause time that seems to be inadequate based on our discussion in Part 1.
    Such a low-pause concurrent GC can at best provide SRT.

    Another very promising SRT implementation is provided by JRockit Real Time, which extends the existing mostly concurrent low-pause GC by ensuring the target pause time and limiting the total pause time within a prescribed windows (unfortunately such a windows can't be configured by users).
    JRockit Real Time GC is more deterministic than the regular JRockit concurrent low-pause GC. Oracle claims it is the industry’s fastest JVM with its average response time in microseconds for most well-managed applications and low, single-digit, millisecond response with a five nines reliability guarantee. These numbers are pretty good for SRT even in the financial services industries.
     
    Here is the biggest advantage of tuning GC compared to the new RTJS classes in Section 1: it is transparent to Java SE developers. In other words, it doesn't need developers to change their existing programming models or adapting to new RTC models.

    Finally in this section it is often hard to pick a selection between SRT  or loose HRT coded using C/C++ and SRT coded using Java along with a deterministic low-pause concurrent GC.


    3. Class Loading, Linking, Initializing and Compilation
    3.1 Class Loading, Linking, Initializing (LLI)
    A standard JVM loads, links and initializes classes on demand; it also unloads classes no longer referenced. Though LLI provides great flexibility on memory and CPU consumptions and the one-time cost can probably be made up if the same class is referenced multiple times, it is still another big source of non-determinism.
    For example class loading usually takes at least tens of milliseconds because it usually involves disk IO. The linking and initializing are also very CPU-intensive.

    You can eliminate the LLI jitter by pre-doing LLI in the application warm-up phase (it is defined in Part 3) either by calling the standard java.lang.Class.forName() or using RTSJ implementation-specific utilities.
    For example, Oracle's Java RTS 2 allows you to either specify a list of classes for pre-loading and / or pre-initializing on the command line at JVM startup or use the Initialization-Time-Compilation (ITC) API during runtime (Both Java RTS and WebSphere RT can automatically generate a list of classes that were loaded or initialized by your application's execution).

    Both approaches for eliminating LLI jitter are basically transparent to Java SE developers.

    3.2 JIT vs Ahead-of-Time (AOT / ITC)
    If you used C/C++ static compiler before, you know compiling (optimization included) is very CPU and memory intensive. So modern JVM JITs initially interpret Java methods and , for only those methods that execute frequently, later compile to native code.
    Because the dynamic nature of Java makes much important information only available during runtime, JIT can generate even better code than statically compiled language like C/C++. Recompiling (deoptimization), overriding (virtual) method optimization and escape analysis are just three examples.
    So if JIT can carefully balance the compiling time and optimization aggressiveness, the one-time compiling cost will be made up by the multiple later executions of the native code and the average execution time will be equal to or even shorter than the corresponding C/C++ program.

    However because the compiling time is up to JVM and there is execution time variation between interpreted code and native code, the dynamic JIT is another big source of non-determinism.
    The only solution to eliminate JIT jitter is to use some kind of static compiling such as AOT in WebSphere RT and ITC in Java RTS.

    However AOT/ITC also has some disadvantages.
    Firstly, the platform portability is comprised by AOT. ITC is a bit better due to its initialization-time compiling instead of AOT's development-time compiling.
    Secondly AOT/ITC-compiled code, though faster than interpreted code, can be much slower than JIT-compiled or C/C++ code because AOT/ITC can only make few conservative optimizations with little information available at hand.

    The debate of  JIT vs AOT/ITC is still involving. Readers are recommenced to this resource to gain more insight.  I personally recommend JIT because you can still achieve SRT and even loose HRT with it. With such fine tunings as warm-up and background JIT compiling, your Java programs can compete toe-to-toe with C/C++ programs.

    4. Timers and Clocks
    If RTC needs nanosecond resolution and your system can only provide milliseconds at beat, it will cause jitter.
    RTSJ provides new classes javax.realtime.Clock and javax.realtime.HighResolutionTime to expose high-resolution clocks in underlying hardware and OS as discussed in Part 2 and 3.
    You should also use the following standard Java methods if you need nanosecond resolution:
      java.lang.Object.wait(long timeout, int nanos); 
      java.lang.Thread .sleep(long millis, int nanos);

    You shouldn't use standard Java's java.util.Date or java.lang.System.currentTimeMillis() due to their low resolution and synchronization with the world clock (this synchronization looks like jitter when this world clock is updated).

    In summary RTSJ can compete toe-to-toe with C/C++ in most areas for building COTS ERTC applications. The only concern is the learning curve of its functions extensions to standard Java. However the curve is not steep and often necessary for RTC.
    If you had bias against using Java to build RTC systems, hopefully you will have a second thought after this discussion.

    Friday, November 5, 2010

    Locking Schemes for Replicated Data Updating

    Data distribution is commonly used in high-performance computing (HPC). Basically there are two fundamental data distribution topologies. One is replication; the other is partition.

    With data partition, you can achieve parallel processing of a large amount of data.
    With data replication, you can achieve load balancing and high availability(HA).

    Even though a data item has several replicas in a data replication environment, it should have some degree of transparency and appear to only one virtual global item to end users.
    The biggest challenge using data replication is the proper trade-off between data coherence and performance based on your business requirements.

    Some kind of locking scheme is usually employed in order to achieve data coherence.
    I will list some replication and locking schemes I have experienced using Oracle10g advanced replication, Oracle10g RAC, Oracle10g TimesTen and Gigaspaces XAP 7.1.

    Before we go to details, let's suppose you have a distributed airline ticketing system (DATS hereafter) which has two databases: one is in NY and the other is in LA. Depending your replication scheme, data can be updated either at one site only and replicated to the other or at both sites and replicated to the each other.
    Further suppose the following actions take place in time sequence:
    1. There is only one ticket left in the database. Because both local database replicas are synchronized at this time, the only ticket can be taken by either NY or LA;
    2. A NY customer bought this ticket. This action was updated in the local NY database and will be replicated to LA somehow depending on the replication scheme;
    3. Depending on your replication scheme, the LA database can show the same ticket either still available or already taken by a NY customer. If the same ticket still appears available to the LA database, it will be sold to a LA customer. This will create an oversold situation.
    1. Synchronous Replication Using Distributed Locks and Local Transactions

    Oracle RAC (formerly OPS in version 8i and prior) allows the same physical database to be accessed at multiple instance sites. In order to allow users to read and write any data at any time at any instance site, Oracle RAC uses "Cache Fusion" to ensure data coherence.

    "Cache Fusion" basically uses synchronous replication with a distributed locking manager (DLM). DLM acts as a distributed lock (DL) coordinator among other functions so that the same resource - e.g. a table row - can only be granted to one instance site for changing at a time while other sites have to wait.

    DLM also act as a global resource directory. For example, when instance site 1 updated a row, it doesn't need to actively push the new version of data to all other instance sites.  When instance site 2 later requests the same row, DLM can tell it to get the latest version from instance site 1.
    Also instance site 1 doesn't need to use any distributed transaction thanks to DLM and the fact that there is still only one physical database (so far I haven't seen any synchronous replication that use both distributed locks and distributed transactions).

    Benefits include very high degree of data coherence and load balance for both reads and writes.
    Drawback include poor write performance and requirements of high-speed interconnect due to distributed locks.
    Distributed locks usually consist of quite a few daemon processes and data structures at each site whose coordination performs poorly in a low-speed interconnect such as LAN and WAN. For Oracle "Cache Fusion", distributed locks are implemented with Global Cache Service (GCS), Global Enqueue Service (GES) and Global Resource Directory (GRD).

    If we apply this scheme to DATS assuming the poor interconnect performance is tolerable, step 3 has to wait for the DL to be release by step 2. When step 3 gets the DL, the same ticket will show already taken by step 2.
    (In order to have good performance, most multi-tiered applications use optimistic locking that can create lost update problem. For example if we use optimistic locking in both databases in DATS, the application tier in step3 can first read the LA database before step 2 and then sell the same ticket to a LA customer after step 2.
    The application must use "optimistic locking with version checking" to fix this issue. One version checking is just a version number that increases whenever there is any corresponding data change.
    Suppose the version is 0 at step 1. Step 2 updates it to 1. The version found by the application tier read at step 3 is also 0. But when the application tier tries to sell the same ticket, it will fail because it will find the version has changed to 1 from its cached value 0.
    All our arguments assume "optimistic locking with version checking" .)

    2. Synchronous Replication Using Local Locks and Distributed Transactions
    Oracle's multimaster replication (also called peer-to-peer or n-way) has two data coherence protocols.
    One of them is synchronous replication that applies any changes or executes any replicated procedures at all sites participating in the replication environment as part of a single distributed transaction. If the DML statement or procedure fails at any site, then the entire transaction rolls back.

    The distributed transaction ensures data consistency at all sites in real-time. However it doesn't use any distributed locking. Instead it only uses local locks in the participant local transaction.
    This is the case when an application performs a synchronous update to a replicated table. Oracle first locks the local row and then uses an after row trigger to lock the corresponding remote row. Oracle releases the locks when the transaction commits at each site.

    Benefits include high degree of data coherence, simple implementation, easy administration and fit to both high-speed interconnect and low-speed LAN and WAN (implementing distributed locks in low-speed LAN and WAN is much harder than using lock locks in such environments).
    Drawbacks include possible deadlock due to temporal behavior of local and remote locking, high availability requirements on network and poor write performance.

    If we apply this scheme to DATS assuming the poor interconnect performance is tolerable, step 3 has to wait for step 2 to release the local and remote locks. When step 3 gets the locks, the same ticket will show already taken by step 2.

    3. Synchronous Replication Using Local Locks and Local Transactions
    TimesTen's unidirectional active-standby pair configuration only uses so called "return twosafe
    replication". It provides fully synchronous replication between the master (the active site) and subscriber (the standby site).
    No distributed transaction or distributed lock is involved. Instead only local transaction and local lock are used. Specifically the local transaction on the subscriber is first committed before on the master. If the subscriber can't commit, the master will not commit either.
    At any time, only the active site can be updated that greatly simplifies data updating complexity (otherwise using local locks and local transactions will be insufficient) and ensures fast fail-over to the standby site in case of the active site failure.

    This scheme has similar benefits and drawbacks to the previous scheme in section 2.
    However its performance is even better thanks to the avoiding of two-phase commit (2PC) required in a distributed transaction. It also eliminates the deadlock because only the active site allows updates.
    Although the standby site seems to be a waste of capacity, you can collocate it with another active site as shown in figure 1 (by another I mean it has different data from the collocated standby site).
    This scheme has lower degree of data coherence due to the inconsistency resulted from master commit failure even the subscriber has successfully committed (the root cause is it doesn't use distributed transactions as you can guess. But you should also know that data inconsistency can still result from the second "commit" phase failure in a 2PC process).

    TimesTen's practice in this scenario is consistent with its configurations for high performance in other areas such as asynchronous logging and data caching with write-behind policy, 

    Gigaspaces IMDG has a very similar topology called primary-backup replication. The only difference is it uses distributed transactions instead of local transactions only. So it has higher degree of data coherence than TimesTen.
    Another advantage is the fail-over happens in Gigaspaces IMDG transparently to end users while TimesTen customers need to resort to some third-part cluster manager or some custom software.



    If we apply this scheme to DATS, either NY or LA site will be the active site and the other standby site has to connect to the active for data updating (in reality, active-standby often is used with a high-speed interconnect). The local lock in the active site prevents the oversold situation.

    This scheme along with data partitioning as shown in figure 1 is strongly recommend compared to the two previous synchronous schemes if it can your business requirements.
    Although the two previous synchronous schemes allow updating anywhere, updating the same resource entails costly lock coordination over network. Scalable updating is usually achieved by data partitioning.
    Although the two previous synchronous schemes allow distributed and scalable reads, you can fine-tune your partitions to allow more current reads.

    Figure 1: The Primary-Backup Partition in Gigaspaces


    4. Asynchronous Replication with Update Anywhere
    Another data coherence protocol in Oracle's multimaster replication is asynchronous replication that allows users to update data at any participant site.
    This scheme is also used in Oracle's updatable materialized view replication and TimesTen's bidirectional master-subscriber replication for general distributed workloads.

    With this scheme the data changes at one site will be queued for propagation to other sites and committed locally. The queued changes will be propagated in batches in a separate transaction. So it doesn't use any distributed lock or distributed transaction. Instead it only use local locks required in the corresponding local transaction.

    Benefits include great read and write performance, easy implementation, simple administration and fit to low-speed interconnection such as LAN and WAN and disconnected updating.
    Drawbacks include limited degree of data coherence depending on data refresh frequency, and possible data change conflicts.
    Because there is no distributed lock or distributed transaction involved, a replication conflict occurs if two transactions originating from different sites update the same row at nearly the same time (when the queued changes are propagated to the other site, the other site will have two versions of data changes. So which one should take place?).

    A conflict resolution method must be provided to resolve the data inconsistency. Both Oracle and TimesTen have a prebuilt "latest timestamp" resolution method that makes the change with the latest timestamp the winner. Oracle also allows you to customize a resolution method based on your business requirements.

    This scheme can't be applied to DATS if oversold situations are not allowed because the changes at NY and LA sites can be committed independently in two different transactions that result in the same ticket being sold to two customers.
    If occasional oversold situations are allowed, the NY and LA sites can sell tickets at different times thanks to the three hours time zone difference. If a replication conflict does occur, relevant information should be recorded in the database based on which your front-end application takes proper actions (in reality a reservation system like DATS doesn't use this scheme).

    5. Asynchronous Replication with Update on Master Site only

    This scheme is used in Oracle's read-only materialized view replication, TimesTen's unidirectional master-subscriber replication and Gigaspaces IMDG's master-local replication.

    This scheme basically has similar benefits and drawbacks to the previous scheme in section 4. However because it only allows updates at the master, it eliminates the notorious replication conflicts, which most of the time proves to be a very sound design in an asynchronous replication environment.

    If we apply this scheme to DATS and suppose NY is the master site (or a third site is the master), NY has to wait if LA first gets the local lock at the master site. The local lock in the master site prevents the oversold situation.

    Life is much easier using Gigaspaces IMDG's master-local topology as shown in figure 2 because it automatically delegates your local cache updating to the master which propagates the same updating to other local caches. Gigaspaces IMDG also supports optimistic locking with versioning.
    You must do both by yourself if you use Oracle's read-only materialized view replication and TimesTen's unidirectional master-subscriber replication.

    Figure 2: Gigaspaces Master-Local Topology where Master can be Figure 1

    Wednesday, November 3, 2010

    Duplicate Global Variable Definitions in C?

    If you are still a novel C developer, a common, however bad, practice of using a global variable is shown as follows:

    /* global.h */
    int globalInt;

    /* file1.c */
    #include "global.h"
    void foo() {
       globalInt = 1;
    }

    /* file2.c */
    #include "global.h"
    #include "stdio.h"

    void bar() {
      printf("the globalInt is %d", globalInt);
    }                                                                                                                                                   

    Neither Visual C++ studio 2008 nor GCC 4 has any complaint compiling and linking the program as C.

    However you will get "duplicate definition" error if you try to link the same program as C++ either using VC++ or GCC.

    After I made some researches and found the reason, I feel embarrassed because I have already been though C was just too easy.

    In order to know the reason, you need to know the variable declaration, variable definition, and tentative definition and "multiple external definitions" in C99.

    When you define a variable, you are telling the compiler to allocate memory for that variable, and possibly also to initialize its contents to some value.
    For example int i = 100;

    When you declare a variable, you are telling the compiler that the variable was defined elsewhere. You are just telling the compiler that a variable by that name and type exists, but the compiler should not allocate memory for it since it is done somewhere else.
    For example extern int i;

    Here are the two important rules you have probably already known:
    A variable must be defined once in one of the modules of the program. If there is no definition or more than one, an error is produced, possibly in the linking stage.

    A variable may be declared many times, as long as the declarations are consistent with each other and with the definition. It may be declared in many modules, including the module where it was defined, and even many times in the same module. But it is usually pointless to declare it more than once in a module.
    Since the "globalInt" in our example is outside of any function, it has the "extern" linkage. so is the "int globalInt" a definition or declaration?
    Based on C99, it is called "tentative definition" in Section 6.9.2 as stated below:

    A declaration of an identifier for an object that has file scope without an initializer, and without a storage-class specifier or with the storage-class specifier static, constitutes a tentative definition.If a translation unit contains one or more tentative definitions for an identifier, and the translation unit contains no external definition for that identifier, then the behavior is exactly as if the translation unit contains a file scope declaration of that identifier, with the composite type as of the end of the translation unit, with an initializer equal to 0.
    Based on the above C99 standard, both file1.c and file2.c should have the following definition after being compiled:
    int globalInt = 0;

    So if both VC++ and GCC strictly followed the C99 standard, the linking step should have failed due to duplicate variable definition.
    However C99 does have an Annex J.5.11. as follows:
    Multiple external definitions 
    There may be more than one external definition for the identifier of an object, with or without the explicit use of the keyword extern; if the definitions disagree, or more than one is initialized, the behavior is undefined (6.9.2).
    So obviously both VC++ and GCC allow consistent multiple external definitions by extending the "tentative definition" to the entire program scope from the original file scope.

    Now the last puzzle why C++ complained.
    Because C++ treats C99 "tentative definition" as normal variable definition, you of course got the variable duplicate error.

    Tuesday, November 2, 2010

    How to effectively Manage Floating-Point Keys in C++ Map (or Set)?

    Because computers can't accurately store floating-point values as you know, you should try to avoid equality comparison between them or use some approach like "relative error" as mentioned in this resource.

    I have a table of data where keys and mapped values are both floats. Although I can provide my own Compare logic using the above "relative error",  this Compare template parameter is only used for "strict weak ordering". In other words, it only tells whether two floating keys in my case have a "less than" (<) ordering so that the Map can order keys in acceding order.

    For example if Compare says the predicate of "float1 < float2" is false, it either means float1 = float2 or float1 > float2.
    But the C++ Map doesn't allow you to customize either an equal or a greater than logic. This is especially troubling when you want to use Map's find() method based on floating-point values. In this case, the Map will use the computer's intrinsic equal(=) comparison logic, which is not reliable for floating-point value' comparison as you know.

    Fortunately in my case I know the maximal precision and scale for all my floating keys. So I first safely mapped my floating key to integers and then saved the corresponding integer keys into the map.

    Monday, October 25, 2010

    Is the "ORDER BY" in HQL and EJB-QL based on lexicography or dictionary?

    This topic is inspired by Dominik Dunz's comment on my Hibernate tuning article "Revving up Your Hibernate Engine" on InfoQ.

    In Java, you sort strings lexicographically (based on the underlying character's encoding values) using class String's compareTo() method.
    You can also sort strings based on a locale's dictionary using class Collator's compare() method.

    So now you should ask which sorting the "order by" I wrote in HQL or EJB-QL supports?
    Currently there is no any QL syntax for you specify either a lexicographic or dictionary order. So both HQL and EJB-QL just literally pass the "order by" clause to the back-end database. It is your database session that decides the sorting.

    In case of Oracle, it also supports sorting lexicographically (binary in Oracle's term) or based on dictionary(linguistic in Oracle's term).
    Your Oracle session decides the sorting. Specifically if the session's NLS_COMP is "binary" it will sort lexicographically(based on the string's underlying encoding values).
    If the session's NLS_COMP is "linguistic", it will sort based on the dictionary of the locale that you specified in NLS_SORT.

    If you use Oracle's JDBC thin driver in an application server, the application server's JVM decides the values of NLS_COMP and NLS_SORT.

    You can always do sorting in your application tier based on your business logic instead of relying on your database. But your application tier sort probably will be slower than your DB sorting.
    However there are many complications if you want to use your back-end database sorting to implement your business logic sorting.
    1. Your database may only support lexicographical sorting;
    2. Even lexicographical sorting is much simpler than dictionary sorting, your database session's character encoding may not be Java string's UNICODE. However it may not be a big deal to change your DB's charactor encoding to Java string's UNICODE or be a subset of Java string's UNICODE.
      You also need to make sure that your DB's lexicographical sorting is the same as your Java's. In case of Oracle, it basically has the same lexicographical sorting as Java String's compareTo() method.
    3. Java's linguistic sorting may not be the same as your DB's. You need to carefully exam documents from both Java and your DB.
      You can find Oracle's linguistic sorting logic from this link.
    4. It is becoming more complicated if you have a there-tier architecture where the front-end UI (either Swing or a browser) decides the sorting logic because the same database session in the back-end can be shared by different front-end user sessions.
      You have to change your DB session's sorting whenever your fron-end UI has changed.

    Thursday, October 21, 2010

    Commercial-Off-The-Shell Enterprise Real Time Computing (COTS ERTC) -- Part 3: Operating System (OS) Requirements

    Since OS is between hardware and Java program language, it provides your COTS ERTC applications with more enabling functions. The latency and jitter requirements to OS are much higher than to hardware.

    Again such general purpose OS as Windows NT, Linux and Solaris are designed primarily for high throughput at the cost of poor latency. Generally there are two trends for OS to support COTS ERTC.
    One is to stick with the general purpose OS with the help of fine tunings through which SRT is usually achievable. Those OS's stick with the high-throughput design goal. Windows NT belongs to this category () (Although there are indeed quite many commercial efforts to extending NT for RTC functions, they are appropriate and hence doesn't belong to COTS ERTC).

    The other trend is to add RTC function to the OS so that both throughput and RTC workloads can be handled and even HRT can also be implemented at the cost of slight lower throughput.
    More and more RTC features have been added to the Linux mainline kernel since version 2.6 through a RT patch (We will hereafter use "Stock Linux" for general purpose Linux, "Linux RT" for Linux with the RT patch). Red Hat Enterprise MGR and SUSE Linux Enterprise Real Time (SLERT) are representative.
    Oracle / Sun also has significant RTC features in its Solaris itself for long.
    Actually both Linux and Solaris are POSIX compliant including the real-time and thread extensions. So they both belong to this category (There are also many other efforts to extending stock Linux for RTC functions such as RTLinux, RTAI. They are more or less like a dual-kernel approach. Because they are appropriate and never got into the mainline Linux kernel, they don't belong to COTS ERTC) .

    1. Preemptable Kernel
    Multi-threaded programs are known to programmers for long. However OS fully preempting user-space threads doesn't necessarily mean its kernel is also fully preemptable. Actually different OS's provide different degrees of preemption. Obviously low-degree preemption means high latency and jitter.

    Figure 2 in part 2 shows the OS scheduler takes "interval response time" to preempt an interrupted thread(usually a low-priority thread) with a preempting thread (usually a high-priority thread). The shorter the interval response time, the more preemptable.

    Whenever the processor receives an interrupt, it calls an interrupt handler, a.k.a. an interrupt service routine (ISR) to service the interrupt.
    Preemption latency is the time needed for the scheduler to determine which thread should run and the time for the new thread to be dispatched.
    Context switch is the kernel saves the state of the interrupted thread or process, loads the context for the preempting thread or process, and begins execution.
    I will focus on ISR and preemption latency because different OS's employ different strategies.

    1.1 ISR
    On Linux RT and Windows NT, ISR is divided into two parts: the First-Level Interrupt Handler (FLIH) (or Upper Half on Linux RT) and the Second-Level Interrupt Handler (SLIH) (or Lower Half or Bottom Half on Linux RT; Deferred Procedure Call (DPC) on Windows NT).
    FLIH quickly services the interrupt or records platform-specific critical information which is only available at the time of the interrupt, and schedules the execution of SLIH for further long-lived interrupt handling.
    Because FLIH typically masks interrupts at the same or lower level until it completes, it affects preemption and causes jitter. So to reduce jitter and to reduce the potential for losing data from masked interrupts, OS should minimize the execution time of FLIH, moving as much as possible to SLIH.

    SLIH asynchronously completes long interrupt processing tasks in a kernel thread scheduled by FLIH. Because it is implemented in a thread, the user can assign a priority to it and the scheduler can dispatch it along with other threads.
    For example, if your RT application thread has a higher priority than SLIH, only FLIH interrupts your RT application thread and SLIH will not run until your RT application thread has done.
    Because the ISR in Figure 2 only effectively represents FLIH, the whole interval response time was cut short.

    On Solaris ISR is a whole and implemented in a kernel thread. Because such a thread has higher priority than all non-ISR threads including RT ones, it makes kernel less preemptable and causes much larger jitter to your RT application threads than the previous approach.

    Windows NT has additional jitter caused by DPCs being scheduled into a FIFO queue. So if your high-priority DPC is put behind a low-priority one, the high-priority DPC can't be executed until its prior low-priority one is done.

    1.2 Preemption Latency
    Traditionally when a low-priority thread calls a kernel function through a system call, it can't be preempted even by a high-priority thread until the system call returns. This is again primarily due to high-throughput consideration (The more interrupts, the more overhead and the lower throughput).
    This is the situation for stock Linux Kernel 2.5 or prior that has many lengthy kernel code paths protected by spin locks or even  by so called Bigger Kernel Lock (BKL is basically a kernel-wide or global lock).

    Changing BKL to localized spin locks is the first step toward preemption. But a spin lock is typically not preemptable because if it is preempted, the preempting thread can also try to spin-lock the same resource, which causes deadlock.

    To make kernel more preemptable is to break down a lengthy code path into a number of shorter code paths, between which preemption points are created which is stock Linux kernel 2.6 or later has enabled. SRT can be achieved at best in this case.

    The extreme approach is to convert all spin locks to sleepy mutexes so that your kernel code is preemptable at any point which is what Linux RT has enabled. HRT needs this capability.

    However because Linux should be able to handle both throughput and RTC workloads, a better and practical approach may be to use adaptable locks which are spin locks for short-running code paths and are mutexes for long-running code paths based on statistics.
    Actually SLERT 11 provides such adaptable locks.

    Windows NT has been fully preemptable from the very beginning.

    2. Priority-Based Scheduling
    The scheduler in a general purpose OS is designed to maximize overall throughput and to assure fairness for all time-share threads / processes. To provide equitable behavior and ensure all time-share threads / processes can eventually be executed, the scheduler adjusts thread priorities dynamically so that the priorities for resource-intensive threads are lowered automatically while the priorities for IO-intensive threads are boosted automatically. In other words, even you initially assigned a high priority level to a time-share thread, it will not starve other threads.

    This is not desirable for RT threads which always need to run before any low-priority thread in order to minimize latency at the cost of lower throughput of other threads.
    Besides the traditional time-slice and dynamic-priority threads, Windows NT, Solaris and stock Linux all provide RT threads which have fixed priorities and always run before TS and other low-priority threads.
    In other words, the scheduler will not adjust those RT threads' priority and they will not be preempted by TS or other lower-priority threads unless they wait, sleep or yield.

    Both stock Linux and Solaris provide two scheduling policies for RT thread.  One is Round-Robin which is similar to the TS thread scheduling; the other is FIFO where the prior RT thread runs to complete before the late RT thread with the same priority level.

    The priority level range for RT threads can't to be too small. Otherwise your RT thread scheduling flexibility will be severely constrained.
    Windows NT includes 32 priority levels of which 16 are reserved for the operating system and
    real-time processes. This range is really too tight.

    Stock Linux RT priority class provides 99 fixed priority levels ranging from 1 to 99 (0 is left for non-RT threads).
    The following RT thread priority mapping table was extracted from Red Hat Enterprise MRG tuning guide:
    Priority Threads Description
    1 Low priority kernel threads Priority 1 is usually reserved for those tasks that need to be just above SCHED_OTHER
    2 - 69 Available for use Range used for typical application priorities
    70 - 79 Soft IRQs
    80 NFS RPC, Locking and Authentication threads for NFS
    81 - 89 Hard IRQs Dedicated interrupt processing threads for each IRQ in the system
    90 - 98 Available for use For use only by very high priority application threads
    99 Watchdogs and migration System threads that must run at the highest priority

    Although an important feature for RT thread scheduling is to schedule your RT application threads to be higher than kernel threads, it can possibly cause the system to hang and other unpredictable behavior such as blocked network trafic and blocked swapping if crucial kernel threads are prevented from running as needed (now you should have more feeling how your RT thread is scheduled at the cost of lower overall system throughput).
    So if your RT application thread is higher than kernel threads, make sure they don't runaway and you also should allocate some time for kernel threads.
    For example, your RT thread doesn't run too long or it runs periodically based on a RT timer or it is driven by external periodic RT events or you have multiple CPUs at least one of which is dedicated to kernel threads.

    3. Priority Inheritance
    Priority Inversion occurs when a high-priority thread blocks on a resource that is held by a low-priority thread, and a medium-priority thread preempts the low-priority thread and runs before the high-priority thread, which causes jitter for the high-priority thread.
    Priority inheritance fixes the priority inversion problem by temporarily enabling the low-priority thread to inherit the priority of high-priority thread so that the formerly low-priority thread can continue to run to finish without being preempted by the medium-priority thread. The inheriting thread restores its original low priority when it has released the lock.

    Both Solaris and Linux RT support priority inheritance. Unfortunately Windows NT doesn't support it.
    If possible, try to avoid a high-priority thread from sharing the same resource as a low-priority thread. Obviously this appears to be more important to Windows NT.

    4. High-Resolution Timers
    Section 1.5 in part 2 mentioned the need for high-resolution timers which are backed by high-resolution clocks on most modern hardware. The OS just takes advantage of hardware timers by providing you with different system calls for high-resolution timers besides the traditional system call for regular timers.

    For example both Solaris and Linux support system call "timer_create" and "timer_settime" with clock type "CLOCK_HIGHRES" on Solaris or CLOCK_REALTIME / CLOCK_MONOTONIC on Linux (you need to enable a kernel parameter "CONFIG_HIGH_RES_TIMERS" available on 2.6.21 and later on X86) to access high-resolution timers.

    5. CPU Shielding
    Windows NT, Solaris and stock Linux all support CPU shielding which allows you to bind different processors / cores to different interrupts and threads including both kernel and user space ones. The bound CPU is shielded from unbound interrupts and threads.

    For example, you bind your high-priority application thread to one CPU while other CPUs take care of other threads including kernel thread, and interrupts including NMI and SMI so that you are confident that your high-priority application thread has low latency and is very predicable.
    This means more to Solaris because its ISR is implemented in a thread whose priority is higher than any non-ISR thread including your RT application thread.

    6. Others
    6.1 Memory Pinning
    Windows NT, Solaris and stock Linux all allow you to pin your high-priority thread to physical memory to avoid being swapped to high-latency disks.
    Due to the mechanism in disks, disk IO access latency is in milli-seconds, which is at least one order of magnitude slower than memory access. So OS swapping is a major contributor to latency.

    6.2 Early Binding
    The late binding of dynamic libraries in OS can induce unpredictable jitter to your RT application thread. To avoid jitter, Both Linux and Solaris provides for early binding of dynamic libraries through an environment vairable LD_BIND_NOW.
    Windows NT doesn't seems to support such early binding. To counter-attach this, you can warm up (it is hereafter either the program's start-up phase or an initialization phase before the time-critical execution) your application before asking it to execute time-critical code.

    6.3 Locking Improvement
    Stock Linux use so called "Futex" to avoid system calls for un-contended locks. Solaris uses a similar mechanism called "adaptive lock".

    7 COTS ERTC scenarios with OS
    Even an OS provides both through-put and RTC functions, the RTC functions are at the cost of slight throughput degradation. Actually many observations show only a minority of workloads truly need tight HRT. Accordingly users should always first try OS without the RTC functions enabled.

    For example on Windows NT and stock Linux, if your low latency requirements can be met through such tunings as using RT threads, CPU shielding, memory pinning, priority inversion avoidance, HR timers, application warm-up, and early bind and preemption kernel configuration on stock Linux, don't try Linux RT. Actually many SRT can be achieved using Windows NT or stock Linux

    If you need high predictability or tight HRT, you have to use Linux RT such as MRG and SLERT, or Solaris.

    Thursday, October 14, 2010

    How to use 2 or more data sources in Hibernate along with Spring's Declarative Transaction Management?

    You may quickly response "just use Spirng's JtaTransactionManager".
    But wait. Before deciding to use JTA, you should make sure that local transaction really doesn't meet your requirement because JTA requires many more resources and is much slower than local transactions.
    Even you have 2 or more data sources, you don't need to use use JTA in the following cases:
    • No business method has to access more than 1 data source;
    • Event your business method has to access more than 1 data source, you can still use a technique similar to “Last Resource Commit Optimization" with local transactions if you can tolerate occasional data inconsistency. 
    Here is the example in my "Revving up Your Hibernate Engine": 

    Our application has several service layer methods which only deal with database “A” in most instances; however occasionally they also retrieve read-only data from database “B”. Because database “B” only provides read-only data, we still use local transactions on both databases for those methods.
    The service layer does have one method involving data changes on both databases. Here is the pseudo-code:
    //Make sure a local transaction on database A exists
    @Transactional (readOnly=false, propagation=Propagation.REQUIRED)
    public void saveIsoBids() {
      //it participates in the above annotated local transaction
      insertBidsInDatabaseA();
      //it runs in its own local transaction on database B
      insertBidRequestsInDatabaseB(); //must be the last operation

    Because insertBidRequestsInDatabaseB() is the last operation in saveIsoBids (), only the following scenario can cause data inconsistency:
    The local transaction on database “A” fails to commit when the execution returns from saveIsoBids ().
    However even if you use JTA for saveIsoBids (), you still get data inconsistency when the second commit phase fails in the two phase commit (2PC) process. So if you can deal with the above data inconsistency and really don’t want JTA complexities for just one or a few methods, you should use local transactions. 

    Now suppose you will use local transaction, i.e.Spring's HibernateTransactionManager. In your context XML, you define the needed transaction manager bean:
      <tx:annotation-driven transaction-manager="txManager1"/>
      <bean id="txManager1"
     class="org.springframework.orm.hibernate3.HibernateTransactionManager">
        <property name="sessionFactory" ref="sessionFactory1"/>
      </bean>

      <bean id="sessionFactory1"
     class="org.springframework.orm.hibernate3.annotation.AnnotationSessionFactoryBean">
        <property name="dataSource" ref="dataSource1"/>
      </bean>

    The above XML just configured one data source and its related Hibernate session factory and transaction manager.
    The above annotation-driven element only allows you to specify one transaction manager (it should do this way otherwise which transaction manager will the annotated service method use?) and it is global.
    So what happens if you specify another transaction manager in a different context XML:
      <tx:annotation-driven transaction-manager="txManager2"/>
      <bean id="txManager1"

    Based on my testing, the result is unexpected.

    The solution is you can only use the transaction manager along with Spring's declarative transaction; other transaction managers must use Spring's XML configuration:

    <tx:advice id="txAdvice" transaction-manager="txManager2">
      <tx:attributes>
        <!-- all other methods starting with 'get' are read-only -->
        <tx:method name="get*" read-only="true"/>

        <!-- other methods use the default transaction settings (see below) -->
        <tx:method name="*"/>
      </tx:attributes>
    </tx:advice>

    <bean id="txManager2"
     class="org.springframework.orm.hibernate3.HibernateTransactionManager">
      <property name="sessionFactory" ref="sessionFactory2"/>
    </bean>

    <aop:config>
      <aop:pointcut id="serviceOperation" expression="...ignored"/>
      <aop:advisor advice-ref="txAdvice" pointcut-ref="serviceOperation"/>
    </aop:config>