Thursday, October 21, 2010

Commercial-Off-The-Shell Enterprise Real Time Computing (COTS ERTC) -- Part 3: Operating System (OS) Requirements

Since OS is between hardware and Java program language, it provides your COTS ERTC applications with more enabling functions. The latency and jitter requirements to OS are much higher than to hardware.

Again such general purpose OS as Windows NT, Linux and Solaris are designed primarily for high throughput at the cost of poor latency. Generally there are two trends for OS to support COTS ERTC.
One is to stick with the general purpose OS with the help of fine tunings through which SRT is usually achievable. Those OS's stick with the high-throughput design goal. Windows NT belongs to this category () (Although there are indeed quite many commercial efforts to extending NT for RTC functions, they are appropriate and hence doesn't belong to COTS ERTC).

The other trend is to add RTC function to the OS so that both throughput and RTC workloads can be handled and even HRT can also be implemented at the cost of slight lower throughput.
More and more RTC features have been added to the Linux mainline kernel since version 2.6 through a RT patch (We will hereafter use "Stock Linux" for general purpose Linux, "Linux RT" for Linux with the RT patch). Red Hat Enterprise MGR and SUSE Linux Enterprise Real Time (SLERT) are representative.
Oracle / Sun also has significant RTC features in its Solaris itself for long.
Actually both Linux and Solaris are POSIX compliant including the real-time and thread extensions. So they both belong to this category (There are also many other efforts to extending stock Linux for RTC functions such as RTLinux, RTAI. They are more or less like a dual-kernel approach. Because they are appropriate and never got into the mainline Linux kernel, they don't belong to COTS ERTC) .

1. Preemptable Kernel
Multi-threaded programs are known to programmers for long. However OS fully preempting user-space threads doesn't necessarily mean its kernel is also fully preemptable. Actually different OS's provide different degrees of preemption. Obviously low-degree preemption means high latency and jitter.

Figure 2 in part 2 shows the OS scheduler takes "interval response time" to preempt an interrupted thread(usually a low-priority thread) with a preempting thread (usually a high-priority thread). The shorter the interval response time, the more preemptable.

Whenever the processor receives an interrupt, it calls an interrupt handler, a.k.a. an interrupt service routine (ISR) to service the interrupt.
Preemption latency is the time needed for the scheduler to determine which thread should run and the time for the new thread to be dispatched.
Context switch is the kernel saves the state of the interrupted thread or process, loads the context for the preempting thread or process, and begins execution.
I will focus on ISR and preemption latency because different OS's employ different strategies.

1.1 ISR
On Linux RT and Windows NT, ISR is divided into two parts: the First-Level Interrupt Handler (FLIH) (or Upper Half on Linux RT) and the Second-Level Interrupt Handler (SLIH) (or Lower Half or Bottom Half on Linux RT; Deferred Procedure Call (DPC) on Windows NT).
FLIH quickly services the interrupt or records platform-specific critical information which is only available at the time of the interrupt, and schedules the execution of SLIH for further long-lived interrupt handling.
Because FLIH typically masks interrupts at the same or lower level until it completes, it affects preemption and causes jitter. So to reduce jitter and to reduce the potential for losing data from masked interrupts, OS should minimize the execution time of FLIH, moving as much as possible to SLIH.

SLIH asynchronously completes long interrupt processing tasks in a kernel thread scheduled by FLIH. Because it is implemented in a thread, the user can assign a priority to it and the scheduler can dispatch it along with other threads.
For example, if your RT application thread has a higher priority than SLIH, only FLIH interrupts your RT application thread and SLIH will not run until your RT application thread has done.
Because the ISR in Figure 2 only effectively represents FLIH, the whole interval response time was cut short.

On Solaris ISR is a whole and implemented in a kernel thread. Because such a thread has higher priority than all non-ISR threads including RT ones, it makes kernel less preemptable and causes much larger jitter to your RT application threads than the previous approach.

Windows NT has additional jitter caused by DPCs being scheduled into a FIFO queue. So if your high-priority DPC is put behind a low-priority one, the high-priority DPC can't be executed until its prior low-priority one is done.

1.2 Preemption Latency
Traditionally when a low-priority thread calls a kernel function through a system call, it can't be preempted even by a high-priority thread until the system call returns. This is again primarily due to high-throughput consideration (The more interrupts, the more overhead and the lower throughput).
This is the situation for stock Linux Kernel 2.5 or prior that has many lengthy kernel code paths protected by spin locks or even  by so called Bigger Kernel Lock (BKL is basically a kernel-wide or global lock).

Changing BKL to localized spin locks is the first step toward preemption. But a spin lock is typically not preemptable because if it is preempted, the preempting thread can also try to spin-lock the same resource, which causes deadlock.

To make kernel more preemptable is to break down a lengthy code path into a number of shorter code paths, between which preemption points are created which is stock Linux kernel 2.6 or later has enabled. SRT can be achieved at best in this case.

The extreme approach is to convert all spin locks to sleepy mutexes so that your kernel code is preemptable at any point which is what Linux RT has enabled. HRT needs this capability.

However because Linux should be able to handle both throughput and RTC workloads, a better and practical approach may be to use adaptable locks which are spin locks for short-running code paths and are mutexes for long-running code paths based on statistics.
Actually SLERT 11 provides such adaptable locks.

Windows NT has been fully preemptable from the very beginning.

2. Priority-Based Scheduling
The scheduler in a general purpose OS is designed to maximize overall throughput and to assure fairness for all time-share threads / processes. To provide equitable behavior and ensure all time-share threads / processes can eventually be executed, the scheduler adjusts thread priorities dynamically so that the priorities for resource-intensive threads are lowered automatically while the priorities for IO-intensive threads are boosted automatically. In other words, even you initially assigned a high priority level to a time-share thread, it will not starve other threads.

This is not desirable for RT threads which always need to run before any low-priority thread in order to minimize latency at the cost of lower throughput of other threads.
Besides the traditional time-slice and dynamic-priority threads, Windows NT, Solaris and stock Linux all provide RT threads which have fixed priorities and always run before TS and other low-priority threads.
In other words, the scheduler will not adjust those RT threads' priority and they will not be preempted by TS or other lower-priority threads unless they wait, sleep or yield.

Both stock Linux and Solaris provide two scheduling policies for RT thread.  One is Round-Robin which is similar to the TS thread scheduling; the other is FIFO where the prior RT thread runs to complete before the late RT thread with the same priority level.

The priority level range for RT threads can't to be too small. Otherwise your RT thread scheduling flexibility will be severely constrained.
Windows NT includes 32 priority levels of which 16 are reserved for the operating system and
real-time processes. This range is really too tight.

Stock Linux RT priority class provides 99 fixed priority levels ranging from 1 to 99 (0 is left for non-RT threads).
The following RT thread priority mapping table was extracted from Red Hat Enterprise MRG tuning guide:
Priority Threads Description
1 Low priority kernel threads Priority 1 is usually reserved for those tasks that need to be just above SCHED_OTHER
2 - 69 Available for use Range used for typical application priorities
70 - 79 Soft IRQs
80 NFS RPC, Locking and Authentication threads for NFS
81 - 89 Hard IRQs Dedicated interrupt processing threads for each IRQ in the system
90 - 98 Available for use For use only by very high priority application threads
99 Watchdogs and migration System threads that must run at the highest priority

Although an important feature for RT thread scheduling is to schedule your RT application threads to be higher than kernel threads, it can possibly cause the system to hang and other unpredictable behavior such as blocked network trafic and blocked swapping if crucial kernel threads are prevented from running as needed (now you should have more feeling how your RT thread is scheduled at the cost of lower overall system throughput).
So if your RT application thread is higher than kernel threads, make sure they don't runaway and you also should allocate some time for kernel threads.
For example, your RT thread doesn't run too long or it runs periodically based on a RT timer or it is driven by external periodic RT events or you have multiple CPUs at least one of which is dedicated to kernel threads.

3. Priority Inheritance
Priority Inversion occurs when a high-priority thread blocks on a resource that is held by a low-priority thread, and a medium-priority thread preempts the low-priority thread and runs before the high-priority thread, which causes jitter for the high-priority thread.
Priority inheritance fixes the priority inversion problem by temporarily enabling the low-priority thread to inherit the priority of high-priority thread so that the formerly low-priority thread can continue to run to finish without being preempted by the medium-priority thread. The inheriting thread restores its original low priority when it has released the lock.

Both Solaris and Linux RT support priority inheritance. Unfortunately Windows NT doesn't support it.
If possible, try to avoid a high-priority thread from sharing the same resource as a low-priority thread. Obviously this appears to be more important to Windows NT.

4. High-Resolution Timers
Section 1.5 in part 2 mentioned the need for high-resolution timers which are backed by high-resolution clocks on most modern hardware. The OS just takes advantage of hardware timers by providing you with different system calls for high-resolution timers besides the traditional system call for regular timers.

For example both Solaris and Linux support system call "timer_create" and "timer_settime" with clock type "CLOCK_HIGHRES" on Solaris or CLOCK_REALTIME / CLOCK_MONOTONIC on Linux (you need to enable a kernel parameter "CONFIG_HIGH_RES_TIMERS" available on 2.6.21 and later on X86) to access high-resolution timers.

5. CPU Shielding
Windows NT, Solaris and stock Linux all support CPU shielding which allows you to bind different processors / cores to different interrupts and threads including both kernel and user space ones. The bound CPU is shielded from unbound interrupts and threads.

For example, you bind your high-priority application thread to one CPU while other CPUs take care of other threads including kernel thread, and interrupts including NMI and SMI so that you are confident that your high-priority application thread has low latency and is very predicable.
This means more to Solaris because its ISR is implemented in a thread whose priority is higher than any non-ISR thread including your RT application thread.

6. Others
6.1 Memory Pinning
Windows NT, Solaris and stock Linux all allow you to pin your high-priority thread to physical memory to avoid being swapped to high-latency disks.
Due to the mechanism in disks, disk IO access latency is in milli-seconds, which is at least one order of magnitude slower than memory access. So OS swapping is a major contributor to latency.

6.2 Early Binding
The late binding of dynamic libraries in OS can induce unpredictable jitter to your RT application thread. To avoid jitter, Both Linux and Solaris provides for early binding of dynamic libraries through an environment vairable LD_BIND_NOW.
Windows NT doesn't seems to support such early binding. To counter-attach this, you can warm up (it is hereafter either the program's start-up phase or an initialization phase before the time-critical execution) your application before asking it to execute time-critical code.

6.3 Locking Improvement
Stock Linux use so called "Futex" to avoid system calls for un-contended locks. Solaris uses a similar mechanism called "adaptive lock".

7 COTS ERTC scenarios with OS
Even an OS provides both through-put and RTC functions, the RTC functions are at the cost of slight throughput degradation. Actually many observations show only a minority of workloads truly need tight HRT. Accordingly users should always first try OS without the RTC functions enabled.

For example on Windows NT and stock Linux, if your low latency requirements can be met through such tunings as using RT threads, CPU shielding, memory pinning, priority inversion avoidance, HR timers, application warm-up, and early bind and preemption kernel configuration on stock Linux, don't try Linux RT. Actually many SRT can be achieved using Windows NT or stock Linux

If you need high predictability or tight HRT, you have to use Linux RT such as MRG and SLERT, or Solaris.

No comments:

Post a Comment