Sunday, November 21, 2010

Commercial-Off-The-Shell Enterprise Real Time Computing (COTS ERTC) -- Part 4: Network Requirements

Enterprise applications rarely work in isolation nowadays. Instead they usually cooperate with one another in a network environment.
Due to Ethernet's uniqueness, this topic will only discuss the TCP/IP suite over Ethernet. The networks in Figure 4.1 are used for the discussion.
Figure 4.1 Networks
1. Network Latency
The network latency from a host in LAN1 such as Host11 to a host in LAN2 such as Host21 is the cumulative latency of all the following component:
  • the TCP/IP stack latency on Host11. It includes all OSI layers except the 7th application layer (you have full control at the application layer). At least the 2 lower layers are implemented in the NIC firmware. All other layers are either included in the OS or also included in the NIC firmware.
  • the switch latency in LAN1. The switch is a Layer 2 device. Originally an unintelligent Layer 1 hub or repeater was used. I will compare the latency difference later on.
  • the routers latencies between LAN1 and LAN2. The routers are actually Layer 3 switches.
  • the switch latency in LAN2. Again the switch is a Layer 2 device.
  • the TCP/IP stack latency on Host21. Again it includes all OSI layers except the 7th application layer.
  • the propagation delay from Host11 to Host21. It is the time it takes for signals to travel in the physical copper cables or fibers. Signals in fibers or copper cables can travel at about 2/3 of the fundamental speed of light.
    If the communication distance is short e.g. in a LAN, the propagation latency can be negligible compared to other larger latencies.
    On the other hand if the the communication distance is long, it can't be ignored. For example the straight distance from San Francisco to New York is about 4,125 kilometers and the one-way signal travel needs about 19.6ms that is the lower bound of your overall latency optimization.
Network latency can also be defined as the round-trip latency between Host11 and Host12. However since our discussion focuses on latency and jitter analysis of all involving components, the above one-way definition is sufficient.

The data unit in our discussion is a packet. For example we can set it anywhere from the minimum Ethernet frame size (it is about 64 bytes) to the maximum MTU (it is about 1500 bytes) and compare the difference.
When you analyze your application level latency, the data unit is an application level request, which has complications when interacting with the underlying Maximum Segment Size (MSS) of TCP and the MTU of Ethernet frames (later sections will explain more details).

The prioritized latency and jitter analysis mentioned in Part 1 still applies here. For example upgrading to a faster switch in San Francisco doesn't make sense because the minimum latency to New York is bounded by the 19.6ms propagation delay and the latency of switches is usually in low micro-seconds.

The following sections will discuss the latency and jitter in all involving components.
2. Non-Determinism in Ethernet
Ethernet, as standardized by IEEE 802.3, is implemented at Layer 1 and 2 of the OSI model. Its latency comes from the processing delays at  Layer 1 and 2. The latency has jitter because of its media access control (MAC) protocol - CSMA / CD at Layer 2.

With CSMA / CD, each host detects if another host is transmitting in the shared medium before it tries to transmit its own data (Carrier Sense). When a host detects a carrier, its Carrier Sense is turned on and it will defer transmission until determining the medium is free. This deference is unfortunately not predicable. If two hosts happen to transmit simultaneously (Multiple Access), a collision occurs and all frames are destroyed.

If we replace the switch in LAN1 / LAN2 with a hub / repeater, all four hosts will be in one collision domain. The more hosts trying to send data, the more collisions and the more non-deterministic.
Hosts sharing a half duplex connection are also in the same collision domain.For example even there are only Host11 and Host12 communicating to each other using a half duplex connection in LAN1, collisions still happen when Host11 and Host12 tries to send data to each other concurrently.

Although redesign of MAC protocol can solve the collision problem, it is not compatible to the existing ubiquitous Ethernet deployments and thus is appropriate and not a COTS solution.
If we can avoid collision, Ethernet will be much deterministic. This is what the switch in LAN1 / LAN2 is supposed to do. With the switch each host is guaranteed the exclusive use of the medium in both directions and is thus in a different collision domain from other hosts.
For example Host11 can communicate with Host12 while Host13 is communicating with Host14 without causing any collisions or frame drops.

3. Non-Determinism in TCP/IP
IP is at Layer 3; TCP and UDP are at Layer 4. The biggest jitter at these upper layers actually comes from the non-determinism of the underlying CSMA / CD.
For example if the underlying frame was dropped due to collision, the upper packet will be lost if it is UPD and will be retransmitted if it is TCP thanks to TCP's time-out mechanism.

A side note for TCP's time-out mechanism: if the network is slow, a low time-out value can lead to false retransmission. This is another reason why RTC applications prefer such fast networks as LAN.

You should also be aware of the following points
3.1 Non-Deterministic Routing Paths
A router has an additional Layer 3 compared to a switch. The more routers, the more potential communication paths between LAN1 and LAN2, and the more non-deterministic. A long communication path also leads to more frame drops due to data corruption. So a RTC system is usually enclosed in a LAN; otherwise make sure your routing path is deterministic.This is why many algorithmic trading systems especially high-frequency trading system, collocate with exchanges.

If the communication path is predicable, the determinism of a network with switches and routers will depend on that of those switches and routers. Modern switches and routers are actually very fast and can show very low jitter if managed properly, which will be discussed in Section 4.
.
3.2 Packet Header Overhead and Fragmentation
Each packet at IP, UDP and TCP layers has a header besides the real payload. The head size for IP, UDP and TCP is 20, 8 and 20, respectively.
Actually the Ethernet frame at Lay 2 also has a 20 bytes overhead. When an IP packet is greater than the Ethernet MTU (it is about 1500 bytes), IP needs to fragment and reassemble its packet.
TCP also fragments its stream into packets based on the MSS, the available window size and some other factors.

On one hand, more fragmentation and reassembly means more extra computing and higher latency. On the other hand, smaller packets than the MTU or MSS mean more significant header overhead and lower throughput.
In order to reduce jitter, your application level requests shouldn't have too variant sizes (different sized requests also mean different transmission delay). You should make your requests as large as possible before the latency misses your target. If possible, please decide the request size based on MTU.

3.3 TCP Flow Control
In order to avoid TCP packet drops from buffer overflows, TCP employs a sliding window mechanism to control its data flow.
So slow processing at the receiver host requires the sender host to send out smaller packets or even have to wait if the window size becomes 0. In order to reduce this jitter, make sure both ends can process network data fast and predictably.

If the network is slow, the sender may have to wait for the latest window size from the receiver. A fast network such as LAN can reduce this waiting latency. 

3.4 Nagle's Algorithm
Because of the 20 bytes header overhead in TCP, Nagle's algorithm coalesces a number of small outgoing messages into one packet and sends it out.
This algorithm is usually unacceptable to RTC systems because they need immediate response for each of their requests. This is especially true when the RTC requests are small.

RTC systems usually take the following two counter-measures:
  • Enclose each request in one TCP packet;
  • Turn off this algorithm by using the TCP_NODELAY socket option.
3.5 Delayed Acknowledgment
Again because of 20 bytes header overhead in TCP, a pure TCP response packet will have too much overhead. So this mechanism delays the TCP layer response by waiting about one or two hundred milliseconds, and hopes to send back the response along with the upper application layer response using just one TCP packet instead of two.

Unfortunately a one or two hundred milliseconds deadlock can occur if this mechanism interacts with the Nagle's algorithm and your application's response doesn't synchronize with the TCP layer's e.g. an application request was sent out in two or more TCP packets. Please refer to this resource for a detailed analysis.

In order to avoid this jitter, you need to take one or more of the following measures:
  • disable the Nagle's algorithm;
  • disable or configure the delayed acknowledgment mechanism. Please refer to this resource on Windows; and TCP_QUICKACK on Linux;
  • Make sure your application level request is enclosed in one TCP packet.
3.6 UDP with Application Level Retransmission and Flow Controls
TCP implements its reliability at the cost of more computing work and header overhead than UDP. If TCP's reliability is more than what you want, you can use UDP and implement your simple reliability by yourself at the application layer. This approach should give your lower latency and jitter than using TCP.

4. Latency and Jitter of a Switch / Router
A router has an additional Layer 3 function. Its processing latency in a well-planed network will be predicable through a warm-up period that should cache all routing paths in the router. So we will focus on switches only.
A modern switch basically has two packet forwarding methods:
  • Store and forward. A whole frame is buffered before being forwarded. A checksum is usually performed on each frame.
  • Cut through. The switch reads only up to the frame's hardware address before starting to forward it. There is no error checking.
Because a cut-through switch doesn't have the additional buffering step, it has even shorter latency. However a cut-through switch will fall back to store and forward if its outgoing (egress) port is busy at the time a frame arrives. So we will focus on store-and-forward switches.

The one-way port-to-port latency of a switch is the cumulative delays of the following components:
  • Layer 1 processing at both ports including signal modulation and data framing;
  • Switch Fabric Latency. A switch has an internally shared high-bandwidth fabric that is much faster than any of its ports.
  • Store and forward latency;
  • Queuing latency. It occurs when different ingress ports are sending frames to the same egress port concurrently. Since only one frame can be transmitted at a time from the egress port, the other frames must be queued for sequential transmission in a FIFO manner. This phenomenon is called head-of-line blocking. Due to the FIFO behavior, the latency of a frame in the queue is unpredictable.
    So each port of a switch has an outgoing queue, which along with the switch fabric actually gives the impression of simultaneous paths among its multiple ports. 
Detailed analysis can be found from these two resources: Switched Ethernet Latency Analysis and Latency on a Switched Ethernet Network.

Both analyses show the queuing latency is usually the largest and a switch's jitter comes from it due to its unpredictable head-of-line blocking. .
In order to reduce the queuing latency and jitter, you need to take one or more of the following measures:
  • Plan your network traffic properly including a predicable many-to-one mapping from ingress ports to egress ports; a smaller many-side value if a higher predictability is needed; and more importantly, no egress port should be oversubscribed.
    For example if you egress port is of 10GbE, you can only have a maximum of 10 1GbE ingress ports concurrently sending data to it; otherwise frame loss will occur. If you need some 1GbE port to have a higher predictability, you need to reduce the number of 1GbE ingress ports concurrently sending data to the 10 GbE egress port. 
  • Because a switch's queuing jitter can cause avalanche effect at subsequent  network hops, this is another reason why you need to reduce the number of routers in your networks;
  • Apply Virutal Lan (VLAN) or Priority values to different traffic. Make sure your RTC frames are in a separate VLAN or have the highest priority. This is similar to the different priority levels for RTC threads we discussed in Part 3. 
Finally in this section, even a hub / repeater has lower latency than a switch thanks to its simple Layer 1 processing, a RTC system should not use it unless necessary such as port mirroring because it creates jitter.
    5. Gigabit Ethernet
    Ethernet evolved from 10BASE-T to 100BASE-T to the current widely deployed 1GbE or Gigabit Ethernet. Now even 10GbE has often been seen in backbone networks and high-end servers as a cheaper alternative to the appropriate and expensive high-speed interconnects such as Fiber Channel and InfinitBand.
    A modern cut-through 10GbE switch such as BLADE'S RackSwitch G8124 can has an average port-to-port latency in 680 nanoseconds (the average is on latencies of several different packet sizes from the minimum 64 bytes to the maximum 1518 bytes).

    Although this 680 nanoseconds latency is at the same magnitude as the main memory, the load on CPU increases linearly, without some kind of TCP offloading or OS kernel bypassing, as a function of packets processed, with the usual rule of thumb being that each bit per second of bandwidth consumes about a HZ of the CPU clock. For example 10Gbs of network traffic consumes about 10GHz of CPU that is much higher than the 3.33GHz of Intel's latest processor Core i7.
    As more of the host CPU is consumed by the network load, both CPU utilization and host send / receive latency and jitter become significant issues.
    RDMA over TCP/IP or iWARP has come up as a rescue. Basically an iWARP NIC or R-NIC allows a server to read/write data directly between its user memory space and the user memory space of another R-NIC-enabled host on the network without any involvement of the host operating systems.

    API's have been implemented for different platforms including the OpenFabrics Enterprise Distribution (OFED) by the OpenFabrics Alliance for Linux operating system, and the Winsock Direct protocol for Microsoft Windows.

    No comments:

    Post a Comment