![]() |
at | ![]() |
San Francisco
14-17 August 2006
Low-Latency 10-Gigabit Ethernet
At LinuxWorld San Francisco, Myricom demonstrated message-passing software that delivers low latency and low host-CPU utilization over 10-Gigabit Ethernet. Myricom extended its Myrinet Express (MX) software, already widely used in High-Performance Computing (HPC) clusters interconnected with Myrinet, to work also over 10-Gigabit Ethernet. “MX over Ethernet” operates by kernel bypass with Myricom’s dual-protocol Myri-10G network-interface cards (NICs) and standard 10-Gigabit Ethernet switches to achieve latencies 5 to 10 times lower than with TCP/IP over Ethernet, and much lower host-CPU utilization. As is detailed below, MX/Ethernet performance metrics are nearly on par with those achieved by MX over Myrinet.
The MX/Ethernet protocols are open, and Myricom encourages implementations using other Ethernet NICs. The technique is transparent to Ethernet switch makers, less expensive than proprietary HPC solutions, and applicable both to HPC and to enterprises.
The demonstration cluster in Myricom's LinuxWorld booth consisted of four dual 2.4GHz dual-core AMD Opteron hosts, 16 processors total, connected with both 10-Gigabit Myrinet and 10-Gigabit Ethernet networks. One Myri-10G NIC in each host of the demonstration cluster operated in Myrinet mode, and was connected to a 128-port, 10-Gigabit Myrinet switch. The second Myri-10G NIC in each host was connected to a Fujitsu XG700 low-latency 10-Gigabit Ethernet switch. Visitors to the Myricom booth could observe benchmarks and applications in both MX/Myrinet and MX/Ethernet modes.

HPC Clustering with Myri-10G NICs and Standard 10-Gigabit Ethernet Switches
How it Works
Myricom’s Myri-10G solutions introduced a convergence at 10-Gigabit/s data rates of Myrinet, the most successful specialty network for HPC applications, and mainstream Ethernet. Dual-protocol Myri-10G NICs initially achieved optimal performance running MX software with Myrinet network protocols through Myri-10G switches. MX’s kernel-bypass techniques achieve low latency and low host-CPU utilization by allowing application programs to communicate directly with firmware in the programmable Myri-10G NICs. Now, the availability of MX/Ethernet extends MX’s advantages to standard 10-Gigabit Ethernet switching. OEMs and cluster integrators can achieve HPC performance with mainstream Ethernet technology. Myricom is making the MX/Ethernet protocols fully open and accessible, just as with earlier Myrinet protocols and source code.
MX/Ethernet uses 10-Gigabit Ethernet as a layer-2 network with an MX EtherType to identify MX frames (packets). The EtherType, a part of the Ethernet standards since the earliest days, identifies the protocol of an Ethernet frame. For example, there are EtherTypes for the Internet Protocol (IP), Address Resolution Protocol (ARP), AppleTalk, and many other protocols. All of these protocols can be carried concurrently on the same Ethernet network. Ethernet switches normally ignore the EtherType. Myri-10G NICs can carry TCP/IP and other traffic along with MX traffic, but achieve the best performance when they circumvent TCP/IP. MX provides its own, highly efficient, reliability layer.
MX/Ethernet is plug-and-play with any 10-Gigabit Ethernet switch, although you get the best performance with low-latency switches such as the Fujitsu XG700. The table below of MPI benchmarks[1] starts with MX/Myrinet with Myri-10G NICs and a 10-Gigabit Myrinet switch as a baseline. The performance of MX/Ethernet with the low-latency Fujitsu XG700 12-port 10-Gigabit Ethernet switch is nearly as good as the MX/Myrinet performance. The last column of the table cites recently published[2] MPI benchmarks for Mellanox InfiniBand to show that, even with standard 10-Gigabit Ethernet switches, MX with Myri-10G NICs soundly outperforms InfiniBand.
|
MPI Benchmark |
MX/Myrinet |
MX/Ethernet |
OpenIB |
|
PingPong latency |
2.3µs |
2.8µs |
4.0µs |
|
One-way data rate |
1204 MByte/s |
1201 MByte/s |
964 MByte/s |
|
Two-way data rate |
2397 MByte/s |
2387 MByte/s |
1902 MByte/s |
| Note: On 7 August 2006, Fujitsu announced a 20-port 10-Gigabit Ethernet switch chip. Myricom has had the opportunity to run the same MX/Ethernet tests with this remarkable new switch. The data rate results were identical to those above with XG700 switch, but the MPI PingPong latency was only 2.63µs. |
In addition to low latency, MX exhibits host-CPU utilization that is dramatically lower than the typical TCP/IP utilization and service demand reported in standard benchmarks such as netperf. The host-CPU utilization for MPI communication ranges from ~1µs of host-CPU time at the sender or receiver to transfer messages up to 2 KBytes, then increasing to ~10µs of host-CPU time to transfer messages in the range from 64KBytes to many MBytes. At a 1MByte message size, for example, a data transfer that MX/Ethernet accomplishes across 10-Gigabit Ethernet in less than 1000µs, the 10µs host-CPU utilization corresponds to a host-CPU utilization of ~1%, an unheard-of low host-CPU load in the TCP/IP world. Even applications that are not sensitive to latency can benefit from MX/Ethernet due to the savings in host-CPU load.
These MX/Ethernet results show that for small clusters, up to the size that can be supported from a single switch, 10-Gigabit Ethernet is capable of performance formerly associated only with specialty cluster interconnects. These solutions will be limited to smaller clusters that can be served with a single 10-Gigabit Ethernet switch, because of performance losses in building larger networks by connecting multiple Ethernet switches. Inasmuch as there are no high-port-count, low-latency, full-bisection, 10-Gigabit Ethernet switches on the market today, MX/Myrinet with 10-Gigabit Myrinet switches will continue to be preferred for large clusters because of the economy and scalability of Myrinet switching.
This MX/Ethernet innovation provides strong new evidence that 10-Gigabit Ethernet will become the interconnect technology of choice for HPC clusters, initially for small clusters, but, as 10-Gigabit Ethernet switch technology advances, for larger clusters as well. Clusters have come to dominate the TOP500 supercomputer list in recent years. Over the past two years, commodity Gigabit Ethernet has eclipsed specialty interconnects, including Myricom’s earlier Myrinet-2000 interconnect, in the number of systems in the TOP500 list. However, Gigabit Ethernet is not fast enough for leading-edge cluster hosts with their multiple, multi-core processors. In anticipation of these trends, Myricom’s latest generation of products, Myri-10G, was designed as a convergence at 10-Gigabit/s data rates of Myrinet, the most successful specialty network for HPC applications, and mainstream Ethernet. As these MX/Ethernet results demonstrate, Myricom’s Myri-10G technology combines the best of both worlds.
Myricom attendees at LinuxWorld San Francisco were Glenn Brown, Senior Programmer; John Daley, Senior Programmer; Nelson Escobar, Programmer; Tom Leinberger, Director of Sales - Central Region; David Pegan, VP Sales; Dr. Chuck Seitz, CEO; and Tim Sticklinski, Director of Sales - Western Region.
[1] The MPI benchmarks for MX are the standard Pallas, now Intel, MPI benchmarks. The data rates are converted from the Mebibyte (220 Byte) per second measure reported to the standard MByte/s measure.
[2] OSU Benchmark Comparison: May 11, 2006. The numbers cited are with Intel MPI, and are typical of the best of 45 benchmarks reported. The reported latency does not include the latency of an InfiniBand switch; thus, the actual in-system latency will be higher. The data rates are from streaming tests, which are less demanding than and produce better throughput numbers than PingPong tests.
![]()
Updated 17 August 2006