Towards an Open Data Center with an Interoperable Network (ODIN)
Volume 5: WAN and Ultra Low Latency Applications
                                                                                              ®
May 2012




                                             Towards an Open Data Center
                                             with an Interoperable Network
                                             (ODIN)

                                             Volume 5: WAN and Ultra Low
                                             Latency Applications




                                             Casimer DeCusatis, Ph.D.
                                             Distinguished Engineer
                                             IBM System Networking, CTO Strategic Alliances
                                             IBM Systems and Technology Group


                                             May 2012
Towards an Open Data Center with an Interoperable Network (ODIN)
Volume 5: WAN and Ultra Low Latency Applications
Page 2


Executive Overview
       Wide area networks (WANs) are used to interconnect multiple data centers, and are an important
       part of the overall network design strategy. While this document will not discuss backup/recovery
       requirements planning, it does present an overview of preferred industry standards for operating
       extended distance private optical networks such as wavelength division multiplexing (WDM).
       Techniques to minimize the latency for WAN connections are also presented. Other types of
       extended distance connections may be provided through basic IP connections, assuming they
       meet the desired business objectives.


5.1 Multi-Site Connectivity
       It is common for the volume of WAN traffic to increase at an annual rate of thirty percent of more,
       and this traffic volume is expected to increase even further with the advent of larger cloud data
       centers and multi-site enterprise disaster recovery solutions. In the past data centers didn't
       extend broadcast domains over long distance. Filtering was required for traffic intended to go
       outside a given broadcast domain. In a more modern environment, there may be tens to
       hundreds or even thousands of virtual servers on a single domain; if this is extended over
       distance, it would require a huge amount of WAN bandwidth (otherwise, it might take a very long
       time to move a VM and its associated data). Higher data rates on the WAN and service provider
       network would also drive disproportionately higher data rates on switches within the data center
       and at the WAN edge, which does not lend itself to cost effective scaling. This has motivated the
       development of new WAN technologies as the data center network has evolved.

       Multi-site connectivity can be implemented in a number of ways. Public Internet connections with
       IPSec secure tunneling are readily available and low cost, but may not provide the quality of
       service and performance guarantees required for larger enterprises. There are approaches that
       can leverage the IP network, such as Fibre Channel over IP (FC/IP), which is an IETF industry
       standard protocol used to encapsulate Fibre Channel frames and forward them across an IP
       network. FC/IP can form part of an extended distance solution for data mobility within storage
       systems. Managed data connectivity services provide additional layers of security and
       performance running over a public or private Internet connection. Leased line data services are
       available from service providers which include options for private management of point-to-point
       networks (known as private circuits or Layer 2 VPN) or full mesh connectivity (Layer 3 VPN). In
       areas where leased optical fiber (or “dark fiber”) is available, it is often cost effective for larger
       enterprises to use dedicated optical wavelength division multiplexing (WDM) solutions. The cost
       of WDM is falling rapidly, and it is also available as an integrated option on some large Ethernet
       switches. WDM connectivity may also be used either in place of, or in conjunction with, storage
       data mobility solutions over extended distances (up to several hundred kilometers or more).

       Historically, there have been four distinct generations of enterprise WAN technologies. Starting in
       the mid to late 1980s, it became common for enterprise IT organizations to deploy integrated
       TDM-based WANs to carry both voice and data traffic. In the early 1990s, IT organizations began
       to deploy Frame Relay-based WANs. In the mid to late 1990s, some IT organizations replaced
       their Frame Relay-based WANs with WANs based on ATM (Asynchronous Transfer Mode)
       technology. Since around the year 2000, most IT organizations have replaced their legacy WANs
       with MPLS based technology combined with some Internet based services. More recently, MPLS
       has also been used within a single data center to deliver the same benefits as when it is used on
       the WAN. Since the price/performance of MPLS services tends to lag behind the expected growth
       of WAN traffic, new technologies such as virtual private LAN services (VPLS) are being deployed.
       VPLS represents the combination of Ethernet and MPLS whereby an Ethernet frame is
       encapsulated inside of MPLS. As is typically the case with WAN services, the viability of using
       VPLS vs. alternative services will hinge largely on the relative cost of the services, which will vary
       by service provider and geographic location. When MPLS is deployed between data centers, it
Towards an Open Data Center with an Interoperable Network (ODIN)
Volume 5: WAN and Ultra Low Latency Applications
Page 3

       functions as an overlay on top of an existing leased line infrastructure; this can make it difficult to
       create a cost effective infrastructure for smaller and medium sized enterprises.

       There are several types of MPLS/VPLS services, depending on whether the application is within
       a single data center or between data centers, and whether traffic is being managed within a
       single VPN or between VPNs. Core switches that support MPLS/VPLS standards enable
       collapsing the router and core tiers of the data center network into a flatter network with fewer
       layers. MPLS/VPLS enable VM migration between data centers by keeping VMs in a single layer-
       2 domain which spans multiple data centers.

       One of the challenges associated with modern WANs in general and hybrid cloud computing in
       particular is that hybrid clouds depend heavily on VM migration among geographically dispersed
       servers. This is necessary in order to ensure high availability and dynamic response to changes
       in user demand for services. The desire to have transparency relative to the location of the
       applications has a number of networking implications including the following:

          VLAN Extension - The VLANs within which VMs are migrated must be extended over the
           WAN between the private and public data centers.
          Secure Tunneling - These tunnels must provide an adequate level of security for all the
           required data flows over the Internet.
          Universal Access to Central Services - All application services, such as load balancing,
           should be available and function transparently in this environment
          Disaster recovery solutions - As data centers become larger, there is an increasing need for
           multi-site managed backup, recovery, and continuous availability solutions. The nature of
           these solutions depends on each user’s tolerance for the period of time there data can be
           unavailable during an outage (recovery time objective), the amount of data which can afford
           to be lost (recovery point objective), and other factors. Technical problems remain with
           supporting multi-hop across FCoE switches at extended distance, so Fibre Channel will
           continue to be used for long distance storage backups. Many enterprise applications will
           continue to use ultra-high availability solutions for their mission critical data (such as GDPS in
           a mainframe environment).
       Since WAN costs can be relatively high compared with inter-site networking, small and medium
       sized clients often cannot simply add more WAN capacity to their networks on demand. There is
       a tradeoff between cost containment and increased network traffic demands. Various WAN
       optimization and acceleration techniques can be used to get increasing performance from the
       existing infrastructure. WAN optimization should enable locating key servers in a centralized
       location by providing application performance similar to that achieved on a LAN. Application
       accelerators for TCP/IP and similar protocols also play an important role in performance
       optimization. If real time applications are deployed over an accelerated WAN, then quality of
       service and bandwidth optimization are desired features. There are vendor proprietary
       alternatives to MPLS/VPLS; there are a number of concerns with these alternatives, including
       requirements to configure them on core routers, security issues (particularly for alternatives which
       transport traffic over an untrusted IP connection rather than an MPLS/VPLS tunnel), and
       guaranteeing lossless performance and reserved bandwidth. Further, MPLS/VPLS is a very
       mature protocol, with well-developed traffic engineering facilities, and since MPLS/VPLS is a
       shared network model, in principle it offers lower cost to the end users.

       An MPLS backbone for site to site connectivity is compatible with a dual homed Ethernet
       architecture in the data center, including core switch connectivity with MLAG, TRILL, and other
       features described previously. Routers and firewalls should be deployed in an active/active
       configuration and use separate WAN links, cross-connected to provide high availability. Load
       balancing across redundant connections is optional depending on traffic volumes and availability
       requirements of the application. Other emerging protocols including IPv6 and OpenFlow are
       beginning to make inroads into the WAN, as well.
Towards an Open Data Center with an Interoperable Network (ODIN)
Volume 5: WAN and Ultra Low Latency Applications
Page 4

5.2 Ultra Low Latency Applications
       One application which has recently received considerable interest is the design of data centers to
       accommodate extremely low latency applications. In some cases, the network will not be the
       limiting factor for low latency (for example, storage controllers may have latency many times
       larger than the network); in other cases, the network latency may be a significant factor. These
       applications may include areas such as telemedicine and other remote control systems; one of
       the largest applications involves real time electronic financial transactions. Sometimes known as
       high frequency trading (HFT), this approach is currently responsible for over 1/3 of all stock
       transactions and is expected to grow significantly in coming years. The overriding design
       consideration for HFT applications is lowering latency, which refers to the total end to end time
       delay within the data center network due to a combination of time of flight and processing delays
       within the network equipment. Financial applications are especially sensitive to latency; a
       difference of microseconds or less can mean millions of dollars in lost revenue. There are several
       published examples from retail and online merchants in which latency reduces the number of
       search queries and retail transactions; some of these effects can persist even after a brief
       increase in latency has been restored to nominal levels. High latency translates directly to lower
       performance because applications stall or idle when they are waiting for a response over the
       network. Further, new types of network traffic are particularly sensitive to latency, including virtual
       machine migration and storage traffic. In the case of HFT, both the magnitude and consistency of
       the latency (jitter, or variation in packet arrival times) are important. Low latency is critical to high
       performance, especially for modern applications where the ratio of communication to computation
       is relatively high compared to legacy applications. The Securities Technology Analysis Center
       (STAC™) is a vendor neutral benchmarking organization comprised of leading financial market
       firms, who write and maintain a library of test suites which represent customer-defined, simulated
       market trading environments. Testing with this benchmark is observed and audited by STAC™
       and made available to their members and subscribing companies.

       Today there is a tradeoff between virtualization and latency, so that applications with very low
       latency requirements do not virtualize their applications. In the long term, this may change as
       increased speeds of multi-core processors and better software reduce the latency overhead
       associated with virtualization.

       The internal design of data center switches can influence latency. The number of switch chip
       hops within a switch should be minimized; a single-chip switch offers not only lower latency than
       a multi-chip switch, but also provides more consistent, deterministic latency to every switch port.
       Single-chip solutions also offer higher reliability and lower power dissipation.

       Most of the latency associated with data center networks is incurred by the upper layer protocols
       (TCP windowing, flow control, packet retransmission and routing, store and forward, etc.). For this
       reason, techniques such as iWarp and RoCE can be used to minimize the network stack latency.
       However, a significant amount of latency is also incurred from wide area network transport. There
       are three major sources of latency in the wide area network (WAN); fiber latency, WAN
       equipment latency, and the contributions of equipment in the fiber path (signal regenerators,
       amplifiers, and dispersion compensators). The fiber latency is fixed at 5 microseconds per km,
       and will be dominated by the WAN distance rather than distances within the data center. This is
       particularly difficult to adjust, since fiber paths are often indirect and much longer than the
       geographic distance between two locations. For connections between major cities, existing fiber
       routes are not particularly direct, and new, more direct fiber builds are often not economically
       justified since it is much easier to reinforce existing fiber routes.
Towards an Open Data Center with an Interoperable Network (ODIN)
Volume 5: WAN and Ultra Low Latency Applications
Page 5


                                   More Latency
  7
                                            Application
      6                                    Presentation                          Data
                                             Session
          5

                                       TCP windowing
                                        Flow control                       Segments
                  4                    Packet re-send


                                      Address lookup
                                     Packet forwarding             Packets
                         3               Routing


                                       Store-Forward
                                        Line coding          Frames
                               2         Switching


                                            Framing
                                              FEC
                                       1              Bits




                                   Less Latency
Figure 5.1 – Sources of latency in the network

              There are also potentially significant sources of latency in the long distance optical transport
              equipment. For example, optical transponders are used to convert an incoming data signal to a
              specific modulated optical wavelength for multiplexing purposes, or to aggregate lower data rates
              using time division multiplexing. The electronic time multiplexing, performance monitoring,
              protocol conversion, clock recovery, and forward error correction (FEC) algorithms used in this
              application are all sources of added latency. While this is usually negligible for typical
              applications, it can be significant for latency sensitive applications. Higher data rates (over 10
              Gbit/second) require FEC in order to detect and correct bit errors, but this can add tens to
              hundreds of microseconds of latency. Similarly, the convergence of optical and electrical signals
              in a sub-rate multiplexing architecture can be achieved using the industry standard IETF G.709,
              known as Optical Transport Network (OTN). This approach encapsulates user data in a digital
              wrapper to decouple the server links from the long haul links, and is commonly used to
              encapsulate lower data rate traffic into a 40-100 Gbit/second backbone. However, OTN
              encapsulation also introduces tens of microseconds of additional latency, and should be disabled
              for ultra-low latency networks. We also note that many vendor proprietary inter-switch links (ISLs)
              on Fibre Channel switches are not fully compatible with OTN, and thus OTN should be disabled if
              these interconnects are used for long distance transmission.
Towards an Open Data Center with an Interoperable Network (ODIN)
Volume 5: WAN and Ultra Low Latency Applications
Page 6

       For distances exceeding 80-100 km, optical amplification and dispersion compensation are
       required. Optical fiber amplifiers consist of specially doped sections of fiber which may be tens to
       hundreds of meters or more in length. Optical signals passing through this fiber are amplified
       without requiring electronic to optical signal conversion, so the overall latency from an optical
       amplifier is lower than a corresponding electronic amplifier; there is a tradeoff in signal integrity
       since the optical amplifier cannot retime a signal like the electronic amplifier. Although the latency
       introduced by a single optical amplifier is typically very low (less than a few microseconds), for
       fiber links with poor noise figures, many amplifiers placed close together may be required, thus
       increasing the aggregate latency. The type of optical amplifier will also make a difference. Erbium
       doped fiber amplifiers (EDFAs) require longer fiber lengths within the amplifier, and thus add
       more latency compared with Raman amplifiers.

       Extended distance links also require dispersion compensation, to overcome the fixed levels of
       chromatic dispersion associated with long distances of installed fiber. The type of dispersion
       compensator can make a significant difference in latency. One approach involves inserting spools
       of specially treated dispersion compensating fiber into the link, which have a negative dispersion
       shift and cancel out the positive dispersion associated with the rest of the fiber. A typical 100 km
       link can be compensated with about 14 km of dispersion shifted fiber, which adds about 70
       microseconds to the link latency [6]. If the dispersion compensating fiber is not optimally placed,
       additional optical amplifier stages may be required, which further increases the link latency.
       Another approach is the use of dispersion compensation gratings, which are short lengths of
       optical fiber fabricated with a chirped fiber Bragg grating in their core. This diffraction grating is
       able to induce high levels of negative dispersion proportional to the optical wavelength; several
       possible designs have been proposed [7]. A 100 km length of fiber can be compensated using
       only about 20 meters of fiber Bragg grating, with an additional latency of less than 0.15
       microseconds. Although dispersion compensating grating are currently more expensive, the cost
       difference may be justified in cases where ultra-low latency is required. Additional latency tuning
       can also be achieved through tuning of the application environment, operating system, and
       hardware environment of the servers attached to the network, although these details are beyond
       the scope of this architecture.

       In summary, for ultra-low latency applications such as high frequency financial trading, the data
       center network can introduce significant amounts of latency. Within a data center, the entire
       network stack must be considered, including the server adapter, top of rack switches, and core
       switches; end to end solutions which perform well on independently audited latency benchmark
       tests are recommended. The number of switch chips within a network switch should be
       minimized. For links between data centers, latency can be optimized by selecting the shortest
       possible physical fiber path, disabling FEC and OTN, using Raman amps instead of EDFAs, and
       using dispersion compensating Bragg gratings instead of dispersion compensating fiber. In the
       future, as transaction rates increase, we expect further reductions in latency will be possible
       through faster processors and network interface controllers, accelerated middleware appliances,
       and ultra-low latency switches, combined with a certain amount of tuning and design optimization.


Summary
       The ability to interconnect multiple data centers over extended distance is an important part of an
       overall data center strategy, particularly when considering business continuity and
       backup/recovery applications. Mature standards such as MPLS/VPLS provide a practical way to
       enable this connectivity. When designing for ultra-low latency applications, the incremental
       latency of the WAN can be minimized through techniques such as disabling OTN and FEC, using
       Fiber Bragg grating based dispersion compensation, or using Raman amplifiers instead of
       EDFAs.
Towards an Open Data Center with an Interoperable Network (ODIN)
Volume 5: WAN and Ultra Low Latency Applications
Page 7

Technical References
       A Bach, “High speed networking and the race to zero”, Hot Interconnects (HOTI) conference, 11
       Madison Ave, New York, NY, August 25-27, 2009; retrieved from
       http://www.hoti.org/hoti17/program/

       “Best practices for tuning system latency”, IBM White Paper, March 2011, retrieved from:
       http://publib.boulder.ibm.com/infocenter/lnxinfo/v3r0m0/topic/performance/rtbestp/rtbestp_pdf.pdf

       Discussion on why MPLS is more secure than OTV, retrieved from:
       http://www.bandwidth.com/wiki/article/How_secure_is_MPLS

       “InfiniBand® Trade Association Announces RDMA over Converged Ethernet (RoCE); New
       Specification to Bolster Low Latency Ethernet Adoption in the Enterprise Data Center,” retrieved
       from: http://www.infinibandta.org/content/pages.php?pg=press_room_item&rec_id=663

       IETF, Request for Comments (RFC) Pages, http://www.ietf.org/rfc.html
Towards an Open Data Center with an Interoperable Network (ODIN)
Volume 5: WAN and Ultra Low Latency Applications
Page 8




For More Information
IBM System Networking                                             http://ibm.com/networking/
IBM PureSystems                                                   http://ibm.com/puresystems/
IBM System x Servers                                              http://ibm.com/systems/x
IBM Power Systems                                                 http://ibm.com/systems/power
IBM BladeCenter Server and options                                http://ibm.com/systems/bladecenter
IBM System x and BladeCenter Power Configurator                   http://ibm.com/systems/bladecenter/resources/powerconfig.html
IBM Standalone Solutions Configuration Tool                       http://ibm.com/systems/x/hardware/configtools.html
IBM Configuration and Options Guide                               http://ibm.com/systems/x/hardware/configtools.html
Technical Support                                                 http://ibm.com/server/support
Other Technical Support Resources                                 http://ibm.com/systems/support

Legal Information                                                 This publication may contain links to third party sites that are
                                                                  not under the control of or maintained by IBM. Access to any
IBM Systems and Technology Group                                  such third party site is at the user's own risk and IBM is not
Route 100                                                         responsible for the accuracy or reliability of any information,
                                                                  data, opinions, advice or statements made on these sites. IBM
Somers, NY 10589.
                                                                  provides these links merely as a convenience and the
Produced in the USA                                               inclusion of such links does not imply an endorsement.
May 2012
                                                                  Information in this presentation concerning non-IBM products
All rights reserved.
                                                                  was obtained from the suppliers of these products, published
IBM, the IBM logo, ibm.com, BladeCenter, and VMready are          announcement material or other publicly available sources.
trademarks of International Business Machines Corp.,              IBM has not tested these products and cannot confirm the
registered in many jurisdictions worldwide. Other product and     accuracy of performance, compatibility or any other claims
service names might be trademarks of IBM or other                 related to non-IBM products. Questions on the capabilities of
companies. A current list of IBM trademarks is available on       non-IBM products should be addressed to the suppliers of
the web at ibm.com/legal/copytrade.shtml                          those products.
InfiniBand is a trademark of InfiniBand Trade Association.        MB, GB and TB = 1,000,000, 1,000,000,000 and
                                                                  1,000,000,000,000 bytes, respectively, when referring to
Intel, the Intel logo, Celeron, Itanium, Pentium, and Xeon are
                                                                  storage capacity. Accessible capacity is less; up to 3GB is
trademarks or registered trademarks of Intel Corporation or its
                                                                  used in service partition. Actual storage capacity will vary
subsidiaries in the United States and other countries.            based upon many factors and may be less than stated.
Linux is a registered trademark of Linus Torvalds.
                                                                  Performance is in Internal Throughput Rate (ITR) ratio based
Lotus, Domino, Notes, and Symphony are trademarks or              on measurements and projections using standard IBM
registered trademarks of Lotus Development Corporation            benchmarks in a controlled environment. The actual
and/or IBM Corporation.                                           throughput that any user will experience will depend on
                                                                  considerations such as the amount of multiprogramming in the
Microsoft, Windows, Windows Server, the Windows logo,             user’s job stream, the I/O configuration, the storage
Hyper-V, and SQL Server are trademarks or registered              configuration and the workload processed. Therefore, no
trademarks of Microsoft Corporation.                              assurance can be given that an individual user will achieve
TPC Benchmark is a trademark of the Transaction Processing        throughput improvements equivalent to the performance ratios
Performance Council.                                              stated here.
UNIX is a registered trademark in the U.S. and/or other           Maximum internal hard disk and memory capacities may
countries licensed exclusively through The Open Group.            require the replacement of any standard hard drives and/or
                                                                  memory and the population of all hard disk bays and memory
Other company, product and service names may be                   slots with the largest currently supported drives available.
trademarks or service marks of others.                            When referring to variable speed CD-ROMs, CD-Rs, CD-RWs
IBM reserves the right to change specifications or other          and DVDs, actual playback speed will vary and is often less
product information without notice. References in this            than the maximum possible.
publication to IBM products or services do not imply that IBM
intends to make them available in all countries in which IBM
operates. IBM PROVIDES THIS PUBLICATION “AS IS”
WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS
OR IMPLIED, INCLUDING THE IMPLIED WARRANTIES OF
MERCHANTABILITY AND FITNESS FOR A PARTICULAR
PURPOSE. Some jurisdictions do not allow disclaimer of
express or implied warranties in certain transactions;
therefore, this statement may not apply to you.
                                                                                                   QCW03023USEN-00

Towards an Open Data Cente with an Interoperable Network (ODIN) Volume 5: WAN and Ultra Low Latency Applications

  • 1.
    Towards an OpenData Center with an Interoperable Network (ODIN) Volume 5: WAN and Ultra Low Latency Applications ® May 2012 Towards an Open Data Center with an Interoperable Network (ODIN) Volume 5: WAN and Ultra Low Latency Applications Casimer DeCusatis, Ph.D. Distinguished Engineer IBM System Networking, CTO Strategic Alliances IBM Systems and Technology Group May 2012
  • 2.
    Towards an OpenData Center with an Interoperable Network (ODIN) Volume 5: WAN and Ultra Low Latency Applications Page 2 Executive Overview Wide area networks (WANs) are used to interconnect multiple data centers, and are an important part of the overall network design strategy. While this document will not discuss backup/recovery requirements planning, it does present an overview of preferred industry standards for operating extended distance private optical networks such as wavelength division multiplexing (WDM). Techniques to minimize the latency for WAN connections are also presented. Other types of extended distance connections may be provided through basic IP connections, assuming they meet the desired business objectives. 5.1 Multi-Site Connectivity It is common for the volume of WAN traffic to increase at an annual rate of thirty percent of more, and this traffic volume is expected to increase even further with the advent of larger cloud data centers and multi-site enterprise disaster recovery solutions. In the past data centers didn't extend broadcast domains over long distance. Filtering was required for traffic intended to go outside a given broadcast domain. In a more modern environment, there may be tens to hundreds or even thousands of virtual servers on a single domain; if this is extended over distance, it would require a huge amount of WAN bandwidth (otherwise, it might take a very long time to move a VM and its associated data). Higher data rates on the WAN and service provider network would also drive disproportionately higher data rates on switches within the data center and at the WAN edge, which does not lend itself to cost effective scaling. This has motivated the development of new WAN technologies as the data center network has evolved. Multi-site connectivity can be implemented in a number of ways. Public Internet connections with IPSec secure tunneling are readily available and low cost, but may not provide the quality of service and performance guarantees required for larger enterprises. There are approaches that can leverage the IP network, such as Fibre Channel over IP (FC/IP), which is an IETF industry standard protocol used to encapsulate Fibre Channel frames and forward them across an IP network. FC/IP can form part of an extended distance solution for data mobility within storage systems. Managed data connectivity services provide additional layers of security and performance running over a public or private Internet connection. Leased line data services are available from service providers which include options for private management of point-to-point networks (known as private circuits or Layer 2 VPN) or full mesh connectivity (Layer 3 VPN). In areas where leased optical fiber (or “dark fiber”) is available, it is often cost effective for larger enterprises to use dedicated optical wavelength division multiplexing (WDM) solutions. The cost of WDM is falling rapidly, and it is also available as an integrated option on some large Ethernet switches. WDM connectivity may also be used either in place of, or in conjunction with, storage data mobility solutions over extended distances (up to several hundred kilometers or more). Historically, there have been four distinct generations of enterprise WAN technologies. Starting in the mid to late 1980s, it became common for enterprise IT organizations to deploy integrated TDM-based WANs to carry both voice and data traffic. In the early 1990s, IT organizations began to deploy Frame Relay-based WANs. In the mid to late 1990s, some IT organizations replaced their Frame Relay-based WANs with WANs based on ATM (Asynchronous Transfer Mode) technology. Since around the year 2000, most IT organizations have replaced their legacy WANs with MPLS based technology combined with some Internet based services. More recently, MPLS has also been used within a single data center to deliver the same benefits as when it is used on the WAN. Since the price/performance of MPLS services tends to lag behind the expected growth of WAN traffic, new technologies such as virtual private LAN services (VPLS) are being deployed. VPLS represents the combination of Ethernet and MPLS whereby an Ethernet frame is encapsulated inside of MPLS. As is typically the case with WAN services, the viability of using VPLS vs. alternative services will hinge largely on the relative cost of the services, which will vary by service provider and geographic location. When MPLS is deployed between data centers, it
  • 3.
    Towards an OpenData Center with an Interoperable Network (ODIN) Volume 5: WAN and Ultra Low Latency Applications Page 3 functions as an overlay on top of an existing leased line infrastructure; this can make it difficult to create a cost effective infrastructure for smaller and medium sized enterprises. There are several types of MPLS/VPLS services, depending on whether the application is within a single data center or between data centers, and whether traffic is being managed within a single VPN or between VPNs. Core switches that support MPLS/VPLS standards enable collapsing the router and core tiers of the data center network into a flatter network with fewer layers. MPLS/VPLS enable VM migration between data centers by keeping VMs in a single layer- 2 domain which spans multiple data centers. One of the challenges associated with modern WANs in general and hybrid cloud computing in particular is that hybrid clouds depend heavily on VM migration among geographically dispersed servers. This is necessary in order to ensure high availability and dynamic response to changes in user demand for services. The desire to have transparency relative to the location of the applications has a number of networking implications including the following:  VLAN Extension - The VLANs within which VMs are migrated must be extended over the WAN between the private and public data centers.  Secure Tunneling - These tunnels must provide an adequate level of security for all the required data flows over the Internet.  Universal Access to Central Services - All application services, such as load balancing, should be available and function transparently in this environment  Disaster recovery solutions - As data centers become larger, there is an increasing need for multi-site managed backup, recovery, and continuous availability solutions. The nature of these solutions depends on each user’s tolerance for the period of time there data can be unavailable during an outage (recovery time objective), the amount of data which can afford to be lost (recovery point objective), and other factors. Technical problems remain with supporting multi-hop across FCoE switches at extended distance, so Fibre Channel will continue to be used for long distance storage backups. Many enterprise applications will continue to use ultra-high availability solutions for their mission critical data (such as GDPS in a mainframe environment). Since WAN costs can be relatively high compared with inter-site networking, small and medium sized clients often cannot simply add more WAN capacity to their networks on demand. There is a tradeoff between cost containment and increased network traffic demands. Various WAN optimization and acceleration techniques can be used to get increasing performance from the existing infrastructure. WAN optimization should enable locating key servers in a centralized location by providing application performance similar to that achieved on a LAN. Application accelerators for TCP/IP and similar protocols also play an important role in performance optimization. If real time applications are deployed over an accelerated WAN, then quality of service and bandwidth optimization are desired features. There are vendor proprietary alternatives to MPLS/VPLS; there are a number of concerns with these alternatives, including requirements to configure them on core routers, security issues (particularly for alternatives which transport traffic over an untrusted IP connection rather than an MPLS/VPLS tunnel), and guaranteeing lossless performance and reserved bandwidth. Further, MPLS/VPLS is a very mature protocol, with well-developed traffic engineering facilities, and since MPLS/VPLS is a shared network model, in principle it offers lower cost to the end users. An MPLS backbone for site to site connectivity is compatible with a dual homed Ethernet architecture in the data center, including core switch connectivity with MLAG, TRILL, and other features described previously. Routers and firewalls should be deployed in an active/active configuration and use separate WAN links, cross-connected to provide high availability. Load balancing across redundant connections is optional depending on traffic volumes and availability requirements of the application. Other emerging protocols including IPv6 and OpenFlow are beginning to make inroads into the WAN, as well.
  • 4.
    Towards an OpenData Center with an Interoperable Network (ODIN) Volume 5: WAN and Ultra Low Latency Applications Page 4 5.2 Ultra Low Latency Applications One application which has recently received considerable interest is the design of data centers to accommodate extremely low latency applications. In some cases, the network will not be the limiting factor for low latency (for example, storage controllers may have latency many times larger than the network); in other cases, the network latency may be a significant factor. These applications may include areas such as telemedicine and other remote control systems; one of the largest applications involves real time electronic financial transactions. Sometimes known as high frequency trading (HFT), this approach is currently responsible for over 1/3 of all stock transactions and is expected to grow significantly in coming years. The overriding design consideration for HFT applications is lowering latency, which refers to the total end to end time delay within the data center network due to a combination of time of flight and processing delays within the network equipment. Financial applications are especially sensitive to latency; a difference of microseconds or less can mean millions of dollars in lost revenue. There are several published examples from retail and online merchants in which latency reduces the number of search queries and retail transactions; some of these effects can persist even after a brief increase in latency has been restored to nominal levels. High latency translates directly to lower performance because applications stall or idle when they are waiting for a response over the network. Further, new types of network traffic are particularly sensitive to latency, including virtual machine migration and storage traffic. In the case of HFT, both the magnitude and consistency of the latency (jitter, or variation in packet arrival times) are important. Low latency is critical to high performance, especially for modern applications where the ratio of communication to computation is relatively high compared to legacy applications. The Securities Technology Analysis Center (STAC™) is a vendor neutral benchmarking organization comprised of leading financial market firms, who write and maintain a library of test suites which represent customer-defined, simulated market trading environments. Testing with this benchmark is observed and audited by STAC™ and made available to their members and subscribing companies. Today there is a tradeoff between virtualization and latency, so that applications with very low latency requirements do not virtualize their applications. In the long term, this may change as increased speeds of multi-core processors and better software reduce the latency overhead associated with virtualization. The internal design of data center switches can influence latency. The number of switch chip hops within a switch should be minimized; a single-chip switch offers not only lower latency than a multi-chip switch, but also provides more consistent, deterministic latency to every switch port. Single-chip solutions also offer higher reliability and lower power dissipation. Most of the latency associated with data center networks is incurred by the upper layer protocols (TCP windowing, flow control, packet retransmission and routing, store and forward, etc.). For this reason, techniques such as iWarp and RoCE can be used to minimize the network stack latency. However, a significant amount of latency is also incurred from wide area network transport. There are three major sources of latency in the wide area network (WAN); fiber latency, WAN equipment latency, and the contributions of equipment in the fiber path (signal regenerators, amplifiers, and dispersion compensators). The fiber latency is fixed at 5 microseconds per km, and will be dominated by the WAN distance rather than distances within the data center. This is particularly difficult to adjust, since fiber paths are often indirect and much longer than the geographic distance between two locations. For connections between major cities, existing fiber routes are not particularly direct, and new, more direct fiber builds are often not economically justified since it is much easier to reinforce existing fiber routes.
  • 5.
    Towards an OpenData Center with an Interoperable Network (ODIN) Volume 5: WAN and Ultra Low Latency Applications Page 5 More Latency 7 Application 6 Presentation Data Session 5 TCP windowing Flow control Segments 4 Packet re-send Address lookup Packet forwarding Packets 3 Routing Store-Forward Line coding Frames 2 Switching Framing FEC 1 Bits Less Latency Figure 5.1 – Sources of latency in the network There are also potentially significant sources of latency in the long distance optical transport equipment. For example, optical transponders are used to convert an incoming data signal to a specific modulated optical wavelength for multiplexing purposes, or to aggregate lower data rates using time division multiplexing. The electronic time multiplexing, performance monitoring, protocol conversion, clock recovery, and forward error correction (FEC) algorithms used in this application are all sources of added latency. While this is usually negligible for typical applications, it can be significant for latency sensitive applications. Higher data rates (over 10 Gbit/second) require FEC in order to detect and correct bit errors, but this can add tens to hundreds of microseconds of latency. Similarly, the convergence of optical and electrical signals in a sub-rate multiplexing architecture can be achieved using the industry standard IETF G.709, known as Optical Transport Network (OTN). This approach encapsulates user data in a digital wrapper to decouple the server links from the long haul links, and is commonly used to encapsulate lower data rate traffic into a 40-100 Gbit/second backbone. However, OTN encapsulation also introduces tens of microseconds of additional latency, and should be disabled for ultra-low latency networks. We also note that many vendor proprietary inter-switch links (ISLs) on Fibre Channel switches are not fully compatible with OTN, and thus OTN should be disabled if these interconnects are used for long distance transmission.
  • 6.
    Towards an OpenData Center with an Interoperable Network (ODIN) Volume 5: WAN and Ultra Low Latency Applications Page 6 For distances exceeding 80-100 km, optical amplification and dispersion compensation are required. Optical fiber amplifiers consist of specially doped sections of fiber which may be tens to hundreds of meters or more in length. Optical signals passing through this fiber are amplified without requiring electronic to optical signal conversion, so the overall latency from an optical amplifier is lower than a corresponding electronic amplifier; there is a tradeoff in signal integrity since the optical amplifier cannot retime a signal like the electronic amplifier. Although the latency introduced by a single optical amplifier is typically very low (less than a few microseconds), for fiber links with poor noise figures, many amplifiers placed close together may be required, thus increasing the aggregate latency. The type of optical amplifier will also make a difference. Erbium doped fiber amplifiers (EDFAs) require longer fiber lengths within the amplifier, and thus add more latency compared with Raman amplifiers. Extended distance links also require dispersion compensation, to overcome the fixed levels of chromatic dispersion associated with long distances of installed fiber. The type of dispersion compensator can make a significant difference in latency. One approach involves inserting spools of specially treated dispersion compensating fiber into the link, which have a negative dispersion shift and cancel out the positive dispersion associated with the rest of the fiber. A typical 100 km link can be compensated with about 14 km of dispersion shifted fiber, which adds about 70 microseconds to the link latency [6]. If the dispersion compensating fiber is not optimally placed, additional optical amplifier stages may be required, which further increases the link latency. Another approach is the use of dispersion compensation gratings, which are short lengths of optical fiber fabricated with a chirped fiber Bragg grating in their core. This diffraction grating is able to induce high levels of negative dispersion proportional to the optical wavelength; several possible designs have been proposed [7]. A 100 km length of fiber can be compensated using only about 20 meters of fiber Bragg grating, with an additional latency of less than 0.15 microseconds. Although dispersion compensating grating are currently more expensive, the cost difference may be justified in cases where ultra-low latency is required. Additional latency tuning can also be achieved through tuning of the application environment, operating system, and hardware environment of the servers attached to the network, although these details are beyond the scope of this architecture. In summary, for ultra-low latency applications such as high frequency financial trading, the data center network can introduce significant amounts of latency. Within a data center, the entire network stack must be considered, including the server adapter, top of rack switches, and core switches; end to end solutions which perform well on independently audited latency benchmark tests are recommended. The number of switch chips within a network switch should be minimized. For links between data centers, latency can be optimized by selecting the shortest possible physical fiber path, disabling FEC and OTN, using Raman amps instead of EDFAs, and using dispersion compensating Bragg gratings instead of dispersion compensating fiber. In the future, as transaction rates increase, we expect further reductions in latency will be possible through faster processors and network interface controllers, accelerated middleware appliances, and ultra-low latency switches, combined with a certain amount of tuning and design optimization. Summary The ability to interconnect multiple data centers over extended distance is an important part of an overall data center strategy, particularly when considering business continuity and backup/recovery applications. Mature standards such as MPLS/VPLS provide a practical way to enable this connectivity. When designing for ultra-low latency applications, the incremental latency of the WAN can be minimized through techniques such as disabling OTN and FEC, using Fiber Bragg grating based dispersion compensation, or using Raman amplifiers instead of EDFAs.
  • 7.
    Towards an OpenData Center with an Interoperable Network (ODIN) Volume 5: WAN and Ultra Low Latency Applications Page 7 Technical References A Bach, “High speed networking and the race to zero”, Hot Interconnects (HOTI) conference, 11 Madison Ave, New York, NY, August 25-27, 2009; retrieved from http://www.hoti.org/hoti17/program/ “Best practices for tuning system latency”, IBM White Paper, March 2011, retrieved from: http://publib.boulder.ibm.com/infocenter/lnxinfo/v3r0m0/topic/performance/rtbestp/rtbestp_pdf.pdf Discussion on why MPLS is more secure than OTV, retrieved from: http://www.bandwidth.com/wiki/article/How_secure_is_MPLS “InfiniBand® Trade Association Announces RDMA over Converged Ethernet (RoCE); New Specification to Bolster Low Latency Ethernet Adoption in the Enterprise Data Center,” retrieved from: http://www.infinibandta.org/content/pages.php?pg=press_room_item&rec_id=663 IETF, Request for Comments (RFC) Pages, http://www.ietf.org/rfc.html
  • 8.
    Towards an OpenData Center with an Interoperable Network (ODIN) Volume 5: WAN and Ultra Low Latency Applications Page 8 For More Information IBM System Networking http://ibm.com/networking/ IBM PureSystems http://ibm.com/puresystems/ IBM System x Servers http://ibm.com/systems/x IBM Power Systems http://ibm.com/systems/power IBM BladeCenter Server and options http://ibm.com/systems/bladecenter IBM System x and BladeCenter Power Configurator http://ibm.com/systems/bladecenter/resources/powerconfig.html IBM Standalone Solutions Configuration Tool http://ibm.com/systems/x/hardware/configtools.html IBM Configuration and Options Guide http://ibm.com/systems/x/hardware/configtools.html Technical Support http://ibm.com/server/support Other Technical Support Resources http://ibm.com/systems/support Legal Information This publication may contain links to third party sites that are not under the control of or maintained by IBM. Access to any IBM Systems and Technology Group such third party site is at the user's own risk and IBM is not Route 100 responsible for the accuracy or reliability of any information, data, opinions, advice or statements made on these sites. IBM Somers, NY 10589. provides these links merely as a convenience and the Produced in the USA inclusion of such links does not imply an endorsement. May 2012 Information in this presentation concerning non-IBM products All rights reserved. was obtained from the suppliers of these products, published IBM, the IBM logo, ibm.com, BladeCenter, and VMready are announcement material or other publicly available sources. trademarks of International Business Machines Corp., IBM has not tested these products and cannot confirm the registered in many jurisdictions worldwide. Other product and accuracy of performance, compatibility or any other claims service names might be trademarks of IBM or other related to non-IBM products. Questions on the capabilities of companies. A current list of IBM trademarks is available on non-IBM products should be addressed to the suppliers of the web at ibm.com/legal/copytrade.shtml those products. InfiniBand is a trademark of InfiniBand Trade Association. MB, GB and TB = 1,000,000, 1,000,000,000 and 1,000,000,000,000 bytes, respectively, when referring to Intel, the Intel logo, Celeron, Itanium, Pentium, and Xeon are storage capacity. Accessible capacity is less; up to 3GB is trademarks or registered trademarks of Intel Corporation or its used in service partition. Actual storage capacity will vary subsidiaries in the United States and other countries. based upon many factors and may be less than stated. Linux is a registered trademark of Linus Torvalds. Performance is in Internal Throughput Rate (ITR) ratio based Lotus, Domino, Notes, and Symphony are trademarks or on measurements and projections using standard IBM registered trademarks of Lotus Development Corporation benchmarks in a controlled environment. The actual and/or IBM Corporation. throughput that any user will experience will depend on considerations such as the amount of multiprogramming in the Microsoft, Windows, Windows Server, the Windows logo, user’s job stream, the I/O configuration, the storage Hyper-V, and SQL Server are trademarks or registered configuration and the workload processed. Therefore, no trademarks of Microsoft Corporation. assurance can be given that an individual user will achieve TPC Benchmark is a trademark of the Transaction Processing throughput improvements equivalent to the performance ratios Performance Council. stated here. UNIX is a registered trademark in the U.S. and/or other Maximum internal hard disk and memory capacities may countries licensed exclusively through The Open Group. require the replacement of any standard hard drives and/or memory and the population of all hard disk bays and memory Other company, product and service names may be slots with the largest currently supported drives available. trademarks or service marks of others. When referring to variable speed CD-ROMs, CD-Rs, CD-RWs IBM reserves the right to change specifications or other and DVDs, actual playback speed will vary and is often less product information without notice. References in this than the maximum possible. publication to IBM products or services do not imply that IBM intends to make them available in all countries in which IBM operates. IBM PROVIDES THIS PUBLICATION “AS IS” WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. Some jurisdictions do not allow disclaimer of express or implied warranties in certain transactions; therefore, this statement may not apply to you. QCW03023USEN-00