SlideShare a Scribd company logo
ETHERNET FABRIC
MAKING CC WORK FOR LOW LATENCY
HIGH BANDWIDTH UNDER INCAST
A QUICK TOUR OF THE TRANSITION FROM SDP TO TCP
OVERVIEW
▸ What is SDP and why did it work well as a fabric?
▸ Why TCP?
▸ What are the problems with default TCP?
▸ Incast
▸ Slow start and delayed acks
▸ NewReno sawtooth - ECN and DCTCP
▸ What we did to implement the research suggestions
▸ Several overlooked implications
▸ Performance in practice
SDP - IN THE BEGINNING THERE WAS INFINIBAND
SDP - SOCKETS DIRECT PROTOCOL
▸ “The Sockets Direct Protocol (SDP) is a transport-agnostic
protocol to support stream sockets over Remote Direct Memory
Access (RDMA) network fabrics” - Wikipedia
▸ All congestion control, queueing, and pacing handled in
hardware/firmware
▸ Provides guaranteed delivery, low latency, and (in theory)
reduced CPU overhead
▸ On paper provides the perfect solution for a hardware agnostic,
robust, low-latency cluster interconnect
THE GOOD - COST
WHY TCP? ETHERNET - OBVIOUSLY
▸ “It’s a single set of skills, plugs and protocols to rule them
all. Perhaps it’s time to create the second law of Metcalfe:
Never bet against Ethernet.” NetworkWorld Aug 15, 2006
▸ Multiple Vendors
▸ Very commoditized below 40Gbps
▸ Fewer firmware issues
▸ Much easier to understand failure modes
THE BAD & UGLY - UTILIZATION AND LOSS RECOVERY
WHAT’S WRONG WITH TCP?
▸ Incast - catastrophic loss in high synchronous fan in
▸ Slow start - recovery too conservative
▸ Delayed acks - 100ms is a very long time when the RTT is
50us
▸ NewReno can’t reach consistent utilization
▸ Partially mitigated by ECN
▸ DCTCP may be a long-term fix
PROBLEMS UNIQUE TO A MESH
INCAST - DEFINITION AND MITIGATION
▸ Term coined in [PANFS] for the case of increasing the number of
simultaneously initiated, effectively barrier synchronized, fan-in flows in to a
single port to the point where the instantaneous switch / NIC buffering
capacity was exceeded. Thus causing a decline in aggregate bandwidth as
the need for retransmits increases. This is further exacerbated by tail-drop
behavior in the switch whereby multiple losses within individual streams
exceeds the recovery abilities of duplicate ACKs or SACK, leading to RTOs
before the flow is resumed.
▸ Solution: scale RTO continuously
▸ Make retransmit timer higher resolution
▸ Make RTT calculation higher resolution
BURSTINESS & LOSS
“SLOW START” - RENEGOTIATING THE LINK
▸ Under loss window size drops to 1 - given that we know the RTT
and maximum peer bandwidth, this should really be a larger
value determined by the maximum number of active peers (i.e.
1/RTT * (max BW/ #peers)
▸ LRO - stretch ACKs ignored per RFC - Linux avoids ack division
while growing the window more rapidly by acknowledging total
acked bytes
▸ Delayed ACKs 100ms - particularly when in slow start this can
delay window growth - at other times can create an artificially
elevated RTO
CONGESTION CONTROL AND UTILIZATION
THE NEWRENO SAWTOOTH
▸ ECN
▸ when a CE is seen we only reduce the congestion window by half rather
than resetting it to 1
▸ still causes the window bounce between 2 values even at a stable “steady
state”
▸ DCTCP
▸ allows a continuous scaling of the congestion window - each CE only
reduces the congestion window by a small fraction
▸ interoperability issues due to redefining ECN and setting both the ECN
min/max to the same value of K
IMPLEMENTATION
WHAT WE DID FOR INCAST
▸ Separating callout scheduling granularity and hardclock
frequency, fix callouts so that timer interrupt can be
cleared when rescheduling
▸ High resolution TCP timestamps - moving from 1khz / tick
based to TSC based for a 16Mhz (~64ns / increment - the
fastest allowable for a 120s maximum segment lifetime)
PROBLEMS
UNFORESEEN PROBLEMS PART 1
▸ Cached timers don’t scale down while maintaining monotonicity
▸ scheduling of hardclock doesn’t provide consistent updates
▸ solution: use TSC based timer even if more expensive, if no invariant TSC
available restrict measurement values to integral multiples of ticks
▸ Idle connections stop working when the timer wraps
▸ If connection has been idle for longer than it takes for the timestamp
counter to wrap peer will see all segments as coming “before” the last
segment prior to the idle period
▸ Solution: LIE. Increment last timestamp value sent by large value until the
value sent to peer catches up with actual value.
PROBLEMS
UNFORESEEN PROBLEMS PART 2
▸ SRTT & RTTVAR have much too short a memory - outlier values quickly forgotten leading to
frequent spurious retransmits
▸ First solution - taken from RFC 7323 App. G:
▸ RTTVAR <- (1 - beta’) * RTTVAR + beta’ * |SRTT - R’|; SRTT <- (1 - alpha’) * SRTT + alpha’
* R’
▸ alpha’ = alpha / ExpectedSamples; beta’ = beta / ExpectedSamples;
ExpectedSamples = ceil(FlightSize/ (SMSS *2))
▸ Problem: When pipe is empty or cwnd is reset after loss return to very having short
memory
▸ Second solution: ExpectedSamples = ceil(cwnd / (SMSS * 2))
▸ Problem: When cwnd is reset after loss return to very short memory
▸ Final solution: ExpectedSamples = ceil(max(cwnd, cwnd_prev) / (SMSS *2))
THE PRODUCT
HOW DID IT TURN OUT?
▸ Target streaming throughput was 15GB/s on 4 nodes
▸ Actually achieve 16-20GB/s with lower CPU utilization
than Infiniband (!?)
▸ Next generation (2017-) will be 100GigE fabric

More Related Content

PDF
TCP Westwood
GuillemCarles
 
PPTX
Cubic
deawoo Kim
 
PPT
Tcp congestion control
Abdo sayed
 
PPTX
Analysis of TCP variants
Institute of Technology, Nirma University
 
PPTX
TCP Congestion Control By Owais Jara
Owaîs Járå
 
PDF
Comparative Analysis of Different TCP Variants in Mobile Ad-Hoc Network
partha pratim deb
 
PDF
Congestion control
Abhay Pai
 
PPT
Congestion control avoidance
Anthony-Claret Onwutalobi
 
TCP Westwood
GuillemCarles
 
Cubic
deawoo Kim
 
Tcp congestion control
Abdo sayed
 
TCP Congestion Control By Owais Jara
Owaîs Járå
 
Comparative Analysis of Different TCP Variants in Mobile Ad-Hoc Network
partha pratim deb
 
Congestion control
Abhay Pai
 
Congestion control avoidance
Anthony-Claret Onwutalobi
 

What's hot (20)

PPTX
Tcp congestion avoidance
Ahmed Kamel Taha
 
PPT
Lect9
Abdo sayed
 
ODP
A Baker's dozen of TCP
Stephen Hemminger
 
PPT
TCP congestion control
Shubham Jain
 
PPT
Tcp congestion avoidance algorithm identification
Bala Lavanya
 
PDF
Computer network (13)
NYversity
 
PDF
TCP Congestion Control
Michail Grigoropoulos
 
PPTX
Congestion control in tcp
samarai_apoc
 
PPT
Troubleshooting TCP/IP
vijai s
 
DOCX
Queue Manager both as sender and Receiver.docx
sarvank2
 
DOCX
CCDT(client connection)MQ.docx
sarvank2
 
PDF
Network performance overview
My cp
 
DOCX
Implementing ssl self sign demo
sarvank2
 
PPTX
TCP-FIT: An Improved TCP Congestion Control Algorithm and its Performance
Kevin Tong
 
PPTX
Go back-n protocol
STEFFY D
 
PPTX
Go Back N Arq1
guestb4ff06
 
DOCX
Ibm mq dqm setups
sarvank2
 
PDF
DCTcp
Oded Rotter
 
PPT
Lect9 (1)
Abdo sayed
 
PPTX
Leaky Bucket & Tocken Bucket - Traffic shaping
Vimal Dewangan
 
Tcp congestion avoidance
Ahmed Kamel Taha
 
Lect9
Abdo sayed
 
A Baker's dozen of TCP
Stephen Hemminger
 
TCP congestion control
Shubham Jain
 
Tcp congestion avoidance algorithm identification
Bala Lavanya
 
Computer network (13)
NYversity
 
TCP Congestion Control
Michail Grigoropoulos
 
Congestion control in tcp
samarai_apoc
 
Troubleshooting TCP/IP
vijai s
 
Queue Manager both as sender and Receiver.docx
sarvank2
 
CCDT(client connection)MQ.docx
sarvank2
 
Network performance overview
My cp
 
Implementing ssl self sign demo
sarvank2
 
TCP-FIT: An Improved TCP Congestion Control Algorithm and its Performance
Kevin Tong
 
Go back-n protocol
STEFFY D
 
Go Back N Arq1
guestb4ff06
 
Ibm mq dqm setups
sarvank2
 
Lect9 (1)
Abdo sayed
 
Leaky Bucket & Tocken Bucket - Traffic shaping
Vimal Dewangan
 
Ad

Viewers also liked (15)

PDF
Rockford Web Devs Meetup - AWS - November 10th, 2015
Karl Grzeszczak
 
PDF
Difusividad rev6
Grecia Ibarra Aguilar
 
PPT
4.android java interfaces
guidotic
 
PPT
Locating mechanism
米 雞
 
PPTX
Tic ludy
Ludy Florez
 
PPT
ApresentaçãO Consultoria 1
guestd745d8
 
PPTX
European civilizations test review
Ashley Birmingham
 
PDF
Bsi 0
Hilda Poon
 
PPTX
Lauren Wilson: Graduate Life at Impression Nottingham
Laura Hampton
 
PDF
20150817 trans med plan ecsim_vw
Carlos H Jaramillo A
 
PDF
21 tabelas de lajes
kalelboss
 
PPTX
Aula 2 estudo transversal
Ricardo Alexandre
 
PPTX
Neocolonialism
timothyjgraham
 
PPTX
Communication de crise et Internet
Najoua Setti
 
Rockford Web Devs Meetup - AWS - November 10th, 2015
Karl Grzeszczak
 
Difusividad rev6
Grecia Ibarra Aguilar
 
4.android java interfaces
guidotic
 
Locating mechanism
米 雞
 
Tic ludy
Ludy Florez
 
ApresentaçãO Consultoria 1
guestd745d8
 
European civilizations test review
Ashley Birmingham
 
Bsi 0
Hilda Poon
 
Lauren Wilson: Graduate Life at Impression Nottingham
Laura Hampton
 
20150817 trans med plan ecsim_vw
Carlos H Jaramillo A
 
21 tabelas de lajes
kalelboss
 
Aula 2 estudo transversal
Ricardo Alexandre
 
Neocolonialism
timothyjgraham
 
Communication de crise et Internet
Najoua Setti
 
Ad

Similar to Ethernet as fabric (20)

PDF
Lecture 19 22. transport protocol for ad-hoc
Chandra Meena
 
PPTX
High Performance Networking with Advanced TCP
Dilum Bandara
 
PDF
features of tcp important for the web
rinnocente
 
PPT
Pushing the limits of Controller Area Network (CAN)
RealTime-at-Work (RTaW)
 
PDF
Analytical Research of TCP Variants in Terms of Maximum Throughput
IJLT EMAS
 
PDF
Designing TCP-Friendly Window-based Congestion Control
soohyunc
 
PPT
Chapter6TransportLayer header format protocols-2.ppt
Mugabo4
 
PPTX
6610-l14.pptx
ArvindRamesh22
 
PDF
Ns3: Newreno vs Vegas vs Veno
TCHAYE Jude
 
PDF
LF_OVS_17_OVS/OVS-DPDK connection tracking for Mobile usecases
LF_OpenvSwitch
 
PDF
Transaction TCP
amardeepsingh1902
 
PDF
Do We Really Need TSN in Next-Generation Helicopters? Insights From a Case-Study
RealTime-at-Work (RTaW)
 
PPTX
3.TRANSPORT LAYER Computer Network .pptx
tyghvbn
 
PDF
Computer network (11)
NYversity
 
PDF
Performance Evaluation of High Speed Congestion Control Protocols
IOSR Journals
 
PPTX
Part9-congestion.pptx
Olivier Bonaventure
 
PDF
Computer network (5)
NYversity
 
PDF
Tcp performance simulationsusingns2
Justin Frankel
 
PDF
Tuning TCP and NGINX on EC2
Chartbeat
 
PPTX
NE #1.pptx
tahaniali27
 
Lecture 19 22. transport protocol for ad-hoc
Chandra Meena
 
High Performance Networking with Advanced TCP
Dilum Bandara
 
features of tcp important for the web
rinnocente
 
Pushing the limits of Controller Area Network (CAN)
RealTime-at-Work (RTaW)
 
Analytical Research of TCP Variants in Terms of Maximum Throughput
IJLT EMAS
 
Designing TCP-Friendly Window-based Congestion Control
soohyunc
 
Chapter6TransportLayer header format protocols-2.ppt
Mugabo4
 
6610-l14.pptx
ArvindRamesh22
 
Ns3: Newreno vs Vegas vs Veno
TCHAYE Jude
 
LF_OVS_17_OVS/OVS-DPDK connection tracking for Mobile usecases
LF_OpenvSwitch
 
Transaction TCP
amardeepsingh1902
 
Do We Really Need TSN in Next-Generation Helicopters? Insights From a Case-Study
RealTime-at-Work (RTaW)
 
3.TRANSPORT LAYER Computer Network .pptx
tyghvbn
 
Computer network (11)
NYversity
 
Performance Evaluation of High Speed Congestion Control Protocols
IOSR Journals
 
Part9-congestion.pptx
Olivier Bonaventure
 
Computer network (5)
NYversity
 
Tcp performance simulationsusingns2
Justin Frankel
 
Tuning TCP and NGINX on EC2
Chartbeat
 
NE #1.pptx
tahaniali27
 

Ethernet as fabric

  • 1. ETHERNET FABRIC MAKING CC WORK FOR LOW LATENCY HIGH BANDWIDTH UNDER INCAST
  • 2. A QUICK TOUR OF THE TRANSITION FROM SDP TO TCP OVERVIEW ▸ What is SDP and why did it work well as a fabric? ▸ Why TCP? ▸ What are the problems with default TCP? ▸ Incast ▸ Slow start and delayed acks ▸ NewReno sawtooth - ECN and DCTCP ▸ What we did to implement the research suggestions ▸ Several overlooked implications ▸ Performance in practice
  • 3. SDP - IN THE BEGINNING THERE WAS INFINIBAND SDP - SOCKETS DIRECT PROTOCOL ▸ “The Sockets Direct Protocol (SDP) is a transport-agnostic protocol to support stream sockets over Remote Direct Memory Access (RDMA) network fabrics” - Wikipedia ▸ All congestion control, queueing, and pacing handled in hardware/firmware ▸ Provides guaranteed delivery, low latency, and (in theory) reduced CPU overhead ▸ On paper provides the perfect solution for a hardware agnostic, robust, low-latency cluster interconnect
  • 4. THE GOOD - COST WHY TCP? ETHERNET - OBVIOUSLY ▸ “It’s a single set of skills, plugs and protocols to rule them all. Perhaps it’s time to create the second law of Metcalfe: Never bet against Ethernet.” NetworkWorld Aug 15, 2006 ▸ Multiple Vendors ▸ Very commoditized below 40Gbps ▸ Fewer firmware issues ▸ Much easier to understand failure modes
  • 5. THE BAD & UGLY - UTILIZATION AND LOSS RECOVERY WHAT’S WRONG WITH TCP? ▸ Incast - catastrophic loss in high synchronous fan in ▸ Slow start - recovery too conservative ▸ Delayed acks - 100ms is a very long time when the RTT is 50us ▸ NewReno can’t reach consistent utilization ▸ Partially mitigated by ECN ▸ DCTCP may be a long-term fix
  • 6. PROBLEMS UNIQUE TO A MESH INCAST - DEFINITION AND MITIGATION ▸ Term coined in [PANFS] for the case of increasing the number of simultaneously initiated, effectively barrier synchronized, fan-in flows in to a single port to the point where the instantaneous switch / NIC buffering capacity was exceeded. Thus causing a decline in aggregate bandwidth as the need for retransmits increases. This is further exacerbated by tail-drop behavior in the switch whereby multiple losses within individual streams exceeds the recovery abilities of duplicate ACKs or SACK, leading to RTOs before the flow is resumed. ▸ Solution: scale RTO continuously ▸ Make retransmit timer higher resolution ▸ Make RTT calculation higher resolution
  • 7. BURSTINESS & LOSS “SLOW START” - RENEGOTIATING THE LINK ▸ Under loss window size drops to 1 - given that we know the RTT and maximum peer bandwidth, this should really be a larger value determined by the maximum number of active peers (i.e. 1/RTT * (max BW/ #peers) ▸ LRO - stretch ACKs ignored per RFC - Linux avoids ack division while growing the window more rapidly by acknowledging total acked bytes ▸ Delayed ACKs 100ms - particularly when in slow start this can delay window growth - at other times can create an artificially elevated RTO
  • 8. CONGESTION CONTROL AND UTILIZATION THE NEWRENO SAWTOOTH ▸ ECN ▸ when a CE is seen we only reduce the congestion window by half rather than resetting it to 1 ▸ still causes the window bounce between 2 values even at a stable “steady state” ▸ DCTCP ▸ allows a continuous scaling of the congestion window - each CE only reduces the congestion window by a small fraction ▸ interoperability issues due to redefining ECN and setting both the ECN min/max to the same value of K
  • 9. IMPLEMENTATION WHAT WE DID FOR INCAST ▸ Separating callout scheduling granularity and hardclock frequency, fix callouts so that timer interrupt can be cleared when rescheduling ▸ High resolution TCP timestamps - moving from 1khz / tick based to TSC based for a 16Mhz (~64ns / increment - the fastest allowable for a 120s maximum segment lifetime)
  • 10. PROBLEMS UNFORESEEN PROBLEMS PART 1 ▸ Cached timers don’t scale down while maintaining monotonicity ▸ scheduling of hardclock doesn’t provide consistent updates ▸ solution: use TSC based timer even if more expensive, if no invariant TSC available restrict measurement values to integral multiples of ticks ▸ Idle connections stop working when the timer wraps ▸ If connection has been idle for longer than it takes for the timestamp counter to wrap peer will see all segments as coming “before” the last segment prior to the idle period ▸ Solution: LIE. Increment last timestamp value sent by large value until the value sent to peer catches up with actual value.
  • 11. PROBLEMS UNFORESEEN PROBLEMS PART 2 ▸ SRTT & RTTVAR have much too short a memory - outlier values quickly forgotten leading to frequent spurious retransmits ▸ First solution - taken from RFC 7323 App. G: ▸ RTTVAR <- (1 - beta’) * RTTVAR + beta’ * |SRTT - R’|; SRTT <- (1 - alpha’) * SRTT + alpha’ * R’ ▸ alpha’ = alpha / ExpectedSamples; beta’ = beta / ExpectedSamples; ExpectedSamples = ceil(FlightSize/ (SMSS *2)) ▸ Problem: When pipe is empty or cwnd is reset after loss return to very having short memory ▸ Second solution: ExpectedSamples = ceil(cwnd / (SMSS * 2)) ▸ Problem: When cwnd is reset after loss return to very short memory ▸ Final solution: ExpectedSamples = ceil(max(cwnd, cwnd_prev) / (SMSS *2))
  • 12. THE PRODUCT HOW DID IT TURN OUT? ▸ Target streaming throughput was 15GB/s on 4 nodes ▸ Actually achieve 16-20GB/s with lower CPU utilization than Infiniband (!?) ▸ Next generation (2017-) will be 100GigE fabric