SlideShare a Scribd company logo
1
March 14, 2017
JR Rivers | Co-founder/CTO
A JOURNEY TO DEEPER UNDERSTANDING
Network DataPath
2
How Much Buffer – the take away
If the last bit of performance matters to you, do the testing
§ be careful of what you read
If not, take solace…
…the web-scales use “small buffer” switches
Network Data Path
3
Tools and Knobs – Show and Tell
Network Data Path
cumulus@server02:~$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 16
On-line CPU(s) list: 0-15
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 2
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 26
Model name: Intel(R) Xeon(R) CPU L5520 @ 2.27GHz
Stepping: 5
CPU MHz: 1600.000
CPU max MHz: 2268.0000
CPU min MHz: 1600.0000
BogoMIPS: 4441.84
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 8192K
NUMA node0 CPU(s): 0-15
Internet
…
25GE attached servers, 100G interconnect
server01
server02
leaf01
server03
server04
leaf03
edge01
exit01
spine01
oob-mgmt-server
oob-mgmt-switch
100G
25G
Link Under Test
4
Tools and Knobs - iperf3
Network Data Path
cumulus@server01:~$ iperf3 -c rack-edge01 -p 5201 -t 30
Connecting to host rack-edge01, port 5201
[ 4] local 10.0.1.1 port 34912 connected to 10.0.3.1 port 5201
[ ID] Interval Transfer Bandwidth Retr Cwnd
[ 4] 0.00-1.00 sec 2.13 GBytes 18.3 Gbits/sec 433 888 KBytes
[ 4] 1.00-2.00 sec 2.74 GBytes 23.5 Gbits/sec 0 888 KBytes
[ 4] 2.00-3.00 sec 2.74 GBytes 23.5 Gbits/sec 0 1020 KBytes
[ 4] 3.00-4.00 sec 2.74 GBytes 23.5 Gbits/sec 0 1020 KBytes
[ 4] 4.00-5.00 sec 2.74 GBytes 23.5 Gbits/sec 0 1.01 MBytes
[ 4] 5.00-6.00 sec 2.74 GBytes 23.5 Gbits/sec 0 1.02 MBytes
[ 4] 6.00-7.00 sec 2.72 GBytes 23.4 Gbits/sec 0 1.16 MBytes
[ 4] 7.00-8.00 sec 2.72 GBytes 23.4 Gbits/sec 0 1.45 MBytes
[ 4] 8.00-9.00 sec 2.74 GBytes 23.5 Gbits/sec 0 1.46 MBytes
[ 4] 9.00-10.00 sec 2.74 GBytes 23.5 Gbits/sec 0 1.46 MBytes
[ 4] 10.00-11.00 sec 2.74 GBytes 23.5 Gbits/sec 0 1.46 MBytes
[ 4] 11.00-12.00 sec 2.74 GBytes 23.5 Gbits/sec 0 1.46 MBytes
[ 4] 12.00-13.00 sec 2.74 GBytes 23.5 Gbits/sec 0 1.46 MBytes
[ 4] 13.00-14.00 sec 2.73 GBytes 23.5 Gbits/sec 0 1.57 MBytes
[ 4] 14.00-15.00 sec 2.72 GBytes 23.4 Gbits/sec 0 1.76 MBytes
[ 4] 15.00-16.00 sec 2.73 GBytes 23.4 Gbits/sec 0 1.76 MBytes
[ 4] 16.00-17.00 sec 2.73 GBytes 23.4 Gbits/sec 0 1.76 MBytes
[ 4] 17.00-18.00 sec 2.73 GBytes 23.4 Gbits/sec 0 1.76 MBytes
[ 4] 18.00-19.00 sec 2.72 GBytes 23.4 Gbits/sec 0 1.76 MBytes
[ 4] 19.00-20.00 sec 2.73 GBytes 23.4 Gbits/sec 0 1.76 MBytes
[ 4] 20.00-21.00 sec 2.73 GBytes 23.4 Gbits/sec 0 1.76 MBytes
[ 4] 21.00-22.00 sec 2.72 GBytes 23.4 Gbits/sec 0 1.76 MBytes
[ 4] 22.00-23.00 sec 2.72 GBytes 23.4 Gbits/sec 0 1.76 MBytes
[ 4] 23.00-24.00 sec 2.72 GBytes 23.4 Gbits/sec 1 1.76 MBytes
[ 4] 24.00-25.00 sec 2.74 GBytes 23.5 Gbits/sec 0 1.76 MBytes
[ 4] 25.00-26.00 sec 2.74 GBytes 23.5 Gbits/sec 0 1.76 MBytes
[ 4] 26.00-27.00 sec 2.74 GBytes 23.5 Gbits/sec 0 1.76 MBytes
[ 4] 27.00-28.00 sec 2.73 GBytes 23.4 Gbits/sec 0 1.76 MBytes
[ 4] 28.00-29.00 sec 2.65 GBytes 22.8 Gbits/sec 0 1.76 MBytes
[ 4] 29.00-30.00 sec 2.73 GBytes 23.5 Gbits/sec 0 1.76 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bandwidth Retr
[ 4] 0.00-30.00 sec 81.3 GBytes 23.3 Gbits/sec 434 sender
[ 4] 0.00-30.00 sec 81.3 GBytes 23.3 Gbits/sec receiver
iperf Done.
top - 17:10:44 up 21:55, 2 users, load average: 0.21, 0.07, 0.02
Tasks: 216 total, 1 running, 215 sleeping, 0 stopped, 0 zombie
%Cpu0 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu1 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu2 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu3 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu4 : 0.7 us, 30.8 sy, 0.0 ni, 67.9 id, 0.0 wa, 0.0 hi, 0.7 si, 0.0 st
%Cpu5 : 0.0 us, 4.0 sy, 0.0 ni, 95.3 id, 0.0 wa, 0.0 hi, 0.7 si, 0.0 st
%Cpu6 : 0.3 us, 0.0 sy, 0.0 ni, 99.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu7 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu8 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu9 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu10 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu11 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu12 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu13 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu14 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu15 : 0.4 us, 41.6 sy, 0.0 ni, 46.9 id, 0.0 wa, 0.0 hi, 11.1 si, 0.0 st
KiB Mem : 74224280 total, 73448200 free, 498208 used, 277872 buff/cache
KiB Swap: 75486208 total, 75486208 free, 0 used. 73183560 avail Mem
Note - bandwidth is reported in TCP payload,
so 23.5 Gbits/sec is wire-speed 25G Ethernet
5
Tools and Knobs – tcpdump
Network Data Path
cumulus@edge01:~/pcaps$ sudo tcpdump -i enp4s0f1 -w single.pcap tcp port 5201
tcpdump: listening on enp4s0f1, link-type EN10MB (Ethernet), capture size 262144 bytes
1098 packets captured
1098 packets received by filter
0 packets dropped by kernel
cumulus@server01:~$ iperf3 -c rack-edge01 -p 5201 -t 2 -b 50M
Connecting to host rack-edge01, port 5201
[ 4] local 10.0.1.1 port 34948 connected to 10.0.3.1 port 5201
[ ID] Interval Transfer Bandwidth Retr Cwnd
[ 4] 0.00-1.00 sec 5.46 MBytes 45.8 Mbits/sec 21 109 KBytes
[ 4] 1.00-2.00 sec 5.88 MBytes 49.3 Mbits/sec 29 70.7 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bandwidth Retr
[ 4] 0.00-2.00 sec 11.3 MBytes 47.5 Mbits/sec 50 sender
[ 4] 0.00-2.00 sec 11.3 MBytes 47.5 Mbits/sec receiver
iperf Done.
cumulus@edge01:~/pcaps$ tcpdump -r single.pcap
reading from file single.pcap, link-type EN10MB (Ethernet)
07:52:57.600873 IP rack-server01.34946 > rack-edge01.5201: Flags [SEW], seq 1655732583, win 29200, options [mss 1460,sackOK,TS val 33182573 ecr 0,nop,wscale 7], length 0
07:52:57.600900 IP rack-edge01.5201 > rack-server01.34946: Flags [S.E], seq 319971738, ack 1655732584, win 28960, options [mss 1460,sackOK,TS val 56252912 ecr
33182573,nop,wscale 7], length 0
07:52:57.601133 IP rack-server01.34946 > rack-edge01.5201: Flags [.], ack 1, win 229, options [nop,nop,TS val 33182573 ecr 56252912], length 0
07:52:57.601160 IP rack-server01.34946 > rack-edge01.5201: Flags [P.], seq 1:38, ack 1, win 229, options [nop,nop,TS val 33182573 ecr 56252912], length 37
07:52:57.601169 IP rack-edge01.5201 > rack-server01.34946: Flags [.], ack 38, win 227, options [nop,nop,TS val 56252912 ecr 33182573], length 0
07:52:57.601213 IP rack-edge01.5201 > rack-server01.34946: Flags [P.], seq 1:2, ack 38, win 227, options [nop,nop,TS val 56252912 ecr 33182573], length 1
07:52:57.601412 IP rack-server01.34946 > rack-edge01.5201: Flags [.], ack 2, win 229, options [nop,nop,TS val 33182573 ecr 56252912], length 0
07:52:57.601419 IP rack-server01.34946 > rack-edge01.5201: Flags [P.], seq 38:42, ack 2, win 229, options [nop,nop,TS val 33182573 ecr 56252912], length 4
07:52:57.640098 IP rack-edge01.5201 > rack-server01.34946: Flags [.], ack 42, win 227, options [nop,nop,TS val 56252922 ecr 33182573], length 0
...
Need to make sure your data sources and pcap filters don’t allow drops!!!!
6
Tools and Knobs - wireshark
Network Data Path
7
Tools and Knobs – tcpprobe
Network Data Path
Column Contents
1 Kernel Timestamp
2 Source_IP:port
3 Destination_IP:port
4 Packet Length
5 Send Next
6 Send Unacknowledged
7 Send Congestion Window
8 Slow Start Threshold
9 Send Window
10 Smoothed RTT
11 Receive Window
cumulus@server01:~$ sudo modprobe tcp_probe port=5201 full=1
cumulus@server01:~$ sudo chmod oug+r /proc/net/tcpprobe
cumulus@server01:~$ cat /proc/net/tcpprobe > /tmp/tcpprobe.out &
[1] 6921
cumulus@server01:~$ iperf3 -c edge01-hs -t 5
...
snip
...
[ ID] Interval Transfer Bandwidth Retr
[ 4] 0.00-5.00 sec 13.0 GBytes 22.2 Gbits/sec 538 sender
[ 4] 0.00-5.00 sec 12.9 GBytes 22.2 Gbits/sec receiver
iperf Done.
cumulus@server01:~$ kill 6921
cumulus@server01:~$ head 10 /tmp/tcpprobe.out
==> /tmp/tcpprobe.out <==
4.111198452 10.0.0.2:45520 10.0.0.5:5201 32 0x358a629a 0x3589f17a 20 2147483647 57984 142 29312
4.111461826 10.0.0.2:45520 10.0.0.5:5201 32 0x358ad962 0x358a629a 21 20 115840 161 29312
4.111731474 10.0.0.2:45520 10.0.0.5:5201 32 0x358b55d2 0x358ad962 22 20 171648 173 29312
4.112000993 10.0.0.2:45520 10.0.0.5:5201 44 0x358bd7ea 0x358b55d2 23 20 170880 185 29312
4.112037126 10.0.0.2:45520 10.0.0.5:5201 32 0x358c107a 0x358b55d2 16 16 225920 195 29312
4.112260554 10.0.0.2:45520 10.0.0.5:5201 44 0x358c5faa 0x358c1622 17 16 275200 188 29312
4.112278958 10.0.0.2:45520 10.0.0.5:5201 32 0x358c983a 0x358c1622 23 20 275200 188 29312
4.112533754 10.0.0.2:45520 10.0.0.5:5201 32 0x358ced12 0x358c326a 16 16 338944 202 29312
4.112842106 10.0.0.2:45520 10.0.0.5:5201 44 0x358d63da 0x358d03b2 17 16 396800 202 29312
4.112854569 10.0.0.2:45520 10.0.0.5:5201 32 0x358d63da 0x358d03b2 23 20 396800 202 29312
Note that the smoothed RTT is ~200 usec with no traffic!!!!!
8
Tools and Knobs – TCP congestion algorithms and socket stats
Network Data Path
cumulus@server01:~$ ls /lib/modules/4.4.0-45-generic/kernel/net/ipv4/tcp*
/lib/modules/4.4.0-45-generic/kernel/net/ipv4/tcp_bic.ko
/lib/modules/4.4.0-45-generic/kernel/net/ipv4/tcp_cdg.ko
/lib/modules/4.4.0-45-generic/kernel/net/ipv4/tcp_dctcp.ko
/lib/modules/4.4.0-45-generic/kernel/net/ipv4/tcp_diag.ko
/lib/modules/4.4.0-45-generic/kernel/net/ipv4/tcp_highspeed.ko
/lib/modules/4.4.0-45-generic/kernel/net/ipv4/tcp_htcp.ko
/lib/modules/4.4.0-45-generic/kernel/net/ipv4/tcp_hybla.ko
/lib/modules/4.4.0-45-generic/kernel/net/ipv4/tcp_illinois.ko
/lib/modules/4.4.0-45-generic/kernel/net/ipv4/tcp_lp.ko
/lib/modules/4.4.0-45-generic/kernel/net/ipv4/tcp_probe.ko
/lib/modules/4.4.0-45-generic/kernel/net/ipv4/tcp_scalable.ko
/lib/modules/4.4.0-45-generic/kernel/net/ipv4/tcp_vegas.ko
/lib/modules/4.4.0-45-generic/kernel/net/ipv4/tcp_veno.ko
/lib/modules/4.4.0-45-generic/kernel/net/ipv4/tcp_westwood.ko
/lib/modules/4.4.0-45-generic/kernel/net/ipv4/tcp_yeah.ko
cumulus@server01:~$ cat /proc/sys/net/ipv4/tcp_congestion_control
cubic
cumulus@server01:~$ ss --tcp --info dport = 5201
State Recv-Q Send-Q Local Address:Port Peer
Address:Port
ESTAB 0 2480400 10.0.0.2:45524
10.0.0.5:5201
cubic wscale:7,7 rto:204 rtt:0.137/0.008 mss:1448 cwnd:450 ssthresh:336
bytes_acked:25460316350 segs_out:17583731 segs_in:422330 send 38049.6Mbps
lastrcv:122325132 unacked:272 retrans:0/250 reordering:86 rcv_space:29200
Linux default since 2.6.19
param value
wscale 7,7
rto 204
rtt 0.137/0.008
mss 1448
cwnd 450
ssthresh 336
bytes_acked 25460316350
segs_out 17583731
segs_in 422330
send 38049.6Mbps
lastrcv 122325132
unacked 272
retrans 0/250
reordering 86
rcv_space 29200
9
Tools and Knobs – NIC Tuning
Network Data Path
cumulus@edge01:/proc/sys/net/ipv4$ sysctl net.ipv4.tcp_sack
net.ipv4.tcp_sack = 1
cumulus@edge01:/proc/sys/net/ipv4$ sysctl net.core.netdev_max_backlog
net.core.netdev_max_backlog = 25000
cumulus@edge01:/proc/sys/net/ipv4$ sysctl net.core.rmem_max
net.core.rmem_max = 4194304
cumulus@edge01:/proc/sys/net/ipv4$ sysctl net.core.wmem_max
net.core.wmem_max = 4194304
cumulus@edge01:/proc/sys/net/ipv4$ sysctl net.core.rmem_default
net.core.rmem_default = 4194304
cumulus@edge01:/proc/sys/net/ipv4$ sysctl net.core.wmem_default
net.core.wmem_default = 4194304
cumulus@edge01:/proc/sys/net/ipv4$ sysctl net.ipv4.tcp_rmem
net.ipv4.tcp_rmem = 4096 87380 4194304
cumulus@edge01:/proc/sys/net/ipv4$ sysctl net.ipv4.tcp_wmem
net.ipv4.tcp_wmem = 4096 65536 4194304
cumulus@edge01:/proc/sys/net/ipv4$ sysctl net.ipv4.tcp_low_latency
net.ipv4.tcp_low_latency = 1
cumulus@edge01:/proc/sys/net/ipv4$ sysctl net.ipv4.tcp_adv_win_scale
net.ipv4.tcp_adv_win_scale = 1
http://www.mellanox.com/related-docs/prod_software/Performance_Tuning_Guide_for_Mellanox_Network_Adapters.pdf
10
Tools and Knobs – TCP Tuning
Network Data Path
cumulus@edge01:/proc/sys/net/ipv4$ ls tcp_*
tcp_abort_on_overflow tcp_keepalive_probes tcp_reordering
tcp_adv_win_scale tcp_keepalive_time tcp_retrans_collapse
tcp_allowed_congestion_control tcp_limit_output_bytes tcp_retries1
tcp_app_win tcp_low_latency tcp_retries2
tcp_autocorking tcp_max_orphans tcp_rfc1337
tcp_available_congestion_control tcp_max_reordering tcp_rmem
tcp_base_mss tcp_max_syn_backlog tcp_sack
tcp_challenge_ack_limit tcp_max_tw_buckets tcp_slow_start_after_idle
tcp_congestion_control tcp_mem tcp_stdurg
tcp_dsack tcp_min_rtt_wlen tcp_synack_retries
tcp_early_retrans tcp_min_tso_segs tcp_syncookies
tcp_ecn tcp_moderate_rcvbuf tcp_syn_retries
tcp_ecn_fallback tcp_mtu_probing tcp_thin_dupack
tcp_fack tcp_no_metrics_save tcp_thin_linear_timeouts
tcp_fastopen tcp_notsent_lowat tcp_timestamps
tcp_fastopen_key tcp_orphan_retries tcp_tso_win_divisor
tcp_fin_timeout tcp_pacing_ca_ratio tcp_tw_recycle
tcp_frto tcp_pacing_ss_ratio tcp_tw_reuse
tcp_fwmark_accept tcp_probe_interval tcp_window_scaling
tcp_invalid_ratelimit tcp_probe_threshold tcp_wmem
tcp_keepalive_intvl tcp_recovery tcp_workaround_signed_windows
tcp_ecn - INTEGER
Control use of Explicit Congestion Notification (ECN) by TCP.
ECN is used only when both ends of the TCP connection indicate
support for it. This feature is useful in avoiding losses due
to congestion by allowing supporting routers to signal
congestion before having to drop packets.
Possible values are:
0 Disable ECN. Neither initiate nor accept ECN.
1 Enable ECN when requested by incoming connections and
also request ECN on outgoing connection attempts.
2 Enable ECN when requested by incoming connections
but do not request ECN on outgoing connections.
Default: 2
https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt
11
Live Action Time!!!!
12
Tools and Knobs – What’s next for me
Find/write a good ”mice” traffic generator
§ modify iperf3 to include mean-time-to-completion with blocks
DCTCP with both ECN and Priority Flow Control
§ High performance fabrics combine end-to-end congestion
management and lossless links
Infiniband, Fibre Channel, PCIe, NumaLink, etc
Network Data Path
13
How Much Buffer – the take away
If the last bit of performance matters to you, do the testing
§ be careful of what you read
If not, take solace…
…the web-scales use “small buffer” switches
Network Data Path
14
Thank you!
Visit us at cumulusnetworks.com or follow us @cumulusnetworks
© 2017 Cumulus Networks. Cumulus Networks, the Cumulus Networks Logo, and Cumulus Linux are trademarks or registered trademarks of Cumulus
Networks, Inc. or its affiliates in the U.S. and other countries. Other names may be trademarks of their respective owners. The registered trademark
Linux® is used pursuant to a sublicense from LMI, the exclusive licensee of Linus Torvalds, owner of the mark on a world-wide basis.

More Related Content

What's hot (20)

PDF
Ceph c01
Lâm Đào
 
PDF
Linux Linux Traffic Control
SUSE Labs Taipei
 
PPTX
Ovs dpdk hwoffload way to full offload
Kevin Traynor
 
PPTX
Proxmox for DevOps
Jorge Moratilla Porras
 
PDF
DPDK in Containers Hands-on Lab
Michelle Holley
 
PPTX
VPP事始め
npsg
 
PDF
Virtualized network with openvswitch
Sim Janghoon
 
PPTX
TRex Realistic Traffic Generator - Stateless support
Hanoch Haim
 
PDF
IPv6 address-planning
Tim Martin
 
PDF
A crash course in CRUSH
Sage Weil
 
PDF
Moving to PCI Express based SSD with NVM Express
Odinot Stanislas
 
PPTX
NETCONF YANG tutorial
Tail-f Systems
 
PPTX
OVN 設定サンプル | OVN config example 2015/12/27
Kentaro Ebisawa
 
PDF
MPLS - Multiprotocol Label Switching
Peter R. Egli
 
PDF
Intel dpdk Tutorial
Saifuddin Kaijar
 
PPTX
Docker 基礎介紹與實戰
Bo-Yi Wu
 
PDF
Accelerating Envoy and Istio with Cilium and the Linux Kernel
Thomas Graf
 
PDF
Cilium - BPF & XDP for containers
Docker, Inc.
 
PPTX
3GPP F1インターフェース(TS38.470-f50)の概要
Tetsuya Hasegawa
 
PPTX
Ceph Intro and Architectural Overview by Ross Turk
buildacloud
 
Ceph c01
Lâm Đào
 
Linux Linux Traffic Control
SUSE Labs Taipei
 
Ovs dpdk hwoffload way to full offload
Kevin Traynor
 
Proxmox for DevOps
Jorge Moratilla Porras
 
DPDK in Containers Hands-on Lab
Michelle Holley
 
VPP事始め
npsg
 
Virtualized network with openvswitch
Sim Janghoon
 
TRex Realistic Traffic Generator - Stateless support
Hanoch Haim
 
IPv6 address-planning
Tim Martin
 
A crash course in CRUSH
Sage Weil
 
Moving to PCI Express based SSD with NVM Express
Odinot Stanislas
 
NETCONF YANG tutorial
Tail-f Systems
 
OVN 設定サンプル | OVN config example 2015/12/27
Kentaro Ebisawa
 
MPLS - Multiprotocol Label Switching
Peter R. Egli
 
Intel dpdk Tutorial
Saifuddin Kaijar
 
Docker 基礎介紹與實戰
Bo-Yi Wu
 
Accelerating Envoy and Istio with Cilium and the Linux Kernel
Thomas Graf
 
Cilium - BPF & XDP for containers
Docker, Inc.
 
3GPP F1インターフェース(TS38.470-f50)の概要
Tetsuya Hasegawa
 
Ceph Intro and Architectural Overview by Ross Turk
buildacloud
 

Similar to How deep is your buffer – Demystifying buffers and application performance (20)

PPTX
QCon 2015 Broken Performance Tools
Brendan Gregg
 
PPT
Troubleshooting TCP/IP
vijai s
 
PPT
Day2
Jai4uk
 
PPTX
了解网络
Feng Yu
 
PPT
Tonyfortunatoiperfquickstart 1212633021928769-8
Jamil Jamil
 
DOCX
Ipref
jeromy fu
 
PDF
Communication Performance Over A Gigabit Ethernet Network
IJERA Editor
 
PDF
TRex Traffic Generator - Hanoch Haim
harryvanhaaren
 
PPTX
cFrame framework slides
kestasj
 
PDF
Aceleracion TCP Mikrotik.pdf
WifiCren
 
PDF
Disruptive IP Networking with Intel DPDK on Linux
Naoto MATSUMOTO
 
PPTX
Packet Analysis - Course Technology Computing Conference
Cengage Learning
 
PPT
OSTU - Sake Blok on Packet Capturing with Tshark
Denny K Miu
 
DOCX
Running head network design 1 netwo
AKHIL969626
 
PDF
Analyzing OS X Systems Performance with the USE Method
Brendan Gregg
 
PDF
Boost UDP Transaction Performance
LF Events
 
PDF
Handy Networking Tools and How to Use Them
Sneha Inguva
 
PPTX
Wireshark, Tcpdump and Network Performance tools
Sachidananda Sahu
 
PPTX
Broken Linux Performance Tools 2016
Brendan Gregg
 
QCon 2015 Broken Performance Tools
Brendan Gregg
 
Troubleshooting TCP/IP
vijai s
 
Day2
Jai4uk
 
了解网络
Feng Yu
 
Tonyfortunatoiperfquickstart 1212633021928769-8
Jamil Jamil
 
Ipref
jeromy fu
 
Communication Performance Over A Gigabit Ethernet Network
IJERA Editor
 
TRex Traffic Generator - Hanoch Haim
harryvanhaaren
 
cFrame framework slides
kestasj
 
Aceleracion TCP Mikrotik.pdf
WifiCren
 
Disruptive IP Networking with Intel DPDK on Linux
Naoto MATSUMOTO
 
Packet Analysis - Course Technology Computing Conference
Cengage Learning
 
OSTU - Sake Blok on Packet Capturing with Tshark
Denny K Miu
 
Running head network design 1 netwo
AKHIL969626
 
Analyzing OS X Systems Performance with the USE Method
Brendan Gregg
 
Boost UDP Transaction Performance
LF Events
 
Handy Networking Tools and How to Use Them
Sneha Inguva
 
Wireshark, Tcpdump and Network Performance tools
Sachidananda Sahu
 
Broken Linux Performance Tools 2016
Brendan Gregg
 
Ad

More from Cumulus Networks (20)

PPTX
Building a Layer 3 network with Cumulus Linux
Cumulus Networks
 
PDF
Operationalizing EVPN in the Data Center: Part 2
Cumulus Networks
 
PDF
Demystifying EVPN in the data center: Part 1 in 2 episode series
Cumulus Networks
 
PPTX
Best practices for network troubleshooting
Cumulus Networks
 
PDF
NetDevOps 202: Life After Configuration
Cumulus Networks
 
PPTX
Cumulus Networks: Automating Network Configuration
Cumulus Networks
 
PPTX
Demystifying Networking: Data Center Networking Trends 2017
Cumulus Networks
 
PPTX
Building Scalable Data Center Networks
Cumulus Networks
 
PPTX
Network Architecture for Containers
Cumulus Networks
 
PPTX
Webinar: Network Automation [Tips & Tricks]
Cumulus Networks
 
PPTX
July NYC Open Networking Meeup
Cumulus Networks
 
PPTX
Demystifying Networking Webinar Series- Routing on the Host
Cumulus Networks
 
PDF
Ifupdown2: Network Interface Manager
Cumulus Networks
 
PPTX
Operationalizing VRF in the Data Center
Cumulus Networks
 
PPTX
Microservices Network Architecture 101
Cumulus Networks
 
PPTX
Linux networking is Awesome!
Cumulus Networks
 
PPTX
Webinar-Linux Networking is Awesome
Cumulus Networks
 
PDF
Webinar- Tea for the Tillerman
Cumulus Networks
 
PDF
Dreamhost deploying dreamcompute at scale
Cumulus Networks
 
PDF
Operationalizing BGP in the SDDC
Cumulus Networks
 
Building a Layer 3 network with Cumulus Linux
Cumulus Networks
 
Operationalizing EVPN in the Data Center: Part 2
Cumulus Networks
 
Demystifying EVPN in the data center: Part 1 in 2 episode series
Cumulus Networks
 
Best practices for network troubleshooting
Cumulus Networks
 
NetDevOps 202: Life After Configuration
Cumulus Networks
 
Cumulus Networks: Automating Network Configuration
Cumulus Networks
 
Demystifying Networking: Data Center Networking Trends 2017
Cumulus Networks
 
Building Scalable Data Center Networks
Cumulus Networks
 
Network Architecture for Containers
Cumulus Networks
 
Webinar: Network Automation [Tips & Tricks]
Cumulus Networks
 
July NYC Open Networking Meeup
Cumulus Networks
 
Demystifying Networking Webinar Series- Routing on the Host
Cumulus Networks
 
Ifupdown2: Network Interface Manager
Cumulus Networks
 
Operationalizing VRF in the Data Center
Cumulus Networks
 
Microservices Network Architecture 101
Cumulus Networks
 
Linux networking is Awesome!
Cumulus Networks
 
Webinar-Linux Networking is Awesome
Cumulus Networks
 
Webinar- Tea for the Tillerman
Cumulus Networks
 
Dreamhost deploying dreamcompute at scale
Cumulus Networks
 
Operationalizing BGP in the SDDC
Cumulus Networks
 
Ad

Recently uploaded (20)

PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PPTX
Building a Production-Ready Barts Health Secure Data Environment Tooling, Acc...
Barts Health
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PDF
SFWelly Summer 25 Release Highlights July 2025
Anna Loughnan Colquhoun
 
PDF
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PDF
Rethinking Security Operations - SOC Evolution Journey.pdf
Haris Chughtai
 
PDF
Human-centred design in online workplace learning and relationship to engagem...
Tracy Tang
 
PDF
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
PDF
July Patch Tuesday
Ivanti
 
PPTX
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
PDF
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
PDF
TrustArc Webinar - Data Privacy Trends 2025: Mid-Year Insights & Program Stra...
TrustArc
 
PDF
Persuasive AI: risks and opportunities in the age of digital debate
Speck&Tech
 
PDF
Windsurf Meetup Ottawa 2025-07-12 - Planning Mode at Reliza.pdf
Pavel Shukhman
 
PDF
Wojciech Ciemski for Top Cyber News MAGAZINE. June 2025
Dr. Ludmila Morozova-Buss
 
PDF
Empowering Cloud Providers with Apache CloudStack and Stackbill
ShapeBlue
 
PDF
NewMind AI Journal - Weekly Chronicles - July'25 Week II
NewMind AI
 
PDF
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
Building a Production-Ready Barts Health Secure Data Environment Tooling, Acc...
Barts Health
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
SFWelly Summer 25 Release Highlights July 2025
Anna Loughnan Colquhoun
 
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
Rethinking Security Operations - SOC Evolution Journey.pdf
Haris Chughtai
 
Human-centred design in online workplace learning and relationship to engagem...
Tracy Tang
 
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
July Patch Tuesday
Ivanti
 
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
TrustArc Webinar - Data Privacy Trends 2025: Mid-Year Insights & Program Stra...
TrustArc
 
Persuasive AI: risks and opportunities in the age of digital debate
Speck&Tech
 
Windsurf Meetup Ottawa 2025-07-12 - Planning Mode at Reliza.pdf
Pavel Shukhman
 
Wojciech Ciemski for Top Cyber News MAGAZINE. June 2025
Dr. Ludmila Morozova-Buss
 
Empowering Cloud Providers with Apache CloudStack and Stackbill
ShapeBlue
 
NewMind AI Journal - Weekly Chronicles - July'25 Week II
NewMind AI
 
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 

How deep is your buffer – Demystifying buffers and application performance

  • 1. 1 March 14, 2017 JR Rivers | Co-founder/CTO A JOURNEY TO DEEPER UNDERSTANDING Network DataPath
  • 2. 2 How Much Buffer – the take away If the last bit of performance matters to you, do the testing § be careful of what you read If not, take solace… …the web-scales use “small buffer” switches Network Data Path
  • 3. 3 Tools and Knobs – Show and Tell Network Data Path cumulus@server02:~$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 2 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 26 Model name: Intel(R) Xeon(R) CPU L5520 @ 2.27GHz Stepping: 5 CPU MHz: 1600.000 CPU max MHz: 2268.0000 CPU min MHz: 1600.0000 BogoMIPS: 4441.84 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 8192K NUMA node0 CPU(s): 0-15 Internet … 25GE attached servers, 100G interconnect server01 server02 leaf01 server03 server04 leaf03 edge01 exit01 spine01 oob-mgmt-server oob-mgmt-switch 100G 25G Link Under Test
  • 4. 4 Tools and Knobs - iperf3 Network Data Path cumulus@server01:~$ iperf3 -c rack-edge01 -p 5201 -t 30 Connecting to host rack-edge01, port 5201 [ 4] local 10.0.1.1 port 34912 connected to 10.0.3.1 port 5201 [ ID] Interval Transfer Bandwidth Retr Cwnd [ 4] 0.00-1.00 sec 2.13 GBytes 18.3 Gbits/sec 433 888 KBytes [ 4] 1.00-2.00 sec 2.74 GBytes 23.5 Gbits/sec 0 888 KBytes [ 4] 2.00-3.00 sec 2.74 GBytes 23.5 Gbits/sec 0 1020 KBytes [ 4] 3.00-4.00 sec 2.74 GBytes 23.5 Gbits/sec 0 1020 KBytes [ 4] 4.00-5.00 sec 2.74 GBytes 23.5 Gbits/sec 0 1.01 MBytes [ 4] 5.00-6.00 sec 2.74 GBytes 23.5 Gbits/sec 0 1.02 MBytes [ 4] 6.00-7.00 sec 2.72 GBytes 23.4 Gbits/sec 0 1.16 MBytes [ 4] 7.00-8.00 sec 2.72 GBytes 23.4 Gbits/sec 0 1.45 MBytes [ 4] 8.00-9.00 sec 2.74 GBytes 23.5 Gbits/sec 0 1.46 MBytes [ 4] 9.00-10.00 sec 2.74 GBytes 23.5 Gbits/sec 0 1.46 MBytes [ 4] 10.00-11.00 sec 2.74 GBytes 23.5 Gbits/sec 0 1.46 MBytes [ 4] 11.00-12.00 sec 2.74 GBytes 23.5 Gbits/sec 0 1.46 MBytes [ 4] 12.00-13.00 sec 2.74 GBytes 23.5 Gbits/sec 0 1.46 MBytes [ 4] 13.00-14.00 sec 2.73 GBytes 23.5 Gbits/sec 0 1.57 MBytes [ 4] 14.00-15.00 sec 2.72 GBytes 23.4 Gbits/sec 0 1.76 MBytes [ 4] 15.00-16.00 sec 2.73 GBytes 23.4 Gbits/sec 0 1.76 MBytes [ 4] 16.00-17.00 sec 2.73 GBytes 23.4 Gbits/sec 0 1.76 MBytes [ 4] 17.00-18.00 sec 2.73 GBytes 23.4 Gbits/sec 0 1.76 MBytes [ 4] 18.00-19.00 sec 2.72 GBytes 23.4 Gbits/sec 0 1.76 MBytes [ 4] 19.00-20.00 sec 2.73 GBytes 23.4 Gbits/sec 0 1.76 MBytes [ 4] 20.00-21.00 sec 2.73 GBytes 23.4 Gbits/sec 0 1.76 MBytes [ 4] 21.00-22.00 sec 2.72 GBytes 23.4 Gbits/sec 0 1.76 MBytes [ 4] 22.00-23.00 sec 2.72 GBytes 23.4 Gbits/sec 0 1.76 MBytes [ 4] 23.00-24.00 sec 2.72 GBytes 23.4 Gbits/sec 1 1.76 MBytes [ 4] 24.00-25.00 sec 2.74 GBytes 23.5 Gbits/sec 0 1.76 MBytes [ 4] 25.00-26.00 sec 2.74 GBytes 23.5 Gbits/sec 0 1.76 MBytes [ 4] 26.00-27.00 sec 2.74 GBytes 23.5 Gbits/sec 0 1.76 MBytes [ 4] 27.00-28.00 sec 2.73 GBytes 23.4 Gbits/sec 0 1.76 MBytes [ 4] 28.00-29.00 sec 2.65 GBytes 22.8 Gbits/sec 0 1.76 MBytes [ 4] 29.00-30.00 sec 2.73 GBytes 23.5 Gbits/sec 0 1.76 MBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth Retr [ 4] 0.00-30.00 sec 81.3 GBytes 23.3 Gbits/sec 434 sender [ 4] 0.00-30.00 sec 81.3 GBytes 23.3 Gbits/sec receiver iperf Done. top - 17:10:44 up 21:55, 2 users, load average: 0.21, 0.07, 0.02 Tasks: 216 total, 1 running, 215 sleeping, 0 stopped, 0 zombie %Cpu0 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu1 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu2 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu3 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu4 : 0.7 us, 30.8 sy, 0.0 ni, 67.9 id, 0.0 wa, 0.0 hi, 0.7 si, 0.0 st %Cpu5 : 0.0 us, 4.0 sy, 0.0 ni, 95.3 id, 0.0 wa, 0.0 hi, 0.7 si, 0.0 st %Cpu6 : 0.3 us, 0.0 sy, 0.0 ni, 99.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu7 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu8 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu9 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu10 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu11 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu12 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu13 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu14 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu15 : 0.4 us, 41.6 sy, 0.0 ni, 46.9 id, 0.0 wa, 0.0 hi, 11.1 si, 0.0 st KiB Mem : 74224280 total, 73448200 free, 498208 used, 277872 buff/cache KiB Swap: 75486208 total, 75486208 free, 0 used. 73183560 avail Mem Note - bandwidth is reported in TCP payload, so 23.5 Gbits/sec is wire-speed 25G Ethernet
  • 5. 5 Tools and Knobs – tcpdump Network Data Path cumulus@edge01:~/pcaps$ sudo tcpdump -i enp4s0f1 -w single.pcap tcp port 5201 tcpdump: listening on enp4s0f1, link-type EN10MB (Ethernet), capture size 262144 bytes 1098 packets captured 1098 packets received by filter 0 packets dropped by kernel cumulus@server01:~$ iperf3 -c rack-edge01 -p 5201 -t 2 -b 50M Connecting to host rack-edge01, port 5201 [ 4] local 10.0.1.1 port 34948 connected to 10.0.3.1 port 5201 [ ID] Interval Transfer Bandwidth Retr Cwnd [ 4] 0.00-1.00 sec 5.46 MBytes 45.8 Mbits/sec 21 109 KBytes [ 4] 1.00-2.00 sec 5.88 MBytes 49.3 Mbits/sec 29 70.7 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth Retr [ 4] 0.00-2.00 sec 11.3 MBytes 47.5 Mbits/sec 50 sender [ 4] 0.00-2.00 sec 11.3 MBytes 47.5 Mbits/sec receiver iperf Done. cumulus@edge01:~/pcaps$ tcpdump -r single.pcap reading from file single.pcap, link-type EN10MB (Ethernet) 07:52:57.600873 IP rack-server01.34946 > rack-edge01.5201: Flags [SEW], seq 1655732583, win 29200, options [mss 1460,sackOK,TS val 33182573 ecr 0,nop,wscale 7], length 0 07:52:57.600900 IP rack-edge01.5201 > rack-server01.34946: Flags [S.E], seq 319971738, ack 1655732584, win 28960, options [mss 1460,sackOK,TS val 56252912 ecr 33182573,nop,wscale 7], length 0 07:52:57.601133 IP rack-server01.34946 > rack-edge01.5201: Flags [.], ack 1, win 229, options [nop,nop,TS val 33182573 ecr 56252912], length 0 07:52:57.601160 IP rack-server01.34946 > rack-edge01.5201: Flags [P.], seq 1:38, ack 1, win 229, options [nop,nop,TS val 33182573 ecr 56252912], length 37 07:52:57.601169 IP rack-edge01.5201 > rack-server01.34946: Flags [.], ack 38, win 227, options [nop,nop,TS val 56252912 ecr 33182573], length 0 07:52:57.601213 IP rack-edge01.5201 > rack-server01.34946: Flags [P.], seq 1:2, ack 38, win 227, options [nop,nop,TS val 56252912 ecr 33182573], length 1 07:52:57.601412 IP rack-server01.34946 > rack-edge01.5201: Flags [.], ack 2, win 229, options [nop,nop,TS val 33182573 ecr 56252912], length 0 07:52:57.601419 IP rack-server01.34946 > rack-edge01.5201: Flags [P.], seq 38:42, ack 2, win 229, options [nop,nop,TS val 33182573 ecr 56252912], length 4 07:52:57.640098 IP rack-edge01.5201 > rack-server01.34946: Flags [.], ack 42, win 227, options [nop,nop,TS val 56252922 ecr 33182573], length 0 ... Need to make sure your data sources and pcap filters don’t allow drops!!!!
  • 6. 6 Tools and Knobs - wireshark Network Data Path
  • 7. 7 Tools and Knobs – tcpprobe Network Data Path Column Contents 1 Kernel Timestamp 2 Source_IP:port 3 Destination_IP:port 4 Packet Length 5 Send Next 6 Send Unacknowledged 7 Send Congestion Window 8 Slow Start Threshold 9 Send Window 10 Smoothed RTT 11 Receive Window cumulus@server01:~$ sudo modprobe tcp_probe port=5201 full=1 cumulus@server01:~$ sudo chmod oug+r /proc/net/tcpprobe cumulus@server01:~$ cat /proc/net/tcpprobe > /tmp/tcpprobe.out & [1] 6921 cumulus@server01:~$ iperf3 -c edge01-hs -t 5 ... snip ... [ ID] Interval Transfer Bandwidth Retr [ 4] 0.00-5.00 sec 13.0 GBytes 22.2 Gbits/sec 538 sender [ 4] 0.00-5.00 sec 12.9 GBytes 22.2 Gbits/sec receiver iperf Done. cumulus@server01:~$ kill 6921 cumulus@server01:~$ head 10 /tmp/tcpprobe.out ==> /tmp/tcpprobe.out <== 4.111198452 10.0.0.2:45520 10.0.0.5:5201 32 0x358a629a 0x3589f17a 20 2147483647 57984 142 29312 4.111461826 10.0.0.2:45520 10.0.0.5:5201 32 0x358ad962 0x358a629a 21 20 115840 161 29312 4.111731474 10.0.0.2:45520 10.0.0.5:5201 32 0x358b55d2 0x358ad962 22 20 171648 173 29312 4.112000993 10.0.0.2:45520 10.0.0.5:5201 44 0x358bd7ea 0x358b55d2 23 20 170880 185 29312 4.112037126 10.0.0.2:45520 10.0.0.5:5201 32 0x358c107a 0x358b55d2 16 16 225920 195 29312 4.112260554 10.0.0.2:45520 10.0.0.5:5201 44 0x358c5faa 0x358c1622 17 16 275200 188 29312 4.112278958 10.0.0.2:45520 10.0.0.5:5201 32 0x358c983a 0x358c1622 23 20 275200 188 29312 4.112533754 10.0.0.2:45520 10.0.0.5:5201 32 0x358ced12 0x358c326a 16 16 338944 202 29312 4.112842106 10.0.0.2:45520 10.0.0.5:5201 44 0x358d63da 0x358d03b2 17 16 396800 202 29312 4.112854569 10.0.0.2:45520 10.0.0.5:5201 32 0x358d63da 0x358d03b2 23 20 396800 202 29312 Note that the smoothed RTT is ~200 usec with no traffic!!!!!
  • 8. 8 Tools and Knobs – TCP congestion algorithms and socket stats Network Data Path cumulus@server01:~$ ls /lib/modules/4.4.0-45-generic/kernel/net/ipv4/tcp* /lib/modules/4.4.0-45-generic/kernel/net/ipv4/tcp_bic.ko /lib/modules/4.4.0-45-generic/kernel/net/ipv4/tcp_cdg.ko /lib/modules/4.4.0-45-generic/kernel/net/ipv4/tcp_dctcp.ko /lib/modules/4.4.0-45-generic/kernel/net/ipv4/tcp_diag.ko /lib/modules/4.4.0-45-generic/kernel/net/ipv4/tcp_highspeed.ko /lib/modules/4.4.0-45-generic/kernel/net/ipv4/tcp_htcp.ko /lib/modules/4.4.0-45-generic/kernel/net/ipv4/tcp_hybla.ko /lib/modules/4.4.0-45-generic/kernel/net/ipv4/tcp_illinois.ko /lib/modules/4.4.0-45-generic/kernel/net/ipv4/tcp_lp.ko /lib/modules/4.4.0-45-generic/kernel/net/ipv4/tcp_probe.ko /lib/modules/4.4.0-45-generic/kernel/net/ipv4/tcp_scalable.ko /lib/modules/4.4.0-45-generic/kernel/net/ipv4/tcp_vegas.ko /lib/modules/4.4.0-45-generic/kernel/net/ipv4/tcp_veno.ko /lib/modules/4.4.0-45-generic/kernel/net/ipv4/tcp_westwood.ko /lib/modules/4.4.0-45-generic/kernel/net/ipv4/tcp_yeah.ko cumulus@server01:~$ cat /proc/sys/net/ipv4/tcp_congestion_control cubic cumulus@server01:~$ ss --tcp --info dport = 5201 State Recv-Q Send-Q Local Address:Port Peer Address:Port ESTAB 0 2480400 10.0.0.2:45524 10.0.0.5:5201 cubic wscale:7,7 rto:204 rtt:0.137/0.008 mss:1448 cwnd:450 ssthresh:336 bytes_acked:25460316350 segs_out:17583731 segs_in:422330 send 38049.6Mbps lastrcv:122325132 unacked:272 retrans:0/250 reordering:86 rcv_space:29200 Linux default since 2.6.19 param value wscale 7,7 rto 204 rtt 0.137/0.008 mss 1448 cwnd 450 ssthresh 336 bytes_acked 25460316350 segs_out 17583731 segs_in 422330 send 38049.6Mbps lastrcv 122325132 unacked 272 retrans 0/250 reordering 86 rcv_space 29200
  • 9. 9 Tools and Knobs – NIC Tuning Network Data Path cumulus@edge01:/proc/sys/net/ipv4$ sysctl net.ipv4.tcp_sack net.ipv4.tcp_sack = 1 cumulus@edge01:/proc/sys/net/ipv4$ sysctl net.core.netdev_max_backlog net.core.netdev_max_backlog = 25000 cumulus@edge01:/proc/sys/net/ipv4$ sysctl net.core.rmem_max net.core.rmem_max = 4194304 cumulus@edge01:/proc/sys/net/ipv4$ sysctl net.core.wmem_max net.core.wmem_max = 4194304 cumulus@edge01:/proc/sys/net/ipv4$ sysctl net.core.rmem_default net.core.rmem_default = 4194304 cumulus@edge01:/proc/sys/net/ipv4$ sysctl net.core.wmem_default net.core.wmem_default = 4194304 cumulus@edge01:/proc/sys/net/ipv4$ sysctl net.ipv4.tcp_rmem net.ipv4.tcp_rmem = 4096 87380 4194304 cumulus@edge01:/proc/sys/net/ipv4$ sysctl net.ipv4.tcp_wmem net.ipv4.tcp_wmem = 4096 65536 4194304 cumulus@edge01:/proc/sys/net/ipv4$ sysctl net.ipv4.tcp_low_latency net.ipv4.tcp_low_latency = 1 cumulus@edge01:/proc/sys/net/ipv4$ sysctl net.ipv4.tcp_adv_win_scale net.ipv4.tcp_adv_win_scale = 1 http://www.mellanox.com/related-docs/prod_software/Performance_Tuning_Guide_for_Mellanox_Network_Adapters.pdf
  • 10. 10 Tools and Knobs – TCP Tuning Network Data Path cumulus@edge01:/proc/sys/net/ipv4$ ls tcp_* tcp_abort_on_overflow tcp_keepalive_probes tcp_reordering tcp_adv_win_scale tcp_keepalive_time tcp_retrans_collapse tcp_allowed_congestion_control tcp_limit_output_bytes tcp_retries1 tcp_app_win tcp_low_latency tcp_retries2 tcp_autocorking tcp_max_orphans tcp_rfc1337 tcp_available_congestion_control tcp_max_reordering tcp_rmem tcp_base_mss tcp_max_syn_backlog tcp_sack tcp_challenge_ack_limit tcp_max_tw_buckets tcp_slow_start_after_idle tcp_congestion_control tcp_mem tcp_stdurg tcp_dsack tcp_min_rtt_wlen tcp_synack_retries tcp_early_retrans tcp_min_tso_segs tcp_syncookies tcp_ecn tcp_moderate_rcvbuf tcp_syn_retries tcp_ecn_fallback tcp_mtu_probing tcp_thin_dupack tcp_fack tcp_no_metrics_save tcp_thin_linear_timeouts tcp_fastopen tcp_notsent_lowat tcp_timestamps tcp_fastopen_key tcp_orphan_retries tcp_tso_win_divisor tcp_fin_timeout tcp_pacing_ca_ratio tcp_tw_recycle tcp_frto tcp_pacing_ss_ratio tcp_tw_reuse tcp_fwmark_accept tcp_probe_interval tcp_window_scaling tcp_invalid_ratelimit tcp_probe_threshold tcp_wmem tcp_keepalive_intvl tcp_recovery tcp_workaround_signed_windows tcp_ecn - INTEGER Control use of Explicit Congestion Notification (ECN) by TCP. ECN is used only when both ends of the TCP connection indicate support for it. This feature is useful in avoiding losses due to congestion by allowing supporting routers to signal congestion before having to drop packets. Possible values are: 0 Disable ECN. Neither initiate nor accept ECN. 1 Enable ECN when requested by incoming connections and also request ECN on outgoing connection attempts. 2 Enable ECN when requested by incoming connections but do not request ECN on outgoing connections. Default: 2 https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt
  • 12. 12 Tools and Knobs – What’s next for me Find/write a good ”mice” traffic generator § modify iperf3 to include mean-time-to-completion with blocks DCTCP with both ECN and Priority Flow Control § High performance fabrics combine end-to-end congestion management and lossless links Infiniband, Fibre Channel, PCIe, NumaLink, etc Network Data Path
  • 13. 13 How Much Buffer – the take away If the last bit of performance matters to you, do the testing § be careful of what you read If not, take solace… …the web-scales use “small buffer” switches Network Data Path
  • 14. 14 Thank you! Visit us at cumulusnetworks.com or follow us @cumulusnetworks © 2017 Cumulus Networks. Cumulus Networks, the Cumulus Networks Logo, and Cumulus Linux are trademarks or registered trademarks of Cumulus Networks, Inc. or its affiliates in the U.S. and other countries. Other names may be trademarks of their respective owners. The registered trademark Linux® is used pursuant to a sublicense from LMI, the exclusive licensee of Linus Torvalds, owner of the mark on a world-wide basis.