Guarding Fast Data Delivery in Cloud: an Effective Approach to Isolating
Performance Bottleneck During Slow Data Delivery
Zhenyun Zhuang, Haricharan Ramachandra, Badri Sridharan
{zzhuang, hramachandra, bsridharan}@linkedin.com
LinkedIn Corporation, 2029 Stierlin Court Mountain View, CA 94043 United States
Abstract—Cloud-based products heavily rely on the fast data
delivery between data centers and remote users - when data
delivery is slow, the products’ performance is crippled. When
slow data delivery occurs, engineers need to investigate the issue
and find the root cause. The investigation requires experience
and time, as data delivery involves multiple playing parts
including sender/receiver/network.
To facilitate the investigations, we propose an algorithm
to automatically identify the performance bottleneck. The
algorithm aggregates information from multiple layers of
data sender and receiver. It helps to automatically iso-
late the problem type by identifying which component of
sender/receiver/network is the bottleneck. After isolation, suc-
cessive efforts can be taken to root cause the exact problem.
We also build a prototype to demonstrate the effectiveness of
the algorithm.
I. INTRODUCTION
Cloud-based products (e.g., Cloud storage productions
such as Amazon S3) involve data transfer between cloud
data centers and remote users. For cloud environments,
fast data delivery is critical to ensure high performing
data products, as it translates to higher data throughput
and hence less response time as experienced by the users.
However, despite many techniques aimed at optimizing the
data delivery through Internet (e.g. web page optimization,
network acceleration), a critical problem we have continu-
ously experienced is slow data delivery. This problem has
two forms: (1) data is not delivered; and (2) data delivery
is slow (i.e., taking long time to complete the delivery). For
example, a web client is browsing a web page, and needs
to download a jpeg file. In the first form, the jpeg file may
not be received by the client at all. In the second form, the
client is receiving the jpeg file, however it takes too much
time to download.
When slow data transfer happens, it cripples applica-
tion performance, and engineers need to investigate and
root cause the issue. There are many possible causes. For
instance, it could be caused by the data sender being
overloaded and hence not sending data to the downstream
receiver; or the network channel is slow; or the data receiver
is too busy to read data from network buffer. Based on our
experiences, root causing the issues is tedious and requires
lots of experiences and expertise. Largely, possible causes of
such a problem can be classified into three types: (1) client
application causes; (2) network side causes; and (3) server
application causes.
When carrying on such investigations, engineers typically
need to quickly isolate different types of causes, so that
they can focus on the particular suspicious component and
perform deeper analysis to root cause. It is a challenging
task due to several reasons. First, the data delivery involves
multiple network entities including two machines (i.e., the
sender and receiver) and the networking routes, which is in
sharp contrast to the usual performance issues where only a
single machine is involved. Second, the diagnosis involves
multiple layers of information including application layer
and transport layer. To isolate the causes, engineers have to
check various places including client log, server log, network
statistics, cpu usage.etc. These types of checking take much
time and efforts, and often times requires experiences and
expertise from the performance engineers.
To save time and efforts, performance engineers are
eagerly looking forward to seeing more intelligent tools
to help them quickly isolate the root causes. Though the
exact causes can vary wildly in different scenarios, a quick
isolation of the problem types still would greatly help the
engineers who investigate the problem.
In this work, we specifically focus on fulfilling such a
request: quickly identify the component that is to blame,
be it the sender, the receiver, or the network. This work
presents an algorithm and a prototype to help performance
engineers. The algorithm can automatically isolate the root
cause when slow data delivery occurs. Once the blame part
is figured out, successive investigations can be conducted to
nail down the real problem. These will be outside of the
scope of this work, and are part of our future work.
For the remainder of the writing, after providing some
necessary technical background in section II, we then define
and motivate the problems being addressed in this writing
using three scenarios in Section III. We then present the
designs in Section IV. Based on the design principles, we
propose the solutions and deployment mode in Section V.
We build a prototype and perform performance evaluation
using the prototype in Section VI, respectively. We also
present certain related works in Section VII. Finally in
Section VIII we conclude the work.
II. BACKGROUND AND SCOPE
A. Background
TCP transport protocol: Transmission Control Protocol
(TCP) is one of the transport-layer protocols that provides
ordered and reliable delivery of streamed bytes. It is the
most widely used transport protocol Today. TCP features
flow control to avoid overloading the receiver. The receiver
sets up a dedicated receiver buffer, and the sender sets
up a corresponding send buffer. TCP also have congestion
control, retransmission mechanism, etc. to ensure the reliable
and fast data transfer.
Application protocol: Application protocols such as
HTTP are built upon lower transport protocols. When appli-
cation performance suffers, the immediate symptoms typi-
cally are higher latency, lower throughput. Though these two
symptoms are often related, we specifically focus on solving
the slow data transfer (i.e. lower throughput) problem.
Depending on application protocols and how applications
are built, application layers may emit additional logs. For
instance, the data receiver may log the time stamps of send-
ing data requests, receiving the first byte of requested data
and receiving the last byte of the requested data. Similarly,
the data sender may log the time stamps of receiving the
data requests, sending the first byte of requested data and
sending the last byte of the requested data.
B. Scope
To understand the causes of the low performing data
delivery problem, we firstly need to understand the data de-
livery model which determines the flow of bytes. For simple
presentation, we assume the “client” is the data receiver,
while the “server” is the data sender. This assumption is
in line with today’s web browsing paradigm where web
browsers are data receivers and web servers are data senders.
Generally there are two models of data delivery: pull-
based and push-based. In pull-based model, the client sends
a request to the server, and the server sends back a response
corresponding to the client request. In push-based model, the
server pushes data to the client, without the client explic-
itly asking for that. Pull-based model is more complicated
model, as the data flow of push-based model is only a subset
of that of the pull-based model. In other words, the only
difference between the two models is that pull-based model
contains the “data request” phase, as shown in Figure 1.
Though data downloading can be carried over with various
protocols including TCP and UDP, most of today’s data are
transferred through TCP, given the dominance of Internet
service. Hence, in this work we only focus on TCP-based
data downloading. For easier presentation, we choose Linux
platforms to present our designs and solutions due to Linux’s
popularity. However, the relevant designs and solutions will
also apply to other platforms.
Client Server
Application “Response”PUSH
Model
(a) Push Model
Client Server
Application “request”
Application “Response”PULL
Model
(b) Pull Model
Figure 1. Push model and Pull model
III. PROBLEM DEFINITION AND MOTIVATION
SCENARIOS
We first define and demonstrate the problem we want to
solve using three motivating scenarios.
A. Problem definition
We use C to denote the client which receives the data;
use S to denote the server that sends back the data to C. We
firstly consider the pull-based data delivery model, where
for any data delivery, C sends a request Rq to S first; after
receiving Rq, S will prepare the response data Rs and send
back to C.
We assume the time taken in good data delivery scenarios
is Tg, and the maximum of Tg is denoted as Tgmax. The
delivery time is calculated as the time difference between
when C’s application sends out Rq and when that application
receives the last byte of the Rs with read() call. The actual
data delivery time is Ta. We say that the particular data
delivery is slow when Ta > Tgmax.
B. Results of three motivation scenarios
To further illustrate the nature of the problem we are
addressing, we setup a simple experiment to demonstrate the
application symptoms as well as the root causes of slow data
downloading. In the experiment, a client (receiver) opens a
TCP connection to download a bulk of data from a server.
The environment are Gigbit LAN network, where the client
machine and the server machine are directly connected. The
RTT (round trip delay) is at sub-millisecond. Hence, in
typical (“good”) data transfer scenarios, the expected data
transmission rate should not be less than 1MB/s based on
many practices.
For each scenario, we measure the data delivery rate
on the client (receiver) side at application level. The three
scenarios differ in where the performance bottleneck is,
namely, the client (receiver), the server (sender) and the
network. Each of these bottlenecks can cause slow data
Figure 2. Similar symptoms of slow data downloading, but caused by
different types of bottlenecks: Server (a), Client (b) and Network(c)
delivery. We plot the accumulated bytes received by the
receiver (from application level) across the time line, which
is in the format of hour:minute:second.
1) Receiver is the bottleneck: In this scenario, the re-
ceiver application is not reading fast enough. The down-
loading progress as measured by the receiver is plotted in
Figure 2(a). The average throughput is about 3KB/second.
2) Sender is the bottleneck (The application not sending
fast enough): In this scenario, the sender application is
not reading fast enough. The average throughput is about
3KB/second, as shown in Figure 2(b).
3) Network is the bottleneck (The network is not trans-
mitting fast enough): In this scenario, we introduce packet
losses to the network, and the network protocol is not
transmitting fast enough. The average throughput is about
5KB/second, as shown in Figure 2(c).
C. Summary
We have demonstrated that for all the three scenarios, the
data transfer is slow - it takes up to one minute to download
the 200KB data. Though the exact throughput vary in three
scenarios, the application symptoms are similar, regardless
of the three different bottlenecks.
IV. DESIGN
A. Design Overview
We notice that various issues causing slow data transfer
between machines are largely classified into 3 types based
on where the problem is: (1) the sender application; (2) the
receiver application; or (3) the networking channel. We hope
to quickly and automatically decide which types of problems
is causing the slow data downloading.
We also notice the pull-model is a super-set of push-
model, and the push-model is exactly the response data
delivery part of the pull-model. Hence we break down our
solution into two parts. First, we focus on the push-model
and derive the algorithm based on our observations with
some experiments and our analysis of network protocols.
The solution relies on network-layer knowledge of network
buffer queue size. Second, stepping on the solution of push-
model, we extend the solution to solve the pull-model
problem. The solution introduces a state-machine, and by
incorporating the application-layer knowledge, it can deduct
the bottleneck that causes the slow data transfer.
B. Design for the push model
Considering the data delivery process in push-model, a
typical data flow in any TCP-based data transfer is illustrated
in Figure 3. Five steps with regard to system calls and
network transmissions are as below: (1) at Step-A, the server
application issues a write() system call and the application
data is copied to socket send buffer. (2) at Step-B, the
server’s TCP layer issues send() call and sends some data
to the network; The amount of data is subject to TCP’s
congestion control and flow control. (3) at Step-C, Network
will route the data hop-by-hop to the receiver; IP routing
protocol is in play in this part. (4) at Step-D, The client’s
TCP layer will receive the data with recv()system call. The
data are put in receive buffer. (5) at Step-E, The client
application issues read() call to receive the data and copy
to user space.
During this course, there are could be 3 types of problems:
(1) Server-application bottleneck. The server application
may not have the data ready, this could be caused by multiple
Client Application Server Application
read( )
Data packets
recv( )
Recv Buff
write( )
send( )
Send Buff
A
B
E
D
C
Figure 3. The design for push model
Client Server
Application “request”
Application “Response”
T0
T1
T2
T3
T4
T5
Figure 4. Time line of pull model data transfer
reasons. For instance, the server machine may be over-
committed (e.g., running multiple applications), hence the
particular application does not get chance to write(); the
server application may have too many threads running and
hence the write() thread is not being scheduled on cpus;
the application’s data-preparation business logic is slow. etc.
(2) Client-application bottleneck. The client application may
be too busy to issue read() call. Similar to server side,
possible causes include machine over-commitment, and so
on. (3) Network-side bottleneck. The network (including
the TCP protocol) is unable to delivery data fast enough.
Possible causes are poor network channel (e.g., high losses,
low bandwidth) or poor TCP/IP configuration or tunings.
Invariably, the symptom is the data are not being able to
pushed through to the receiver side.
After careful analysis of the symptoms and the causes, we
propose to isolate the problem into the above 3 types based
on the knowledge of the TCP sockets, specifically, the queue
lengths of the send and receive buffers.
In normal scenarios where data delivery is fast, the re-
ceiver’s receive queue size should mostly be zero, indicating
that the receiver is fast enough to consume the data received
by copying the data from socket buffer to user space. The
sender’s send queue should be non-zero, indicating the
fact that the data are produced fast enough to supply the
consumption. On the other hand, when the receive queue
is non-zero, that indicates client-side bottleneck; when the
send queue is zero, that indicates the server-side bottleneck.
When data transfer is indeed slow, and neither the server
application nor the sender application is the slowing down
the data transfer, we can conclude that the data transmission
(i.e., the network) between machines is slow. Such network-
side issue may involve both ends of the transmission. Also
note that this type of issue is not limited to networking
or transport protocol themselves (e.g., TCP/IP), it may be
caused by the slowdown of the sender/receiver OS (e.g.,
TCP/IP protocol processing).
C. Design for the pull model
The solution for pull-model is similar to the push-model
solution, with the key difference of added phase of client-
request. The server will not send back data until it receives
data request. We illustrate the time line of a typical pull-data
transfer in Figure 4. When client needs to do a pull-based
data delivery, it firstly sends out a request at T0; after the
network delivered the request to the server at T1, the server
then prepares for the data, followed by sending back the data
at T2 by a sequence of write() calls. The sending completes
at T4. After network transmission, the client begins receiving
data at T3, and completes receiving at T5. Note that though
other time stamps are strictly ordered, the ordering of T3 and
T4 may vary depending on the exact scenarios. Specifically,
for small data transfer, T4 may precede T3 since a single
write() call may suffice. For large data transfer, T3 typically
precede T4.
From the above process, we can see that if the data
request failed reach the server, then the data delivery will not
happen, hence not completed. Though the symptom of this
scenario is different from the scenario where the delivery is
slow, we decide to solve both scenarios since the data are
not completely delivered. To provide a complete solution for
pull-model problem, we need to distinguish between various
possible failing components (e.g., client/network/server) for
the client-request phase as well.
We propose a state-machine based algorithm in the fol-
lowing. Figure 5 is a diagram illustrating the states moni-
tored in a state machine according to some embodiments.
The state machine narrows the location of a problem in
a data transfer operation to one of three realms: a sender
realm that encompasses the sender or provider of the data
(e.g., a server), a receiver realm that encompasses the
recipient or receiver of the data (e.g., a client), and realm
that encompasses the communication link(s) that convey the
transferred data. In Figure 5, the states are depicted with
or near the component(s) whose action or actions cause a
change from one state to another.
A state engine process is fed the necessary knowledge
to monitor or identify the progress of a data transfer from
start (i.e., at state S) for a pull-based transfer or at state C
for a push-based transfer to finish (at state G). This may
require the OS of the two entities (e.g., client and server
in Figure 5) and the applications that use the data (e.g.,
client application, server application) to emit certain types of
information at certain times. For instance, the receiver (e.g.,
the recipients OS and/or the application) logs events at one
or more protocol layers, such as generation and dispatch of
a data request, transmission of the request from the receiver
machine, receipt of a first portion of data, and receipt of the
last portion of the data. Similarly, the data provider (e.g.,
the providers OS and/or the application) logs events at one
or more protocol layers, such as receipt of a data request,
preparation of the data, dispatch of the first portion of the
data, and dispatch of the final portion of the data.
V. SOLUTION
We now present the detailed algorithm, which solves the
problem of slow data transfer for both the pull-model and
the push-model.
Application Layer Protocol
Transport Layer Protocol
Communication Buffer
Client Application
S G
Application Layer Protocol
Transport Layer Protocol
Communication Buffer
Server Application
D C
E
F
Communication Buffer
A
Communication Buffer
B
Communication Links
Figure 5. The design for pull model
A. Solution Overview
Our proposed algorithm aggregates the information from
the sender and the receiver nodes, as shown in Figure 6. For
each node, the information from both the application layer
and transport layer are collected and utilized. The heart of
the algorithm is the Bottleneck Determination Engine (BDE)
which determines the bottleneck. Internally, the algorithm
maintains the current state of the data transfer and performs
state transition when appropriate, which is handled by the
State Transition Engine (STE). The state transition is based
on the information collected by the Information Collection
(IC), which collects and aggregates the information from
both ends of the data transfer.
The algorithm we present has the following key design
principles:
• Distributed information aggregation The solution uti-
lizes both client and server knowledge in order to piece
together the entire picture and identify the causes.
• Cross-layer information aggregation The solution uti-
lizes both application layer knowledge (e.g., the various
types of application logs) and transport layer knowledge
(e.g., the send queue and receive queue sizes).
• State-machine based expert system The state transform
is triggered by the aggregated knowledge from different
machines and different layers.
Information Collection
Sender Node
Application
Layer Info
Application
Layer Info
Transport Layer
Info
Transport Layer
Info
Application
Layer Info
Application
Layer Info
Transport Layer
Info
Transport Layer
InfoBottleneck
Determination Engine
Receiver Node
State Transition
Engine
Figure 6. High level design of the algorithm
B. Collecting Queue Size Information
In order for the algorithm to identify the bottleneck, it
needs to gather information about queue sizes at transport
layer on both sender and receiver sides. There are many
ways to collect such information. In this work, we list two
examples of these tools/utilities, namely netstat [1] and ss
[2]. These two utilities can be issued on the hosts as com-
mands. (a) netstat (network statistics [1]) is a command-line
tool that displays network connections (both incoming and
outgoing) and network protocol statistics. For the purpose
of this work, we utilize the sizes of the send queue and the
receive queue for the TCP/IP sockets. This tool is available
in many OS including Unix and Windows. (b) ss command
[2] is used to show socket statistics. It can display stats for
TCP as well as other types of sockets. Similar to netstat,
ss can also display the send and receive queue sizes, which
can be used by our algorithm.
When utilizing these tools/utilities to collect information
about transport level queue sizes, the collection process is
desired to meet the following requirements. First, the queue
size collection process should continuously output the queue
sizes for interested TCP/IP sockets during the duration of the
slow data transfer. If the collection tools (e.g., netstat and
ss) can only display the instantaneously values of that time
point, they need to be invoked repeatedly to gather multiple
data points during the data transfer course.
Second, invoking the collection tools may cause overhead,
hence they are desired to be invoked without exerting too
much overhead to the system. To achieve this objective,
appropriate delays (e.g., 10 second) can be injected between
invocations.
Third, the delays between collection invocations are de-
sired to not to be fixed to avoid the clocking of application-
reading or TCP-receiving. For instance, if the receiver ap-
plication is designed to read the data every 10 seconds, then
if the delays between collection invocations happen to be
10 seconds and right after the application reading, then the
collected queue sizes may always be zero and conclude that
the receiver is not the bottleneck, while in reality the receiver
could be the bottleneck. Similarly, the sending application
or the TCP/IP may also have similar clockings and need to
be avoided by varying the delays injected between collection
invocations.
Fourth, the collection process is desired to be long enough
to contain multiple data points to avoid false alert and ensure
certain degree of confidence about the collected information.
These multiple data points can be later used to filter out
possible spike values. The reason for dosing some is that
some tools/utilities (including netstat and ss) are outputting
the instantaneous queue sizes of the particular time stamps
when invoked.
C. State transitions
We describe the stat transitions for both pull and push
models. When the state machine is stuck at a particular state,
the corresponding bottleneck can be easily identified.
A pull-based data transfer begins in state S when client
(e.g., application) issues a data request. When the client
(e.g., application protocol, transport protocol) logs queuing
of the request, the data transfer transitions from state S to
state A (the request has been queued in the clients send
buffer). When the clients send buffer is empty or there
is some other indication that the request was transmitted
from the client, the data transfer transitions from state A to
state B (the request has been transmitted on communication
link(s)). When receipt of the request is logged by server
(e.g., application protocol, transport protocol), the transfer
transitions from state B to state C (the server application
has received the request). A push-based data transfer may
be considered to start at state C.
After the server prepares and sends a first portion of the
data to be transferred (e.g., the first byte, the first packet),
the data transfer operation transitions to state D (the data
response is underway). Progress of the data transfer may
now depend on the amount of data to be transferred and the
speed with which it is conveyed by communication link(s).
For example, if the amount of data being transferred is
relatively large and/or the communication path is relatively
slow, the data transfer transitions from state D to state E
when the client logs receipt of the first portion of the data,
transitions to state F when the server logs queuing/release of
the last portion of the data, and terminates at state G when
the client logs receipt of the last portion of the data. The lines
with long dashes represent this chain of state transitions.
In another scenario, if the amount of data is relatively
small and/or the communication path is relatively fast, the
data transfer transitions from state D to state F when the
server logs queuing/release of the last portion of the data,
transitions to state E when the client logs receipt of the first
portion of the data, and terminates at state G when the client
logs receipt of the last portion of the data. The lines with
short dashes represent this chain of state transitions.
In some other scenarios, instead of two different paths
through states E and F, separate (mirrored) states may be
defined that reflect the same statuses (i.e., client-logged
receipt of first data, server-logged dispatch of final data).
In these scenarios, therefore, there will be only one valid
path through states E and F and through the two mirrored
states, which could illustratively be represented as E’ and
F’.
Note that it could be that there are more than a single
bottleneck exist at the same time. For example, both client
and server can be bottlenecks. As another example, at first
half the data transfer, the client is the bottleneck, while
at the second half the data transfer, the network is the
bottleneck. Though it is possible to handle these complicated
cases easily by splitting the time duration into smaller
units and output the algorithm output for all these smaller
units, for simpler presentation, we choose to focus on the
scenarios where only a single bottleneck exists. However,
the presented algorithm/solution can be easily converted to
accommodate the above more complicated scenarios.
The determination regarding which component is the
bottleneck when the state is stuck at States of D/E/F/G/H
depends on the transport layer information. Specifically,
below is the table illustrates all the three scenarios and the
information needed to make the corresponding the decision.
Specifically, if the Receive Queue on the client is not zero,
then the client application is the bottleneck. If the Send
Queue on the server is zero, then the server application
is the bottleneck. If the client’s Receive Queue is zero
and the server’s Send Queue is non-zero, then the network
component is the bottleneck.
D. Deployment mode
Our algorithm can be deployed and utilized in two modes:
online and off-line, depending on whether algorithm is
invoked during the slow data transfer or after that.
Online mode: During the data transfer, if the user sees
the transfer is slow, and would like to diagnose, the user
can invoke the algorithm. The algorithm will continuously
collect the required information at application layer and
transport layer from both machines of the data transfer. To
achieve this online deployment, the continuous information
collection can be done in two ways. First, the information
can be streamed to the algorithm engines. Second, the
algorithm may choose to repeatedly mine the information
logged on the two machines.
Table I
THREE SCENARIOS IN PUSH MODEL
Bottleneck Recv. Que. Send Que. Notes
Client Non-zero Any Blocked receiving
Server Any Zero Blocked sending
Network Zero Non-zero Blocked delivery
Off-line mode: The users may also invoke the algorithm
in an off-line mode. For instance, the user knows the time
duration when the data transfer is slow, and later use this
algorithm to diagnose the particular time duration. The ad-
vantage of this mode is that it does not need to continuously
collect the information needed, which is more complex to
implement. On the other hand, it requires the user to record
the time stamps to define the slow transfer time range.
VI. EVALUATION
A. Prototype
We implemented the proposed algorithm with a prototype
in Python. The prototype works in off-line mode, and it
can be used to determine the bottleneck during a specific
time duration where the data transfer is slow. The prototype
includes a netstat-based information collection script (i.e.,
“netstat -Ttp”) to repeatedly output the TCP connection
information on both the sender and receiver hosts. A ran-
dom delay period is injected into two succeeding netstat
invocations, with the average delay value being 2 seconds.
The output of each netstat invocation is prefixed by a time
stamp in the granularity of 1 ms.
The prototype firstly takes the user input of a configuration
file. The file defines the directional data transfer connection,
the information source and the time duration. The directional
data transfer connection is defined by tuple of (src host , src
port, dst host, dst port). Note that unlike common definitions
of TCP connections which are bidirectionally, the definition
of connections here is directional, as the slow data transfer
has the notion of direction. The information source is a local
directory which contains the netstat output from both ends.
The time duration is defined by the beginning time stamp
and the ending time stamp.
The prototype also takes the user input of an “input”
directory, where the netstat outputs are stored. Since data
transfer is directional, the prototype treats each TCP connec-
tion as two directional data transfer connections. Thus, each
extracted TCP data transfer is recorded in a separate csv file.
Each csv file contains lines of tuples in the format of (time
stamp, recv-queue, send-queue). One special treatment about
the processing is the handling of time zones. It is possible
for the two hosts of the sender and receiver to be in different
time zones. Because of this, the output of the netstat might
be mis-aligned if not treated accordingly. To accommodate
the possible differences in time zones, the configuration file
allows the user to specify time zones for the two end hosts.
Internally, the prototype will perform time zone conversion
to align the csv data.
Based on the user input in the configuration file, the pro-
totype only extracts the interested directional data transfer
connections. Furthermore, only the interested time periods
are extracted based on the user-input. Given a particular
directional data transfer connection which needs to diagnose,
Configuration File:
1 [GLOBAL]
2 ts start=2014-04-21 20:56:40
3 ts end=2014-04-21 20:57:40
4 time zone=UTC
5 [SENDER]
6 host name=sender.linkedin.com
7 port=10001
8 time zone=UTC
9 [RECEIVER]
10 host name=receiver.linkedin.com
11 port=36885
12 time zone=America/Los Angeles
Output of the algorithm
13 [user1@host1]> ./rootcause.py -c
rootcause.client.conf -i out2/resources/
14 Connection of sender.linkedin.com [’10001’]
−− > receiver.linkedin.com [’36885’]
15 Results:
16 Client is slowly receiving, the bottleneck.
Figure 7. Sample Configuration and Sample Output
the prototype determines the bottleneck based on the rules
presented earlier.
To reliably determine the bottleneck, the prototype also
filters out some spike values which might cause false alert.
Specifically, the decision about whether a queue size is zero
or non-zero has to persist for at least certain number of
invocations and at least certain amount of time duration. For
instance, in one debugging case, with the average netstat
invocation of 2 seconds, in order for the prototype to
conclude that the client side is the bottleneck, the receive
queue size needs to be zero for at least 10 seconds.
B. Results
We evaluated the built off-line prototype. The usage sce-
nario is as follows. After the user noticed slow data transfer
during some time period in the form of beginning time
stamp and ending time stamp, the user runs the prototype to
diagnose the issue and determine the bottleneck.
We have used this prototype to identify bottlenecks in
the following LinkedIn production investigations/issues: (1)
Databus [3] bootstrapping; where a client receives boot-
strapping events from the server using TCP/IP protocol; (2)
Voldemort [4] cluster expansion; where data receivers fetch
data from data senders using TCP/IP protocol. We have
found that for the first scenario, the Databus server (i.e., the
(a) Send queue
(b) Receive queue
Figure 8. Client is the bottleneck
data sender) is the bottleneck; while in the second scenario,
the networking part is the bottleneck.
To demonstrate all the three scenarios where each of
the scenarios has a different type of bottleneck (e.g.
sender/receiver/networking), we designed a set of experi-
ments using the a custom-build workloads. The workload
can mimic the three types of bottleneck based on user inputs.
For each of the scenarios, a single separate TCP connection
is created.
A sample configuration file and the corresponding sample
output are shown in Figure 7. The “GLOBAL” section
specifies the beginning and ending time stamps of the
duration where the slow data transfer lasts. It also takes the
time zone about the two time stamps. The “SENDER” and
“RECEIVER” sections define the directional data transfer
connection in the form host names and ports. These two
sections can also take the time zones used for each host to
allow alignment of netstat output.
1) Client is slowly receiving : We force the receiver (i.e.,
the client) to slow down the data receiving by injecting
delays between the calls to read() in application code, which
represents a scenario where the client is the bottleneck. As
shown in Figure 8, the client side (receiver side) has certain
RecvQ buildup, which is indicated by the non-zero values.
These non-zero values last for a while, so that the algorithm
can conclude that the client is the bottleneck.
2) Server is slowly sending : We force the sender to slow
down the data sending. We inject delays between the calls
to write() in application code, which represents a scenario
where the sender is the bottleneck. As shown in Figure 9,
the server side has zero SendQ values. The lasting period is
(a) Send queue
(b) Receive queue
Figure 9. Server is the bottleneck
significant to allow the algorithm to conclude the sender
bottleneck. Note that there is a spike about the SendQ.
Since the spike does not persist for long enough period,
it is internally filtered out by the algorithm.
3) Network is slowly transmitting: We created a scenario
where TCP is slowly transmitting the data. Specifically, we
inject delays and latencies to the network path, such that
TCP can only transmit at a very low throughput. As shown in
Figure 10, the SendQ values are non-zero, while the RecvQ
is zero. These values of typical of a fast data delivery. Since
we see very low data transfer, the algorithm conclude that
it must be networking problem.
VII. RELATED WORK
Networking protocol is a critical component of distributed
system. Many protocols and algorithms [5]–[7] are proposed
to ensure high transmission rate in various scenarios.
Many works have been done to diagnose computer perfor-
mance problem, but most of them focus on the performance
of a specific component, for instance, system performance
bottleneck about a particular OS (e.g., Linux kernel 2.6.18
[8]) or a particular protocol (TCP Reno [9], [10], etc.).
Though there are scattered knowledge/experiences about
debugging the targeted slow-data-transfer problem, to our
best knowledge, we have not seen any algorithm or prototype
that is equivalent to our proposed algorithm/prototype.
For the overall problem of slow data transfer, all sources
we can find only rely on performance extertise/experience
(e.g., [11]), and there is no single algorithm or resource to
achieve the same as our algorithm does. Moreover, we did
not find any automated prototype/tool for that purpose.
(a) Send queue
(b) Receive queue
Figure 10. Network is the bottleneck
VIII. CONCLUSION
We proposed and implemented an algorithm to automat-
ically determine the bottleneck component along the data
transfer path. It reduces the efforts when diagnosing the
problem and eventually root-causing the problem.
REFERENCES
[1] “Netstat utility,” http://en.wikipedia.org/wiki/Netstat.
[2] “Ss utility,” http://man7.org/linux/man-pages/man8/ss.8.html.
[3] S. Das, C. Botev, and et al., “All aboard the databus!:
Linkedin’s scalable consistent change data capture platform,”
ser. SoCC ’12, New York, NY, USA, 2012.
[4] R. Sumbaly, J. Kreps, L. Gao, A. Feinberg, C. Soman,
and S. Shah, “Serving large-scale batch computed data with
project voldemort,” in Proceedings of the 10th USENIX
Conference on File and Storage Technologies, ser. FAST’12,
Berkeley, CA, USA, 2012, pp. 18–18.
[5] J. Zhu, S. Roy, and J. H. Kim, “Performance modelling
of tcp enhancements in terrestrial-satellite hybrid networks,”
IEEE/ACM Trans. Netw., vol. 14, no. 4, pp. 753–766, Aug.
2006.
[6] H. Schulzrinne, S. Casner, R. Frederick, and V. Jacobson,
“Rtp: A transport protocol for real-time applications,” United
States, 2003.
[7] P. Sinha, T. Nandagopal, N. Venkitaraman, R. Sivakumar,
and V. Bharghavan, “Wtcp: A reliable transport protocol for
wireless wide-area networks,” Wirel. Netw., vol. 8, no. 2/3,
pp. 301–316, Mar. 2002.
[8] “Performance regression on tcp stream throughput,”
https://bugzilla.redhat.com/show bug.cgi?id=705989.
[9] L. A. Grieco and S. Mascolo, “Performance evaluation and
comparison of westwood+, new reno, and vegas tcp conges-
tion control,” SIGCOMM Comput. Commun. Rev., vol. 34,
no. 2, pp. 25–38, Apr. 2004.
[10] C. Won, B. Lee, C. Yu, S. Moh, K. Park, and M.-J. Kim,
“A detailed performance analysis of udp/ip, tcp/ip, and m-via
network protocols using linux/simos,” J. High Speed Netw.,
vol. 13, no. 3, pp. 169–182, Aug. 2004.
[11] “Slow performance occurs when you copy data,”
http://support.microsoft.com/kb/823764.

More Related Content

PDF
Client-side web acceleration for low-bandwidth hosts
PDF
WebAccel: Accelerating Web access for low-bandwidth hosts
PDF
A SPDYier Experience by Olaniyi Jinadu
PDF
A3: application-aware acceleration for wireless data networks
PDF
Reducing download time through mirror servers
PPTX
Computer networks unit v
DOCX
Internet
PDF
IRJET- An Overview of Web Sockets: The Future of Real-Time Communication
Client-side web acceleration for low-bandwidth hosts
WebAccel: Accelerating Web access for low-bandwidth hosts
A SPDYier Experience by Olaniyi Jinadu
A3: application-aware acceleration for wireless data networks
Reducing download time through mirror servers
Computer networks unit v
Internet
IRJET- An Overview of Web Sockets: The Future of Real-Time Communication

What's hot (16)

PDF
Ieeepro techno solutions 2014 ieee java project - cloud bandwidth and cost ...
PDF
An in-building multi-server cloud system based on shortest Path algorithm dep...
PDF
MODIFIED BITTORRENT PROTOCOL AND ITS APPLICATION IN CLOUD COMPUTING ENVIRONMENT
DOCX
JPJ1410 PACK: Prediction-Based Cloud Bandwidth and Cost Reduction System
PPT
Chapter 2 v6.3
PDF
Bandwidth White Paper
PPT
IEEE ICPADS 2008 - Kalman Graffi - SkyEye.KOM: An Information Management Over...
PPTX
Middleware in Distributed System-RPC,RMI
PPTX
Unit 3 cs6601 Distributed Systems
PPT
Application layer protocols
PDF
Implementing a Caching Scheme for Media Streaming in a Proxy Server
PPTX
A Split Protocol Technique for Web Server Migration
PDF
Web Protocol Future (QUIC/SPDY/HTTP2/MPTCP/SCTP)
PDF
LARGE SCALE IMAGE PROCESSING IN REAL-TIME ENVIRONMENTS WITH KAFKA
PDF
Multiple_Vendors_Part-1
PPT
Application layer
Ieeepro techno solutions 2014 ieee java project - cloud bandwidth and cost ...
An in-building multi-server cloud system based on shortest Path algorithm dep...
MODIFIED BITTORRENT PROTOCOL AND ITS APPLICATION IN CLOUD COMPUTING ENVIRONMENT
JPJ1410 PACK: Prediction-Based Cloud Bandwidth and Cost Reduction System
Chapter 2 v6.3
Bandwidth White Paper
IEEE ICPADS 2008 - Kalman Graffi - SkyEye.KOM: An Information Management Over...
Middleware in Distributed System-RPC,RMI
Unit 3 cs6601 Distributed Systems
Application layer protocols
Implementing a Caching Scheme for Media Streaming in a Proxy Server
A Split Protocol Technique for Web Server Migration
Web Protocol Future (QUIC/SPDY/HTTP2/MPTCP/SCTP)
LARGE SCALE IMAGE PROCESSING IN REAL-TIME ENVIRONMENTS WITH KAFKA
Multiple_Vendors_Part-1
Application layer
Ad

Viewers also liked (20)

PDF
Mutual Exclusion in Wireless Sensor and Actor Networks
PDF
Hybrid Periodical Flooding in Unstructured Peer-to-Peer Networks
PDF
Mobile Hosts Participating in Peer-to-Peer Data Networks: Challenges and Solu...
PDF
OCPA: An Algorithm for Fast and Effective Virtual Machine Placement and Assig...
PDF
Eliminating OS-caused Large JVM Pauses for Latency-sensitive Java-based Cloud...
PDF
Building Cloud-ready Video Transcoding System for Content Delivery Networks (...
PDF
Optimizing CDN Infrastructure for Live Streaming with Constrained Server Chai...
PDF
Designing SSD-friendly Applications for Better Application Performance and Hi...
PDF
Dynamic Layer Management in Super-Peer Architectures
PDF
Wireless memory: Eliminating communication redundancy in Wi-Fi networks
PDF
Improving energy efficiency of location sensing on smartphones
PDF
Application-Aware Acceleration for Wireless Data Networks: Design Elements an...
PDF
Capacity Planning and Headroom Analysis for Taming Database Replication Latency
PDF
PAIDS: A Proximity-Assisted Intrusion Detection System for Unidentified Worms
PDF
Optimizing Streaming Server Selection for CDN-delivered Live Streaming
PDF
AOTO: Adaptive overlay topology optimization in unstructured P2P systems
PDF
Hazard avoidance in wireless sensor and actor networks
PDF
On the Impact of Mobile Hosts in Peer-to-Peer Data Networks
PDF
Optimizing JMS Performance for Cloud-based Application Servers
PDF
Enhancing Intrusion Detection System with Proximity Information
Mutual Exclusion in Wireless Sensor and Actor Networks
Hybrid Periodical Flooding in Unstructured Peer-to-Peer Networks
Mobile Hosts Participating in Peer-to-Peer Data Networks: Challenges and Solu...
OCPA: An Algorithm for Fast and Effective Virtual Machine Placement and Assig...
Eliminating OS-caused Large JVM Pauses for Latency-sensitive Java-based Cloud...
Building Cloud-ready Video Transcoding System for Content Delivery Networks (...
Optimizing CDN Infrastructure for Live Streaming with Constrained Server Chai...
Designing SSD-friendly Applications for Better Application Performance and Hi...
Dynamic Layer Management in Super-Peer Architectures
Wireless memory: Eliminating communication redundancy in Wi-Fi networks
Improving energy efficiency of location sensing on smartphones
Application-Aware Acceleration for Wireless Data Networks: Design Elements an...
Capacity Planning and Headroom Analysis for Taming Database Replication Latency
PAIDS: A Proximity-Assisted Intrusion Detection System for Unidentified Worms
Optimizing Streaming Server Selection for CDN-delivered Live Streaming
AOTO: Adaptive overlay topology optimization in unstructured P2P systems
Hazard avoidance in wireless sensor and actor networks
On the Impact of Mobile Hosts in Peer-to-Peer Data Networks
Optimizing JMS Performance for Cloud-based Application Servers
Enhancing Intrusion Detection System with Proximity Information
Ad

Similar to Guarding Fast Data Delivery in Cloud: an Effective Approach to Isolating Performance Bottleneck During Slow Data Delivery (20)

PDF
H017113842
PPTX
Online TCP-IP Networking Assignment Help
PDF
IRJET- Simulation Analysis of a New Startup Algorithm for TCP New Reno
PDF
A COMPREHENSIVE SOLUTION TO CLOUD TRAFFIC TRIBULATIONS
PDF
A COMPREHENSIVE SOLUTION TO CLOUD TRAFFIC TRIBULATIONS
PDF
A COMPREHENSIVE SOLUTION TO CLOUD TRAFFIC TRIBULATIONS
PDF
A COMPREHENSIVE SOLUTION TO CLOUD TRAFFIC TRIBULATIONS
PDF
A COMPREHENSIVE SOLUTION TO CLOUD TRAFFIC TRIBULATIONS
PDF
Socket programming assignment
PPT
PPTX
Computer(presentation).pptx computer netwprl
PDF
Ieeepro techno solutions 2014 ieee dotnet project - cloud bandwidth and cos...
PDF
Reducing download time through mirror servers
PPTX
CN(BCS502) Module-4 _Transport Layer.pptx
DOCX
Individual CommentsYour answers missed there below topics, sp.docx
PDF
Week10 transport
PDF
DrShivashankar_Computer Net_Module-3.pdf
PDF
Transport laye
PPT
Design an Implementation of A Messaging and Resource Sharing Software
PDF
Unit 3 Assignment 1 Osi Model
H017113842
Online TCP-IP Networking Assignment Help
IRJET- Simulation Analysis of a New Startup Algorithm for TCP New Reno
A COMPREHENSIVE SOLUTION TO CLOUD TRAFFIC TRIBULATIONS
A COMPREHENSIVE SOLUTION TO CLOUD TRAFFIC TRIBULATIONS
A COMPREHENSIVE SOLUTION TO CLOUD TRAFFIC TRIBULATIONS
A COMPREHENSIVE SOLUTION TO CLOUD TRAFFIC TRIBULATIONS
A COMPREHENSIVE SOLUTION TO CLOUD TRAFFIC TRIBULATIONS
Socket programming assignment
Computer(presentation).pptx computer netwprl
Ieeepro techno solutions 2014 ieee dotnet project - cloud bandwidth and cos...
Reducing download time through mirror servers
CN(BCS502) Module-4 _Transport Layer.pptx
Individual CommentsYour answers missed there below topics, sp.docx
Week10 transport
DrShivashankar_Computer Net_Module-3.pdf
Transport laye
Design an Implementation of A Messaging and Resource Sharing Software
Unit 3 Assignment 1 Osi Model

Recently uploaded (20)

PDF
August -2025_Top10 Read_Articles_ijait.pdf
PPTX
Chapter 2 -Technology and Enginerring Materials + Composites.pptx
PPTX
Chemical Technological Processes, Feasibility Study and Chemical Process Indu...
PDF
Applications of Equal_Area_Criterion.pdf
PPTX
Principal presentation for NAAC (1).pptx
PDF
LOW POWER CLASS AB SI POWER AMPLIFIER FOR WIRELESS MEDICAL SENSOR NETWORK
PPTX
A Brief Introduction to IoT- Smart Objects: The "Things" in IoT
PPTX
Graph Data Structures with Types, Traversals, Connectivity, and Real-Life App...
PPTX
CN_Unite_1 AI&DS ENGGERING SPPU PUNE UNIVERSITY
PDF
August 2025 - Top 10 Read Articles in Network Security & Its Applications
PPTX
tack Data Structure with Array and Linked List Implementation, Push and Pop O...
PDF
Abrasive, erosive and cavitation wear.pdf
PPTX
Amdahl’s law is explained in the above power point presentations
PDF
UEFA_Carbon_Footprint_Calculator_Methology_2.0.pdf
PDF
Exploratory_Data_Analysis_Fundamentals.pdf
PPTX
ai_satellite_crop_management_20250815030350.pptx
PPTX
Petroleum Refining & Petrochemicals.pptx
PDF
distributed database system" (DDBS) is often used to refer to both the distri...
PDF
Java Basics-Introduction and program control
PDF
MLpara ingenieira CIVIL, meca Y AMBIENTAL
August -2025_Top10 Read_Articles_ijait.pdf
Chapter 2 -Technology and Enginerring Materials + Composites.pptx
Chemical Technological Processes, Feasibility Study and Chemical Process Indu...
Applications of Equal_Area_Criterion.pdf
Principal presentation for NAAC (1).pptx
LOW POWER CLASS AB SI POWER AMPLIFIER FOR WIRELESS MEDICAL SENSOR NETWORK
A Brief Introduction to IoT- Smart Objects: The "Things" in IoT
Graph Data Structures with Types, Traversals, Connectivity, and Real-Life App...
CN_Unite_1 AI&DS ENGGERING SPPU PUNE UNIVERSITY
August 2025 - Top 10 Read Articles in Network Security & Its Applications
tack Data Structure with Array and Linked List Implementation, Push and Pop O...
Abrasive, erosive and cavitation wear.pdf
Amdahl’s law is explained in the above power point presentations
UEFA_Carbon_Footprint_Calculator_Methology_2.0.pdf
Exploratory_Data_Analysis_Fundamentals.pdf
ai_satellite_crop_management_20250815030350.pptx
Petroleum Refining & Petrochemicals.pptx
distributed database system" (DDBS) is often used to refer to both the distri...
Java Basics-Introduction and program control
MLpara ingenieira CIVIL, meca Y AMBIENTAL

Guarding Fast Data Delivery in Cloud: an Effective Approach to Isolating Performance Bottleneck During Slow Data Delivery

  • 1. Guarding Fast Data Delivery in Cloud: an Effective Approach to Isolating Performance Bottleneck During Slow Data Delivery Zhenyun Zhuang, Haricharan Ramachandra, Badri Sridharan {zzhuang, hramachandra, bsridharan}@linkedin.com LinkedIn Corporation, 2029 Stierlin Court Mountain View, CA 94043 United States Abstract—Cloud-based products heavily rely on the fast data delivery between data centers and remote users - when data delivery is slow, the products’ performance is crippled. When slow data delivery occurs, engineers need to investigate the issue and find the root cause. The investigation requires experience and time, as data delivery involves multiple playing parts including sender/receiver/network. To facilitate the investigations, we propose an algorithm to automatically identify the performance bottleneck. The algorithm aggregates information from multiple layers of data sender and receiver. It helps to automatically iso- late the problem type by identifying which component of sender/receiver/network is the bottleneck. After isolation, suc- cessive efforts can be taken to root cause the exact problem. We also build a prototype to demonstrate the effectiveness of the algorithm. I. INTRODUCTION Cloud-based products (e.g., Cloud storage productions such as Amazon S3) involve data transfer between cloud data centers and remote users. For cloud environments, fast data delivery is critical to ensure high performing data products, as it translates to higher data throughput and hence less response time as experienced by the users. However, despite many techniques aimed at optimizing the data delivery through Internet (e.g. web page optimization, network acceleration), a critical problem we have continu- ously experienced is slow data delivery. This problem has two forms: (1) data is not delivered; and (2) data delivery is slow (i.e., taking long time to complete the delivery). For example, a web client is browsing a web page, and needs to download a jpeg file. In the first form, the jpeg file may not be received by the client at all. In the second form, the client is receiving the jpeg file, however it takes too much time to download. When slow data transfer happens, it cripples applica- tion performance, and engineers need to investigate and root cause the issue. There are many possible causes. For instance, it could be caused by the data sender being overloaded and hence not sending data to the downstream receiver; or the network channel is slow; or the data receiver is too busy to read data from network buffer. Based on our experiences, root causing the issues is tedious and requires lots of experiences and expertise. Largely, possible causes of such a problem can be classified into three types: (1) client application causes; (2) network side causes; and (3) server application causes. When carrying on such investigations, engineers typically need to quickly isolate different types of causes, so that they can focus on the particular suspicious component and perform deeper analysis to root cause. It is a challenging task due to several reasons. First, the data delivery involves multiple network entities including two machines (i.e., the sender and receiver) and the networking routes, which is in sharp contrast to the usual performance issues where only a single machine is involved. Second, the diagnosis involves multiple layers of information including application layer and transport layer. To isolate the causes, engineers have to check various places including client log, server log, network statistics, cpu usage.etc. These types of checking take much time and efforts, and often times requires experiences and expertise from the performance engineers. To save time and efforts, performance engineers are eagerly looking forward to seeing more intelligent tools to help them quickly isolate the root causes. Though the exact causes can vary wildly in different scenarios, a quick isolation of the problem types still would greatly help the engineers who investigate the problem. In this work, we specifically focus on fulfilling such a request: quickly identify the component that is to blame, be it the sender, the receiver, or the network. This work presents an algorithm and a prototype to help performance engineers. The algorithm can automatically isolate the root cause when slow data delivery occurs. Once the blame part is figured out, successive investigations can be conducted to nail down the real problem. These will be outside of the scope of this work, and are part of our future work. For the remainder of the writing, after providing some necessary technical background in section II, we then define and motivate the problems being addressed in this writing using three scenarios in Section III. We then present the designs in Section IV. Based on the design principles, we propose the solutions and deployment mode in Section V. We build a prototype and perform performance evaluation using the prototype in Section VI, respectively. We also present certain related works in Section VII. Finally in Section VIII we conclude the work.
  • 2. II. BACKGROUND AND SCOPE A. Background TCP transport protocol: Transmission Control Protocol (TCP) is one of the transport-layer protocols that provides ordered and reliable delivery of streamed bytes. It is the most widely used transport protocol Today. TCP features flow control to avoid overloading the receiver. The receiver sets up a dedicated receiver buffer, and the sender sets up a corresponding send buffer. TCP also have congestion control, retransmission mechanism, etc. to ensure the reliable and fast data transfer. Application protocol: Application protocols such as HTTP are built upon lower transport protocols. When appli- cation performance suffers, the immediate symptoms typi- cally are higher latency, lower throughput. Though these two symptoms are often related, we specifically focus on solving the slow data transfer (i.e. lower throughput) problem. Depending on application protocols and how applications are built, application layers may emit additional logs. For instance, the data receiver may log the time stamps of send- ing data requests, receiving the first byte of requested data and receiving the last byte of the requested data. Similarly, the data sender may log the time stamps of receiving the data requests, sending the first byte of requested data and sending the last byte of the requested data. B. Scope To understand the causes of the low performing data delivery problem, we firstly need to understand the data de- livery model which determines the flow of bytes. For simple presentation, we assume the “client” is the data receiver, while the “server” is the data sender. This assumption is in line with today’s web browsing paradigm where web browsers are data receivers and web servers are data senders. Generally there are two models of data delivery: pull- based and push-based. In pull-based model, the client sends a request to the server, and the server sends back a response corresponding to the client request. In push-based model, the server pushes data to the client, without the client explic- itly asking for that. Pull-based model is more complicated model, as the data flow of push-based model is only a subset of that of the pull-based model. In other words, the only difference between the two models is that pull-based model contains the “data request” phase, as shown in Figure 1. Though data downloading can be carried over with various protocols including TCP and UDP, most of today’s data are transferred through TCP, given the dominance of Internet service. Hence, in this work we only focus on TCP-based data downloading. For easier presentation, we choose Linux platforms to present our designs and solutions due to Linux’s popularity. However, the relevant designs and solutions will also apply to other platforms. Client Server Application “Response”PUSH Model (a) Push Model Client Server Application “request” Application “Response”PULL Model (b) Pull Model Figure 1. Push model and Pull model III. PROBLEM DEFINITION AND MOTIVATION SCENARIOS We first define and demonstrate the problem we want to solve using three motivating scenarios. A. Problem definition We use C to denote the client which receives the data; use S to denote the server that sends back the data to C. We firstly consider the pull-based data delivery model, where for any data delivery, C sends a request Rq to S first; after receiving Rq, S will prepare the response data Rs and send back to C. We assume the time taken in good data delivery scenarios is Tg, and the maximum of Tg is denoted as Tgmax. The delivery time is calculated as the time difference between when C’s application sends out Rq and when that application receives the last byte of the Rs with read() call. The actual data delivery time is Ta. We say that the particular data delivery is slow when Ta > Tgmax. B. Results of three motivation scenarios To further illustrate the nature of the problem we are addressing, we setup a simple experiment to demonstrate the application symptoms as well as the root causes of slow data downloading. In the experiment, a client (receiver) opens a TCP connection to download a bulk of data from a server. The environment are Gigbit LAN network, where the client machine and the server machine are directly connected. The RTT (round trip delay) is at sub-millisecond. Hence, in typical (“good”) data transfer scenarios, the expected data transmission rate should not be less than 1MB/s based on many practices. For each scenario, we measure the data delivery rate on the client (receiver) side at application level. The three scenarios differ in where the performance bottleneck is, namely, the client (receiver), the server (sender) and the network. Each of these bottlenecks can cause slow data
  • 3. Figure 2. Similar symptoms of slow data downloading, but caused by different types of bottlenecks: Server (a), Client (b) and Network(c) delivery. We plot the accumulated bytes received by the receiver (from application level) across the time line, which is in the format of hour:minute:second. 1) Receiver is the bottleneck: In this scenario, the re- ceiver application is not reading fast enough. The down- loading progress as measured by the receiver is plotted in Figure 2(a). The average throughput is about 3KB/second. 2) Sender is the bottleneck (The application not sending fast enough): In this scenario, the sender application is not reading fast enough. The average throughput is about 3KB/second, as shown in Figure 2(b). 3) Network is the bottleneck (The network is not trans- mitting fast enough): In this scenario, we introduce packet losses to the network, and the network protocol is not transmitting fast enough. The average throughput is about 5KB/second, as shown in Figure 2(c). C. Summary We have demonstrated that for all the three scenarios, the data transfer is slow - it takes up to one minute to download the 200KB data. Though the exact throughput vary in three scenarios, the application symptoms are similar, regardless of the three different bottlenecks. IV. DESIGN A. Design Overview We notice that various issues causing slow data transfer between machines are largely classified into 3 types based on where the problem is: (1) the sender application; (2) the receiver application; or (3) the networking channel. We hope to quickly and automatically decide which types of problems is causing the slow data downloading. We also notice the pull-model is a super-set of push- model, and the push-model is exactly the response data delivery part of the pull-model. Hence we break down our solution into two parts. First, we focus on the push-model and derive the algorithm based on our observations with some experiments and our analysis of network protocols. The solution relies on network-layer knowledge of network buffer queue size. Second, stepping on the solution of push- model, we extend the solution to solve the pull-model problem. The solution introduces a state-machine, and by incorporating the application-layer knowledge, it can deduct the bottleneck that causes the slow data transfer. B. Design for the push model Considering the data delivery process in push-model, a typical data flow in any TCP-based data transfer is illustrated in Figure 3. Five steps with regard to system calls and network transmissions are as below: (1) at Step-A, the server application issues a write() system call and the application data is copied to socket send buffer. (2) at Step-B, the server’s TCP layer issues send() call and sends some data to the network; The amount of data is subject to TCP’s congestion control and flow control. (3) at Step-C, Network will route the data hop-by-hop to the receiver; IP routing protocol is in play in this part. (4) at Step-D, The client’s TCP layer will receive the data with recv()system call. The data are put in receive buffer. (5) at Step-E, The client application issues read() call to receive the data and copy to user space. During this course, there are could be 3 types of problems: (1) Server-application bottleneck. The server application may not have the data ready, this could be caused by multiple Client Application Server Application read( ) Data packets recv( ) Recv Buff write( ) send( ) Send Buff A B E D C Figure 3. The design for push model
  • 4. Client Server Application “request” Application “Response” T0 T1 T2 T3 T4 T5 Figure 4. Time line of pull model data transfer reasons. For instance, the server machine may be over- committed (e.g., running multiple applications), hence the particular application does not get chance to write(); the server application may have too many threads running and hence the write() thread is not being scheduled on cpus; the application’s data-preparation business logic is slow. etc. (2) Client-application bottleneck. The client application may be too busy to issue read() call. Similar to server side, possible causes include machine over-commitment, and so on. (3) Network-side bottleneck. The network (including the TCP protocol) is unable to delivery data fast enough. Possible causes are poor network channel (e.g., high losses, low bandwidth) or poor TCP/IP configuration or tunings. Invariably, the symptom is the data are not being able to pushed through to the receiver side. After careful analysis of the symptoms and the causes, we propose to isolate the problem into the above 3 types based on the knowledge of the TCP sockets, specifically, the queue lengths of the send and receive buffers. In normal scenarios where data delivery is fast, the re- ceiver’s receive queue size should mostly be zero, indicating that the receiver is fast enough to consume the data received by copying the data from socket buffer to user space. The sender’s send queue should be non-zero, indicating the fact that the data are produced fast enough to supply the consumption. On the other hand, when the receive queue is non-zero, that indicates client-side bottleneck; when the send queue is zero, that indicates the server-side bottleneck. When data transfer is indeed slow, and neither the server application nor the sender application is the slowing down the data transfer, we can conclude that the data transmission (i.e., the network) between machines is slow. Such network- side issue may involve both ends of the transmission. Also note that this type of issue is not limited to networking or transport protocol themselves (e.g., TCP/IP), it may be caused by the slowdown of the sender/receiver OS (e.g., TCP/IP protocol processing). C. Design for the pull model The solution for pull-model is similar to the push-model solution, with the key difference of added phase of client- request. The server will not send back data until it receives data request. We illustrate the time line of a typical pull-data transfer in Figure 4. When client needs to do a pull-based data delivery, it firstly sends out a request at T0; after the network delivered the request to the server at T1, the server then prepares for the data, followed by sending back the data at T2 by a sequence of write() calls. The sending completes at T4. After network transmission, the client begins receiving data at T3, and completes receiving at T5. Note that though other time stamps are strictly ordered, the ordering of T3 and T4 may vary depending on the exact scenarios. Specifically, for small data transfer, T4 may precede T3 since a single write() call may suffice. For large data transfer, T3 typically precede T4. From the above process, we can see that if the data request failed reach the server, then the data delivery will not happen, hence not completed. Though the symptom of this scenario is different from the scenario where the delivery is slow, we decide to solve both scenarios since the data are not completely delivered. To provide a complete solution for pull-model problem, we need to distinguish between various possible failing components (e.g., client/network/server) for the client-request phase as well. We propose a state-machine based algorithm in the fol- lowing. Figure 5 is a diagram illustrating the states moni- tored in a state machine according to some embodiments. The state machine narrows the location of a problem in a data transfer operation to one of three realms: a sender realm that encompasses the sender or provider of the data (e.g., a server), a receiver realm that encompasses the recipient or receiver of the data (e.g., a client), and realm that encompasses the communication link(s) that convey the transferred data. In Figure 5, the states are depicted with or near the component(s) whose action or actions cause a change from one state to another. A state engine process is fed the necessary knowledge to monitor or identify the progress of a data transfer from start (i.e., at state S) for a pull-based transfer or at state C for a push-based transfer to finish (at state G). This may require the OS of the two entities (e.g., client and server in Figure 5) and the applications that use the data (e.g., client application, server application) to emit certain types of information at certain times. For instance, the receiver (e.g., the recipients OS and/or the application) logs events at one or more protocol layers, such as generation and dispatch of a data request, transmission of the request from the receiver machine, receipt of a first portion of data, and receipt of the last portion of the data. Similarly, the data provider (e.g., the providers OS and/or the application) logs events at one or more protocol layers, such as receipt of a data request, preparation of the data, dispatch of the first portion of the data, and dispatch of the final portion of the data. V. SOLUTION We now present the detailed algorithm, which solves the problem of slow data transfer for both the pull-model and the push-model.
  • 5. Application Layer Protocol Transport Layer Protocol Communication Buffer Client Application S G Application Layer Protocol Transport Layer Protocol Communication Buffer Server Application D C E F Communication Buffer A Communication Buffer B Communication Links Figure 5. The design for pull model A. Solution Overview Our proposed algorithm aggregates the information from the sender and the receiver nodes, as shown in Figure 6. For each node, the information from both the application layer and transport layer are collected and utilized. The heart of the algorithm is the Bottleneck Determination Engine (BDE) which determines the bottleneck. Internally, the algorithm maintains the current state of the data transfer and performs state transition when appropriate, which is handled by the State Transition Engine (STE). The state transition is based on the information collected by the Information Collection (IC), which collects and aggregates the information from both ends of the data transfer. The algorithm we present has the following key design principles: • Distributed information aggregation The solution uti- lizes both client and server knowledge in order to piece together the entire picture and identify the causes. • Cross-layer information aggregation The solution uti- lizes both application layer knowledge (e.g., the various types of application logs) and transport layer knowledge (e.g., the send queue and receive queue sizes). • State-machine based expert system The state transform is triggered by the aggregated knowledge from different machines and different layers. Information Collection Sender Node Application Layer Info Application Layer Info Transport Layer Info Transport Layer Info Application Layer Info Application Layer Info Transport Layer Info Transport Layer InfoBottleneck Determination Engine Receiver Node State Transition Engine Figure 6. High level design of the algorithm B. Collecting Queue Size Information In order for the algorithm to identify the bottleneck, it needs to gather information about queue sizes at transport layer on both sender and receiver sides. There are many ways to collect such information. In this work, we list two examples of these tools/utilities, namely netstat [1] and ss [2]. These two utilities can be issued on the hosts as com- mands. (a) netstat (network statistics [1]) is a command-line tool that displays network connections (both incoming and outgoing) and network protocol statistics. For the purpose of this work, we utilize the sizes of the send queue and the receive queue for the TCP/IP sockets. This tool is available in many OS including Unix and Windows. (b) ss command [2] is used to show socket statistics. It can display stats for TCP as well as other types of sockets. Similar to netstat, ss can also display the send and receive queue sizes, which can be used by our algorithm. When utilizing these tools/utilities to collect information about transport level queue sizes, the collection process is desired to meet the following requirements. First, the queue size collection process should continuously output the queue sizes for interested TCP/IP sockets during the duration of the slow data transfer. If the collection tools (e.g., netstat and ss) can only display the instantaneously values of that time point, they need to be invoked repeatedly to gather multiple data points during the data transfer course. Second, invoking the collection tools may cause overhead, hence they are desired to be invoked without exerting too much overhead to the system. To achieve this objective, appropriate delays (e.g., 10 second) can be injected between invocations. Third, the delays between collection invocations are de- sired to not to be fixed to avoid the clocking of application- reading or TCP-receiving. For instance, if the receiver ap- plication is designed to read the data every 10 seconds, then if the delays between collection invocations happen to be 10 seconds and right after the application reading, then the collected queue sizes may always be zero and conclude that
  • 6. the receiver is not the bottleneck, while in reality the receiver could be the bottleneck. Similarly, the sending application or the TCP/IP may also have similar clockings and need to be avoided by varying the delays injected between collection invocations. Fourth, the collection process is desired to be long enough to contain multiple data points to avoid false alert and ensure certain degree of confidence about the collected information. These multiple data points can be later used to filter out possible spike values. The reason for dosing some is that some tools/utilities (including netstat and ss) are outputting the instantaneous queue sizes of the particular time stamps when invoked. C. State transitions We describe the stat transitions for both pull and push models. When the state machine is stuck at a particular state, the corresponding bottleneck can be easily identified. A pull-based data transfer begins in state S when client (e.g., application) issues a data request. When the client (e.g., application protocol, transport protocol) logs queuing of the request, the data transfer transitions from state S to state A (the request has been queued in the clients send buffer). When the clients send buffer is empty or there is some other indication that the request was transmitted from the client, the data transfer transitions from state A to state B (the request has been transmitted on communication link(s)). When receipt of the request is logged by server (e.g., application protocol, transport protocol), the transfer transitions from state B to state C (the server application has received the request). A push-based data transfer may be considered to start at state C. After the server prepares and sends a first portion of the data to be transferred (e.g., the first byte, the first packet), the data transfer operation transitions to state D (the data response is underway). Progress of the data transfer may now depend on the amount of data to be transferred and the speed with which it is conveyed by communication link(s). For example, if the amount of data being transferred is relatively large and/or the communication path is relatively slow, the data transfer transitions from state D to state E when the client logs receipt of the first portion of the data, transitions to state F when the server logs queuing/release of the last portion of the data, and terminates at state G when the client logs receipt of the last portion of the data. The lines with long dashes represent this chain of state transitions. In another scenario, if the amount of data is relatively small and/or the communication path is relatively fast, the data transfer transitions from state D to state F when the server logs queuing/release of the last portion of the data, transitions to state E when the client logs receipt of the first portion of the data, and terminates at state G when the client logs receipt of the last portion of the data. The lines with short dashes represent this chain of state transitions. In some other scenarios, instead of two different paths through states E and F, separate (mirrored) states may be defined that reflect the same statuses (i.e., client-logged receipt of first data, server-logged dispatch of final data). In these scenarios, therefore, there will be only one valid path through states E and F and through the two mirrored states, which could illustratively be represented as E’ and F’. Note that it could be that there are more than a single bottleneck exist at the same time. For example, both client and server can be bottlenecks. As another example, at first half the data transfer, the client is the bottleneck, while at the second half the data transfer, the network is the bottleneck. Though it is possible to handle these complicated cases easily by splitting the time duration into smaller units and output the algorithm output for all these smaller units, for simpler presentation, we choose to focus on the scenarios where only a single bottleneck exists. However, the presented algorithm/solution can be easily converted to accommodate the above more complicated scenarios. The determination regarding which component is the bottleneck when the state is stuck at States of D/E/F/G/H depends on the transport layer information. Specifically, below is the table illustrates all the three scenarios and the information needed to make the corresponding the decision. Specifically, if the Receive Queue on the client is not zero, then the client application is the bottleneck. If the Send Queue on the server is zero, then the server application is the bottleneck. If the client’s Receive Queue is zero and the server’s Send Queue is non-zero, then the network component is the bottleneck. D. Deployment mode Our algorithm can be deployed and utilized in two modes: online and off-line, depending on whether algorithm is invoked during the slow data transfer or after that. Online mode: During the data transfer, if the user sees the transfer is slow, and would like to diagnose, the user can invoke the algorithm. The algorithm will continuously collect the required information at application layer and transport layer from both machines of the data transfer. To achieve this online deployment, the continuous information collection can be done in two ways. First, the information can be streamed to the algorithm engines. Second, the algorithm may choose to repeatedly mine the information logged on the two machines. Table I THREE SCENARIOS IN PUSH MODEL Bottleneck Recv. Que. Send Que. Notes Client Non-zero Any Blocked receiving Server Any Zero Blocked sending Network Zero Non-zero Blocked delivery
  • 7. Off-line mode: The users may also invoke the algorithm in an off-line mode. For instance, the user knows the time duration when the data transfer is slow, and later use this algorithm to diagnose the particular time duration. The ad- vantage of this mode is that it does not need to continuously collect the information needed, which is more complex to implement. On the other hand, it requires the user to record the time stamps to define the slow transfer time range. VI. EVALUATION A. Prototype We implemented the proposed algorithm with a prototype in Python. The prototype works in off-line mode, and it can be used to determine the bottleneck during a specific time duration where the data transfer is slow. The prototype includes a netstat-based information collection script (i.e., “netstat -Ttp”) to repeatedly output the TCP connection information on both the sender and receiver hosts. A ran- dom delay period is injected into two succeeding netstat invocations, with the average delay value being 2 seconds. The output of each netstat invocation is prefixed by a time stamp in the granularity of 1 ms. The prototype firstly takes the user input of a configuration file. The file defines the directional data transfer connection, the information source and the time duration. The directional data transfer connection is defined by tuple of (src host , src port, dst host, dst port). Note that unlike common definitions of TCP connections which are bidirectionally, the definition of connections here is directional, as the slow data transfer has the notion of direction. The information source is a local directory which contains the netstat output from both ends. The time duration is defined by the beginning time stamp and the ending time stamp. The prototype also takes the user input of an “input” directory, where the netstat outputs are stored. Since data transfer is directional, the prototype treats each TCP connec- tion as two directional data transfer connections. Thus, each extracted TCP data transfer is recorded in a separate csv file. Each csv file contains lines of tuples in the format of (time stamp, recv-queue, send-queue). One special treatment about the processing is the handling of time zones. It is possible for the two hosts of the sender and receiver to be in different time zones. Because of this, the output of the netstat might be mis-aligned if not treated accordingly. To accommodate the possible differences in time zones, the configuration file allows the user to specify time zones for the two end hosts. Internally, the prototype will perform time zone conversion to align the csv data. Based on the user input in the configuration file, the pro- totype only extracts the interested directional data transfer connections. Furthermore, only the interested time periods are extracted based on the user-input. Given a particular directional data transfer connection which needs to diagnose, Configuration File: 1 [GLOBAL] 2 ts start=2014-04-21 20:56:40 3 ts end=2014-04-21 20:57:40 4 time zone=UTC 5 [SENDER] 6 host name=sender.linkedin.com 7 port=10001 8 time zone=UTC 9 [RECEIVER] 10 host name=receiver.linkedin.com 11 port=36885 12 time zone=America/Los Angeles Output of the algorithm 13 [user1@host1]> ./rootcause.py -c rootcause.client.conf -i out2/resources/ 14 Connection of sender.linkedin.com [’10001’] −− > receiver.linkedin.com [’36885’] 15 Results: 16 Client is slowly receiving, the bottleneck. Figure 7. Sample Configuration and Sample Output the prototype determines the bottleneck based on the rules presented earlier. To reliably determine the bottleneck, the prototype also filters out some spike values which might cause false alert. Specifically, the decision about whether a queue size is zero or non-zero has to persist for at least certain number of invocations and at least certain amount of time duration. For instance, in one debugging case, with the average netstat invocation of 2 seconds, in order for the prototype to conclude that the client side is the bottleneck, the receive queue size needs to be zero for at least 10 seconds. B. Results We evaluated the built off-line prototype. The usage sce- nario is as follows. After the user noticed slow data transfer during some time period in the form of beginning time stamp and ending time stamp, the user runs the prototype to diagnose the issue and determine the bottleneck. We have used this prototype to identify bottlenecks in the following LinkedIn production investigations/issues: (1) Databus [3] bootstrapping; where a client receives boot- strapping events from the server using TCP/IP protocol; (2) Voldemort [4] cluster expansion; where data receivers fetch data from data senders using TCP/IP protocol. We have found that for the first scenario, the Databus server (i.e., the
  • 8. (a) Send queue (b) Receive queue Figure 8. Client is the bottleneck data sender) is the bottleneck; while in the second scenario, the networking part is the bottleneck. To demonstrate all the three scenarios where each of the scenarios has a different type of bottleneck (e.g. sender/receiver/networking), we designed a set of experi- ments using the a custom-build workloads. The workload can mimic the three types of bottleneck based on user inputs. For each of the scenarios, a single separate TCP connection is created. A sample configuration file and the corresponding sample output are shown in Figure 7. The “GLOBAL” section specifies the beginning and ending time stamps of the duration where the slow data transfer lasts. It also takes the time zone about the two time stamps. The “SENDER” and “RECEIVER” sections define the directional data transfer connection in the form host names and ports. These two sections can also take the time zones used for each host to allow alignment of netstat output. 1) Client is slowly receiving : We force the receiver (i.e., the client) to slow down the data receiving by injecting delays between the calls to read() in application code, which represents a scenario where the client is the bottleneck. As shown in Figure 8, the client side (receiver side) has certain RecvQ buildup, which is indicated by the non-zero values. These non-zero values last for a while, so that the algorithm can conclude that the client is the bottleneck. 2) Server is slowly sending : We force the sender to slow down the data sending. We inject delays between the calls to write() in application code, which represents a scenario where the sender is the bottleneck. As shown in Figure 9, the server side has zero SendQ values. The lasting period is (a) Send queue (b) Receive queue Figure 9. Server is the bottleneck significant to allow the algorithm to conclude the sender bottleneck. Note that there is a spike about the SendQ. Since the spike does not persist for long enough period, it is internally filtered out by the algorithm. 3) Network is slowly transmitting: We created a scenario where TCP is slowly transmitting the data. Specifically, we inject delays and latencies to the network path, such that TCP can only transmit at a very low throughput. As shown in Figure 10, the SendQ values are non-zero, while the RecvQ is zero. These values of typical of a fast data delivery. Since we see very low data transfer, the algorithm conclude that it must be networking problem. VII. RELATED WORK Networking protocol is a critical component of distributed system. Many protocols and algorithms [5]–[7] are proposed to ensure high transmission rate in various scenarios. Many works have been done to diagnose computer perfor- mance problem, but most of them focus on the performance of a specific component, for instance, system performance bottleneck about a particular OS (e.g., Linux kernel 2.6.18 [8]) or a particular protocol (TCP Reno [9], [10], etc.). Though there are scattered knowledge/experiences about debugging the targeted slow-data-transfer problem, to our best knowledge, we have not seen any algorithm or prototype that is equivalent to our proposed algorithm/prototype. For the overall problem of slow data transfer, all sources we can find only rely on performance extertise/experience (e.g., [11]), and there is no single algorithm or resource to achieve the same as our algorithm does. Moreover, we did not find any automated prototype/tool for that purpose.
  • 9. (a) Send queue (b) Receive queue Figure 10. Network is the bottleneck VIII. CONCLUSION We proposed and implemented an algorithm to automat- ically determine the bottleneck component along the data transfer path. It reduces the efforts when diagnosing the problem and eventually root-causing the problem. REFERENCES [1] “Netstat utility,” http://en.wikipedia.org/wiki/Netstat. [2] “Ss utility,” http://man7.org/linux/man-pages/man8/ss.8.html. [3] S. Das, C. Botev, and et al., “All aboard the databus!: Linkedin’s scalable consistent change data capture platform,” ser. SoCC ’12, New York, NY, USA, 2012. [4] R. Sumbaly, J. Kreps, L. Gao, A. Feinberg, C. Soman, and S. Shah, “Serving large-scale batch computed data with project voldemort,” in Proceedings of the 10th USENIX Conference on File and Storage Technologies, ser. FAST’12, Berkeley, CA, USA, 2012, pp. 18–18. [5] J. Zhu, S. Roy, and J. H. Kim, “Performance modelling of tcp enhancements in terrestrial-satellite hybrid networks,” IEEE/ACM Trans. Netw., vol. 14, no. 4, pp. 753–766, Aug. 2006. [6] H. Schulzrinne, S. Casner, R. Frederick, and V. Jacobson, “Rtp: A transport protocol for real-time applications,” United States, 2003. [7] P. Sinha, T. Nandagopal, N. Venkitaraman, R. Sivakumar, and V. Bharghavan, “Wtcp: A reliable transport protocol for wireless wide-area networks,” Wirel. Netw., vol. 8, no. 2/3, pp. 301–316, Mar. 2002. [8] “Performance regression on tcp stream throughput,” https://bugzilla.redhat.com/show bug.cgi?id=705989. [9] L. A. Grieco and S. Mascolo, “Performance evaluation and comparison of westwood+, new reno, and vegas tcp conges- tion control,” SIGCOMM Comput. Commun. Rev., vol. 34, no. 2, pp. 25–38, Apr. 2004. [10] C. Won, B. Lee, C. Yu, S. Moh, K. Park, and M.-J. Kim, “A detailed performance analysis of udp/ip, tcp/ip, and m-via network protocols using linux/simos,” J. High Speed Netw., vol. 13, no. 3, pp. 169–182, Aug. 2004. [11] “Slow performance occurs when you copy data,” http://support.microsoft.com/kb/823764.