Guarding Fast Data Delivery in Cloud: an Effective Approach to Isolating Performance Bottleneck During Slow Data Delivery

Guarding Fast Data Delivery in Cloud: an Effective Approach to Isolating
Performance Bottleneck During Slow Data Delivery
Zhenyun Zhuang, Haricharan Ramachandra, Badri Sridharan
{zzhuang, hramachandra, bsridharan}@linkedin.com
LinkedIn Corporation, 2029 Stierlin Court Mountain View, CA 94043 United States
Abstract—Cloud-based products heavily rely on the fast data
delivery between data centers and remote users - when data
delivery is slow, the products’ performance is crippled. When
slow data delivery occurs, engineers need to investigate the issue
and find the root cause. The investigation requires experience
and time, as data delivery involves multiple playing parts
including sender/receiver/network.
To facilitate the investigations, we propose an algorithm
to automatically identify the performance bottleneck. The
algorithm aggregates information from multiple layers of
data sender and receiver. It helps to automatically iso-
late the problem type by identifying which component of
sender/receiver/network is the bottleneck. After isolation, suc-
cessive efforts can be taken to root cause the exact problem.
We also build a prototype to demonstrate the effectiveness of
the algorithm.
I. INTRODUCTION
Cloud-based products (e.g., Cloud storage productions
such as Amazon S3) involve data transfer between cloud
data centers and remote users. For cloud environments,
fast data delivery is critical to ensure high performing
data products, as it translates to higher data throughput
and hence less response time as experienced by the users.
However, despite many techniques aimed at optimizing the
data delivery through Internet (e.g. web page optimization,
network acceleration), a critical problem we have continu-
ously experienced is slow data delivery. This problem has
two forms: (1) data is not delivered; and (2) data delivery
is slow (i.e., taking long time to complete the delivery). For
example, a web client is browsing a web page, and needs
to download a jpeg file. In the first form, the jpeg file may
not be received by the client at all. In the second form, the
client is receiving the jpeg file, however it takes too much
time to download.
When slow data transfer happens, it cripples applica-
tion performance, and engineers need to investigate and
root cause the issue. There are many possible causes. For
instance, it could be caused by the data sender being
overloaded and hence not sending data to the downstream
receiver; or the network channel is slow; or the data receiver
is too busy to read data from network buffer. Based on our
experiences, root causing the issues is tedious and requires
lots of experiences and expertise. Largely, possible causes of
such a problem can be classified into three types: (1) client
application causes; (2) network side causes; and (3) server
application causes.
When carrying on such investigations, engineers typically
need to quickly isolate different types of causes, so that
they can focus on the particular suspicious component and
perform deeper analysis to root cause. It is a challenging
task due to several reasons. First, the data delivery involves
multiple network entities including two machines (i.e., the
sender and receiver) and the networking routes, which is in
sharp contrast to the usual performance issues where only a
single machine is involved. Second, the diagnosis involves
multiple layers of information including application layer
and transport layer. To isolate the causes, engineers have to
check various places including client log, server log, network
statistics, cpu usage.etc. These types of checking take much
time and efforts, and often times requires experiences and
expertise from the performance engineers.
To save time and efforts, performance engineers are
eagerly looking forward to seeing more intelligent tools
to help them quickly isolate the root causes. Though the
exact causes can vary wildly in different scenarios, a quick
isolation of the problem types still would greatly help the
engineers who investigate the problem.
In this work, we specifically focus on fulfilling such a
request: quickly identify the component that is to blame,
be it the sender, the receiver, or the network. This work
presents an algorithm and a prototype to help performance
engineers. The algorithm can automatically isolate the root
cause when slow data delivery occurs. Once the blame part
is figured out, successive investigations can be conducted to
nail down the real problem. These will be outside of the
scope of this work, and are part of our future work.
For the remainder of the writing, after providing some
necessary technical background in section II, we then define
and motivate the problems being addressed in this writing
using three scenarios in Section III. We then present the
designs in Section IV. Based on the design principles, we
propose the solutions and deployment mode in Section V.
We build a prototype and perform performance evaluation
using the prototype in Section VI, respectively. We also
present certain related works in Section VII. Finally in
Section VIII we conclude the work.

II. BACKGROUND AND SCOPE
A. Background
TCP transport protocol: Transmission Control Protocol
(TCP) is one of the transport-layer protocols that provides
ordered and reliable delivery of streamed bytes. It is the
most widely used transport protocol Today. TCP features
flow control to avoid overloading the receiver. The receiver
sets up a dedicated receiver buffer, and the sender sets
up a corresponding send buffer. TCP also have congestion
control, retransmission mechanism, etc. to ensure the reliable
and fast data transfer.
Application protocol: Application protocols such as
HTTP are built upon lower transport protocols. When appli-
cation performance suffers, the immediate symptoms typi-
cally are higher latency, lower throughput. Though these two
symptoms are often related, we specifically focus on solving
the slow data transfer (i.e. lower throughput) problem.
Depending on application protocols and how applications
are built, application layers may emit additional logs. For
instance, the data receiver may log the time stamps of send-
ing data requests, receiving the first byte of requested data
and receiving the last byte of the requested data. Similarly,
the data sender may log the time stamps of receiving the
data requests, sending the first byte of requested data and
sending the last byte of the requested data.
B. Scope
To understand the causes of the low performing data
delivery problem, we firstly need to understand the data de-
livery model which determines the flow of bytes. For simple
presentation, we assume the “client” is the data receiver,
while the “server” is the data sender. This assumption is
in line with today’s web browsing paradigm where web
browsers are data receivers and web servers are data senders.
Generally there are two models of data delivery: pull-
based and push-based. In pull-based model, the client sends
a request to the server, and the server sends back a response
corresponding to the client request. In push-based model, the
server pushes data to the client, without the client explic-
itly asking for that. Pull-based model is more complicated
model, as the data flow of push-based model is only a subset
of that of the pull-based model. In other words, the only
difference between the two models is that pull-based model
contains the “data request” phase, as shown in Figure 1.
Though data downloading can be carried over with various
protocols including TCP and UDP, most of today’s data are
transferred through TCP, given the dominance of Internet
service. Hence, in this work we only focus on TCP-based
data downloading. For easier presentation, we choose Linux
platforms to present our designs and solutions due to Linux’s
popularity. However, the relevant designs and solutions will
also apply to other platforms.
Client Server
Application “Response”PUSH
Model
(a) Push Model
Client Server
Application “request”
Application “Response”PULL
Model
(b) Pull Model
Figure 1. Push model and Pull model
III. PROBLEM DEFINITION AND MOTIVATION
SCENARIOS
We first define and demonstrate the problem we want to
solve using three motivating scenarios.
A. Problem definition
We use C to denote the client which receives the data;
use S to denote the server that sends back the data to C. We
firstly consider the pull-based data delivery model, where
for any data delivery, C sends a request Rq to S first; after
receiving Rq, S will prepare the response data Rs and send
back to C.
We assume the time taken in good data delivery scenarios
is Tg, and the maximum of Tg is denoted as Tgmax. The
delivery time is calculated as the time difference between
when C’s application sends out Rq and when that application
receives the last byte of the Rs with read() call. The actual
data delivery time is Ta. We say that the particular data
delivery is slow when Ta > Tgmax.
B. Results of three motivation scenarios
To further illustrate the nature of the problem we are
addressing, we setup a simple experiment to demonstrate the
application symptoms as well as the root causes of slow data
downloading. In the experiment, a client (receiver) opens a
TCP connection to download a bulk of data from a server.
The environment are Gigbit LAN network, where the client
machine and the server machine are directly connected. The
RTT (round trip delay) is at sub-millisecond. Hence, in
typical (“good”) data transfer scenarios, the expected data
transmission rate should not be less than 1MB/s based on
many practices.
For each scenario, we measure the data delivery rate
on the client (receiver) side at application level. The three
scenarios differ in where the performance bottleneck is,
namely, the client (receiver), the server (sender) and the
network. Each of these bottlenecks can cause slow data

Figure 2. Similar symptoms of slow data downloading, but caused by
different types of bottlenecks: Server (a), Client (b) and Network(c)
delivery. We plot the accumulated bytes received by the
receiver (from application level) across the time line, which
is in the format of hour:minute:second.
1) Receiver is the bottleneck: In this scenario, the re-
ceiver application is not reading fast enough. The down-
loading progress as measured by the receiver is plotted in
Figure 2(a). The average throughput is about 3KB/second.
2) Sender is the bottleneck (The application not sending
fast enough): In this scenario, the sender application is
not reading fast enough. The average throughput is about
3KB/second, as shown in Figure 2(b).
3) Network is the bottleneck (The network is not trans-
mitting fast enough): In this scenario, we introduce packet
losses to the network, and the network protocol is not
transmitting fast enough. The average throughput is about
5KB/second, as shown in Figure 2(c).
C. Summary
We have demonstrated that for all the three scenarios, the
data transfer is slow - it takes up to one minute to download
the 200KB data. Though the exact throughput vary in three
scenarios, the application symptoms are similar, regardless
of the three different bottlenecks.
IV. DESIGN
A. Design Overview
We notice that various issues causing slow data transfer
between machines are largely classified into 3 types based
on where the problem is: (1) the sender application; (2) the
receiver application; or (3) the networking channel. We hope
to quickly and automatically decide which types of problems
is causing the slow data downloading.
We also notice the pull-model is a super-set of push-
model, and the push-model is exactly the response data
delivery part of the pull-model. Hence we break down our
solution into two parts. First, we focus on the push-model
and derive the algorithm based on our observations with
some experiments and our analysis of network protocols.
The solution relies on network-layer knowledge of network
buffer queue size. Second, stepping on the solution of push-
model, we extend the solution to solve the pull-model
problem. The solution introduces a state-machine, and by
incorporating the application-layer knowledge, it can deduct
the bottleneck that causes the slow data transfer.
B. Design for the push model
Considering the data delivery process in push-model, a
typical data flow in any TCP-based data transfer is illustrated
in Figure 3. Five steps with regard to system calls and
network transmissions are as below: (1) at Step-A, the server
application issues a write() system call and the application
data is copied to socket send buffer. (2) at Step-B, the
server’s TCP layer issues send() call and sends some data
to the network; The amount of data is subject to TCP’s
congestion control and flow control. (3) at Step-C, Network
will route the data hop-by-hop to the receiver; IP routing
protocol is in play in this part. (4) at Step-D, The client’s
TCP layer will receive the data with recv()system call. The
data are put in receive buffer. (5) at Step-E, The client
application issues read() call to receive the data and copy
to user space.
During this course, there are could be 3 types of problems:
(1) Server-application bottleneck. The server application
may not have the data ready, this could be caused by multiple
Client Application Server Application
read( )
Data packets
recv( )
Recv Buff
write( )
send( )
Send Buff
A
B
E
D
C
Figure 3. The design for push model

Client Server
Application “request”
Application “Response”
T0
T1
T2
T3
T4
T5
Figure 4. Time line of pull model data transfer
reasons. For instance, the server machine may be over-
committed (e.g., running multiple applications), hence the
particular application does not get chance to write(); the
server application may have too many threads running and
hence the write() thread is not being scheduled on cpus;
the application’s data-preparation business logic is slow. etc.
(2) Client-application bottleneck. The client application may
be too busy to issue read() call. Similar to server side,
possible causes include machine over-commitment, and so
on. (3) Network-side bottleneck. The network (including
the TCP protocol) is unable to delivery data fast enough.
Possible causes are poor network channel (e.g., high losses,
low bandwidth) or poor TCP/IP configuration or tunings.
Invariably, the symptom is the data are not being able to
pushed through to the receiver side.
After careful analysis of the symptoms and the causes, we
propose to isolate the problem into the above 3 types based
on the knowledge of the TCP sockets, specifically, the queue
lengths of the send and receive buffers.
In normal scenarios where data delivery is fast, the re-
ceiver’s receive queue size should mostly be zero, indicating
that the receiver is fast enough to consume the data received
by copying the data from socket buffer to user space. The
sender’s send queue should be non-zero, indicating the
fact that the data are produced fast enough to supply the
consumption. On the other hand, when the receive queue
is non-zero, that indicates client-side bottleneck; when the
send queue is zero, that indicates the server-side bottleneck.
When data transfer is indeed slow, and neither the server
application nor the sender application is the slowing down
the data transfer, we can conclude that the data transmission
(i.e., the network) between machines is slow. Such network-
side issue may involve both ends of the transmission. Also
note that this type of issue is not limited to networking
or transport protocol themselves (e.g., TCP/IP), it may be
caused by the slowdown of the sender/receiver OS (e.g.,
TCP/IP protocol processing).
C. Design for the pull model
The solution for pull-model is similar to the push-model
solution, with the key difference of added phase of client-
request. The server will not send back data until it receives
data request. We illustrate the time line of a typical pull-data
transfer in Figure 4. When client needs to do a pull-based
data delivery, it firstly sends out a request at T0; after the
network delivered the request to the server at T1, the server
then prepares for the data, followed by sending back the data
at T2 by a sequence of write() calls. The sending completes
at T4. After network transmission, the client begins receiving
data at T3, and completes receiving at T5. Note that though
other time stamps are strictly ordered, the ordering of T3 and
T4 may vary depending on the exact scenarios. Specifically,
for small data transfer, T4 may precede T3 since a single
write() call may suffice. For large data transfer, T3 typically
precede T4.
From the above process, we can see that if the data
request failed reach the server, then the data delivery will not
happen, hence not completed. Though the symptom of this
scenario is different from the scenario where the delivery is
slow, we decide to solve both scenarios since the data are
not completely delivered. To provide a complete solution for
pull-model problem, we need to distinguish between various
possible failing components (e.g., client/network/server) for
the client-request phase as well.
We propose a state-machine based algorithm in the fol-
lowing. Figure 5 is a diagram illustrating the states moni-
tored in a state machine according to some embodiments.
The state machine narrows the location of a problem in
a data transfer operation to one of three realms: a sender
realm that encompasses the sender or provider of the data
(e.g., a server), a receiver realm that encompasses the
recipient or receiver of the data (e.g., a client), and realm
that encompasses the communication link(s) that convey the
transferred data. In Figure 5, the states are depicted with
or near the component(s) whose action or actions cause a
change from one state to another.
A state engine process is fed the necessary knowledge
to monitor or identify the progress of a data transfer from
start (i.e., at state S) for a pull-based transfer or at state C
for a push-based transfer to finish (at state G). This may
require the OS of the two entities (e.g., client and server
in Figure 5) and the applications that use the data (e.g.,
client application, server application) to emit certain types of
information at certain times. For instance, the receiver (e.g.,
the recipients OS and/or the application) logs events at one
or more protocol layers, such as generation and dispatch of
a data request, transmission of the request from the receiver
machine, receipt of a first portion of data, and receipt of the
last portion of the data. Similarly, the data provider (e.g.,
the providers OS and/or the application) logs events at one
or more protocol layers, such as receipt of a data request,
preparation of the data, dispatch of the first portion of the
data, and dispatch of the final portion of the data.
V. SOLUTION
We now present the detailed algorithm, which solves the
problem of slow data transfer for both the pull-model and
the push-model.

Application Layer Protocol
Transport Layer Protocol
Communication Buffer
Client Application
S G
Application Layer Protocol
Transport Layer Protocol
Server Application
D C
E
F
A
B
Communication Links
Figure 5. The design for pull model
A. Solution Overview
Our proposed algorithm aggregates the information from
the sender and the receiver nodes, as shown in Figure 6. For
each node, the information from both the application layer
and transport layer are collected and utilized. The heart of
the algorithm is the Bottleneck Determination Engine (BDE)
which determines the bottleneck. Internally, the algorithm
maintains the current state of the data transfer and performs
state transition when appropriate, which is handled by the
State Transition Engine (STE). The state transition is based
on the information collected by the Information Collection
(IC), which collects and aggregates the information from
both ends of the data transfer.
The algorithm we present has the following key design
principles:
• Distributed information aggregation The solution uti-
lizes both client and server knowledge in order to piece
together the entire picture and identify the causes.
• Cross-layer information aggregation The solution uti-
lizes both application layer knowledge (e.g., the various
types of application logs) and transport layer knowledge
(e.g., the send queue and receive queue sizes).
• State-machine based expert system The state transform
is triggered by the aggregated knowledge from different
machines and different layers.
Information Collection
Sender Node
Application
Layer Info
Application
Layer Info
Transport Layer
Info
Transport Layer
Info
Application
Layer Info
Application
Layer Info
Transport Layer
Info
Transport Layer
InfoBottleneck
Determination Engine
Receiver Node
State Transition
Engine
Figure 6. High level design of the algorithm
B. Collecting Queue Size Information
In order for the algorithm to identify the bottleneck, it
needs to gather information about queue sizes at transport
layer on both sender and receiver sides. There are many
ways to collect such information. In this work, we list two
examples of these tools/utilities, namely netstat [1] and ss
[2]. These two utilities can be issued on the hosts as com-
mands. (a) netstat (network statistics [1]) is a command-line
tool that displays network connections (both incoming and
outgoing) and network protocol statistics. For the purpose
of this work, we utilize the sizes of the send queue and the
receive queue for the TCP/IP sockets. This tool is available
in many OS including Unix and Windows. (b) ss command
[2] is used to show socket statistics. It can display stats for
TCP as well as other types of sockets. Similar to netstat,
ss can also display the send and receive queue sizes, which
can be used by our algorithm.
When utilizing these tools/utilities to collect information
about transport level queue sizes, the collection process is
desired to meet the following requirements. First, the queue
size collection process should continuously output the queue
sizes for interested TCP/IP sockets during the duration of the
slow data transfer. If the collection tools (e.g., netstat and
ss) can only display the instantaneously values of that time
point, they need to be invoked repeatedly to gather multiple
data points during the data transfer course.
Second, invoking the collection tools may cause overhead,
hence they are desired to be invoked without exerting too
much overhead to the system. To achieve this objective,
appropriate delays (e.g., 10 second) can be injected between
invocations.
Third, the delays between collection invocations are de-
sired to not to be ﬁxed to avoid the clocking of application-
reading or TCP-receiving. For instance, if the receiver ap-
plication is designed to read the data every 10 seconds, then
if the delays between collection invocations happen to be
10 seconds and right after the application reading, then the
collected queue sizes may always be zero and conclude that

the receiver is not the bottleneck, while in reality the receiver
could be the bottleneck. Similarly, the sending application
or the TCP/IP may also have similar clockings and need to
be avoided by varying the delays injected between collection
invocations.
Fourth, the collection process is desired to be long enough
to contain multiple data points to avoid false alert and ensure
certain degree of confidence about the collected information.
These multiple data points can be later used to filter out
possible spike values. The reason for dosing some is that
some tools/utilities (including netstat and ss) are outputting
the instantaneous queue sizes of the particular time stamps
when invoked.
C. State transitions
We describe the stat transitions for both pull and push
models. When the state machine is stuck at a particular state,
the corresponding bottleneck can be easily identified.
A pull-based data transfer begins in state S when client
(e.g., application) issues a data request. When the client
(e.g., application protocol, transport protocol) logs queuing
of the request, the data transfer transitions from state S to
state A (the request has been queued in the clients send
buffer). When the clients send buffer is empty or there
is some other indication that the request was transmitted
from the client, the data transfer transitions from state A to
state B (the request has been transmitted on communication
link(s)). When receipt of the request is logged by server
(e.g., application protocol, transport protocol), the transfer
transitions from state B to state C (the server application
has received the request). A push-based data transfer may
be considered to start at state C.
After the server prepares and sends a first portion of the
data to be transferred (e.g., the first byte, the first packet),
the data transfer operation transitions to state D (the data
response is underway). Progress of the data transfer may
now depend on the amount of data to be transferred and the
speed with which it is conveyed by communication link(s).
For example, if the amount of data being transferred is
relatively large and/or the communication path is relatively
slow, the data transfer transitions from state D to state E
when the client logs receipt of the first portion of the data,
transitions to state F when the server logs queuing/release of
the last portion of the data, and terminates at state G when
the client logs receipt of the last portion of the data. The lines
with long dashes represent this chain of state transitions.
In another scenario, if the amount of data is relatively
small and/or the communication path is relatively fast, the
data transfer transitions from state D to state F when the
server logs queuing/release of the last portion of the data,
transitions to state E when the client logs receipt of the first
portion of the data, and terminates at state G when the client
logs receipt of the last portion of the data. The lines with
short dashes represent this chain of state transitions.
In some other scenarios, instead of two different paths
through states E and F, separate (mirrored) states may be
defined that reflect the same statuses (i.e., client-logged
receipt of first data, server-logged dispatch of final data).
In these scenarios, therefore, there will be only one valid
path through states E and F and through the two mirrored
states, which could illustratively be represented as E’ and
F’.
Note that it could be that there are more than a single
bottleneck exist at the same time. For example, both client
and server can be bottlenecks. As another example, at first
half the data transfer, the client is the bottleneck, while
at the second half the data transfer, the network is the
bottleneck. Though it is possible to handle these complicated
cases easily by splitting the time duration into smaller
units and output the algorithm output for all these smaller
units, for simpler presentation, we choose to focus on the
scenarios where only a single bottleneck exists. However,
the presented algorithm/solution can be easily converted to
accommodate the above more complicated scenarios.
The determination regarding which component is the
bottleneck when the state is stuck at States of D/E/F/G/H
depends on the transport layer information. Specifically,
below is the table illustrates all the three scenarios and the
information needed to make the corresponding the decision.
Specifically, if the Receive Queue on the client is not zero,
then the client application is the bottleneck. If the Send
Queue on the server is zero, then the server application
is the bottleneck. If the client’s Receive Queue is zero
and the server’s Send Queue is non-zero, then the network
component is the bottleneck.
D. Deployment mode
Our algorithm can be deployed and utilized in two modes:
online and off-line, depending on whether algorithm is
invoked during the slow data transfer or after that.
Online mode: During the data transfer, if the user sees
the transfer is slow, and would like to diagnose, the user
can invoke the algorithm. The algorithm will continuously
collect the required information at application layer and
transport layer from both machines of the data transfer. To
achieve this online deployment, the continuous information
collection can be done in two ways. First, the information
can be streamed to the algorithm engines. Second, the
algorithm may choose to repeatedly mine the information
logged on the two machines.
Table I
THREE SCENARIOS IN PUSH MODEL
Bottleneck Recv. Que. Send Que. Notes
Client Non-zero Any Blocked receiving
Server Any Zero Blocked sending
Network Zero Non-zero Blocked delivery

Off-line mode: The users may also invoke the algorithm
in an off-line mode. For instance, the user knows the time
duration when the data transfer is slow, and later use this
algorithm to diagnose the particular time duration. The ad-
vantage of this mode is that it does not need to continuously
collect the information needed, which is more complex to
implement. On the other hand, it requires the user to record
the time stamps to define the slow transfer time range.
VI. EVALUATION
A. Prototype
We implemented the proposed algorithm with a prototype
in Python. The prototype works in off-line mode, and it
can be used to determine the bottleneck during a specific
time duration where the data transfer is slow. The prototype
includes a netstat-based information collection script (i.e.,
“netstat -Ttp”) to repeatedly output the TCP connection
information on both the sender and receiver hosts. A ran-
dom delay period is injected into two succeeding netstat
invocations, with the average delay value being 2 seconds.
The output of each netstat invocation is prefixed by a time
stamp in the granularity of 1 ms.
The prototype firstly takes the user input of a configuration
file. The file defines the directional data transfer connection,
the information source and the time duration. The directional
data transfer connection is defined by tuple of (src host , src
port, dst host, dst port). Note that unlike common definitions
of TCP connections which are bidirectionally, the definition
of connections here is directional, as the slow data transfer
has the notion of direction. The information source is a local
directory which contains the netstat output from both ends.
The time duration is defined by the beginning time stamp
and the ending time stamp.
The prototype also takes the user input of an “input”
directory, where the netstat outputs are stored. Since data
transfer is directional, the prototype treats each TCP connec-
tion as two directional data transfer connections. Thus, each
extracted TCP data transfer is recorded in a separate csv file.
Each csv file contains lines of tuples in the format of (time
stamp, recv-queue, send-queue). One special treatment about
the processing is the handling of time zones. It is possible
for the two hosts of the sender and receiver to be in different
time zones. Because of this, the output of the netstat might
be mis-aligned if not treated accordingly. To accommodate
the possible differences in time zones, the configuration file
allows the user to specify time zones for the two end hosts.
Internally, the prototype will perform time zone conversion
to align the csv data.
Based on the user input in the configuration file, the pro-
totype only extracts the interested directional data transfer
connections. Furthermore, only the interested time periods
are extracted based on the user-input. Given a particular
directional data transfer connection which needs to diagnose,
Configuration File:
1 [GLOBAL]
2 ts start=2014-04-21 20:56:40
3 ts end=2014-04-21 20:57:40
4 time zone=UTC
5 [SENDER]
6 host name=sender.linkedin.com
7 port=10001
8 time zone=UTC
9 [RECEIVER]
10 host name=receiver.linkedin.com
11 port=36885
12 time zone=America/Los Angeles
Output of the algorithm
13 [user1@host1]> ./rootcause.py -c
rootcause.client.conf -i out2/resources/
14 Connection of sender.linkedin.com [’10001’]
−− > receiver.linkedin.com [’36885’]
15 Results:
16 Client is slowly receiving, the bottleneck.
Figure 7. Sample Configuration and Sample Output
the prototype determines the bottleneck based on the rules
presented earlier.
To reliably determine the bottleneck, the prototype also
filters out some spike values which might cause false alert.
Specifically, the decision about whether a queue size is zero
or non-zero has to persist for at least certain number of
invocations and at least certain amount of time duration. For
instance, in one debugging case, with the average netstat
invocation of 2 seconds, in order for the prototype to
conclude that the client side is the bottleneck, the receive
queue size needs to be zero for at least 10 seconds.
B. Results
We evaluated the built off-line prototype. The usage sce-
nario is as follows. After the user noticed slow data transfer
during some time period in the form of beginning time
stamp and ending time stamp, the user runs the prototype to
diagnose the issue and determine the bottleneck.
We have used this prototype to identify bottlenecks in
the following LinkedIn production investigations/issues: (1)
Databus [3] bootstrapping; where a client receives boot-
strapping events from the server using TCP/IP protocol; (2)
Voldemort [4] cluster expansion; where data receivers fetch
data from data senders using TCP/IP protocol. We have
found that for the first scenario, the Databus server (i.e., the

(a) Send queue
(b) Receive queue
Figure 8. Client is the bottleneck
data sender) is the bottleneck; while in the second scenario,
the networking part is the bottleneck.
To demonstrate all the three scenarios where each of
the scenarios has a different type of bottleneck (e.g.
sender/receiver/networking), we designed a set of experi-
ments using the a custom-build workloads. The workload
can mimic the three types of bottleneck based on user inputs.
For each of the scenarios, a single separate TCP connection
is created.
A sample configuration file and the corresponding sample
output are shown in Figure 7. The “GLOBAL” section
specifies the beginning and ending time stamps of the
duration where the slow data transfer lasts. It also takes the
time zone about the two time stamps. The “SENDER” and
“RECEIVER” sections define the directional data transfer
connection in the form host names and ports. These two
sections can also take the time zones used for each host to
allow alignment of netstat output.
1) Client is slowly receiving : We force the receiver (i.e.,
the client) to slow down the data receiving by injecting
delays between the calls to read() in application code, which
represents a scenario where the client is the bottleneck. As
shown in Figure 8, the client side (receiver side) has certain
RecvQ buildup, which is indicated by the non-zero values.
These non-zero values last for a while, so that the algorithm
can conclude that the client is the bottleneck.
2) Server is slowly sending : We force the sender to slow
down the data sending. We inject delays between the calls
to write() in application code, which represents a scenario
where the sender is the bottleneck. As shown in Figure 9,
the server side has zero SendQ values. The lasting period is
(a) Send queue
(b) Receive queue
Figure 9. Server is the bottleneck
significant to allow the algorithm to conclude the sender
bottleneck. Note that there is a spike about the SendQ.
Since the spike does not persist for long enough period,
it is internally filtered out by the algorithm.
3) Network is slowly transmitting: We created a scenario
where TCP is slowly transmitting the data. Specifically, we
inject delays and latencies to the network path, such that
TCP can only transmit at a very low throughput. As shown in
Figure 10, the SendQ values are non-zero, while the RecvQ
is zero. These values of typical of a fast data delivery. Since
we see very low data transfer, the algorithm conclude that
it must be networking problem.
VII. RELATED WORK
Networking protocol is a critical component of distributed
system. Many protocols and algorithms [5]–[7] are proposed
to ensure high transmission rate in various scenarios.
Many works have been done to diagnose computer perfor-
mance problem, but most of them focus on the performance
of a specific component, for instance, system performance
bottleneck about a particular OS (e.g., Linux kernel 2.6.18
[8]) or a particular protocol (TCP Reno [9], [10], etc.).
Though there are scattered knowledge/experiences about
debugging the targeted slow-data-transfer problem, to our
best knowledge, we have not seen any algorithm or prototype
that is equivalent to our proposed algorithm/prototype.
For the overall problem of slow data transfer, all sources
we can find only rely on performance extertise/experience
(e.g., [11]), and there is no single algorithm or resource to
achieve the same as our algorithm does. Moreover, we did
not find any automated prototype/tool for that purpose.

(a) Send queue
(b) Receive queue
Figure 10. Network is the bottleneck
VIII. CONCLUSION
We proposed and implemented an algorithm to automat-
ically determine the bottleneck component along the data
transfer path. It reduces the efforts when diagnosing the
problem and eventually root-causing the problem.
REFERENCES
[1] “Netstat utility,” http://en.wikipedia.org/wiki/Netstat.
[2] “Ss utility,” http://man7.org/linux/man-pages/man8/ss.8.html.
[3] S. Das, C. Botev, and et al., “All aboard the databus!:
Linkedin’s scalable consistent change data capture platform,”
ser. SoCC ’12, New York, NY, USA, 2012.
[4] R. Sumbaly, J. Kreps, L. Gao, A. Feinberg, C. Soman,
and S. Shah, “Serving large-scale batch computed data with
project voldemort,” in Proceedings of the 10th USENIX
Conference on File and Storage Technologies, ser. FAST’12,
Berkeley, CA, USA, 2012, pp. 18–18.
[5] J. Zhu, S. Roy, and J. H. Kim, “Performance modelling
of tcp enhancements in terrestrial-satellite hybrid networks,”
IEEE/ACM Trans. Netw., vol. 14, no. 4, pp. 753–766, Aug.
2006.
[6] H. Schulzrinne, S. Casner, R. Frederick, and V. Jacobson,
“Rtp: A transport protocol for real-time applications,” United
States, 2003.
[7] P. Sinha, T. Nandagopal, N. Venkitaraman, R. Sivakumar,
and V. Bharghavan, “Wtcp: A reliable transport protocol for
wireless wide-area networks,” Wirel. Netw., vol. 8, no. 2/3,
pp. 301–316, Mar. 2002.
[8] “Performance regression on tcp stream throughput,”
https://bugzilla.redhat.com/show bug.cgi?id=705989.
[9] L. A. Grieco and S. Mascolo, “Performance evaluation and
comparison of westwood+, new reno, and vegas tcp conges-
tion control,” SIGCOMM Comput. Commun. Rev., vol. 34,
no. 2, pp. 25–38, Apr. 2004.
[10] C. Won, B. Lee, C. Yu, S. Moh, K. Park, and M.-J. Kim,
“A detailed performance analysis of udp/ip, tcp/ip, and m-via
network protocols using linux/simos,” J. High Speed Netw.,
vol. 13, no. 3, pp. 169–182, Aug. 2004.
[11] “Slow performance occurs when you copy data,”
http://support.microsoft.com/kb/823764.

Guarding Fast Data Delivery in Cloud: an Effective Approach to Isolating Performance Bottleneck During Slow Data Delivery

More Related Content

What's hot (16)

Viewers also liked (20)

Similar to Guarding Fast Data Delivery in Cloud: an Effective Approach to Isolating Performance Bottleneck During Slow Data Delivery (20)

Recently uploaded (20)

Guarding Fast Data Delivery in Cloud: an Effective Approach to Isolating Performance Bottleneck During Slow Data Delivery