Minimizing remote accesses in mapreduce clusters
P Tandon, MJ Cafarella… - 2013 IEEE International …, 2013 - ieeexplore.ieee.org
P Tandon, MJ Cafarella, TF Wenisch
2013 IEEE International Symposium on Parallel & Distributed …, 2013•ieeexplore.ieee.orgMapReduce, in particular Hadoop, is a popular framework for the distributed processing of
large datasets on clusters of relatively inexpensive servers. Although Hadoop clusters are
highly scalable and ensure data availability in the face of server failures, their efficiency is
poor. We study data placement as a potential source of inefficiency. Despite networking
improvements that have narrowed the performance gap between map tasks that access
local or remote data, we find that nodes servicing remote HDFS requests see significant …
large datasets on clusters of relatively inexpensive servers. Although Hadoop clusters are
highly scalable and ensure data availability in the face of server failures, their efficiency is
poor. We study data placement as a potential source of inefficiency. Despite networking
improvements that have narrowed the performance gap between map tasks that access
local or remote data, we find that nodes servicing remote HDFS requests see significant …
MapReduce, in particular Hadoop, is a popular framework for the distributed processing of large datasets on clusters of relatively inexpensive servers. Although Hadoop clusters are highly scalable and ensure data availability in the face of server failures, their efficiency is poor. We study data placement as a potential source of inefficiency. Despite networking improvements that have narrowed the performance gap between map tasks that access local or remote data, we find that nodes servicing remote HDFS requests see significant slowdowns of collocated map tasks due to interference effects, whereas nodes making these requests do not experience proportionate slowdowns. To reduce remote accesses, and thus avoid their destructive performance interference, we investigate an intelligent data placement policy we call 'partitioned data placement'. We find that, in an unconstrained cluster where a job's map tasks may be scheduled dynamically on any node over time, Hadoop's default random data placement is effective in avoiding remote accesses. However, when task placement is restricted by long-running jobs or other reservations, partitioned data placement substantially reduces remote access rates (e.g., by as much as 86% over random placement for a job allocated only one-third of a cluster).
ieeexplore.ieee.org
Showing the best result for this search. See all results