SlideShare a Scribd company logo
Apache Hadoop for System Administrators
Allen Wittenauer
Twitter: @_a__w_
Email: aw@apache.org
Hadoop Deployed Now?
Planning Hadoop Deployment?
Needed some place
to sit before lunch?
An Extremely Quick &
Incomplete Intro to Hadoop
Apache Hadoop for System Administrators
 Map (“transform”)
– Perl:

@items=(1,2,3,4,5);
sub sqr {return $_**2);
print join(‘,’,map(sqr,@items));
1,4,9,16,25
– Python:

items = [1,2,3,4,5]
def sqr(x) : return x**2
print list(map(sqr,items))
[1, 4, 9, 16, 25]
 Reduce (“compress” or “fold”)
– Perl

use List::Util qw/reduce/;
@items=(1,4,9,16,25);
print reduce {$a>$b ? $a:$b} @items;
25
– Python

from functools import reduce
items = [1,4,9,16,25]
print reduce ((lambda x,y: x if
(x>y) else y), items)
25
NEVER

GIVE

GONNA

YOU UP
Hadoop
(‘common’ or ‘core’)

MapReduce

HDFS
Hadoop
(‘common’ or ‘core’)

MapReduce

S3
Hadoop
(‘common’ or ‘core’)

MapReduce

Gluster
Hadoop
(‘common’ or ‘core’)

HBase

HDFS
NameNode

DataNode

DataNode

DataNode

D

D

D

D

D

D

ext4

ext4

ext4

ext4

ext4

ext4
JobTracker

TaskTracker
M

M

R

R

TaskTracker
M

M

R

R

TaskTracker
M

M

R

R
JobTracker

TaskTracker
M

M

R

R

TaskTracker
M

M

R

R

TaskTracker
M

M

R

R
Name
Node

DN

DN

DN

DN

DN

D D D D
M M R R

D D D D
M M R R

D D D D
M M R R

D D D D
M M R R

D D D D
M M R R

TT

TT

TT

TT

TT

Job
Tracker
Hadoop isn’t designed for
system administrators and/or support staff.
“Hadoop is not a developer problem;
it’s an operations problem.”
-- Hadoop vendor ex-employee
Apache Hadoop for System Administrators
Don’t Make Assumptions
tail’ing the logs won’t
tell you the whole story.
%
Monitor the masters!
Apache Hadoop for System Administrators
Apache Hadoop for System Administrators
Apache Hadoop for System Administrators
Apache Hadoop for System Administrators
 LinkedIn’s Configuration
– 30+ Health Checks per Grid
 Masters, canary report, daily fsck, etc

Nagios

– 10+ Health Checks per DC
 LDAP, Kerberos, etc ...

– Cross-DC Nagios Server Checks

ZK

 Warn: 5% down nodes
 Panic: 30% down
 HDFS: 20% Free Space
 Gateway home dir: 10% free
space
 ...

VD

NN

JT

Compute
Nodes

AZ

GW
 Health Check Script
– “OK” - good status
– “ERROR (message)” - bad status

mapred.healthChecker.script.path

 Consider checking ...
– critical software
– ownership & permissions
– network connection speed
– drive count

– file system space
– RO file systems
– IO errors
– missing memory
Apache Hadoop for System Administrators
 Use the tools most of your user’s code is written in!
 Pig
– testfile:

100
– Code:

A = load 'testfile' using PigStorage(',')
as (i: int);
B = foreach C generate i;
C = distinct B;
dump C;
– Output:

(100)
Reactive
Proactive
Resource Controls
 JobTracker Memory Resource Controls
– Limit jobs stored in JT heap:
mapred.jobtracker.completeuserjobs.maximum
– Limit total # of job tasks: mapred.jobtracker.maxtasks.per.job

 Job Memory Resource Controls
– Scheduler-level: mapred.cluster.*.memory.mb
– TT-level: auto-calculated based upon MR slot counts & scheduler level settings
– MR Job-level: mapred.job.*.memory.mb
– Linux only: /proc memory calculator and task killer
“I set the heap to 1G but my
process ran out of memory?”
Treat HDFS like any other multi-tenant FS
 Quota everything
– Yes, including /tmp
– No “show me all quotas” functionality

dfsadmin -setQuota
dfsadmin -setSpaceQuota

 Be consistent:
– /user/* all get same quota

 Be flexible:
– Make another dir for user’s to store big projects (e.g., /project)

 Be smart:
– Have a policy that content in /tmp gets deleted after X days. Automate this!
– Build reporting that shows files that are replicated less than 3 times
Compute Node Disk
Partitioning as Protective Measure
 root partitioning

20 GB /, ...

200 GB task space

(rest) HDFS

 non-root partinioning

5 GB
swap

200 GB task space

(rest) HDFS
Security!
 Queue Level ACLs

mapred-queue-acls.xml

– users
– groups
– netgroups

 Service Level ACLs
– hosts
– users
– groups
– netgroups

hadoop-policy.xml

– Limitation: Web services are all or nothing! :(
– Be aware: Hadoop uses ephemeral ports all over the place! :(
Kerberos!
Apache Hadoop for System Administrators
Corp IT
Active Directory
@CORP

krbtgt/GRID@CORP

krbtgt/host@GRID
krbtgt/service@GRID

Password

Client Node

Grid Realm
@GRID

krbtgt/user@CORP
krbtgt/GRID@CORP

Hadoop
Services
Apache Hadoop for System Administrators
Apache Hadoop for System Administrators
Apache Hadoop for System Administrators
http://data.linkedin.com/opensource/white-elephant
 Fonzi: http://www.flickr.com/photos/elzey/7224689810
 Captain Obvious by artist Stuart McGhee. http://stuartmcghee.com/
 Ant on flower: http://www.flickr.com/photos/bolonski/6116358907
 Ant Colony: http://www.flickr.com/photos/klearchos/2821230516
 Ant Queen: http://commons.wikimedia.org/wiki/
File:Camponotus_crispulus_queen_ant.jpg
 Canary: http://www.flickr.com/photos/nathan_and_jenny/2454127424
 Mona Lisa: Leonardo Da Vinci
 White Elephant: http://data.linkedin.com/opensource/white-elephant
 Ecce Homo:
– Elías García Martínez (original)
– Cecilia Giménez (restored)
Thanks!
Contact:
Twitter: @_a__w_
Email: aw@apache.org
More info:
Quora: www.quora.com/user/allenwittenauer
SlideShare: www.slideshare.net/allenwittenauer

More Related Content

What's hot (17)

PPTX
Lua: the world's most infuriating language
jgrahamc
 
PDF
KubeCon EU 2016: Custom Volume Plugins
KubeAcademy
 
KEY
Perl on Amazon Elastic MapReduce
Pedro Figueiredo
 
PPTX
HAB Software Woes
jgrahamc
 
PDF
"Ops Tools with Perl" 2012/05/12 Hokkaido.pm
Ryosuke IWANAGA
 
PDF
Hvordan sette opp en OAI-PMH metadata-innhøster
Libriotech
 
PDF
Apache beam — promyk nadziei data engineera na Toruń JUG 28.03.2018
Piotr Wikiel
 
KEY
My life as a beekeeper
Pedro Figueiredo
 
PDF
Perl Memory Use 201207 (OUTDATED, see 201209 )
Tim Bunce
 
PDF
Hadoop spark performance comparison
arunkumar sadhasivam
 
PPTX
Ansible for Beginners
Arie Bregman
 
PDF
Redis as a message queue
Brandon Lamb
 
PPTX
Dtalk shell
Miha Mencin
 
PPTX
agri inventory - nouka data collector / yaoya data convertor
Toshiaki Baba
 
ODP
Linux Command Line
Prima Yogi Loviniltra
 
PDF
pg_proctab: Accessing System Stats in PostgreSQL
Mark Wong
 
PDF
Building a DSL with GraalVM (VoxxedDays Luxembourg)
Maarten Mulders
 
Lua: the world's most infuriating language
jgrahamc
 
KubeCon EU 2016: Custom Volume Plugins
KubeAcademy
 
Perl on Amazon Elastic MapReduce
Pedro Figueiredo
 
HAB Software Woes
jgrahamc
 
"Ops Tools with Perl" 2012/05/12 Hokkaido.pm
Ryosuke IWANAGA
 
Hvordan sette opp en OAI-PMH metadata-innhøster
Libriotech
 
Apache beam — promyk nadziei data engineera na Toruń JUG 28.03.2018
Piotr Wikiel
 
My life as a beekeeper
Pedro Figueiredo
 
Perl Memory Use 201207 (OUTDATED, see 201209 )
Tim Bunce
 
Hadoop spark performance comparison
arunkumar sadhasivam
 
Ansible for Beginners
Arie Bregman
 
Redis as a message queue
Brandon Lamb
 
Dtalk shell
Miha Mencin
 
agri inventory - nouka data collector / yaoya data convertor
Toshiaki Baba
 
Linux Command Line
Prima Yogi Loviniltra
 
pg_proctab: Accessing System Stats in PostgreSQL
Mark Wong
 
Building a DSL with GraalVM (VoxxedDays Luxembourg)
Maarten Mulders
 

Viewers also liked (7)

PDF
Let's Talk Operations! (Hadoop Summit 2014)
Allen Wittenauer
 
PDF
Apache Yetus: Intro to Precommit for HBase Contributors
Allen Wittenauer
 
PDF
Apache Yetus: Helping Solve the Last Mile Problem
Allen Wittenauer
 
PPT
Deploying Grid Services Using Apache Hadoop
Allen Wittenauer
 
PDF
Hadoop Operations at LinkedIn
Allen Wittenauer
 
PPT
Hadoop Performance at LinkedIn
Allen Wittenauer
 
PPTX
A Multi Colored YARN
DataWorks Summit/Hadoop Summit
 
Let's Talk Operations! (Hadoop Summit 2014)
Allen Wittenauer
 
Apache Yetus: Intro to Precommit for HBase Contributors
Allen Wittenauer
 
Apache Yetus: Helping Solve the Last Mile Problem
Allen Wittenauer
 
Deploying Grid Services Using Apache Hadoop
Allen Wittenauer
 
Hadoop Operations at LinkedIn
Allen Wittenauer
 
Hadoop Performance at LinkedIn
Allen Wittenauer
 
A Multi Colored YARN
DataWorks Summit/Hadoop Summit
 
Ad

Similar to Apache Hadoop for System Administrators (20)

PPT
Hadoop Map-Reduce from the subject: Big Data Analytics
RUHULAMINHAZARIKA
 
PPT
Apache hadoop, hdfs and map reduce Overview
Nisanth Simon
 
PPTX
Introduction to HDFS and MapReduce
Uday Vakalapudi
 
PPTX
Introduction to Hadoop
York University
 
PDF
Hadoop - Lessons Learned
tcurdt
 
PPTX
Asbury Hadoop Overview
Brian Enochson
 
PDF
EclipseCon Keynote: Apache Hadoop - An Introduction
Cloudera, Inc.
 
PDF
Intro to big data choco devday - 23-01-2014
Hassan Islamov
 
PPTX
Hadoop performance optimization tips
Subhas Kumar Ghosh
 
PDF
Hadoop operations basic
Hafizur Rahman
 
PPTX
Distributed Systems Hadoop.pptx
Uttara University
 
PPTX
OPERATING SYSTEM .pptx
AltafKhadim
 
PPSX
Hadoop-Quick introduction
Sandeep Singh
 
PDF
Hadoop
Anantha Babu A
 
PPTX
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
Vijay Rayapati
 
PDF
Hadoop-Introduction
Sandeep Deshmukh
 
PDF
Introduction to Hadoop
Apache Apex
 
PDF
Hadoop on-mesos
Henry Cai 蔡明航
 
PDF
Interacting with hdfs
Pradeep Kumbhar
 
PPTX
Introduction to Hadoop and Big Data
Joe Alex
 
Hadoop Map-Reduce from the subject: Big Data Analytics
RUHULAMINHAZARIKA
 
Apache hadoop, hdfs and map reduce Overview
Nisanth Simon
 
Introduction to HDFS and MapReduce
Uday Vakalapudi
 
Introduction to Hadoop
York University
 
Hadoop - Lessons Learned
tcurdt
 
Asbury Hadoop Overview
Brian Enochson
 
EclipseCon Keynote: Apache Hadoop - An Introduction
Cloudera, Inc.
 
Intro to big data choco devday - 23-01-2014
Hassan Islamov
 
Hadoop performance optimization tips
Subhas Kumar Ghosh
 
Hadoop operations basic
Hafizur Rahman
 
Distributed Systems Hadoop.pptx
Uttara University
 
OPERATING SYSTEM .pptx
AltafKhadim
 
Hadoop-Quick introduction
Sandeep Singh
 
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
Vijay Rayapati
 
Hadoop-Introduction
Sandeep Deshmukh
 
Introduction to Hadoop
Apache Apex
 
Hadoop on-mesos
Henry Cai 蔡明航
 
Interacting with hdfs
Pradeep Kumbhar
 
Introduction to Hadoop and Big Data
Joe Alex
 
Ad

Recently uploaded (20)

PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
PDF
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
PDF
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 

Apache Hadoop for System Administrators