SlideShare a Scribd company logo
1 www.prace-ri.euIntroduction to Hadoop (part 2)
Introduction to Hadoop (part 2)
Vienna Technical University - IT Services (TU.it)
Dr. Giovanna Roda
2 www.prace-ri.euIntroduction to Hadoop (part 2)
Some advanced topics
▶ the YARN resource manager
▶ fault tolerance and HDFS Erasure Coding
▶ the mrjob library
▶ HDFS i/o benchmarking
▶ MapReduce spilling
3 www.prace-ri.euIntroduction to Hadoop (part 2)
YARN
Hadoop jobs are managed by YARN (acronym for Yet Another Resource Negotiator), that is responsible
for allocating resources and manage job scheduling.
Basic resource types are:
▶ memory (memory-mb)
▶ virtual cores (vcores)
Yarn supports an extensible resource model that allows to define any countable resource. A
countable resource is a resource that is consumed while a container is running, but is released
afterwards. Such a resource can be for instance:
▶ GPU (gpu)
.
4 www.prace-ri.euIntroduction to Hadoop (part 2)
YARN
Each job submitted to the Yarn is assigned:
▶ a container, that is an abstract entity which incorporates resources such as memory, cpu,
disk, network etc. Container resources are allocated by the Scheduler.
▶ an ApplicationMaster service assigned by the Application Manager for monitoring
the progress of the job, restarting tasks if needed
.
5 www.prace-ri.euIntroduction to Hadoop (part 2)
YARN
The main idea of Yarn is to have two distinct daemons for job monitoring and scheduling,
one global and one local for each application:
▶ the Resource Manager is the global job manager, consisting of:
▶ Scheduler → responsible for allocating of resources across all applications
▶ Applications Manager → accept job submissions, restart Application Masters on
failure
▶ an Application Master for each application, a daemon responsible for negotiating
resources, monitoring status of the job, restarting failed tasks.
6 www.prace-ri.euIntroduction to Hadoop (part 2)
YARN architecture
Source: Apache Software Foundation
7 www.prace-ri.euIntroduction to Hadoop (part 2)
YARN Schedulers
Yarn supports by default two types of schedulers:
▶ Capacity Scheduler
share resources providing capacity guarantees → allows scheduler determinism
▶ Fair Scheduler
all apps get, on average, an equal share of resources over time. Fair sharing allows one
app to use the entire cluster if it’s the only one running.
By default, fairness is based on memory but this can be configured to include also other
resources.
8 www.prace-ri.euIntroduction to Hadoop (part 2)
YARN Dynamic Resource Pools
Yarn supports Dynamic Resource Pools (or dynamic queues).
Each job is assigned to a queue (the queue can be determined for instance by the user’s Unix
primary group).
Queue are assigned a weight and resources are split among queues according to their weights.
For instance, if you have one queue with weight 20 and three other queues with weight 10, the
first queue will get 40%, and each of the other queues 20% of the available resources because:
1 ∗ 20𝑥 + 3 ∗ 10𝑥 = 100
9 www.prace-ri.euIntroduction to Hadoop (part 2)
Testing HDFS I/O troughput with TestDFSIO
TestDFSIO is a tool included in the Hadoop software distribution for measuring read and write
performance of the HDFS filesystem.
It’s a useful tool for assessing the performance of your Hadoop filesystem, identify performance
bottlenecks, support decisions for tuning your HDFS configuration.
TestDFSio uses MapReduce to write and read files on the HDFS filesystem; the reducer is used
to collect and summarize test data.
10 www.prace-ri.euIntroduction to Hadoop (part 2)
Testing HDFS I/O troughput with TestDFSIO
Note: you won’t be able to run TestDFSIO on a cluster unless you’re the cluster’s
admininstrator. Of course, it is recommended to use TestDFSIO to run tests when the Hadoop
cluster is not in use.
To run test as a non-superuser you need to have read/write access to
/benchmarks/TestDFSIO on HDFS or else specify another directory (on HDFS) where to
write and read data with the option –D test.build.data=myDir .
11 www.prace-ri.euIntroduction to Hadoop (part 2)
Testing HDFS I/O troughput with TestDFSIO
TestDFSIO writes and reads file on HDFS spanning exactly one mapper for file.
These are the main options for running stress tests:
▶ -write to run write tests
▶ -read to run read tests
▶ -nrFiles the number of files (set to be equal to the number of mappers)
▶ -fileSize size of files (followed by B|KB|MB|GB|TB)
12 www.prace-ri.euIntroduction to Hadoop (part 2)
TestDFSIO: run a write test
Run a test with 80 files (remember: this is the number of mappers) each of size 10GB.
JARFILE= /opt/pkg/software/Hadoop/2.10.0-GCCcore-8.3.0-
native/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.10.0-
tests.jar
yarn $JARFILE TestDFSIO -write -nrFiles 80 -fileSize 10GB
The jar file needed to run TestDFSIO is hadoop-mapreduce-client-
jobclient*tests.jar and its location depends on your Hadoop installation.
13 www.prace-ri.euIntroduction to Hadoop (part 2)
TestDFSIO: examine test results
Test results are appended by default to the file TestDFSIO_results.log. The log file can
be changed with the option -resFile resultFileName.
The main measurements returned by the HDFSio test are:
▶ throughput in mb/sec
▶ average IO rate in mb/sec
▶ standard deviation of the IO rate
▶ test execution time
14 www.prace-ri.euIntroduction to Hadoop (part 2)
TestDFSIO: examine test results
Main units of measure:
▶ throughput or data transfer rate measures the amount of data read or written (expressed
in Megabytes per second -- MB/s) to the filesystem.
▶ IO rate also abbreviated as IOPS measures IO operations per second, which means the amount of
read or write operations that could be done in one seconds time.
Throughput and IO rate are connected:
Average IO size x IOPS = Throughput in MB/s
15 www.prace-ri.euIntroduction to Hadoop (part 2)
TestDFSIO: examine test results
Sample output of a test write run:
----- TestDFSIO ----- : write
Date & time: Sun Sep 13 16:14:39 CEST 2020
Number of files: 80
Total MBytes processed: 819200
Throughput mb/sec: 30.64
Average IO rate mb/sec: 34.44
IO rate std deviation: 17.45
Test exec time sec: 429.73
16 www.prace-ri.euIntroduction to Hadoop (part 2)
TestDFSIO: examine test results
Concurrent versus overall throughput
The resulting throughput is the average throughput among all map tasks. To get an
approximate overall throughput on the cluster one should divide the total MBytes by the test
execution time in seconds.
In our test the overall write throughput on the cluster is:
819200 / 429.73 = 1906MB/sec
17 www.prace-ri.euIntroduction to Hadoop (part 2)
TestDFSIO: some things to try
▶ run a read test using the same command used for the write test but using the option –
read in place of –write.
▶ use the option -D dfs.replication=1 to measure the effect of replication on I/O
performance
yarn $JARFILE TestDFSIO -D dfs.replication=1 
-write -nrFiles 80 -fileSize 10GB
18 www.prace-ri.euIntroduction to Hadoop (part 2)
TestDFSIO: remove data after completing tests
When done with testing, remove the temporary files with:
yarn $JARFILE TestDFSIO -clean
If you used a special myDir output directory use
yarn $JARFILE TestDFSIO –clean –D test.build.data=myDir
Further reading: TestDFSIO documentation , source code TestDFSIO.java.
19 www.prace-ri.euIntroduction to Hadoop (part 2)
The mrjob library
A typical data processing pipeline consists of multiple steps that needs to run in a sequence.
mrjob is a Python library that offers a convenient framework which allows you to write multi-
step MapReduce jobs in pure Python, test them on your machine, and run them on a cluster.
20 www.prace-ri.euIntroduction to Hadoop (part 2)
The mrjob library: how to define a MapReduce job
A job is defined by a class that inherits from MRJob. This class contains methods that define
the steps of your job.
A “step” consists of a mapper, a combiner, and a reducer.
Step-by-step tutorials and documentation can be foud in: https://mrjob.readthedocs.io/
21 www.prace-ri.euIntroduction to Hadoop (part 2)
The mrjob library: a sample MapReduce job
from mrjob.job import MRJob
class MRWordFrequencyCount(MRJob):
def mapper(self, _, line):
yield "chars", len(line)
yield "words", len(line.split())
yield "lines", 1
def reducer(self, key, values):
yield key, sum(values)
if __name__ == '__main__':
MRWordFrequencyCount.run()
22 www.prace-ri.euIntroduction to Hadoop (part 2)
The mrjob library: a sample MapReduce job
Call the script mr_wordcount.py and run it with:
python mr_wordcount.py file.txt
This command will return the frequencies of characters, words, and lines in file.txt.
To get aggregated frequency counts for multiple files use:
python mr_wordcount.py file1.txt file2.txt …
23 www.prace-ri.euIntroduction to Hadoop (part 2)
The mrjob library: a sample MapReduce pipeline
To define a succession of MapReduce jobs define steps as a list of steps:
def steps(self):
return [
MRStep(mapper=self.mapper_get_words,
combiner=self.combiner_count_words,
reducer=self.reducer_count_words),
MRStep(reducer=self.reducer_find_max_word)
]
24 www.prace-ri.euIntroduction to Hadoop (part 2)
The mrjob library: run offline and on a cluster
Note that up to now we ran everything offline, on our local machine. To run the script on a
cluster use the option -r Hadoop. This will set Hadoop as the target runner.
Alternatively:
▶ -r emr for AWS clusters
▶ -r dataproc for Google cloud
25 www.prace-ri.euIntroduction to Hadoop (part 2)
THANK YOU FOR YOUR ATTENTION
www.prace-ri.eu

More Related Content

What's hot (20)

PDF
Introduction to Hadoop
joelcrabb
 
PPSX
Hadoop
Nishant Gandhi
 
DOCX
Hadoop Seminar Report
Atul Kushwaha
 
PPT
An Introduction to Hadoop
DerrekYoungDotCom
 
PPTX
Hadoop
RittikaBaksi
 
PPTX
Real time hadoop + mapreduce intro
Geoff Hendrey
 
PPT
Hadoop 1.x vs 2
Rommel Garcia
 
ODP
Hadoop2.2
Sreejith P
 
ODP
Hadoop seminar
KrishnenduKrishh
 
PPTX
Apache hadoop technology : Beginners
Shweta Patnaik
 
PPTX
Hadoop architecture by ajay
Hadoop online training
 
PDF
Hadoop installation, Configuration, and Mapreduce program
Praveen Kumar Donta
 
PDF
Hadoop hdfs interview questions
Kalyan Hadoop
 
PPTX
Introduction to Hadoop and Hadoop component
rebeccatho
 
PPTX
Bd class 2 complete
JigsawAcademy2014
 
PDF
Hadoop
Rajesh Piryani
 
ODT
Hadoop Interview Questions and Answers by rohit kapa
kapa rohit
 
PPTX
Introduction to Hadoop Technology
Manish Borkar
 
PPTX
A Basic Introduction to the Hadoop eco system - no animation
Sameer Tiwari
 
PDF
Seminar_Report_hadoop
Varun Narang
 
Introduction to Hadoop
joelcrabb
 
Hadoop Seminar Report
Atul Kushwaha
 
An Introduction to Hadoop
DerrekYoungDotCom
 
Hadoop
RittikaBaksi
 
Real time hadoop + mapreduce intro
Geoff Hendrey
 
Hadoop 1.x vs 2
Rommel Garcia
 
Hadoop2.2
Sreejith P
 
Hadoop seminar
KrishnenduKrishh
 
Apache hadoop technology : Beginners
Shweta Patnaik
 
Hadoop architecture by ajay
Hadoop online training
 
Hadoop installation, Configuration, and Mapreduce program
Praveen Kumar Donta
 
Hadoop hdfs interview questions
Kalyan Hadoop
 
Introduction to Hadoop and Hadoop component
rebeccatho
 
Bd class 2 complete
JigsawAcademy2014
 
Hadoop Interview Questions and Answers by rohit kapa
kapa rohit
 
Introduction to Hadoop Technology
Manish Borkar
 
A Basic Introduction to the Hadoop eco system - no animation
Sameer Tiwari
 
Seminar_Report_hadoop
Varun Narang
 

Similar to Introduction to Hadoop part 2 (20)

PDF
H04502048051
ijceronline
 
PPTX
Introduction to Hadoop and Big Data
Joe Alex
 
PPTX
Hadoop/MapReduce/HDFS
praveen bhat
 
PDF
02.28.13 WANdisco ApacheCon 2013
WANdisco Plc
 
PPTX
Hadoop ppt on the basics and architecture
saipriyacoool
 
PPTX
Introduction to hadoop V2
TarjeiRomtveit
 
PDF
Unleash your cluster with YARN
Ferran Galí Reniu
 
PPTX
Bigdata workshop february 2015
clairvoyantllc
 
PDF
Hadoop & Big Data benchmarking
Bart Vandewoestyne
 
PDF
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
markgrover
 
DOCX
HDFS
Vardhman Kale
 
PPTX
Hadoop crash course workshop at Hadoop Summit
DataWorks Summit
 
PDF
Extending Hadoop for Fun & Profit
Milind Bhandarkar
 
PPTX
Hadoop
Saeed Iqbal
 
PDF
Hadoop installation by santosh nage
Santosh Nage
 
PDF
Understanding hadoop
RexRamos9
 
PPTX
Hadoop and It_s Components_PPT .pptx
ABHIJEETKUMAR632313
 
PPTX
Hadoop and big data training
agiamas
 
PDF
Hadoop 2.0 handout 5.0
Manaranjan Pradhan
 
PDF
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
Leons Petražickis
 
H04502048051
ijceronline
 
Introduction to Hadoop and Big Data
Joe Alex
 
Hadoop/MapReduce/HDFS
praveen bhat
 
02.28.13 WANdisco ApacheCon 2013
WANdisco Plc
 
Hadoop ppt on the basics and architecture
saipriyacoool
 
Introduction to hadoop V2
TarjeiRomtveit
 
Unleash your cluster with YARN
Ferran Galí Reniu
 
Bigdata workshop february 2015
clairvoyantllc
 
Hadoop & Big Data benchmarking
Bart Vandewoestyne
 
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
markgrover
 
Hadoop crash course workshop at Hadoop Summit
DataWorks Summit
 
Extending Hadoop for Fun & Profit
Milind Bhandarkar
 
Hadoop
Saeed Iqbal
 
Hadoop installation by santosh nage
Santosh Nage
 
Understanding hadoop
RexRamos9
 
Hadoop and It_s Components_PPT .pptx
ABHIJEETKUMAR632313
 
Hadoop and big data training
agiamas
 
Hadoop 2.0 handout 5.0
Manaranjan Pradhan
 
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
Leons Petražickis
 
Ad

More from Giovanna Roda (6)

PDF
Distributed Computing for Everyone
Giovanna Roda
 
PPT
The need for new paradigms in IT services provisioning
Giovanna Roda
 
PPTX
Apache Spark™ is here to stay
Giovanna Roda
 
PDF
Chances and Challenges in Comparing Cross-Language Retrieval Tools
Giovanna Roda
 
PDF
CLEF-IP 2009: retrieval experiments in the Intellectual Property domain
Giovanna Roda
 
PDF
Patent Search: An important new test bed for IR
Giovanna Roda
 
Distributed Computing for Everyone
Giovanna Roda
 
The need for new paradigms in IT services provisioning
Giovanna Roda
 
Apache Spark™ is here to stay
Giovanna Roda
 
Chances and Challenges in Comparing Cross-Language Retrieval Tools
Giovanna Roda
 
CLEF-IP 2009: retrieval experiments in the Intellectual Property domain
Giovanna Roda
 
Patent Search: An important new test bed for IR
Giovanna Roda
 
Ad

Recently uploaded (20)

PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
PPTX
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
PDF
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
PDF
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
PPTX
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
PPTX
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
PDF
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
PDF
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
PDF
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
apidays
 
PPT
AI Future trends and opportunities_oct7v1.ppt
SHIKHAKMEHTA
 
PDF
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
PDF
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
PPTX
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
PPT
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
PPTX
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
PPTX
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
PDF
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
apidays
 
AI Future trends and opportunities_oct7v1.ppt
SHIKHAKMEHTA
 
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 

Introduction to Hadoop part 2

  • 1. 1 www.prace-ri.euIntroduction to Hadoop (part 2) Introduction to Hadoop (part 2) Vienna Technical University - IT Services (TU.it) Dr. Giovanna Roda
  • 2. 2 www.prace-ri.euIntroduction to Hadoop (part 2) Some advanced topics ▶ the YARN resource manager ▶ fault tolerance and HDFS Erasure Coding ▶ the mrjob library ▶ HDFS i/o benchmarking ▶ MapReduce spilling
  • 3. 3 www.prace-ri.euIntroduction to Hadoop (part 2) YARN Hadoop jobs are managed by YARN (acronym for Yet Another Resource Negotiator), that is responsible for allocating resources and manage job scheduling. Basic resource types are: ▶ memory (memory-mb) ▶ virtual cores (vcores) Yarn supports an extensible resource model that allows to define any countable resource. A countable resource is a resource that is consumed while a container is running, but is released afterwards. Such a resource can be for instance: ▶ GPU (gpu) .
  • 4. 4 www.prace-ri.euIntroduction to Hadoop (part 2) YARN Each job submitted to the Yarn is assigned: ▶ a container, that is an abstract entity which incorporates resources such as memory, cpu, disk, network etc. Container resources are allocated by the Scheduler. ▶ an ApplicationMaster service assigned by the Application Manager for monitoring the progress of the job, restarting tasks if needed .
  • 5. 5 www.prace-ri.euIntroduction to Hadoop (part 2) YARN The main idea of Yarn is to have two distinct daemons for job monitoring and scheduling, one global and one local for each application: ▶ the Resource Manager is the global job manager, consisting of: ▶ Scheduler → responsible for allocating of resources across all applications ▶ Applications Manager → accept job submissions, restart Application Masters on failure ▶ an Application Master for each application, a daemon responsible for negotiating resources, monitoring status of the job, restarting failed tasks.
  • 6. 6 www.prace-ri.euIntroduction to Hadoop (part 2) YARN architecture Source: Apache Software Foundation
  • 7. 7 www.prace-ri.euIntroduction to Hadoop (part 2) YARN Schedulers Yarn supports by default two types of schedulers: ▶ Capacity Scheduler share resources providing capacity guarantees → allows scheduler determinism ▶ Fair Scheduler all apps get, on average, an equal share of resources over time. Fair sharing allows one app to use the entire cluster if it’s the only one running. By default, fairness is based on memory but this can be configured to include also other resources.
  • 8. 8 www.prace-ri.euIntroduction to Hadoop (part 2) YARN Dynamic Resource Pools Yarn supports Dynamic Resource Pools (or dynamic queues). Each job is assigned to a queue (the queue can be determined for instance by the user’s Unix primary group). Queue are assigned a weight and resources are split among queues according to their weights. For instance, if you have one queue with weight 20 and three other queues with weight 10, the first queue will get 40%, and each of the other queues 20% of the available resources because: 1 ∗ 20𝑥 + 3 ∗ 10𝑥 = 100
  • 9. 9 www.prace-ri.euIntroduction to Hadoop (part 2) Testing HDFS I/O troughput with TestDFSIO TestDFSIO is a tool included in the Hadoop software distribution for measuring read and write performance of the HDFS filesystem. It’s a useful tool for assessing the performance of your Hadoop filesystem, identify performance bottlenecks, support decisions for tuning your HDFS configuration. TestDFSio uses MapReduce to write and read files on the HDFS filesystem; the reducer is used to collect and summarize test data.
  • 10. 10 www.prace-ri.euIntroduction to Hadoop (part 2) Testing HDFS I/O troughput with TestDFSIO Note: you won’t be able to run TestDFSIO on a cluster unless you’re the cluster’s admininstrator. Of course, it is recommended to use TestDFSIO to run tests when the Hadoop cluster is not in use. To run test as a non-superuser you need to have read/write access to /benchmarks/TestDFSIO on HDFS or else specify another directory (on HDFS) where to write and read data with the option –D test.build.data=myDir .
  • 11. 11 www.prace-ri.euIntroduction to Hadoop (part 2) Testing HDFS I/O troughput with TestDFSIO TestDFSIO writes and reads file on HDFS spanning exactly one mapper for file. These are the main options for running stress tests: ▶ -write to run write tests ▶ -read to run read tests ▶ -nrFiles the number of files (set to be equal to the number of mappers) ▶ -fileSize size of files (followed by B|KB|MB|GB|TB)
  • 12. 12 www.prace-ri.euIntroduction to Hadoop (part 2) TestDFSIO: run a write test Run a test with 80 files (remember: this is the number of mappers) each of size 10GB. JARFILE= /opt/pkg/software/Hadoop/2.10.0-GCCcore-8.3.0- native/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.10.0- tests.jar yarn $JARFILE TestDFSIO -write -nrFiles 80 -fileSize 10GB The jar file needed to run TestDFSIO is hadoop-mapreduce-client- jobclient*tests.jar and its location depends on your Hadoop installation.
  • 13. 13 www.prace-ri.euIntroduction to Hadoop (part 2) TestDFSIO: examine test results Test results are appended by default to the file TestDFSIO_results.log. The log file can be changed with the option -resFile resultFileName. The main measurements returned by the HDFSio test are: ▶ throughput in mb/sec ▶ average IO rate in mb/sec ▶ standard deviation of the IO rate ▶ test execution time
  • 14. 14 www.prace-ri.euIntroduction to Hadoop (part 2) TestDFSIO: examine test results Main units of measure: ▶ throughput or data transfer rate measures the amount of data read or written (expressed in Megabytes per second -- MB/s) to the filesystem. ▶ IO rate also abbreviated as IOPS measures IO operations per second, which means the amount of read or write operations that could be done in one seconds time. Throughput and IO rate are connected: Average IO size x IOPS = Throughput in MB/s
  • 15. 15 www.prace-ri.euIntroduction to Hadoop (part 2) TestDFSIO: examine test results Sample output of a test write run: ----- TestDFSIO ----- : write Date & time: Sun Sep 13 16:14:39 CEST 2020 Number of files: 80 Total MBytes processed: 819200 Throughput mb/sec: 30.64 Average IO rate mb/sec: 34.44 IO rate std deviation: 17.45 Test exec time sec: 429.73
  • 16. 16 www.prace-ri.euIntroduction to Hadoop (part 2) TestDFSIO: examine test results Concurrent versus overall throughput The resulting throughput is the average throughput among all map tasks. To get an approximate overall throughput on the cluster one should divide the total MBytes by the test execution time in seconds. In our test the overall write throughput on the cluster is: 819200 / 429.73 = 1906MB/sec
  • 17. 17 www.prace-ri.euIntroduction to Hadoop (part 2) TestDFSIO: some things to try ▶ run a read test using the same command used for the write test but using the option – read in place of –write. ▶ use the option -D dfs.replication=1 to measure the effect of replication on I/O performance yarn $JARFILE TestDFSIO -D dfs.replication=1 -write -nrFiles 80 -fileSize 10GB
  • 18. 18 www.prace-ri.euIntroduction to Hadoop (part 2) TestDFSIO: remove data after completing tests When done with testing, remove the temporary files with: yarn $JARFILE TestDFSIO -clean If you used a special myDir output directory use yarn $JARFILE TestDFSIO –clean –D test.build.data=myDir Further reading: TestDFSIO documentation , source code TestDFSIO.java.
  • 19. 19 www.prace-ri.euIntroduction to Hadoop (part 2) The mrjob library A typical data processing pipeline consists of multiple steps that needs to run in a sequence. mrjob is a Python library that offers a convenient framework which allows you to write multi- step MapReduce jobs in pure Python, test them on your machine, and run them on a cluster.
  • 20. 20 www.prace-ri.euIntroduction to Hadoop (part 2) The mrjob library: how to define a MapReduce job A job is defined by a class that inherits from MRJob. This class contains methods that define the steps of your job. A “step” consists of a mapper, a combiner, and a reducer. Step-by-step tutorials and documentation can be foud in: https://mrjob.readthedocs.io/
  • 21. 21 www.prace-ri.euIntroduction to Hadoop (part 2) The mrjob library: a sample MapReduce job from mrjob.job import MRJob class MRWordFrequencyCount(MRJob): def mapper(self, _, line): yield "chars", len(line) yield "words", len(line.split()) yield "lines", 1 def reducer(self, key, values): yield key, sum(values) if __name__ == '__main__': MRWordFrequencyCount.run()
  • 22. 22 www.prace-ri.euIntroduction to Hadoop (part 2) The mrjob library: a sample MapReduce job Call the script mr_wordcount.py and run it with: python mr_wordcount.py file.txt This command will return the frequencies of characters, words, and lines in file.txt. To get aggregated frequency counts for multiple files use: python mr_wordcount.py file1.txt file2.txt …
  • 23. 23 www.prace-ri.euIntroduction to Hadoop (part 2) The mrjob library: a sample MapReduce pipeline To define a succession of MapReduce jobs define steps as a list of steps: def steps(self): return [ MRStep(mapper=self.mapper_get_words, combiner=self.combiner_count_words, reducer=self.reducer_count_words), MRStep(reducer=self.reducer_find_max_word) ]
  • 24. 24 www.prace-ri.euIntroduction to Hadoop (part 2) The mrjob library: run offline and on a cluster Note that up to now we ran everything offline, on our local machine. To run the script on a cluster use the option -r Hadoop. This will set Hadoop as the target runner. Alternatively: ▶ -r emr for AWS clusters ▶ -r dataproc for Google cloud
  • 25. 25 www.prace-ri.euIntroduction to Hadoop (part 2) THANK YOU FOR YOUR ATTENTION www.prace-ri.eu