SlideShare a Scribd company logo
How To Run Mapreduce Jobs In Python
mrjob: Python Mapreduce Library
 Some Important Features:
 mrjob helps you to write MapReduce jobs in Python
and run them on Hadoop
 mrjob also allows you to run test code locally without
installing Hadoop
 mrjob can write multi-step MapReduce jobs
(eg.CS246 Homework1-Questions1)
 More Information:
 https://github.com/Yelp/mrjob
Run Hadoop W/O VM
 OSX platform
 brew install hadoop
 pip install mrjob
 Change configs
 core-site.xml
 hdfs-site.xml
 mapred-site.xml
 yarn-site.xml
 More detailed information:
http://stackoverflow.com/questions/25358793/error-
launching-job-using-mrjob-on-hadoop
Example: Word Count
 Run locally
 python WordCount.py read.txt
 Run on Hadoop
 python WordCount.py
read.txt –r hadoop
 Python vs. Java
 Line: 13 vs. 61
Homework1
People You Might
Know
Write a MapReduce program in
Hadoop that implements a simple
“People You Might Know” social
network friendship recommendation
algorithm. The key idea is that if two
people have a lot of mutual friends,
then the system should recommend
that they connect with each other.
Input: <User><TAB><Friends>
Output:
<User><TAB><Recommendations>
Homework1
People You Might Know
 First Mapper
 Read each line to generate friend pairs
 Parameters: key- friend_pair
 If already friend, value = 0
 If 1 common friend, value = 1
 E.g. {1: (2, 4)}  [((1, 2), 0), ((1,4), 0), ((2, 4), 1)]
 First Reducer
 Count total common friends for each key
 Parameters: key - friend_pair
value – sum(value)
 E.g. {(2, 4): [1, 1, 1, 0,1,0]} {(2, 4): 4}
Homework1
People You Might Know
 Second Mapper
 Use user as key to map the value pairs
 Parameters: key – user
value – friends, sum(value)
 E.g. {(2, 4): 4}  {2: (4, 4)}
 Second Reducer
 For each user, find the most 10th common friend
 Parameters: key – user
value – the most 10th common friends
 E.g. {2: [<4, 4>, <5, 11>, <6, 10>….]}  {2: [5, 6, …]}
Homework1
People You Might Know
 Multi-step jobs
 Use MRJob.steps(self) to connect mrjob
pipeline
def steps(self):
return [
MRStep(mapper = self.mapper1,
reducer = self.reducer1),
MRStep(mapper = self.mapper2,
reducer=self.reducer2)
]
Homework1
People You Might Know
 Run locally
 python friends_commend.py soc-
LiveJournal1Adj-2.txt > friends.txt
 Run on Hadoop
 python friends_commend.py soc-
LiveJournal1Adj-2.txt –r hadoop >
friends.txt
References
 Stanford CS246: Mining Massive Data Sets (Winter 2015). (n.d.).
Retrieved August 22, 2015.
 Yelp/mrjob. (n.d.). Retrieved August 22, 2015.
 Error launching job using mrjob on Hadoop. (n.d.). Retrieved August
22, 2015.
 MapReduce Tutorial. (n.d.). Retrieved August 22, 2015.

More Related Content

What's hot (20)

PDF
Big Data using NoSQL Technologies
Amit Singh
 
PPTX
Hadoop Distributed File System
Rutvik Bapat
 
PPTX
Data Analytics Life Cycle
Dr. C.V. Suresh Babu
 
PPTX
Machine Learning Tutorial | Machine Learning Basics | Machine Learning Algori...
Simplilearn
 
PDF
Data Science Project Lifecycle
Jason Geng
 
PPTX
Big Data Analytics
RohithND
 
PPTX
Dijkstra's algorithm presentation
Subid Biswas
 
PPTX
Big Data Fundamentals
Cloudera, Inc.
 
PPTX
Big data and data science overview
Colleen Farrelly
 
PPTX
Backtracking
subhradeep mitra
 
PDF
Big Data Characteristics And Process PowerPoint Presentation Slides
SlideTeam
 
PPTX
Lecture 16 memory bounded search
Hema Kashyap
 
PPTX
Lecture 18 simplified memory bound a star algorithm
Hema Kashyap
 
PPTX
Forward and Backward chaining in AI
Megha Sharma
 
PPT
Backtracking
Vikas Sharma
 
PPTX
Web scraping
Selecto
 
PPTX
Regular Expression (Regex) Fundamentals
Mesut Günes
 
PDF
Structured and Unstructured Big Data ebook
Emcien Corporation
 
PPTX
A Practical-ish Introduction to Data Science
Mark West
 
PDF
Lecture13 - Association Rules
Albert Orriols-Puig
 
Big Data using NoSQL Technologies
Amit Singh
 
Hadoop Distributed File System
Rutvik Bapat
 
Data Analytics Life Cycle
Dr. C.V. Suresh Babu
 
Machine Learning Tutorial | Machine Learning Basics | Machine Learning Algori...
Simplilearn
 
Data Science Project Lifecycle
Jason Geng
 
Big Data Analytics
RohithND
 
Dijkstra's algorithm presentation
Subid Biswas
 
Big Data Fundamentals
Cloudera, Inc.
 
Big data and data science overview
Colleen Farrelly
 
Backtracking
subhradeep mitra
 
Big Data Characteristics And Process PowerPoint Presentation Slides
SlideTeam
 
Lecture 16 memory bounded search
Hema Kashyap
 
Lecture 18 simplified memory bound a star algorithm
Hema Kashyap
 
Forward and Backward chaining in AI
Megha Sharma
 
Backtracking
Vikas Sharma
 
Web scraping
Selecto
 
Regular Expression (Regex) Fundamentals
Mesut Günes
 
Structured and Unstructured Big Data ebook
Emcien Corporation
 
A Practical-ish Introduction to Data Science
Mark West
 
Lecture13 - Association Rules
Albert Orriols-Puig
 

Viewers also liked (18)

PDF
Dumbo Hadoop Streaming Made Elegant And Easy Klaas Bosteels
George Ang
 
PPTX
Scalable Hadoop with succinct Python: the best of both worlds
DataWorks Summit
 
PPTX
EuroPython 2015 - Big Data with Python and Hadoop
Max Tepkeev
 
PDF
Lesson from Dumbo
Tri Nguyen
 
PDF
An Introduction to MapReduce
Frane Bandov
 
PPTX
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Skillspeed
 
PPT
Intro to Amazon S3
Yu Lun Teo
 
PPTX
"Uno, two, trois...Plurilingüismo y Programas Europeos"
Ana Isabel Sánchez Peláez
 
DOC
Market framework for Country bean new
S. M. Mainul Islam (Nutritionist, Agriculturist)
 
DOCX
A. Kasem Sir_ Evaporators-ok
S. M. Mainul Islam (Nutritionist, Agriculturist)
 
PPTX
TIK BAB 4 KELAS IX
TamaMEN27
 
PDF
Portfolio.compressed
Nawras Khrais
 
PPTX
Factores de riesgo físico
Leonardo Tovar
 
DOC
Draft Enterprises training_ Bangla_22.8.11
S. M. Mainul Islam (Nutritionist, Agriculturist)
 
PPTX
changing the way we talk about recycling
Monica Boehringer
 
Dumbo Hadoop Streaming Made Elegant And Easy Klaas Bosteels
George Ang
 
Scalable Hadoop with succinct Python: the best of both worlds
DataWorks Summit
 
EuroPython 2015 - Big Data with Python and Hadoop
Max Tepkeev
 
Lesson from Dumbo
Tri Nguyen
 
An Introduction to MapReduce
Frane Bandov
 
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Skillspeed
 
Intro to Amazon S3
Yu Lun Teo
 
"Uno, two, trois...Plurilingüismo y Programas Europeos"
Ana Isabel Sánchez Peláez
 
Market framework for Country bean new
S. M. Mainul Islam (Nutritionist, Agriculturist)
 
TIK BAB 4 KELAS IX
TamaMEN27
 
Portfolio.compressed
Nawras Khrais
 
Factores de riesgo físico
Leonardo Tovar
 
Draft Enterprises training_ Bangla_22.8.11
S. M. Mainul Islam (Nutritionist, Agriculturist)
 
changing the way we talk about recycling
Monica Boehringer
 
Ad

Similar to How To Run Mapreduce Jobs In Python (20)

PDF
Data Science
Subhajit75
 
PDF
Hadoop interview questions
Kalyan Hadoop
 
PPT
Recommender.system.presentation.pjug.05.20.2014
rpbrehm
 
PPTX
Hadoop with Python
Donald Miner
 
PDF
20141111 파이썬으로 Hadoop MR프로그래밍
Tae Young Lee
 
PDF
Hadoop interview question
pappupassindia
 
PDF
Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster
Asociatia ProLinux
 
PDF
Lecture 2 part 3
Jazan University
 
PPTX
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Desing Pathshala
 
PDF
Groovy On Trading Desk (2010)
Jonathan Felch
 
PDF
2014 hadoop wrocław jug
Wojciech Langiewicz
 
PDF
k-means algorithm implementation on Hadoop
Stratos Gounidellis
 
PDF
Hadoop interview questions - Softwarequery.com
softwarequery
 
PDF
Introduction to Hadoop - FinistJug
David Morin
 
PDF
Implementation of k means algorithm on Hadoop
Lamprini Koutsokera
 
PDF
Hadoop 31-frequently-asked-interview-questions
Asad Masood Qazi
 
PPT
Hadoop MapReduce
Urvashi Kataria
 
PDF
Introduction to Hadoop
Apache Apex
 
PDF
Hadoop-Introduction
Sandeep Deshmukh
 
PDF
Getting started with Hadoop, Hive, and Elastic MapReduce
obdit
 
Data Science
Subhajit75
 
Hadoop interview questions
Kalyan Hadoop
 
Recommender.system.presentation.pjug.05.20.2014
rpbrehm
 
Hadoop with Python
Donald Miner
 
20141111 파이썬으로 Hadoop MR프로그래밍
Tae Young Lee
 
Hadoop interview question
pappupassindia
 
Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster
Asociatia ProLinux
 
Lecture 2 part 3
Jazan University
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Desing Pathshala
 
Groovy On Trading Desk (2010)
Jonathan Felch
 
2014 hadoop wrocław jug
Wojciech Langiewicz
 
k-means algorithm implementation on Hadoop
Stratos Gounidellis
 
Hadoop interview questions - Softwarequery.com
softwarequery
 
Introduction to Hadoop - FinistJug
David Morin
 
Implementation of k means algorithm on Hadoop
Lamprini Koutsokera
 
Hadoop 31-frequently-asked-interview-questions
Asad Masood Qazi
 
Hadoop MapReduce
Urvashi Kataria
 
Introduction to Hadoop
Apache Apex
 
Hadoop-Introduction
Sandeep Deshmukh
 
Getting started with Hadoop, Hive, and Elastic MapReduce
obdit
 
Ad

Recently uploaded (20)

PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PDF
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PDF
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PPT
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
PPTX
Digital Circuits, important subject in CS
contactparinay1
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PDF
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PDF
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
PDF
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PDF
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
Digital Circuits, important subject in CS
contactparinay1
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 

How To Run Mapreduce Jobs In Python

  • 2. mrjob: Python Mapreduce Library  Some Important Features:  mrjob helps you to write MapReduce jobs in Python and run them on Hadoop  mrjob also allows you to run test code locally without installing Hadoop  mrjob can write multi-step MapReduce jobs (eg.CS246 Homework1-Questions1)  More Information:  https://github.com/Yelp/mrjob
  • 3. Run Hadoop W/O VM  OSX platform  brew install hadoop  pip install mrjob  Change configs  core-site.xml  hdfs-site.xml  mapred-site.xml  yarn-site.xml  More detailed information: http://stackoverflow.com/questions/25358793/error- launching-job-using-mrjob-on-hadoop
  • 4. Example: Word Count  Run locally  python WordCount.py read.txt  Run on Hadoop  python WordCount.py read.txt –r hadoop  Python vs. Java  Line: 13 vs. 61
  • 5. Homework1 People You Might Know Write a MapReduce program in Hadoop that implements a simple “People You Might Know” social network friendship recommendation algorithm. The key idea is that if two people have a lot of mutual friends, then the system should recommend that they connect with each other. Input: <User><TAB><Friends> Output: <User><TAB><Recommendations>
  • 6. Homework1 People You Might Know  First Mapper  Read each line to generate friend pairs  Parameters: key- friend_pair  If already friend, value = 0  If 1 common friend, value = 1  E.g. {1: (2, 4)}  [((1, 2), 0), ((1,4), 0), ((2, 4), 1)]  First Reducer  Count total common friends for each key  Parameters: key - friend_pair value – sum(value)  E.g. {(2, 4): [1, 1, 1, 0,1,0]} {(2, 4): 4}
  • 7. Homework1 People You Might Know  Second Mapper  Use user as key to map the value pairs  Parameters: key – user value – friends, sum(value)  E.g. {(2, 4): 4}  {2: (4, 4)}  Second Reducer  For each user, find the most 10th common friend  Parameters: key – user value – the most 10th common friends  E.g. {2: [<4, 4>, <5, 11>, <6, 10>….]}  {2: [5, 6, …]}
  • 8. Homework1 People You Might Know  Multi-step jobs  Use MRJob.steps(self) to connect mrjob pipeline def steps(self): return [ MRStep(mapper = self.mapper1, reducer = self.reducer1), MRStep(mapper = self.mapper2, reducer=self.reducer2) ]
  • 9. Homework1 People You Might Know  Run locally  python friends_commend.py soc- LiveJournal1Adj-2.txt > friends.txt  Run on Hadoop  python friends_commend.py soc- LiveJournal1Adj-2.txt –r hadoop > friends.txt
  • 10. References  Stanford CS246: Mining Massive Data Sets (Winter 2015). (n.d.). Retrieved August 22, 2015.  Yelp/mrjob. (n.d.). Retrieved August 22, 2015.  Error launching job using mrjob on Hadoop. (n.d.). Retrieved August 22, 2015.  MapReduce Tutorial. (n.d.). Retrieved August 22, 2015.