SlideShare a Scribd company logo
HadoopStreaming 
IT가맹점개발팀 
이태영 
2014.11.11 
5번째스터디 
파이썬으로MR 개발하기
•개발자, 팀이익숙한언어를사용해서MR 개발가능 
•특정언어에서제공하는라이브러리사용가능 
•표준I/O로데이터교환-자바MR에비해성능저하 
•그러나개발생산성이보장받는다면? 
HadoopStreaming은표준입출력(stdio)를제공하는모든언어이용가능 
※ 두가지요소가정의되어야함 
1.Map 기능이정의된실행가능Mapper 파일 
2.Reduce 기능이정의된실행가능Reducer 파일 
HadoopStreaming
MapReduce 
1.MAP의역할-표준입력으로입력데이터처리 
2.MAP의역할-표준출력으로Key, Value 출력 
3.REDUCER역할-MAP의출력<Key, Value>를표준입력으로처리 
4.REDUCER 역할-표준출력으로Key, Value 출력 
데이터 
입력 
파이썬 
Map 처리 
파이썬 
Reduce 처리 
PIPE 
파일읽기, 
PIPE, 
스트리밍등 
MR 처리 
결과 
출력
Python 설치 
표준I/O 데이터Mapper 예제 
1.python 사이트에서2.7.8 다운로드후압축해제 
2.계정홈디렉토리에서python 심볼릭링크로파이썬디렉토리경로생성 
3../configure 
4../make 
python명령어도 
귀찮으니py로축약
print ‘Hello BC’ 
hello.py 
매번py를쳐주기귀찮다. 
파이썬스크립트자체를실행파일로! 
#!/home/hadoop/python/py 
print ‘Hello BC’ 
hello.py 
[hadoop@big01 ~]$ chmod 755 hello.py 
[hadoop@big01 ~]$ ./hello.py 
Hello BC 
[hadoop@big01 ~]$ pyhello.py 
Hello BC 
Python 실행 
Hello BC 예제실행 
#! (SHA BANG)
#!/home/hadoop/python/py 
import sys 
for line in sys.stdin: 
line = line.strip() 
words = line.split() 
for word in words: 
print '{0}t{1}'.format(word, 1) 
[hadoop@big01 python]$ echo "bc bc bc card bc card" | mapper.py 
bc1 
bc1 
bc1 
card1 
bc1 
card1 
it1 
mapper.py 
Python MAP 
표준I/O Mapper 실행예제
[hadoop@big01 python]$ echo "bc bc bc card bc card it" | mapper.py 
bc1 
bc1 
bc1 
card1 
bc1 
card1 
it1 
[hadoop@big01 python]$ echo "bc bc bc card bc card it" | mapper.py | sort –k 1 
bc1 
bc1 
bc1 
bc1 
card1 
card1 
it1 
첫번째필드기준정렬 
Python MAP 
Mapper 출력값을정렬
import sys 
current_word = None 
current_count = 0 
word = None 
for line in sys.stdin: 
line = line.strip() 
word, count= line.split('t',1) 
count = int(count) 
if current_word == word: 
current_count += count 
else: 
if current_word: 
print '{0}t{1}'.format(current_word, current_count) 
current_count = count 
current_word = word 
if current_word == word: 
print '{0}t{1}'.format(current_word, current_count) 
reducer.py 
기준단어와같다면카운트+1 
기준단어가None이아니라면 
M/R 결과출력 
새로운기준단어설정 
마지막라인처리용 
Python REDUCE 
표준I/O의Reducer 예제 
[hadoop@big01 python]$ echo "bc bc bc card bc card it" | ./mapper.py | sort –k 1 | ./reducer.py 
bc4 
card2 
it1
Python ♥Hadoop 
HadoopStreaming 
1. HadoopStreaming에서mapper/reducer는실행가능한쉘로지정되어야한다. 
[OK]Hadoop jar hadoop-streaming*.jar –mapper map.py–reducer reduce.py… 
[NO] Hadoop jar hadoop-streaming*.jar –mapper python map.py–reducer python reduce.py… 
2. Python스크립트는어디에서든접근가능하도록디렉토리PATH를설정 
조건 
Caused by: java.lang.RuntimeException: configuration exception 
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:222) 
at org.apache.hadoop.streaming.PipeMapper.configure(PipeMapper.java:66) 
... 22 more 
Caused by: java.io.IOException: Cannot run program "mapper.py": error=2, 그런파일이나디렉터리가없습니다 
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048) 
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:209) 
... 23 more 
Caused by: java.io.IOException: error=2, 그런파일이나디렉터리가없습니다 
at java.lang.UNIXProcess.forkAndExec(Native Method) 
at java.lang.UNIXProcess.<init>(UNIXProcess.java:187) 
at java.lang.ProcessImpl.start(ProcessImpl.java:134) 
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029) 
... 24 more 
안하면
Python ♥Hadoop 
HadoopStreaming 
hadoop jar hadoop-streaming-2.5.1.jar  
-input myInputDirs  
-output myOutputDir  
-mapper /bin/cat  
-reducer /usr/bin/wc 
$HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.5.1.jar 
Hadoop 2.x의HadoopStreaming위치 
hadoop command [genericOptions] [streamingOptions]
Python ♥Hadoop 
HadoopStreaming 명령어(command) 
Parameter 
Optional/Required 
Description 
-inputdirectoryname or filename 
Required 
Input location for mapper 
-outputdirectoryname 
Required 
Output location for reducer 
-mapperexecutable or JavaClassName 
Required 
Mapper executable 
-reducerexecutable or JavaClassName 
Required 
Reducer executable 
-filefilename 
Optional 
Make the mapper, reducer, or combiner executable available locally on the compute nodes 
-inputformat JavaClassName 
Optional 
Class you supply should return key/value pairs of Text class. If not specified, TextInputFormat is used as the default 
-outputformat JavaClassName 
Optional 
Class you supply should take key/value pairs of Text class. If not specified, TextOutputformat is used as the default 
-partitioner JavaClassName 
Optional 
Class that determines which reduce a key is sent to 
-combiner streamingCommand or JavaClassName 
Optional 
Combiner executable for map output 
-cmdenv name=value 
Optional 
Pass environment variable to streaming commands 
-inputreader 
Optional 
For backwards-compatibility: specifies a record reader class (instead of an input format class) 
-verbose 
Optional 
Verbose output 
-lazyOutput 
Optional 
Create output lazily. For example, if the output format is based on FileOutputFormat, the output file is created only on the first call to output.collect (or Context.write) 
-numReduceTasks 
Optional 
Specify the number of reducers 
-mapdebug 
Optional 
Script to call when map task fails 
-reducedebug 
Optional 
Script to call when reduce task fails 
hadoop command [genericOptions] [streamingOptions]
Python ♥Hadoop 
HadoopStreaming 제네릭옵션 
Parameter 
Optional/Required 
Description 
-conf configuration_file 
Optional 
Specify an application configuration file 
-D property=value 
Optional 
Use value for given property 
-fs host:port or local 
Optional 
Specify a namenode 
-files 
Optional 
Specify comma-separated files to be copied to the Map/Reduce cluster 
-libjars 
Optional 
Specify comma-separated jar files to include in the classpath 
-archives 
Optional 
Specify comma-separated archives to be unarchived on the compute machines 
hadoop command [genericOptions] [streamingOptions] 
사용예 
hadoop jar hadoop-streaming-2.5.1.jar  
-D mapreduce.job.reduces=2  
-input myInputDirs  
-output myOutputDir  
-mapper /bin/cat  
-reducer /usr/bin/wc
Python ♥Hadoop 
HadoopStreaming 실행: WordCount 
[hadoop@big01 ~]$ hadoop jar hadoop-streaming-2.5.1.jar  
-input alice -output wc_alice 
-mapper mapper.py-reducer reducer.py 
-file mapper.py -file reducer.py 
packageJobJar: [mapper.py, reducer.py, /tmp/hadoop-hadoop/hadoop-unjar2252553335408523254/] [] /tmp/streamjob911479792088347698.jar tmpDir=null 
14/11/1123:51:41 INFO client.RMProxy: Connecting to ResourceManager at big01/192.168.56.101:8040 
14/11/1123:51:41 INFO client.RMProxy: Connecting to ResourceManager at big01/192.168.56.101:8040 
14/11/1123:51:43 INFO mapred.FileInputFormat: Total input paths to process : 1 
14/11/1123:51:43 INFO mapreduce.JobSubmitter: number of splits:2 
14/11/1123:51:43 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1416242552451_0009 
14/11/1123:51:44 INFO impl.YarnClientImpl: Submitted application application_1416242552451_0009 
14/11/1123:51:44 INFO mapreduce.Job: The url to track the job: http://big01:8088/proxy/application_1416242552451_0009/ 
14/11/1123:51:44 INFO mapreduce.Job: Running job: job_1416242552451_0009 
14/11/1123:51:53 INFO mapreduce.Job: Job job_1416242552451_0009 running in uber mode : false 
14/11/1123:51:53 INFO mapreduce.Job: map 0% reduce 0% 
14/11/1123:52:05 INFO mapreduce.Job: map 100% reduce 0% 
14/11/1123:52:13 INFO mapreduce.Job: map 100% reduce 100% 
14/11/1123:52:13 INFO mapreduce.Job: Job job_1416242552451_0009 completed successfully 
14/11/1123:52:13 INFO mapreduce.Job: Counters: 49 
File System Counters 
…..
Python ♥Hadoop 
HadoopStreaming 실행: WordCount 
[hadoop@big01 ~]$ hadoop jar hadoop-streaming-2.5.1.jar  
-input alice -output wc_alice 
-mapper mapper.py-reducer reducer.py 
-file mapper.py -file reducer.py 
packageJobJar: [mapper.py, reducer.py, /tmp/hadoop-hadoop/hadoop-unjar2252553335408523254/] [] /tmp/streamjob911479792088347698.jar tmpDir=null 
14/11/1123:51:41 INFO client.RMProxy: Connecting to ResourceManager at big01/192.168.56.101:8040 
14/11/1123:51:41 INFO client.RMProxy: Connecting to ResourceManager at big01/192.168.56.101:8040 
14/11/1123:51:43 INFO mapred.FileInputFormat: Total input paths to process : 1 
14/11/1123:51:43 INFO mapreduce.JobSubmitter: number of splits:2 
14/11/1123:51:43 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1416242552451_0009 
14/11/1123:51:44 INFO impl.YarnClientImpl: Submitted application application_1416242552451_0009 
14/11/1123:51:44 INFO mapreduce.Job: The url to track the job: http://big01:8088/proxy/application_1416242552451_0009/ 
14/11/1123:51:44 INFO mapreduce.Job: Running job: job_1416242552451_0009 
14/11/1123:51:53 INFO mapreduce.Job: Job job_1416242552451_0009 running in uber mode : false 
14/11/1123:51:53 INFO mapreduce.Job: map 0% reduce 0% 
14/11/1123:52:05 INFO mapreduce.Job: map 100% reduce 0% 
14/11/1123:52:13 INFO mapreduce.Job: map 100% reduce 100% 
14/11/1123:52:13 INFO mapreduce.Job: Job job_1416242552451_0009 completed successfully 
14/11/1123:52:13 INFO mapreduce.Job: Counters: 49 
File System Counters 
…..
Python ♥Hadoop 
HadoopStreaming 결과확인
Python ♥Hadoop 
HadoopStreaming 결과확인 
…. 
you'd8 
you'll4 
you're15 
you've5 
you,25 
you,'6 
you--all1 
you--are1 
you.1 
you.'1 
you:1 
you?2 
you?'7 
young5 
your62 
yours1 
yours."'1 
yourself5 
yourself!'1 
yourself,1 
yourself,'1 
yourself.'2 
youth,3 
youth,'3 
zigzag,1 
part-00000 를열어보면
Python ♥Hadoop 
HadoopStreaming 예제: WordCount 고도화 
#!/home/hadoop/python/py 
import sys 
Import re 
for line in sys.stdin: 
line = line.strip() 
line = re.sub('[=.#/?:$'!,"}]', '', line) 
words = line.split() 
for word in words: 
print '{0}t{1}'.format(word, 1) 
mapper.py 수정 
[hadoop@big01 ~]$ hadoop jar hadoop-streaming-2.5.1.jar  
-input alice -output wc_alice2 
-mapper mapper.py-reducer reducer.py 
-file mapper.py -file reducer.py 
packageJobJar: [mapper.py, reducer.py, /tmp/hadoop-hadoop/hadoop-unjar2252553335408523254/] [] /tmp/streamjob911479792088347698.jar tmpDir=null 
14/11/1123:51:41 INFO client.RMProxy: Connecting to ResourceManager at big01/192.168.56.101:8040 
14/11/1123:51:41 INFO client.RMProxy: Connecting to ResourceManager at big01/192.168.56.101:8040 
정규표현식, 특수문자제거
Python ♥Hadoop 
HadoopStreaming 결과확인 
….. 
ye;1 
year2 
years2 
yelled1 
yelp1 
yer4 
yesterday3 
yet18 
yet--Oh1 
yet--and1 
yet--its1 
you357 
you)1 
you--all1 
you--are1 
youd8 
youll4 
young5 
your62 
youre15 
yours2 
yourself10 
youth6 
youve5 
zigzag1 
wc_alice2의part-00000 를열어보면
끝.

More Related Content

PDF
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
Takahiro Inoue
 
PPTX
Practical Hadoop using Pig
David Wellman
 
PPT
apache pig performance optimizations talk at apachecon 2010
Thejas Nair
 
PPTX
Introduction to Apache Pig
Jason Shao
 
PDF
Perl Memory Use 201207 (OUTDATED, see 201209 )
Tim Bunce
 
PDF
Hadoop Pig: MapReduce the easy way!
Nathan Bijnens
 
PDF
Perl Memory Use 201209
Tim Bunce
 
PDF
Apache Hadoop Shell Rewrite
Allen Wittenauer
 
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
Takahiro Inoue
 
Practical Hadoop using Pig
David Wellman
 
apache pig performance optimizations talk at apachecon 2010
Thejas Nair
 
Introduction to Apache Pig
Jason Shao
 
Perl Memory Use 201207 (OUTDATED, see 201209 )
Tim Bunce
 
Hadoop Pig: MapReduce the easy way!
Nathan Bijnens
 
Perl Memory Use 201209
Tim Bunce
 
Apache Hadoop Shell Rewrite
Allen Wittenauer
 

What's hot (20)

PPTX
Hadoop 20111117
exsuns
 
PPTX
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Viswanath Gangavaram
 
PDF
Perl at SkyCon'12
Tim Bunce
 
PDF
Hypertable - massively scalable nosql database
bigdatagurus_meetup
 
PPTX
03 pig intro
Subhas Kumar Ghosh
 
PDF
Hypertable
betaisao
 
PDF
Hadoop pig
Sean Murphy
 
PPT
Database Architectures and Hypertable
hypertable
 
PDF
Hadoop spark performance comparison
arunkumar sadhasivam
 
PPTX
Hadoop 20111215
exsuns
 
PDF
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
DataWorks Summit
 
PDF
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Alex Levenson
 
PDF
Application Logging in the 21st century - 2014.key
Tim Bunce
 
PDF
Fluentd unified logging layer
Kiyoto Tamura
 
PPTX
2015 bioinformatics python_io_wim_vancriekinge
Prof. Wim Van Criekinge
 
PDF
Apache Pig: A big data processor
Tushar B Kute
 
PPTX
Hadoop MapReduce Streaming and Pipes
Hanborq Inc.
 
PDF
Kyotoproducts
Mikio Hirabayashi
 
PDF
Introduction to Hadoop
Ovidiu Dimulescu
 
PPTX
Big Data Analysis With RHadoop
David Chiu
 
Hadoop 20111117
exsuns
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Viswanath Gangavaram
 
Perl at SkyCon'12
Tim Bunce
 
Hypertable - massively scalable nosql database
bigdatagurus_meetup
 
03 pig intro
Subhas Kumar Ghosh
 
Hypertable
betaisao
 
Hadoop pig
Sean Murphy
 
Database Architectures and Hypertable
hypertable
 
Hadoop spark performance comparison
arunkumar sadhasivam
 
Hadoop 20111215
exsuns
 
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
DataWorks Summit
 
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Alex Levenson
 
Application Logging in the 21st century - 2014.key
Tim Bunce
 
Fluentd unified logging layer
Kiyoto Tamura
 
2015 bioinformatics python_io_wim_vancriekinge
Prof. Wim Van Criekinge
 
Apache Pig: A big data processor
Tushar B Kute
 
Hadoop MapReduce Streaming and Pipes
Hanborq Inc.
 
Kyotoproducts
Mikio Hirabayashi
 
Introduction to Hadoop
Ovidiu Dimulescu
 
Big Data Analysis With RHadoop
David Chiu
 
Ad

Viewers also liked (20)

PDF
2016317 파이썬기초_파이썬_다중설치부터_Jupyter를이용한프로그래밍_이태영
Tae Young Lee
 
PPTX
Hadoop with Python
Donald Miner
 
PDF
20141029 하둡2.5와 hive설치 및 예제
Tae Young Lee
 
PPTX
Python in the Hadoop Ecosystem (Rock Health presentation)
Uri Laserson
 
PDF
하둡 (Hadoop) 및 관련기술 훑어보기
beom kyun choi
 
PDF
20141223 머하웃(mahout) 협업필터링_추천시스템구현
Tae Young Lee
 
PPTX
대용량 분산 아키텍쳐 설계 #4. soa 아키텍쳐
Terry Cho
 
PDF
Intro to HDFS and MapReduce
Ryan Tabora
 
PPTX
대용량 분산 아키텍쳐 설계 #2 대용량 분산 시스템 아키텍쳐 디자인 패턴
Terry Cho
 
PPTX
대용량 분산 아키텍쳐 설계 #3 대용량 분산 시스템 아키텍쳐
Terry Cho
 
PPTX
대용량 분산 아키텍쳐 설계 #5. rest
Terry Cho
 
PPTX
대용량 분산 아키텍쳐 설계 #1 아키텍쳐 설계 방법론
Terry Cho
 
PDF
Pig and Python to Process Big Data
Shawn Hermans
 
PPT
Planning Number Of Instance And Thread In Web Application Server
Terry Cho
 
PPT
Enterprise Soa Concept
Terry Cho
 
PPT
Testing process
Terry Cho
 
PDF
Hadoop 제주대
DaeHeon Oh
 
PDF
20141214 빅데이터실전기술 - 유사도 및 군집화 방법 (Similarity&Clustering)
Tae Young Lee
 
PDF
Recommendatioin system basic
Soo-Kyung Choi
 
PDF
H3 2011 파이썬으로 클라우드 하고 싶어요
KTH
 
2016317 파이썬기초_파이썬_다중설치부터_Jupyter를이용한프로그래밍_이태영
Tae Young Lee
 
Hadoop with Python
Donald Miner
 
20141029 하둡2.5와 hive설치 및 예제
Tae Young Lee
 
Python in the Hadoop Ecosystem (Rock Health presentation)
Uri Laserson
 
하둡 (Hadoop) 및 관련기술 훑어보기
beom kyun choi
 
20141223 머하웃(mahout) 협업필터링_추천시스템구현
Tae Young Lee
 
대용량 분산 아키텍쳐 설계 #4. soa 아키텍쳐
Terry Cho
 
Intro to HDFS and MapReduce
Ryan Tabora
 
대용량 분산 아키텍쳐 설계 #2 대용량 분산 시스템 아키텍쳐 디자인 패턴
Terry Cho
 
대용량 분산 아키텍쳐 설계 #3 대용량 분산 시스템 아키텍쳐
Terry Cho
 
대용량 분산 아키텍쳐 설계 #5. rest
Terry Cho
 
대용량 분산 아키텍쳐 설계 #1 아키텍쳐 설계 방법론
Terry Cho
 
Pig and Python to Process Big Data
Shawn Hermans
 
Planning Number Of Instance And Thread In Web Application Server
Terry Cho
 
Enterprise Soa Concept
Terry Cho
 
Testing process
Terry Cho
 
Hadoop 제주대
DaeHeon Oh
 
20141214 빅데이터실전기술 - 유사도 및 군집화 방법 (Similarity&Clustering)
Tae Young Lee
 
Recommendatioin system basic
Soo-Kyung Choi
 
H3 2011 파이썬으로 클라우드 하고 싶어요
KTH
 
Ad

Similar to 20141111 파이썬으로 Hadoop MR프로그래밍 (20)

PDF
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
CloudxLab
 
PPT
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Impetus Technologies
 
PDF
Hadoop tutorial hand-outs
pardhavi reddy
 
KEY
ちょっとHadoopについて語ってみるか(仮題)
moai kids
 
PPTX
Pig on Tez - Low Latency ETL with Big Data
DataWorks Summit
 
PDF
Hadoop - Lessons Learned
tcurdt
 
PPTX
Hackathon bonn
Emil Andreas Siemes
 
PDF
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
IndicThreads
 
PPTX
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Yahoo Developer Network
 
PDF
Hadoop interview questions
Kalyan Hadoop
 
PPTX
MAP REDUCE IN DATA SCIENCE.pptx
HARIKRISHNANU13
 
PDF
Hadoop Performance comparison
arunkumar sadhasivam
 
PPTX
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Desing Pathshala
 
PPTX
Managing Big data Module 3 (1st part)
Soumee Maschatak
 
PPTX
Map Reduce
Prashant Gupta
 
PDF
Hadoop interview question
pappupassindia
 
PDF
Graphing Nagios services with pnp4nagios
jasonholtzapple
 
PPT
pig.ppt
Sheba41
 
PDF
Data Science
Subhajit75
 
PPTX
Using R on High Performance Computers
Dave Hiltbrand
 
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
CloudxLab
 
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Impetus Technologies
 
Hadoop tutorial hand-outs
pardhavi reddy
 
ちょっとHadoopについて語ってみるか(仮題)
moai kids
 
Pig on Tez - Low Latency ETL with Big Data
DataWorks Summit
 
Hadoop - Lessons Learned
tcurdt
 
Hackathon bonn
Emil Andreas Siemes
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
IndicThreads
 
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Yahoo Developer Network
 
Hadoop interview questions
Kalyan Hadoop
 
MAP REDUCE IN DATA SCIENCE.pptx
HARIKRISHNANU13
 
Hadoop Performance comparison
arunkumar sadhasivam
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Desing Pathshala
 
Managing Big data Module 3 (1st part)
Soumee Maschatak
 
Map Reduce
Prashant Gupta
 
Hadoop interview question
pappupassindia
 
Graphing Nagios services with pnp4nagios
jasonholtzapple
 
pig.ppt
Sheba41
 
Data Science
Subhajit75
 
Using R on High Performance Computers
Dave Hiltbrand
 

Recently uploaded (20)

PDF
Blitz Campinas - Dia 24 de maio - Piettro.pdf
fabigreek
 
PPTX
INFO8116 - Week 10 - Slides.pptx data analutics
guddipatel10
 
PPTX
Introduction to Data Analytics and Data Science
KavithaCIT
 
PDF
TIC ACTIVIDAD 1geeeeeeeeeeeeeeeeeeeeeeeeeeeeeer3.pdf
Thais Ruiz
 
PPTX
Fluvial_Civilizations_Presentation (1).pptx
alisslovemendoza7
 
PPTX
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
PPTX
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
PPTX
IP_Journal_Articles_2025IP_Journal_Articles_2025
mishell212144
 
PPT
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
PDF
Technical Writing Module-I Complete Notes.pdf
VedprakashArya13
 
PDF
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PDF
Fundamentals and Techniques of Biophysics and Molecular Biology (Pranav Kumar...
RohitKumar868624
 
PDF
WISE main accomplishments for ISQOLS award July 2025.pdf
StatsCommunications
 
PPTX
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
PDF
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PPTX
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
PPTX
Probability systematic sampling methods.pptx
PrakashRajput19
 
PDF
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
PPTX
HSE WEEKLY REPORT for dummies and lazzzzy.pptx
ahmedibrahim691723
 
Blitz Campinas - Dia 24 de maio - Piettro.pdf
fabigreek
 
INFO8116 - Week 10 - Slides.pptx data analutics
guddipatel10
 
Introduction to Data Analytics and Data Science
KavithaCIT
 
TIC ACTIVIDAD 1geeeeeeeeeeeeeeeeeeeeeeeeeeeeeer3.pdf
Thais Ruiz
 
Fluvial_Civilizations_Presentation (1).pptx
alisslovemendoza7
 
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
IP_Journal_Articles_2025IP_Journal_Articles_2025
mishell212144
 
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
Technical Writing Module-I Complete Notes.pdf
VedprakashArya13
 
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
Fundamentals and Techniques of Biophysics and Molecular Biology (Pranav Kumar...
RohitKumar868624
 
WISE main accomplishments for ISQOLS award July 2025.pdf
StatsCommunications
 
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
Probability systematic sampling methods.pptx
PrakashRajput19
 
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
HSE WEEKLY REPORT for dummies and lazzzzy.pptx
ahmedibrahim691723
 

20141111 파이썬으로 Hadoop MR프로그래밍

  • 1. HadoopStreaming IT가맹점개발팀 이태영 2014.11.11 5번째스터디 파이썬으로MR 개발하기
  • 2. •개발자, 팀이익숙한언어를사용해서MR 개발가능 •특정언어에서제공하는라이브러리사용가능 •표준I/O로데이터교환-자바MR에비해성능저하 •그러나개발생산성이보장받는다면? HadoopStreaming은표준입출력(stdio)를제공하는모든언어이용가능 ※ 두가지요소가정의되어야함 1.Map 기능이정의된실행가능Mapper 파일 2.Reduce 기능이정의된실행가능Reducer 파일 HadoopStreaming
  • 3. MapReduce 1.MAP의역할-표준입력으로입력데이터처리 2.MAP의역할-표준출력으로Key, Value 출력 3.REDUCER역할-MAP의출력<Key, Value>를표준입력으로처리 4.REDUCER 역할-표준출력으로Key, Value 출력 데이터 입력 파이썬 Map 처리 파이썬 Reduce 처리 PIPE 파일읽기, PIPE, 스트리밍등 MR 처리 결과 출력
  • 4. Python 설치 표준I/O 데이터Mapper 예제 1.python 사이트에서2.7.8 다운로드후압축해제 2.계정홈디렉토리에서python 심볼릭링크로파이썬디렉토리경로생성 3../configure 4../make python명령어도 귀찮으니py로축약
  • 5. print ‘Hello BC’ hello.py 매번py를쳐주기귀찮다. 파이썬스크립트자체를실행파일로! #!/home/hadoop/python/py print ‘Hello BC’ hello.py [hadoop@big01 ~]$ chmod 755 hello.py [hadoop@big01 ~]$ ./hello.py Hello BC [hadoop@big01 ~]$ pyhello.py Hello BC Python 실행 Hello BC 예제실행 #! (SHA BANG)
  • 6. #!/home/hadoop/python/py import sys for line in sys.stdin: line = line.strip() words = line.split() for word in words: print '{0}t{1}'.format(word, 1) [hadoop@big01 python]$ echo "bc bc bc card bc card" | mapper.py bc1 bc1 bc1 card1 bc1 card1 it1 mapper.py Python MAP 표준I/O Mapper 실행예제
  • 7. [hadoop@big01 python]$ echo "bc bc bc card bc card it" | mapper.py bc1 bc1 bc1 card1 bc1 card1 it1 [hadoop@big01 python]$ echo "bc bc bc card bc card it" | mapper.py | sort –k 1 bc1 bc1 bc1 bc1 card1 card1 it1 첫번째필드기준정렬 Python MAP Mapper 출력값을정렬
  • 8. import sys current_word = None current_count = 0 word = None for line in sys.stdin: line = line.strip() word, count= line.split('t',1) count = int(count) if current_word == word: current_count += count else: if current_word: print '{0}t{1}'.format(current_word, current_count) current_count = count current_word = word if current_word == word: print '{0}t{1}'.format(current_word, current_count) reducer.py 기준단어와같다면카운트+1 기준단어가None이아니라면 M/R 결과출력 새로운기준단어설정 마지막라인처리용 Python REDUCE 표준I/O의Reducer 예제 [hadoop@big01 python]$ echo "bc bc bc card bc card it" | ./mapper.py | sort –k 1 | ./reducer.py bc4 card2 it1
  • 9. Python ♥Hadoop HadoopStreaming 1. HadoopStreaming에서mapper/reducer는실행가능한쉘로지정되어야한다. [OK]Hadoop jar hadoop-streaming*.jar –mapper map.py–reducer reduce.py… [NO] Hadoop jar hadoop-streaming*.jar –mapper python map.py–reducer python reduce.py… 2. Python스크립트는어디에서든접근가능하도록디렉토리PATH를설정 조건 Caused by: java.lang.RuntimeException: configuration exception at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:222) at org.apache.hadoop.streaming.PipeMapper.configure(PipeMapper.java:66) ... 22 more Caused by: java.io.IOException: Cannot run program "mapper.py": error=2, 그런파일이나디렉터리가없습니다 at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048) at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:209) ... 23 more Caused by: java.io.IOException: error=2, 그런파일이나디렉터리가없습니다 at java.lang.UNIXProcess.forkAndExec(Native Method) at java.lang.UNIXProcess.<init>(UNIXProcess.java:187) at java.lang.ProcessImpl.start(ProcessImpl.java:134) at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029) ... 24 more 안하면
  • 10. Python ♥Hadoop HadoopStreaming hadoop jar hadoop-streaming-2.5.1.jar -input myInputDirs -output myOutputDir -mapper /bin/cat -reducer /usr/bin/wc $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.5.1.jar Hadoop 2.x의HadoopStreaming위치 hadoop command [genericOptions] [streamingOptions]
  • 11. Python ♥Hadoop HadoopStreaming 명령어(command) Parameter Optional/Required Description -inputdirectoryname or filename Required Input location for mapper -outputdirectoryname Required Output location for reducer -mapperexecutable or JavaClassName Required Mapper executable -reducerexecutable or JavaClassName Required Reducer executable -filefilename Optional Make the mapper, reducer, or combiner executable available locally on the compute nodes -inputformat JavaClassName Optional Class you supply should return key/value pairs of Text class. If not specified, TextInputFormat is used as the default -outputformat JavaClassName Optional Class you supply should take key/value pairs of Text class. If not specified, TextOutputformat is used as the default -partitioner JavaClassName Optional Class that determines which reduce a key is sent to -combiner streamingCommand or JavaClassName Optional Combiner executable for map output -cmdenv name=value Optional Pass environment variable to streaming commands -inputreader Optional For backwards-compatibility: specifies a record reader class (instead of an input format class) -verbose Optional Verbose output -lazyOutput Optional Create output lazily. For example, if the output format is based on FileOutputFormat, the output file is created only on the first call to output.collect (or Context.write) -numReduceTasks Optional Specify the number of reducers -mapdebug Optional Script to call when map task fails -reducedebug Optional Script to call when reduce task fails hadoop command [genericOptions] [streamingOptions]
  • 12. Python ♥Hadoop HadoopStreaming 제네릭옵션 Parameter Optional/Required Description -conf configuration_file Optional Specify an application configuration file -D property=value Optional Use value for given property -fs host:port or local Optional Specify a namenode -files Optional Specify comma-separated files to be copied to the Map/Reduce cluster -libjars Optional Specify comma-separated jar files to include in the classpath -archives Optional Specify comma-separated archives to be unarchived on the compute machines hadoop command [genericOptions] [streamingOptions] 사용예 hadoop jar hadoop-streaming-2.5.1.jar -D mapreduce.job.reduces=2 -input myInputDirs -output myOutputDir -mapper /bin/cat -reducer /usr/bin/wc
  • 13. Python ♥Hadoop HadoopStreaming 실행: WordCount [hadoop@big01 ~]$ hadoop jar hadoop-streaming-2.5.1.jar -input alice -output wc_alice -mapper mapper.py-reducer reducer.py -file mapper.py -file reducer.py packageJobJar: [mapper.py, reducer.py, /tmp/hadoop-hadoop/hadoop-unjar2252553335408523254/] [] /tmp/streamjob911479792088347698.jar tmpDir=null 14/11/1123:51:41 INFO client.RMProxy: Connecting to ResourceManager at big01/192.168.56.101:8040 14/11/1123:51:41 INFO client.RMProxy: Connecting to ResourceManager at big01/192.168.56.101:8040 14/11/1123:51:43 INFO mapred.FileInputFormat: Total input paths to process : 1 14/11/1123:51:43 INFO mapreduce.JobSubmitter: number of splits:2 14/11/1123:51:43 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1416242552451_0009 14/11/1123:51:44 INFO impl.YarnClientImpl: Submitted application application_1416242552451_0009 14/11/1123:51:44 INFO mapreduce.Job: The url to track the job: http://big01:8088/proxy/application_1416242552451_0009/ 14/11/1123:51:44 INFO mapreduce.Job: Running job: job_1416242552451_0009 14/11/1123:51:53 INFO mapreduce.Job: Job job_1416242552451_0009 running in uber mode : false 14/11/1123:51:53 INFO mapreduce.Job: map 0% reduce 0% 14/11/1123:52:05 INFO mapreduce.Job: map 100% reduce 0% 14/11/1123:52:13 INFO mapreduce.Job: map 100% reduce 100% 14/11/1123:52:13 INFO mapreduce.Job: Job job_1416242552451_0009 completed successfully 14/11/1123:52:13 INFO mapreduce.Job: Counters: 49 File System Counters …..
  • 14. Python ♥Hadoop HadoopStreaming 실행: WordCount [hadoop@big01 ~]$ hadoop jar hadoop-streaming-2.5.1.jar -input alice -output wc_alice -mapper mapper.py-reducer reducer.py -file mapper.py -file reducer.py packageJobJar: [mapper.py, reducer.py, /tmp/hadoop-hadoop/hadoop-unjar2252553335408523254/] [] /tmp/streamjob911479792088347698.jar tmpDir=null 14/11/1123:51:41 INFO client.RMProxy: Connecting to ResourceManager at big01/192.168.56.101:8040 14/11/1123:51:41 INFO client.RMProxy: Connecting to ResourceManager at big01/192.168.56.101:8040 14/11/1123:51:43 INFO mapred.FileInputFormat: Total input paths to process : 1 14/11/1123:51:43 INFO mapreduce.JobSubmitter: number of splits:2 14/11/1123:51:43 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1416242552451_0009 14/11/1123:51:44 INFO impl.YarnClientImpl: Submitted application application_1416242552451_0009 14/11/1123:51:44 INFO mapreduce.Job: The url to track the job: http://big01:8088/proxy/application_1416242552451_0009/ 14/11/1123:51:44 INFO mapreduce.Job: Running job: job_1416242552451_0009 14/11/1123:51:53 INFO mapreduce.Job: Job job_1416242552451_0009 running in uber mode : false 14/11/1123:51:53 INFO mapreduce.Job: map 0% reduce 0% 14/11/1123:52:05 INFO mapreduce.Job: map 100% reduce 0% 14/11/1123:52:13 INFO mapreduce.Job: map 100% reduce 100% 14/11/1123:52:13 INFO mapreduce.Job: Job job_1416242552451_0009 completed successfully 14/11/1123:52:13 INFO mapreduce.Job: Counters: 49 File System Counters …..
  • 16. Python ♥Hadoop HadoopStreaming 결과확인 …. you'd8 you'll4 you're15 you've5 you,25 you,'6 you--all1 you--are1 you.1 you.'1 you:1 you?2 you?'7 young5 your62 yours1 yours."'1 yourself5 yourself!'1 yourself,1 yourself,'1 yourself.'2 youth,3 youth,'3 zigzag,1 part-00000 를열어보면
  • 17. Python ♥Hadoop HadoopStreaming 예제: WordCount 고도화 #!/home/hadoop/python/py import sys Import re for line in sys.stdin: line = line.strip() line = re.sub('[=.#/?:$'!,"}]', '', line) words = line.split() for word in words: print '{0}t{1}'.format(word, 1) mapper.py 수정 [hadoop@big01 ~]$ hadoop jar hadoop-streaming-2.5.1.jar -input alice -output wc_alice2 -mapper mapper.py-reducer reducer.py -file mapper.py -file reducer.py packageJobJar: [mapper.py, reducer.py, /tmp/hadoop-hadoop/hadoop-unjar2252553335408523254/] [] /tmp/streamjob911479792088347698.jar tmpDir=null 14/11/1123:51:41 INFO client.RMProxy: Connecting to ResourceManager at big01/192.168.56.101:8040 14/11/1123:51:41 INFO client.RMProxy: Connecting to ResourceManager at big01/192.168.56.101:8040 정규표현식, 특수문자제거
  • 18. Python ♥Hadoop HadoopStreaming 결과확인 ….. ye;1 year2 years2 yelled1 yelp1 yer4 yesterday3 yet18 yet--Oh1 yet--and1 yet--its1 you357 you)1 you--all1 you--are1 youd8 youll4 young5 your62 youre15 yours2 yourself10 youth6 youve5 zigzag1 wc_alice2의part-00000 를열어보면
  • 19. 끝.