SlideShare a Scribd company logo
Hadoop Introduction II
               K-means && Python && Dumbo
Outline

• Dumbo
• K-means
• Python and Data Mining




12/20/12                   2
Hadoop in Python
• Jython: Happy
• Cython:
     • Pydoop
           • components(RecordReader , RecordWriter and Partitioner)
           • Get configuration, set counters and report statuscpython use any module
                                         Dumbo
           • HDFS API
     • Hadoopy: an other Cython
• Streaming:
     • Dumbo
     • Other small Map-Reduce wrapper

12/20/12                                                                               3
Hadoop in Python




12/20/12           4
Hadoop in Python Extention




                                 Hadoop in Python




Integration with Pipes(C++) + Integration with libhdfs(C)
 12/20/12                                                   5
Dumbo
•   Dumbo is a project that allows you to easily write and
    run Hadoop programs in Python. More generally, Dumbo can be
    considered a convenient Python API for writing MapReduce
    programs.
•   Advantages:
     • Easy: Dumbo strives to be as Pythonic as possible
     • Efficient: Dumbo programs communicate with Hadoop in a very
       effecient way by relying on typed bytes, a nifty serialisation
       mechanism that was specifically added to Hadoop with Dumbo
       in mind.
     • Flexible: We can extend it
     • Mature

12/20/12                                                            6
Dumbo: Review WordCount




12/20/12                  7
Dumbo – Word Count




12/20/12             8
Dumbo IP counts




12/20/12          9
Dumbo IP counts




12/20/12          10
K-means in Map-Reduce
•   Normal K-means:
     •     Inputs: a set of n d-dimensional points && a number of desired clusters k.

     •     Step 1: Random choice K points at the sample of n Points
     •     Step2 : Calculate every point to K initial centers. Choice closest
     •     Step3 : Using this assignation of points to cluster centers, each cluster center is
           recalculated as the centroid of its member points.
     •     Step4: This process is then iterated until convergence is reached.
     •     Final: points are reassigned to centers, and centroids recalculated until the k
           cluster centers shift by less than some delta value.



•   k-means is a surprisingly parallelizable algorithm.


12/20/12                                                                                    11
K-means in Map-Reduce
•   Key-points:
     • we want to come up with a scheme where we can operate on
       each point in the data set independently.
     • a small amount of shared data (The cluster centers)
     • when we partition points among MapReduce nodes, we
       also distribute a copy of the cluster centers. This results
       in a small amount of data duplication, but very minimal.
       In this way each of the points can be operated on
       independently.




12/20/12                                                         12
Hadoop Phase
• Map:
  • In : points in the data set
  • Output : (ClusterID, Point) pair for each point.
    Where the ClusterID is the integer Id of the
    cluster which is cloest to point.




12/20/12                                           13
Hadoop Phase
• Reduce Phase:
   • In : (ClusterID, Point)
• Operator:
   • the outputs of the map phase are grouped by
     ClusterID.
   • for each ClusterID the centroid of the points
     associated with that ClusterID is calculated.
     • Output: (ClusterID, Centroid) pairs. Which represent the
       newly calculated cluster centers.

12/20/12                                                     14
External Program
•   Each iteration of the algorithm is structured as a single
    MapReduce job.

•   After each phase, our lib reads the output , determines
    whether convergence has been reached by the calculating
    by how much distance the clusters have moved. The runs
    another Mapreduce job.




12/20/12                                                        15
Write in Dumbo




12/20/12         16
Write in Dumbo




12/20/12         17
Write in Dumbo




12/20/12         18
Results




12/20/12   19
Next
• Write n-times iteration wrapper
• Optimize K-means
• Result Visualization with Python




12/20/12                             20
Optimize
•   If partial centroids for clusters are computed on the map
    nodes are computed on the map nodes themselves.
    (Mapper Local calculate!) and then a weighted average
    of the centroids is taken later by the reducer. In
    other words, the mapping was one to one, and so for
    every point inputted , our mapper outputted a single
    point which it was necessary to sort and transfer to a
    reducer.


• We can use Combiner!

12/20/12                                                   21
Dumbo Usage
•   Very easy
•   You can write your own code for Dumbo
•   Debug easy
•   Command easy




12/20/12                                    22
Python and Data Mining
• Books:
   • 用 Python 进行科学计算
   • 集体智慧编程
   • 挖掘社交网络
   • 用 Python 进行自然语言处理
   • Think Stats Python 与数据分析




12/20/12                        23
Python and Data Mining
• Tools
   • Numpy
   • Scipy
   • Orange (利用 orange 进行关联规则挖掘)




12/20/12                           24
thanks




12/20/12            25

More Related Content

What's hot (20)

PPTX
Map reduce presentation
ateeq ateeq
 
PPTX
Hadoop performance optimization tips
Subhas Kumar Ghosh
 
PPTX
MapReduce basic
Chirag Ahuja
 
PDF
Introduction to Map-Reduce
Brendan Tierney
 
PPT
Hadoop Map Reduce
VNIT-ACM Student Chapter
 
PPTX
Map reduce and Hadoop on windows
Muhammad Shahid
 
PPTX
Map reduce paradigm explained
Dmytro Sandu
 
PDF
Large Scale Data Analysis with Map/Reduce, part I
Marin Dimitrov
 
PPTX
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Deanna Kosaraju
 
PDF
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
PDF
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
soujavajug
 
PPTX
Mapreduce advanced
Chirag Ahuja
 
PPTX
Map reduce prashant
Prashant Gupta
 
PPTX
Introduction to MapReduce
Chicago Hadoop Users Group
 
PDF
Apache Hadoop MapReduce Tutorial
Farzad Nozarian
 
PPTX
Introduction to map reduce
M Baddar
 
PPTX
Map Reduce Online
Hadoop User Group
 
PPTX
Dache: A Data Aware Caching for Big-Data Applications Using the MapReduce Fra...
Govt.Engineering college, Idukki
 
PPTX
06 pig etl features
Subhas Kumar Ghosh
 
Map reduce presentation
ateeq ateeq
 
Hadoop performance optimization tips
Subhas Kumar Ghosh
 
MapReduce basic
Chirag Ahuja
 
Introduction to Map-Reduce
Brendan Tierney
 
Hadoop Map Reduce
VNIT-ACM Student Chapter
 
Map reduce and Hadoop on windows
Muhammad Shahid
 
Map reduce paradigm explained
Dmytro Sandu
 
Large Scale Data Analysis with Map/Reduce, part I
Marin Dimitrov
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Deanna Kosaraju
 
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
soujavajug
 
Mapreduce advanced
Chirag Ahuja
 
Map reduce prashant
Prashant Gupta
 
Introduction to MapReduce
Chicago Hadoop Users Group
 
Apache Hadoop MapReduce Tutorial
Farzad Nozarian
 
Introduction to map reduce
M Baddar
 
Map Reduce Online
Hadoop User Group
 
Dache: A Data Aware Caching for Big-Data Applications Using the MapReduce Fra...
Govt.Engineering college, Idukki
 
06 pig etl features
Subhas Kumar Ghosh
 

Viewers also liked (8)

ODP
Linux Introduction (Commands)
anandvaidya
 
PDF
Scraping the web with python
Jose Manuel Ortega Candel
 
PDF
Tutorial on Web Scraping in Python
Nithish Raghunandanan
 
PDF
Linux File System
Anil Kumar Pugalia
 
PPTX
Linux.ppt
onu9
 
PPTX
Big Data & Hadoop Tutorial
Edureka!
 
PDF
Web Scraping with Python
Paul Schreiber
 
Linux Introduction (Commands)
anandvaidya
 
Scraping the web with python
Jose Manuel Ortega Candel
 
Tutorial on Web Scraping in Python
Nithish Raghunandanan
 
Linux File System
Anil Kumar Pugalia
 
Linux.ppt
onu9
 
Big Data & Hadoop Tutorial
Edureka!
 
Web Scraping with Python
Paul Schreiber
 
Ad

Similar to Hadoop introduction 2 (20)

PPTX
05 k-means clustering
Subhas Kumar Ghosh
 
PDF
Implementation of k means algorithm on Hadoop
Lamprini Koutsokera
 
PDF
k-means algorithm implementation on Hadoop
Stratos Gounidellis
 
PPTX
Big data
Waqas Nawaz
 
PDF
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
acijjournal
 
PDF
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET Journal
 
PPTX
Distributed data mining
Ahmad Ammari
 
PDF
Hadoop: The Default Machine Learning Platform ?
Milind Bhandarkar
 
PDF
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Konstantin V. Shvachko
 
PDF
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
npinto
 
PDF
Bj24390398
IJERA Editor
 
PDF
Big Data on Implementation of Many to Many Clustering
paperpublications3
 
PDF
Hadoop: A Hands-on Introduction
Claudio Martella
 
PDF
Implementation of p pic algorithm in map reduce to handle big data
eSAT Publishing House
 
PDF
Dumbo Hadoop Streaming Made Elegant And Easy Klaas Bosteels
George Ang
 
PDF
Distributed Framework for Data Mining As a Service on Private Cloud
IJERA Editor
 
PDF
41 125-1-pb
Mahendra Sisodia
 
PPTX
Large scale computing with mapreduce
hansen3032
 
PDF
An efficient data mining framework on hadoop using java persistence api
João Gabriel Lima
 
PDF
Big Data Clustering Model based on Fuzzy Gaussian
IJCSIS Research Publications
 
05 k-means clustering
Subhas Kumar Ghosh
 
Implementation of k means algorithm on Hadoop
Lamprini Koutsokera
 
k-means algorithm implementation on Hadoop
Stratos Gounidellis
 
Big data
Waqas Nawaz
 
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
acijjournal
 
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET Journal
 
Distributed data mining
Ahmad Ammari
 
Hadoop: The Default Machine Learning Platform ?
Milind Bhandarkar
 
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Konstantin V. Shvachko
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
npinto
 
Bj24390398
IJERA Editor
 
Big Data on Implementation of Many to Many Clustering
paperpublications3
 
Hadoop: A Hands-on Introduction
Claudio Martella
 
Implementation of p pic algorithm in map reduce to handle big data
eSAT Publishing House
 
Dumbo Hadoop Streaming Made Elegant And Easy Klaas Bosteels
George Ang
 
Distributed Framework for Data Mining As a Service on Private Cloud
IJERA Editor
 
41 125-1-pb
Mahendra Sisodia
 
Large scale computing with mapreduce
hansen3032
 
An efficient data mining framework on hadoop using java persistence api
João Gabriel Lima
 
Big Data Clustering Model based on Fuzzy Gaussian
IJCSIS Research Publications
 
Ad

More from Tianwei Liu (10)

PPTX
2021 ee大会-旷视ai产品背后的研发效能工具建设
Tianwei Liu
 
PDF
2020 gops-旷视城市大脑私有云平台实践-刘天伟
Tianwei Liu
 
PDF
豆瓣Paa s平台 dae - 2017
Tianwei Liu
 
PDF
douban happyday docker for daeqaci
Tianwei Liu
 
PDF
DAE 新变化介绍
Tianwei Liu
 
PDF
Docker在豆瓣的实践 刘天伟-20160709
Tianwei Liu
 
PPT
Mr&ueh数据库方面
Tianwei Liu
 
PPT
Kmeans in-hadoop
Tianwei Liu
 
PPT
Hadoop introduction
Tianwei Liu
 
PPT
Ueh
Tianwei Liu
 
2021 ee大会-旷视ai产品背后的研发效能工具建设
Tianwei Liu
 
2020 gops-旷视城市大脑私有云平台实践-刘天伟
Tianwei Liu
 
豆瓣Paa s平台 dae - 2017
Tianwei Liu
 
douban happyday docker for daeqaci
Tianwei Liu
 
DAE 新变化介绍
Tianwei Liu
 
Docker在豆瓣的实践 刘天伟-20160709
Tianwei Liu
 
Mr&ueh数据库方面
Tianwei Liu
 
Kmeans in-hadoop
Tianwei Liu
 
Hadoop introduction
Tianwei Liu
 

Hadoop introduction 2

  • 1. Hadoop Introduction II K-means && Python && Dumbo
  • 2. Outline • Dumbo • K-means • Python and Data Mining 12/20/12 2
  • 3. Hadoop in Python • Jython: Happy • Cython: • Pydoop • components(RecordReader , RecordWriter and Partitioner) • Get configuration, set counters and report statuscpython use any module Dumbo • HDFS API • Hadoopy: an other Cython • Streaming: • Dumbo • Other small Map-Reduce wrapper 12/20/12 3
  • 5. Hadoop in Python Extention Hadoop in Python Integration with Pipes(C++) + Integration with libhdfs(C) 12/20/12 5
  • 6. Dumbo • Dumbo is a project that allows you to easily write and run Hadoop programs in Python. More generally, Dumbo can be considered a convenient Python API for writing MapReduce programs. • Advantages: • Easy: Dumbo strives to be as Pythonic as possible • Efficient: Dumbo programs communicate with Hadoop in a very effecient way by relying on typed bytes, a nifty serialisation mechanism that was specifically added to Hadoop with Dumbo in mind. • Flexible: We can extend it • Mature 12/20/12 6
  • 8. Dumbo – Word Count 12/20/12 8
  • 11. K-means in Map-Reduce • Normal K-means: • Inputs: a set of n d-dimensional points && a number of desired clusters k. • Step 1: Random choice K points at the sample of n Points • Step2 : Calculate every point to K initial centers. Choice closest • Step3 : Using this assignation of points to cluster centers, each cluster center is recalculated as the centroid of its member points. • Step4: This process is then iterated until convergence is reached. • Final: points are reassigned to centers, and centroids recalculated until the k cluster centers shift by less than some delta value. • k-means is a surprisingly parallelizable algorithm. 12/20/12 11
  • 12. K-means in Map-Reduce • Key-points: • we want to come up with a scheme where we can operate on each point in the data set independently. • a small amount of shared data (The cluster centers) • when we partition points among MapReduce nodes, we also distribute a copy of the cluster centers. This results in a small amount of data duplication, but very minimal. In this way each of the points can be operated on independently. 12/20/12 12
  • 13. Hadoop Phase • Map: • In : points in the data set • Output : (ClusterID, Point) pair for each point. Where the ClusterID is the integer Id of the cluster which is cloest to point. 12/20/12 13
  • 14. Hadoop Phase • Reduce Phase: • In : (ClusterID, Point) • Operator: • the outputs of the map phase are grouped by ClusterID. • for each ClusterID the centroid of the points associated with that ClusterID is calculated. • Output: (ClusterID, Centroid) pairs. Which represent the newly calculated cluster centers. 12/20/12 14
  • 15. External Program • Each iteration of the algorithm is structured as a single MapReduce job. • After each phase, our lib reads the output , determines whether convergence has been reached by the calculating by how much distance the clusters have moved. The runs another Mapreduce job. 12/20/12 15
  • 20. Next • Write n-times iteration wrapper • Optimize K-means • Result Visualization with Python 12/20/12 20
  • 21. Optimize • If partial centroids for clusters are computed on the map nodes are computed on the map nodes themselves. (Mapper Local calculate!) and then a weighted average of the centroids is taken later by the reducer. In other words, the mapping was one to one, and so for every point inputted , our mapper outputted a single point which it was necessary to sort and transfer to a reducer. • We can use Combiner! 12/20/12 21
  • 22. Dumbo Usage • Very easy • You can write your own code for Dumbo • Debug easy • Command easy 12/20/12 22
  • 23. Python and Data Mining • Books: • 用 Python 进行科学计算 • 集体智慧编程 • 挖掘社交网络 • 用 Python 进行自然语言处理 • Think Stats Python 与数据分析 12/20/12 23
  • 24. Python and Data Mining • Tools • Numpy • Scipy • Orange (利用 orange 进行关联规则挖掘) 12/20/12 24

Editor's Notes

  • #2: 素材天下 sucaitianxia.com