Hadoop introduction 2

Hadoop Introduction II
K-means && Python && Dumbo

Outline

• Dumbo
• K-means
• Python and Data Mining

12/20/12 2

Hadoop in Python
• Jython: Happy
• Cython:
• Pydoop
• components(RecordReader ， RecordWriter and Partitioner)
• Get configuration, set counters and report statuscpython use any module
Dumbo
• HDFS API
• Hadoopy: an other Cython
• Streaming:
• Dumbo
• Other small Map-Reduce wrapper

12/20/12 3

Hadoop in Python

12/20/12 4

Hadoop in Python Extention

Hadoop in Python

Integration with Pipes(C++) + Integration with libhdfs(C)
12/20/12 5

Dumbo
• Dumbo is a project that allows you to easily write and
run Hadoop programs in Python. More generally, Dumbo can be
considered a convenient Python API for writing MapReduce
programs.
• Advantages:
• Easy: Dumbo strives to be as Pythonic as possible
• Efficient: Dumbo programs communicate with Hadoop in a very
effecient way by relying on typed bytes, a nifty serialisation
mechanism that was specifically added to Hadoop with Dumbo
in mind.
• Flexible: We can extend it
• Mature

12/20/12 6

Dumbo: Review WordCount

12/20/12 7

Dumbo – Word Count

12/20/12 8

Dumbo IP counts

12/20/12 9

Dumbo IP counts

12/20/12 10

K-means in Map-Reduce
• Normal K-means:
• Inputs: a set of n d-dimensional points && a number of desired clusters k.

• Step 1: Random choice K points at the sample of n Points
• Step2 : Calculate every point to K initial centers. Choice closest
• Step3 : Using this assignation of points to cluster centers, each cluster center is
recalculated as the centroid of its member points.
• Step4: This process is then iterated until convergence is reached.
• Final: points are reassigned to centers, and centroids recalculated until the k
cluster centers shift by less than some delta value.

• k-means is a surprisingly parallelizable algorithm.

12/20/12 11

K-means in Map-Reduce
• Key-points:
• we want to come up with a scheme where we can operate on
each point in the data set independently.
• a small amount of shared data (The cluster centers)
• when we partition points among MapReduce nodes, we
also distribute a copy of the cluster centers. This results
in a small amount of data duplication, but very minimal.
In this way each of the points can be operated on
independently.

12/20/12 12

Hadoop Phase
• Map:
• In : points in the data set
• Output : (ClusterID, Point) pair for each point.
Where the ClusterID is the integer Id of the
cluster which is cloest to point.

12/20/12 13

Hadoop Phase
• Reduce Phase:
• In : (ClusterID, Point)
• Operator:
• the outputs of the map phase are grouped by
ClusterID.
• for each ClusterID the centroid of the points
associated with that ClusterID is calculated.
• Output: (ClusterID, Centroid) pairs. Which represent the
newly calculated cluster centers.

12/20/12 14

External Program
• Each iteration of the algorithm is structured as a single
MapReduce job.

• After each phase, our lib reads the output , determines
whether convergence has been reached by the calculating
by how much distance the clusters have moved. The runs
another Mapreduce job.

12/20/12 15

Write in Dumbo

12/20/12 16

Write in Dumbo

12/20/12 17

Write in Dumbo

12/20/12 18

Next
• Write n-times iteration wrapper
• Optimize K-means
• Result Visualization with Python

12/20/12 20

Optimize
• If partial centroids for clusters are computed on the map
nodes are computed on the map nodes themselves.
(Mapper Local calculate!) and then a weighted average
of the centroids is taken later by the reducer. In
other words, the mapping was one to one, and so for
every point inputted , our mapper outputted a single
point which it was necessary to sort and transfer to a
reducer.

• We can use Combiner!

12/20/12 21

Dumbo Usage
• Very easy
• You can write your own code for Dumbo
• Debug easy
• Command easy

12/20/12 22

Python and Data Mining
• Books:
• 用 Python 进行科学计算
• 集体智慧编程
• 挖掘社交网络
• 用 Python 进行自然语言处理
• Think Stats Python 与数据分析

12/20/12 23

Python and Data Mining
• Tools
• Numpy
• Scipy
• Orange （利用 orange 进行关联规则挖掘）

12/20/12 24

Hadoop introduction 2

More Related Content

What's hot (20)

Viewers also liked (8)

Similar to Hadoop introduction 2 (20)

More from Tianwei Liu (10)

Hadoop introduction 2

Editor's Notes