Quick Wikipedia Mining using Elastic Map Reduce

10 likes1,383 views

This document summarizes Amazon's Elastic MapReduce service. Elastic MapReduce allows users to run Hadoop/MapReduce jobs on Amazon Web Services infrastructure. It launches Hadoop clusters across Amazon EC2 instances and stores data in Amazon S3. The document provides step-by-step examples of using Elastic MapReduce to analyze Japanese Wikipedia data stored in S3, including counting article links, analyzing publication dates over time, and calculating PageRank scores for articles. It concludes by discussing potential use cases for analyzing larger datasets like blog posts.

Technology

http://ohkura.com

• 2008 1
•
• blog
• 2007

• Hadoop
o

• Hadoop Streaming
o Mapper Reducer

o OK Python
o IO
• Amazon AWS (S3, EC2)

Elastic MapReduce

• Amazon Cloud Computing
• MapReduce Hadoop

• Master Worker EC2
• S3

• http://aws.amazon.com/elasticmapreduce/

Step0:

• AWS
• Elastic MapReduce 1
• S3
o Ruby s3sync
http://s3sync.net/wiki
• elastic-mapreduce
o Amazon Ruby
o http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2264

Step1:

• Wikipedia
o wget "http://download.wikimedia.org/jawiki/latest/jawiki-
latest-pages-articles.xml.bz2"
o bunzip2 jawiki-latest-pages-articles.xml.bz2
•
o <page> 20000
o Hadoop Streaming worker
• S3
o ohkura-wikipedia:jawiki/articles/part-00000, 00001, ...
o EC2

Step2:

Mapper
link_pat = re.compile(r"[[([^]|#]*?)[]|#]")

for line in sys.stdin:
for link in link_pat.findall(line):
if ":" not in link:
print "LongValueSum:%st1" % link

Reducer
aggregate (Hadoop Reducer)

Quick Wikipedia Mining using Elastic Map Reduce

2007 92008
2006 88376
2008 82821
2005 77964
76111
2000 68078
2004 64921
63660
58081
2001 57419
2003 57130

Step3:

Mapper
timestamp_pat = re.compile("<timestamp>(.+?)</timestamp>")
articles = ArticleExtractor(sys.stdin)
for article in articles:
for line in article:
m = timestamp_pat.search(line)
if m:
dt = m.groups(0)[0]
# eg. 2009-10-08T05:55:49Z
t = datetime.datetime.strptime(dt, "%Y-%m-%dT%H:%M:%SZ")
print "LongValueSum:%s t1" % t.year

Reducer
aggregate (Hadoop Reducer)

JSON Wizard

$ elastic-mapreduce --create --num-instances 4
--instance-type m1.small
--json count-year-jobflow.json

2002: 1
2003: 4107
2004: 19630
2005: 44766
2006: 103018
2007: 151382
2008: 217252
2009: 683079

PageRank MapReduce

• Step1
o Wikipedia
o M:

o R: Identity
• Step2
o M: /
o R:
• Step2 10
o HDFS

1803.63759701
1568.19638967
1029.67219551 2006
991.646816399 2007
930.652982148 2005
885.892964893
866.358526418 2008
798.668799871 2004
779.443042817
.
.

1803.63759701
1568.19638967
885.892964893
779.443042817
755.488775376
728.882441149
682.257070166
623.000478660
580.347125978
569.411885196
...

779.443042817
728.882441149
682.257070166
580.347125978
522.618667481
495.986145911
452.646283200
444.036370473
443.043952427
441.486349135
392.427995635

=100

0.00682557409174 785 ...
0.00682555111099 JR 700
0.00682544488688
0.00682540998664
0.00682540375114
0.00682528989653 ( )
0.00682524117061
0.00682521978481 ( )
0.00682521236658
0.00682517459662
0.00682512260620

• Wikipedia (JA)
o 1,900,000 articles
o 4.2GB
o 20
o ~30
• Blog from blogeye.jp
o 200,000,000 articles
o 800GB
o 80
o 70

•
o
o Master
•
o
o
o
o
o 1 1 0.1 1 100 1000

Quick Wikipedia Mining using Elastic Map Reduce

1. Elastic MapReduce Wikipedia

2. http://ohkura.com • 2008 1 • • blog • 2007

3. Python Wikipedia ( 120 ) MapReduce

4. • Hadoop o • Hadoop Streaming o Mapper Reducer o OK Python o IO • Amazon AWS (S3, EC2)

5. Elastic MapReduce • Amazon Cloud Computing • MapReduce Hadoop • Master Worker EC2 • S3 • http://aws.amazon.com/elasticmapreduce/

6. Step0: • AWS • Elastic MapReduce 1 • S3 o Ruby s3sync http://s3sync.net/wiki • elastic-mapreduce o Amazon Ruby o http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2264

7. Step1: • Wikipedia o wget "http://download.wikimedia.org/jawiki/latest/jawiki- latest-pages-articles.xml.bz2" o bunzip2 jawiki-latest-pages-articles.xml.bz2 • o <page> 20000 o Hadoop Streaming worker • S3 o ohkura-wikipedia:jawiki/articles/part-00000, 00001, ... o EC2

8. Step2:

9. Step2: Mapper link_pat = re.compile(r"[[([^]|#]*?)[]|#]") for line in sys.stdin: for link in link_pat.findall(line): if ":" not in link: print "LongValueSum:%st1" % link Reducer aggregate (Hadoop Reducer)

15. 2007 92008 2006 88376 2008 82821 2005 77964 76111 2000 68078 2004 64921 63660 58081 2001 57419 2003 57130

16. Step3:

17. Step3: Mapper timestamp_pat = re.compile("<timestamp>(.+?)</timestamp>") articles = ArticleExtractor(sys.stdin) for article in articles: for line in article: m = timestamp_pat.search(line) if m: dt = m.groups(0)[0] # eg. 2009-10-08T05:55:49Z t = datetime.datetime.strptime(dt, "%Y-%m-%dT%H:%M:%SZ") print "LongValueSum:%s t1" % t.year Reducer aggregate (Hadoop Reducer)

18. JSON Wizard $ elastic-mapreduce --create --num-instances 4 --instance-type m1.small --json count-year-jobflow.json

19. 2002: 1 2003: 4107 2004: 19630 2005: 44766 2006: 103018 2007: 151382 2008: 217252 2009: 683079

20. Step4: PageRank

21. Step4: PageRank • o 1 1/ o / o o 2 10

22. Step4: PageRank • o 1 1/ o / o o 2 10

23. Step4: PageRank • o 1 1/ o / o o 2 10

24. Step4: PageRank • o 1 1/ o / o o 2 10

25. Step4: PageRank • o 1 1/ o / o o 2 10

26. PageRank MapReduce • Step1 o Wikipedia o M: o R: Identity • Step2 o M: / o R: • Step2 10 o HDFS

27. 1803.63759701 1568.19638967 1029.67219551 2006 991.646816399 2007 930.652982148 2005 885.892964893 866.358526418 2008 798.668799871 2004 779.443042817 . .

28. 1803.63759701 1568.19638967 885.892964893 779.443042817 755.488775376 728.882441149 682.257070166 623.000478660 580.347125978 569.411885196 ...

29. 779.443042817 728.882441149 682.257070166 580.347125978 522.618667481 495.986145911 452.646283200 444.036370473 443.043952427 441.486349135 392.427995635

30. =100 0.00682557409174 785 ... 0.00682555111099 JR 700 0.00682544488688 0.00682540998664 0.00682540375114 0.00682528989653 ( ) 0.00682524117061 0.00682521978481 ( ) 0.00682521236658 0.00682517459662 0.00682512260620

32. • Wikipedia (JA) o 1,900,000 articles o 4.2GB o 20 o ~30 • Blog from blogeye.jp o 200,000,000 articles o 800GB o 80 o 70

33. • o o Master • o o o o o 1 1 0.1 1 100 1000

34. Q&A

Quick Wikipedia Mining using Elastic Map Reduce

More Related Content

Similar to Quick Wikipedia Mining using Elastic Map Reduce (20)

Recently uploaded (20)

Quick Wikipedia Mining using Elastic Map Reduce