SlideShare a Scribd company logo
Elastic MapReduce
   Wikipedia
http://ohkura.com

• 2008                  1
•
•              blog
•              2007
Python




     Wikipedia       (   120   )
MapReduce
• Hadoop
  o

• Hadoop Streaming
  o Mapper Reducer


  o                  OK   Python
  o            IO
• Amazon AWS (S3, EC2)
Elastic MapReduce

• Amazon          Cloud Computing
• MapReduce                    Hadoop


• Master                       Worker EC2
• S3

• http://aws.amazon.com/elasticmapreduce/
Step0:

• AWS
• Elastic MapReduce                                    1
• S3
  o   Ruby                     s3sync
      http://s3sync.net/wiki
• elastic-mapreduce
   o Amazon                       Ruby
  o   http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2264
Step1:

• Wikipedia
    o wget "http://download.wikimedia.org/jawiki/latest/jawiki-
      latest-pages-articles.xml.bz2"
    o bunzip2 jawiki-latest-pages-articles.xml.bz2
•
    o   <page>      20000
    o   Hadoop Streaming        worker
• S3
    o   ohkura-wikipedia:jawiki/articles/part-00000, 00001, ...
    o   EC2
Step2:
Step2:

Mapper
 link_pat = re.compile(r"[[([^]|#]*?)[]|#]")

 for line in sys.stdin:
    for link in link_pat.findall(line):
        if ":" not in link:
            print "LongValueSum:%st1" % link

Reducer
  aggregate (Hadoop                   Reducer)
Quick Wikipedia Mining using Elastic Map Reduce
Quick Wikipedia Mining using Elastic Map Reduce
Quick Wikipedia Mining using Elastic Map Reduce
Quick Wikipedia Mining using Elastic Map Reduce
Quick Wikipedia Mining using Elastic Map Reduce
2007     92008
2006     88376
2008     82821
2005     77964
       76111
2000     68078
2004     64921
                 63660
        58081
2001     57419
2003     57130
Step3:
Step3:

Mapper
 timestamp_pat = re.compile("<timestamp>(.+?)</timestamp>")
 articles = ArticleExtractor(sys.stdin)
 for article in articles:
   for line in article:
      m = timestamp_pat.search(line)
      if m:
         dt = m.groups(0)[0]
         # eg. 2009-10-08T05:55:49Z
         t = datetime.datetime.strptime(dt, "%Y-%m-%dT%H:%M:%SZ")
         print "LongValueSum:%s t1" % t.year


Reducer
  aggregate (Hadoop                    Reducer)
JSON                   Wizard




$ elastic-mapreduce --create --num-instances 4
                  --instance-type m1.small
                  --json count-year-jobflow.json
2002: 1
2003: 4107
2004: 19630
2005: 44766
2006: 103018
2007: 151382
2008: 217252
2009: 683079
Step4: PageRank
Step4: PageRank

•
    o       1        1/
    o                     /
    o
    o   2       10
Step4: PageRank

•
    o       1        1/
    o                     /
    o
    o   2       10
Step4: PageRank

•
    o       1        1/
    o                     /
    o
    o   2       10
Step4: PageRank

•
    o       1        1/
    o                     /
    o
    o   2       10
Step4: PageRank

•
    o       1        1/
    o                     /
    o
    o   2       10
PageRank                 MapReduce

• Step1
    o          Wikipedia
    o   M:


    o   R: Identity
• Step2
  o M:          /
    o   R:
•         Step2     10
    o                    HDFS
1803.63759701
1568.19638967
1029.67219551 2006
991.646816399 2007
930.652982148 2005
885.892964893
866.358526418 2008
798.668799871 2004
779.443042817
.
.
1803.63759701
1568.19638967
885.892964893
779.443042817
755.488775376
728.882441149
682.257070166
623.000478660
580.347125978
569.411885196
...
779.443042817
728.882441149
682.257070166
580.347125978
522.618667481
495.986145911
452.646283200
444.036370473
443.043952427
441.486349135
392.427995635
=100

0.00682557409174                785       ...
0.00682555111099 JR   700
0.00682544488688
0.00682540998664
0.00682540375114
0.00682528989653      (     )
0.00682524117061
0.00682521978481      (               )
0.00682521236658
0.00682517459662
0.00682512260620
Quick Wikipedia Mining using Elastic Map Reduce
• Wikipedia (JA)
  o 1,900,000 articles
  o 4.2GB
  o 20
  o   ~30
• Blog          from   blogeye.jp
  o   200,000,000 articles
  o   800GB
  o   80
  o   70
•
    o
    o                 Master
•
    o
    o
    o
    o
    o   1   1   0.1   1    100   1000
Q&A

More Related Content

Similar to Quick Wikipedia Mining using Elastic Map Reduce (20)

PDF
Traffic Analyzer for GPRS UMTS Networks (TAN)
Muhannad Aulama
 
PDF
Ruby Outside Rails 2 (southfest)
Victor Petrenko
 
PDF
marko_go_in_badoo
Marko Kevac
 
PPTX
Panoramic Video in Environmental Monitoring Software Development and Applica...
pycontw
 
PDF
mastodon API
treby
 
PPTX
Machine Learning and Logging for Monitoring Microservices
Daniel Berman
 
PDF
Edge trends mizuno-template
shintaro mizuno
 
PDF
Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...
MongoDB
 
PPTX
.NET Fest 2019. Łukasz Pyrzyk. Daily Performance Fuckups
NETFest
 
PPTX
MongoDB Chunks - Distribution, Splitting, and Merging
Jason Terpko
 
PDF
Log files: The Overlooked Source of SEO Opportunities
Robin Rozhon
 
PDF
Django REST Framework における API 実装プラクティス | PyCon JP 2018
Masashi Shibata
 
PPTX
1404 app dev series - session 8 - monitoring & performance tuning
MongoDB
 
PDF
Lessons learned while building Omroep.nl
bartzon
 
PDF
Lessons learned while building Omroep.nl
tieleman
 
PDF
クラウドを支えるハードウェア・ソフトウェア基盤技術
Ryousei Takano
 
PDF
The Seven Wastes of Software Development
Matt Stine
 
PPTX
Serhiy Korolenko - The Strength of Ukrainian Users’ P@ssw0rds2017
OWASP Kyiv
 
PDF
Weaving a Semantic Web across OSS repositories - a spotlight on bts-link, UDD...
olberger
 
PDF
Location and Mapping
SteveCoast
 
Traffic Analyzer for GPRS UMTS Networks (TAN)
Muhannad Aulama
 
Ruby Outside Rails 2 (southfest)
Victor Petrenko
 
marko_go_in_badoo
Marko Kevac
 
Panoramic Video in Environmental Monitoring Software Development and Applica...
pycontw
 
mastodon API
treby
 
Machine Learning and Logging for Monitoring Microservices
Daniel Berman
 
Edge trends mizuno-template
shintaro mizuno
 
Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...
MongoDB
 
.NET Fest 2019. Łukasz Pyrzyk. Daily Performance Fuckups
NETFest
 
MongoDB Chunks - Distribution, Splitting, and Merging
Jason Terpko
 
Log files: The Overlooked Source of SEO Opportunities
Robin Rozhon
 
Django REST Framework における API 実装プラクティス | PyCon JP 2018
Masashi Shibata
 
1404 app dev series - session 8 - monitoring & performance tuning
MongoDB
 
Lessons learned while building Omroep.nl
bartzon
 
Lessons learned while building Omroep.nl
tieleman
 
クラウドを支えるハードウェア・ソフトウェア基盤技術
Ryousei Takano
 
The Seven Wastes of Software Development
Matt Stine
 
Serhiy Korolenko - The Strength of Ukrainian Users’ P@ssw0rds2017
OWASP Kyiv
 
Weaving a Semantic Web across OSS repositories - a spotlight on bts-link, UDD...
olberger
 
Location and Mapping
SteveCoast
 

Recently uploaded (20)

PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PPTX
✨Unleashing Collaboration: Salesforce Channels & Community Power in Patna!✨
SanjeetMishra29
 
PDF
Windsurf Meetup Ottawa 2025-07-12 - Planning Mode at Reliza.pdf
Pavel Shukhman
 
PDF
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PPTX
MSP360 Backup Scheduling and Retention Best Practices.pptx
MSP360
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
July Patch Tuesday
Ivanti
 
PDF
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
PPT
Interview paper part 3, It is based on Interview Prep
SoumyadeepGhosh39
 
PDF
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PDF
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
Python basic programing language for automation
DanialHabibi2
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
✨Unleashing Collaboration: Salesforce Channels & Community Power in Patna!✨
SanjeetMishra29
 
Windsurf Meetup Ottawa 2025-07-12 - Planning Mode at Reliza.pdf
Pavel Shukhman
 
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
MSP360 Backup Scheduling and Retention Best Practices.pptx
MSP360
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
July Patch Tuesday
Ivanti
 
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
Interview paper part 3, It is based on Interview Prep
SoumyadeepGhosh39
 
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
Python basic programing language for automation
DanialHabibi2
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
Ad

Quick Wikipedia Mining using Elastic Map Reduce