SlideShare a Scribd company logo
Hadoop, Pig, and Python
PyData NYC 2012
Overview
OF THIS SESSION




Why Python on Hadoop?
Fast Hadoop overview
Jython
Python
MrJob
Pig
(How they work, challenges, efficiency,
how to start)
Too much data
FOR ONE MACHINE




Data doubles every 18 mo
ETL / Munging




Cleanse
Format
Simple calculations
Social Graph
Predict
Detect
Genetics
Hadoop
RAPID OVERVIEW




MapReduce programming model
from Google
(Jeff Dean and Sanjay Ghemawat)
Hadoop
RAPID OVERVIEW
Hadoop
RAPID OVERVIEW




Hadoop implements MapReduce (Java)
(Doug Cutting)
Incubated at Yahoo
Indexing, Spam detection, more
Hadoop
PROBLEMS




Difficult
Not much Python
Batch only (...or it was)
Hadoop
FUTURE




Yarn
MapReduce optional
Generic management + distributed
apps
Impala
Hadoop
AND PYTHON
Jython
ON HADOOP (MAP)
Jython
ON HADOOP (REDUCE; 1ST HALF)
Jython
ON HADOOP (REDUCE; 2ND HALF)
Jython
ON HADOOP
Python
ON HADOOP




Streaming




(Works with any language, not just
MrJob (Python)
ON HADOOP




Streaming + local / EMR / your Hadoop
MrJob (Python)
ON HADOOP




Multi-step jobs
Pig
ON HADOOP




Less code
Expressive code
Pig
BRIEF, EXPRESSIVE




(thanks: twitter hadoop world presentation)
The Same Script, In
FOR SERIOUS
Pig
ON HADOOP




Less code
Expressive code
Compiles to MR
Insulates from API
Popular
(LinkedIn, Twitter,
Salesforce, Yahoo,
Stanford
Pig
ON HADOOP




Works with Jython
Not Python
Stream, no types
UDF read stdin
UDF deserialize, no types
Serialize for Pig
Write to stdout
Exceptions
Pig + Python
ON HADOOP
Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop + Python
NOT ACTUALLY MAGIC




Hadoop won’t magically parallelize
your algorithm
Hadoop + Python
EFFICIENCY




Don’t stream Java-based languages
• Jython
• Pig + Jython


Streaming has ~30% overhead
• Python
• MrJob
• Pig + Python
Hadoop + Python
EXCITED?




Well... 90-95% of time isn’t spent on
algos
Hadoop + Python
HARD STUFF: SETUP




Get Hadoop running
Software where it needs to be
Processes communicating
Data available
Hadoop + Python
HARD STUFF: DEVELOP




Learn
Project structure, modularity
Dev environment like Production
Hadoop + Python
HARD STUFF: VALIDATE




Syntax check
Packages available
Data readable
Data writable
Without long waits for failure
Hadoop + Python
HARD STUFF: DEBUG




Distributed execution is hard to debug
Hadoop + Python
HARD STUFF: TEST




Data processing is hard to test
But critical
Hadoop + Python
HARD STUFF: DEPLOY




Environments identical
Code correctly deployed
Configuration changes
Non-disruptive
Hadoop + Python
HARD STUFF: HISTORY




Stats about prior runs
What code was run?
What’s changed?
Hadoop + Python
HARD STUFF: LOGS




Distributed logs hard to make sense of
Hadoop logs hard to understand
Ephemeral clusters lose logs
Hadoop + Python
HARD STUFF: MORTAR’S APPROACH




Setup: PaaS, pip installation,
connectors
Develop: learning, structure, instant
dev env
Validate: fast validate
Debug: printf, more coming
Test: Rails-like test suites
Deploy: one-button deploy
K Young
 @kky

More Related Content

What's hot (20)

PDF
Running R on Hadoop - CHUG - 20120815
Chicago Hadoop Users Group
 
PDF
Pig programming is fun
DataWorks Summit
 
PPTX
Hadoop with Python
Donald Miner
 
PPTX
Embedding Pig in scripting languages
Julien Le Dem
 
PPTX
Pig, Making Hadoop Easy
Nick Dimiduk
 
PPT
HIVE: Data Warehousing & Analytics on Hadoop
Zheng Shao
 
PDF
Big Data Hadoop Training
stratapps
 
KEY
Intro To Hadoop
Bill Graham
 
PDF
High-level Programming Languages: Apache Pig and Pig Latin
Pietro Michiardi
 
PPTX
New features in Pig 0.11
Hortonworks
 
PPTX
Introduction to Pig | Pig Architecture | Pig Fundamentals
Skillspeed
 
PDF
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Edureka!
 
PDF
Apache Pig for Data Scientists
DataWorks Summit
 
ODP
Hadoop - Overview
Jay
 
PDF
introduction to data processing using Hadoop and Pig
Ricardo Varela
 
PPTX
Apache Pig
Shashidhar Basavaraju
 
KEY
Hive vs Pig for HadoopSourceCodeReading
Mitsuharu Hamba
 
PPT
Hadoop summit 2010 frameworks panel elephant bird
Kevin Weil
 
PPTX
Big Data Science with H2O in R
Anqi Fu
 
PDF
The state of Hive and Spark in the Cloud (July 2017)
Nicolas Poggi
 
Running R on Hadoop - CHUG - 20120815
Chicago Hadoop Users Group
 
Pig programming is fun
DataWorks Summit
 
Hadoop with Python
Donald Miner
 
Embedding Pig in scripting languages
Julien Le Dem
 
Pig, Making Hadoop Easy
Nick Dimiduk
 
HIVE: Data Warehousing & Analytics on Hadoop
Zheng Shao
 
Big Data Hadoop Training
stratapps
 
Intro To Hadoop
Bill Graham
 
High-level Programming Languages: Apache Pig and Pig Latin
Pietro Michiardi
 
New features in Pig 0.11
Hortonworks
 
Introduction to Pig | Pig Architecture | Pig Fundamentals
Skillspeed
 
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Edureka!
 
Apache Pig for Data Scientists
DataWorks Summit
 
Hadoop - Overview
Jay
 
introduction to data processing using Hadoop and Pig
Ricardo Varela
 
Hive vs Pig for HadoopSourceCodeReading
Mitsuharu Hamba
 
Hadoop summit 2010 frameworks panel elephant bird
Kevin Weil
 
Big Data Science with H2O in R
Anqi Fu
 
The state of Hive and Spark in the Cloud (July 2017)
Nicolas Poggi
 

Viewers also liked (16)

PDF
Pig and Python to Process Big Data
Shawn Hermans
 
PPTX
Pig statements
Ganesh Sanap
 
PDF
Apache pig
Mudassir Khan Pathan
 
PPT
Hadoop and Pig at Twitter__HadoopSummit2010
Yahoo Developer Network
 
PPTX
Practical Hadoop using Pig
David Wellman
 
PDF
Hadoop Pig: MapReduce the easy way!
Nathan Bijnens
 
PDF
Hadoop and pig at twitter (oscon 2010)
Kevin Weil
 
PDF
Hadoop Ecosystem at Twitter - Kevin Weil - Hadoop World 2010
Cloudera, Inc.
 
PPT
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
Hadoop User Group
 
PPTX
Python for Big Data Analytics
Edureka!
 
PPTX
Python for Big Data Analytics
Edureka!
 
PPTX
Python in the Hadoop Ecosystem (Rock Health presentation)
Uri Laserson
 
PPTX
Hadoop et son écosystème
Khanh Maudoux
 
PDF
Big Data: Concepts, techniques et démonstration de Apache Hadoop
hajlaoui jaleleddine
 
PDF
Un introduction à Pig
Modern Data Stack France
 
PDF
Apache Cassandra - Concepts et fonctionnalités
Romain Hardouin
 
Pig and Python to Process Big Data
Shawn Hermans
 
Pig statements
Ganesh Sanap
 
Hadoop and Pig at Twitter__HadoopSummit2010
Yahoo Developer Network
 
Practical Hadoop using Pig
David Wellman
 
Hadoop Pig: MapReduce the easy way!
Nathan Bijnens
 
Hadoop and pig at twitter (oscon 2010)
Kevin Weil
 
Hadoop Ecosystem at Twitter - Kevin Weil - Hadoop World 2010
Cloudera, Inc.
 
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
Hadoop User Group
 
Python for Big Data Analytics
Edureka!
 
Python for Big Data Analytics
Edureka!
 
Python in the Hadoop Ecosystem (Rock Health presentation)
Uri Laserson
 
Hadoop et son écosystème
Khanh Maudoux
 
Big Data: Concepts, techniques et démonstration de Apache Hadoop
hajlaoui jaleleddine
 
Un introduction à Pig
Modern Data Stack France
 
Apache Cassandra - Concepts et fonctionnalités
Romain Hardouin
 
Ad

Similar to Hadoop, Pig, and Python (PyData NYC 2012) (20)

PDF
Hadoop breizhjug
David Morin
 
PDF
Apache pig
Suresh Mandava
 
PDF
Why Python Should Be Your First Programming Language
Edureka!
 
PPTX
Scalable Hadoop with succinct Python: the best of both worlds
DataWorks Summit
 
PPTX
Ultimate Guide to Hire Dedicated Python Developers for Scalable Backend Solut...
Tuvoc Technologies
 
PPT
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Chris Baglieri
 
PPTX
BDA R20 21NM - Summary Big Data Analytics
NetajiGandi1
 
PPTX
Practical introduction to hadoop
inside-BigData.com
 
PDF
Hadoop pycon2011uk
Aditya Sakhuja
 
PPT
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Cloudera, Inc.
 
PPTX
Introduction to PIG
Shanmathy Prabakaran
 
PPTX
Intro to Hadoop
Jonathan Bloom
 
PDF
Python webinar 4th june
Edureka!
 
PPTX
Introduction to hadoop V2
TarjeiRomtveit
 
PDF
What is Big Data?
CodePolitan
 
PDF
Python: The Versatile Programming Language - Introduction
Jainul Musani
 
PDF
Exploring and Using the Python Ecosystem
Adam Cook
 
PPTX
Big Data Concepts
Ahmed Salman
 
PPTX
Data science and Hadoop
Donald Miner
 
Hadoop breizhjug
David Morin
 
Apache pig
Suresh Mandava
 
Why Python Should Be Your First Programming Language
Edureka!
 
Scalable Hadoop with succinct Python: the best of both worlds
DataWorks Summit
 
Ultimate Guide to Hire Dedicated Python Developers for Scalable Backend Solut...
Tuvoc Technologies
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Chris Baglieri
 
BDA R20 21NM - Summary Big Data Analytics
NetajiGandi1
 
Practical introduction to hadoop
inside-BigData.com
 
Hadoop pycon2011uk
Aditya Sakhuja
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Cloudera, Inc.
 
Introduction to PIG
Shanmathy Prabakaran
 
Intro to Hadoop
Jonathan Bloom
 
Python webinar 4th june
Edureka!
 
Introduction to hadoop V2
TarjeiRomtveit
 
What is Big Data?
CodePolitan
 
Python: The Versatile Programming Language - Introduction
Jainul Musani
 
Exploring and Using the Python Ecosystem
Adam Cook
 
Big Data Concepts
Ahmed Salman
 
Data science and Hadoop
Donald Miner
 
Ad

More from mortardata (8)

PDF
Daeil Kim: Machine Learning at the New York Times
mortardata
 
PPTX
Jonathan Coveney: Why Pig?
mortardata
 
PPTX
Pig on Spark
mortardata
 
PPTX
Can Big Data Save the World? By Jake Porway
mortardata
 
PDF
Max Shron, Thinking with Data at the NYC Data Science Meetup
mortardata
 
PPT
Drew Conway: A Social Scientist's Perspective on Data Science
mortardata
 
PDF
Data Science at Tumblr
mortardata
 
PDF
Mortar: Hadoop-as-a-Service + Open Source Framework | AWS re: Invent public …
mortardata
 
Daeil Kim: Machine Learning at the New York Times
mortardata
 
Jonathan Coveney: Why Pig?
mortardata
 
Pig on Spark
mortardata
 
Can Big Data Save the World? By Jake Porway
mortardata
 
Max Shron, Thinking with Data at the NYC Data Science Meetup
mortardata
 
Drew Conway: A Social Scientist's Perspective on Data Science
mortardata
 
Data Science at Tumblr
mortardata
 
Mortar: Hadoop-as-a-Service + Open Source Framework | AWS re: Invent public …
mortardata
 

Recently uploaded (20)

PDF
Building Resilience with Digital Twins : Lessons from Korea
SANGHEE SHIN
 
PPTX
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
PPTX
✨Unleashing Collaboration: Salesforce Channels & Community Power in Patna!✨
SanjeetMishra29
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PDF
Empowering Cloud Providers with Apache CloudStack and Stackbill
ShapeBlue
 
PPTX
Extensions Framework (XaaS) - Enabling Orchestrate Anything
ShapeBlue
 
PDF
SFWelly Summer 25 Release Highlights July 2025
Anna Loughnan Colquhoun
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PDF
TrustArc Webinar - Data Privacy Trends 2025: Mid-Year Insights & Program Stra...
TrustArc
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PPTX
Building a Production-Ready Barts Health Secure Data Environment Tooling, Acc...
Barts Health
 
PDF
CloudStack GPU Integration - Rohit Yadav
ShapeBlue
 
PDF
Meetup Kickoff & Welcome - Rohit Yadav, CSIUG Chairman
ShapeBlue
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PDF
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
PDF
Women in Automation Presents: Reinventing Yourself — Bold Career Pivots That ...
DianaGray10
 
PPTX
Building and Operating a Private Cloud with CloudStack and LINBIT CloudStack ...
ShapeBlue
 
PPTX
Darren Mills The Migration Modernization Balancing Act: Navigating Risks and...
AWS Chicago
 
PDF
Human-centred design in online workplace learning and relationship to engagem...
Tracy Tang
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
Building Resilience with Digital Twins : Lessons from Korea
SANGHEE SHIN
 
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
✨Unleashing Collaboration: Salesforce Channels & Community Power in Patna!✨
SanjeetMishra29
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
Empowering Cloud Providers with Apache CloudStack and Stackbill
ShapeBlue
 
Extensions Framework (XaaS) - Enabling Orchestrate Anything
ShapeBlue
 
SFWelly Summer 25 Release Highlights July 2025
Anna Loughnan Colquhoun
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
TrustArc Webinar - Data Privacy Trends 2025: Mid-Year Insights & Program Stra...
TrustArc
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
Building a Production-Ready Barts Health Secure Data Environment Tooling, Acc...
Barts Health
 
CloudStack GPU Integration - Rohit Yadav
ShapeBlue
 
Meetup Kickoff & Welcome - Rohit Yadav, CSIUG Chairman
ShapeBlue
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
Women in Automation Presents: Reinventing Yourself — Bold Career Pivots That ...
DianaGray10
 
Building and Operating a Private Cloud with CloudStack and LINBIT CloudStack ...
ShapeBlue
 
Darren Mills The Migration Modernization Balancing Act: Navigating Risks and...
AWS Chicago
 
Human-centred design in online workplace learning and relationship to engagem...
Tracy Tang
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 

Hadoop, Pig, and Python (PyData NYC 2012)