SlideShare a Scribd company logo
Pig
      Dataflow Scripting for Hadoop


      Alan F. Gates
      @alanfgates




© Hortonworks, Inc 2011
                                      Page 1
Who Am I?

•   Pig committer and PMC Member
•   HCatalog committer and mentor
•   Member of ASF and Incubator PMC
•   Co-founder of Hortonworks
•   Author of Programming Pig from O’Reilly




           Photo credit: Steven Guarnaccia, The Three Little Pigs
Who Are You?




               3
Example


For all of your           Load Users                Load Logs
registered users, you
                                        Semi-join
want to count how
many came to your site
                         Count by zip                Count by
this month. You want                                age, gender
this count both by
geography (zip code)        Store                      Store
                           results                    results
and by demographic
group (age and
gender)
In Pig Latin
-- Load web server logs
logs      = load 'server_logs' using HCatLoader();
thismonth = filter logs by date >= '20110801'
            and date < '20110901';

-- Load users
users     = load 'users' using HCatLoader();

-- Remove   any users that did not visit this month
grpd        = cogroup thismonth by userid, users by userid;
fltrd       = filter grpd by not IsEmpty(logs);
visited     = foreach fltrd generate flatten(users);

-- Count by zip code
grpbyzip = group visited by zip;
cntzip    = foreach grpbyzip generate group, COUNT(visited);
store cntzip into 'by_zip' using HCatStorer('date=201108');

-- Count by demographics
grpbydemo = group visited by (age, gender);
cntdemo   = foreach grpbydemo
             generate flatten(group), COUNT(visited);
store cntdemo into 'by_demo' using HCatStorer('date=201108');
Pig’s Place in the Data World




 Data Collection   Data Factory           Data Warehouse
                   Pig                    Hive

                   Pipelines              BI Tools
                   Iterative Processing   Analysis
                   Research



                    6
Why not MapReduce?

• Pig Provides a number of standard data operators
   – Five different implementations of join (hash, fragment-
     replicate, merge, sparse merged, skewed)
   – Order by provides total ordering across reducers in a balanced
     way
• Provides optimizations that are hard to do by hand
   – Multi-query: Pig will combine certain types of operations
     together in a single pipeline to reduce the number of times data
     is scanned
• User Defined Functions provide a way to inject your code
  into the data transformation
   – can be written in Java or Python
   – can do column transformation (TOUPPER) and aggregation
     (SUM)
   – can be written to take advantage of the combiner
• Control flow can be done via Python or Java

                            7
Embedding Example: Compute Pagerank


PageRank:
A system of linear equations (as many as there
  are pages on the web, yeah, a lot):


It can be approximated iteratively: compute the
   new page rank based on the page ranks of
   the previous iteration. Start with some value.

Ref: http://en.wikipedia.org/wiki/PageRank


                                   Slide courtesy of Julien Le Dem
Or more visually




Each page sends a fraction of its
 PageRank to the pages linked to.
 Inversely proportional to the
 number of links.
              Slide courtesy of Julien Le Dem
Slide courtesy of Julien Le Dem
Let’s zoom in



           pig script: PR(A) = (1-d) + d (PR(T1)/C(T1) + ... +
                             PR(Tn)/C(Tn))



                                    Iterate 10 times

                                                                 Pass parameters
                                                                  as a dictionary


                                                             Just run P, that was
                                                               declared above
                                              The output
                                           becomes the new
                                                input
      Slide courtesy of Julien Le Dem
Recently Added Features

• New in 0.9 (released July 2011):
  – Embedding in Python
  – Macros and Imports
• New in 0.10 (should release in Dec 2011)
  – Boolean data type
  – Hash based aggregation for aggregates with
    low cardinality keys
  – UDFs to build and apply bloom filters
  – UDFs in JRuby (may slip to next release)

                   14
Learn More

• Read the online documentation:
  http://pig.apache.org/
• Programming Pig from O’Reilly
  Press
• Join the mailing lists:
  – user@pig.apache.org for user
    questions
  – dev@pig.apache.com for developer
    issues
• Follow me on
  Twitter, @alanfgates
Questions




            16

More Related Content

PPT
Hadoop and Pig at Twitter__HadoopSummit2010
Yahoo Developer Network
 
PPTX
Scalable Hadoop with succinct Python: the best of both worlds
DataWorks Summit
 
PPTX
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Yahoo Developer Network
 
PPTX
Introduction to Apache Pig
Jason Shao
 
PPTX
Pig workshop
Sudar Muthu
 
PDF
Apache Pig: A big data processor
Tushar B Kute
 
PPTX
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Viswanath Gangavaram
 
PPTX
AWS Hadoop and PIG and overview
Dan Morrill
 
Hadoop and Pig at Twitter__HadoopSummit2010
Yahoo Developer Network
 
Scalable Hadoop with succinct Python: the best of both worlds
DataWorks Summit
 
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Yahoo Developer Network
 
Introduction to Apache Pig
Jason Shao
 
Pig workshop
Sudar Muthu
 
Apache Pig: A big data processor
Tushar B Kute
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Viswanath Gangavaram
 
AWS Hadoop and PIG and overview
Dan Morrill
 

What's hot (20)

PPT
apache pig performance optimizations talk at apachecon 2010
Thejas Nair
 
PDF
Practical pig
trihug
 
PDF
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
PyData
 
KEY
最終発表
Hiromi Ishii
 
PDF
Apache beam — promyk nadziei data engineera na Toruń JUG 28.03.2018
Piotr Wikiel
 
PDF
Karmasphere hadoop-productivity-tools
Hadoop User Group
 
PDF
20141111 파이썬으로 Hadoop MR프로그래밍
Tae Young Lee
 
PPTX
03 pig intro
Subhas Kumar Ghosh
 
PDF
Confitura 2018 — Apache Beam — Promyk Nadziei Data Engineera
Piotr Wikiel
 
PDF
Simple ETL in python 3.5+ with Bonobo - PyParis 2017
Romain Dorgueil
 
PDF
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Alex Levenson
 
PDF
GPars in Saga Groovy Study
Naoki Rin
 
PDF
Massively Parallel Processing with Procedural Python (PyData London 2014)
Ian Huston
 
PPTX
Nov HUG 2009: Hadoop Record Reader In Python
Yahoo Developer Network
 
PPTX
Hands on Hadoop and pig
Sudar Muthu
 
PDF
An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)
Koji Sekiguchi
 
PDF
An Introduction to NLP4L
Koji Sekiguchi
 
PDF
An Overview of Hadoop
Asif Ali
 
PPTX
Hadoop - Stock Analysis
Vaibhav Jain
 
PDF
Parquet - Data I/O - Philadelphia 2013
larsgeorge
 
apache pig performance optimizations talk at apachecon 2010
Thejas Nair
 
Practical pig
trihug
 
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
PyData
 
最終発表
Hiromi Ishii
 
Apache beam — promyk nadziei data engineera na Toruń JUG 28.03.2018
Piotr Wikiel
 
Karmasphere hadoop-productivity-tools
Hadoop User Group
 
20141111 파이썬으로 Hadoop MR프로그래밍
Tae Young Lee
 
03 pig intro
Subhas Kumar Ghosh
 
Confitura 2018 — Apache Beam — Promyk Nadziei Data Engineera
Piotr Wikiel
 
Simple ETL in python 3.5+ with Bonobo - PyParis 2017
Romain Dorgueil
 
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Alex Levenson
 
GPars in Saga Groovy Study
Naoki Rin
 
Massively Parallel Processing with Procedural Python (PyData London 2014)
Ian Huston
 
Nov HUG 2009: Hadoop Record Reader In Python
Yahoo Developer Network
 
Hands on Hadoop and pig
Sudar Muthu
 
An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)
Koji Sekiguchi
 
An Introduction to NLP4L
Koji Sekiguchi
 
An Overview of Hadoop
Asif Ali
 
Hadoop - Stock Analysis
Vaibhav Jain
 
Parquet - Data I/O - Philadelphia 2013
larsgeorge
 
Ad

Similar to TriHUG November Pig Talk by Alan Gates (20)

PDF
Pig Out to Hadoop
Hortonworks
 
PPTX
Pig programming is more fun: New features in Pig
daijy
 
PDF
Pig programming is fun
DataWorks Summit
 
PDF
Introduction To Apache Pig at WHUG
Adam Kawa
 
PPTX
The Hadoop Ecosystem
J Singh
 
PDF
Sql saturday pig session (wes floyd) v2
Wes Floyd
 
PPTX
Apache Pig
Shashidhar Basavaraju
 
PDF
20080529dublinpt2
Jeff Hammerbacher
 
PDF
DataFu @ ApacheCon 2014
William Vaughan
 
PPTX
Pig power tools_by_viswanath_gangavaram
Viswanath Gangavaram
 
PPT
lecturte 5. Hgfjhffjyy to the data will be 1.ppt
YashJadhav496388
 
PDF
Coveney pig lecture
Seven Nguyen
 
PPTX
January 2011 HUG: Pig Presentation
Yahoo Developer Network
 
PPTX
Apache PIG
Prashant Gupta
 
PDF
Pig and Python to Process Big Data
Shawn Hermans
 
PPTX
PigHive.pptx
DenizDural2
 
PPTX
Apache pig
Jigar Parekh
 
PPTX
Apache pig presentation_siddharth_mathur
Siddharth Mathur
 
PPTX
Pig
madu mathicp
 
Pig Out to Hadoop
Hortonworks
 
Pig programming is more fun: New features in Pig
daijy
 
Pig programming is fun
DataWorks Summit
 
Introduction To Apache Pig at WHUG
Adam Kawa
 
The Hadoop Ecosystem
J Singh
 
Sql saturday pig session (wes floyd) v2
Wes Floyd
 
20080529dublinpt2
Jeff Hammerbacher
 
DataFu @ ApacheCon 2014
William Vaughan
 
Pig power tools_by_viswanath_gangavaram
Viswanath Gangavaram
 
lecturte 5. Hgfjhffjyy to the data will be 1.ppt
YashJadhav496388
 
Coveney pig lecture
Seven Nguyen
 
January 2011 HUG: Pig Presentation
Yahoo Developer Network
 
Apache PIG
Prashant Gupta
 
Pig and Python to Process Big Data
Shawn Hermans
 
PigHive.pptx
DenizDural2
 
Apache pig
Jigar Parekh
 
Apache pig presentation_siddharth_mathur
Siddharth Mathur
 
Ad

More from trihug (10)

PDF
TriHUG October: Apache Ranger
trihug
 
PDF
TriHUG Feb: Hive on spark
trihug
 
PDF
TriHUG 3/14: HBase in Production
trihug
 
PDF
TriHUG 2/14: Apache Sentry
trihug
 
PDF
TriHUG talk on Spark and Shark
trihug
 
PPTX
Impala presentation
trihug
 
PPT
Financial services trihug
trihug
 
PPTX
TriHUG January 2012 Talk by Chris Shain
trihug
 
PPTX
TriHUG November HCatalog Talk by Alan Gates
trihug
 
PPTX
MapR, Implications for Integration
trihug
 
TriHUG October: Apache Ranger
trihug
 
TriHUG Feb: Hive on spark
trihug
 
TriHUG 3/14: HBase in Production
trihug
 
TriHUG 2/14: Apache Sentry
trihug
 
TriHUG talk on Spark and Shark
trihug
 
Impala presentation
trihug
 
Financial services trihug
trihug
 
TriHUG January 2012 Talk by Chris Shain
trihug
 
TriHUG November HCatalog Talk by Alan Gates
trihug
 
MapR, Implications for Integration
trihug
 

Recently uploaded (20)

PDF
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
PDF
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
PDF
REPORT: Heating appliances market in Poland 2024
SPIUG
 
PDF
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
PPTX
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
PDF
How-Cloud-Computing-Impacts-Businesses-in-2025-and-Beyond.pdf
Artjoker Software Development Company
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PDF
Cloud-Migration-Best-Practices-A-Practical-Guide-to-AWS-Azure-and-Google-Clou...
Artjoker Software Development Company
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
PDF
The Evolution of KM Roles (Presented at Knowledge Summit Dublin 2025)
Enterprise Knowledge
 
PDF
The Future of Artificial Intelligence (AI)
Mukul
 
PPTX
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
PDF
Advances in Ultra High Voltage (UHV) Transmission and Distribution Systems.pdf
Nabajyoti Banik
 
PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
PDF
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
PDF
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
PDF
Doc9.....................................
SofiaCollazos
 
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
REPORT: Heating appliances market in Poland 2024
SPIUG
 
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
How-Cloud-Computing-Impacts-Businesses-in-2025-and-Beyond.pdf
Artjoker Software Development Company
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
Cloud-Migration-Best-Practices-A-Practical-Guide-to-AWS-Azure-and-Google-Clou...
Artjoker Software Development Company
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
The Evolution of KM Roles (Presented at Knowledge Summit Dublin 2025)
Enterprise Knowledge
 
The Future of Artificial Intelligence (AI)
Mukul
 
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
Advances in Ultra High Voltage (UHV) Transmission and Distribution Systems.pdf
Nabajyoti Banik
 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
Doc9.....................................
SofiaCollazos
 

TriHUG November Pig Talk by Alan Gates

  • 1. Pig Dataflow Scripting for Hadoop Alan F. Gates @alanfgates © Hortonworks, Inc 2011 Page 1
  • 2. Who Am I? • Pig committer and PMC Member • HCatalog committer and mentor • Member of ASF and Incubator PMC • Co-founder of Hortonworks • Author of Programming Pig from O’Reilly Photo credit: Steven Guarnaccia, The Three Little Pigs
  • 4. Example For all of your Load Users Load Logs registered users, you Semi-join want to count how many came to your site Count by zip Count by this month. You want age, gender this count both by geography (zip code) Store Store results results and by demographic group (age and gender)
  • 5. In Pig Latin -- Load web server logs logs = load 'server_logs' using HCatLoader(); thismonth = filter logs by date >= '20110801' and date < '20110901'; -- Load users users = load 'users' using HCatLoader(); -- Remove any users that did not visit this month grpd = cogroup thismonth by userid, users by userid; fltrd = filter grpd by not IsEmpty(logs); visited = foreach fltrd generate flatten(users); -- Count by zip code grpbyzip = group visited by zip; cntzip = foreach grpbyzip generate group, COUNT(visited); store cntzip into 'by_zip' using HCatStorer('date=201108'); -- Count by demographics grpbydemo = group visited by (age, gender); cntdemo = foreach grpbydemo generate flatten(group), COUNT(visited); store cntdemo into 'by_demo' using HCatStorer('date=201108');
  • 6. Pig’s Place in the Data World Data Collection Data Factory Data Warehouse Pig Hive Pipelines BI Tools Iterative Processing Analysis Research 6
  • 7. Why not MapReduce? • Pig Provides a number of standard data operators – Five different implementations of join (hash, fragment- replicate, merge, sparse merged, skewed) – Order by provides total ordering across reducers in a balanced way • Provides optimizations that are hard to do by hand – Multi-query: Pig will combine certain types of operations together in a single pipeline to reduce the number of times data is scanned • User Defined Functions provide a way to inject your code into the data transformation – can be written in Java or Python – can do column transformation (TOUPPER) and aggregation (SUM) – can be written to take advantage of the combiner • Control flow can be done via Python or Java 7
  • 8. Embedding Example: Compute Pagerank PageRank: A system of linear equations (as many as there are pages on the web, yeah, a lot): It can be approximated iteratively: compute the new page rank based on the page ranks of the previous iteration. Start with some value. Ref: http://en.wikipedia.org/wiki/PageRank Slide courtesy of Julien Le Dem
  • 9. Or more visually Each page sends a fraction of its PageRank to the pages linked to. Inversely proportional to the number of links. Slide courtesy of Julien Le Dem
  • 10. Slide courtesy of Julien Le Dem
  • 11. Let’s zoom in pig script: PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) Iterate 10 times Pass parameters as a dictionary Just run P, that was declared above The output becomes the new input Slide courtesy of Julien Le Dem
  • 12. Recently Added Features • New in 0.9 (released July 2011): – Embedding in Python – Macros and Imports • New in 0.10 (should release in Dec 2011) – Boolean data type – Hash based aggregation for aggregates with low cardinality keys – UDFs to build and apply bloom filters – UDFs in JRuby (may slip to next release) 14
  • 13. Learn More • Read the online documentation: http://pig.apache.org/ • Programming Pig from O’Reilly Press • Join the mailing lists: – [email protected] for user questions – [email protected] for developer issues • Follow me on Twitter, @alanfgates
  • 14. Questions 16

Editor's Notes

  • #3: Say a little about Hortonworks
  • #7: SQL is a query languageDeclarative, what not howOriented around answering a questionRequires uniform schemaRequires metadataKnown by everyoneA great choice for answering queries, building reports, use with automated toolsPig Latin is a data flow languageScript defines a data flowIntended for pipelines where there may be tens or hundreds of stepsBuilt for raw world of Hadoop where schemas are optional, data may not be clean, etc.Can operate with or without metadataA great choice for ETL pipelines, data models, iterative processing, and research on raw data