SlideShare a Scribd company logo
Alan F. GatesYahoo!Pig, Making Hadoop Easy
Who Am I?Pig committerHadoop PMC MemberAn architect in Yahoo!grid teamOr, as one coworker put it, “the lipstick on the Pig”
Who are you?
Motivation By Example   Suppose you have user data in one file, website data in another, and you need to find the top 5 most visited pages by users aged 18 - 25.Load UsersLoad PagesFilter by ageJoin on nameGroup on urlCount clicksOrder by clicksTake top 5
In Map Reduce
In Pig LatinUsers = load‘users’as (name, age);Fltrd = filter Users by        age >= 18 and age <= 25; Pages = load ‘pages’ as (user, url);Jnd = joinFltrdby name, Pages by user;Grpd = groupJndbyurl;Smmd = foreachGrpdgenerate group,COUNT(Jnd) as clicks;Srtd = orderSmmdby clicks desc;Top5 = limitSrtd 5;store Top5 into‘top5sites’;
Performance0.10.4,0.50.20.30.6, 0.7
Why not SQL?Data FactoryPigPipelinesIterative ProcessingResearchData WarehouseHiveBI ToolsAnalysisData Collection
Pig HighlightsUser defined functions (UDFs) can be written for column transformation (TOUPPER), or aggregation (SUM)UDFs can be written to take advantage of the combinerFour join implementations built in:  hash, fragment-replicate, merge, skewedMulti-query:  Pig will combine certain types of operations together in a single pipeline to reduce the number of times data is scannedOrder by provides total ordering across reducers in a balanced wayWriting load and store functions is easy once an InputFormat and OutputFormat existPiggybank, a collection of user contributed UDFs
Who uses Pig for What?70% of production jobs at Yahoo (10ks per day)Also used by Twitter, LinkedIn, Ebay, AOL, …Used toProcess web logsBuild user behavior modelsProcess imagesBuild maps of the webDo research on raw data sets
Accessing PigSubmit a script directlyGrunt, the pig shellPigServer Java class, a JDBC like interface
ComponentsJob executes on clusterHadoop ClusterPig resides on user machineUser machineNo need to install anything extra on your Hadoop cluster.
How It WorksPig LatinA = LOAD ‘myfile’    AS (x, y, z);B = FILTER A by x > 0; C = GROUP B BY x;D = FOREACH A GENERATEx, COUNT(B);STORE D INTO ‘output’;pig.jar:parses
checks
optimizes
plans execution

More Related Content

What's hot (19)

PPT
Another Intro To Hadoop
Adeel Ahmad
 
KEY
Hive vs Pig for HadoopSourceCodeReading
Mitsuharu Hamba
 
KEY
Intro to Hadoop
jeffturner
 
PDF
Practical Problem Solving with Apache Hadoop & Pig
Milind Bhandarkar
 
PPTX
Introduction to Apache Hadoop
Christopher Pezza
 
PPTX
Map Reduce
Rahul Agarwal
 
PPTX
Pig programming is more fun: New features in Pig
daijy
 
PPTX
MapReduce basic
Chirag Ahuja
 
PPTX
Introduction to Pig
Prashanth Babu
 
PDF
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Uwe Printz
 
PPTX
Hadoop & Hive Change the Data Warehousing Game Forever
DataWorks Summit
 
ODP
Hadoop - Overview
Jay
 
DOCX
Hadoop Seminar Report
Atul Kushwaha
 
PDF
Hadoop Administration pdf
Edureka!
 
PDF
Hadoop-Introduction
Sandeep Deshmukh
 
PPT
Seminar Presentation Hadoop
Varun Narang
 
PPT
2008 Ur Tech Talk Zshao
Jeff Hammerbacher
 
PPTX
Apache Pig
Shashidhar Basavaraju
 
PDF
Report Hadoop Map Reduce
Urvashi Kataria
 
Another Intro To Hadoop
Adeel Ahmad
 
Hive vs Pig for HadoopSourceCodeReading
Mitsuharu Hamba
 
Intro to Hadoop
jeffturner
 
Practical Problem Solving with Apache Hadoop & Pig
Milind Bhandarkar
 
Introduction to Apache Hadoop
Christopher Pezza
 
Map Reduce
Rahul Agarwal
 
Pig programming is more fun: New features in Pig
daijy
 
MapReduce basic
Chirag Ahuja
 
Introduction to Pig
Prashanth Babu
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Uwe Printz
 
Hadoop & Hive Change the Data Warehousing Game Forever
DataWorks Summit
 
Hadoop - Overview
Jay
 
Hadoop Seminar Report
Atul Kushwaha
 
Hadoop Administration pdf
Edureka!
 
Hadoop-Introduction
Sandeep Deshmukh
 
Seminar Presentation Hadoop
Varun Narang
 
2008 Ur Tech Talk Zshao
Jeff Hammerbacher
 
Report Hadoop Map Reduce
Urvashi Kataria
 

Viewers also liked (10)

PDF
Integration of Hive and HBase
Hortonworks
 
PDF
Hive Quick Start Tutorial
Carl Steinbach
 
PPTX
Big Data Analytics with Hadoop
Philippe Julio
 
PPTX
Big Data & Hadoop Tutorial
Edureka!
 
PPTX
Hadoop introduction , Why and What is Hadoop ?
sudhakara st
 
PPTX
Big data and Hadoop
Rahul Agarwal
 
ODP
Hadoop demo ppt
Phil Young
 
PDF
A beginners guide to Cloudera Hadoop
David Yahalom
 
PPSX
Hadoop
Nishant Gandhi
 
PDF
Hadoop Overview & Architecture
EMC
 
Integration of Hive and HBase
Hortonworks
 
Hive Quick Start Tutorial
Carl Steinbach
 
Big Data Analytics with Hadoop
Philippe Julio
 
Big Data & Hadoop Tutorial
Edureka!
 
Hadoop introduction , Why and What is Hadoop ?
sudhakara st
 
Big data and Hadoop
Rahul Agarwal
 
Hadoop demo ppt
Phil Young
 
A beginners guide to Cloudera Hadoop
David Yahalom
 
Hadoop Overview & Architecture
EMC
 
Ad

Similar to Pig, Making Hadoop Easy (20)

PPTX
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Yahoo Developer Network
 
PDF
Apache Pig: A big data processor
Tushar B Kute
 
PPTX
03 pig intro
Subhas Kumar Ghosh
 
PDF
Big Data Hadoop Training
stratapps
 
PPTX
power point presentation on pig -hadoop framework
bhargavi804095
 
PPTX
Pig power tools_by_viswanath_gangavaram
Viswanath Gangavaram
 
PDF
Sql saturday pig session (wes floyd) v2
Wes Floyd
 
PPTX
Pig
jramsingh
 
PPTX
Introduction to pig.
Triloki Gupta
 
PPTX
Apache PIG
Prashant Gupta
 
PDF
43_Sameer_Kumar_Das2
Mr.Sameer Kumar Das
 
PPTX
Understanding Pig and Hive in Apache Hadoop
mohindrachinmay
 
PPTX
PigHive.pptx
DenizDural2
 
PPTX
PigHive.pptx
KeerthiChukka
 
PPTX
Running, execution and HDFS(Hadoop distributed file system)in pig
keerthika2567
 
PPTX
PigHive presentation and hive impor.pptx
Rahul Borate
 
PPTX
An Introduction to Apache Pig
Sachin Vakkund
 
PDF
Unit V.pdf
KennyPratheepKumar
 
PDF
Introduction to pig & pig latin
knowbigdata
 
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Yahoo Developer Network
 
Apache Pig: A big data processor
Tushar B Kute
 
03 pig intro
Subhas Kumar Ghosh
 
Big Data Hadoop Training
stratapps
 
power point presentation on pig -hadoop framework
bhargavi804095
 
Pig power tools_by_viswanath_gangavaram
Viswanath Gangavaram
 
Sql saturday pig session (wes floyd) v2
Wes Floyd
 
Introduction to pig.
Triloki Gupta
 
Apache PIG
Prashant Gupta
 
43_Sameer_Kumar_Das2
Mr.Sameer Kumar Das
 
Understanding Pig and Hive in Apache Hadoop
mohindrachinmay
 
PigHive.pptx
DenizDural2
 
PigHive.pptx
KeerthiChukka
 
Running, execution and HDFS(Hadoop distributed file system)in pig
keerthika2567
 
PigHive presentation and hive impor.pptx
Rahul Borate
 
An Introduction to Apache Pig
Sachin Vakkund
 
Unit V.pdf
KennyPratheepKumar
 
Introduction to pig & pig latin
knowbigdata
 
Ad

More from Nick Dimiduk (13)

PDF
Apache Big Data EU 2015 - HBase
Nick Dimiduk
 
PDF
Apache Big Data EU 2015 - Phoenix
Nick Dimiduk
 
PDF
Apache HBase 1.0 Release
Nick Dimiduk
 
PPTX
HBase Low Latency, StrataNYC 2014
Nick Dimiduk
 
PDF
HBase Blockcache 101
Nick Dimiduk
 
PDF
HBase Data Types
Nick Dimiduk
 
PDF
Apache HBase Low Latency
Nick Dimiduk
 
PDF
Apache HBase for Architects
Nick Dimiduk
 
PDF
HBase Data Types (WIP)
Nick Dimiduk
 
PDF
Bring Cartography to the Cloud
Nick Dimiduk
 
PDF
HBase for Architects
Nick Dimiduk
 
PDF
HBase Client APIs (for webapps?)
Nick Dimiduk
 
KEY
Introduction to Hadoop, HBase, and NoSQL
Nick Dimiduk
 
Apache Big Data EU 2015 - HBase
Nick Dimiduk
 
Apache Big Data EU 2015 - Phoenix
Nick Dimiduk
 
Apache HBase 1.0 Release
Nick Dimiduk
 
HBase Low Latency, StrataNYC 2014
Nick Dimiduk
 
HBase Blockcache 101
Nick Dimiduk
 
HBase Data Types
Nick Dimiduk
 
Apache HBase Low Latency
Nick Dimiduk
 
Apache HBase for Architects
Nick Dimiduk
 
HBase Data Types (WIP)
Nick Dimiduk
 
Bring Cartography to the Cloud
Nick Dimiduk
 
HBase for Architects
Nick Dimiduk
 
HBase Client APIs (for webapps?)
Nick Dimiduk
 
Introduction to Hadoop, HBase, and NoSQL
Nick Dimiduk
 

Pig, Making Hadoop Easy

  • 1. Alan F. GatesYahoo!Pig, Making Hadoop Easy
  • 2. Who Am I?Pig committerHadoop PMC MemberAn architect in Yahoo!grid teamOr, as one coworker put it, “the lipstick on the Pig”
  • 4. Motivation By Example Suppose you have user data in one file, website data in another, and you need to find the top 5 most visited pages by users aged 18 - 25.Load UsersLoad PagesFilter by ageJoin on nameGroup on urlCount clicksOrder by clicksTake top 5
  • 6. In Pig LatinUsers = load‘users’as (name, age);Fltrd = filter Users by age >= 18 and age <= 25; Pages = load ‘pages’ as (user, url);Jnd = joinFltrdby name, Pages by user;Grpd = groupJndbyurl;Smmd = foreachGrpdgenerate group,COUNT(Jnd) as clicks;Srtd = orderSmmdby clicks desc;Top5 = limitSrtd 5;store Top5 into‘top5sites’;
  • 8. Why not SQL?Data FactoryPigPipelinesIterative ProcessingResearchData WarehouseHiveBI ToolsAnalysisData Collection
  • 9. Pig HighlightsUser defined functions (UDFs) can be written for column transformation (TOUPPER), or aggregation (SUM)UDFs can be written to take advantage of the combinerFour join implementations built in: hash, fragment-replicate, merge, skewedMulti-query: Pig will combine certain types of operations together in a single pipeline to reduce the number of times data is scannedOrder by provides total ordering across reducers in a balanced wayWriting load and store functions is easy once an InputFormat and OutputFormat existPiggybank, a collection of user contributed UDFs
  • 10. Who uses Pig for What?70% of production jobs at Yahoo (10ks per day)Also used by Twitter, LinkedIn, Ebay, AOL, …Used toProcess web logsBuild user behavior modelsProcess imagesBuild maps of the webDo research on raw data sets
  • 11. Accessing PigSubmit a script directlyGrunt, the pig shellPigServer Java class, a JDBC like interface
  • 12. ComponentsJob executes on clusterHadoop ClusterPig resides on user machineUser machineNo need to install anything extra on your Hadoop cluster.
  • 13. How It WorksPig LatinA = LOAD ‘myfile’ AS (x, y, z);B = FILTER A by x > 0; C = GROUP B BY x;D = FOREACH A GENERATEx, COUNT(B);STORE D INTO ‘output’;pig.jar:parses
  • 17. submits jar to Hadoop
  • 18. monitors job progressExecution PlanMap:Filter CountCombine/Reduce:Sum
  • 20. Upcoming FeaturesIn 0.8 (plan to branch end of August, release this fall):Runtime statistics collectionUDFs in scripting languages (e.g. python)Ability to specify a custom partitionerAdding many string and math functions as Pig supported UDFsPost 0.8Adding branches, loops, functions, and modulesUsabilityBetter error messagesFix ILLUSTRATEImproved integration with workflow systems
  • 21. Learn MoreRead the online documentation: http://hadoop.apache.org/pig/On line tutorialsFrom Yahoo, http://developer.yahoo.com/hadoop/tutorial/From Cloudera, http://www.cloudera.com/hadoop-trainingUsing Pig on EC2: http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2728A couple of Hadoop books available that include chapters on Pig, search at your favorite bookstoreJoin the mailing lists:[email protected] for user [email protected] for developer [email protected] for Howl

Editor's Notes

  • #4: How many have used Pig? How many have looked at it and have a basic understanding of it?
  • #15: Demo script:Show group query first, talk about: load and schema (none, declared, from data) data types data sources need not be from HDFS or even from files parallel clause, how parallelism is determined on maps how grouping works in Pig LatinSo far what I’ve shown you is a simple join/group query. Now let’s look at something less straight forward in SQLOften people want to group data a number of different ways. Look at multiquery script: Note how there’s a branch in the logic nowOften want to operate on the result of each record in a previous statement. Look at top5 query Note nested foreach allows you to operate on each record coming out of group by Since result of group by is a bag in each record, can apply operators to that bag Currently support order, distinct, filter, limit Use of flatten at the end Use of positional parametersThere will always be logic you need to write that you can’t get from Pig Latin. This is where rich support of UDFs come in. Look at session query Note registering UDF UDF now called like any other Pig builtin function (in fact Pig builtins implemented as UDFs)Look at SessionAnalysis.java Class name is UDF name Input to UDF is always a Tuple, avoids need to declare expected input, means UDF has to check what it gets Talk about how projection of bags works Talk about how EvalFunc is templatized on return typeAlso easy to write load and store functions to fit your data needs