SlideShare a Scribd company logo
2
Most read
3
Most read
5
Most read
Introduction To PIGThe evolution of data processing frameworks
What is PIG?Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programsPig generates and compiles a Map/Reduce program(s) on the fly.
Why PIG?Ease of programming - It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain.
File FormatsPigStorageCustom Load / Store Functions
Installing PIGDownload / Unpack tarball (pig.apache.org)Install RPM / DEB package (cloudera.com)
Running PIGGrunt Shell: Enter Pig commands manually using Pig’s interactive shell, Grunt.Script File: Place Pig commands in a script file and run the script.Embedded Program: Embed Pig commands in a host language and run the program.
Run ModesLocal Mode: To run Pig in local mode, you need access to a single machine.Hadoop(mapreduce) Mode: To run Pig in hadoop (mapreduce) mode, you need access to a Hadoop cluster and HDFS installation.
Sample PIG scriptA = load 'passwd' using PigStorage(':'); B = foreach A generate $0 as id;store B into ‘id.out’;
Sample Script With SchemaA = LOAD 'student_data' AS (name: chararray, age: int, gpa: float);B = FOREACH A GENERATE myudfs.UPPER(name);
Eval FunctionsAVGCONCATExampleCOUNTCOUNT_STARDIFFIsEmptyMAXMINSIZESUMTOKENIZE
Math Functions# Math FunctionsABSACOSASINATANCBRTCEILCOSHCOSEXPFLOORLOGLOG10RANDOMROUNDSINSINHSQRTTANTANH
Pig Types
Sample CW PIG scriptRawInput = LOAD '$INPUT' USING com.contextweb.pig.CWHeaderLoader('$RESOURCES/schema/wide.xml');input = foreachRawInput GENERATE ContextCategoryId as Category, TagId, URL, Impressions;GroupedInput = GROUP input BY (Category, TagId, URL);result = FOREACH GroupedInput GENERATE group, SUM(input.Impressions) as Impressions;STORE result INTO '$OUTPUT' USING com.contextweb.pig.CWHeaderStore();
Sample PIG script (Filtering)RawInput = LOAD '$INPUT' USING com.contextweb.pig.CWHeaderLoader('$RESOURCES/schema/wide.xml');input = foreachRawInput GENERATE ContextCategoryId as Category, DefLevelId , TagId, URL,Impressions;defFilter = FILTER input BY (DefLevelId == 8) or (DefLevelId == 12);GroupedInput = GROUP defFilter BY (Category, TagId, URL);result = FOREACH GroupedInput GENERATE group, SUM(input.Impressions) as Impressions;STORE result INTO '$OUTPUT' USING com.contextweb.pig.CWHeaderStore();
What is PIG UDF?UDF  - User Defined FunctionTypes of UDF’s:Eval Functions (extends EvalFunc<String>)Aggregate Functions (extends EvalFunc<Long> implements Algebraic)Filter Functions (extends FilterFunc)UDFContextAllows UDFs to get access to the JobConfobjectAllows UDFs to pass configuration information between instantiations of the UDF on the front and backends.
Sample UDFpublic class TopLevelDomain extends EvalFunc<String> {	@Override	public String exec(Tupletuple) throws IOException {		Object o = tuple.get(0);		if (o == null) {			return null;		}		return Validator.getTLD(o.toString());	}}
UDF In ActionREGISTER '$WORK_DIR/pig-support.jar';DEFINE getTopLevelDomaincom.contextweb.pig.udf.TopLevelDomain();AA = foreach input GENERATE TagId, getTopLevelDomain(PublisherDomain) as RootDomain
ResourcesApache PIG http://pig.apache.org/Apache Hadoophttp://hadoop.apache.org/Cloudera CDH https://wiki.cloudera.com/display/DOC/CDH3+Installation
PIG DEMO

More Related Content

What's hot (20)

PDF
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
Edureka!
 
PPTX
Map Reduce
Prashant Gupta
 
PPTX
Spark
Heena Madan
 
PPSX
Hadoop
Nishant Gandhi
 
PDF
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
PPTX
Apache spark
TEJPAL GAUTAM
 
PDF
Apache Spark Introduction
sudhakara st
 
PDF
Sqoop
Prashant Gupta
 
PPTX
Hadoop File system (HDFS)
Prashant Gupta
 
PPTX
Map reduce presentation
ateeq ateeq
 
PPTX
Apache Spark Architecture
Alexey Grishchenko
 
PPTX
Introduction to Pig
Prashanth Babu
 
PDF
Introduction to apache spark
Aakashdata
 
PPTX
Apache flink
Ahmed Nader
 
PPTX
Apache Spark Core
Girish Khanzode
 
PDF
Introduction to HBase
Avkash Chauhan
 
PDF
Introduction to Apache Flink - Fast and reliable big data processing
Till Rohrmann
 
PPTX
Introduction to Apache ZooKeeper
Saurav Haloi
 
PDF
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
PPTX
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Simplilearn
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
Edureka!
 
Map Reduce
Prashant Gupta
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Apache spark
TEJPAL GAUTAM
 
Apache Spark Introduction
sudhakara st
 
Hadoop File system (HDFS)
Prashant Gupta
 
Map reduce presentation
ateeq ateeq
 
Apache Spark Architecture
Alexey Grishchenko
 
Introduction to Pig
Prashanth Babu
 
Introduction to apache spark
Aakashdata
 
Apache flink
Ahmed Nader
 
Apache Spark Core
Girish Khanzode
 
Introduction to HBase
Avkash Chauhan
 
Introduction to Apache Flink - Fast and reliable big data processing
Till Rohrmann
 
Introduction to Apache ZooKeeper
Saurav Haloi
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Simplilearn
 

Viewers also liked (7)

PPTX
Hive ppt (1)
marwa baich
 
PDF
Une introduction à Hive
Modern Data Stack France
 
PDF
Un introduction à Pig
Modern Data Stack France
 
PDF
Big Data : concepts, cas d'usage et tendances
Jean-Michel Franco
 
PPTX
Big data - Cours d'introduction l Data-business
Vincent de Stoecklin
 
PPTX
Big data ppt
Nasrin Hussain
 
PPTX
What is Big Data?
Bernard Marr
 
Hive ppt (1)
marwa baich
 
Une introduction à Hive
Modern Data Stack France
 
Un introduction à Pig
Modern Data Stack France
 
Big Data : concepts, cas d'usage et tendances
Jean-Michel Franco
 
Big data - Cours d'introduction l Data-business
Vincent de Stoecklin
 
Big data ppt
Nasrin Hussain
 
What is Big Data?
Bernard Marr
 
Ad

Similar to Introduction to Apache Pig (20)

PPTX
Pig
madu mathicp
 
PPTX
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Yahoo Developer Network
 
PPTX
Pig workshop
Sudar Muthu
 
PDF
Apache pig
Mudassir Khan Pathan
 
PPTX
Pig
jramsingh
 
PPTX
Apache PIG
Prashant Gupta
 
PPTX
Pig_Presentation
Arjun Shah
 
PPTX
Unit-5 [Pig] working and architecture.pptx
tripathineeharika
 
PPTX
Apache pig
Jigar Parekh
 
PDF
Pig
Vetri V
 
PDF
Introduction to pig & pig latin
knowbigdata
 
PDF
43_Sameer_Kumar_Das2
Mr.Sameer Kumar Das
 
PPTX
power point presentation on pig -hadoop framework
bhargavi804095
 
PDF
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
PPTX
Apache Pig
Shashidhar Basavaraju
 
PDF
Big Data Hadoop Training
stratapps
 
PPTX
Pig: Data Analysis Tool in Cloud
Jianfeng Zhang
 
PPTX
Understanding Pig and Hive in Apache Hadoop
mohindrachinmay
 
PDF
Practical pig
trihug
 
PPTX
PigHive.pptx
DenizDural2
 
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Yahoo Developer Network
 
Pig workshop
Sudar Muthu
 
Apache PIG
Prashant Gupta
 
Pig_Presentation
Arjun Shah
 
Unit-5 [Pig] working and architecture.pptx
tripathineeharika
 
Apache pig
Jigar Parekh
 
Pig
Vetri V
 
Introduction to pig & pig latin
knowbigdata
 
43_Sameer_Kumar_Das2
Mr.Sameer Kumar Das
 
power point presentation on pig -hadoop framework
bhargavi804095
 
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Big Data Hadoop Training
stratapps
 
Pig: Data Analysis Tool in Cloud
Jianfeng Zhang
 
Understanding Pig and Hive in Apache Hadoop
mohindrachinmay
 
Practical pig
trihug
 
PigHive.pptx
DenizDural2
 
Ad

More from Jason Shao (6)

ODP
Tune hadoop
Jason Shao
 
PPT
Sgi hadoop
Jason Shao
 
PPTX
NYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
Jason Shao
 
PPTX
Managing Hadoop with Puppet
Jason Shao
 
PPTX
NYC Java Meetup - Profiling and Performance
Jason Shao
 
PDF
Sakai NYC User Group
Jason Shao
 
Tune hadoop
Jason Shao
 
Sgi hadoop
Jason Shao
 
NYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
Jason Shao
 
Managing Hadoop with Puppet
Jason Shao
 
NYC Java Meetup - Profiling and Performance
Jason Shao
 
Sakai NYC User Group
Jason Shao
 

Recently uploaded (20)

PDF
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PDF
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
PDF
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
PDF
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
PPT
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PPTX
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PPTX
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
PPTX
Digital Circuits, important subject in CS
contactparinay1
 
PDF
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
Digital Circuits, important subject in CS
contactparinay1
 
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 

Introduction to Apache Pig

  • 1. Introduction To PIGThe evolution of data processing frameworks
  • 2. What is PIG?Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programsPig generates and compiles a Map/Reduce program(s) on the fly.
  • 3. Why PIG?Ease of programming - It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain.
  • 5. Installing PIGDownload / Unpack tarball (pig.apache.org)Install RPM / DEB package (cloudera.com)
  • 6. Running PIGGrunt Shell: Enter Pig commands manually using Pig’s interactive shell, Grunt.Script File: Place Pig commands in a script file and run the script.Embedded Program: Embed Pig commands in a host language and run the program.
  • 7. Run ModesLocal Mode: To run Pig in local mode, you need access to a single machine.Hadoop(mapreduce) Mode: To run Pig in hadoop (mapreduce) mode, you need access to a Hadoop cluster and HDFS installation.
  • 8. Sample PIG scriptA = load 'passwd' using PigStorage(':'); B = foreach A generate $0 as id;store B into ‘id.out’;
  • 9. Sample Script With SchemaA = LOAD 'student_data' AS (name: chararray, age: int, gpa: float);B = FOREACH A GENERATE myudfs.UPPER(name);
  • 11. Math Functions# Math FunctionsABSACOSASINATANCBRTCEILCOSHCOSEXPFLOORLOGLOG10RANDOMROUNDSINSINHSQRTTANTANH
  • 13. Sample CW PIG scriptRawInput = LOAD '$INPUT' USING com.contextweb.pig.CWHeaderLoader('$RESOURCES/schema/wide.xml');input = foreachRawInput GENERATE ContextCategoryId as Category, TagId, URL, Impressions;GroupedInput = GROUP input BY (Category, TagId, URL);result = FOREACH GroupedInput GENERATE group, SUM(input.Impressions) as Impressions;STORE result INTO '$OUTPUT' USING com.contextweb.pig.CWHeaderStore();
  • 14. Sample PIG script (Filtering)RawInput = LOAD '$INPUT' USING com.contextweb.pig.CWHeaderLoader('$RESOURCES/schema/wide.xml');input = foreachRawInput GENERATE ContextCategoryId as Category, DefLevelId , TagId, URL,Impressions;defFilter = FILTER input BY (DefLevelId == 8) or (DefLevelId == 12);GroupedInput = GROUP defFilter BY (Category, TagId, URL);result = FOREACH GroupedInput GENERATE group, SUM(input.Impressions) as Impressions;STORE result INTO '$OUTPUT' USING com.contextweb.pig.CWHeaderStore();
  • 15. What is PIG UDF?UDF - User Defined FunctionTypes of UDF’s:Eval Functions (extends EvalFunc<String>)Aggregate Functions (extends EvalFunc<Long> implements Algebraic)Filter Functions (extends FilterFunc)UDFContextAllows UDFs to get access to the JobConfobjectAllows UDFs to pass configuration information between instantiations of the UDF on the front and backends.
  • 16. Sample UDFpublic class TopLevelDomain extends EvalFunc<String> { @Override public String exec(Tupletuple) throws IOException { Object o = tuple.get(0); if (o == null) { return null; } return Validator.getTLD(o.toString()); }}
  • 17. UDF In ActionREGISTER '$WORK_DIR/pig-support.jar';DEFINE getTopLevelDomaincom.contextweb.pig.udf.TopLevelDomain();AA = foreach input GENERATE TagId, getTopLevelDomain(PublisherDomain) as RootDomain
  • 18. ResourcesApache PIG http://pig.apache.org/Apache Hadoophttp://hadoop.apache.org/Cloudera CDH https://wiki.cloudera.com/display/DOC/CDH3+Installation