SlideShare a Scribd company logo
Hadoop Developer
Training
Session 04 - PIG
Page 1Classification: Restricted
Agenda
• PIG
• PIG - Overview
• Installation and Running Pig
• Load in Pig
• Macros in Pig
Page 2Classification: Restricted
What is Apache Pig?
• Apache Pig is an abstraction over MapReduce. It is a tool/platform which
is used to analyze larger sets of data representing them as data flows.
Pig is generally used with Hadoop; we can perform all the data
manipulation operations in Hadoop using Apache Pig.
• To write data analysis programs, Pig provides a high-level language
known as Pig Latin. This language provides various operators using which
programmers can develop their own functions for reading, writing, and
processing data.
• To analyze data using Apache Pig, programmers need to write scripts
using Pig Latin language. All these scripts are internally converted to Map
and Reduce tasks. Apache Pig has a component known as Pig Engine that
accepts the Pig Latin scripts as input and converts those scripts into
MapReduce jobs.
Page 3Classification: Restricted
Installation and Running Pig
• PIG is a data analytics framework in Hadoop ecosystem. It could be installed
in following ways:
• Local Mode: It is installed on a single machine and use local file system for
storage. Local mode is used mainly for debugging and testing Pig Latin
scripts. Specify ‘local’ as an argument to pig to start it in local mode.
• MapReduce Mode: This is the default installation of Pig. It requires Hadoop
cluster configured to load files in HDFS and run MR using Pig Latin scripts.
Follow the steps below to configure PIG in your environment:
Download the latest version of pig from here.
Untar the pig tar using command
tar –xzvf pig-0.13.0.tar.gz
Create a folder in /opt for pig installation
mkdir /opt/pig
Move the untar file to its folder in opt
mv /opt/setups/pig-0.13.1 /opt/pig
Open .bash_profile
nano ~/.bash_profile
Page 4Classification: Restricted
Installation and Running Pig
Move the untar file to its folder in opt
mv /opt/setups/pig-0.13.1 /opt/pig
Open .bash_profile
nano ~/.bash_profile
• Add pig to classpath
export PIG_HOME=/opt/pig/pig-0.13.1
export PATH=$PATH:$PIG_HOME/bin
Test for successful pig installation
pig -h
Start pig in local mode
pig -x local
If data from hdfs
pig –x mapreduce
Page 5Classification: Restricted
Installation and Running Pig
In general, Apache Pig works on top of Hadoop. It is an analytical tool that
analyzes large datasets that exist in the Hadoop File System. To analyze data
using Apache Pig, we have to initially load the data into Apache Pig. This
chapter explains how to load data to Apache Pig from HDFS.
Preparing HDFS
In MapReduce mode, Pig reads (loads) data from HDFS and stores the
results back in HDFS. Therefore, let us start HDFS and create the following
sample data in HDFS.
The above dataset contains personal details like id, first name, last name,
phone number and city, of six students
Student ID First Name Last Name Phone City
1 Peter Burke
4353521729 Salt Lake City
2 Aaron Kimberlake
8013528191 Salt Lake City
3 Danny Jacob
2958295582 Salt Lake City
4 Angela Kouth
2938811911 Salt Lake City
5 Peggy Karter
3202289119 Salt Lake City
Page 6Classification: Restricted
Installation and Running Pig
Create a Directory in HDFS
In Hadoop DFS, you can create directories using the command mkdir
The input file of Pig contains each tuple/record in individual lines. And the
entities of the record are separated by a delimiter (In our example we used
“,”).
In the local file system, create an input file student_data.txt containing data
as shown below.
1, Peter, Burke, 4353521729, Salt Lake City
2, Aaron, Kimberlake, 8013528191, Salt Lake City
3, Danny, Jacob, 2958295582, Salt Lake City
4, Angela, Kouth, 2938811911, Salt Lake City
5, Peggy, Karter, 3202289119, Salt Lake City
Now, move the file from the local file system to HDFS using put command
Page 7Classification: Restricted
Load in Pig
The load statement will simply load the data into the specified relation in
Pig.
The load statement consists of two parts divided by the “=” operator. On
the left-hand side, we need to mention the name of the relation where we
want to store the data, and on the right-hand side, we have to define how
we store the data. Given below is the syntax of the Load operator.
Relation_name = LOAD 'Input file path' USING function as schema;
Schema − We have to define the schema of the data. We can define the
required schema as follows −
(column1 : data type, column2 : data type, column3 : data type);
Now load the data from the file student_data.txt into Pig by executing the
following Pig Latin statement in the Grunt shell.
student = LOAD '/pig_data/student_data.txt' USING PigStorage(',') as (
id:int, firstname:chararray, lastname:chararray, phone:chararray,
city:chararray );
Page 8Classification: Restricted
Load in Pig
We have used the PigStorage() function. It loads and stores data as
structured text files. It takes a delimiter using which each entity of a tuple is
separated, as a parameter. By default, it takes ‘t’ as a parameter.
In the previous chapter, we learnt how to load data into Apache Pig. You can
store the loaded data in the file system using the store operator.
A stored file is only obtained in mapreduce(hdfs) mode of pig
STORE Relation_name INTO ' required_directory_path ' [USING function];
student = LOAD '/pig_data/student_data.txt' USING PigStorage(',') as (
id:int, firstname:chararray, lastname:chararray, phone:chararray,
city:chararray );
STORE student INTO ' hdfs://localhost:9000/pig_Output/ ' USING PigStorage
(',');
Page 9Classification: Restricted
Load in Pig
Dump Operator
The Dump operator is used to run the Pig Latin statements and display the
results on the screen.
Given below is the syntax of the Dump operator.
grunt> Dump Relation_Name
grunt> Dump student
Describe Operator
The describe operator is used to view the schema of a relation
grunt> Describe Relation_name
grunt> describe student;
Page 10Classification: Restricted
Load in Pig
Output
Once you execute the above Pig Latin statement, it will produce the
following output.
grunt> student: { id: int,firstname: chararray,lastname: chararray,phone:
chararray,city: chararray }
The illustrate operator gives you the step-by-step execution of a sequence
of statements.
Syntax
Given below is the syntax of the illustrate operator.
grunt> illustrate Relation_name;
Now, let us illustrate the relation named student as shown below.
grunt> illustrate student
Page 11Classification: Restricted
Load in Pig
Assume that we have a file named student_details.txt in the HDFS directory
/pig_data/ as shown below.
student_details.txt
1, Peter, Burke, 4353521729, Salt Lake City
2, Aaron, Kimberlake, 8013528191, Salt Lake City
3, Danny, Jacob, 2958295582, Salt Lake City
4, Angela, Kouth, 2938811911, Salt Lake City
5, Peggy, Karter, 3202289119, Salt Lake City
6, King, Salmon, 2398329282, Salt Lake City
7, Carolyn, Fisher, 2293322829, Salt Lake City
8, John, Hopkins, 2102392020, Salt Lake City
grunt> student_details = LOAD '/pig_data/student_details.txt' USING
PigStorage(',') as (id:int, firstname:chararray, lastname:chararray, age:int,
phone:chararray, city:chararray);
Page 12Classification: Restricted
Load in Pig
The GROUP operator is used to group the data in one or more relations. It
collects the data having the same key.
Syntax
Given below is the syntax of the group operator.
grunt> Group_data = GROUP Relation_name BY age;
grunt> group_data = GROUP student_details by age;
grunt> dump group_data
Page 13Classification: Restricted
Load in Pig
Output
Then you will get output displaying the contents of the relation named
group_data as shown below. Here you can observe that the resulting
schema has two columns −
One is age, by which we have grouped the relation.
The other is a bag, which contains the group of tuples, student records with
the respective age.
(21,{(4, Angela, Kouth, 2938811911, Salt Lake City), (1, Peter, Burke,
4353521729, Salt Lake City)})
(22,{(3, Danny, Jacob, 2958295582, Salt Lake City), (2, Aaron, Kimberlake,
8013528191, Salt Lake City)})
(23,{(6, King, Salmon, 2398329282, Salt Lake City), (5, Peggy, Karter,
3202289119, Salt Lake City)})
(24,{(8, John, Hopkins, 2102392020, Salt Lake City), (7, Carolyn, Fisher,
2293322829, Salt Lake City)})
Page 14Classification: Restricted
Load in Pig
 customers.txt
 1, Peter, 32, Salt Lake City, 2000.00
 2, Aaron, 25, Salt Lake City, 1500.00
 3, Danny, 23, Salt Lake City, 2000.00
 4, Angela, 25, Salt Lake City, 6500.00
 5, Peggy, 27, Salt Lake City, 8500.00
 6, King, 22, Salt Lake City, 4500.00
 7, Carolyn, 24, Salt Lake City,10000.00
 orders.txt
 102,2009-10-08 00:00:00,3,3000
 100,2009-10-08 00:00:00,3,1500
 101,2009-11-20 00:00:00,2,1560
 103,2008-05-20 00:00:00,4,2060
Page 15Classification: Restricted
Topics to be covered in next session
PIG
• Loads in Pig Continued
• Verification
• Filters
• Macros in Pig
Page 16Classification: Restricted
Thank you!

More Related Content

What's hot (19)

PDF
Apache Scoop - Import with Append mode and Last Modified mode
Rupak Roy
 
PPTX
Unit 2
vishal choudhary
 
PDF
Import and Export Big Data using R Studio
Rupak Roy
 
PPTX
Map reduce prashant
Prashant Gupta
 
PDF
Apache Hive Table Partition and HQL
Rupak Roy
 
PDF
Introductive to Hive
Rupak Roy
 
PDF
Map Reduce Execution Architecture
Rupak Roy
 
PPTX
Apache PIG
Prashant Gupta
 
PDF
Lecture 2 part 3
Jazan University
 
PPTX
Unit 3 writable collections
vishal choudhary
 
PDF
Introduction to hadoop ecosystem
Rupak Roy
 
PDF
Introduction to R and R Studio
Rupak Roy
 
PDF
Introduction to scoop and its functions
Rupak Roy
 
PPTX
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Viswanath Gangavaram
 
PPTX
Map reduce and Hadoop on windows
Muhammad Shahid
 
PPTX
Spark Sql and DataFrame
Prashant Gupta
 
PDF
Inside Parquet Format
Yue Chen
 
PDF
Manipulating Data using DPLYR in R Studio
Rupak Roy
 
PPT
Hive User Meeting August 2009 Facebook
ragho
 
Apache Scoop - Import with Append mode and Last Modified mode
Rupak Roy
 
Import and Export Big Data using R Studio
Rupak Roy
 
Map reduce prashant
Prashant Gupta
 
Apache Hive Table Partition and HQL
Rupak Roy
 
Introductive to Hive
Rupak Roy
 
Map Reduce Execution Architecture
Rupak Roy
 
Apache PIG
Prashant Gupta
 
Lecture 2 part 3
Jazan University
 
Unit 3 writable collections
vishal choudhary
 
Introduction to hadoop ecosystem
Rupak Roy
 
Introduction to R and R Studio
Rupak Roy
 
Introduction to scoop and its functions
Rupak Roy
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Viswanath Gangavaram
 
Map reduce and Hadoop on windows
Muhammad Shahid
 
Spark Sql and DataFrame
Prashant Gupta
 
Inside Parquet Format
Yue Chen
 
Manipulating Data using DPLYR in R Studio
Rupak Roy
 
Hive User Meeting August 2009 Facebook
ragho
 

Similar to Session 04 pig - slides (20)

PPTX
Apache Pig
Shashidhar Basavaraju
 
PDF
06 pig-01-intro
Aasim Naveed
 
PPTX
03 pig intro
Subhas Kumar Ghosh
 
PDF
LAB PROGRAM-9zxcvbnmzxcvbnzxcvbnxcvbn.pdf
Roja40
 
PDF
An Overview of Hadoop
Asif Ali
 
PPTX
AWS Hadoop and PIG and overview
Dan Morrill
 
PDF
BIGDATA ANALYTICS LAB MANUAL final.pdf
ANJALAI AMMAL MAHALINGAM ENGINEERING COLLEGE
 
PPT
pig.ppt
Sheba41
 
PPTX
Improving build solutions dependency management with webpack
NodeXperts
 
PPT
Hadoop - Apache Pig
Vibrant Technologies & Computers
 
PDF
Case study ap log collector
Jyun-Yao Huang
 
PPTX
Unit-5 [Pig] working and architecture.pptx
tripathineeharika
 
DOCX
Commands documentaion
TejalNijai
 
PDF
Introduction to pig & pig latin
knowbigdata
 
PDF
IRJET- Analysis of Boston’s Crime Data using Apache Pig
IRJET Journal
 
PPTX
An Introduction to Apache Pig
Sachin Vakkund
 
PPT
r,rstats,r language,r packages
Ajay Ohri
 
PDF
Osgis 2010 notes
Joanne Cook
 
PDF
Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs
IMC Institute
 
06 pig-01-intro
Aasim Naveed
 
03 pig intro
Subhas Kumar Ghosh
 
LAB PROGRAM-9zxcvbnmzxcvbnzxcvbnxcvbn.pdf
Roja40
 
An Overview of Hadoop
Asif Ali
 
AWS Hadoop and PIG and overview
Dan Morrill
 
BIGDATA ANALYTICS LAB MANUAL final.pdf
ANJALAI AMMAL MAHALINGAM ENGINEERING COLLEGE
 
pig.ppt
Sheba41
 
Improving build solutions dependency management with webpack
NodeXperts
 
Case study ap log collector
Jyun-Yao Huang
 
Unit-5 [Pig] working and architecture.pptx
tripathineeharika
 
Commands documentaion
TejalNijai
 
Introduction to pig & pig latin
knowbigdata
 
IRJET- Analysis of Boston’s Crime Data using Apache Pig
IRJET Journal
 
An Introduction to Apache Pig
Sachin Vakkund
 
r,rstats,r language,r packages
Ajay Ohri
 
Osgis 2010 notes
Joanne Cook
 
Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs
IMC Institute
 
Ad

More from AnandMHadoop (6)

PPTX
Overview of Java
AnandMHadoop
 
PPTX
Session 14 - Hive
AnandMHadoop
 
PPTX
Session 23 - Kafka and Zookeeper
AnandMHadoop
 
PPTX
Session 03 - Hadoop Installation and Basic Commands
AnandMHadoop
 
PPTX
Session 02 - Yarn Concepts
AnandMHadoop
 
PPTX
Session 01 - Into to Hadoop
AnandMHadoop
 
Overview of Java
AnandMHadoop
 
Session 14 - Hive
AnandMHadoop
 
Session 23 - Kafka and Zookeeper
AnandMHadoop
 
Session 03 - Hadoop Installation and Basic Commands
AnandMHadoop
 
Session 02 - Yarn Concepts
AnandMHadoop
 
Session 01 - Into to Hadoop
AnandMHadoop
 
Ad

Recently uploaded (20)

PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PDF
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PDF
Biography of Daniel Podor.pdf
Daniel Podor
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
Biography of Daniel Podor.pdf
Daniel Podor
 

Session 04 pig - slides

  • 2. Page 1Classification: Restricted Agenda • PIG • PIG - Overview • Installation and Running Pig • Load in Pig • Macros in Pig
  • 3. Page 2Classification: Restricted What is Apache Pig? • Apache Pig is an abstraction over MapReduce. It is a tool/platform which is used to analyze larger sets of data representing them as data flows. Pig is generally used with Hadoop; we can perform all the data manipulation operations in Hadoop using Apache Pig. • To write data analysis programs, Pig provides a high-level language known as Pig Latin. This language provides various operators using which programmers can develop their own functions for reading, writing, and processing data. • To analyze data using Apache Pig, programmers need to write scripts using Pig Latin language. All these scripts are internally converted to Map and Reduce tasks. Apache Pig has a component known as Pig Engine that accepts the Pig Latin scripts as input and converts those scripts into MapReduce jobs.
  • 4. Page 3Classification: Restricted Installation and Running Pig • PIG is a data analytics framework in Hadoop ecosystem. It could be installed in following ways: • Local Mode: It is installed on a single machine and use local file system for storage. Local mode is used mainly for debugging and testing Pig Latin scripts. Specify ‘local’ as an argument to pig to start it in local mode. • MapReduce Mode: This is the default installation of Pig. It requires Hadoop cluster configured to load files in HDFS and run MR using Pig Latin scripts. Follow the steps below to configure PIG in your environment: Download the latest version of pig from here. Untar the pig tar using command tar –xzvf pig-0.13.0.tar.gz Create a folder in /opt for pig installation mkdir /opt/pig Move the untar file to its folder in opt mv /opt/setups/pig-0.13.1 /opt/pig Open .bash_profile nano ~/.bash_profile
  • 5. Page 4Classification: Restricted Installation and Running Pig Move the untar file to its folder in opt mv /opt/setups/pig-0.13.1 /opt/pig Open .bash_profile nano ~/.bash_profile • Add pig to classpath export PIG_HOME=/opt/pig/pig-0.13.1 export PATH=$PATH:$PIG_HOME/bin Test for successful pig installation pig -h Start pig in local mode pig -x local If data from hdfs pig –x mapreduce
  • 6. Page 5Classification: Restricted Installation and Running Pig In general, Apache Pig works on top of Hadoop. It is an analytical tool that analyzes large datasets that exist in the Hadoop File System. To analyze data using Apache Pig, we have to initially load the data into Apache Pig. This chapter explains how to load data to Apache Pig from HDFS. Preparing HDFS In MapReduce mode, Pig reads (loads) data from HDFS and stores the results back in HDFS. Therefore, let us start HDFS and create the following sample data in HDFS. The above dataset contains personal details like id, first name, last name, phone number and city, of six students Student ID First Name Last Name Phone City 1 Peter Burke 4353521729 Salt Lake City 2 Aaron Kimberlake 8013528191 Salt Lake City 3 Danny Jacob 2958295582 Salt Lake City 4 Angela Kouth 2938811911 Salt Lake City 5 Peggy Karter 3202289119 Salt Lake City
  • 7. Page 6Classification: Restricted Installation and Running Pig Create a Directory in HDFS In Hadoop DFS, you can create directories using the command mkdir The input file of Pig contains each tuple/record in individual lines. And the entities of the record are separated by a delimiter (In our example we used “,”). In the local file system, create an input file student_data.txt containing data as shown below. 1, Peter, Burke, 4353521729, Salt Lake City 2, Aaron, Kimberlake, 8013528191, Salt Lake City 3, Danny, Jacob, 2958295582, Salt Lake City 4, Angela, Kouth, 2938811911, Salt Lake City 5, Peggy, Karter, 3202289119, Salt Lake City Now, move the file from the local file system to HDFS using put command
  • 8. Page 7Classification: Restricted Load in Pig The load statement will simply load the data into the specified relation in Pig. The load statement consists of two parts divided by the “=” operator. On the left-hand side, we need to mention the name of the relation where we want to store the data, and on the right-hand side, we have to define how we store the data. Given below is the syntax of the Load operator. Relation_name = LOAD 'Input file path' USING function as schema; Schema − We have to define the schema of the data. We can define the required schema as follows − (column1 : data type, column2 : data type, column3 : data type); Now load the data from the file student_data.txt into Pig by executing the following Pig Latin statement in the Grunt shell. student = LOAD '/pig_data/student_data.txt' USING PigStorage(',') as ( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray );
  • 9. Page 8Classification: Restricted Load in Pig We have used the PigStorage() function. It loads and stores data as structured text files. It takes a delimiter using which each entity of a tuple is separated, as a parameter. By default, it takes ‘t’ as a parameter. In the previous chapter, we learnt how to load data into Apache Pig. You can store the loaded data in the file system using the store operator. A stored file is only obtained in mapreduce(hdfs) mode of pig STORE Relation_name INTO ' required_directory_path ' [USING function]; student = LOAD '/pig_data/student_data.txt' USING PigStorage(',') as ( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray ); STORE student INTO ' hdfs://localhost:9000/pig_Output/ ' USING PigStorage (',');
  • 10. Page 9Classification: Restricted Load in Pig Dump Operator The Dump operator is used to run the Pig Latin statements and display the results on the screen. Given below is the syntax of the Dump operator. grunt> Dump Relation_Name grunt> Dump student Describe Operator The describe operator is used to view the schema of a relation grunt> Describe Relation_name grunt> describe student;
  • 11. Page 10Classification: Restricted Load in Pig Output Once you execute the above Pig Latin statement, it will produce the following output. grunt> student: { id: int,firstname: chararray,lastname: chararray,phone: chararray,city: chararray } The illustrate operator gives you the step-by-step execution of a sequence of statements. Syntax Given below is the syntax of the illustrate operator. grunt> illustrate Relation_name; Now, let us illustrate the relation named student as shown below. grunt> illustrate student
  • 12. Page 11Classification: Restricted Load in Pig Assume that we have a file named student_details.txt in the HDFS directory /pig_data/ as shown below. student_details.txt 1, Peter, Burke, 4353521729, Salt Lake City 2, Aaron, Kimberlake, 8013528191, Salt Lake City 3, Danny, Jacob, 2958295582, Salt Lake City 4, Angela, Kouth, 2938811911, Salt Lake City 5, Peggy, Karter, 3202289119, Salt Lake City 6, King, Salmon, 2398329282, Salt Lake City 7, Carolyn, Fisher, 2293322829, Salt Lake City 8, John, Hopkins, 2102392020, Salt Lake City grunt> student_details = LOAD '/pig_data/student_details.txt' USING PigStorage(',') as (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray);
  • 13. Page 12Classification: Restricted Load in Pig The GROUP operator is used to group the data in one or more relations. It collects the data having the same key. Syntax Given below is the syntax of the group operator. grunt> Group_data = GROUP Relation_name BY age; grunt> group_data = GROUP student_details by age; grunt> dump group_data
  • 14. Page 13Classification: Restricted Load in Pig Output Then you will get output displaying the contents of the relation named group_data as shown below. Here you can observe that the resulting schema has two columns − One is age, by which we have grouped the relation. The other is a bag, which contains the group of tuples, student records with the respective age. (21,{(4, Angela, Kouth, 2938811911, Salt Lake City), (1, Peter, Burke, 4353521729, Salt Lake City)}) (22,{(3, Danny, Jacob, 2958295582, Salt Lake City), (2, Aaron, Kimberlake, 8013528191, Salt Lake City)}) (23,{(6, King, Salmon, 2398329282, Salt Lake City), (5, Peggy, Karter, 3202289119, Salt Lake City)}) (24,{(8, John, Hopkins, 2102392020, Salt Lake City), (7, Carolyn, Fisher, 2293322829, Salt Lake City)})
  • 15. Page 14Classification: Restricted Load in Pig  customers.txt  1, Peter, 32, Salt Lake City, 2000.00  2, Aaron, 25, Salt Lake City, 1500.00  3, Danny, 23, Salt Lake City, 2000.00  4, Angela, 25, Salt Lake City, 6500.00  5, Peggy, 27, Salt Lake City, 8500.00  6, King, 22, Salt Lake City, 4500.00  7, Carolyn, 24, Salt Lake City,10000.00  orders.txt  102,2009-10-08 00:00:00,3,3000  100,2009-10-08 00:00:00,3,1500  101,2009-11-20 00:00:00,2,1560  103,2008-05-20 00:00:00,4,2060
  • 16. Page 15Classification: Restricted Topics to be covered in next session PIG • Loads in Pig Continued • Verification • Filters • Macros in Pig