SlideShare a Scribd company logo
Execution
Environments for
Distributed
Computing
Apache Pig
EEDC
34330Master in Computer Architecture,
Networks and Systems - CANS
Homework number: 3
Group number: EEDC-3
Group members:
Javier Álvarez – javicid@gmail.com
Francesc Lordan – francesc.lordan@gmail.com
Roger Rafanell – rogerrafanell@gmail.com
222
Outline
1.- Introduction
2.- Pig Latin
2.1.- Data model
2.2.- Relational commands
3.- Implementation
4.- Conclusions
Execution
Environments for
Distributed
Computing
Part 1
Introduction
EEDC
34330Master in Computer Architecture,
Networks and Systems - CANS
444
Why Apache Pig?
Today’s Internet companies needs to process hugh data sets:
– Parallel databases can be prohibitively expensive at this scale.
– Programmers tend to find declarative languages such as SQL very
unnatural.
– Other approaches such map-reduce are low-level and rigid.
555
What is Apache Pig?
A platform for analyzing large data sets that:
– It is based in Pig Latin which lies between declarative (SQL) and
procedural (C++) programming languages.
– At the same time, enables the construction of programs with an easy
parallelizable structure.
666
Which features does it have?
 Dataflow Language
– Data processing is expressed step-by-step.
 Quick Start & Interoperability
– Pig can work over any kind of input and produce any kind of output.
 Nested Data Model
– Pig works with complex types like tuples, bags, ...
 User Defined Functions (UDFs)
– Potentially in any programming language (only Java for the moment).
 Only parallel
– Pig Latin forces to use directives that are parallelizable in a direct way.
 Debugging environment
– Debugging at programming time.
Execution
Environments for
Distributed
Computing
Part 2
Pig Latin
EEDC
34330Master in Computer Architecture,
Networks and Systems - CANS
Execution
Environments for
Distributed
Computing
Section 2.1
Data model
EEDC
34330Master in Computer Architecture,
Networks and Systems - CANS
999
Data Model
Very rich data model consisting on 4 simple data types:
 Atom: Simple atomic value such as strings or numbers.
‘Alice’
 Tuple: Sequence of fields of any type of data.
(‘Alice’, ‘Apple’)
(‘Alice’, (‘Barça’, ‘football’))
 Bag: collection of tuples with possible duplicates.
(‘Alice’, ‘Apple’)
(‘Alice’, (‘Barça’, ‘football’))
 Map: collection of data items with an associated key (always an atom).
‘Fan of’  (‘Apple’)
(‘Barça’, ‘football’)
Execution
Environments for
Distributed
Computing Section 2.2
Relational
commands
EEDC
34330Master in Computer Architecture,
Networks and Systems - CANS
111111
Relational commands
visits = LOAD ‘visits.txt’ AS (user, url, time)
pages = LOAD `pages.txt` AS (url, rank);
visits: (‘Amy’, ‘cnn.com’, ‘8am’)
(‘Amy’, ‘nytimes.com’, ‘9am’)
(‘Bob’, ‘elmundotoday.com’, ’11am’)
pages: (‘cnn.com’, ‘0.8’)
(‘nytimes.com’, ‘0.6’)
(‘elmundotoday’, ‘0.2’)
121212
Relational commands
visits = LOAD ‘visits.txt’ AS (user, url, time)
pages = LOAD `pages.txt` AS (url, rank);
vp = JOIN visits BY url, pages BY url
v_p:(‘Amy’, ‘cnn.com’, ‘8am’, ‘cnn.com’, ‘0.8’)
(‘Amy’, ‘nytimes.com’, ‘9am’, ‘nytimes.com, ‘0.6’)
(‘Bob’, ‘elmundotoday.com’, ’11am’, ‘elmundotoday.com’, ‘0.2’)
131313
Relational commands
visits = LOAD ‘visits.txt’ AS (user, url, time)
pages = LOAD `pages.txt` AS (url, rank);
vp = JOIN visits BY url, pages BY url
users = GROUP vp BY user
user: (‘Amy’, { (‘Amy’, ‘cnn.com’, ‘8am’, ‘cnn.com’, ‘0.8’),
(‘Amy’, ‘nytimes.com’, ‘9am’, ‘nytimes.com, ‘0.6’)})
(‘Bob’, {‘elmundotoday.com’, ’11am’, ‘elmundotoday.com’, ‘0.2’)})
141414
Relational commands
visits = LOAD ‘visits.txt’ AS (user, url, time)
pages = LOAD `pages.txt` AS (url, rank);
vp = JOIN visits BY url, pages BY url
users = GROUP vp BY user
useravg = FOREACH users GENERATE group, AVG(vp.rank) AS avgpr
user: (‘Amy’, ‘0.7’)
(‘Bob’, ‘0.2’)
151515
Relational commands
visits = LOAD ‘visits.txt’ AS (user, url, time)
pages = LOAD `pages.txt` AS (url, rank);
vp = JOIN visits BY url, pages BY url
users = GROUP vp BY user
useravg = FOREACH users GENERATE group, AVG(vp.rank) AS avgpr
answer = FILTER useravg BY avgpr > ‘0.5’
answer: (‘Amy’, ‘0.7’)
161616
Relational commands
Other relational operators:
– STORE : exports data into a file.
STORE var1_name INTO 'output.txt‘;
– COGROUP : groups together tuples from diferent datasets.
COGROUP var1_name BY field_id, var2_name BY field_id
– UNION : computes the union of two variables.
– CROSS : computes the cross product.
– ORDER : sorts a data set by one or more fields.
– DISTINCT : removes replicated tuples in a dataset.
Execution
Environments for
Distributed
Computing
Part 3
Implementation
EEDC
34330Master in Computer Architecture,
Networks and Systems - CANS
181818
Implementation: Highlights
 Works on top of Hadoop ecosystem:
– Current implementation uses Hadoop as execution platform.
 On-the-fly compilation:
– Pig translates the Pig Latin commands to Map and Reduce methods.
 Lazy style language:
– Pig try to pospone the data materialization (on disk writes) as much as
possible.
191919
Implementation: Building the logical plan
 Query parsing:
– Pig interpreter parses the commands verifying that the input files and
bags referenced are valid.
 On-the-fly compilation:
– Pig compiles the logical plan for that bag into physical plan (Map-Reduce
statements) when the command cannot be more delayed and must be
executed.
 Lazy characteristics:
– No processing are carried out when the logical plan are build up.
– Processing is triggered only when the user invokes STORE command on
a bag.
– Lazy style execution permits in-memory pipelining and other interesting
optimizations.
202020
Implementation: Map-Reduce plan compilation
 CO(GROUP):
– Each command is compiled in a distinct map-reduce job with its own
map and reduce functions.
– Parallelism is achieved since the output of multiple map instances is
repartitioned in parallel to multiple reduce instances.
 LOAD:
– Parallelism is obtained since Pig operates over files residing in the
Hadoop distributed file system.
 FILTER/FOREACH:
– Automatic parallelism is given since for a map-reduce job several map
and reduce instances are run in parallel.
 ORDER (compiled in two map-reduce jobs):
– First: Determine quantiles of the sort key
– Second: Chops the job according the quantiles and performs a local
sorting in the reduce phase resulting in a global sorted file.
Execution
Environments for
Distributed
Computing
Part 4
Conclusions
EEDC
34330Master in Computer Architecture,
Networks and Systems - CANS
222222
Conclusions
 Advantages:
– Step-by-step syntaxis.
– Flexible: UDFs, not locked to a fixed schema (allows schema changes over the time).
– Exposes a set of widely used functions: FOREACH, FILTER, ORDER, GROUP, …
– Takes advantage of Hadoop native properties such: parallelism, load-balancing, fault-tolerance.
– Debugging environment.
– Open Source (IMPORTANT!!)
 Disadvantages:
– UDFs methods could be a source of performance loss (the control relies on user).
– Overhead while compiling Pig Latin into map-reduce jobs.
 Usage Scenarios:
– Temporal analysis: search logs mainly involves studying how search query distribution changes
over time.
– Session analysis: web user sessions, i.e, sequences of page views and clicks made by users are
analized to calculate some metrics such:
– how long is the average user session?
– how many links does a user click on before leaving a website?
– Others, ...
232323
Q&A

More Related Content

PPT
EEDC Apache Pig Language
Roger Rafanell Mas
 
PDF
EEDC - Apache Pig
javicid
 
PDF
Installing Apache Hive, internal and external table, import-export
Rupak Roy
 
PDF
Apache Hadoop MapReduce Tutorial
Farzad Nozarian
 
PDF
Introduction to scoop and its functions
Rupak Roy
 
PDF
Apache Scoop - Import with Append mode and Last Modified mode
Rupak Roy
 
PDF
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
InfluxData
 
PDF
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
DataWorks Summit
 
EEDC Apache Pig Language
Roger Rafanell Mas
 
EEDC - Apache Pig
javicid
 
Installing Apache Hive, internal and external table, import-export
Rupak Roy
 
Apache Hadoop MapReduce Tutorial
Farzad Nozarian
 
Introduction to scoop and its functions
Rupak Roy
 
Apache Scoop - Import with Append mode and Last Modified mode
Rupak Roy
 
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
InfluxData
 
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
DataWorks Summit
 

What's hot (20)

PDF
Scoop Job, import and export to RDBMS
Rupak Roy
 
PDF
Introduction to HBase | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
PPTX
Hive commands
Ganesh Sanap
 
PPTX
Calling r from sas (msug meeting, feb 17, 2018) revised
Barry DeCicco
 
PDF
Postgres 12 Cluster Database operations.
Vijay Kumar N
 
PPT
Hadoop - Introduction to mapreduce
Vibrant Technologies & Computers
 
PPTX
Hive data migration (export/import)
Bopyo Hong
 
PDF
Unix commands in etl testing
Garuda Trainings
 
PPTX
Hadoop MapReduce Introduction and Deep Insight
Hanborq Inc.
 
PPT
Myth busters - performance tuning 103 2008
paulguerin
 
PDF
Ganesh naik linux_kernel_internals
Ganesh Naik
 
PDF
Prologue O/S - Improving the Odds of Job Success
inside-BigData.com
 
PDF
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
Titus Damaiyanti
 
PPTX
03 pig intro
Subhas Kumar Ghosh
 
PPTX
Configuringahadoop
mensb
 
PDF
Distributed Tracing, from internal SAAS insights
Huy Do
 
PPTX
Using R on High Performance Computers
Dave Hiltbrand
 
PDF
Plmce 14 be a_hero_16x9_final
Marco Tusa
 
PPT
Benedutch 2011 ew_ppt
Antonius Intelligence Team
 
PDF
A deeper-understanding-of-spark-internals
Cheng Min Chi
 
Scoop Job, import and export to RDBMS
Rupak Roy
 
Introduction to HBase | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Hive commands
Ganesh Sanap
 
Calling r from sas (msug meeting, feb 17, 2018) revised
Barry DeCicco
 
Postgres 12 Cluster Database operations.
Vijay Kumar N
 
Hadoop - Introduction to mapreduce
Vibrant Technologies & Computers
 
Hive data migration (export/import)
Bopyo Hong
 
Unix commands in etl testing
Garuda Trainings
 
Hadoop MapReduce Introduction and Deep Insight
Hanborq Inc.
 
Myth busters - performance tuning 103 2008
paulguerin
 
Ganesh naik linux_kernel_internals
Ganesh Naik
 
Prologue O/S - Improving the Odds of Job Success
inside-BigData.com
 
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
Titus Damaiyanti
 
03 pig intro
Subhas Kumar Ghosh
 
Configuringahadoop
mensb
 
Distributed Tracing, from internal SAAS insights
Huy Do
 
Using R on High Performance Computers
Dave Hiltbrand
 
Plmce 14 be a_hero_16x9_final
Marco Tusa
 
Benedutch 2011 ew_ppt
Antonius Intelligence Team
 
A deeper-understanding-of-spark-internals
Cheng Min Chi
 
Ad

Viewers also liked (9)

PPTX
Actuaciã³n 4⺠de primaria de ana (2)
cchh07
 
PDF
VIKING- NORWAY
Diana Oh
 
PDF
Bidang pembelajaran-5-3
Nasran Syahiran
 
PPT
Solar system webquest (finished)
jane-park
 
DOCX
El perfume (1)
Gus Alvarez
 
PPT
Solar system webquest (finished)
jane-park
 
PPT
Pengenalan kepada pengaturcaraan berstruktur
Unit Kediaman Luar Kampus
 
PPT
Solar system webquest (finished)
jane-park
 
PDF
5.1 konsep asas pengaturcaraan
dean36
 
Actuaciã³n 4⺠de primaria de ana (2)
cchh07
 
VIKING- NORWAY
Diana Oh
 
Bidang pembelajaran-5-3
Nasran Syahiran
 
Solar system webquest (finished)
jane-park
 
El perfume (1)
Gus Alvarez
 
Solar system webquest (finished)
jane-park
 
Pengenalan kepada pengaturcaraan berstruktur
Unit Kediaman Luar Kampus
 
Solar system webquest (finished)
jane-park
 
5.1 konsep asas pengaturcaraan
dean36
 
Ad

Similar to Eedc.apache.pig last (20)

PPT
L19CloudMapReduce introduction for cloud computing .ppt
MaruthiPrasad96
 
PDF
Hadoop scalability
WANdisco Plc
 
PPT
L4.FA16n nm,m,m,,m,m,m,mmnm,n,mnmnmm.ppt
abdulbasetalselwi
 
PPT
The Anatomy Of The Google Architecture Fina Lv1.1
Hassy Veldstra
 
PPT
Lecture Slide - Introduction to Hadoop, HDFS, MapR.ppt
SuchithraaPalani
 
PDF
Hadoop trainting in hyderabad@kelly technologies
Kelly Technologies
 
PPT
Meethadoop
IIIT-H
 
PDF
A look under the hood at Apache Spark's API and engine evolutions
Databricks
 
PDF
Hadoop-Introduction
Sandeep Deshmukh
 
PDF
Introduction to Hadoop
Apache Apex
 
PPT
Hadoop trainting-in-hyderabad@kelly technologies
Kelly Technologies
 
PDF
Introduction to Apache Spark
Vincent Poncet
 
PPT
L3.fa14.ppt
Tushar557668
 
PPT
MAPREDUCE ppt big data computing fall 2014 indranil gupta.ppt
zuhaibmohammed465
 
PDF
Data Analytics and Simulation in Parallel with MATLAB*
Intel® Software
 
PPT
Lecture 4 Parallel and Distributed Systems Fall 2024.ppt
ssusere82d541
 
PPT
Hadoop institutes-in-bangalore
Kelly Technologies
 
PPT
Scala and spark
Fabio Fumarola
 
PDF
Aws dc elastic-mapreduce
beaknit
 
PDF
Aws dc elastic-mapreduce
beaknit
 
L19CloudMapReduce introduction for cloud computing .ppt
MaruthiPrasad96
 
Hadoop scalability
WANdisco Plc
 
L4.FA16n nm,m,m,,m,m,m,mmnm,n,mnmnmm.ppt
abdulbasetalselwi
 
The Anatomy Of The Google Architecture Fina Lv1.1
Hassy Veldstra
 
Lecture Slide - Introduction to Hadoop, HDFS, MapR.ppt
SuchithraaPalani
 
Hadoop trainting in hyderabad@kelly technologies
Kelly Technologies
 
Meethadoop
IIIT-H
 
A look under the hood at Apache Spark's API and engine evolutions
Databricks
 
Hadoop-Introduction
Sandeep Deshmukh
 
Introduction to Hadoop
Apache Apex
 
Hadoop trainting-in-hyderabad@kelly technologies
Kelly Technologies
 
Introduction to Apache Spark
Vincent Poncet
 
L3.fa14.ppt
Tushar557668
 
MAPREDUCE ppt big data computing fall 2014 indranil gupta.ppt
zuhaibmohammed465
 
Data Analytics and Simulation in Parallel with MATLAB*
Intel® Software
 
Lecture 4 Parallel and Distributed Systems Fall 2024.ppt
ssusere82d541
 
Hadoop institutes-in-bangalore
Kelly Technologies
 
Scala and spark
Fabio Fumarola
 
Aws dc elastic-mapreduce
beaknit
 
Aws dc elastic-mapreduce
beaknit
 

Recently uploaded (20)

PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
PDF
Cloud-Migration-Best-Practices-A-Practical-Guide-to-AWS-Azure-and-Google-Clou...
Artjoker Software Development Company
 
PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
PPTX
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
PDF
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
PDF
The Evolution of KM Roles (Presented at Knowledge Summit Dublin 2025)
Enterprise Knowledge
 
PDF
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PDF
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
PDF
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
PDF
How-Cloud-Computing-Impacts-Businesses-in-2025-and-Beyond.pdf
Artjoker Software Development Company
 
PDF
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
PDF
BLW VOCATIONAL TRAINING SUMMER INTERNSHIP REPORT
codernjn73
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
PDF
Advances in Ultra High Voltage (UHV) Transmission and Distribution Systems.pdf
Nabajyoti Banik
 
PDF
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
PDF
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
PDF
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
Cloud-Migration-Best-Practices-A-Practical-Guide-to-AWS-Azure-and-Google-Clou...
Artjoker Software Development Company
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
The Evolution of KM Roles (Presented at Knowledge Summit Dublin 2025)
Enterprise Knowledge
 
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
How-Cloud-Computing-Impacts-Businesses-in-2025-and-Beyond.pdf
Artjoker Software Development Company
 
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
BLW VOCATIONAL TRAINING SUMMER INTERNSHIP REPORT
codernjn73
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
Advances in Ultra High Voltage (UHV) Transmission and Distribution Systems.pdf
Nabajyoti Banik
 
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 

Eedc.apache.pig last

  • 1. Execution Environments for Distributed Computing Apache Pig EEDC 34330Master in Computer Architecture, Networks and Systems - CANS Homework number: 3 Group number: EEDC-3 Group members: Javier Álvarez – [email protected] Francesc Lordan – [email protected] Roger Rafanell – [email protected]
  • 2. 222 Outline 1.- Introduction 2.- Pig Latin 2.1.- Data model 2.2.- Relational commands 3.- Implementation 4.- Conclusions
  • 4. 444 Why Apache Pig? Today’s Internet companies needs to process hugh data sets: – Parallel databases can be prohibitively expensive at this scale. – Programmers tend to find declarative languages such as SQL very unnatural. – Other approaches such map-reduce are low-level and rigid.
  • 5. 555 What is Apache Pig? A platform for analyzing large data sets that: – It is based in Pig Latin which lies between declarative (SQL) and procedural (C++) programming languages. – At the same time, enables the construction of programs with an easy parallelizable structure.
  • 6. 666 Which features does it have?  Dataflow Language – Data processing is expressed step-by-step.  Quick Start & Interoperability – Pig can work over any kind of input and produce any kind of output.  Nested Data Model – Pig works with complex types like tuples, bags, ...  User Defined Functions (UDFs) – Potentially in any programming language (only Java for the moment).  Only parallel – Pig Latin forces to use directives that are parallelizable in a direct way.  Debugging environment – Debugging at programming time.
  • 7. Execution Environments for Distributed Computing Part 2 Pig Latin EEDC 34330Master in Computer Architecture, Networks and Systems - CANS
  • 8. Execution Environments for Distributed Computing Section 2.1 Data model EEDC 34330Master in Computer Architecture, Networks and Systems - CANS
  • 9. 999 Data Model Very rich data model consisting on 4 simple data types:  Atom: Simple atomic value such as strings or numbers. ‘Alice’  Tuple: Sequence of fields of any type of data. (‘Alice’, ‘Apple’) (‘Alice’, (‘Barça’, ‘football’))  Bag: collection of tuples with possible duplicates. (‘Alice’, ‘Apple’) (‘Alice’, (‘Barça’, ‘football’))  Map: collection of data items with an associated key (always an atom). ‘Fan of’  (‘Apple’) (‘Barça’, ‘football’)
  • 10. Execution Environments for Distributed Computing Section 2.2 Relational commands EEDC 34330Master in Computer Architecture, Networks and Systems - CANS
  • 11. 111111 Relational commands visits = LOAD ‘visits.txt’ AS (user, url, time) pages = LOAD `pages.txt` AS (url, rank); visits: (‘Amy’, ‘cnn.com’, ‘8am’) (‘Amy’, ‘nytimes.com’, ‘9am’) (‘Bob’, ‘elmundotoday.com’, ’11am’) pages: (‘cnn.com’, ‘0.8’) (‘nytimes.com’, ‘0.6’) (‘elmundotoday’, ‘0.2’)
  • 12. 121212 Relational commands visits = LOAD ‘visits.txt’ AS (user, url, time) pages = LOAD `pages.txt` AS (url, rank); vp = JOIN visits BY url, pages BY url v_p:(‘Amy’, ‘cnn.com’, ‘8am’, ‘cnn.com’, ‘0.8’) (‘Amy’, ‘nytimes.com’, ‘9am’, ‘nytimes.com, ‘0.6’) (‘Bob’, ‘elmundotoday.com’, ’11am’, ‘elmundotoday.com’, ‘0.2’)
  • 13. 131313 Relational commands visits = LOAD ‘visits.txt’ AS (user, url, time) pages = LOAD `pages.txt` AS (url, rank); vp = JOIN visits BY url, pages BY url users = GROUP vp BY user user: (‘Amy’, { (‘Amy’, ‘cnn.com’, ‘8am’, ‘cnn.com’, ‘0.8’), (‘Amy’, ‘nytimes.com’, ‘9am’, ‘nytimes.com, ‘0.6’)}) (‘Bob’, {‘elmundotoday.com’, ’11am’, ‘elmundotoday.com’, ‘0.2’)})
  • 14. 141414 Relational commands visits = LOAD ‘visits.txt’ AS (user, url, time) pages = LOAD `pages.txt` AS (url, rank); vp = JOIN visits BY url, pages BY url users = GROUP vp BY user useravg = FOREACH users GENERATE group, AVG(vp.rank) AS avgpr user: (‘Amy’, ‘0.7’) (‘Bob’, ‘0.2’)
  • 15. 151515 Relational commands visits = LOAD ‘visits.txt’ AS (user, url, time) pages = LOAD `pages.txt` AS (url, rank); vp = JOIN visits BY url, pages BY url users = GROUP vp BY user useravg = FOREACH users GENERATE group, AVG(vp.rank) AS avgpr answer = FILTER useravg BY avgpr > ‘0.5’ answer: (‘Amy’, ‘0.7’)
  • 16. 161616 Relational commands Other relational operators: – STORE : exports data into a file. STORE var1_name INTO 'output.txt‘; – COGROUP : groups together tuples from diferent datasets. COGROUP var1_name BY field_id, var2_name BY field_id – UNION : computes the union of two variables. – CROSS : computes the cross product. – ORDER : sorts a data set by one or more fields. – DISTINCT : removes replicated tuples in a dataset.
  • 18. 181818 Implementation: Highlights  Works on top of Hadoop ecosystem: – Current implementation uses Hadoop as execution platform.  On-the-fly compilation: – Pig translates the Pig Latin commands to Map and Reduce methods.  Lazy style language: – Pig try to pospone the data materialization (on disk writes) as much as possible.
  • 19. 191919 Implementation: Building the logical plan  Query parsing: – Pig interpreter parses the commands verifying that the input files and bags referenced are valid.  On-the-fly compilation: – Pig compiles the logical plan for that bag into physical plan (Map-Reduce statements) when the command cannot be more delayed and must be executed.  Lazy characteristics: – No processing are carried out when the logical plan are build up. – Processing is triggered only when the user invokes STORE command on a bag. – Lazy style execution permits in-memory pipelining and other interesting optimizations.
  • 20. 202020 Implementation: Map-Reduce plan compilation  CO(GROUP): – Each command is compiled in a distinct map-reduce job with its own map and reduce functions. – Parallelism is achieved since the output of multiple map instances is repartitioned in parallel to multiple reduce instances.  LOAD: – Parallelism is obtained since Pig operates over files residing in the Hadoop distributed file system.  FILTER/FOREACH: – Automatic parallelism is given since for a map-reduce job several map and reduce instances are run in parallel.  ORDER (compiled in two map-reduce jobs): – First: Determine quantiles of the sort key – Second: Chops the job according the quantiles and performs a local sorting in the reduce phase resulting in a global sorted file.
  • 21. Execution Environments for Distributed Computing Part 4 Conclusions EEDC 34330Master in Computer Architecture, Networks and Systems - CANS
  • 22. 222222 Conclusions  Advantages: – Step-by-step syntaxis. – Flexible: UDFs, not locked to a fixed schema (allows schema changes over the time). – Exposes a set of widely used functions: FOREACH, FILTER, ORDER, GROUP, … – Takes advantage of Hadoop native properties such: parallelism, load-balancing, fault-tolerance. – Debugging environment. – Open Source (IMPORTANT!!)  Disadvantages: – UDFs methods could be a source of performance loss (the control relies on user). – Overhead while compiling Pig Latin into map-reduce jobs.  Usage Scenarios: – Temporal analysis: search logs mainly involves studying how search query distribution changes over time. – Session analysis: web user sessions, i.e, sequences of page views and clicks made by users are analized to calculate some metrics such: – how long is the average user session? – how many links does a user click on before leaving a website? – Others, ...