Eedc.apache.pig last

Execution
Environments for
Distributed
Computing
Apache Pig
EEDC
34330Master in Computer Architecture,
Networks and Systems - CANS
Homework number: 3
Group number: EEDC-3
Group members:
Javier Álvarez – javicid@gmail.com
Francesc Lordan – francesc.lordan@gmail.com
Roger Rafanell – rogerrafanell@gmail.com

222
Outline
1.- Introduction
2.- Pig Latin
2.1.- Data model
2.2.- Relational commands
3.- Implementation
4.- Conclusions

Execution
Environments for
Distributed
Computing
Part 1
Introduction
EEDC

444
Why Apache Pig?
Today’s Internet companies needs to process hugh data sets:
– Parallel databases can be prohibitively expensive at this scale.
– Programmers tend to find declarative languages such as SQL very
unnatural.
– Other approaches such map-reduce are low-level and rigid.

555
What is Apache Pig?
A platform for analyzing large data sets that:
– It is based in Pig Latin which lies between declarative (SQL) and
procedural (C++) programming languages.
– At the same time, enables the construction of programs with an easy
parallelizable structure.

666
Which features does it have?
 Dataflow Language
– Data processing is expressed step-by-step.
 Quick Start & Interoperability
– Pig can work over any kind of input and produce any kind of output.
 Nested Data Model
– Pig works with complex types like tuples, bags, ...
 User Defined Functions (UDFs)
– Potentially in any programming language (only Java for the moment).
 Only parallel
– Pig Latin forces to use directives that are parallelizable in a direct way.
 Debugging environment
– Debugging at programming time.

Execution
Environments for
Distributed
Computing
Part 2
Pig Latin
EEDC

Execution
Environments for
Distributed
Computing
Section 2.1
Data model
EEDC

999
Data Model
Very rich data model consisting on 4 simple data types:
 Atom: Simple atomic value such as strings or numbers.
‘Alice’
 Tuple: Sequence of fields of any type of data.
(‘Alice’, ‘Apple’)
(‘Alice’, (‘Barça’, ‘football’))
 Bag: collection of tuples with possible duplicates.
(‘Alice’, ‘Apple’)
(‘Alice’, (‘Barça’, ‘football’))
 Map: collection of data items with an associated key (always an atom).
‘Fan of’  (‘Apple’)
(‘Barça’, ‘football’)

Execution
Environments for
Distributed
Computing Section 2.2
Relational
commands
EEDC

111111
Relational commands
visits = LOAD ‘visits.txt’ AS (user, url, time)
pages = LOAD `pages.txt` AS (url, rank);
visits: (‘Amy’, ‘cnn.com’, ‘8am’)
(‘Amy’, ‘nytimes.com’, ‘9am’)
(‘Bob’, ‘elmundotoday.com’, ’11am’)
pages: (‘cnn.com’, ‘0.8’)
(‘nytimes.com’, ‘0.6’)
(‘elmundotoday’, ‘0.2’)

121212
Relational commands
vp = JOIN visits BY url, pages BY url
v_p:(‘Amy’, ‘cnn.com’, ‘8am’, ‘cnn.com’, ‘0.8’)
(‘Amy’, ‘nytimes.com’, ‘9am’, ‘nytimes.com, ‘0.6’)
(‘Bob’, ‘elmundotoday.com’, ’11am’, ‘elmundotoday.com’, ‘0.2’)

131313
Relational commands
users = GROUP vp BY user
user: (‘Amy’, { (‘Amy’, ‘cnn.com’, ‘8am’, ‘cnn.com’, ‘0.8’),
(‘Amy’, ‘nytimes.com’, ‘9am’, ‘nytimes.com, ‘0.6’)})
(‘Bob’, {‘elmundotoday.com’, ’11am’, ‘elmundotoday.com’, ‘0.2’)})

141414
Relational commands
useravg = FOREACH users GENERATE group, AVG(vp.rank) AS avgpr
user: (‘Amy’, ‘0.7’)
(‘Bob’, ‘0.2’)

151515
Relational commands
useravg = FOREACH users GENERATE group, AVG(vp.rank) AS avgpr
answer = FILTER useravg BY avgpr > ‘0.5’
answer: (‘Amy’, ‘0.7’)

161616
Relational commands
Other relational operators:
– STORE : exports data into a file.
STORE var1_name INTO 'output.txt‘;
– COGROUP : groups together tuples from diferent datasets.
COGROUP var1_name BY field_id, var2_name BY field_id
– UNION : computes the union of two variables.
– CROSS : computes the cross product.
– ORDER : sorts a data set by one or more fields.
– DISTINCT : removes replicated tuples in a dataset.

Execution
Environments for
Distributed
Computing
Part 3
Implementation
EEDC

181818
Implementation: Highlights
 Works on top of Hadoop ecosystem:
– Current implementation uses Hadoop as execution platform.
 On-the-fly compilation:
– Pig translates the Pig Latin commands to Map and Reduce methods.
 Lazy style language:
– Pig try to pospone the data materialization (on disk writes) as much as
possible.

191919
Implementation: Building the logical plan
 Query parsing:
– Pig interpreter parses the commands verifying that the input files and
bags referenced are valid.
 On-the-fly compilation:
– Pig compiles the logical plan for that bag into physical plan (Map-Reduce
statements) when the command cannot be more delayed and must be
executed.
 Lazy characteristics:
– No processing are carried out when the logical plan are build up.
– Processing is triggered only when the user invokes STORE command on
a bag.
– Lazy style execution permits in-memory pipelining and other interesting
optimizations.

202020
Implementation: Map-Reduce plan compilation
 CO(GROUP):
– Each command is compiled in a distinct map-reduce job with its own
map and reduce functions.
– Parallelism is achieved since the output of multiple map instances is
repartitioned in parallel to multiple reduce instances.
 LOAD:
– Parallelism is obtained since Pig operates over files residing in the
Hadoop distributed file system.
 FILTER/FOREACH:
– Automatic parallelism is given since for a map-reduce job several map
and reduce instances are run in parallel.
 ORDER (compiled in two map-reduce jobs):
– First: Determine quantiles of the sort key
– Second: Chops the job according the quantiles and performs a local
sorting in the reduce phase resulting in a global sorted file.

Execution
Environments for
Distributed
Computing
Part 4
Conclusions
EEDC

222222
Conclusions
 Advantages:
– Step-by-step syntaxis.
– Flexible: UDFs, not locked to a fixed schema (allows schema changes over the time).
– Exposes a set of widely used functions: FOREACH, FILTER, ORDER, GROUP, …
– Takes advantage of Hadoop native properties such: parallelism, load-balancing, fault-tolerance.
– Debugging environment.
– Open Source (IMPORTANT!!)
 Disadvantages:
– UDFs methods could be a source of performance loss (the control relies on user).
– Overhead while compiling Pig Latin into map-reduce jobs.
 Usage Scenarios:
– Temporal analysis: search logs mainly involves studying how search query distribution changes
over time.
– Session analysis: web user sessions, i.e, sequences of page views and clicks made by users are
analized to calculate some metrics such:
– how long is the average user session?
– how many links does a user click on before leaving a website?
– Others, ...

Eedc.apache.pig last

More Related Content

What's hot (20)

Viewers also liked (9)

Similar to Eedc.apache.pig last (20)

Recently uploaded (20)

Eedc.apache.pig last