SlideShare a Scribd company logo
PIG ScriptingMaking Pig Turing-complete through embedding in a scripting languageJulien Le Dem - YahooOverviewPig execution model - Map/Reduce: a solution to process big data.  - Pig: makes it easier to manipulate data on the grid.  - Pig scripting: makes Pig easier with iterative algorithms and User Defined Functions.Jiras: PIG-1479, PIG-1794 (in 0.9), PIG-928 (in 0.8)Example: Transitive closure - Iterative process: requires a loop and a termination condition - Requires multiple join/group by: typical Pig usage - Requires User Defined FunctionsSolution 2: using Pig scriptingSolution 1: plain Pig1 file required:7 files required:UDFs take the elements of the tuple as parameters, not a tuple.The output schema is specified using a decorator (it can be a function if you need to manipulate the input schema)Modification flow:Modification flow:UDFs use standard Python constructs (tuple/list/dictionary) automatically converted to Pig.Python functions are automatically available as UDFsEmbedded Pig calls in PythonPython variables can be used in the pig scripts $n, $i, ...Unfortunately this solution does not fit here.It requires 7 artifacts: 3 Java UDFs: 3 classes must be compiled and packaged in a jar. The average UDF size is 50 to 100 lines of code.

More Related Content

What's hot (9)

PDF
Everything You Always Wanted to Know About Memory in Python - But Were Afraid...
Piotr Przymus
 
PDF
Everything You Always Wanted to Know About Memory in Python But Were Afraid t...
Piotr Przymus
 
PDF
Programming with Python - Adv.
Mosky Liu
 
PPTX
Boost.Python: C++ and Python Integration
GlobalLogic Ukraine
 
PDF
오픈소스 라이브러리 개발기
겨울 정
 
PDF
What’s eating python performance
Piotr Przymus
 
PDF
The Benefits of Type Hints
masahitojp
 
KEY
Using MPI
Kazuki Ohta
 
PDF
Open source projects with python
roskakori
 
Everything You Always Wanted to Know About Memory in Python - But Were Afraid...
Piotr Przymus
 
Everything You Always Wanted to Know About Memory in Python But Were Afraid t...
Piotr Przymus
 
Programming with Python - Adv.
Mosky Liu
 
Boost.Python: C++ and Python Integration
GlobalLogic Ukraine
 
오픈소스 라이브러리 개발기
겨울 정
 
What’s eating python performance
Piotr Przymus
 
The Benefits of Type Hints
masahitojp
 
Using MPI
Kazuki Ohta
 
Open source projects with python
roskakori
 

Viewers also liked (8)

PPTX
Embedding Pig in scripting languages
Julien Le Dem
 
PPTX
Data Eng Conf NY Nov 2016 Parquet Arrow
Julien Le Dem
 
PPTX
Reducing the dimensionality of data with neural networks
Hakky St
 
PDF
Low Latency Execution For Apache Spark
Jen Aman
 
PDF
Parquet Strata/Hadoop World, New York 2013
Julien Le Dem
 
PDF
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
PDF
If you have your own Columnar format, stop now and use Parquet 😛
Julien Le Dem
 
PPTX
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
StampedeCon
 
Embedding Pig in scripting languages
Julien Le Dem
 
Data Eng Conf NY Nov 2016 Parquet Arrow
Julien Le Dem
 
Reducing the dimensionality of data with neural networks
Hakky St
 
Low Latency Execution For Apache Spark
Jen Aman
 
Parquet Strata/Hadoop World, New York 2013
Julien Le Dem
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
If you have your own Columnar format, stop now and use Parquet 😛
Julien Le Dem
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
StampedeCon
 
Ad

Similar to Poster Hadoop summit 2011: pig embedding in scripting languages (20)

PDF
Multithreaded_Programming_in_Python.pdf
giridharsripathi
 
PDF
Introduction to Chainer: A Flexible Framework for Deep Learning
Seiya Tokui
 
PDF
concurrency
Jonathan Wagoner
 
PDF
25 must know python for Interview by Tutort Academy
yashikanigam1
 
PPTX
pythontraining-201jn026043638.pptx
RohitKumar639388
 
PPTX
Python training
Kunalchauhan76
 
PPTX
Unit-5 [Pig] working and architecture.pptx
tripathineeharika
 
PPT
Multicore
Birgit Plötzeneder
 
DOCX
Pacman game computer investigatory project
meenaloshiniG
 
PPTX
Apache pig presentation_siddharth_mathur
Siddharth Mathur
 
PPTX
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Viswanath Gangavaram
 
PDF
Unit V.pdf
KennyPratheepKumar
 
PPTX
Apache pig presentation_siddharth_mathur
Siddharth Mathur
 
PDF
Building A Linux Cluster Using Raspberry PI #2!
A Jorge Garcia
 
PDF
PHYTON-REPORT.pdf
PraveenKumar640562
 
PDF
Debugging Hung Python Processes With GDB
bmbouter
 
PDF
HiPEAC 2019 Tutorial - Maestro RTOS
Tulipp. Eu
 
PDF
W-334535VBE242 Using Python Libraries.pdf
manassingh1509
 
PDF
unit-4-apache pig-.pdf
ssuser92282c
 
Multithreaded_Programming_in_Python.pdf
giridharsripathi
 
Introduction to Chainer: A Flexible Framework for Deep Learning
Seiya Tokui
 
concurrency
Jonathan Wagoner
 
25 must know python for Interview by Tutort Academy
yashikanigam1
 
pythontraining-201jn026043638.pptx
RohitKumar639388
 
Python training
Kunalchauhan76
 
Unit-5 [Pig] working and architecture.pptx
tripathineeharika
 
Pacman game computer investigatory project
meenaloshiniG
 
Apache pig presentation_siddharth_mathur
Siddharth Mathur
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Viswanath Gangavaram
 
Unit V.pdf
KennyPratheepKumar
 
Apache pig presentation_siddharth_mathur
Siddharth Mathur
 
Building A Linux Cluster Using Raspberry PI #2!
A Jorge Garcia
 
PHYTON-REPORT.pdf
PraveenKumar640562
 
Debugging Hung Python Processes With GDB
bmbouter
 
HiPEAC 2019 Tutorial - Maestro RTOS
Tulipp. Eu
 
W-334535VBE242 Using Python Libraries.pdf
manassingh1509
 
unit-4-apache pig-.pdf
ssuser92282c
 
Ad

More from Julien Le Dem (19)

PDF
Data and AI summit: data pipelines observability with open lineage
Julien Le Dem
 
PDF
Data pipelines observability: OpenLineage & Marquez
Julien Le Dem
 
PDF
Open core summit: Observability for data pipelines with OpenLineage
Julien Le Dem
 
PDF
Data platform architecture principles - ieee infrastructure 2020
Julien Le Dem
 
PDF
Data lineage and observability with Marquez - subsurface 2020
Julien Le Dem
 
PPTX
Strata NY 2018: The deconstructed database
Julien Le Dem
 
PDF
From flat files to deconstructed database
Julien Le Dem
 
PPTX
Strata NY 2017 Parquet Arrow roadmap
Julien Le Dem
 
PPTX
The columnar roadmap: Apache Parquet and Apache Arrow
Julien Le Dem
 
PPTX
Improving Python and Spark Performance and Interoperability with Apache Arrow
Julien Le Dem
 
PPTX
Mule soft mar 2017 Parquet Arrow
Julien Le Dem
 
PPTX
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Julien Le Dem
 
PDF
Strata London 2016: The future of column oriented data processing with Arrow ...
Julien Le Dem
 
PDF
Sql on everything with drill
Julien Le Dem
 
PDF
How to use Parquet as a basis for ETL and analytics
Julien Le Dem
 
PDF
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Julien Le Dem
 
PDF
Parquet Hadoop Summit 2013
Julien Le Dem
 
PDF
Parquet Twitter Seattle open house
Julien Le Dem
 
PPT
Parquet overview
Julien Le Dem
 
Data and AI summit: data pipelines observability with open lineage
Julien Le Dem
 
Data pipelines observability: OpenLineage & Marquez
Julien Le Dem
 
Open core summit: Observability for data pipelines with OpenLineage
Julien Le Dem
 
Data platform architecture principles - ieee infrastructure 2020
Julien Le Dem
 
Data lineage and observability with Marquez - subsurface 2020
Julien Le Dem
 
Strata NY 2018: The deconstructed database
Julien Le Dem
 
From flat files to deconstructed database
Julien Le Dem
 
Strata NY 2017 Parquet Arrow roadmap
Julien Le Dem
 
The columnar roadmap: Apache Parquet and Apache Arrow
Julien Le Dem
 
Improving Python and Spark Performance and Interoperability with Apache Arrow
Julien Le Dem
 
Mule soft mar 2017 Parquet Arrow
Julien Le Dem
 
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Julien Le Dem
 
Strata London 2016: The future of column oriented data processing with Arrow ...
Julien Le Dem
 
Sql on everything with drill
Julien Le Dem
 
How to use Parquet as a basis for ETL and analytics
Julien Le Dem
 
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Julien Le Dem
 
Parquet Hadoop Summit 2013
Julien Le Dem
 
Parquet Twitter Seattle open house
Julien Le Dem
 
Parquet overview
Julien Le Dem
 

Recently uploaded (20)

PDF
CloudStack GPU Integration - Rohit Yadav
ShapeBlue
 
PDF
Persuasive AI: risks and opportunities in the age of digital debate
Speck&Tech
 
PDF
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PDF
Complete JavaScript Notes: From Basics to Advanced Concepts.pdf
haydendavispro
 
PDF
Impact of IEEE Computer Society in Advancing Emerging Technologies including ...
Hironori Washizaki
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PPTX
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
PDF
Predicting the unpredictable: re-engineering recommendation algorithms for fr...
Speck&Tech
 
PDF
Building Resilience with Digital Twins : Lessons from Korea
SANGHEE SHIN
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PDF
NewMind AI Journal - Weekly Chronicles - July'25 Week II
NewMind AI
 
PDF
Apache CloudStack 201: Let's Design & Build an IaaS Cloud
ShapeBlue
 
PDF
Français Patch Tuesday - Juillet
Ivanti
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PDF
Wojciech Ciemski for Top Cyber News MAGAZINE. June 2025
Dr. Ludmila Morozova-Buss
 
PDF
Smart Air Quality Monitoring with Serrax AQM190 LITE
SERRAX TECHNOLOGIES LLP
 
PDF
Why Orbit Edge Tech is a Top Next JS Development Company in 2025
mahendraalaska08
 
PDF
TrustArc Webinar - Data Privacy Trends 2025: Mid-Year Insights & Program Stra...
TrustArc
 
CloudStack GPU Integration - Rohit Yadav
ShapeBlue
 
Persuasive AI: risks and opportunities in the age of digital debate
Speck&Tech
 
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
Complete JavaScript Notes: From Basics to Advanced Concepts.pdf
haydendavispro
 
Impact of IEEE Computer Society in Advancing Emerging Technologies including ...
Hironori Washizaki
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
Predicting the unpredictable: re-engineering recommendation algorithms for fr...
Speck&Tech
 
Building Resilience with Digital Twins : Lessons from Korea
SANGHEE SHIN
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
NewMind AI Journal - Weekly Chronicles - July'25 Week II
NewMind AI
 
Apache CloudStack 201: Let's Design & Build an IaaS Cloud
ShapeBlue
 
Français Patch Tuesday - Juillet
Ivanti
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
Wojciech Ciemski for Top Cyber News MAGAZINE. June 2025
Dr. Ludmila Morozova-Buss
 
Smart Air Quality Monitoring with Serrax AQM190 LITE
SERRAX TECHNOLOGIES LLP
 
Why Orbit Edge Tech is a Top Next JS Development Company in 2025
mahendraalaska08
 
TrustArc Webinar - Data Privacy Trends 2025: Mid-Year Insights & Program Stra...
TrustArc
 

Poster Hadoop summit 2011: pig embedding in scripting languages

  • 1. PIG ScriptingMaking Pig Turing-complete through embedding in a scripting languageJulien Le Dem - YahooOverviewPig execution model - Map/Reduce: a solution to process big data. - Pig: makes it easier to manipulate data on the grid. - Pig scripting: makes Pig easier with iterative algorithms and User Defined Functions.Jiras: PIG-1479, PIG-1794 (in 0.9), PIG-928 (in 0.8)Example: Transitive closure - Iterative process: requires a loop and a termination condition - Requires multiple join/group by: typical Pig usage - Requires User Defined FunctionsSolution 2: using Pig scriptingSolution 1: plain Pig1 file required:7 files required:UDFs take the elements of the tuple as parameters, not a tuple.The output schema is specified using a decorator (it can be a function if you need to manipulate the input schema)Modification flow:Modification flow:UDFs use standard Python constructs (tuple/list/dictionary) automatically converted to Pig.Python functions are automatically available as UDFsEmbedded Pig calls in PythonPython variables can be used in the pig scripts $n, $i, ...Unfortunately this solution does not fit here.It requires 7 artifacts: 3 Java UDFs: 3 classes must be compiled and packaged in a jar. The average UDF size is 50 to 100 lines of code.
  • 2. 3 Pig scripts in 3 separate files: init, main loop content, finalization. The average Pig script size is 5 to 10 lines.
  • 3. Main program: executes and coordinates the Pig scripts. It contains about 50 lines of code.This solution does fit on the poster.It requires 1 artifact, containing: 3 UDFs provided as functions: The average UDF size is 8 to 12 lines of code.
  • 4. 3 Pig queries: init, main loop content, finalization. Each Pig query adds an average 5 to 10 lines.
  • 5. Main function: executes and coordinates the Pig scripts. It adds about 10 lines to the script.Access to the output of pig scripts