SlideShare a Scribd company logo
Embedding Pig in scripting languagesWhat happens when you feed a Pig to a Python?Julien Le Dem – Principal Engineer - Content Platforms at Yahoo!Pig committerjulien@ledem.net@julienledem
DisclaimerNo animals were hurtin the making of this presentationI’m cuteI’m hungryPicture credits:OZinOH: http://www.flickr.com/photos/75905404@N00/5421543577/Stephen & Claire Farnsworth: http://www.flickr.com/photos/the_farnsworths/4720850597/
What for ?Simplifying the implementation of iterative algorithms:Loop and exit criteriaSimpler User Defined FunctionsEasier parameter passing
BeforeThe implementation has the following artifacts:
Pig Script(s)warshall_n_minus_1 = LOAD '$workDir/warshall_0'	USING BinStorage AS (id1:chararray, id2:chararray, status:chararray);to_join_n_minus_1 = LOAD '$workDir/to_join_0'USING BinStorage AS (id1:chararray, id2:chararray, status:chararray);joined = COGROUP to_join_n_minus_1 BY id2, warshall_n_minus_1 BY id1;followed = FOREACH joinedGENERATE FLATTEN(followRel(to_join_n_minus_1,warshall_n_minus_1));followed_byid = GROUP followed BY id1;warshall_n = FOREACH followed_byidGENERATE group, FLATTEN(coalesceLine(followed.(id2, status)));to_join_n = FILTER warshall_n BY $2 == 'notfollowed' AND $0!=$1;STORE warshall_n INTO '$workDir/warshall_1' USING BinStorage;STORE to_join_n INTO '$workDir/to_join_1 USING BinStorage;
External loop#!/usr/bin/python import osnum_iter=int(10)for i in range(num_iter):os.system('java -jar ./lib/pig.jar -x local plsi_singleiteration.pig')os.rename('output_results/p_z_u','output_results/p_z_u.'+str(i))os.system('cpoutput_results/p_z_u.nxtoutput_results/p_z_u');	os.rename('output_results/p_z_u.nxt','output_results/p_z_u.'+str(i+1))os.rename('output_results/p_s_z','output_results/p_s_z.'+str(i))os.system('cpoutput_results/p_s_z.nxtoutput_results/p_s_z');	os.rename('output_results/p_s_z.nxt','output_results/p_s_z.'+str(i+1))
Java UDF(s)
Development Iteration
So… What happens?Credits: Mango Atchar: http://www.flickr.com/photos/mangoatchar/362439607/
AfterOne script (to rule them all):	- main program	- UDFs as script functions 	- embedded Pig statementsAll the algorithm in one place
ReferencesIt uses JVM implementations of scripting languages (Jython, Rhino).This is a joint effort, see the following Jiras: 	in Pig 0.8: PIG-928 Python UDFs   in Pig0.9: PIG-1479 embedding, PIG-1794 JavaScript supportDoc: http://pig.apache.org/docs/
Examples1) Simple example: fixed loop iteration2) Adding convergence criteria and accessing intermediary output3)More advanced example with UDFs
1) A Simple ExamplePageRank:A system of linear equations (as many as there are pages on the web, yeah, a lot): It can be approximated iteratively: compute the new page rank based on the page ranks of the previous iteration. Start with some value.Ref: http://en.wikipedia.org/wiki/PageRank
Or more visuallyEach page sends a fraction of its PageRank to the pages linked to. Inversely proportional to the number of links.
Embedding Pig in scripting languages
Let’s zoom inpig script: PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))Iterate 10 timesPass parameters as a dictionaryPass parameters as a dictionaryJust run P, that was declared aboveThe output becomes the new input
Practical resultApplied to the English Wikipedia link graph:http://wiki.dbpedia.org/Downloads36#owikipediapagelinksIt turns out that the max PageRank is awarded to:http://en.wikipedia.org/wiki/United_StatesThanks @ogrisel for the suggestion
2) Same example, one step furtherNow let’s say that we define a threshold as a convergence criteria instead of a fixed iteration count.
Same thing as previouslyComputation of the maximum difference with the previous iteration… continued next slide
The main programThe parameter-less bind() uses the variables in the current scopeWe can easily read the output of Pig from the gridStop if we reach a threshold
3) Now somethingmore complexCompute a transitive closure: find the connected components of a graph. - Useful if you’re doing de-duplication - Requires iterations and UDFs
Or more visuallyTurn this:	                Into this:
ConvergenceConverges in : log2(max(Diameter of a component))Diameter = “the longest shortest path”Bottom line: the number of iterations will be reasonable.
UDFs are in the same script as the main programZoom next slidePage 1/3
Zoom on UDFsThe output schema of the UDF is defined using a decoratorThe native structures of the language can be used directly
Zoom next slidesZoom next slidesPage 2/3
Zoom on the Pig script…UDFs are directly available, no extra declaration needed
Zoom on the loopIterate a maximum of 10 times(2^10 maximum diameter of a component)Convergence criteria: all links have been followed
Final part: formattingTurning the graph representation into a component list representationThis is necessary when we have UDFs, so that the script can be evaluated again on the slaves without running main()Page 3/3
One more thing …I presented Python but JavaScript is available as well (experimental).The framework is extensible. Any JVM implementation of a language could be integrated (contribute!).The examples can be found at:https://github.com/julienledem/Pig-scripting-examples
Questions???

More Related Content

What's hot (20)

PPTX
Python in big data world
Rohit
 
PDF
Sql saturday pig session (wes floyd) v2
Wes Floyd
 
PDF
Massively Parallel Processing with Procedural Python (PyData London 2014)
Ian Huston
 
PPTX
Scalable Hadoop with succinct Python: the best of both worlds
DataWorks Summit
 
PDF
Introduction of the Design of A High-level Language over MapReduce -- The Pig...
Yu Liu
 
PDF
IPython Notebook as a Unified Data Science Interface for Hadoop
DataWorks Summit
 
PDF
Data Science Amsterdam - Massively Parallel Processing with Procedural Languages
Ian Huston
 
PPTX
Pig on Tez - Low Latency ETL with Big Data
DataWorks Summit
 
PDF
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
PyData
 
PDF
Apache Pig: Making data transformation easy
Victor Sanchez Anguix
 
PDF
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Edureka!
 
PDF
Apache Pig for Data Scientists
DataWorks Summit
 
PDF
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
Jeffrey Breen
 
KEY
Hive vs Pig for HadoopSourceCodeReading
Mitsuharu Hamba
 
PDF
Hadoop interview question
pappupassindia
 
PDF
Word Embedding for Nearest Words
EkaKurniawan40
 
PPT
Hadoop, Hbase and Hive- Bay area Hadoop User Group
Hadoop User Group
 
PDF
Hadoop 31-frequently-asked-interview-questions
Asad Masood Qazi
 
PPT
GPU Accelerated Machine Learning
Sri Ambati
 
PPTX
Hadoop for Java Professionals
Edureka!
 
Python in big data world
Rohit
 
Sql saturday pig session (wes floyd) v2
Wes Floyd
 
Massively Parallel Processing with Procedural Python (PyData London 2014)
Ian Huston
 
Scalable Hadoop with succinct Python: the best of both worlds
DataWorks Summit
 
Introduction of the Design of A High-level Language over MapReduce -- The Pig...
Yu Liu
 
IPython Notebook as a Unified Data Science Interface for Hadoop
DataWorks Summit
 
Data Science Amsterdam - Massively Parallel Processing with Procedural Languages
Ian Huston
 
Pig on Tez - Low Latency ETL with Big Data
DataWorks Summit
 
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
PyData
 
Apache Pig: Making data transformation easy
Victor Sanchez Anguix
 
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Edureka!
 
Apache Pig for Data Scientists
DataWorks Summit
 
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
Jeffrey Breen
 
Hive vs Pig for HadoopSourceCodeReading
Mitsuharu Hamba
 
Hadoop interview question
pappupassindia
 
Word Embedding for Nearest Words
EkaKurniawan40
 
Hadoop, Hbase and Hive- Bay area Hadoop User Group
Hadoop User Group
 
Hadoop 31-frequently-asked-interview-questions
Asad Masood Qazi
 
GPU Accelerated Machine Learning
Sri Ambati
 
Hadoop for Java Professionals
Edureka!
 

Viewers also liked (12)

PPTX
BioPig for scalable analysis of big sequencing data
Zhong Wang
 
PPTX
Poster Hadoop summit 2011: pig embedding in scripting languages
Julien Le Dem
 
PPTX
Data Eng Conf NY Nov 2016 Parquet Arrow
Julien Le Dem
 
PPTX
Reducing the dimensionality of data with neural networks
Hakky St
 
PPTX
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Julien Le Dem
 
PPTX
05 k-means clustering
Subhas Kumar Ghosh
 
PDF
Low Latency Execution For Apache Spark
Jen Aman
 
PDF
Parquet Strata/Hadoop World, New York 2013
Julien Le Dem
 
PDF
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
PDF
If you have your own Columnar format, stop now and use Parquet 😛
Julien Le Dem
 
PPTX
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
StampedeCon
 
PDF
Pig and Python to Process Big Data
Shawn Hermans
 
BioPig for scalable analysis of big sequencing data
Zhong Wang
 
Poster Hadoop summit 2011: pig embedding in scripting languages
Julien Le Dem
 
Data Eng Conf NY Nov 2016 Parquet Arrow
Julien Le Dem
 
Reducing the dimensionality of data with neural networks
Hakky St
 
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Julien Le Dem
 
05 k-means clustering
Subhas Kumar Ghosh
 
Low Latency Execution For Apache Spark
Jen Aman
 
Parquet Strata/Hadoop World, New York 2013
Julien Le Dem
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
If you have your own Columnar format, stop now and use Parquet 😛
Julien Le Dem
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
StampedeCon
 
Pig and Python to Process Big Data
Shawn Hermans
 
Ad

Similar to Embedding Pig in scripting languages (20)

PPTX
January 2011 HUG: Pig Presentation
Yahoo Developer Network
 
PPTX
Pig workshop
Sudar Muthu
 
PPTX
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Yahoo Developer Network
 
PDF
Functional python
Jesué Junior
 
PDF
Pig
Vetri V
 
PPTX
Apache pig
Jigar Parekh
 
PPTX
TriHUG November Pig Talk by Alan Gates
trihug
 
PPTX
Apache pig presentation_siddharth_mathur
Siddharth Mathur
 
PPTX
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Viswanath Gangavaram
 
PPTX
Apache pig presentation_siddharth_mathur
Siddharth Mathur
 
PPTX
Apache PIG
Prashant Gupta
 
PPTX
04 pig data operations
Subhas Kumar Ghosh
 
PPTX
Pig power tools_by_viswanath_gangavaram
Viswanath Gangavaram
 
ODP
Day2
Karin Lagesen
 
PPTX
Unit-5 [Pig] working and architecture.pptx
tripathineeharika
 
PPT
EEDC Apache Pig Language
Roger Rafanell Mas
 
PPT
Eedc.apache.pig last
Francesc Lordan Gomis
 
PDF
Language-agnostic data analysis workflows and reproducible research
Andrew Lowe
 
PDF
A tour of Python
Aleksandar Veselinovic
 
PDF
An overview of Python 2.7
decoupled
 
January 2011 HUG: Pig Presentation
Yahoo Developer Network
 
Pig workshop
Sudar Muthu
 
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Yahoo Developer Network
 
Functional python
Jesué Junior
 
Pig
Vetri V
 
Apache pig
Jigar Parekh
 
TriHUG November Pig Talk by Alan Gates
trihug
 
Apache pig presentation_siddharth_mathur
Siddharth Mathur
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Viswanath Gangavaram
 
Apache pig presentation_siddharth_mathur
Siddharth Mathur
 
Apache PIG
Prashant Gupta
 
04 pig data operations
Subhas Kumar Ghosh
 
Pig power tools_by_viswanath_gangavaram
Viswanath Gangavaram
 
Unit-5 [Pig] working and architecture.pptx
tripathineeharika
 
EEDC Apache Pig Language
Roger Rafanell Mas
 
Eedc.apache.pig last
Francesc Lordan Gomis
 
Language-agnostic data analysis workflows and reproducible research
Andrew Lowe
 
A tour of Python
Aleksandar Veselinovic
 
An overview of Python 2.7
decoupled
 
Ad

More from Julien Le Dem (18)

PDF
Data and AI summit: data pipelines observability with open lineage
Julien Le Dem
 
PDF
Data pipelines observability: OpenLineage & Marquez
Julien Le Dem
 
PDF
Open core summit: Observability for data pipelines with OpenLineage
Julien Le Dem
 
PDF
Data platform architecture principles - ieee infrastructure 2020
Julien Le Dem
 
PDF
Data lineage and observability with Marquez - subsurface 2020
Julien Le Dem
 
PPTX
Strata NY 2018: The deconstructed database
Julien Le Dem
 
PDF
From flat files to deconstructed database
Julien Le Dem
 
PPTX
Strata NY 2017 Parquet Arrow roadmap
Julien Le Dem
 
PPTX
The columnar roadmap: Apache Parquet and Apache Arrow
Julien Le Dem
 
PPTX
Improving Python and Spark Performance and Interoperability with Apache Arrow
Julien Le Dem
 
PPTX
Mule soft mar 2017 Parquet Arrow
Julien Le Dem
 
PDF
Strata London 2016: The future of column oriented data processing with Arrow ...
Julien Le Dem
 
PDF
Sql on everything with drill
Julien Le Dem
 
PDF
How to use Parquet as a basis for ETL and analytics
Julien Le Dem
 
PDF
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Julien Le Dem
 
PDF
Parquet Hadoop Summit 2013
Julien Le Dem
 
PDF
Parquet Twitter Seattle open house
Julien Le Dem
 
PPT
Parquet overview
Julien Le Dem
 
Data and AI summit: data pipelines observability with open lineage
Julien Le Dem
 
Data pipelines observability: OpenLineage & Marquez
Julien Le Dem
 
Open core summit: Observability for data pipelines with OpenLineage
Julien Le Dem
 
Data platform architecture principles - ieee infrastructure 2020
Julien Le Dem
 
Data lineage and observability with Marquez - subsurface 2020
Julien Le Dem
 
Strata NY 2018: The deconstructed database
Julien Le Dem
 
From flat files to deconstructed database
Julien Le Dem
 
Strata NY 2017 Parquet Arrow roadmap
Julien Le Dem
 
The columnar roadmap: Apache Parquet and Apache Arrow
Julien Le Dem
 
Improving Python and Spark Performance and Interoperability with Apache Arrow
Julien Le Dem
 
Mule soft mar 2017 Parquet Arrow
Julien Le Dem
 
Strata London 2016: The future of column oriented data processing with Arrow ...
Julien Le Dem
 
Sql on everything with drill
Julien Le Dem
 
How to use Parquet as a basis for ETL and analytics
Julien Le Dem
 
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Julien Le Dem
 
Parquet Hadoop Summit 2013
Julien Le Dem
 
Parquet Twitter Seattle open house
Julien Le Dem
 
Parquet overview
Julien Le Dem
 

Recently uploaded (20)

PDF
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
PDF
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
PDF
July Patch Tuesday
Ivanti
 
PDF
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
PDF
Predicting the unpredictable: re-engineering recommendation algorithms for fr...
Speck&Tech
 
PPTX
Top Managed Service Providers in Los Angeles
Captain IT
 
PDF
Impact of IEEE Computer Society in Advancing Emerging Technologies including ...
Hironori Washizaki
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PDF
Persuasive AI: risks and opportunities in the age of digital debate
Speck&Tech
 
PDF
Apache CloudStack 201: Let's Design & Build an IaaS Cloud
ShapeBlue
 
PPTX
Building a Production-Ready Barts Health Secure Data Environment Tooling, Acc...
Barts Health
 
PDF
Meetup Kickoff & Welcome - Rohit Yadav, CSIUG Chairman
ShapeBlue
 
PDF
SFWelly Summer 25 Release Highlights July 2025
Anna Loughnan Colquhoun
 
PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
PDF
Ampere Offers Energy-Efficient Future For AI And Cloud
ShapeBlue
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PPT
Interview paper part 3, It is based on Interview Prep
SoumyadeepGhosh39
 
PDF
Women in Automation Presents: Reinventing Yourself — Bold Career Pivots That ...
DianaGray10
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PPTX
Extensions Framework (XaaS) - Enabling Orchestrate Anything
ShapeBlue
 
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
July Patch Tuesday
Ivanti
 
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
Predicting the unpredictable: re-engineering recommendation algorithms for fr...
Speck&Tech
 
Top Managed Service Providers in Los Angeles
Captain IT
 
Impact of IEEE Computer Society in Advancing Emerging Technologies including ...
Hironori Washizaki
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
Persuasive AI: risks and opportunities in the age of digital debate
Speck&Tech
 
Apache CloudStack 201: Let's Design & Build an IaaS Cloud
ShapeBlue
 
Building a Production-Ready Barts Health Secure Data Environment Tooling, Acc...
Barts Health
 
Meetup Kickoff & Welcome - Rohit Yadav, CSIUG Chairman
ShapeBlue
 
SFWelly Summer 25 Release Highlights July 2025
Anna Loughnan Colquhoun
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
Ampere Offers Energy-Efficient Future For AI And Cloud
ShapeBlue
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
Interview paper part 3, It is based on Interview Prep
SoumyadeepGhosh39
 
Women in Automation Presents: Reinventing Yourself — Bold Career Pivots That ...
DianaGray10
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
Extensions Framework (XaaS) - Enabling Orchestrate Anything
ShapeBlue
 

Embedding Pig in scripting languages

  • 1. Embedding Pig in scripting languagesWhat happens when you feed a Pig to a Python?Julien Le Dem – Principal Engineer - Content Platforms at Yahoo!Pig [email protected]@julienledem
  • 2. DisclaimerNo animals were hurtin the making of this presentationI’m cuteI’m hungryPicture credits:OZinOH: http://www.flickr.com/photos/75905404@N00/5421543577/Stephen & Claire Farnsworth: http://www.flickr.com/photos/the_farnsworths/4720850597/
  • 3. What for ?Simplifying the implementation of iterative algorithms:Loop and exit criteriaSimpler User Defined FunctionsEasier parameter passing
  • 4. BeforeThe implementation has the following artifacts:
  • 5. Pig Script(s)warshall_n_minus_1 = LOAD '$workDir/warshall_0' USING BinStorage AS (id1:chararray, id2:chararray, status:chararray);to_join_n_minus_1 = LOAD '$workDir/to_join_0'USING BinStorage AS (id1:chararray, id2:chararray, status:chararray);joined = COGROUP to_join_n_minus_1 BY id2, warshall_n_minus_1 BY id1;followed = FOREACH joinedGENERATE FLATTEN(followRel(to_join_n_minus_1,warshall_n_minus_1));followed_byid = GROUP followed BY id1;warshall_n = FOREACH followed_byidGENERATE group, FLATTEN(coalesceLine(followed.(id2, status)));to_join_n = FILTER warshall_n BY $2 == 'notfollowed' AND $0!=$1;STORE warshall_n INTO '$workDir/warshall_1' USING BinStorage;STORE to_join_n INTO '$workDir/to_join_1 USING BinStorage;
  • 6. External loop#!/usr/bin/python import osnum_iter=int(10)for i in range(num_iter):os.system('java -jar ./lib/pig.jar -x local plsi_singleiteration.pig')os.rename('output_results/p_z_u','output_results/p_z_u.'+str(i))os.system('cpoutput_results/p_z_u.nxtoutput_results/p_z_u'); os.rename('output_results/p_z_u.nxt','output_results/p_z_u.'+str(i+1))os.rename('output_results/p_s_z','output_results/p_s_z.'+str(i))os.system('cpoutput_results/p_s_z.nxtoutput_results/p_s_z'); os.rename('output_results/p_s_z.nxt','output_results/p_s_z.'+str(i+1))
  • 9. So… What happens?Credits: Mango Atchar: http://www.flickr.com/photos/mangoatchar/362439607/
  • 10. AfterOne script (to rule them all): - main program - UDFs as script functions - embedded Pig statementsAll the algorithm in one place
  • 11. ReferencesIt uses JVM implementations of scripting languages (Jython, Rhino).This is a joint effort, see the following Jiras: in Pig 0.8: PIG-928 Python UDFs in Pig0.9: PIG-1479 embedding, PIG-1794 JavaScript supportDoc: http://pig.apache.org/docs/
  • 12. Examples1) Simple example: fixed loop iteration2) Adding convergence criteria and accessing intermediary output3)More advanced example with UDFs
  • 13. 1) A Simple ExamplePageRank:A system of linear equations (as many as there are pages on the web, yeah, a lot): It can be approximated iteratively: compute the new page rank based on the page ranks of the previous iteration. Start with some value.Ref: http://en.wikipedia.org/wiki/PageRank
  • 14. Or more visuallyEach page sends a fraction of its PageRank to the pages linked to. Inversely proportional to the number of links.
  • 16. Let’s zoom inpig script: PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))Iterate 10 timesPass parameters as a dictionaryPass parameters as a dictionaryJust run P, that was declared aboveThe output becomes the new input
  • 17. Practical resultApplied to the English Wikipedia link graph:http://wiki.dbpedia.org/Downloads36#owikipediapagelinksIt turns out that the max PageRank is awarded to:http://en.wikipedia.org/wiki/United_StatesThanks @ogrisel for the suggestion
  • 18. 2) Same example, one step furtherNow let’s say that we define a threshold as a convergence criteria instead of a fixed iteration count.
  • 19. Same thing as previouslyComputation of the maximum difference with the previous iteration… continued next slide
  • 20. The main programThe parameter-less bind() uses the variables in the current scopeWe can easily read the output of Pig from the gridStop if we reach a threshold
  • 21. 3) Now somethingmore complexCompute a transitive closure: find the connected components of a graph. - Useful if you’re doing de-duplication - Requires iterations and UDFs
  • 22. Or more visuallyTurn this: Into this:
  • 23. ConvergenceConverges in : log2(max(Diameter of a component))Diameter = “the longest shortest path”Bottom line: the number of iterations will be reasonable.
  • 24. UDFs are in the same script as the main programZoom next slidePage 1/3
  • 25. Zoom on UDFsThe output schema of the UDF is defined using a decoratorThe native structures of the language can be used directly
  • 26. Zoom next slidesZoom next slidesPage 2/3
  • 27. Zoom on the Pig script…UDFs are directly available, no extra declaration needed
  • 28. Zoom on the loopIterate a maximum of 10 times(2^10 maximum diameter of a component)Convergence criteria: all links have been followed
  • 29. Final part: formattingTurning the graph representation into a component list representationThis is necessary when we have UDFs, so that the script can be evaluated again on the slaves without running main()Page 3/3
  • 30. One more thing …I presented Python but JavaScript is available as well (experimental).The framework is extensible. Any JVM implementation of a language could be integrated (contribute!).The examples can be found at:https://github.com/julienledem/Pig-scripting-examples