SlideShare a Scribd company logo
Hadoop Record Reader in PythonHUG: Nov 18 2009Paul Tarjanhttp://paulisageek.com@ptarjanhttp://github.com/ptarjan/hadoop_record
Hey Jute…Tabs and newlines are good and allFor lots of data, don’t do that
don’t make it bad...Hadoop has a native data storage format called Hadoop Record or “Jute”org.apache.hadoop.recordhttp://en.wikipedia.org/wiki/Jute
take a data structure…There is a Data Definition Language!module links {		class Link {ustringURL;booleanisRelative;ustringanchorText;		};}
and make it better…And a compiler$ rcc -lc++ inclrec.jrtestrec.jr	namespace inclrec {		class RI :		public hadoop::Record {		    private:			int32_t I32;			double D;std::string S;
remember, to only use C++/Java$rcc--help	Usage: rcc --language[java|c++] ddl-files
then you can start to make it better…I wanted it in pythonNeed 2 parts:Parsing library and DDL translatorI only did the first partIf you need second part, let me know
Hey Jute don't be afraid…
you were made to go out and get her…http://github.com/ptarjan/hadoop_record
the minute you let her under your skin…I bet you thought I was done with “Hey Jude” references, eh?How I built itPly == lex and yaccParser == 234 lines including tests!Outputs generic data typesYou have to do the class transform yourselfYou can use my lex and yacc stuff in your language of choice
and any time you feel the pain…Parsing the binary format is hardVector vsstruct???struct= "s{" record *("," record) "}"vector = "v{" [record *("," record)] "}"LazyString – don’t decode if not needed99% of my hadoop time was decoding strings I didn’t needBinary on disk -> CSV -> python == wastefulHadoopupacks zip files – name it .mod
nananananaFuture workDDL ConverterIntegrate it officiallyRecord writer (should be easy)SequenceFileAsOutputFormatIntegrate your feedback

More Related Content

What's hot (20)

PPTX
sphinx-i18n — The True Story
Robert Lehmann
 
PDF
Code as Data workshop: Using source{d} Engine to extract insights from git re...
source{d}
 
PDF
Business logic with PostgreSQL and Python
Hubert Piotrowski
 
PPT
Getting started with PostGIS geographic database
EDINA, University of Edinburgh
 
PPT
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINA
JISC GECO
 
PPT
Visualizing and Analyzing HDF-EOS5 and HDF5 data with NCL
The HDF-EOS Tools and Information Center
 
PPTX
Usage of NCL, IDL, and MATLAB to access NASA HDF4/HDF-EOS2/HDF-EOS5 data
The HDF-EOS Tools and Information Center
 
PPTX
Data analysis on hadoop
Frank Y
 
PDF
DUG'20: 07 - Storing High-Energy Physics data in DAOS
Andrey Kudryavtsev
 
PDF
Meetup Elasticsearch 13 novembre 2014
Jean-Pierre Paris
 
PPT
Using HDF5 and Python: The H5py module
The HDF-EOS Tools and Information Center
 
PPT
Tokyocabinet
guestf96ccd
 
PDF
Geo Package and OWS Context at FOSS4G PDX
Luis Bermudez
 
ODP
Working with Shared Libraries in Perl
Ido Kanner
 
PDF
Docopt, beautiful command-line options for R, user2014
Edwin de Jonge
 
PPT
Substituting HDF5 tools with Python/H5py scripts
The HDF-EOS Tools and Information Center
 
PDF
20141111 파이썬으로 Hadoop MR프로그래밍
Tae Young Lee
 
PDF
anticorrp
Victor Ni?u
 
PPSX
NASA HDF/HDF-EOS Data for Dummies (and Developers)
The HDF-EOS Tools and Information Center
 
PDF
Pybind11 - SciPy 2021
Henry Schreiner
 
sphinx-i18n — The True Story
Robert Lehmann
 
Code as Data workshop: Using source{d} Engine to extract insights from git re...
source{d}
 
Business logic with PostgreSQL and Python
Hubert Piotrowski
 
Getting started with PostGIS geographic database
EDINA, University of Edinburgh
 
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINA
JISC GECO
 
Visualizing and Analyzing HDF-EOS5 and HDF5 data with NCL
The HDF-EOS Tools and Information Center
 
Usage of NCL, IDL, and MATLAB to access NASA HDF4/HDF-EOS2/HDF-EOS5 data
The HDF-EOS Tools and Information Center
 
Data analysis on hadoop
Frank Y
 
DUG'20: 07 - Storing High-Energy Physics data in DAOS
Andrey Kudryavtsev
 
Meetup Elasticsearch 13 novembre 2014
Jean-Pierre Paris
 
Using HDF5 and Python: The H5py module
The HDF-EOS Tools and Information Center
 
Tokyocabinet
guestf96ccd
 
Geo Package and OWS Context at FOSS4G PDX
Luis Bermudez
 
Working with Shared Libraries in Perl
Ido Kanner
 
Docopt, beautiful command-line options for R, user2014
Edwin de Jonge
 
Substituting HDF5 tools with Python/H5py scripts
The HDF-EOS Tools and Information Center
 
20141111 파이썬으로 Hadoop MR프로그래밍
Tae Young Lee
 
anticorrp
Victor Ni?u
 
NASA HDF/HDF-EOS Data for Dummies (and Developers)
The HDF-EOS Tools and Information Center
 
Pybind11 - SciPy 2021
Henry Schreiner
 

More from Yahoo Developer Network (20)

PDF
Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Yahoo Developer Network
 
PDF
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Yahoo Developer Network
 
PDF
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Yahoo Developer Network
 
PDF
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Yahoo Developer Network
 
PDF
CICD at Oath using Screwdriver
Yahoo Developer Network
 
PDF
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Yahoo Developer Network
 
PPTX
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
Yahoo Developer Network
 
PDF
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
Yahoo Developer Network
 
PPTX
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Yahoo Developer Network
 
PPTX
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Yahoo Developer Network
 
PDF
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
Yahoo Developer Network
 
PPTX
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Yahoo Developer Network
 
PDF
Moving the Oath Grid to Docker, Eric Badger, Oath
Yahoo Developer Network
 
PDF
Architecting Petabyte Scale AI Applications
Yahoo Developer Network
 
PDF
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Yahoo Developer Network
 
PPTX
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Yahoo Developer Network
 
PDF
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Yahoo Developer Network
 
PPTX
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
Yahoo Developer Network
 
PPTX
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
Yahoo Developer Network
 
PPTX
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
Yahoo Developer Network
 
Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Yahoo Developer Network
 
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Yahoo Developer Network
 
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Yahoo Developer Network
 
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Yahoo Developer Network
 
CICD at Oath using Screwdriver
Yahoo Developer Network
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Yahoo Developer Network
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
Yahoo Developer Network
 
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
Yahoo Developer Network
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Yahoo Developer Network
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Yahoo Developer Network
 
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
Yahoo Developer Network
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Yahoo Developer Network
 
Moving the Oath Grid to Docker, Eric Badger, Oath
Yahoo Developer Network
 
Architecting Petabyte Scale AI Applications
Yahoo Developer Network
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Yahoo Developer Network
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Yahoo Developer Network
 
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Yahoo Developer Network
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
Yahoo Developer Network
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
Yahoo Developer Network
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
Yahoo Developer Network
 
Ad

Recently uploaded (20)

PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
Timothy Rottach - Ramp up on AI Use Cases, from Vector Search to AI Agents wi...
AWS Chicago
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PDF
Windsurf Meetup Ottawa 2025-07-12 - Planning Mode at Reliza.pdf
Pavel Shukhman
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
PPTX
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
PDF
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
PPTX
MSP360 Backup Scheduling and Retention Best Practices.pptx
MSP360
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
PDF
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PDF
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
PPT
Interview paper part 3, It is based on Interview Prep
SoumyadeepGhosh39
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
Timothy Rottach - Ramp up on AI Use Cases, from Vector Search to AI Agents wi...
AWS Chicago
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
Windsurf Meetup Ottawa 2025-07-12 - Planning Mode at Reliza.pdf
Pavel Shukhman
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
MSP360 Backup Scheduling and Retention Best Practices.pptx
MSP360
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
Interview paper part 3, It is based on Interview Prep
SoumyadeepGhosh39
 
Ad

Nov HUG 2009: Hadoop Record Reader In Python

  • 1. Hadoop Record Reader in PythonHUG: Nov 18 2009Paul Tarjanhttp://paulisageek.com@ptarjanhttp://github.com/ptarjan/hadoop_record
  • 2. Hey Jute…Tabs and newlines are good and allFor lots of data, don’t do that
  • 3. don’t make it bad...Hadoop has a native data storage format called Hadoop Record or “Jute”org.apache.hadoop.recordhttp://en.wikipedia.org/wiki/Jute
  • 4. take a data structure…There is a Data Definition Language!module links { class Link {ustringURL;booleanisRelative;ustringanchorText; };}
  • 5. and make it better…And a compiler$ rcc -lc++ inclrec.jrtestrec.jr namespace inclrec { class RI : public hadoop::Record { private: int32_t I32; double D;std::string S;
  • 6. remember, to only use C++/Java$rcc--help Usage: rcc --language[java|c++] ddl-files
  • 7. then you can start to make it better…I wanted it in pythonNeed 2 parts:Parsing library and DDL translatorI only did the first partIf you need second part, let me know
  • 8. Hey Jute don't be afraid…
  • 9. you were made to go out and get her…http://github.com/ptarjan/hadoop_record
  • 10. the minute you let her under your skin…I bet you thought I was done with “Hey Jude” references, eh?How I built itPly == lex and yaccParser == 234 lines including tests!Outputs generic data typesYou have to do the class transform yourselfYou can use my lex and yacc stuff in your language of choice
  • 11. and any time you feel the pain…Parsing the binary format is hardVector vsstruct???struct= "s{" record *("," record) "}"vector = "v{" [record *("," record)] "}"LazyString – don’t decode if not needed99% of my hadoop time was decoding strings I didn’t needBinary on disk -> CSV -> python == wastefulHadoopupacks zip files – name it .mod
  • 12. nananananaFuture workDDL ConverterIntegrate it officiallyRecord writer (should be easy)SequenceFileAsOutputFormatIntegrate your feedback