SlideShare a Scribd company logo
Hadoop Record Reader in PythonHUG: Nov 18 2009Paul Tarjanhttp://paulisageek.com@ptarjanhttp://github.com/ptarjan/hadoop_record
Hey Jute…Tabs and newlines are good and allFor lots of data, don’t do that
don’t make it bad...Hadoop has a native data storage format called Hadoop Record or “Jute”org.apache.hadoop.recordhttp://en.wikipedia.org/wiki/Jute
take a data structure…There is a Data Definition Language!module links {		class Link {ustringURL;booleanisRelative;ustringanchorText;		};}
and make it better…And a compiler$ rcc -lc++ inclrec.jrtestrec.jr	namespace inclrec {		class RI :		public hadoop::Record {		    private:			int32_t I32;			double D;std::string S;
remember, to only use C++/Java$rcc--help	Usage: rcc --language[java|c++] ddl-files
then you can start to make it better…I wanted it in pythonNeed 2 parts. Parsing library and DDL translatorI only did the first partIf you need second part, let me know
Hey Jute don't be afraid…
you were made to go out and get her…http://github.com/ptarjan/hadoop_record
the minute you let her under your skin…I bet you thought I was done with “Hey Jude” references, eh?How I built itPly == lex and yaccParser == 234 lines including tests!Outputs generic data typesYou have to do the class transform yourselfYou can use my lex and yacc stuff in your language of choice
and any time you feel the pain…Parsing the binary format is hardVector vsstruct???struct= "s{" record *("," record) "}"vector = "v{" [record *("," record)] "}"LazyString – don’t decode if not needed99% of my hadoop time was decoding strings I didn’t needBinary on disk -> CSV -> python == wastefullHadoopupacks zip files – name it .mod
nananananaFuture workDDL ConverterIntegrate it officiallyRecord writer (should be easy)SequenceFileAsOutputFormatIntegrate your feedback

More Related Content

What's hot (20)

PPTX
sphinx-i18n — The True Story
Robert Lehmann
 
PDF
Code as Data workshop: Using source{d} Engine to extract insights from git re...
source{d}
 
PDF
Business logic with PostgreSQL and Python
Hubert Piotrowski
 
PPT
Getting started with PostGIS geographic database
EDINA, University of Edinburgh
 
PPT
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINA
JISC GECO
 
PPT
Visualizing and Analyzing HDF-EOS5 and HDF5 data with NCL
The HDF-EOS Tools and Information Center
 
PPTX
Usage of NCL, IDL, and MATLAB to access NASA HDF4/HDF-EOS2/HDF-EOS5 data
The HDF-EOS Tools and Information Center
 
PPTX
Data analysis on hadoop
Frank Y
 
PDF
DUG'20: 07 - Storing High-Energy Physics data in DAOS
Andrey Kudryavtsev
 
PDF
Meetup Elasticsearch 13 novembre 2014
Jean-Pierre Paris
 
PPT
Using HDF5 and Python: The H5py module
The HDF-EOS Tools and Information Center
 
PPT
Tokyocabinet
guestf96ccd
 
PDF
Geo Package and OWS Context at FOSS4G PDX
Luis Bermudez
 
ODP
Working with Shared Libraries in Perl
Ido Kanner
 
PDF
Docopt, beautiful command-line options for R, user2014
Edwin de Jonge
 
PPT
Substituting HDF5 tools with Python/H5py scripts
The HDF-EOS Tools and Information Center
 
PDF
20141111 파이썬으로 Hadoop MR프로그래밍
Tae Young Lee
 
PPSX
NASA HDF/HDF-EOS Data for Dummies (and Developers)
The HDF-EOS Tools and Information Center
 
PDF
anticorrp
Victor Ni?u
 
PDF
Pybind11 - SciPy 2021
Henry Schreiner
 
sphinx-i18n — The True Story
Robert Lehmann
 
Code as Data workshop: Using source{d} Engine to extract insights from git re...
source{d}
 
Business logic with PostgreSQL and Python
Hubert Piotrowski
 
Getting started with PostGIS geographic database
EDINA, University of Edinburgh
 
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINA
JISC GECO
 
Visualizing and Analyzing HDF-EOS5 and HDF5 data with NCL
The HDF-EOS Tools and Information Center
 
Usage of NCL, IDL, and MATLAB to access NASA HDF4/HDF-EOS2/HDF-EOS5 data
The HDF-EOS Tools and Information Center
 
Data analysis on hadoop
Frank Y
 
DUG'20: 07 - Storing High-Energy Physics data in DAOS
Andrey Kudryavtsev
 
Meetup Elasticsearch 13 novembre 2014
Jean-Pierre Paris
 
Using HDF5 and Python: The H5py module
The HDF-EOS Tools and Information Center
 
Tokyocabinet
guestf96ccd
 
Geo Package and OWS Context at FOSS4G PDX
Luis Bermudez
 
Working with Shared Libraries in Perl
Ido Kanner
 
Docopt, beautiful command-line options for R, user2014
Edwin de Jonge
 
Substituting HDF5 tools with Python/H5py scripts
The HDF-EOS Tools and Information Center
 
20141111 파이썬으로 Hadoop MR프로그래밍
Tae Young Lee
 
NASA HDF/HDF-EOS Data for Dummies (and Developers)
The HDF-EOS Tools and Information Center
 
anticorrp
Victor Ni?u
 
Pybind11 - SciPy 2021
Henry Schreiner
 

Viewers also liked (10)

PDF
Semantic Searchmonkey
Paul Tarjan
 
PPT
Hands on Hadoop
Paul Tarjan
 
PPTX
How To Be A Hacker
Paul Tarjan
 
PPTX
Hacku Intro 2009
Paul Tarjan
 
PPTX
Yahoo! HackU 2010
Paul Tarjan
 
PPT
SearchMonkey
Paul Tarjan
 
PDF
Soleus Audio Manager Help
Chris CHOU
 
PDF
Yahoo Developer Network overview
Christian Heilmann
 
PPS
Trompe L’Oeil & Decorazioni Pignotti Pisanu
guest79d1a6
 
PPTX
Promoting Excellence Network - Graduate Attributes at CQUniversity Australia
Damien Clark
 
Semantic Searchmonkey
Paul Tarjan
 
Hands on Hadoop
Paul Tarjan
 
How To Be A Hacker
Paul Tarjan
 
Hacku Intro 2009
Paul Tarjan
 
Yahoo! HackU 2010
Paul Tarjan
 
SearchMonkey
Paul Tarjan
 
Soleus Audio Manager Help
Chris CHOU
 
Yahoo Developer Network overview
Christian Heilmann
 
Trompe L’Oeil & Decorazioni Pignotti Pisanu
guest79d1a6
 
Promoting Excellence Network - Graduate Attributes at CQUniversity Australia
Damien Clark
 
Ad

Similar to Hadoop Jute Record Python (20)

PDF
Massively Parallel Process with Prodedural Python by Ian Huston
PyData
 
PDF
Massively Parallel Processing with Procedural Python (PyData London 2014)
Ian Huston
 
PDF
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
spinningmatt
 
PDF
Python Powered Data Science at Pivotal (PyData 2013)
Srivatsan Ramanujam
 
PDF
Language-agnostic data analysis workflows and reproducible research
Andrew Lowe
 
PDF
Hadoop Pig: MapReduce the easy way!
Nathan Bijnens
 
PDF
Intro to-puppet
F.L. Jonathan Araña Cruz
 
PDF
An Overview of Hadoop
Asif Ali
 
ODP
Lamp
Reka
 
ODP
Lamp1
Reka
 
ODP
Lamp1
Nadhi ya
 
PDF
HCatalog
GetInData
 
PDF
Big data using Hadoop, Hive, Sqoop with Installation
mellempudilavanya999
 
PDF
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
confluent
 
PDF
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
PyData
 
PDF
The Beauty And The Beast Php N W09
Bastian Feder
 
PPTX
Quadrupling your elephants - RDF and the Hadoop ecosystem
Rob Vesse
 
PDF
Unit V.pdf
KennyPratheepKumar
 
ODP
Playing with Hadoop (NPW2013)
Søren Lund
 
Massively Parallel Process with Prodedural Python by Ian Huston
PyData
 
Massively Parallel Processing with Procedural Python (PyData London 2014)
Ian Huston
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
spinningmatt
 
Python Powered Data Science at Pivotal (PyData 2013)
Srivatsan Ramanujam
 
Language-agnostic data analysis workflows and reproducible research
Andrew Lowe
 
Hadoop Pig: MapReduce the easy way!
Nathan Bijnens
 
Intro to-puppet
F.L. Jonathan Araña Cruz
 
An Overview of Hadoop
Asif Ali
 
Lamp
Reka
 
Lamp1
Reka
 
Lamp1
Nadhi ya
 
HCatalog
GetInData
 
Big data using Hadoop, Hive, Sqoop with Installation
mellempudilavanya999
 
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
confluent
 
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
PyData
 
The Beauty And The Beast Php N W09
Bastian Feder
 
Quadrupling your elephants - RDF and the Hadoop ecosystem
Rob Vesse
 
Unit V.pdf
KennyPratheepKumar
 
Playing with Hadoop (NPW2013)
Søren Lund
 
Ad

Recently uploaded (20)

PDF
Upgrading to z_OS V2R4 Part 01 of 02.pdf
Flavio787771
 
PDF
Rethinking Security Operations - SOC Evolution Journey.pdf
Haris Chughtai
 
PDF
Bitcoin+ Escalando sin concesiones - Parte 1
Fernando Paredes García
 
PDF
Meetup Kickoff & Welcome - Rohit Yadav, CSIUG Chairman
ShapeBlue
 
PPTX
Top Managed Service Providers in Los Angeles
Captain IT
 
PDF
Novus Safe Lite- What is Novus Safe Lite.pdf
Novus Hi-Tech
 
PDF
HR agent at Mediq: Lessons learned on Agent Builder & Maestro by Tacstone Tec...
UiPathCommunity
 
PDF
SFWelly Summer 25 Release Highlights July 2025
Anna Loughnan Colquhoun
 
PDF
Arcee AI - building and working with small language models (06/25)
Julien SIMON
 
PDF
Building Resilience with Digital Twins : Lessons from Korea
SANGHEE SHIN
 
PDF
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
PDF
HydITEx corporation Booklet 2025 English
Георгий Феодориди
 
PPTX
MSP360 Backup Scheduling and Retention Best Practices.pptx
MSP360
 
PPTX
✨Unleashing Collaboration: Salesforce Channels & Community Power in Patna!✨
SanjeetMishra29
 
PDF
Shuen Mei Parth Sharma Boost Productivity, Innovation and Efficiency wit...
AWS Chicago
 
PDF
Wojciech Ciemski for Top Cyber News MAGAZINE. June 2025
Dr. Ludmila Morozova-Buss
 
PDF
How Current Advanced Cyber Threats Transform Business Operation
Eryk Budi Pratama
 
PDF
Smart Air Quality Monitoring with Serrax AQM190 LITE
SERRAX TECHNOLOGIES LLP
 
PPTX
Building and Operating a Private Cloud with CloudStack and LINBIT CloudStack ...
ShapeBlue
 
PDF
Blockchain Transactions Explained For Everyone
CIFDAQ
 
Upgrading to z_OS V2R4 Part 01 of 02.pdf
Flavio787771
 
Rethinking Security Operations - SOC Evolution Journey.pdf
Haris Chughtai
 
Bitcoin+ Escalando sin concesiones - Parte 1
Fernando Paredes García
 
Meetup Kickoff & Welcome - Rohit Yadav, CSIUG Chairman
ShapeBlue
 
Top Managed Service Providers in Los Angeles
Captain IT
 
Novus Safe Lite- What is Novus Safe Lite.pdf
Novus Hi-Tech
 
HR agent at Mediq: Lessons learned on Agent Builder & Maestro by Tacstone Tec...
UiPathCommunity
 
SFWelly Summer 25 Release Highlights July 2025
Anna Loughnan Colquhoun
 
Arcee AI - building and working with small language models (06/25)
Julien SIMON
 
Building Resilience with Digital Twins : Lessons from Korea
SANGHEE SHIN
 
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
HydITEx corporation Booklet 2025 English
Георгий Феодориди
 
MSP360 Backup Scheduling and Retention Best Practices.pptx
MSP360
 
✨Unleashing Collaboration: Salesforce Channels & Community Power in Patna!✨
SanjeetMishra29
 
Shuen Mei Parth Sharma Boost Productivity, Innovation and Efficiency wit...
AWS Chicago
 
Wojciech Ciemski for Top Cyber News MAGAZINE. June 2025
Dr. Ludmila Morozova-Buss
 
How Current Advanced Cyber Threats Transform Business Operation
Eryk Budi Pratama
 
Smart Air Quality Monitoring with Serrax AQM190 LITE
SERRAX TECHNOLOGIES LLP
 
Building and Operating a Private Cloud with CloudStack and LINBIT CloudStack ...
ShapeBlue
 
Blockchain Transactions Explained For Everyone
CIFDAQ
 

Hadoop Jute Record Python

  • 1. Hadoop Record Reader in PythonHUG: Nov 18 2009Paul Tarjanhttp://paulisageek.com@ptarjanhttp://github.com/ptarjan/hadoop_record
  • 2. Hey Jute…Tabs and newlines are good and allFor lots of data, don’t do that
  • 3. don’t make it bad...Hadoop has a native data storage format called Hadoop Record or “Jute”org.apache.hadoop.recordhttp://en.wikipedia.org/wiki/Jute
  • 4. take a data structure…There is a Data Definition Language!module links { class Link {ustringURL;booleanisRelative;ustringanchorText; };}
  • 5. and make it better…And a compiler$ rcc -lc++ inclrec.jrtestrec.jr namespace inclrec { class RI : public hadoop::Record { private: int32_t I32; double D;std::string S;
  • 6. remember, to only use C++/Java$rcc--help Usage: rcc --language[java|c++] ddl-files
  • 7. then you can start to make it better…I wanted it in pythonNeed 2 parts. Parsing library and DDL translatorI only did the first partIf you need second part, let me know
  • 8. Hey Jute don't be afraid…
  • 9. you were made to go out and get her…http://github.com/ptarjan/hadoop_record
  • 10. the minute you let her under your skin…I bet you thought I was done with “Hey Jude” references, eh?How I built itPly == lex and yaccParser == 234 lines including tests!Outputs generic data typesYou have to do the class transform yourselfYou can use my lex and yacc stuff in your language of choice
  • 11. and any time you feel the pain…Parsing the binary format is hardVector vsstruct???struct= "s{" record *("," record) "}"vector = "v{" [record *("," record)] "}"LazyString – don’t decode if not needed99% of my hadoop time was decoding strings I didn’t needBinary on disk -> CSV -> python == wastefullHadoopupacks zip files – name it .mod
  • 12. nananananaFuture workDDL ConverterIntegrate it officiallyRecord writer (should be easy)SequenceFileAsOutputFormatIntegrate your feedback