Nov HUG 2009: Hadoop Record Reader In Python

4 likes782 views

The document discusses creating a Hadoop Record Reader in Python, highlighting the use of the native data storage format called Jute and a data definition language (DDL) for data structures. It touches on the challenges of parsing binary formats and provides insights on building a parser using lex and yacc. The author mentions future work, including developing a DDL converter and integrating feedback into the project.

Technology

Hadoop Record Reader in PythonHUG: Nov 18 2009Paul Tarjanhttp://paulisageek.com@ptarjanhttp://github.com/ptarjan/hadoop_record

Hey Jute…Tabs and newlines are good and allFor lots of data, don’t do that

don’t make it bad...Hadoop has a native data storage format called Hadoop Record or “Jute”org.apache.hadoop.recordhttp://en.wikipedia.org/wiki/Jute

take a data structure…There is a Data Definition Language!module links { class Link {ustringURL;booleanisRelative;ustringanchorText; };}

$and make it better…And a compiler$ rcc -lc++ inclrec.jrtestrec.jr namespace inclrec { class RI : public hadoop::Record { private: int32_t I32; double D;std::string S;$

remember, to only use C++/Java$rcc--help Usage: rcc --language[java|c++] ddl-files

then you can start to make it better…I wanted it in pythonNeed 2 parts:Parsing library and DDL translatorI only did the first partIf you need second part, let me know

you were made to go out and get her…http://github.com/ptarjan/hadoop_record

the minute you let her under your skin…I bet you thought I was done with “Hey Jude” references, eh?How I built itPly == lex and yaccParser == 234 lines including tests!Outputs generic data typesYou have to do the class transform yourselfYou can use my lex and yacc stuff in your language of choice

and any time you feel the pain…Parsing the binary format is hardVector vsstruct???struct= "s{" record *("," record) "}"vector = "v{" [record *("," record)] "}"LazyString – don’t decode if not needed99% of my hadoop time was decoding strings I didn’t needBinary on disk -> CSV -> python == wastefulHadoopupacks zip files – name it .mod

nananananaFuture workDDL ConverterIntegrate it officiallyRecord writer (should be easy)SequenceFileAsOutputFormatIntegrate your feedback

More Related Content

What's hot (20)

PPTX

sphinx-i18n — The True StoryRobert Lehmann

PDF

Code as Data workshop: Using source{d} Engine to extract insights from git re...source{d}

PDF

Business logic with PostgreSQL and PythonHubert Piotrowski

PPT

Getting started with PostGIS geographic databaseEDINA, University of Edinburgh

PPT

Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINAJISC GECO

PPT

Visualizing and Analyzing HDF-EOS5 and HDF5 data with NCLThe HDF-EOS Tools and Information Center

PPTX

Usage of NCL, IDL, and MATLAB to access NASA HDF4/HDF-EOS2/HDF-EOS5 dataThe HDF-EOS Tools and Information Center

PPTX

Data analysis on hadoopFrank Y

PDF

DUG'20: 07 - Storing High-Energy Physics data in DAOSAndrey Kudryavtsev

PDF

Meetup Elasticsearch 13 novembre 2014Jean-Pierre Paris

PPT

Using HDF5 and Python: The H5py moduleThe HDF-EOS Tools and Information Center

PPT

Tokyocabinetguestf96ccd

PDF

Geo Package and OWS Context at FOSS4G PDXLuis Bermudez

ODP

Working with Shared Libraries in PerlIdo Kanner

PDF

Docopt, beautiful command-line options for R, user2014Edwin de Jonge

PPT

Substituting HDF5 tools with Python/H5py scriptsThe HDF-EOS Tools and Information Center

PDF

20141111 파이썬으로 Hadoop MR프로그래밍Tae Young Lee

PDF

anticorrpVictor Ni?u

PPSX

NASA HDF/HDF-EOS Data for Dummies (and Developers)The HDF-EOS Tools and Information Center

PDF

Pybind11 - SciPy 2021Henry Schreiner

sphinx-i18n — The True StoryRobert Lehmann

Code as Data workshop: Using source{d} Engine to extract insights from git re...source{d}

Business logic with PostgreSQL and PythonHubert Piotrowski

Getting started with PostGIS geographic databaseEDINA, University of Edinburgh

Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINAJISC GECO

Visualizing and Analyzing HDF-EOS5 and HDF5 data with NCLThe HDF-EOS Tools and Information Center

Usage of NCL, IDL, and MATLAB to access NASA HDF4/HDF-EOS2/HDF-EOS5 dataThe HDF-EOS Tools and Information Center

Data analysis on hadoopFrank Y

DUG'20: 07 - Storing High-Energy Physics data in DAOSAndrey Kudryavtsev

Meetup Elasticsearch 13 novembre 2014Jean-Pierre Paris

Using HDF5 and Python: The H5py moduleThe HDF-EOS Tools and Information Center

Tokyocabinetguestf96ccd

Geo Package and OWS Context at FOSS4G PDXLuis Bermudez

Working with Shared Libraries in PerlIdo Kanner

Docopt, beautiful command-line options for R, user2014Edwin de Jonge

Substituting HDF5 tools with Python/H5py scriptsThe HDF-EOS Tools and Information Center

20141111 파이썬으로 Hadoop MR프로그래밍Tae Young Lee

anticorrpVictor Ni?u

NASA HDF/HDF-EOS Data for Dummies (and Developers)The HDF-EOS Tools and Information Center

Pybind11 - SciPy 2021Henry Schreiner

More from Yahoo Developer Network (20)

PDF

Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaYahoo Developer Network

PDF

Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Yahoo Developer Network

PDF

Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanYahoo Developer Network

PDF

Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Yahoo Developer Network

PDF

CICD at Oath using ScrewdriverYahoo Developer Network

PDF

Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathYahoo Developer Network

PPTX

How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuYahoo Developer Network

PDF

The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolYahoo Developer Network

PPTX

Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Yahoo Developer Network

PPTX

Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Yahoo Developer Network

PDF

HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathYahoo Developer Network

PPTX

Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Yahoo Developer Network

PDF

Moving the Oath Grid to Docker, Eric Badger, OathYahoo Developer Network

PDF

Architecting Petabyte Scale AI ApplicationsYahoo Developer Network

PDF

Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Yahoo Developer Network

PPTX

Jun 2017 HUG: YARN Scheduling – A Step BeyondYahoo Developer Network

PDF

Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Yahoo Developer Network

PPTX

February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...Yahoo Developer Network

PPTX

February 2017 HUG: Exactly-once end-to-end processing with Apache ApexYahoo Developer Network

PPTX

February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsYahoo Developer Network

Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaYahoo Developer Network

Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Yahoo Developer Network

Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanYahoo Developer Network

Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Yahoo Developer Network

CICD at Oath using ScrewdriverYahoo Developer Network

Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathYahoo Developer Network

How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuYahoo Developer Network

The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolYahoo Developer Network

Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Yahoo Developer Network

Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Yahoo Developer Network

HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathYahoo Developer Network

Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Yahoo Developer Network

Moving the Oath Grid to Docker, Eric Badger, OathYahoo Developer Network

Architecting Petabyte Scale AI ApplicationsYahoo Developer Network

Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Yahoo Developer Network

Jun 2017 HUG: YARN Scheduling – A Step BeyondYahoo Developer Network

Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Yahoo Developer Network

February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...Yahoo Developer Network

February 2017 HUG: Exactly-once end-to-end processing with Apache ApexYahoo Developer Network

February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsYahoo Developer Network

Recently uploaded (20)

PDF

New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025BookNet Canada

PDF

Timothy Rottach - Ramp up on AI Use Cases, from Vector Search to AI Agents wi...AWS Chicago

PDF

CIFDAQ Token Spotlight for 9th July 2025CIFDAQ

PPTX

Building Search Using OpenSearch: Limitations and WorkaroundsSease

PDF

Smart Trailers 2025 Update with History and OverviewPaul Menig

PDF

Windsurf Meetup Ottawa 2025-07-12 - Planning Mode at Reliza.pdfPavel Shukhman

PDF

Using FME to Develop Self-Service CAD Applications for a Major UK Police ForceSafe Software

PPTX

UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst ContentDianaGray10

PDF

Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025faizk77g

PPTX

MSP360 Backup Scheduling and Retention Best Practices.pptxMSP360

PPTX

Webinar: Introduction to LF Energy EVerestDanBrown980551

PPTX

"Autonomy of LLM Agents: Current State and Future Prospects", Oles` PetrivFwdays

PDF

Reverse Engineering of Security Products: Developing an Advanced Microsoft De...nwbxhhcyjv

PPTX

Q2 FY26 Tableau User Group Leader Quarterly Calllward7

PDF

HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...mcastillo49

PDF

The Builder’s Playbook - 2025 State of AI Report.pdfjeroen339954

PDF

Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AIdominikamizerska1

PDF

"AI Transformation: Directions and Challenges", Pavlo ShaternikFwdays

PDF

Log-Based Anomaly Detection: Enhancing System Reliability with Machine LearningMohammed BEKKOUCHE

PPT

Interview paper part 3, It is based on Interview PrepSoumyadeepGhosh39

New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025BookNet Canada

Timothy Rottach - Ramp up on AI Use Cases, from Vector Search to AI Agents wi...AWS Chicago

CIFDAQ Token Spotlight for 9th July 2025CIFDAQ

Building Search Using OpenSearch: Limitations and WorkaroundsSease

Smart Trailers 2025 Update with History and OverviewPaul Menig

Windsurf Meetup Ottawa 2025-07-12 - Planning Mode at Reliza.pdfPavel Shukhman

Using FME to Develop Self-Service CAD Applications for a Major UK Police ForceSafe Software

UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst ContentDianaGray10

Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025faizk77g

MSP360 Backup Scheduling and Retention Best Practices.pptxMSP360

Webinar: Introduction to LF Energy EVerestDanBrown980551

"Autonomy of LLM Agents: Current State and Future Prospects", Oles` PetrivFwdays

Reverse Engineering of Security Products: Developing an Advanced Microsoft De...nwbxhhcyjv

Q2 FY26 Tableau User Group Leader Quarterly Calllward7

HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...mcastillo49

The Builder’s Playbook - 2025 State of AI Report.pdfjeroen339954

Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AIdominikamizerska1

"AI Transformation: Directions and Challenges", Pavlo ShaternikFwdays

Log-Based Anomaly Detection: Enhancing System Reliability with Machine LearningMohammed BEKKOUCHE

Interview paper part 3, It is based on Interview PrepSoumyadeepGhosh39

Nov HUG 2009: Hadoop Record Reader In Python

1. Hadoop Record Reader in PythonHUG: Nov 18 2009Paul Tarjanhttp://paulisageek.com@ptarjanhttp://github.com/ptarjan/hadoop_record

2. Hey Jute…Tabs and newlines are good and allFor lots of data, don’t do that

3. don’t make it bad...Hadoop has a native data storage format called Hadoop Record or “Jute”org.apache.hadoop.recordhttp://en.wikipedia.org/wiki/Jute

4. take a data structure…There is a Data Definition Language!module links { class Link {ustringURL;booleanisRelative;ustringanchorText; };}

5. and make it better…And a compiler$ rcc -lc++ inclrec.jrtestrec.jr namespace inclrec { class RI : public hadoop::Record { private: int32_t I32; double D;std::string S;

6. remember, to only use C++/Java$rcc--help Usage: rcc --language[java|c++] ddl-files

7. then you can start to make it better…I wanted it in pythonNeed 2 parts:Parsing library and DDL translatorI only did the first partIf you need second part, let me know

8. Hey Jute don't be afraid…

9. you were made to go out and get her…http://github.com/ptarjan/hadoop_record

10. the minute you let her under your skin…I bet you thought I was done with “Hey Jude” references, eh?How I built itPly == lex and yaccParser == 234 lines including tests!Outputs generic data typesYou have to do the class transform yourselfYou can use my lex and yacc stuff in your language of choice

11. and any time you feel the pain…Parsing the binary format is hardVector vsstruct???struct= "s{" record *("," record) "}"vector = "v{" [record *("," record)] "}"LazyString – don’t decode if not needed99% of my hadoop time was decoding strings I didn’t needBinary on disk -> CSV -> python == wastefulHadoopupacks zip files – name it .mod

12. nananananaFuture workDDL ConverterIntegrate it officiallyRecord writer (should be easy)SequenceFileAsOutputFormatIntegrate your feedback