SlideShare a Scribd company logo
Command-line Data Tools
Peter Wang
@pwang
Why?
• Some times you just want to sling data
• Text is still king; Lowest common denominator
• Machines are pretty honking big now
This Presentation
• List of some good collections of cmd-line tools
• Call out and describe a few in particular
• The PyDataTool of my desire
Sources
• From author of “Data Science at the Command
Line”: http://jeroenjanssens.com/2013/09/19/seven-
command-line-tools-for-data-science.html (larger
list at http://datascienceatthecommandline.com/)
• HN discussion: https://news.ycombinator.com/
item?id=6412190
• https://github.com/bitly/data_hacks
Tools
• JSON:
• jq: https://stedolan.github.io/jq/
• RecordStream: https://github.com/benbernard/
RecordStream
• csvkit: https://csvkit.readthedocs.io/en/1.0.2/
• dt: https://github.com/clarkgrubb/data-tools
• XMLStarlet: http://xmlstar.sourceforge.net/overview.php
Honorable Mentions
• Pythonic awk: https://github.com/alecthomas/
pawk
• Google Crush Tools: https://github.com/google/
crush-tools
• Xonsh: http://xon.sh/tutorial.html
The PyDataTool of My Desire
• Support for csv, json, sql, xls, hdf5; image formats; network
formats (pcap etc.)
• Capability of:
• csvkit, jq, dt, “cols” tool
• unix tools: sed, sort, shuf, split, tr, tee, uniq, wc, head,
tail, bc
• netpbm, imagemagick for images
• Work in streaming mode (netcat, wget, curl)
• First-class support for dask, spark
• Basic plotting via gnuplot, mpl, bokeh
• Built-in SQLite to do in-memory support for queries
Continuum Is Hiring!
• Creators of Anaconda, conda, bokeh, blaze, dask,
holoviews, numba, phosphorJS
• Maintainers/contributors to Jupyter, JupyterLab,
Spyder, pandas, conda-forge, …
• 150+ ppl, 80 in Austin
• Venture backed
• Enterprise product, OSS community innovation,
consulting, training
Continuum Is Hiring
• Enterprise Product Team:
• Dev Manager (reports to CTO, runs product engineering)
• QA Lead Engineer - creates test plans, coordinates with
product mgmt, dev, and testing team
• Senior Python Developer - enterprise product development;
backend, web tech; full stack preferred
• DevOps and Operations - enterprise product, anaconda.org,
Anaconda build system
• Email careers@continuum.io

More Related Content

Similar to Command line Data Tools (20)

PPTX
Client Side Performance for Back End Developers - Camb Expert Talks, Nov 2016
Bart Read
 
PDF
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Wes McKinney
 
PDF
Silicon Valley Code Camp 2016 - MongoDB in production
Daniel Coupal
 
PDF
44CON 2014 - Binary Protocol Analysis with CANAPE, James Forshaw
44CON
 
PDF
Best practices-wordpress-enterprise
Taylor Lovett
 
PDF
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Open Analytics
 
PDF
Open Data Summit Presentation by Joe Olsen
Christopher Whitaker
 
PPTX
Old code doesn't stink - Detroit
Martin Gutenbrunner
 
PPTX
Best Practices for WordPress in Enterprise
Taylor Lovett
 
PDF
Best Practices for WordPress
Taylor Lovett
 
PDF
ScrapeGraphAI: AI-powered web scraping, reso facile con l'open source
Speck&Tech
 
PPTX
Nouveautes_Databricks decouvrire un use case general
pascalsegoul
 
PPTX
Web Scrapping Using Python
ComputerScienceJunct
 
PPTX
Basic Application Performance Optimization Techniques (Backend)
Klas Berlič Fras
 
PPTX
Week 1 - Interactive News Editing and Producing
kurtgessler
 
PDF
Softshake 2013: Introduction to NoSQL with Couchbase
Tugdual Grall
 
PDF
Hitchhiker's Guide to Web Standards
Dominic Farolino
 
PDF
Big Data Analytics course: Named Entities and Deep Learning for NLP
Christian Morbidoni
 
PPTX
Cloud Foundry Monitoring How-To: Collecting Metrics and Logs
Altoros
 
PDF
Capacity Planning for fun & profit
Rodrigo Campos
 
Client Side Performance for Back End Developers - Camb Expert Talks, Nov 2016
Bart Read
 
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Wes McKinney
 
Silicon Valley Code Camp 2016 - MongoDB in production
Daniel Coupal
 
44CON 2014 - Binary Protocol Analysis with CANAPE, James Forshaw
44CON
 
Best practices-wordpress-enterprise
Taylor Lovett
 
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Open Analytics
 
Open Data Summit Presentation by Joe Olsen
Christopher Whitaker
 
Old code doesn't stink - Detroit
Martin Gutenbrunner
 
Best Practices for WordPress in Enterprise
Taylor Lovett
 
Best Practices for WordPress
Taylor Lovett
 
ScrapeGraphAI: AI-powered web scraping, reso facile con l'open source
Speck&Tech
 
Nouveautes_Databricks decouvrire un use case general
pascalsegoul
 
Web Scrapping Using Python
ComputerScienceJunct
 
Basic Application Performance Optimization Techniques (Backend)
Klas Berlič Fras
 
Week 1 - Interactive News Editing and Producing
kurtgessler
 
Softshake 2013: Introduction to NoSQL with Couchbase
Tugdual Grall
 
Hitchhiker's Guide to Web Standards
Dominic Farolino
 
Big Data Analytics course: Named Entities and Deep Learning for NLP
Christian Morbidoni
 
Cloud Foundry Monitoring How-To: Collecting Metrics and Logs
Altoros
 
Capacity Planning for fun & profit
Rodrigo Campos
 

More from Peter Wang (10)

PDF
Rethinking Decentralization / Whither Privacy?
Peter Wang
 
PDF
Rethinking OSS In An Era of Cloud and ML
Peter Wang
 
PDF
Python for Data: Past, Present, Future (PyCon JP 2017 Keynote)
Peter Wang
 
PDF
Stories, Myth, and the Humane Network
Peter Wang
 
PDF
Thoughts on Business & Startups
Peter Wang
 
PDF
PyData Texas 2015 Keynote
Peter Wang
 
PDF
Bokeh Tutorial - PyData @ Strata San Jose 2015
Peter Wang
 
PDF
Interactive Visualization With Bokeh (SF Python Meetup)
Peter Wang
 
PDF
PyData: Past, Present Future (PyData SV 2014 Keynote)
Peter Wang
 
PDF
Python's Role in the Future of Data Analysis
Peter Wang
 
Rethinking Decentralization / Whither Privacy?
Peter Wang
 
Rethinking OSS In An Era of Cloud and ML
Peter Wang
 
Python for Data: Past, Present, Future (PyCon JP 2017 Keynote)
Peter Wang
 
Stories, Myth, and the Humane Network
Peter Wang
 
Thoughts on Business & Startups
Peter Wang
 
PyData Texas 2015 Keynote
Peter Wang
 
Bokeh Tutorial - PyData @ Strata San Jose 2015
Peter Wang
 
Interactive Visualization With Bokeh (SF Python Meetup)
Peter Wang
 
PyData: Past, Present Future (PyData SV 2014 Keynote)
Peter Wang
 
Python's Role in the Future of Data Analysis
Peter Wang
 
Ad

Recently uploaded (20)

PPTX
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
PPTX
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
PDF
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
PPTX
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
PDF
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
PDF
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
PDF
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
PPT
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
PPTX
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
PPT
AI Future trends and opportunities_oct7v1.ppt
SHIKHAKMEHTA
 
PPTX
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
PDF
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
PDF
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
PPTX
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
PPT
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
PDF
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
PPTX
ER_Model_Relationship_in_DBMS_Presentation.pptx
dharaadhvaryu1992
 
PDF
Choosing the Right Database for Indexing.pdf
Tamanna
 
PDF
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
AI Future trends and opportunities_oct7v1.ppt
SHIKHAKMEHTA
 
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
ER_Model_Relationship_in_DBMS_Presentation.pptx
dharaadhvaryu1992
 
Choosing the Right Database for Indexing.pdf
Tamanna
 
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
Ad

Command line Data Tools

  • 2. Why? • Some times you just want to sling data • Text is still king; Lowest common denominator • Machines are pretty honking big now
  • 3. This Presentation • List of some good collections of cmd-line tools • Call out and describe a few in particular • The PyDataTool of my desire
  • 4. Sources • From author of “Data Science at the Command Line”: http://jeroenjanssens.com/2013/09/19/seven- command-line-tools-for-data-science.html (larger list at http://datascienceatthecommandline.com/) • HN discussion: https://news.ycombinator.com/ item?id=6412190 • https://github.com/bitly/data_hacks
  • 5. Tools • JSON: • jq: https://stedolan.github.io/jq/ • RecordStream: https://github.com/benbernard/ RecordStream • csvkit: https://csvkit.readthedocs.io/en/1.0.2/ • dt: https://github.com/clarkgrubb/data-tools • XMLStarlet: http://xmlstar.sourceforge.net/overview.php
  • 6. Honorable Mentions • Pythonic awk: https://github.com/alecthomas/ pawk • Google Crush Tools: https://github.com/google/ crush-tools • Xonsh: http://xon.sh/tutorial.html
  • 7. The PyDataTool of My Desire • Support for csv, json, sql, xls, hdf5; image formats; network formats (pcap etc.) • Capability of: • csvkit, jq, dt, “cols” tool • unix tools: sed, sort, shuf, split, tr, tee, uniq, wc, head, tail, bc • netpbm, imagemagick for images • Work in streaming mode (netcat, wget, curl) • First-class support for dask, spark • Basic plotting via gnuplot, mpl, bokeh • Built-in SQLite to do in-memory support for queries
  • 8. Continuum Is Hiring! • Creators of Anaconda, conda, bokeh, blaze, dask, holoviews, numba, phosphorJS • Maintainers/contributors to Jupyter, JupyterLab, Spyder, pandas, conda-forge, … • 150+ ppl, 80 in Austin • Venture backed • Enterprise product, OSS community innovation, consulting, training
  • 9. Continuum Is Hiring • Enterprise Product Team: • Dev Manager (reports to CTO, runs product engineering) • QA Lead Engineer - creates test plans, coordinates with product mgmt, dev, and testing team • Senior Python Developer - enterprise product development; backend, web tech; full stack preferred • DevOps and Operations - enterprise product, anaconda.org, Anaconda build system • Email [email protected]