SlideShare a Scribd company logo
January 21st, 2015
Data Science Consulting
Héloïse Nonne
@heloisenonne,
hnonne@quantmetry.com
Data Science at the Command Line
Paris Data Geek @ AXA
PLEASE DO NOT PRINT THE BLACK SLIDES.
THINK ABOUT THE ENVIRONMENT
Data Science at the Command Line
at Paris Data Geek
@ AXA
Data Science at the Command Line
$ Let’s stop clicking
and
start typing
CLI has 40 years and counting
A modern tool!
• Agile and interactive
read–eval–print loop (REPL) vs edit-compile-run-debug cycle
• Close to the filesystem
• A good tool for the lazy
Automates repetitive tasks
• Good for integration with other technologies
C/C++, python, perl, R, ruby, etc.
Create your own tools
$ why use the command line?
Do we really need
Hadoop to process
a few GB of data?
1.75 GB – 2 million chess games
• Hadoop: 26 minutes (1.14MB/sec)
• Bash, local: 12 seconds (270MB/sec)
Source: Adam Drake, aadrake.com 2014
Streaming at the command line
Spout
<
Bolt
|
Sink
>
MapReduce at the command line
Word count
Mapper
grep –oE ‘[a-zA-Z]{2,}’
(Shuffle)
& Sort
sort
Reduce
uniq -c
$ Lets go parallel
We have many jobs to run and 4CPUs
Naive parallelization
GNU parallel
spawns a new process when one finishes
All CPUs remain active
$ Data Scientist Toolkit
csvcut
csvsort
csvstack
csvjoin
csvstat
gnuplot
lowercase
regex
sed
tr
csvcut
cut
awk
sort
uniq
curl
in2csv
sql2csv
scrape
jq
$ Machine learning at
the command line
mlpack
$ linear_regression --input_file dataset.csv --test_file predict.csv -v
dbacl
Don’t Be Afraid of the Command Line?
Online learning with
Vowpal Wabbit
Command line is good for
• Starting data project before going on Hadoop, Spark, …
• Data discovery
• Data cleaning
• Do some efficient machine learning (online, C)
• Model / Feature discovery
What next?
• Online learning
• Benchmark with bigger data
• Hadoop (Hive) vs CLI
• Benchmark of Machine learning at the CLI
• CLI tools vs Python / R
William Shotts
Jeroen Janssens
The man pages!
$ Bibliography
$ Questions?

More Related Content

PPTX
Online learning, Vowpal Wabbit and Hadoop
Héloïse Nonne
 
ODP
Wapid and wobust active online machine leawning with Vowpal Wabbit
Antti Haapala
 
PPTX
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
MLconf
 
PPTX
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15
MLconf
 
PDF
SciPy 2019: How to Accelerate an Existing Codebase with Numba
stan_seibert
 
PPTX
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
MLconf
 
PDF
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...
MLconf
 
PDF
Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016
MLconf
 
Online learning, Vowpal Wabbit and Hadoop
Héloïse Nonne
 
Wapid and wobust active online machine leawning with Vowpal Wabbit
Antti Haapala
 
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
MLconf
 
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15
MLconf
 
SciPy 2019: How to Accelerate an Existing Codebase with Numba
stan_seibert
 
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
MLconf
 
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...
MLconf
 
Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016
MLconf
 

What's hot (20)

PDF
Time-Evolving Graph Processing On Commodity Clusters
Jen Aman
 
PDF
GDG-Shanghai 2017 TensorFlow Summit Recap
Jiang Jun
 
PPTX
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Databricks
 
PDF
High Performance Machine Learning in R with H2O
Sri Ambati
 
PDF
More Data Science with Less Engineering: Machine Learning Infrastructure at N...
Ville Tuulos
 
PDF
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Databricks
 
PDF
Netflix machine learning
Amer Ather
 
PPTX
TensorFrames: Google Tensorflow on Apache Spark
Databricks
 
PDF
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
MLconf
 
PPTX
An Introduction to TensorFlow architecture
Mani Goswami
 
PDF
Neural Networks and Deep Learning for Physicists
Héloïse Nonne
 
PDF
Deep learning with TensorFlow
Ndjido Ardo BAR
 
PPTX
Tensorflow
marwa Ayad Mohamed
 
PDF
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
Spark Summit
 
PDF
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Flink Forward
 
PPTX
Avi Pfeffer, Principal Scientist, Charles River Analytics at MLconf SEA - 5/2...
MLconf
 
PDF
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
MLconf
 
PPTX
Tensorflow 101 @ Machine Learning Innovation Summit SF June 6, 2017
Ashish Bansal
 
PDF
ML+Hadoop at NYC Predictive Analytics
Erik Bernhardsson
 
PDF
Machine Intelligence at Google Scale: TensorFlow
DataWorks Summit/Hadoop Summit
 
Time-Evolving Graph Processing On Commodity Clusters
Jen Aman
 
GDG-Shanghai 2017 TensorFlow Summit Recap
Jiang Jun
 
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Databricks
 
High Performance Machine Learning in R with H2O
Sri Ambati
 
More Data Science with Less Engineering: Machine Learning Infrastructure at N...
Ville Tuulos
 
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Databricks
 
Netflix machine learning
Amer Ather
 
TensorFrames: Google Tensorflow on Apache Spark
Databricks
 
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
MLconf
 
An Introduction to TensorFlow architecture
Mani Goswami
 
Neural Networks and Deep Learning for Physicists
Héloïse Nonne
 
Deep learning with TensorFlow
Ndjido Ardo BAR
 
Tensorflow
marwa Ayad Mohamed
 
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
Spark Summit
 
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Flink Forward
 
Avi Pfeffer, Principal Scientist, Charles River Analytics at MLconf SEA - 5/2...
MLconf
 
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
MLconf
 
Tensorflow 101 @ Machine Learning Innovation Summit SF June 6, 2017
Ashish Bansal
 
ML+Hadoop at NYC Predictive Analytics
Erik Bernhardsson
 
Machine Intelligence at Google Scale: TensorFlow
DataWorks Summit/Hadoop Summit
 
Ad

Viewers also liked (10)

PDF
Big Data Analytics for connected home
Héloïse Nonne
 
PDF
The strategy journey
gfkeeys
 
PPTX
Présentation Big Data et REX Hadoop
Joseph Glorieux
 
PPTX
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...
Cedric CARBONE
 
PDF
Cassandra spark connector
Duyhai Doan
 
PDF
June Spark meetup : search as recommandation
Modern Data Stack France
 
PPTX
Paris Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamiel
Modern Data Stack France
 
PDF
Spark ML par Xebia (Spark Meetup du 11/06/2015)
Modern Data Stack France
 
PDF
Spark dataframe
Modern Data Stack France
 
PDF
SlideShare 101
Amit Ranjan
 
Big Data Analytics for connected home
Héloïse Nonne
 
The strategy journey
gfkeeys
 
Présentation Big Data et REX Hadoop
Joseph Glorieux
 
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...
Cedric CARBONE
 
Cassandra spark connector
Duyhai Doan
 
June Spark meetup : search as recommandation
Modern Data Stack France
 
Paris Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamiel
Modern Data Stack France
 
Spark ML par Xebia (Spark Meetup du 11/06/2015)
Modern Data Stack France
 
Spark dataframe
Modern Data Stack France
 
SlideShare 101
Amit Ranjan
 
Ad

Similar to Data Science at the Command Line (20)

PDF
Building a Data Science Toolbox
datasciencenl
 
PDF
How researchers and developers can benefit from the command line
Data Science Workshops
 
PPTX
Your data isn't that big @ Big Things Meetup 2016-05-16
Boaz Menuhin
 
PDF
Obtaining, Scrubbing, and Exploring Data at the Command Line by Jeroen Janssens
Hakka Labs
 
PDF
Data science at the command line
Sharat Chikkerur
 
PDF
Jeroen Janssens - Predicting at the Command Line
PAPIs.io
 
PDF
Introduction-to-Data-Science-with-Python.pdf
TCCI Computer Coaching
 
PDF
Data science with Perl & Raku
Sören Laird Sörries
 
PDF
Data science and Machine learning Booklet
Vellore Institute of Technology
 
PDF
Data science presentation
MSDEVMTL
 
PPTX
Nix for etl using scripting to automate data cleaning & transformation
Lynchpin Analytics Consultancy
 
PDF
Blueprintscreating Describing And Implementing Designs 23 Stephen Davies
rillmalnga
 
PPTX
Data science presentation - Management career institute
PoojaPatidar11
 
PDF
Command line Data Tools
Peter Wang
 
PDF
Command line s
yeison herbert
 
PDF
Command line
Adityaroy110
 
PDF
Data Science: Notes and Toolkits
Babis Marmanis
 
PDF
PYTHON FOR DATA SCIENCE- EXPLAINED IN 6 EASY STEPS
USDSI
 
PDF
TIAD 2016 : Is Automation Worth My Time?
The Incredible Automation Day
 
PDF
TIAD - Is Automation Worth My Time?
Randall Hunt
 
Building a Data Science Toolbox
datasciencenl
 
How researchers and developers can benefit from the command line
Data Science Workshops
 
Your data isn't that big @ Big Things Meetup 2016-05-16
Boaz Menuhin
 
Obtaining, Scrubbing, and Exploring Data at the Command Line by Jeroen Janssens
Hakka Labs
 
Data science at the command line
Sharat Chikkerur
 
Jeroen Janssens - Predicting at the Command Line
PAPIs.io
 
Introduction-to-Data-Science-with-Python.pdf
TCCI Computer Coaching
 
Data science with Perl & Raku
Sören Laird Sörries
 
Data science and Machine learning Booklet
Vellore Institute of Technology
 
Data science presentation
MSDEVMTL
 
Nix for etl using scripting to automate data cleaning & transformation
Lynchpin Analytics Consultancy
 
Blueprintscreating Describing And Implementing Designs 23 Stephen Davies
rillmalnga
 
Data science presentation - Management career institute
PoojaPatidar11
 
Command line Data Tools
Peter Wang
 
Command line s
yeison herbert
 
Command line
Adityaroy110
 
Data Science: Notes and Toolkits
Babis Marmanis
 
PYTHON FOR DATA SCIENCE- EXPLAINED IN 6 EASY STEPS
USDSI
 
TIAD 2016 : Is Automation Worth My Time?
The Incredible Automation Day
 
TIAD - Is Automation Worth My Time?
Randall Hunt
 

Recently uploaded (20)

PDF
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
PPT
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
PDF
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
PPTX
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
PPTX
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
PPTX
Complete_STATA_Introduction_Beginner.pptx
mbayekebe
 
PDF
oop_java (1) of ice or cse or eee ic.pdf
sabiquntoufiqlabonno
 
PPTX
Power BI in Business Intelligence with AI
KPR Institute of Engineering and Technology
 
PDF
WISE main accomplishments for ISQOLS award July 2025.pdf
StatsCommunications
 
PPTX
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
PPTX
Introduction to Data Analytics and Data Science
KavithaCIT
 
PDF
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
PPTX
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
PPTX
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
PPTX
IP_Journal_Articles_2025IP_Journal_Articles_2025
mishell212144
 
PPTX
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
PDF
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
PPTX
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PPTX
Introduction to computer chapter one 2017.pptx
mensunmarley
 
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
Complete_STATA_Introduction_Beginner.pptx
mbayekebe
 
oop_java (1) of ice or cse or eee ic.pdf
sabiquntoufiqlabonno
 
Power BI in Business Intelligence with AI
KPR Institute of Engineering and Technology
 
WISE main accomplishments for ISQOLS award July 2025.pdf
StatsCommunications
 
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
Introduction to Data Analytics and Data Science
KavithaCIT
 
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
IP_Journal_Articles_2025IP_Journal_Articles_2025
mishell212144
 
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
Introduction to computer chapter one 2017.pptx
mensunmarley
 

Data Science at the Command Line