SlideShare a Scribd company logo
Data science at the command line
Rapid prototyping and reproducible science
Sharat Chikkerur
sharat@alum.mit.edu
Principal Data Scientist
Nanigans Inc.
1
Outline
Introduction
Motivation
Data science workflow
Obtaining data
Scrubbing data
Exploring data
Managing large workflows
Modeling using vowpal wabbit
2
Introduction
Setup
• Follow instructions at
https://github.com/sharatsc/cdse
• Install virtualbox virtualbox.org
• Install vagrant vagrantup.com
• Initialize virtual machine
mkdir datascience
cd datascience
vagrant init data-science-toolbox/data-science-at-the-command-line
vagrant up && vagrant ssh
3
About me
• UB Alumni, 2005 (M.S., EE Dept, cubs.buffalo.edu)
• MIT Alumni, 2010 (PhD., EECS Dept, cbcl.mit.edu)
• Senior Software Engineer, Google AdWords modeling
• Senior Software Engineer, Microsoft Machine learning
• Principal data scientist, Nanigans Inc
4
About the workshop
• Based on a book by Jeroen Janssens 1
• Vowpal wabbit 2
1
http://datascienceatthecommandline.com/
2
https://github.com/JohnLangford/vowpal_wabbit
5
POSIX command line
• Exposes operating system functionalities through a shell
• Example shells include bash, zsh, tcsh, fish etc.
• Comprises of a large number of utility programs
• Examples: grep, awk, sed, etc.
• GNU user space programs
• Common API: Input through stdin and output to stdout
• stdin and stdout can be redirected during execution
• Pipes (|) allows composition through pipes (chaining).
• Allows redirection of output of one command as input to
another
ls -lh | more
cat file | sort | uniq -c
6
Why command line ?
• REPL allows rapid iteration (through immediate feedback)
• Allows composition of scripts and commands using pipes
• Automation and scaling
• Reproducibility
• Extensibility
• R, python, perl, ruby scripts can be invoked like command
line utilities
7
Data science workflow
A typical workflow OSEMN model
• Obtaining data
• Scrubbing data
• Exploring data
• Modeling data
• Interpreting data
8
Workshop outline
• Background
• Getting data: Curl, scrape
• Scrubbing: jq, csvkit
• Exploring: csvkit
• Modeling: vowpal wabbit
• Scaling: parallel
Hands on exercises
• Obtaining data walkthrough
• JQ walkthrough
• CSVKit walkthrough
• Vowpal wabbit
9
Workflow example: Boston housing dataset
https://github.com/sharatsc/cdse/blob/master/
boston-housing
10
Boston housing dataset
Python workflow 3
import urllib
import pandas as pd
# Obtain data
urllib.urlretrieve(’https://raw.githubusercontent.com/sharatsc/cdse/master/boston-housin
df = pd.read_csv(’boston.csv’)
# Scrub data
df = df.fillna(0)
# Model data
from statsmodels import regression
from statsmodels.formula import api as smf
formula = ’medv~’ + ’ + ’.join(df.columns - [’medv’])
model = smf.ols(formula=formula, data=df)
res=model.fit()
res.summary()
Command line workflow
URL="https://raw.githubusercontent.com/sharatsc/cdse/master/boston-housing/boston.csv"
curl $URL| Rio -e ’model=lm("medv~.", df);model’
3
https://github.com/sharatsc/cdse/blob/master/boston-housing
11
Obtaining data
Obtaining data from the web: curl
• CURL (curl.haxx.se)
• Cross platform command line tool that supports data
transfer using
HTTP, HTTPS, FTP, IMAP, SCP, SFTP
• Supports cookies, user+password authentication
• can be used to get data from RESTful APIs 4
• http GET
curl http://www.google.com
• ftp GET
curl ftp://catless.ncl.ac.uk
• scp COPY
curl -u username: --key key_file --pass password scp://example.com/~/file.txt
4
www.codingpedia.org/ama/how-to-test-a-rest-api-from-command-line-with-curl/
12
Scrubbing data
Scrubbing web data: Scrape
• Scrape is a python command line tool to parse html
documents
• Queries can be made in CSS selector or XPath syntax
htmldoc=$(cat << EOF
<div id=a>
<a href="x.pdf">x</a>
</div>
<div id=b>
<a href="png.png">y</a>
<a href="pdf.pdf">y</a>
</div>
EOF
)
# Select liks that end with pdf and are within div with id=b (Use CSS3 selector)
echo $htmldoc | scrape -e "$b a[href$=pdf]"
# Select all anchors (use Xpath)
echo $htmldoc | scrape -e "//a"
<a href="pdf.pdf">y</a>
13
CSS selectors
.class selects all elements with class=’class’
div p selects all <p> elements inside div elements
div > p selects <p> elements where parent is <div>
[target=blank] selects all elements with target="blank"
[href^=https] selects urls beginning with https
[href$=pdf] selects urls ending with pdf
More examples at
https://www.w3schools.com/cssref/css_selectors.asp
14
XPath Query
author
selects all <author> elements at
the current level
//author
selects all <author> elements at
any level
//author[@class=’x’]
selects <author> elements with
class=famous
//book//author
All <author> elements that are
below <book> element"
//author/* All children of <author> nodes
More examples at
https://msdn.microsoft.com/en-us/library/ms256086
15
Getting data walkthrough
https:
//github.com/sharatsc/cdse/tree/master/curl
16
Scrubbing JSON data: JQ 5
• JQ is a portable command line utility to manipulate and
filter JSON data
• Filters can be defined to access individual fields, transform
records or produce derived objets
• Filters can be composed and combined
• Provides builtin functions an operators
Example:
curl ’https://api.github.com/repos/stedolan/jq/commits’ | jq ’.[0]’
5
https://stedolan.github.io/jq/
17
Basic filters
# Identity ’.’
echo ’"Hello world"’ | jq ’.’
"Hello world"
# Examples ’.foo’, ’.foo.bar’, ’.foo|.bar’, ’.["foo"]
echo ’{"foo": 42, "bar": 10}’ | jq ’.foo’
42
echo ’{"foo": {"bar": 10, "baz": 20}} | jq ’.foo.bar’
10
#Arrays ’.[]’, ’.[0]’, ’.[1:3]’
echo ’["foo", "bar"]’ | jq ’.[0]’
"foo"
echo ’["foo", "bar", "baz"]’ | jq ’.[1:3]’
["bar", "baz"]
# Pipe ’.foo|.bar’, ’.[]|.bar’
echo ’[{"f": 10}, {"f": 20}, {"f": 30}]’ | jq ’.[] | .f
10 20 30
18
Object construction
Object construction allows you to derive new objects out of
existing ones.
# Field selection
echo {"foo": "F", "bar": "B", "baz": "Z"} | jq ’{"foo": .foo}’
{"foo": "F"}
# Array expansion
echo ’{"foo": "A", "bar": ["X", "Y"]}’ | jq ’{"foo": .foo, "bar": .bar[]}’
{"foo": "F", "bar": "X"}
{"foo": "F", "bar": "Y"}
# Expression evaluation, key and value can be substituted
echo ’{"foo": "A", "bar": ["X", "Y"]}’ | jq ’{(.foo): .bar[]}’
{"A": "X"}
{"A": "Y"}
19
Operators
Addition
• Numbers are added by normal arithmetic.
• Arrays are added by being concatenated into a larger array.
• Strings are added by being joined into a larger string.
• Objects are added by merging, that is, inserting all the
key-value pairs from both objects into a single combined
object.
# Adding fields
echo ’{"foo": 10}’ | jq ’.foo + 1’
11
# Adding arrays
echo ’{"foo": [1,2,3], "bar": [11,12,13]}’ | jq ’.foo + .bar’
[1,2,3,11,12,13]
20
JQ walkthrough
https:
//github.com/sharatsc/cdse/tree/master/jq
21
Exploring data
CSV
• CSV (Comma separated value) is the common demoniator
for data exchange
• Tabular data with ’,’ as a separator
• Can be ingested by R, python, excel etc.
• No explicit specification of data types (ARFF supports type
annotation)
Example
state,county,quantity
NE,ADAMS,1
NE,BUFFALO,1
NE,THURSTON,1
22
CSVKit (Groskopf and contributors [2016])
csvkit 6 is a suite of command line tools for converting and
working with CSV, the defacto standard for tabular file formats.
Example use cases
• Importing data from excel, sql
• Select subset of columns
• Reorder columns
• Mergeing multiple files (row and column wise)
• Summary statistics
6
https://csvkit.readthedocs.io/
23
Importing data
# Fetch data in XLS format
# (LESO) 1033 Program dataset, which describes how surplus military arms have been distributed
# This data was widely cited in the aftermath of the Ferguson, Missouri protests. T
curl -L https://github.com/sharatsc/cdse/blob/master/csvkit/ne_1033_data.xls?raw=true -o ne_10
# Convert to csv
in2csv ne_1033_data.xlxs > data.csv
# Inspect the columns
csvcut -n data.csv
# Inspect the data in specific columns
csvcut -c county, quantity data.csv | csvlook
24
CSVKit: Examining data
csvstat provides a summary view of the data similar to
summary() function in R.
# Get summary for county, and cost
csvcut -c county,acquisition_cost,ship_date data.csv | csvstat
1. county
Text
Nulls: False
Unique values: 35
Max length: 10
5 most frequent values:
DOUGLAS: 760
DAKOTA: 42
CASS: 37
HALL: 23
LANCASTER: 18
2. acquisition_cost
Number
Nulls: False
Min: 0.0
Max: 412000.0
Sum: 5430787.55
Mean: 5242.072924710424710424710425
Median: 6000.0
Standard Deviation: 13368.07836799839045093904423
Unique values: 75
5 most frequent values:
6800.0: 304
25
CSVKit: searching data
csvgrep can be used to search the content of the CSV file.
Options include
• Exact match -m
• Regex match -r
• Invert match -i
• Search specific columns -c columns
csvcut -c county,item_name,total_cost data.csv | csvgrep -c county -m LANCASTER | csvlook
| county | item_name | total_cost |
| --------- | ------------------------------ | ---------- |
| LANCASTER | RIFLE,5.56 MILLIMETER | 120 |
| LANCASTER | RIFLE,5.56 MILLIMETER | 120 |
| LANCASTER | RIFLE,5.56 MILLIMETER | 120 |
26
CSVKit: Power tools
• csvjoin can be used to combine columns from multiple files
csvjoin -c join_column data.csv other_data.csv
• csvsort can be used to sort the file based on specific
columns
csvsort -c total_population | csvlook | head
• csvstack allows you to merge mutiple files together
(row-wise)
curl -L -O https://raw.githubusercontent.com/wireservice/csvkit/master/examples/re
in2csv ne_1033_data.xls > ne_1033_data.csv
csvstack -g region ne_1033_data.csv ks_1033_data.csv > region.csv
27
CSVKit: SQL
• csvsql allows queryring against one or more CSV files.
• The results of the query can be inserted back into a db
Examples
• Import from csv into a table
# Inserts into a specific table
csvsql --db postgresql:///test --table data --insert data.csv
# Inserts each file into a separate table
csvsql --db postgresql:///test --insert examples/*_tables.csv
• Regular SQL query
csvsql --query "select count(*) from data" data.csv
28
CSVKit Walkthrough
https:
//github.com/sharatsc/cdse/tree/master/csvkit
29
Managing large workflows
GNU parallel
• GNU parallel (Tange [2011]) is a tool for executing jobs in
parallel on one or more machines.
• It can be used to parallelize across arguments, lines and
files.
Examples
# Parallelize across lines
seq 1000 | parallel "echo {}"
# Parallelize across file content
cat input.csv | parallel -C, "mv {1} {2}"
cat input.csv | parallel -C --header "mv {source} {dest}"
30
Parallel (cont.)
• By default, parallel runs one job per cpu core
• Concurrency can be controlled by --jobs or -j option
seq 100 | parallel -j2 "echo number: {}"
seq 100 | parallel -j200% "echo number: {}"
Logging
• Output from each parallel job can be captured separately
using --results
seq 10 | parallel --results data/outdir "echo number: {}"
find data/outdir
31
Parallel (cont.)
• Remote execution
parallel --nonall --slf instances hostname
# nonall - no argument command to follow
# slf - uses ~/.parallel/sshloginfile as the list of sshlogins
• Distributing data
# Split 1-1000 into sections of 100 and pipe it to remote instances
seq 1000 | parallel -N100 --pipe --slf instances "(wc -l)"
#transmit, retrieve and cleanup
# sends jq to all instances
# transmits the input file, retrives the results into {.}csv and cleanup
ls *.gz | parallel -v --basefile jq --trc {.} csv
32
Modeling using vowpal wabbit
Overview
• Fast, online , scalable learning system
• Supports out of core execution with in memory model
• Scalable (Terascale)
• 1000 nodes (??)
• Billions of examples
• Trillions of unique features.
• Actively developed
https://github.com/JohnLangford/vowpal_wabbit
33
Swiss army knife of online algorithms
• Binary classification
• Multiclass classification
• Linear regression
• Quantile regression
• Topic modeling (online LDA)
• Structured prediction
• Active learning
• Recommendation (Matrix factorization)
• Contextual bandit learning (explore/exploit algorithms)
• Reductions
34
Features
• Flexible input format: Allows free form text, identifying tags,
multiple labels
• Speed: Supports online learning
• Scalable
• Streaming removes row limit.
• Hashing removes dimension limit.
• Distributed training: models merged using AllReduce
operation.
• Cross product features: Allows multi-task learning
35
Optimization
VW solves optimization of the form
i
l(wT
xi; yi) + λR(w)
Here, l() is convex, R(w) = λ1|w| + λ2||w||2.
VW support a variety of loss function
Linear regression (y − wT x)2
Logistic regression log(1 + exp(−ywT x))
SVM regression max(0, 1 − ywT x)
Quantile regression τ(wT x − y) ∗ I(y < wT x) + (1 − τ)(y − wT x)I
36
Detour: Feature hashing
• Feature hashing can be used to reduce dimensions of
sparse features.
• Unlike random projections ? , retains sparsity
• Preserves dot products (random projection preserves
distances).
• Model can fit in memory.
• Unsigned ?
Consider a hash function
h(x) : [0 . . . N] → [0 . . . m], m << N.
φi(x) =
j:h(j)=i
xj
• Signed ?
Consider additionaly a hash function
ξ(x) : [0 . . . N] → {1, −1}.
φi(x) = ξ(j)xj
37
Detour: Generalized linear models
A generalized linear predictor specifies
• A linear predictor of the form η(x) = wT x
• A mean estimate µ
• A link function g(µ) such that g(µ) = η(x) that relates the
mean estimate to the linear predictor.
This framework supports a variety of regression problems
Linear regression µ = wT x
Logistic regression log( µ
1−µ) = wT x
Poisson regression log(µ) = wT x
38
fragileInput format
Label Importance [Tag]|namespace Feature . . . | namespace
Feature . . .
namespace = String[:Float]
feature = String[:Float]
Examples:
• 1 | 1:0.01 32:-0.1
• example|namespace normal text features
• 1 3 tag|ad-features ad description |user-features name
address age
39
Input options
• data file -d datafile
• network --daemon --port <port=26542>
• compressed data --compressed
• stdin cat <data> | vw
40
Manipulation options
• ngrams --ngram
• skips --skips
• quadratic interaction -q args. e.g -q ab
• cubic interaction --cubic args. e.g. --cubic ccc
41
Output options
• Examining feature construction --audit
• Generating prediction --predictions or -p
• Unnormalized predictions --raw_predictions
• Testing only --testonly or -t
42
Model options
• Model size --bit_precision or -b . Number of
coefficients limited to 2b
• Update existing model --initial_regressor or -i.
• Final model destination --final_regressor or -f
• Readable model definition --readable_model
• Readable feature values --invert_hash
• Snapshot model every pass --save_per_pass
• Weight initialization
--initial_weight or --random_weights
43
Regression (Demo)
https://github.com/sharatsc/cdse/tree/master/
vowpal-wabbit/linear-regression
• Linear regression --loss_function square
• Quantile regression
--loss_function quantile --quantile_tau <=0.5>
44
Binary classification (Demo)
https://github.com/sharatsc/cdse/tree/master/
vowpal-wabbit/classification
• Note: a linear regressor can be used as a classifier as well
• Logistic loss
--loss_function logistic, --link logistic
• Hinge loss (SVM loss function)
--loss_function hinge
• Report binary loss instead of logistic loss --binary
45
Multiclass classification (Demo)
https://github.com/sharatsc/cdse/tree/master/
vowpal-wabbit/multiclass
• One against all --oaa <k>
• Error correcting tournament --ect <k>
• Online decision trees ---log_multi <k>
• Cost sensitive one-against-all --csoaa <k>
46
LDA options (Demo)
https://github.com/sharatsc/cdse/tree/master/
vowpal-wabbit/lda
• Number of topics --lda
• Prior on per-document topic weights --lda_alpha
• Prior on topic distributions --lda_rho
• Estimated number of documents --lda_D
• Convergence parameter for topic estimation
--lda_epsilon
• Mini batch size --minibatch
47
Daemon mode (Demo)
https://github.com/sharatsc/cdse/tree/master/
vowpal-wabbit/daemon-respond
https://github.com/sharatsc/cdse/tree/master/
vowpal-wabbit/daemon-request
• Loads model and answers any prediction request coming
over the network
• Preferred way to deploy a VW model
• Options
• --daemon. Enables demon mode
• --testonly or -t. Does not update the model in
response to requests
• --initial_model or -i. Model to load
• --port <arg>. Port to listen to the request
• --num_children <arg>. Number of threads listening to
request 48
References
Christopher Groskopf and contributors. csvkit, 2016. URL
https://csvkit.readthedocs.org/.
O. Tange. Gnu parallel - the command-line power tool. ;login:
The USENIX Magazine, 36(1):42–47, Feb 2011. doi:
http://dx.doi.org/10.5281/zenodo.16303. URL
http://www.gnu.org/s/parallel.
48

More Related Content

What's hot (20)

PPTX
MongoDB - Aggregation Pipeline
Jason Terpko
 
PDF
Data Processing and Aggregation with MongoDB
MongoDB
 
PDF
Aggregation Framework MongoDB Days Munich
Norberto Leite
 
PDF
Working With a Real-World Dataset in Neo4j: Import and Modeling
Neo4j
 
PDF
MongoDB and Python
Norberto Leite
 
PPTX
MongoDB World 2016 : Advanced Aggregation
Joe Drumgoole
 
KEY
Python Development (MongoSF)
Mike Dirolf
 
PDF
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
PDF
Analytics with MongoDB Aggregation Framework and Hadoop Connector
Henrik Ingo
 
KEY
MongoDB Aggregation Framework
Tyler Brock
 
PDF
On Beyond (PostgreSQL) Data Types
Jonathan Katz
 
PDF
Drill 1.0
MapR Technologies
 
PPTX
Introduction to MongoDB and Hadoop
Steven Francia
 
PPTX
PostgreSQL 9.4 JSON Types and Operators
Nicholas Kiraly
 
PDF
Webinar: Data Processing and Aggregation Options
MongoDB
 
PPTX
MongoDB Aggregation
Amit Ghosh
 
PPTX
Webinar: Exploring the Aggregation Framework
MongoDB
 
PPTX
Conceptos básicos. Seminario web 5: Introducción a Aggregation Framework
MongoDB
 
PDF
Fast track to getting started with DSE Max @ ING
Duyhai Doan
 
PDF
Spark cassandra integration 2016
Duyhai Doan
 
MongoDB - Aggregation Pipeline
Jason Terpko
 
Data Processing and Aggregation with MongoDB
MongoDB
 
Aggregation Framework MongoDB Days Munich
Norberto Leite
 
Working With a Real-World Dataset in Neo4j: Import and Modeling
Neo4j
 
MongoDB and Python
Norberto Leite
 
MongoDB World 2016 : Advanced Aggregation
Joe Drumgoole
 
Python Development (MongoSF)
Mike Dirolf
 
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Analytics with MongoDB Aggregation Framework and Hadoop Connector
Henrik Ingo
 
MongoDB Aggregation Framework
Tyler Brock
 
On Beyond (PostgreSQL) Data Types
Jonathan Katz
 
Introduction to MongoDB and Hadoop
Steven Francia
 
PostgreSQL 9.4 JSON Types and Operators
Nicholas Kiraly
 
Webinar: Data Processing and Aggregation Options
MongoDB
 
MongoDB Aggregation
Amit Ghosh
 
Webinar: Exploring the Aggregation Framework
MongoDB
 
Conceptos básicos. Seminario web 5: Introducción a Aggregation Framework
MongoDB
 
Fast track to getting started with DSE Max @ ING
Duyhai Doan
 
Spark cassandra integration 2016
Duyhai Doan
 

Viewers also liked (20)

PPTX
Make Sense Out of Data with Feature Engineering
DataRobot
 
PDF
Work - LIGHT Ministry
Jeong-Yoon Lee
 
PDF
6 rules of enterprise innovation
HackerEarth
 
PPTX
Leverage Social Media for Employer Brand and Recruiting
HackerEarth
 
PDF
Data Science Competition
Jeong-Yoon Lee
 
PDF
Ethics in Data Science and Machine Learning
HJ van Veen
 
PDF
Doing your first Kaggle (Python for Big Data sets)
Domino Data Lab
 
PPTX
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Spark Summit
 
PDF
Menstrual Health Reader - mEo
HackerEarth
 
PDF
Open Innovation - A Case Study
HackerEarth
 
PPTX
Managing Data Science | Lessons from the Field
Domino Data Lab
 
PPTX
State of women in technical workforce
HackerEarth
 
ODP
Wapid and wobust active online machine leawning with Vowpal Wabbit
Antti Haapala
 
PPTX
How to recruit excellent tech talent
HackerEarth
 
PDF
How hackathons can drive top line revenue growth
HackerEarth
 
PDF
No-Bullshit Data Science
Domino Data Lab
 
PPTX
Kill the wabbit
Joe Kleinwaechter
 
PDF
USC LIGHT Ministry Introduction
Jeong-Yoon Lee
 
PDF
Intra company hackathons using HackerEarth
HackerEarth
 
PPTX
A Panorama of Natural Language Processing
Ted Xiao
 
Make Sense Out of Data with Feature Engineering
DataRobot
 
Work - LIGHT Ministry
Jeong-Yoon Lee
 
6 rules of enterprise innovation
HackerEarth
 
Leverage Social Media for Employer Brand and Recruiting
HackerEarth
 
Data Science Competition
Jeong-Yoon Lee
 
Ethics in Data Science and Machine Learning
HJ van Veen
 
Doing your first Kaggle (Python for Big Data sets)
Domino Data Lab
 
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Spark Summit
 
Menstrual Health Reader - mEo
HackerEarth
 
Open Innovation - A Case Study
HackerEarth
 
Managing Data Science | Lessons from the Field
Domino Data Lab
 
State of women in technical workforce
HackerEarth
 
Wapid and wobust active online machine leawning with Vowpal Wabbit
Antti Haapala
 
How to recruit excellent tech talent
HackerEarth
 
How hackathons can drive top line revenue growth
HackerEarth
 
No-Bullshit Data Science
Domino Data Lab
 
Kill the wabbit
Joe Kleinwaechter
 
USC LIGHT Ministry Introduction
Jeong-Yoon Lee
 
Intra company hackathons using HackerEarth
HackerEarth
 
A Panorama of Natural Language Processing
Ted Xiao
 
Ad

Similar to Data science at the command line (20)

PDF
Obtaining, Scrubbing, and Exploring Data at the Command Line by Jeroen Janssens
Hakka Labs
 
PPTX
Your data isn't that big @ Big Things Meetup 2016-05-16
Boaz Menuhin
 
PPTX
Web Scraping Basics
Kyle Banerjee
 
PPTX
Nix for etl using scripting to automate data cleaning & transformation
Lynchpin Analytics Consultancy
 
PDF
Data handling in r
Abhik Seal
 
PDF
How researchers and developers can benefit from the command line
Data Science Workshops
 
DOCX
Ravi software faculty
ravikumar4java
 
PDF
Python for Data Analysis Data Wrangling with Pandas NumPy and IPython 1st Edi...
morvehnobbe
 
PDF
Computational Social Science, Lecture 09: Data Wrangling
jakehofman
 
PDF
Linux intro 3 grep + Unix piping
Giovanni Marco Dall'Olio
 
ODP
Love Your Command Line
Liz Henry
 
PDF
Introduction to Linux with Focus on Raspberry Pi
David Horvath
 
PDF
Command line Data Tools
Peter Wang
 
PDF
Linux Command Line - By Ranjan Raja
Ranjan Raja
 
PDF
Wes McKinney - Python for Data Analysis-O'Reilly Media (2012).pdf
Blue Sea
 
PDF
Data analystics with R module 3 cseds vtu
LalithauLali
 
PPTX
Linux tech talk
Prince Raj
 
PDF
Adaptive Query Processing on RAW Data
Manos Karpathiotakis
 
PDF
20180324 leveraging unix tools
David Horvath
 
PDF
One-Liners to Rule Them All
egypt
 
Obtaining, Scrubbing, and Exploring Data at the Command Line by Jeroen Janssens
Hakka Labs
 
Your data isn't that big @ Big Things Meetup 2016-05-16
Boaz Menuhin
 
Web Scraping Basics
Kyle Banerjee
 
Nix for etl using scripting to automate data cleaning & transformation
Lynchpin Analytics Consultancy
 
Data handling in r
Abhik Seal
 
How researchers and developers can benefit from the command line
Data Science Workshops
 
Ravi software faculty
ravikumar4java
 
Python for Data Analysis Data Wrangling with Pandas NumPy and IPython 1st Edi...
morvehnobbe
 
Computational Social Science, Lecture 09: Data Wrangling
jakehofman
 
Linux intro 3 grep + Unix piping
Giovanni Marco Dall'Olio
 
Love Your Command Line
Liz Henry
 
Introduction to Linux with Focus on Raspberry Pi
David Horvath
 
Command line Data Tools
Peter Wang
 
Linux Command Line - By Ranjan Raja
Ranjan Raja
 
Wes McKinney - Python for Data Analysis-O'Reilly Media (2012).pdf
Blue Sea
 
Data analystics with R module 3 cseds vtu
LalithauLali
 
Linux tech talk
Prince Raj
 
Adaptive Query Processing on RAW Data
Manos Karpathiotakis
 
20180324 leveraging unix tools
David Horvath
 
One-Liners to Rule Them All
egypt
 
Ad

Recently uploaded (20)

PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PDF
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
PDF
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PPTX
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
PDF
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
PPTX
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
PPTX
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PPT
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
PDF
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 

Data science at the command line

  • 1. Data science at the command line Rapid prototyping and reproducible science Sharat Chikkerur [email protected] Principal Data Scientist Nanigans Inc. 1
  • 2. Outline Introduction Motivation Data science workflow Obtaining data Scrubbing data Exploring data Managing large workflows Modeling using vowpal wabbit 2
  • 4. Setup • Follow instructions at https://github.com/sharatsc/cdse • Install virtualbox virtualbox.org • Install vagrant vagrantup.com • Initialize virtual machine mkdir datascience cd datascience vagrant init data-science-toolbox/data-science-at-the-command-line vagrant up && vagrant ssh 3
  • 5. About me • UB Alumni, 2005 (M.S., EE Dept, cubs.buffalo.edu) • MIT Alumni, 2010 (PhD., EECS Dept, cbcl.mit.edu) • Senior Software Engineer, Google AdWords modeling • Senior Software Engineer, Microsoft Machine learning • Principal data scientist, Nanigans Inc 4
  • 6. About the workshop • Based on a book by Jeroen Janssens 1 • Vowpal wabbit 2 1 http://datascienceatthecommandline.com/ 2 https://github.com/JohnLangford/vowpal_wabbit 5
  • 7. POSIX command line • Exposes operating system functionalities through a shell • Example shells include bash, zsh, tcsh, fish etc. • Comprises of a large number of utility programs • Examples: grep, awk, sed, etc. • GNU user space programs • Common API: Input through stdin and output to stdout • stdin and stdout can be redirected during execution • Pipes (|) allows composition through pipes (chaining). • Allows redirection of output of one command as input to another ls -lh | more cat file | sort | uniq -c 6
  • 8. Why command line ? • REPL allows rapid iteration (through immediate feedback) • Allows composition of scripts and commands using pipes • Automation and scaling • Reproducibility • Extensibility • R, python, perl, ruby scripts can be invoked like command line utilities 7
  • 9. Data science workflow A typical workflow OSEMN model • Obtaining data • Scrubbing data • Exploring data • Modeling data • Interpreting data 8
  • 10. Workshop outline • Background • Getting data: Curl, scrape • Scrubbing: jq, csvkit • Exploring: csvkit • Modeling: vowpal wabbit • Scaling: parallel Hands on exercises • Obtaining data walkthrough • JQ walkthrough • CSVKit walkthrough • Vowpal wabbit 9
  • 11. Workflow example: Boston housing dataset https://github.com/sharatsc/cdse/blob/master/ boston-housing 10
  • 12. Boston housing dataset Python workflow 3 import urllib import pandas as pd # Obtain data urllib.urlretrieve(’https://raw.githubusercontent.com/sharatsc/cdse/master/boston-housin df = pd.read_csv(’boston.csv’) # Scrub data df = df.fillna(0) # Model data from statsmodels import regression from statsmodels.formula import api as smf formula = ’medv~’ + ’ + ’.join(df.columns - [’medv’]) model = smf.ols(formula=formula, data=df) res=model.fit() res.summary() Command line workflow URL="https://raw.githubusercontent.com/sharatsc/cdse/master/boston-housing/boston.csv" curl $URL| Rio -e ’model=lm("medv~.", df);model’ 3 https://github.com/sharatsc/cdse/blob/master/boston-housing 11
  • 14. Obtaining data from the web: curl • CURL (curl.haxx.se) • Cross platform command line tool that supports data transfer using HTTP, HTTPS, FTP, IMAP, SCP, SFTP • Supports cookies, user+password authentication • can be used to get data from RESTful APIs 4 • http GET curl http://www.google.com • ftp GET curl ftp://catless.ncl.ac.uk • scp COPY curl -u username: --key key_file --pass password scp://example.com/~/file.txt 4 www.codingpedia.org/ama/how-to-test-a-rest-api-from-command-line-with-curl/ 12
  • 16. Scrubbing web data: Scrape • Scrape is a python command line tool to parse html documents • Queries can be made in CSS selector or XPath syntax htmldoc=$(cat << EOF <div id=a> <a href="x.pdf">x</a> </div> <div id=b> <a href="png.png">y</a> <a href="pdf.pdf">y</a> </div> EOF ) # Select liks that end with pdf and are within div with id=b (Use CSS3 selector) echo $htmldoc | scrape -e "$b a[href$=pdf]" # Select all anchors (use Xpath) echo $htmldoc | scrape -e "//a" <a href="pdf.pdf">y</a> 13
  • 17. CSS selectors .class selects all elements with class=’class’ div p selects all <p> elements inside div elements div > p selects <p> elements where parent is <div> [target=blank] selects all elements with target="blank" [href^=https] selects urls beginning with https [href$=pdf] selects urls ending with pdf More examples at https://www.w3schools.com/cssref/css_selectors.asp 14
  • 18. XPath Query author selects all <author> elements at the current level //author selects all <author> elements at any level //author[@class=’x’] selects <author> elements with class=famous //book//author All <author> elements that are below <book> element" //author/* All children of <author> nodes More examples at https://msdn.microsoft.com/en-us/library/ms256086 15
  • 20. Scrubbing JSON data: JQ 5 • JQ is a portable command line utility to manipulate and filter JSON data • Filters can be defined to access individual fields, transform records or produce derived objets • Filters can be composed and combined • Provides builtin functions an operators Example: curl ’https://api.github.com/repos/stedolan/jq/commits’ | jq ’.[0]’ 5 https://stedolan.github.io/jq/ 17
  • 21. Basic filters # Identity ’.’ echo ’"Hello world"’ | jq ’.’ "Hello world" # Examples ’.foo’, ’.foo.bar’, ’.foo|.bar’, ’.["foo"] echo ’{"foo": 42, "bar": 10}’ | jq ’.foo’ 42 echo ’{"foo": {"bar": 10, "baz": 20}} | jq ’.foo.bar’ 10 #Arrays ’.[]’, ’.[0]’, ’.[1:3]’ echo ’["foo", "bar"]’ | jq ’.[0]’ "foo" echo ’["foo", "bar", "baz"]’ | jq ’.[1:3]’ ["bar", "baz"] # Pipe ’.foo|.bar’, ’.[]|.bar’ echo ’[{"f": 10}, {"f": 20}, {"f": 30}]’ | jq ’.[] | .f 10 20 30 18
  • 22. Object construction Object construction allows you to derive new objects out of existing ones. # Field selection echo {"foo": "F", "bar": "B", "baz": "Z"} | jq ’{"foo": .foo}’ {"foo": "F"} # Array expansion echo ’{"foo": "A", "bar": ["X", "Y"]}’ | jq ’{"foo": .foo, "bar": .bar[]}’ {"foo": "F", "bar": "X"} {"foo": "F", "bar": "Y"} # Expression evaluation, key and value can be substituted echo ’{"foo": "A", "bar": ["X", "Y"]}’ | jq ’{(.foo): .bar[]}’ {"A": "X"} {"A": "Y"} 19
  • 23. Operators Addition • Numbers are added by normal arithmetic. • Arrays are added by being concatenated into a larger array. • Strings are added by being joined into a larger string. • Objects are added by merging, that is, inserting all the key-value pairs from both objects into a single combined object. # Adding fields echo ’{"foo": 10}’ | jq ’.foo + 1’ 11 # Adding arrays echo ’{"foo": [1,2,3], "bar": [11,12,13]}’ | jq ’.foo + .bar’ [1,2,3,11,12,13] 20
  • 26. CSV • CSV (Comma separated value) is the common demoniator for data exchange • Tabular data with ’,’ as a separator • Can be ingested by R, python, excel etc. • No explicit specification of data types (ARFF supports type annotation) Example state,county,quantity NE,ADAMS,1 NE,BUFFALO,1 NE,THURSTON,1 22
  • 27. CSVKit (Groskopf and contributors [2016]) csvkit 6 is a suite of command line tools for converting and working with CSV, the defacto standard for tabular file formats. Example use cases • Importing data from excel, sql • Select subset of columns • Reorder columns • Mergeing multiple files (row and column wise) • Summary statistics 6 https://csvkit.readthedocs.io/ 23
  • 28. Importing data # Fetch data in XLS format # (LESO) 1033 Program dataset, which describes how surplus military arms have been distributed # This data was widely cited in the aftermath of the Ferguson, Missouri protests. T curl -L https://github.com/sharatsc/cdse/blob/master/csvkit/ne_1033_data.xls?raw=true -o ne_10 # Convert to csv in2csv ne_1033_data.xlxs > data.csv # Inspect the columns csvcut -n data.csv # Inspect the data in specific columns csvcut -c county, quantity data.csv | csvlook 24
  • 29. CSVKit: Examining data csvstat provides a summary view of the data similar to summary() function in R. # Get summary for county, and cost csvcut -c county,acquisition_cost,ship_date data.csv | csvstat 1. county Text Nulls: False Unique values: 35 Max length: 10 5 most frequent values: DOUGLAS: 760 DAKOTA: 42 CASS: 37 HALL: 23 LANCASTER: 18 2. acquisition_cost Number Nulls: False Min: 0.0 Max: 412000.0 Sum: 5430787.55 Mean: 5242.072924710424710424710425 Median: 6000.0 Standard Deviation: 13368.07836799839045093904423 Unique values: 75 5 most frequent values: 6800.0: 304 25
  • 30. CSVKit: searching data csvgrep can be used to search the content of the CSV file. Options include • Exact match -m • Regex match -r • Invert match -i • Search specific columns -c columns csvcut -c county,item_name,total_cost data.csv | csvgrep -c county -m LANCASTER | csvlook | county | item_name | total_cost | | --------- | ------------------------------ | ---------- | | LANCASTER | RIFLE,5.56 MILLIMETER | 120 | | LANCASTER | RIFLE,5.56 MILLIMETER | 120 | | LANCASTER | RIFLE,5.56 MILLIMETER | 120 | 26
  • 31. CSVKit: Power tools • csvjoin can be used to combine columns from multiple files csvjoin -c join_column data.csv other_data.csv • csvsort can be used to sort the file based on specific columns csvsort -c total_population | csvlook | head • csvstack allows you to merge mutiple files together (row-wise) curl -L -O https://raw.githubusercontent.com/wireservice/csvkit/master/examples/re in2csv ne_1033_data.xls > ne_1033_data.csv csvstack -g region ne_1033_data.csv ks_1033_data.csv > region.csv 27
  • 32. CSVKit: SQL • csvsql allows queryring against one or more CSV files. • The results of the query can be inserted back into a db Examples • Import from csv into a table # Inserts into a specific table csvsql --db postgresql:///test --table data --insert data.csv # Inserts each file into a separate table csvsql --db postgresql:///test --insert examples/*_tables.csv • Regular SQL query csvsql --query "select count(*) from data" data.csv 28
  • 35. GNU parallel • GNU parallel (Tange [2011]) is a tool for executing jobs in parallel on one or more machines. • It can be used to parallelize across arguments, lines and files. Examples # Parallelize across lines seq 1000 | parallel "echo {}" # Parallelize across file content cat input.csv | parallel -C, "mv {1} {2}" cat input.csv | parallel -C --header "mv {source} {dest}" 30
  • 36. Parallel (cont.) • By default, parallel runs one job per cpu core • Concurrency can be controlled by --jobs or -j option seq 100 | parallel -j2 "echo number: {}" seq 100 | parallel -j200% "echo number: {}" Logging • Output from each parallel job can be captured separately using --results seq 10 | parallel --results data/outdir "echo number: {}" find data/outdir 31
  • 37. Parallel (cont.) • Remote execution parallel --nonall --slf instances hostname # nonall - no argument command to follow # slf - uses ~/.parallel/sshloginfile as the list of sshlogins • Distributing data # Split 1-1000 into sections of 100 and pipe it to remote instances seq 1000 | parallel -N100 --pipe --slf instances "(wc -l)" #transmit, retrieve and cleanup # sends jq to all instances # transmits the input file, retrives the results into {.}csv and cleanup ls *.gz | parallel -v --basefile jq --trc {.} csv 32
  • 39. Overview • Fast, online , scalable learning system • Supports out of core execution with in memory model • Scalable (Terascale) • 1000 nodes (??) • Billions of examples • Trillions of unique features. • Actively developed https://github.com/JohnLangford/vowpal_wabbit 33
  • 40. Swiss army knife of online algorithms • Binary classification • Multiclass classification • Linear regression • Quantile regression • Topic modeling (online LDA) • Structured prediction • Active learning • Recommendation (Matrix factorization) • Contextual bandit learning (explore/exploit algorithms) • Reductions 34
  • 41. Features • Flexible input format: Allows free form text, identifying tags, multiple labels • Speed: Supports online learning • Scalable • Streaming removes row limit. • Hashing removes dimension limit. • Distributed training: models merged using AllReduce operation. • Cross product features: Allows multi-task learning 35
  • 42. Optimization VW solves optimization of the form i l(wT xi; yi) + λR(w) Here, l() is convex, R(w) = λ1|w| + λ2||w||2. VW support a variety of loss function Linear regression (y − wT x)2 Logistic regression log(1 + exp(−ywT x)) SVM regression max(0, 1 − ywT x) Quantile regression τ(wT x − y) ∗ I(y < wT x) + (1 − τ)(y − wT x)I 36
  • 43. Detour: Feature hashing • Feature hashing can be used to reduce dimensions of sparse features. • Unlike random projections ? , retains sparsity • Preserves dot products (random projection preserves distances). • Model can fit in memory. • Unsigned ? Consider a hash function h(x) : [0 . . . N] → [0 . . . m], m << N. φi(x) = j:h(j)=i xj • Signed ? Consider additionaly a hash function ξ(x) : [0 . . . N] → {1, −1}. φi(x) = ξ(j)xj 37
  • 44. Detour: Generalized linear models A generalized linear predictor specifies • A linear predictor of the form η(x) = wT x • A mean estimate µ • A link function g(µ) such that g(µ) = η(x) that relates the mean estimate to the linear predictor. This framework supports a variety of regression problems Linear regression µ = wT x Logistic regression log( µ 1−µ) = wT x Poisson regression log(µ) = wT x 38
  • 45. fragileInput format Label Importance [Tag]|namespace Feature . . . | namespace Feature . . . namespace = String[:Float] feature = String[:Float] Examples: • 1 | 1:0.01 32:-0.1 • example|namespace normal text features • 1 3 tag|ad-features ad description |user-features name address age 39
  • 46. Input options • data file -d datafile • network --daemon --port <port=26542> • compressed data --compressed • stdin cat <data> | vw 40
  • 47. Manipulation options • ngrams --ngram • skips --skips • quadratic interaction -q args. e.g -q ab • cubic interaction --cubic args. e.g. --cubic ccc 41
  • 48. Output options • Examining feature construction --audit • Generating prediction --predictions or -p • Unnormalized predictions --raw_predictions • Testing only --testonly or -t 42
  • 49. Model options • Model size --bit_precision or -b . Number of coefficients limited to 2b • Update existing model --initial_regressor or -i. • Final model destination --final_regressor or -f • Readable model definition --readable_model • Readable feature values --invert_hash • Snapshot model every pass --save_per_pass • Weight initialization --initial_weight or --random_weights 43
  • 50. Regression (Demo) https://github.com/sharatsc/cdse/tree/master/ vowpal-wabbit/linear-regression • Linear regression --loss_function square • Quantile regression --loss_function quantile --quantile_tau <=0.5> 44
  • 51. Binary classification (Demo) https://github.com/sharatsc/cdse/tree/master/ vowpal-wabbit/classification • Note: a linear regressor can be used as a classifier as well • Logistic loss --loss_function logistic, --link logistic • Hinge loss (SVM loss function) --loss_function hinge • Report binary loss instead of logistic loss --binary 45
  • 52. Multiclass classification (Demo) https://github.com/sharatsc/cdse/tree/master/ vowpal-wabbit/multiclass • One against all --oaa <k> • Error correcting tournament --ect <k> • Online decision trees ---log_multi <k> • Cost sensitive one-against-all --csoaa <k> 46
  • 53. LDA options (Demo) https://github.com/sharatsc/cdse/tree/master/ vowpal-wabbit/lda • Number of topics --lda • Prior on per-document topic weights --lda_alpha • Prior on topic distributions --lda_rho • Estimated number of documents --lda_D • Convergence parameter for topic estimation --lda_epsilon • Mini batch size --minibatch 47
  • 54. Daemon mode (Demo) https://github.com/sharatsc/cdse/tree/master/ vowpal-wabbit/daemon-respond https://github.com/sharatsc/cdse/tree/master/ vowpal-wabbit/daemon-request • Loads model and answers any prediction request coming over the network • Preferred way to deploy a VW model • Options • --daemon. Enables demon mode • --testonly or -t. Does not update the model in response to requests • --initial_model or -i. Model to load • --port <arg>. Port to listen to the request • --num_children <arg>. Number of threads listening to request 48
  • 56. Christopher Groskopf and contributors. csvkit, 2016. URL https://csvkit.readthedocs.org/. O. Tange. Gnu parallel - the command-line power tool. ;login: The USENIX Magazine, 36(1):42–47, Feb 2011. doi: http://dx.doi.org/10.5281/zenodo.16303. URL http://www.gnu.org/s/parallel. 48