R meetup talk scaling data science with dgit

Scaling Data Science
with dgit
Dr. Venkata Pingali
Founder, Scribble Data
pingali@scribbledata.io
https://github.com/pingali

Summary
1. Scaling impact of data science requires increasing trust and efficiency
a. Trust requires auditability and reproducibility of results
b. Efficiency requires standardization and automation
2. Dataset is a fundamental abstraction of data science
3. dgit enables git-like management of datasets
a. Python package, open source, MIT licence
b. Familiar git interface with modifications
4. Call to collaborate

dgit - git wrapper for datasets
1. Python package, MIT license
2. Application of git
3. Beyond git - “Understands” data
a. Metadata generation and management
b. Automatic scanning of working directory for changes
c. Automatic validation and materialization
d. Dependency tracking across repos
e. Automatic audit trails with execution
f. Pipeline support

Anonymized Random Slide from an Actual
Presentation
Implication: Large wasted spend, poor production
design, baseline worsening

Decision-maker Questions
1. Where did the numbers come from? (Correctness, Lineage)
a. Assumption, models, datasets
2. Is this an accident? Does it hold now? (Reproducibility, Retargetability)
a. Model, dataset, and question revisions
3. Can you get the results faster? (Efficiency)
a. Time, effort, cost
4. Can you also analyze X? (Extensibility)
a. Different dataset, question
5. Could we try X? (Dataset generation - synthetic and real)
a. What if scenarios, field experiments

Conceptual Process
Biz
Analytics
Team
Data
Engg
Qtns, Context
Data Req
Datasets
Model Results
Story Telling
All three roles could
be in a single team!

Business Complexity is Discovered Over
Time
Incomplete context (history, semantics)
Qtns not thought through
Continuous revisions
Biz
Analytics
Team
Data
Engg
Qtns, Context
Data Req
Datasets
Model Results
Story Telling

Imperfect Data Queries due to Limited
Understanding
Dependencies not specified
Wrong filters
Known outliers
Narrow specification (cubes)
Biz
Analytics
Team
Data
Engg
Qtns, Context
Data Req
Datasets
Model Results
Story Telling

Weak process
Lack of protocol (email/files)
Missing validation checks
No lineage
No revisions
Biz
Analytics
Team
Data
Engg
Qtns, Context
Data Req
Datasets
Model Results
Story Telling

Eagerness to Present Great Narratives
Wrong input dataset
Mistakes in pipeline
Excel/adhoc transformations
Model evolution
Continuous revision of narratives
Missing interpretation integrity
checks (e.g. other time windows)
Better methodology
Biz
Analytics
Team
Data
Engg
Qtns, Context
Data Req
Datasets
Model Results
Story Telling

Process in Reality
Biz
Analytics
Team
Data
Engg
Qtns, Context
Data Req
Datasets
Model Results
Story Telling
Iterative
Expensive
Laborious

Actual Process
Biz
Analytics
Team
Data
Engg
Qtns, Context
Data Req
Datasets
Model Results
Story Telling
Iterative
Expensive
Laborious
http://fortune.com/2016/02/05/why-big-data-isnt-paying-off-for-companies-yet/
"80% of ..
companies
strategic decision
go haywire..
“flawed” data

Desired State
1. Trusted
a. Every model should be auditable to the last record and step ⬅
b. Every model should be reproducible with zero human intervention ⬅
c. Enables use and development of mathematical judgment
2. Scalable
a. Highly automated through most of the lifecycle ⬅
b. Continuous reduction in costs ⬅
c. Grow sublinearly with questions, datasets, models
3. Robust
a. Younger, inexperienced staff ⬅
b. Weak processes

Process with Dataset Repository
Biz
Analytics
Team
Data
Engg
Server Side CI
Dataset Rules
Evaluation Rules
Dependencies
Materialized dataset
v1
v2
v3Materialize
Model Pipeline
Pipeline Execution
v4
Slide Content
URN
Context,
Questions
v5Evaluation
Interpretation
v6
Dataset as mutable object
with memory
No emails/google docs
Continuous validation by
thirdparty (server)
Separate model
development and
evaluation

Dgit Structure
dgitcore API
Repo Mgr
Git
Backend
S3
Validator Generator Instrumentation
MySQLS3Regression ContentPlatform
dgit CLI
Metadata
Basic

Demo Goals
1. Show end-to-end example (command line)
a. Simple regression
2. Explain structure
3. Advanced features
a. Validation (regression quality plugin)
b. Generator (SQL)
c. Pipeline (Dora)

Open Tasks
1. Dgit specific
a. Cleanup and stabilization
i. Python v2/3 compatibility
ii. Plugins to do various tasks (anonymization, hive etc)
b. Testing infrastructure
c. Integration
i. Windows and MacOS support
ii. Support for instabase/dat/other services
2. Ideas for new tools to reduce cost and complexity of data science

Speaker
Dr. Venkata Pingali
Founder, Scribble Data
Former-VP Analytics, FourthLion
IIT(B) PhD (USC)
http://linkedin.com/in/pingali

R meetup talk scaling data science with dgit

More Related Content

What's hot (8)

Viewers also liked (13)

Similar to R meetup talk scaling data science with dgit (20)

Recently uploaded (20)

R meetup talk scaling data science with dgit