SlideShare a Scribd company logo
Scaling Data Science
with dgit
Dr. Venkata Pingali
Founder, Scribble Data
pingali@scribbledata.io
https://github.com/pingali
Summary
1. Scaling impact of data science requires increasing trust and efficiency
a. Trust requires auditability and reproducibility of results
b. Efficiency requires standardization and automation
2. Dataset is a fundamental abstraction of data science
3. dgit enables git-like management of datasets
a. Python package, open source, MIT licence
b. Familiar git interface with modifications
4. Call to collaborate
dgit - 1 min summary
dgit - git wrapper for datasets
1. Python package, MIT license
2. Application of git
3. Beyond git - “Understands” data
a. Metadata generation and management
b. Automatic scanning of working directory for changes
c. Automatic validation and materialization
d. Dependency tracking across repos
e. Automatic audit trails with execution
f. Pipeline support
Growing Pains in Data
Science
Anonymized Random Slide from an Actual
Presentation
Implication: Large wasted spend, poor production
design, baseline worsening
Decision-maker Questions
1. Where did the numbers come from? (Correctness, Lineage)
a. Assumption, models, datasets
2. Is this an accident? Does it hold now? (Reproducibility, Retargetability)
a. Model, dataset, and question revisions
3. Can you get the results faster? (Efficiency)
a. Time, effort, cost
4. Can you also analyze X? (Extensibility)
a. Different dataset, question
5. Could we try X? (Dataset generation - synthetic and real)
a. What if scenarios, field experiments
Conceptual Process
Biz
Analytics
Team
Data
Engg
Qtns, Context
Data Req
Datasets
Model Results
Story Telling
All three roles could
be in a single team!
Business Complexity is Discovered Over
Time
Incomplete context (history, semantics)
Qtns not thought through
Continuous revisions
Biz
Analytics
Team
Data
Engg
Qtns, Context
Data Req
Datasets
Model Results
Story Telling
Imperfect Data Queries due to Limited
Understanding
Dependencies not specified
Wrong filters
Known outliers
Narrow specification (cubes)
Biz
Analytics
Team
Data
Engg
Qtns, Context
Data Req
Datasets
Model Results
Story Telling
Weak process
Lack of protocol (email/files)
Missing validation checks
No lineage
No revisions
Biz
Analytics
Team
Data
Engg
Qtns, Context
Data Req
Datasets
Model Results
Story Telling
Eagerness to Present Great Narratives
Wrong input dataset
Mistakes in pipeline
Excel/adhoc transformations
Model evolution
Continuous revision of narratives
Missing interpretation integrity
checks (e.g. other time windows)
Better methodology
Biz
Analytics
Team
Data
Engg
Qtns, Context
Data Req
Datasets
Model Results
Story Telling
Process in Reality
Biz
Analytics
Team
Data
Engg
Qtns, Context
Data Req
Datasets
Model Results
Story Telling
Iterative
Expensive
Laborious
Actual Process
Biz
Analytics
Team
Data
Engg
Qtns, Context
Data Req
Datasets
Model Results
Story Telling
Iterative
Expensive
Laborious
http://fortune.com/2016/02/05/why-big-data-isnt-paying-off-for-companies-yet/
"80% of ..
companies
strategic decision
go haywire..
“flawed” data
Desired State
1. Trusted
a. Every model should be auditable to the last record and step ⬅
b. Every model should be reproducible with zero human intervention ⬅
c. Enables use and development of mathematical judgment
2. Scalable
a. Highly automated through most of the lifecycle ⬅
b. Continuous reduction in costs ⬅
c. Grow sublinearly with questions, datasets, models
3. Robust
a. Younger, inexperienced staff ⬅
b. Weak processes
Process with Dataset Repository
Biz
Analytics
Team
Data
Engg
Server Side CI
Dataset Rules
Evaluation Rules
Dependencies
Materialized dataset
v1
v2
v3Materialize
Model Pipeline
Pipeline Execution
v4
Slide Content
URN
Context,
Questions
v5Evaluation
Interpretation
v6
Dataset as mutable object
with memory
No emails/google docs
Continuous validation by
thirdparty (server)
Separate model
development and
evaluation
dgit
Dgit Structure
dgitcore API
Repo Mgr
Git
Backend
S3
Validator Generator Instrumentation
MySQLS3Regression ContentPlatform
dgit CLI
Metadata
Basic
Demo Goals
1. Show end-to-end example (command line)
a. Simple regression
2. Explain structure
3. Advanced features
a. Validation (regression quality plugin)
b. Generator (SQL)
c. Pipeline (Dora)
Open Tasks
1. Dgit specific
a. Cleanup and stabilization
i. Python v2/3 compatibility
ii. Plugins to do various tasks (anonymization, hive etc)
b. Testing infrastructure
c. Integration
i. Windows and MacOS support
ii. Support for instabase/dat/other services
2. Ideas for new tools to reduce cost and complexity of data science
Speaker
Dr. Venkata Pingali
Founder, Scribble Data
Former-VP Analytics, FourthLion
IIT(B) PhD (USC)
http://linkedin.com/in/pingali

More Related Content

What's hot (8)

PDF
From Vaccine Management to ICU Planning: How CRISP Unlocked the Power of Data...
Databricks
 
PDF
The REAL face of Big Data
Douglas Bernardini
 
PDF
IC-SDV 2019: Search Technology / Vantage Point
Dr. Haxel Consult
 
PDF
Who is a Data Scientist? | How to become a Data Scientist? | Data Science Cou...
Edureka!
 
PDF
Unit 3 part 2
MohammadAsharAshraf
 
PDF
Data science
GitanshuSharma1
 
PPT
ZettaVox: Content Mining and Analysis Across Heterogeneous Compute Clouds__Ha...
Yahoo Developer Network
 
PDF
Is one enough? Data warehousing for biomedical research
Greg Landrum
 
From Vaccine Management to ICU Planning: How CRISP Unlocked the Power of Data...
Databricks
 
The REAL face of Big Data
Douglas Bernardini
 
IC-SDV 2019: Search Technology / Vantage Point
Dr. Haxel Consult
 
Who is a Data Scientist? | How to become a Data Scientist? | Data Science Cou...
Edureka!
 
Unit 3 part 2
MohammadAsharAshraf
 
Data science
GitanshuSharma1
 
ZettaVox: Content Mining and Analysis Across Heterogeneous Compute Clouds__Ha...
Yahoo Developer Network
 
Is one enough? Data warehousing for biomedical research
Greg Landrum
 

Viewers also liked (13)

DOC
CVimron-qren
Imraan Muslim
 
DOC
Engineering Performance-programKesehatanRemaja
Imraan Muslim
 
PPTX
Bonds Use Of Google Solutions
birney.james
 
DOC
LAMP-MIKROPROPOSAL-2
Imraan Muslim
 
DOC
program-training-Pendidik sebaya-KesehatanRemaja
Imraan Muslim
 
DOC
BAB I-pkk-pemberdayaan ekonomi
Imraan Muslim
 
PDF
Multi Supplier Market, Comm's Days Summit, Sydney April 2014
gtilton
 
PDF
Practical Application of the TMF Reference Model Webinar
Paragon Solutions
 
PDF
Project findings paper TMForum catalyst 2014 B2B service bundling 1.0
gtilton
 
PDF
Dynamic Data Specification
gtilton
 
PDF
Dynamic modelling best practice recommendation for the SID
gtilton
 
PDF
Cv imraan muslim-03 eng-edit
Imraan Muslim
 
PDF
Analytics Lessons Learnt
Venkata Pingali
 
CVimron-qren
Imraan Muslim
 
Engineering Performance-programKesehatanRemaja
Imraan Muslim
 
Bonds Use Of Google Solutions
birney.james
 
LAMP-MIKROPROPOSAL-2
Imraan Muslim
 
program-training-Pendidik sebaya-KesehatanRemaja
Imraan Muslim
 
BAB I-pkk-pemberdayaan ekonomi
Imraan Muslim
 
Multi Supplier Market, Comm's Days Summit, Sydney April 2014
gtilton
 
Practical Application of the TMF Reference Model Webinar
Paragon Solutions
 
Project findings paper TMForum catalyst 2014 B2B service bundling 1.0
gtilton
 
Dynamic Data Specification
gtilton
 
Dynamic modelling best practice recommendation for the SID
gtilton
 
Cv imraan muslim-03 eng-edit
Imraan Muslim
 
Analytics Lessons Learnt
Venkata Pingali
 
Ad

Similar to R meetup talk scaling data science with dgit (20)

PDF
Data Science with Spark
Krishna Sankar
 
PPTX
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
Vivian S. Zhang
 
PDF
Using dataset versioning in data science
Venkata Pingali
 
PDF
Challenges of Operationalising Data Science in Production
iguazio
 
PPTX
Software engineering practices for the data science and machine learning life...
DataWorks Summit
 
PPTX
Big Data and the Art of Data Science
Andrew Gardner
 
PDF
Analytics 101 - How to build a data-driven organisation? - Rafał Małanij, Get...
GetInData
 
PDF
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
DATAVERSITY
 
PDF
Embedded-ml(ai)applications - Bjoern Staender
Dataconomy Media
 
PDF
Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...
Databricks
 
PDF
Understanding Big Data Analytics - solutions for growing businesses - Rafał M...
GetInData
 
PPTX
Proposed Talk Outline for Pycon2017
Dr. Ananth Krishnamoorthy
 
PDF
Building successful data science teams
Venkatesh Umaashankar
 
PDF
Data Science Introduction and Process in Data Science
Pyingkodi Maran
 
PDF
Career in Data Science (July 2017, DTLA)
Thinkful
 
PDF
Getting started in Data Science (April 2017, Los Angeles)
Thinkful
 
PDF
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Krishna Sankar
 
PDF
Getting Started in Data Science
Thinkful
 
PDF
How to make your data scientists happy
Hussain Sultan
 
PDF
PXL Data Engineering Workshop By Selligent
Jonny Daenen
 
Data Science with Spark
Krishna Sankar
 
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
Vivian S. Zhang
 
Using dataset versioning in data science
Venkata Pingali
 
Challenges of Operationalising Data Science in Production
iguazio
 
Software engineering practices for the data science and machine learning life...
DataWorks Summit
 
Big Data and the Art of Data Science
Andrew Gardner
 
Analytics 101 - How to build a data-driven organisation? - Rafał Małanij, Get...
GetInData
 
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
DATAVERSITY
 
Embedded-ml(ai)applications - Bjoern Staender
Dataconomy Media
 
Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...
Databricks
 
Understanding Big Data Analytics - solutions for growing businesses - Rafał M...
GetInData
 
Proposed Talk Outline for Pycon2017
Dr. Ananth Krishnamoorthy
 
Building successful data science teams
Venkatesh Umaashankar
 
Data Science Introduction and Process in Data Science
Pyingkodi Maran
 
Career in Data Science (July 2017, DTLA)
Thinkful
 
Getting started in Data Science (April 2017, Los Angeles)
Thinkful
 
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Krishna Sankar
 
Getting Started in Data Science
Thinkful
 
How to make your data scientists happy
Hussain Sultan
 
PXL Data Engineering Workshop By Selligent
Jonny Daenen
 
Ad

Recently uploaded (20)

PPTX
英国假毕业证诺森比亚大学成绩单GPA修改UNN学生卡网上可查学历成绩单
Taqyea
 
PDF
Internet Governance and its role in Global economy presentation By Shreedeep ...
Shreedeep Rayamajhi
 
PPTX
Simplifying and CounFounding in egime.pptx
Ryanto10
 
PPTX
02 IoT Industry Applications and Solutions (1).pptx
abuizzaam
 
PPTX
本科硕士学历佛罗里达大学毕业证(UF毕业证书)24小时在线办理
Taqyea
 
PDF
Technical Guide to Build a Successful Shopify Marketplace from Scratch.pdf
CartCoders
 
PPTX
ZARA-Case.pptx djdkkdjnddkdoodkdxjidjdnhdjjdjx
RonnelPineda2
 
PDF
APNIC's Role in the Pacific Islands, presented at Pacific IGF 2205
APNIC
 
PPTX
一比一原版(LaTech毕业证)路易斯安那理工大学毕业证如何办理
Taqyea
 
PPTX
Random Presentation By Fuhran Khalil uio
maniieiish
 
PDF
DORA - MobileOps & MORA - DORA for Mobile Applications
Willy ROUVRE
 
PDF
Slides PDF: ZPE - QFS Eco Economic Epochs pdf
Steven McGee
 
PPT
Computer Securityyyyyyyy - Chapter 2.ppt
SolomonSB
 
PPTX
Presentation on Social Media1111111.pptx
tanamlimbu
 
PPTX
Research Design - Report on seminar in thesis writing. PPTX
arvielobos1
 
PPTX
1.10-Ruta=1st Term------------------------------1st.pptx
zk7304860098
 
PDF
World Game (s) Great Redesign via ZPE - QFS pdf
Steven McGee
 
PDF
The Power and Impact of Promotion most useful
RajaBilal42
 
PDF
Apple_Environmental_Progress_Report_2025.pdf
yiukwong
 
PPTX
英国学位证(RCM毕业证书)皇家音乐学院毕业证书如何办理
Taqyea
 
英国假毕业证诺森比亚大学成绩单GPA修改UNN学生卡网上可查学历成绩单
Taqyea
 
Internet Governance and its role in Global economy presentation By Shreedeep ...
Shreedeep Rayamajhi
 
Simplifying and CounFounding in egime.pptx
Ryanto10
 
02 IoT Industry Applications and Solutions (1).pptx
abuizzaam
 
本科硕士学历佛罗里达大学毕业证(UF毕业证书)24小时在线办理
Taqyea
 
Technical Guide to Build a Successful Shopify Marketplace from Scratch.pdf
CartCoders
 
ZARA-Case.pptx djdkkdjnddkdoodkdxjidjdnhdjjdjx
RonnelPineda2
 
APNIC's Role in the Pacific Islands, presented at Pacific IGF 2205
APNIC
 
一比一原版(LaTech毕业证)路易斯安那理工大学毕业证如何办理
Taqyea
 
Random Presentation By Fuhran Khalil uio
maniieiish
 
DORA - MobileOps & MORA - DORA for Mobile Applications
Willy ROUVRE
 
Slides PDF: ZPE - QFS Eco Economic Epochs pdf
Steven McGee
 
Computer Securityyyyyyyy - Chapter 2.ppt
SolomonSB
 
Presentation on Social Media1111111.pptx
tanamlimbu
 
Research Design - Report on seminar in thesis writing. PPTX
arvielobos1
 
1.10-Ruta=1st Term------------------------------1st.pptx
zk7304860098
 
World Game (s) Great Redesign via ZPE - QFS pdf
Steven McGee
 
The Power and Impact of Promotion most useful
RajaBilal42
 
Apple_Environmental_Progress_Report_2025.pdf
yiukwong
 
英国学位证(RCM毕业证书)皇家音乐学院毕业证书如何办理
Taqyea
 

R meetup talk scaling data science with dgit

  • 1. Scaling Data Science with dgit Dr. Venkata Pingali Founder, Scribble Data [email protected] https://github.com/pingali
  • 2. Summary 1. Scaling impact of data science requires increasing trust and efficiency a. Trust requires auditability and reproducibility of results b. Efficiency requires standardization and automation 2. Dataset is a fundamental abstraction of data science 3. dgit enables git-like management of datasets a. Python package, open source, MIT licence b. Familiar git interface with modifications 4. Call to collaborate
  • 3. dgit - 1 min summary
  • 4. dgit - git wrapper for datasets 1. Python package, MIT license 2. Application of git 3. Beyond git - “Understands” data a. Metadata generation and management b. Automatic scanning of working directory for changes c. Automatic validation and materialization d. Dependency tracking across repos e. Automatic audit trails with execution f. Pipeline support
  • 5. Growing Pains in Data Science
  • 6. Anonymized Random Slide from an Actual Presentation Implication: Large wasted spend, poor production design, baseline worsening
  • 7. Decision-maker Questions 1. Where did the numbers come from? (Correctness, Lineage) a. Assumption, models, datasets 2. Is this an accident? Does it hold now? (Reproducibility, Retargetability) a. Model, dataset, and question revisions 3. Can you get the results faster? (Efficiency) a. Time, effort, cost 4. Can you also analyze X? (Extensibility) a. Different dataset, question 5. Could we try X? (Dataset generation - synthetic and real) a. What if scenarios, field experiments
  • 8. Conceptual Process Biz Analytics Team Data Engg Qtns, Context Data Req Datasets Model Results Story Telling All three roles could be in a single team!
  • 9. Business Complexity is Discovered Over Time Incomplete context (history, semantics) Qtns not thought through Continuous revisions Biz Analytics Team Data Engg Qtns, Context Data Req Datasets Model Results Story Telling
  • 10. Imperfect Data Queries due to Limited Understanding Dependencies not specified Wrong filters Known outliers Narrow specification (cubes) Biz Analytics Team Data Engg Qtns, Context Data Req Datasets Model Results Story Telling
  • 11. Weak process Lack of protocol (email/files) Missing validation checks No lineage No revisions Biz Analytics Team Data Engg Qtns, Context Data Req Datasets Model Results Story Telling
  • 12. Eagerness to Present Great Narratives Wrong input dataset Mistakes in pipeline Excel/adhoc transformations Model evolution Continuous revision of narratives Missing interpretation integrity checks (e.g. other time windows) Better methodology Biz Analytics Team Data Engg Qtns, Context Data Req Datasets Model Results Story Telling
  • 13. Process in Reality Biz Analytics Team Data Engg Qtns, Context Data Req Datasets Model Results Story Telling Iterative Expensive Laborious
  • 14. Actual Process Biz Analytics Team Data Engg Qtns, Context Data Req Datasets Model Results Story Telling Iterative Expensive Laborious http://fortune.com/2016/02/05/why-big-data-isnt-paying-off-for-companies-yet/ "80% of .. companies strategic decision go haywire.. “flawed” data
  • 15. Desired State 1. Trusted a. Every model should be auditable to the last record and step ⬅ b. Every model should be reproducible with zero human intervention ⬅ c. Enables use and development of mathematical judgment 2. Scalable a. Highly automated through most of the lifecycle ⬅ b. Continuous reduction in costs ⬅ c. Grow sublinearly with questions, datasets, models 3. Robust a. Younger, inexperienced staff ⬅ b. Weak processes
  • 16. Process with Dataset Repository Biz Analytics Team Data Engg Server Side CI Dataset Rules Evaluation Rules Dependencies Materialized dataset v1 v2 v3Materialize Model Pipeline Pipeline Execution v4 Slide Content URN Context, Questions v5Evaluation Interpretation v6 Dataset as mutable object with memory No emails/google docs Continuous validation by thirdparty (server) Separate model development and evaluation
  • 17. dgit
  • 18. Dgit Structure dgitcore API Repo Mgr Git Backend S3 Validator Generator Instrumentation MySQLS3Regression ContentPlatform dgit CLI Metadata Basic
  • 19. Demo Goals 1. Show end-to-end example (command line) a. Simple regression 2. Explain structure 3. Advanced features a. Validation (regression quality plugin) b. Generator (SQL) c. Pipeline (Dora)
  • 20. Open Tasks 1. Dgit specific a. Cleanup and stabilization i. Python v2/3 compatibility ii. Plugins to do various tasks (anonymization, hive etc) b. Testing infrastructure c. Integration i. Windows and MacOS support ii. Support for instabase/dat/other services 2. Ideas for new tools to reduce cost and complexity of data science
  • 21. Speaker Dr. Venkata Pingali Founder, Scribble Data Former-VP Analytics, FourthLion IIT(B) PhD (USC) http://linkedin.com/in/pingali