SlideShare a Scribd company logo
Making Data Science Work
By
Dr. Kuldeep Deshpande
Saumitra Modak
10 x increase in
data science jobs!
Out of all Data Science Projects…
Only these many succeed.
Gartner says 80% of analytics insights will not deliver business outcomes through 2022 and 80% of AI projects
will “remain alchemy, run by wizards” through 2020
Is DS = BS* ?
It is!
Data Science = Business Sense
Why Data Science Projects Fail?
Initiation People and Process Solution Design Data Access Data Fallacies
Inadequate research
Starting with the
wrong questions
Not addressing the
root cause
Initiating data science
project due to a blog
Try to take on to large
of a first project
Lack of diverse
Subject Matter Experts
Lacking an experienced
data science leader
Limited business
understanding
Lack of a standardized
data science process
Failing to
communicate value
of data science
The solutions are
too complex
Forming conclusions
before data scientists
start
Poorly designed
models
Fail to provide
actionable insights
Using technologies
because they are
cool
Lack of access to
data
Using Faulty / Bad
Data
Having a data scientist
build their own ETLs
Relying on Excel as
the main data storage.
Big data silos or
vendor owned
data!
Simpson’s Paradox
Setting wrong
performance measures
McNamara Fallacy
Overfitting
Data Dredging
Availability of right data
More Data Beats A Cleverer Algorithm!
You need 10 times as many examples as degrees of freedom in
the model.
What
model
should I
use?
How much
training
data
should I
gather?
How much data is required for algorithms?
IT DEPENDS!
More features may help overcome issue of lesser data.
Correctly tagged data may be more useful than large untagged
dataset.
Small Data problems
Over-fitting becomes much harder to avoid
Outliers become much more dangerous.
Noise becomes a real issue!
Setting The Right Performance Measure
Regression Classification
OthersUnsupervised
Models
o MSPE
o MSAE
o R Square
o Adjusted R Square
o Precision- Recall
o ROC-AUC
o Accuracy
o Log-Loss
o Rand Index
o Mutual Information
o CV Error
o Heuristic methods
tc find K
o BLEU Score (NLP)
Classification of legal documents using Data Science
A u t o m a t e d
C l a s s i f i c a t i o n o f l e g a l
d o c u m e n t s f r o m 3 0 0 0 +
g o v e r n m e n t w e b p a g e s
• U n s t r u c t u r e d D a t a
• Tr a i n i n g d a t a – M o r e
t h a n 9 0 0 0 0 m a n u a l l y
c l a s s i f i e d
• 9 5 % o ve r a l l
a c c u r a c y
• M o r e t h a n 8 5 %
r e c a l l
Data UsedObjective Result
Availability of right data Right Performance Measure
• N a ï v e B a y e s
• S V M
• R o c c h i o
• Te c h n o l o g i e s – P y t h o n
a n d R e d i s
Algorithms And Technologies
Milking the bull – When data science is going to fail!
Images and text with very less
features
Beware of data issues
that can lead to failure
Inaccurately tagged data
Low quantity data
Absence of actual predictors
Time series data without features
The model does not categorize the data correctly because of too much of details
and noise.
Overfitting
A statistical model is said to be overfitted, when we train it with a lot of data.
When a model gets trained with so much of data, it starts learning from the
noise and inaccurate data entries.
Production forecasting for manufacturers
P r e d i c t i n g d a i l y
p r o d u c t i o n f o r a
m a n u f a c t u r i n g
c o m p a n y
* M o n t h l y t a r g e t s n o t a v a i l a b l e
* L a b o r d e t a i l s u n r e l i a b l e
* M a c h i n e d e t a i l s n o t a v a i l a b l e
• P a t t e r n s f o u n d i n
s u b s e t s o f d a t a
n o t g e n e r a l i z i n g
• N o u s e f u l
p r e d i c t i o n
Data UsedObjective Result
Milking The Bull Overfitting
• R
• A R I M A , L i n e a r
r e g r e s s i o n , C a t b o o s t ,
& R a n d o m F o r e s t s w i t h
f e a t u r e e n g i n e e r i n g
Algorithms And Technologies
6 m o n t h s h o u r l y
p r o d u c t i o n n u m b e r s
How companies use hammers to kill mosquitos!
appropriate technologies do!
Sexy Technologies don’t guarantee success,
Data must be clean or there should be a way to clean data.
Better Data > Efficient Algorithms
Domain understanding
Data Owner interactions
Acceptance of
missing values
Data Profiling
automation
• Remove Unwanted observations
• Fix Structural Errors
• Filter Unwanted Outliers
• Handle Missing Data
Customer Churn Prediction For Specialty Insurance
• P r e d i c t l i k e l i h o o d o f
c h u r n f o r e a c h
p o l i c y
• G e t m a x i m u m c h u r n
d e t e c t i o n r a t e w h i l e
k e e p i n g f a l s e
a l a r m s l o w e r t h a n
2 0 %
• D a t a u n d e r s t a n d i n g w i t h
d o m a i n e x p e r t s
• D a t a c l e a n i n g a n d
e x p l o r a t i o n f o r i n s i g h t s
a n d p r e d i c t o r
i d e n t i f i c a t i o n
• A b l e t o a c h i e v e 7 5 %
c h u r n d e t e c t i o n r a t e
w h i l e k e e p i n g t h e
f a l s e a l a r m r a t e l e s s
t h a t 2 0 %
• T i m e l y p r e d i c t i o n o f
c h u r n h e l p i n g i n
t a k i n g r e t e n t i o n
a c t i o n
Data UsedObjective Result
Appropriate Technologies Clean Data
Algorithms And Technologies
• R
• R a n d o m F o r e s t , L o g i s t i c
R e g r e s s i o n , G r a d i e n t
B o o s t i n g a n d C a t B o o s t
A phenomenon in which a trend appears in different groups of data
but disappears or reverses when the groups are combined.
Simpson’s Paradox
The admission process seems significantly
biased against women.
But in reality most of the departments
are significantly biased against men.
Admission to UC Berkeley
Making a decision based solely on quantitative observations
and ignoring all others.
McNamara Fallacy
Presume that which cannot be measured easily is not important.
• Let us assume a company has developed a new E-Commerce website.
• After new website, site visits are up 50% and number of newsletter subscriptions are up 25%.
Measure whatever can be easily measured.
Disregard that which cannot be measured easily.
But
What if percentage of people who never open their emails OR
who unsubscribe immediately has increased?
Web Traffic Measurement
Biotech Innovation Efficiency Analytics
• A n a l y z e p o t e n t i a l o f
e a r l y s t a g e b i o t e c h
f i r m s b y a n a l y z i n g
t h e i r I n n o v a t i o n
e f f i c i e n c y.
• F i n d o u t s t a t i s t i c a l
c o r r e l a t i o n o f
p e r f o r m a n c e o f a
c o m p a n y w i t h
i n n o v a t i o n e f f i c i e n c y.
• C l i n i c a l Tr a i l s
• P r e s s r e l e a s e
• S t o c k d e t a i l s
• P a t e n t s
• P u b l i c a t i o n s
• C o m p a n y F i n a n c i a l s
Data UsedObjective Result
Simpson’s Paradox McNamara Fallacy
Algorithms And Technologies
• R a n d o m F o r e s t
• D e c i s i o n T r e e
• N e u r a l N e t w o r k s
• R e g r e s s i o n
Right Performance Measure Domain Understanding
S t a t i s t i c a l
c o r r e l a t i o n b e t w e e n
f i n a n c i a l
p e r f o r m a n c e a n d
i n n o va t i o n
e f f i c i e n c y f o r
c e r t a i n c a t e g o r y o f
c o m p a n i e s
Why Data Science Projects Fail?
Initiation People and Process Solution Design Data Access Data Fallacies
Inadequate research
Starting with the
wrong questions
Not addressing the
root cause
Initiating data science
project due to a blog
Try to take on to large
of a first project
Lack of diverse
Subject Matter Experts
Lacking an experienced
data science leader
Limited business
understanding
Lack of a standardized
data science process
Failing to
communicate value
of data science
The solutions are
too complex
Forming conclusions
before data scientists
start
Poorly designed
models
Fail to provide
actionable insights
Using technologies
because they are
cool
Lack of access to
data
Using Faulty / Bad
Data
Having a data scientist
build their own ETLs
Relying on Excel as
the main data storage.
Big data silos or
vendor owned
data!
Simpson’s Paradox
Setting wrong
performance measures
McNamara Fallacy
Overfitting
Data Dredging

More Related Content

What's hot (20)

PDF
Agile Marketing For The Real World event - Signal - 6th Nov 2019
Lauren Cormack
 
PDF
Trends on Pinterest
June Andrews
 
PDF
Whispers in Chaos
J. Paul Reed
 
PDF
Experience based choice
Sidharath Chhatani
 
PDF
LKCE18 Dimitar Bakardziev - Kanban Policy Game
Lean Kanban Central Europe
 
PDF
Improving the development process with metrics driven insights presentation
indeedeng
 
PDF
Frappe Open Day - June 2018
Frappe Technologies Pvt. Ltd.
 
PDF
Real-Time Responsive Text Analytics
iCAN-Global: Virtual Commercialization & Acceleration Network
 
PPT
Estimations in Project Management
Intaver Insititute
 
PPTX
Big Data and Small Devices: What will it do for us and to us
John Tomizuka
 
PDF
Grady Newsource: UX Study
Kate Devlin
 
PDF
Artificial Assistants: How can I help you? by Christopher Currin
Christopher Currin
 
PPSX
Data Science 101
odsc
 
PPTX
Agile metrics
Chandan Patary
 
PPTX
Risk Management and Reliable Forecasting using Un-reliable Data (magennis) - ...
Troy Magennis
 
PPT
Developing Analytic Technique and Defeating Cognitive Bias in Security
chrissanders88
 
PPTX
A Comparative Study of Data Management Maturity Models
Data Crossroads
 
PDF
A Comparative Study of Data Management Maturity Models
Data Crossroads
 
PPTX
What is the story with agile data keynote agile 2018 (Magennis)
Troy Magennis
 
PPTX
Poynter lesson 6
Ray Poynter
 
Agile Marketing For The Real World event - Signal - 6th Nov 2019
Lauren Cormack
 
Trends on Pinterest
June Andrews
 
Whispers in Chaos
J. Paul Reed
 
Experience based choice
Sidharath Chhatani
 
LKCE18 Dimitar Bakardziev - Kanban Policy Game
Lean Kanban Central Europe
 
Improving the development process with metrics driven insights presentation
indeedeng
 
Frappe Open Day - June 2018
Frappe Technologies Pvt. Ltd.
 
Estimations in Project Management
Intaver Insititute
 
Big Data and Small Devices: What will it do for us and to us
John Tomizuka
 
Grady Newsource: UX Study
Kate Devlin
 
Artificial Assistants: How can I help you? by Christopher Currin
Christopher Currin
 
Data Science 101
odsc
 
Agile metrics
Chandan Patary
 
Risk Management and Reliable Forecasting using Un-reliable Data (magennis) - ...
Troy Magennis
 
Developing Analytic Technique and Defeating Cognitive Bias in Security
chrissanders88
 
A Comparative Study of Data Management Maturity Models
Data Crossroads
 
A Comparative Study of Data Management Maturity Models
Data Crossroads
 
What is the story with agile data keynote agile 2018 (Magennis)
Troy Magennis
 
Poynter lesson 6
Ray Poynter
 

Similar to Ellicium Solutions - Making Data Science Work (20)

PDF
Data Modelling Fundamentals course 3 day synopsis
Christopher Bradley
 
PDF
Advanced Data Modelling course 3 day synopsis
Christopher Bradley
 
PDF
Slides: How Automating Data Lineage Improves BI Performance
DATAVERSITY
 
PDF
Como transformar servidores em cientistas de dados e diminuir a distância ent...
Rommel Carvalho
 
PDF
Switching horses midstream - From Waterfall to Agile
Doc Norton
 
PPT
SENCER_panel.ppt
nagarajan740445
 
PDF
From the right process to a solid cultural change
Francesco Zaia
 
PDF
The IoT For Real
University of Hertfordshire
 
PDF
Dmmaturitymodelscomparison 190513162839
Irina Steenbeek, PhD
 
PPTX
Final PPT Pratik 107.pptx
VaibhavJhanwar2
 
PDF
Information Security Project Management
Igor Pertsovsky
 
PPTX
Scaling your Tableau - Migrating from Tableau Online to a proper DWH solution...
Sergii Khomenko
 
PDF
Small data big impact
University of Hertfordshire
 
PPTX
Why Every Product Manager Needs to Know Big Data
Jeremy Horn
 
PDF
Data Modeling & Metadata for Graph Databases
DATAVERSITY
 
PDF
Tailoring Malaysian Blockchain Regulations For Digital Economy 2018 MIGHT
Kancil San
 
PDF
Presentation on BIKON - International BI conference
Kunal Bhattacharya
 
PPT
Actionable insights
Actionable_Insights
 
PPTX
Analytics and Big Data in Law Firms
LexisNexis Pacific
 
PDF
Decoupled APIs through Microservices
David Simons
 
Data Modelling Fundamentals course 3 day synopsis
Christopher Bradley
 
Advanced Data Modelling course 3 day synopsis
Christopher Bradley
 
Slides: How Automating Data Lineage Improves BI Performance
DATAVERSITY
 
Como transformar servidores em cientistas de dados e diminuir a distância ent...
Rommel Carvalho
 
Switching horses midstream - From Waterfall to Agile
Doc Norton
 
SENCER_panel.ppt
nagarajan740445
 
From the right process to a solid cultural change
Francesco Zaia
 
The IoT For Real
University of Hertfordshire
 
Dmmaturitymodelscomparison 190513162839
Irina Steenbeek, PhD
 
Final PPT Pratik 107.pptx
VaibhavJhanwar2
 
Information Security Project Management
Igor Pertsovsky
 
Scaling your Tableau - Migrating from Tableau Online to a proper DWH solution...
Sergii Khomenko
 
Small data big impact
University of Hertfordshire
 
Why Every Product Manager Needs to Know Big Data
Jeremy Horn
 
Data Modeling & Metadata for Graph Databases
DATAVERSITY
 
Tailoring Malaysian Blockchain Regulations For Digital Economy 2018 MIGHT
Kancil San
 
Presentation on BIKON - International BI conference
Kunal Bhattacharya
 
Actionable insights
Actionable_Insights
 
Analytics and Big Data in Law Firms
LexisNexis Pacific
 
Decoupled APIs through Microservices
David Simons
 
Ad

Recently uploaded (20)

PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
PPTX
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
DOC
MATRIX_AMAN IRAWAN_20227479046.docbbbnnb
vanitafiani1
 
PPTX
The _Operations_on_Functions_Addition subtruction Multiplication and Division...
mdregaspi24
 
PPTX
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
PPTX
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
PDF
List of all the AI prompt cheat codes.pdf
Avijit Kumar Roy
 
PDF
Choosing the Right Database for Indexing.pdf
Tamanna
 
PDF
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
PDF
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
PDF
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
PDF
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
PPTX
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
PDF
2_Management_of_patients_with_Reproductive_System_Disorders.pdf
motbayhonewunetu
 
PDF
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
PPTX
recruitment Presentation.pptxhdhshhshshhehh
devraj40467
 
PPTX
Resmed Rady Landis May 4th - analytics.pptx
Adrian Limanto
 
PPTX
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
PPT
deep dive data management sharepoint apps.ppt
novaprofk
 
PPTX
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
MATRIX_AMAN IRAWAN_20227479046.docbbbnnb
vanitafiani1
 
The _Operations_on_Functions_Addition subtruction Multiplication and Division...
mdregaspi24
 
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
List of all the AI prompt cheat codes.pdf
Avijit Kumar Roy
 
Choosing the Right Database for Indexing.pdf
Tamanna
 
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
2_Management_of_patients_with_Reproductive_System_Disorders.pdf
motbayhonewunetu
 
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
recruitment Presentation.pptxhdhshhshshhehh
devraj40467
 
Resmed Rady Landis May 4th - analytics.pptx
Adrian Limanto
 
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
deep dive data management sharepoint apps.ppt
novaprofk
 
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
Ad

Ellicium Solutions - Making Data Science Work

  • 1. Making Data Science Work By Dr. Kuldeep Deshpande Saumitra Modak
  • 2. 10 x increase in data science jobs!
  • 3. Out of all Data Science Projects… Only these many succeed. Gartner says 80% of analytics insights will not deliver business outcomes through 2022 and 80% of AI projects will “remain alchemy, run by wizards” through 2020
  • 4. Is DS = BS* ?
  • 5. It is! Data Science = Business Sense
  • 6. Why Data Science Projects Fail? Initiation People and Process Solution Design Data Access Data Fallacies Inadequate research Starting with the wrong questions Not addressing the root cause Initiating data science project due to a blog Try to take on to large of a first project Lack of diverse Subject Matter Experts Lacking an experienced data science leader Limited business understanding Lack of a standardized data science process Failing to communicate value of data science The solutions are too complex Forming conclusions before data scientists start Poorly designed models Fail to provide actionable insights Using technologies because they are cool Lack of access to data Using Faulty / Bad Data Having a data scientist build their own ETLs Relying on Excel as the main data storage. Big data silos or vendor owned data! Simpson’s Paradox Setting wrong performance measures McNamara Fallacy Overfitting Data Dredging
  • 7. Availability of right data More Data Beats A Cleverer Algorithm! You need 10 times as many examples as degrees of freedom in the model. What model should I use? How much training data should I gather? How much data is required for algorithms? IT DEPENDS! More features may help overcome issue of lesser data. Correctly tagged data may be more useful than large untagged dataset. Small Data problems Over-fitting becomes much harder to avoid Outliers become much more dangerous. Noise becomes a real issue!
  • 8. Setting The Right Performance Measure Regression Classification OthersUnsupervised Models o MSPE o MSAE o R Square o Adjusted R Square o Precision- Recall o ROC-AUC o Accuracy o Log-Loss o Rand Index o Mutual Information o CV Error o Heuristic methods tc find K o BLEU Score (NLP)
  • 9. Classification of legal documents using Data Science A u t o m a t e d C l a s s i f i c a t i o n o f l e g a l d o c u m e n t s f r o m 3 0 0 0 + g o v e r n m e n t w e b p a g e s • U n s t r u c t u r e d D a t a • Tr a i n i n g d a t a – M o r e t h a n 9 0 0 0 0 m a n u a l l y c l a s s i f i e d • 9 5 % o ve r a l l a c c u r a c y • M o r e t h a n 8 5 % r e c a l l Data UsedObjective Result Availability of right data Right Performance Measure • N a ï v e B a y e s • S V M • R o c c h i o • Te c h n o l o g i e s – P y t h o n a n d R e d i s Algorithms And Technologies
  • 10. Milking the bull – When data science is going to fail! Images and text with very less features Beware of data issues that can lead to failure Inaccurately tagged data Low quantity data Absence of actual predictors Time series data without features
  • 11. The model does not categorize the data correctly because of too much of details and noise. Overfitting A statistical model is said to be overfitted, when we train it with a lot of data. When a model gets trained with so much of data, it starts learning from the noise and inaccurate data entries.
  • 12. Production forecasting for manufacturers P r e d i c t i n g d a i l y p r o d u c t i o n f o r a m a n u f a c t u r i n g c o m p a n y * M o n t h l y t a r g e t s n o t a v a i l a b l e * L a b o r d e t a i l s u n r e l i a b l e * M a c h i n e d e t a i l s n o t a v a i l a b l e • P a t t e r n s f o u n d i n s u b s e t s o f d a t a n o t g e n e r a l i z i n g • N o u s e f u l p r e d i c t i o n Data UsedObjective Result Milking The Bull Overfitting • R • A R I M A , L i n e a r r e g r e s s i o n , C a t b o o s t , & R a n d o m F o r e s t s w i t h f e a t u r e e n g i n e e r i n g Algorithms And Technologies 6 m o n t h s h o u r l y p r o d u c t i o n n u m b e r s
  • 13. How companies use hammers to kill mosquitos! appropriate technologies do! Sexy Technologies don’t guarantee success,
  • 14. Data must be clean or there should be a way to clean data. Better Data > Efficient Algorithms Domain understanding Data Owner interactions Acceptance of missing values Data Profiling automation • Remove Unwanted observations • Fix Structural Errors • Filter Unwanted Outliers • Handle Missing Data
  • 15. Customer Churn Prediction For Specialty Insurance • P r e d i c t l i k e l i h o o d o f c h u r n f o r e a c h p o l i c y • G e t m a x i m u m c h u r n d e t e c t i o n r a t e w h i l e k e e p i n g f a l s e a l a r m s l o w e r t h a n 2 0 % • D a t a u n d e r s t a n d i n g w i t h d o m a i n e x p e r t s • D a t a c l e a n i n g a n d e x p l o r a t i o n f o r i n s i g h t s a n d p r e d i c t o r i d e n t i f i c a t i o n • A b l e t o a c h i e v e 7 5 % c h u r n d e t e c t i o n r a t e w h i l e k e e p i n g t h e f a l s e a l a r m r a t e l e s s t h a t 2 0 % • T i m e l y p r e d i c t i o n o f c h u r n h e l p i n g i n t a k i n g r e t e n t i o n a c t i o n Data UsedObjective Result Appropriate Technologies Clean Data Algorithms And Technologies • R • R a n d o m F o r e s t , L o g i s t i c R e g r e s s i o n , G r a d i e n t B o o s t i n g a n d C a t B o o s t
  • 16. A phenomenon in which a trend appears in different groups of data but disappears or reverses when the groups are combined. Simpson’s Paradox The admission process seems significantly biased against women. But in reality most of the departments are significantly biased against men. Admission to UC Berkeley
  • 17. Making a decision based solely on quantitative observations and ignoring all others. McNamara Fallacy Presume that which cannot be measured easily is not important. • Let us assume a company has developed a new E-Commerce website. • After new website, site visits are up 50% and number of newsletter subscriptions are up 25%. Measure whatever can be easily measured. Disregard that which cannot be measured easily. But What if percentage of people who never open their emails OR who unsubscribe immediately has increased? Web Traffic Measurement
  • 18. Biotech Innovation Efficiency Analytics • A n a l y z e p o t e n t i a l o f e a r l y s t a g e b i o t e c h f i r m s b y a n a l y z i n g t h e i r I n n o v a t i o n e f f i c i e n c y. • F i n d o u t s t a t i s t i c a l c o r r e l a t i o n o f p e r f o r m a n c e o f a c o m p a n y w i t h i n n o v a t i o n e f f i c i e n c y. • C l i n i c a l Tr a i l s • P r e s s r e l e a s e • S t o c k d e t a i l s • P a t e n t s • P u b l i c a t i o n s • C o m p a n y F i n a n c i a l s Data UsedObjective Result Simpson’s Paradox McNamara Fallacy Algorithms And Technologies • R a n d o m F o r e s t • D e c i s i o n T r e e • N e u r a l N e t w o r k s • R e g r e s s i o n Right Performance Measure Domain Understanding S t a t i s t i c a l c o r r e l a t i o n b e t w e e n f i n a n c i a l p e r f o r m a n c e a n d i n n o va t i o n e f f i c i e n c y f o r c e r t a i n c a t e g o r y o f c o m p a n i e s
  • 19. Why Data Science Projects Fail? Initiation People and Process Solution Design Data Access Data Fallacies Inadequate research Starting with the wrong questions Not addressing the root cause Initiating data science project due to a blog Try to take on to large of a first project Lack of diverse Subject Matter Experts Lacking an experienced data science leader Limited business understanding Lack of a standardized data science process Failing to communicate value of data science The solutions are too complex Forming conclusions before data scientists start Poorly designed models Fail to provide actionable insights Using technologies because they are cool Lack of access to data Using Faulty / Bad Data Having a data scientist build their own ETLs Relying on Excel as the main data storage. Big data silos or vendor owned data! Simpson’s Paradox Setting wrong performance measures McNamara Fallacy Overfitting Data Dredging