SlideShare a Scribd company logo
Data Science
A practitioner’s perspective
Amir Ziai
@amirziai
Who am I?
● Data Scientist at ZEFR, ad tech, LA
● Previously worked in healthcare, SaaS, and finance
Agenda
● Data Science
● My perspective
○ Problems
○ Pitfalls
○ Minimum skills
○ How to build your skills
● Resources
Data Science, a short history
● 1960, Peter Naur used it as a substitute for computer science
● 1977, Jeff Wu gave the “Statistics = Data Science?” lecture
● 2008, DJ Patil and Jeff Hammerbacher used “data scientist” to describe their job
● 2011, McKinsey, shortage of 140k analysts and 1.5M managers by 2018
● 2015, Data Scientists don’t scale
● 2016, Why You’re Not Getting Value from Your Data Science
https://whatsthebigdata.com/2012/04/26/a-very-short-history-of-data-science/
Data Science, growth
Data Science, hyped?
http://www.kdnuggets.com/wp-content/uploads/gartner-2014-hype-cycle.jpeg
Data Science, too broad
● BI Analyst/Engineer
● Analytics Engineer
● Data Engineer
● Statistician
● Research Scientist
● Machine Learning Engineer
● AI Engineer
● Solutions Specialist (with analytical background)
● Software Architect
● Financial Modeler
● Actuary
● ...
Data Science, definition
“Data Scientist is a Data Analyst who lives in California”
“Data Scientist is statistics on a Mac”
“...someone who is better at statistics than any software engineer and better at software
engineering than any statistician”
Data Science, the many Venn diagrams
Data Science, process
● Data wrangling (get data from any source, reshape, scale up if needed)
● Problem formulation and modeling (ML, DL, AI)
● Communicate the findings (visualization, UI/UX)
● Productize (SWE, Data Engineering, DevOps)
In the context of:
● Benefit (business value)
● Cost (development, infrastructure, and architecture)
My perspective, what does ZEFR do?
● Ingesting hundreds of millions of videos per day
● Help brands show relevant ads
● Identify content for monetization
● Data science
○ Optimize advertising campaigns
○ Forecast inventory
○ Process text, image, audio, and video
○ Petabyte scale
My perspective, scale and automation
Requirements
● Billions of examples, million of features to train the models with
● Scoring on a similar scale of data
● Models to be re-trained near real-time
Implications
● Have to use cloud computing and distributed systems
● Small deltas in quality and algorithm efficiency magnified to massive cost or
benefit deltas
● Solid software engineering and automation is key
My perspective, example
Task
● Train a better forecasting model (vs. a benchmark statistic)
● Hundreds of terabytes of historical data available
Process
● Wrangling Pre-process and featurize (Spark, S3, RedShift)
● Modeling VW, H2O, hyper-parameter optimization
● Communication Justify cost of 100 node EMR cluster ($1,000 per day)
● Productize Test, deploy, automate with Jenkins, ECS and Kafka
My perspective, the grind
Weeks of tuning the infrastructure,
finding the right features, reasoning
through algorithm complexity
My perspective, pitfalls
● Unreasonable expectations
○ Hype, just hire a few PhDs
○ Is data science too easy?
● Throwing it over the fence*
○ Data science builds models in R/Python, engineering implements it in Java, C, Scala
● Dismissing the importance of good software engineering practices
○ Use tests, understand algo complexity, do code reviews, experiments should be reproducible
● Dismissing the importance of understanding and formulating the problem
○ Get out and talk to people
● Dismissing or not understanding architecture, infrastructure, and cost/benefit
* Full disclosure: article is written by my boss Jonathan Morra at ZEFR
My perspective, data science platforms
● Many companies have recognized the problem with the the disconnect between
data science and engineering
● Facebook and Uber have in-house platforms
● A number of commercial solutions: Sense, Domino Data Labs, DataScience, Data
Robot, Yhat, just to name a few
● Very expensive and inflexible in our case
https://blog.dominodatalab.com/uber-and-the-need-for-a-data-science-platform/
https://medium.com/@novakkm/the-purpose-of-platforms-in-data-science-965e2124edf8#.vwlz3idyw
https://code.facebook.com/posts/1072626246134461/introducing-fblearner-flow-facebook-s-ai-backbone/
My perspective, minimum data science requirements
- Statically-typed language (C, Java, Scala)
- Dynamically-typed language (Python, R)
- SQL (lag, partition, joins, rank, nested subqueries)
- NoSQL (JSON, MongoDB, Couch)
- Data wrangling (Pandas, dplyr, Julia, PySpark, Dask)
- Command-line fu
- Cloud computing (spin up instances, S3, ssh) and environment isolation
- Software engineering best practices (testing, version control, complexity)
- ML theory (bias/variance, complexity, encoding, hashing, feature engineering)
- ML practice (sklearn, R, Julia, MLLib, H2O, TensorFlow)
- Basic stats (experiment design, hypothesis testing, moments)
My perspective, how to build your skills
● Take courses in areas of weakness (Udacity, Coursera)
● Showcase your skills with projects on GitHub
● Write a blog about things you’re good at to refine your understanding
● Do Kaggle competitions
● Contribute to StackOverflow and/or CrossValidated
● Contribute to open source projects (sklearn, tensorflow, dask, spark)
Resources
Newsletters, blogs and people to follow
Data Elixir, Data Science Weekly, The Morning Paper, Intuition Machine, The Wild
Week in AI, MLConf, Talking Machines, Partially Derivative, Brandon Rohr, Julian
Evans, Chris Fregly, Bryan Smith, Stitch Fix, Unofficial Google Data Science Blog,
Variance Explained, Wes McKinney, Peter Norvig’s iPython notebooks, Frank Chen of
a16z, Fast Forward Labs, Chris Olah, Andrej Karpathy, Open AI, Indico, John Cook, ...

More Related Content

PDF
Data science
GitanshuSharma1
 
PPTX
Introduction to data science
Sampath Kumar
 
PPTX
Big Data Analytics for BI, BA and QA
Dmitry Tolpeko
 
PDF
Introduction to Python for Data Science
Arc & Codementor
 
PDF
Data science
9diov
 
PDF
Life of a data scientist (pub)
Buhwan Jeong
 
PDF
Data science
Sreejith c
 
PPTX
data science
skhraletta
 
Data science
GitanshuSharma1
 
Introduction to data science
Sampath Kumar
 
Big Data Analytics for BI, BA and QA
Dmitry Tolpeko
 
Introduction to Python for Data Science
Arc & Codementor
 
Data science
9diov
 
Life of a data scientist (pub)
Buhwan Jeong
 
Data science
Sreejith c
 
data science
skhraletta
 

What's hot (20)

PPTX
Data Science: Past, Present, and Future
Gregory Piatetsky-Shapiro
 
PDF
Data science e machine learning
Giuseppe Manco
 
PDF
What is Big Data?
CodePolitan
 
PDF
8 minute intro to data science
Mahesh Kumar CV
 
PDF
IIPGH Webinar 1: Getting Started With Data Science
ds4good
 
PPTX
Big Data and the Art of Data Science
Andrew Gardner
 
PDF
Introduction to Data Science
Anastasiia Kornilova
 
PDF
Data Science Provenance: From Drug Discovery to Fake Fans
Jameel Syed
 
PDF
Data Science
Prithwis Mukerjee
 
PPTX
Introduction to data science club
Data Science Club
 
PDF
Demystifying Data Science with an introduction to Machine Learning
Julian Bright
 
PDF
Introduction to Data Science
Edureka!
 
PDF
Data science presentation 2nd CI day
Mohammed Barakat
 
PPTX
Intro to Data Science by DatalentTeam at Data Science Clinic#11
Dr.Sotarat Thammaboosadee CIMP-Data Governance
 
PDF
Data science and visualization lab presentation
iHub Research
 
DOCX
Datascienceindia article
HimanshuPise1
 
PPS
Big Data Science: Intro and Benefits
Chandan Rajah
 
PDF
Data science presentation
MSDEVMTL
 
PPTX
So, What Does a Data Scientist do?
Jameel Syed
 
PDF
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
Ferdin Joe John Joseph PhD
 
Data Science: Past, Present, and Future
Gregory Piatetsky-Shapiro
 
Data science e machine learning
Giuseppe Manco
 
What is Big Data?
CodePolitan
 
8 minute intro to data science
Mahesh Kumar CV
 
IIPGH Webinar 1: Getting Started With Data Science
ds4good
 
Big Data and the Art of Data Science
Andrew Gardner
 
Introduction to Data Science
Anastasiia Kornilova
 
Data Science Provenance: From Drug Discovery to Fake Fans
Jameel Syed
 
Data Science
Prithwis Mukerjee
 
Introduction to data science club
Data Science Club
 
Demystifying Data Science with an introduction to Machine Learning
Julian Bright
 
Introduction to Data Science
Edureka!
 
Data science presentation 2nd CI day
Mohammed Barakat
 
Intro to Data Science by DatalentTeam at Data Science Clinic#11
Dr.Sotarat Thammaboosadee CIMP-Data Governance
 
Data science and visualization lab presentation
iHub Research
 
Datascienceindia article
HimanshuPise1
 
Big Data Science: Intro and Benefits
Chandan Rajah
 
Data science presentation
MSDEVMTL
 
So, What Does a Data Scientist do?
Jameel Syed
 
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
Ferdin Joe John Joseph PhD
 
Ad

Similar to Data science a practitioner's perspective (20)

PDF
Building successful data science teams
Venkatesh Umaashankar
 
PDF
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
Mihai Criveti
 
PPTX
Software engineering practices for the data science and machine learning life...
DataWorks Summit
 
PDF
Data Science at Scale - The DevOps Approach
Mihai Criveti
 
PDF
How to become a data scientist
Manjunath Sindagi
 
PDF
From Rocket Science to Data Science
Sanghamitra Deb
 
PDF
How to become a Data Scientist?
HackerEarth
 
PDF
DataScience_introduction.pdf
SouravBiswas747273
 
PDF
Data science - An Introduction
Ravishankar Rajagopalan
 
PPTX
New professional careers in data
David Rostcheck
 
PDF
Paytm labs soyouwanttodatascience
Adam Muise
 
PDF
Untitled document.pdf
MuhammadTahiriqbal13
 
PDF
Artificial Intelligence (ML - DL)
ShehryarSH1
 
PPT
DataScience fundamentals and Python Coding
Sanket Shikhar
 
PDF
Enabling Your Data Science Team with Modern Data Engineering
James Densmore
 
PPTX
introductiontodatascience-230122140841-b90a0856 (1).pptx
urvashipundir04
 
PPTX
Data Science in Manufacturing and Automation
Ravishankar Rajagopalan
 
PDF
Data Science with Spark
Krishna Sankar
 
PDF
Decoding Data Science
Matt Fornito
 
PPTX
Future of data science as a profession
Jose Quesada
 
Building successful data science teams
Venkatesh Umaashankar
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
Mihai Criveti
 
Software engineering practices for the data science and machine learning life...
DataWorks Summit
 
Data Science at Scale - The DevOps Approach
Mihai Criveti
 
How to become a data scientist
Manjunath Sindagi
 
From Rocket Science to Data Science
Sanghamitra Deb
 
How to become a Data Scientist?
HackerEarth
 
DataScience_introduction.pdf
SouravBiswas747273
 
Data science - An Introduction
Ravishankar Rajagopalan
 
New professional careers in data
David Rostcheck
 
Paytm labs soyouwanttodatascience
Adam Muise
 
Untitled document.pdf
MuhammadTahiriqbal13
 
Artificial Intelligence (ML - DL)
ShehryarSH1
 
DataScience fundamentals and Python Coding
Sanket Shikhar
 
Enabling Your Data Science Team with Modern Data Engineering
James Densmore
 
introductiontodatascience-230122140841-b90a0856 (1).pptx
urvashipundir04
 
Data Science in Manufacturing and Automation
Ravishankar Rajagopalan
 
Data Science with Spark
Krishna Sankar
 
Decoding Data Science
Matt Fornito
 
Future of data science as a profession
Jose Quesada
 
Ad

Recently uploaded (20)

PDF
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
PPTX
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
PDF
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
PDF
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
PPTX
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
PDF
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
PDF
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
PDF
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PDF
Doc9.....................................
SofiaCollazos
 
PDF
Software Development Methodologies in 2025
KodekX
 
PPTX
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
PDF
The Future of Artificial Intelligence (AI)
Mukul
 
PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PDF
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
Doc9.....................................
SofiaCollazos
 
Software Development Methodologies in 2025
KodekX
 
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
The Future of Artificial Intelligence (AI)
Mukul
 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 

Data science a practitioner's perspective

  • 1. Data Science A practitioner’s perspective Amir Ziai @amirziai
  • 2. Who am I? ● Data Scientist at ZEFR, ad tech, LA ● Previously worked in healthcare, SaaS, and finance
  • 3. Agenda ● Data Science ● My perspective ○ Problems ○ Pitfalls ○ Minimum skills ○ How to build your skills ● Resources
  • 4. Data Science, a short history ● 1960, Peter Naur used it as a substitute for computer science ● 1977, Jeff Wu gave the “Statistics = Data Science?” lecture ● 2008, DJ Patil and Jeff Hammerbacher used “data scientist” to describe their job ● 2011, McKinsey, shortage of 140k analysts and 1.5M managers by 2018 ● 2015, Data Scientists don’t scale ● 2016, Why You’re Not Getting Value from Your Data Science https://whatsthebigdata.com/2012/04/26/a-very-short-history-of-data-science/
  • 7. Data Science, too broad ● BI Analyst/Engineer ● Analytics Engineer ● Data Engineer ● Statistician ● Research Scientist ● Machine Learning Engineer ● AI Engineer ● Solutions Specialist (with analytical background) ● Software Architect ● Financial Modeler ● Actuary ● ...
  • 8. Data Science, definition “Data Scientist is a Data Analyst who lives in California” “Data Scientist is statistics on a Mac” “...someone who is better at statistics than any software engineer and better at software engineering than any statistician”
  • 9. Data Science, the many Venn diagrams
  • 10. Data Science, process ● Data wrangling (get data from any source, reshape, scale up if needed) ● Problem formulation and modeling (ML, DL, AI) ● Communicate the findings (visualization, UI/UX) ● Productize (SWE, Data Engineering, DevOps) In the context of: ● Benefit (business value) ● Cost (development, infrastructure, and architecture)
  • 11. My perspective, what does ZEFR do? ● Ingesting hundreds of millions of videos per day ● Help brands show relevant ads ● Identify content for monetization ● Data science ○ Optimize advertising campaigns ○ Forecast inventory ○ Process text, image, audio, and video ○ Petabyte scale
  • 12. My perspective, scale and automation Requirements ● Billions of examples, million of features to train the models with ● Scoring on a similar scale of data ● Models to be re-trained near real-time Implications ● Have to use cloud computing and distributed systems ● Small deltas in quality and algorithm efficiency magnified to massive cost or benefit deltas ● Solid software engineering and automation is key
  • 13. My perspective, example Task ● Train a better forecasting model (vs. a benchmark statistic) ● Hundreds of terabytes of historical data available Process ● Wrangling Pre-process and featurize (Spark, S3, RedShift) ● Modeling VW, H2O, hyper-parameter optimization ● Communication Justify cost of 100 node EMR cluster ($1,000 per day) ● Productize Test, deploy, automate with Jenkins, ECS and Kafka
  • 14. My perspective, the grind Weeks of tuning the infrastructure, finding the right features, reasoning through algorithm complexity
  • 15. My perspective, pitfalls ● Unreasonable expectations ○ Hype, just hire a few PhDs ○ Is data science too easy? ● Throwing it over the fence* ○ Data science builds models in R/Python, engineering implements it in Java, C, Scala ● Dismissing the importance of good software engineering practices ○ Use tests, understand algo complexity, do code reviews, experiments should be reproducible ● Dismissing the importance of understanding and formulating the problem ○ Get out and talk to people ● Dismissing or not understanding architecture, infrastructure, and cost/benefit * Full disclosure: article is written by my boss Jonathan Morra at ZEFR
  • 16. My perspective, data science platforms ● Many companies have recognized the problem with the the disconnect between data science and engineering ● Facebook and Uber have in-house platforms ● A number of commercial solutions: Sense, Domino Data Labs, DataScience, Data Robot, Yhat, just to name a few ● Very expensive and inflexible in our case https://blog.dominodatalab.com/uber-and-the-need-for-a-data-science-platform/ https://medium.com/@novakkm/the-purpose-of-platforms-in-data-science-965e2124edf8#.vwlz3idyw https://code.facebook.com/posts/1072626246134461/introducing-fblearner-flow-facebook-s-ai-backbone/
  • 17. My perspective, minimum data science requirements - Statically-typed language (C, Java, Scala) - Dynamically-typed language (Python, R) - SQL (lag, partition, joins, rank, nested subqueries) - NoSQL (JSON, MongoDB, Couch) - Data wrangling (Pandas, dplyr, Julia, PySpark, Dask) - Command-line fu - Cloud computing (spin up instances, S3, ssh) and environment isolation - Software engineering best practices (testing, version control, complexity) - ML theory (bias/variance, complexity, encoding, hashing, feature engineering) - ML practice (sklearn, R, Julia, MLLib, H2O, TensorFlow) - Basic stats (experiment design, hypothesis testing, moments)
  • 18. My perspective, how to build your skills ● Take courses in areas of weakness (Udacity, Coursera) ● Showcase your skills with projects on GitHub ● Write a blog about things you’re good at to refine your understanding ● Do Kaggle competitions ● Contribute to StackOverflow and/or CrossValidated ● Contribute to open source projects (sklearn, tensorflow, dask, spark)
  • 19. Resources Newsletters, blogs and people to follow Data Elixir, Data Science Weekly, The Morning Paper, Intuition Machine, The Wild Week in AI, MLConf, Talking Machines, Partially Derivative, Brandon Rohr, Julian Evans, Chris Fregly, Bryan Smith, Stitch Fix, Unofficial Google Data Science Blog, Variance Explained, Wes McKinney, Peter Norvig’s iPython notebooks, Frank Chen of a16z, Fast Forward Labs, Chris Olah, Andrej Karpathy, Open AI, Indico, John Cook, ...