SlideShare a Scribd company logo
Machine Learning
what, why and how
me ...
Harshad
➔ Senior Data Scientist @ Sokrati
➔ Spent last 4 years trying to understand and
apply machine learning
Sokrati is a digital advertising startup based out of Pune
what are we going to do ?
get a 10000 feet view ...
then go to specifics ...
10000 feet view
what is ML ?
➔ Too many definitions!
➔ Too much debate
➔ Analytics vs ML vs data mining vs Big Data vs
Statistics vs next buzzword in the market
what is ML ?
Teaching machines to take decisions
with the help of data
practical man’s definition of machine learning!
bit of history ...
That’s too ancient!
That’s not
ML...
bit of (relevant) history ...
➔ Insurance, Banking industry
◆ Credit scoring
◆ mathematical models in finance
➔ Artificial Intelligence and other fancy ideas
◆ deep blue, Samuel’s checkers machine
◆ IBM watson computer
Find ‘f’ => endeavour to understand the world!
g( y ) = f( X )
the two cultures in ML
Stats Culture
Vs
AI / ML Culture
the two cultures in ML
Stats Culture
➔ Focus on ‘why this model’
➔ Goodness of fit, hypothesis
tests, residuals
➔ or MCMC methods, bayesian
modeling
➔ regression models, survival
analysis
AI / ML Culture
➔ Focus on ‘good predictions’
➔ Cross validations, ensemble of
models
➔ Focus on underfit vs overfit
analysis
➔ neural nets, tree based models
(random forests et. al.)
but let’s build bridges
Stats
➔ Focus on basics, sound theory
➔ Exploration, summaries
➔ Models
ML / AI
➔ Focus on predictions
➔ Model evaluation
➔ Feature selection
Computations
➔ Focus on application
➔ Achieving scale and usability
➔ Hadoop , Storm and friends..
Business Knowledge
➔ Focus on interpretation
➔ Visualizations
➔ Creating stories out of data
typical ML process
➔ Objective
➔ Source data
➔ Explore
➔ Model
➔ Evaluate
➔ Apply
➔ Validate
objective
in brief!
bottom line : not in terms of algorithm but outcome
probability of customer churn
group set of emails by topic
predict rainfall
recommend item to a consumer to
increase likelihood of click
sourcing data
to be covered at end!
explore
R you ready ?
introduction to R
➔ Starting R
➔ Data Structures
◆ Atomic Vectors, matrix
◆ Lists
◆ Data frames
➔ Data types
◆ usual suspects in numerics (int, double, character)
◆ Factors
◆ logical
data frame , the workhorse
➔ Load sample data frame
➔ Explore data frame (head, tail)
➔ Access elements by index
◆ access rows
◆ access columns
● single
● multiple (by name, by index, by -ve index)
➔ Find metadata
◆ names
◆ dimensions
➔ Explore using plot (pairs)
broadcasting / vectorization
➔ Very important concept
➔ Subsetting
◆ vectors
◆ data frames
➔ Applying operations
◆ operations on entire column
data exploration
➔ summarizing using mean
➔ quantiles, when mean is not enough
◆ outlier detection
➔ functional roots of R : sapply summary
➔ summary function
◆ on numerical
◆ on factors
➔ plots (basic)
➔ histograms
➔ correlations
models
simple & real world
basic types of models
➔ Supervised learning
➔ Unsupervised learning
➔ Semi-supervised learning
➔ Reinforcement learning
linear regression
➔ load sample dataset (cars)
➔ build linear regression model
➔ understand the output
◆ summary
◆ plots
➔ understanding train vs predict cycle
◆ most important idea!
demo on real world
dataset
data exploration, classification
hierarchical clustering
➔ basic idea of clustering
◆ distance as a proxy for similarity, group by distance
◆ group anything as long as distance can be
calculated
➔ load and explore eurodist data
➔ fit hierarchical cluster
➔ plot dendrogram
demo on real world data
and the most important idea!
concept of vector space model
➔ Words as axis
➔ Bag of words defines
vector space
➔ Document as a point in
space
➔ We can
◆ define distance
◆ measure similarity
(cosine similarity)
◆ group documents
➔ what can be a document ?
evaluation
evaluation metrics
➔ depends on type of model
◆ regression : MAPE, MSSE
◆ classification : accuracy, precision, recall, F score
◆ clustering : within vs between variance
➔ ML world (ref : two cultures) has much
better story
➔ Not enough to perform well on training set
brief intro regularization
➔ Bias vs Variance problem
➔ We want to be ‘just right’
➔ Concept of regularization
➔ Intro Cross validation
demo of evaluation
and fantastic Scikits API
sourcing and applying
and the great ML divide
the great ML divide
Lab Culture
➔ Theory
➔ Small Datasets
➔ In memory
➔ Not live
➔ R, Octave,
Python..
Source and Apply
➔ The practice
➔ Huge datasets
➔ Live in production
➔ Hadoop and
friends, Python ? ,
R ?
processing data at scale
➔ Data is not available in final form
➔ Non standard data
◆ click streams
◆ event logs
◆ free form text
➔ Process at scale
➔ Transform , clean, group in final form
5 min intro to Hadoop Ecosystem
➔ Write in assembly
◆ java
➔ DSLs
◆ Pig
◆ hive
◆ impala
➔ Functional Languages to rescue!
Introduction to Cascalog
processing data at scale
Recap
➔ Pragmatic ML
➔ Key phases
➔ Supervised learning
➔ Unsupervised learning
➔ Evaluation
➔ Application issues
➔ Processing data @ scale
Let’s discuss...

More Related Content

Viewers also liked (16)

PPTX
Energy harvesting at morgan
Frédéric Pimparel CEng
 
PDF
The Seven Deadly Sins by Employees
Ankur Tandon
 
PPTX
Google Analytics aplicado al Email Marketing
FromDoppler
 
PDF
Social Media - Strategy Quickstart for CPAs
Tom Hood, CPA,CITP,CGMA
 
PDF
ללמוד כיצד לרתום את הכח של מתודולוגיית אדיג\'ס- הזמנה ליום העיון
PeterShtrom
 
PPTX
Faire d’internet un véritable outil commercial dans le tourisme
Philippe Fabry
 
PDF
The 7 Essential Secrets of the Tech Job Search
Jeremy Schifeling
 
PPTX
Aloittelevan kunnallispoliitikon tunnustukset vuodelta 1997
Jyrki Kasvi
 
PDF
Actividades de reflexion incial la ofimatica
Neiver Ramirez Perez
 
PDF
E book - Hiring tool kit for Smart Recruiters
Talview
 
PDF
The Numbers Magic (Amsterdam Node Meetup Presentation)
icemobile
 
PPT
Презентація УБА 2012
uba2010
 
PPTX
Social CMI: Social data in context
Brandwatch
 
PPTX
Social Selling Social Media Conversions Webinar With Milk it Academy by Doyle...
Doyle Buehler
 
PPT
Leadership howard county 03 28 12 jenn lim delivering happiness
Delivering Happiness
 
PDF
Guía de aplicación sistema nervioso
Giuliana Tinoco
 
Energy harvesting at morgan
Frédéric Pimparel CEng
 
The Seven Deadly Sins by Employees
Ankur Tandon
 
Google Analytics aplicado al Email Marketing
FromDoppler
 
Social Media - Strategy Quickstart for CPAs
Tom Hood, CPA,CITP,CGMA
 
ללמוד כיצד לרתום את הכח של מתודולוגיית אדיג\'ס- הזמנה ליום העיון
PeterShtrom
 
Faire d’internet un véritable outil commercial dans le tourisme
Philippe Fabry
 
The 7 Essential Secrets of the Tech Job Search
Jeremy Schifeling
 
Aloittelevan kunnallispoliitikon tunnustukset vuodelta 1997
Jyrki Kasvi
 
Actividades de reflexion incial la ofimatica
Neiver Ramirez Perez
 
E book - Hiring tool kit for Smart Recruiters
Talview
 
The Numbers Magic (Amsterdam Node Meetup Presentation)
icemobile
 
Презентація УБА 2012
uba2010
 
Social CMI: Social data in context
Brandwatch
 
Social Selling Social Media Conversions Webinar With Milk it Academy by Doyle...
Doyle Buehler
 
Leadership howard county 03 28 12 jenn lim delivering happiness
Delivering Happiness
 
Guía de aplicación sistema nervioso
Giuliana Tinoco
 

Similar to Workshop on Machine Learning (20)

PDF
Ml masterclass
Maxwell Rebo
 
PPTX
Introduction to Machine Learning - An overview and first step for candidate d...
Lucas Jellema
 
PPTX
Introduction overviewmachinelearning sig Door Lucas Jellema
Getting value from IoT, Integration and Data Analytics
 
PPTX
The Art of Intelligence – A Practical Introduction Machine Learning for Oracl...
Lucas Jellema
 
PDF
The Machine Learning Workflow with Azure
Ivo Andreev
 
PPTX
L15.pptx
ImonBennett
 
PDF
The Art of Intelligence – A Practical Introduction Machine Learning for Orac...
Lucas Jellema
 
PDF
General introduction to AI ML DL DS
Roopesh Kohad
 
PPTX
Introduction to Big Data/Machine Learning
Lars Marius Garshol
 
PDF
A step towards machine learning at accionlabs
Chetan Khatri
 
PDF
Choosing a Machine Learning technique to solve your need
GibDevs
 
PDF
Introduction to Machine Learning with SciKit-Learn
Benjamin Bengfort
 
PDF
ML.pdf
SamuelAwuah1
 
PPTX
The Art of Intelligence – Introduction Machine Learning for Oracle profession...
Lucas Jellema
 
PPTX
NCCU: The Story of Data Science and Machine Learning Workshop - A Tutorial in...
The Statistical and Applied Mathematical Sciences Institute
 
PPTX
Machine learning workshop @DYP Pune
Ganesh Raskar
 
PPTX
Azure Databricks for Data Scientists
Richard Garris
 
PPTX
Machine Learning for SEOs - SMXL
Britney Muller
 
PDF
Applied Machine Learning Basic like Data representation , validation and tet...
VanshMunjal7
 
PPTX
Python Machine Learning January 2018 - Ho Chi Minh City
Andrew Schwabe
 
Ml masterclass
Maxwell Rebo
 
Introduction to Machine Learning - An overview and first step for candidate d...
Lucas Jellema
 
Introduction overviewmachinelearning sig Door Lucas Jellema
Getting value from IoT, Integration and Data Analytics
 
The Art of Intelligence – A Practical Introduction Machine Learning for Oracl...
Lucas Jellema
 
The Machine Learning Workflow with Azure
Ivo Andreev
 
L15.pptx
ImonBennett
 
The Art of Intelligence – A Practical Introduction Machine Learning for Orac...
Lucas Jellema
 
General introduction to AI ML DL DS
Roopesh Kohad
 
Introduction to Big Data/Machine Learning
Lars Marius Garshol
 
A step towards machine learning at accionlabs
Chetan Khatri
 
Choosing a Machine Learning technique to solve your need
GibDevs
 
Introduction to Machine Learning with SciKit-Learn
Benjamin Bengfort
 
ML.pdf
SamuelAwuah1
 
The Art of Intelligence – Introduction Machine Learning for Oracle profession...
Lucas Jellema
 
NCCU: The Story of Data Science and Machine Learning Workshop - A Tutorial in...
The Statistical and Applied Mathematical Sciences Institute
 
Machine learning workshop @DYP Pune
Ganesh Raskar
 
Azure Databricks for Data Scientists
Richard Garris
 
Machine Learning for SEOs - SMXL
Britney Muller
 
Applied Machine Learning Basic like Data representation , validation and tet...
VanshMunjal7
 
Python Machine Learning January 2018 - Ho Chi Minh City
Andrew Schwabe
 
Ad

Recently uploaded (20)

PPTX
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
PPTX
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
PDF
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
PDF
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
PPT
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
PDF
Choosing the Right Database for Indexing.pdf
Tamanna
 
PPTX
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
PPTX
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
PPTX
ER_Model_Relationship_in_DBMS_Presentation.pptx
dharaadhvaryu1992
 
PDF
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
PPTX
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
PDF
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
PDF
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
PDF
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
PPTX
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
PPTX
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
Choosing the Right Database for Indexing.pdf
Tamanna
 
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
ER_Model_Relationship_in_DBMS_Presentation.pptx
dharaadhvaryu1992
 
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
Ad

Workshop on Machine Learning

  • 2. me ... Harshad ➔ Senior Data Scientist @ Sokrati ➔ Spent last 4 years trying to understand and apply machine learning Sokrati is a digital advertising startup based out of Pune
  • 3. what are we going to do ?
  • 4. get a 10000 feet view ...
  • 5. then go to specifics ...
  • 7. what is ML ? ➔ Too many definitions! ➔ Too much debate ➔ Analytics vs ML vs data mining vs Big Data vs Statistics vs next buzzword in the market
  • 8. what is ML ? Teaching machines to take decisions with the help of data practical man’s definition of machine learning!
  • 11. bit of (relevant) history ... ➔ Insurance, Banking industry ◆ Credit scoring ◆ mathematical models in finance ➔ Artificial Intelligence and other fancy ideas ◆ deep blue, Samuel’s checkers machine ◆ IBM watson computer
  • 12. Find ‘f’ => endeavour to understand the world! g( y ) = f( X )
  • 13. the two cultures in ML Stats Culture Vs AI / ML Culture
  • 14. the two cultures in ML Stats Culture ➔ Focus on ‘why this model’ ➔ Goodness of fit, hypothesis tests, residuals ➔ or MCMC methods, bayesian modeling ➔ regression models, survival analysis AI / ML Culture ➔ Focus on ‘good predictions’ ➔ Cross validations, ensemble of models ➔ Focus on underfit vs overfit analysis ➔ neural nets, tree based models (random forests et. al.)
  • 15. but let’s build bridges Stats ➔ Focus on basics, sound theory ➔ Exploration, summaries ➔ Models ML / AI ➔ Focus on predictions ➔ Model evaluation ➔ Feature selection Computations ➔ Focus on application ➔ Achieving scale and usability ➔ Hadoop , Storm and friends.. Business Knowledge ➔ Focus on interpretation ➔ Visualizations ➔ Creating stories out of data
  • 16. typical ML process ➔ Objective ➔ Source data ➔ Explore ➔ Model ➔ Evaluate ➔ Apply ➔ Validate
  • 18. bottom line : not in terms of algorithm but outcome probability of customer churn group set of emails by topic predict rainfall recommend item to a consumer to increase likelihood of click
  • 19. sourcing data to be covered at end!
  • 21. introduction to R ➔ Starting R ➔ Data Structures ◆ Atomic Vectors, matrix ◆ Lists ◆ Data frames ➔ Data types ◆ usual suspects in numerics (int, double, character) ◆ Factors ◆ logical
  • 22. data frame , the workhorse ➔ Load sample data frame ➔ Explore data frame (head, tail) ➔ Access elements by index ◆ access rows ◆ access columns ● single ● multiple (by name, by index, by -ve index) ➔ Find metadata ◆ names ◆ dimensions ➔ Explore using plot (pairs)
  • 23. broadcasting / vectorization ➔ Very important concept ➔ Subsetting ◆ vectors ◆ data frames ➔ Applying operations ◆ operations on entire column
  • 24. data exploration ➔ summarizing using mean ➔ quantiles, when mean is not enough ◆ outlier detection ➔ functional roots of R : sapply summary ➔ summary function ◆ on numerical ◆ on factors ➔ plots (basic) ➔ histograms ➔ correlations
  • 26. basic types of models ➔ Supervised learning ➔ Unsupervised learning ➔ Semi-supervised learning ➔ Reinforcement learning
  • 27. linear regression ➔ load sample dataset (cars) ➔ build linear regression model ➔ understand the output ◆ summary ◆ plots ➔ understanding train vs predict cycle ◆ most important idea!
  • 28. demo on real world dataset data exploration, classification
  • 29. hierarchical clustering ➔ basic idea of clustering ◆ distance as a proxy for similarity, group by distance ◆ group anything as long as distance can be calculated ➔ load and explore eurodist data ➔ fit hierarchical cluster ➔ plot dendrogram
  • 30. demo on real world data and the most important idea!
  • 31. concept of vector space model ➔ Words as axis ➔ Bag of words defines vector space ➔ Document as a point in space ➔ We can ◆ define distance ◆ measure similarity (cosine similarity) ◆ group documents ➔ what can be a document ?
  • 33. evaluation metrics ➔ depends on type of model ◆ regression : MAPE, MSSE ◆ classification : accuracy, precision, recall, F score ◆ clustering : within vs between variance ➔ ML world (ref : two cultures) has much better story ➔ Not enough to perform well on training set
  • 34. brief intro regularization ➔ Bias vs Variance problem ➔ We want to be ‘just right’ ➔ Concept of regularization ➔ Intro Cross validation
  • 35. demo of evaluation and fantastic Scikits API
  • 36. sourcing and applying and the great ML divide
  • 37. the great ML divide Lab Culture ➔ Theory ➔ Small Datasets ➔ In memory ➔ Not live ➔ R, Octave, Python.. Source and Apply ➔ The practice ➔ Huge datasets ➔ Live in production ➔ Hadoop and friends, Python ? , R ?
  • 38. processing data at scale ➔ Data is not available in final form ➔ Non standard data ◆ click streams ◆ event logs ◆ free form text ➔ Process at scale ➔ Transform , clean, group in final form
  • 39. 5 min intro to Hadoop Ecosystem ➔ Write in assembly ◆ java ➔ DSLs ◆ Pig ◆ hive ◆ impala ➔ Functional Languages to rescue!
  • 41. Recap ➔ Pragmatic ML ➔ Key phases ➔ Supervised learning ➔ Unsupervised learning ➔ Evaluation ➔ Application issues ➔ Processing data @ scale