SlideShare a Scribd company logo
Introduction to Data Science, Understanding
Machine Learning and Embracing it within IoT
Solution
Meet Ganesh Raskar | @geekwhocodes
• Intern at RapidCircle India
• Microsoft Student Partner
• Microsoft Certified Professional
• Microsoft Specialist (HTML5, CSS3 & JavaScript, Azure Web Services)
• Periodic Blogger (http://geekwhocodes.me)
• Lifelong learner
Email geekwhocodes@outlook.com
Twitter https://twitter.com/geekwhocodes
About https://about.me/geekwhocodes
LinkedIn https://in.linkedin.com/in/geekwhocod
es
Module 01 | Introduction to Data Science
Module 02 | Understanding Machine Learning
Module 03 | Machine Learning Workflow
• Module 03-01 | Regression
• Module 03-02 | Classification
• Module 03-03 | Clustering
• Module 03-04 | Recommenders
Module 04 | Demo – Classification
Module 05 | Demo - Embracing ML in IoT solution
Modules
Module 01 | Introduction to Data Science
It has it’s own jargon
What is Data Science ?
• Evolving subject, no single definition
• Requires a range of skills
Data science is the exploration and quantitative analysis of all available
structured and unstructured data to develop understanding, extract
knowledge, and formulate actionable results.
ActionDecision
Why did it happen?
What will happen?
What should I do?
Decision automation
Decision support
Data
What happened? Manual process
Value
Data Decision Actions
What Types of Analytics
Retrospective
analytics
• Predictive analytics calibrated on past data, tells us what to expect
• Prescriptive analysis tells what actions to take
Predictive vs Prescriptive Analytics
Module 02 | Understanding Machine
Learning
What is Machine Learning?
What
Machine
Learning does?
Finds patterns in data
Uses those patterns to predict the future
Examples:
• Detecting credit card fraud
• Determining whether a customer is likely to
switch to a competitor
• Detecting machine failure
• Lots more
What
Does It Mean
to Learn?
How did you learn to read?
Learning requires:
• Identifying patterns
• Recognizing those patterns when you see
them again
• Theory -> Simulation -> Try to understand
things
>This is what machine learning does
Finding Patterns
Name Amount Fraudulent
Omkar ₹ 10,000 No
Amit ₹ 17,000 Yes
Ankit ₹ 20,000 Yes
Ganesh ₹ 19,000 No
A simple example
Name Amount Issued Used Age Fraudulent
Omkar ₹ 10,000 India India 27 No
Ganesh ₹ 23,000 India India 21 No
Ankit ₹ 12,000 India USA 25 Yes
Amit ₹ 2,000 USA India 27 Yes
Avani ₹ 14,000 India Amsterdam 26 No
Vinit ₹ 69,000 India Holand 25 Yes
Aditi ₹ 70,000 USA USA 26 No
Swapnil ₹ 9,000 India India 21 No
Gayatri ₹ 30,000 India London 20 Yes
A bit complex example
What’s the pattern for fraudulent
transactions?
Machine Learning in a Nutshell
Machine
learning
algorithm
Model
Application
Contains
patterns
Finds
patterns
Recognizes
patterns
Supplies new data
to see if it matches
known patterns
Data
Why Machine
Learning is so hot
right now?
Doing machine learning well requires:
• Lots of data
• Lots of compute power
• Effective machine learning algorithms
All of those things are now more available
than ever
Who’s Interested in Machine Learning ?
Business Leaders
Want solutions to
business problems
Data Scientists
Want powerful, easy-
to-use tools
Software Developers
Want to create better
applications
Who is Data
Scientist?
Someone who knows about:
• Statistics
• Machine learning software
• Some problem domain (ideally)
Key facts about data scientists:
• Good ones are scarce
• Good ones are expensive
The Role of R
R is an open source programming language
• Supports machine learning, statistical
computing, and more
• Has many available packages
• R is very popular
• Many commercial machine learning
offerings support R
But it’s not the only choice:
• Python is also increasingly popular
• Machine learning lets us find patterns in existing
data, then create and use a model that
recognizes those patterns in new data
• Machine learning has gone mainstream
• Big vendors think there’s big money in this market
• Machine learning can probably help your
organization
Summary
Module 03 | Machine Learning Process
ML Process is:
Iterative
• In both big and small ways
Challenging
• It’s rarely easy
Often rewarding
• But not always
First Step : Asking The Right Question
Choosing what question
to ask is the most
important part of the ML
process
Ask yourself : Do you’ve
the right data to answer
the question?
Ask yourself: Do you
know how you’ll evaluate
the result?
Machine Learning Flow
Chosen
Model
Deploy
chosen
model
Candidate
Model
Apply
learning
algorithm
to data
Prepared
Data
Apply pre-
processing
to data
Iterate to find the
best model
Iterate until data is
ready
ML Algorithms
Applications
Raw
Data
Raw
Data
Choose
data
Data
Processing
Modules
Repeating The Process
Raw
Data
Prepared
Data
Apply pre-
processing
to data
Deploy
chosen
model
Apply
learning
algorithm
to data
Chosen
Model
Candidate
Model
Re-create model
regularly
Scenario : Predicting Customer Churn
Detailed
Call Data
ModelMachine
Call Center Staff
Call Center
Application
Aggregated CRM
Call Data Data
Data
for ML
ML Prep
Application
Hadoop, Spark, etc.
Aggregation
Application
Customers
• Choose the right question
• Data Transformation
• Iterate until you have a model that makes good
predictions
• Periodically rebuild the model
• Deploy the solution
Summary
The Closer Look at Machine Learning
ML has it’s own jargon
Terminology
The value you want to
predict is in the training
data
The data is labeled
The value you want to
predict is not in the
training data
The data is unlabeled
Training
Data
Supervised
Learning
Unsupervised
Learning
Most common
The prepared data used to
create a model
Creating a model is called
training a model
* We’ll focus on Supervised ML
training or
prepared data
Data Processing for Supervised Machine Learning
Features Target Value
Available Data Preprocessing Modules
1) Read raw
data
2) Create
training
data
Data Source 2
. . .
Data Source 1
Data Source N
100011010011
110111110110
Categorizing Machine Learning Problems
Regression Classification
Clustering Recommenders
For Predicting real-valued outcomes :
• How many customers will visit our site next week?
• How may TV’s will sell next year?
• Can we predicts someone’s income from their click
through information?
• How many? It’s regression problem
For predicting truth valued outcomes:
• Will I pass next semester ?
• Is this transaction is fraudulent ?
• Is this a spam e-mail?
For solving Unsupervised learning problems :
• Identifying chair from bunch of different objects?
• Hand-writing recognition
• Is this Ganesh's voice ?
Recommending products based on history:
• Building recommender engines
• Machine learning has come of age
• Machine learning isn’t hard to understand
• Although it can be hard to do well
• Machine learning can probably help your
organization
Summary
Module 03-01 | Regression
How many ?
Regression
• Introduction to Regression
• Simple Linear Regression (1 Feature)
• Ridge Regression
• SVM Regression
• Cross-Validation
Introduction to Regression
• Each observation is represented by a set of numbers.
A person is
represented as:
Labels, called ySingle feature, called x
Need a function that estimates y for a new x.
Clicks
[10]
[7]
[…]
Income
53
-15
…
Name
Ganesh
Ankit
[…]
Simple Linear Regression
• Formally, given training set (xi,yi) for i=1…n, we want to create a
regression model f that can predict label y for a new x.
f(x) = function(Number of Businessweek clicks)
2000Number of Business week clicks
Income0
f(x)
1,000Kf (xi ) = b0 +b1xi
f(x) = 100K + 5K*Number of Businessweek clicks(x)
• Want model to be as close to data as possible.
want these to be small: yi  f (xi )
equivalently want these to be small: (yi  f (xi ))2
SSE(f) : Summation of above function
• You do not need to solve the minimization
problem – the machine learning algorithm will do it
for you.
Ridge Regression
• Extension to Simple Linear Regression
• Formally, given training set (xi,yi) for i=1…n, we want
to create a regression model f that can predict label
y for a new x.
Estimated income:
f(x) = function(feature1, feature2, feature3, feature4, feature5,… etc.)
For instance,
f(x) = 3*Number of visits
+10*Number of Businessweek clicks
+100*Number people emailed per day
+2*Number of purchases of over 5K within the last month
+10*Number of visits to airlines
But f(x) could be much more complicated
Ridge Regression
Over-fitting Model :
• Multiple features
• Wrong ML algorithm
• It just remember the data
• Worst
Could choose b0, b1, b2, etc., to minimize the total error on the
training set + regularization term <- keeping the model simple
• C will be calculated using Cross Validation
• This is called “Ridge Regression”
min
b0 ,b1,b2 ,...
(yi - (b0 + b1xi,1 + b2 xi,2 +...))2
i=1
n
å + C(b0
2
+ b1
2
+ b2
2
+ ...)
é
ë
ê
ù
û
ú
Support Vector Machine Regression
min
b0,b1,b2,...
ge (yi - f(xi))
i=1
n
å +C(b0
2
+b1
2
+b2
2
+...)
é
ë
ê
ù
û
ú
0 e
ge (yi - f (xi ))
(yi - f (xi ))
• The difference between Ridge &
SVM is how they measure difference
between prediction and the truth
• Epsilon to – as long as f(x) & y
within the epsilon on either sides,
the value of [ y - f(x) ] = 0
• You don’t need to do it by yourself,
it’ll covered by ML algorithm
Cross Validation
• Cross Validation (CV) is the most popular way to evaluate a machine learning algorithm on a
dataset.
• You will need a dataset, an algorithm, and an evaluation measure for the quality of the result. The
evaluation measure might be the squared error between the predictions and the truth.
• Divide the data into approximately-equally sized 10 “folds”
• Train the algorithm on 9 folds, compute the evaluation measure on the last fold.
• Repeat this 10 times, using each fold in turn as the test fold.
• Report the mean and standard deviation of the evaluation measure over the 10 folds.
Train Test
Module 03-02 | Classification
True or False?
Class1 or Class2 or Class N?
Classification
• What is classification?
• Loss functions for classification
• Logistic regression
• SVM
• AdaBoost
• Decision trees
• Multiclass classification
• Imbalanced learning
• ROC curves and the AUC
Introduction to Classification
• Formally, given training set (xi,yi) for i=1…n, we want to create a classification model f
that can predict label y for a new x.
A person is represented as:
Labels, called yfeatures, called x
Need a function that estimates y for a new x.
[5]
[10]
[7]
[…]
1
-1
1
…
[12]
[14]
[47]
[…]
[51]
[15]
[8]
[…]
[25]
[30]
[9]
[…]
1 2 0 1 1
Introduction to Classification
8Study no. of hours per day
LastYearBacklog
0
3
f(x)>0 f(x)=0
f(x)<0
Fail
Pass
f(x) = function(Last Year Back log, Study No. of hr/day)
The machine learning algorithm will create the
function f for you. It might be very complicated,
but the way to use is not complicated:
The predicted value of y for a new x is the sign
of f(x).
Module 03-02 | Clustering
Clustering
• Clustering is an key unsupervised problem.
• “Unsupervised” means that the training data has no ground truth labels to learn from.
• This means they are much harder to evaluate.
Supervised:
chair?
(not a chair)
(chair)(not a chair)
(not a chair)
(chair)
(chair)
(not a chair)
Unsupervised:
Clustering
• “Unsupervised” means that the training data has no ground truth labels to
learn from.
Applications include:
• Automatically grouping documents/webpages into topics
– For instance, grouping news stories from today into categories
• Clustering large number of products
– E.g. online shopping sites (search)
• Clustering customers into those with similar purchase behavior
Clustering
Module 03-02 | Recommenders
Introduction to Recommenders
• Self Expletory
• Market Basket Analysis
• Customer purchasing behaviour
• Increase sales and maintain inventory
Facebook, LinkedIn
Matrix Factorisation
Collaborative Filtering
K-NN & Pearson
Content-based
Bayesian classifiers, cluster analysis,
Decision trees, artificial neural networks
Used in : Nextflix
Recommenders
Terminology :
Items : [1,2,3,4,5,6,,7,8,9]
Itemset : any subset {3,5} {5,8} {1,3}.. Etc.
Transaction : {2,3} {4,9} {7,2} {9,3}
Rule : eg. {7 -> 2}
Support of itemset : proportion of transactions containing itemset
(if user buy 7, what are chances to buy 2 as well.)
• Collaborative Filtering
• Content-Based Filtering – works on the metadata of item
• Hybrid Approach
• User – Movie Matrix
• Goal is to predict user’s rating for the movie that he
didn’t watched yet
• The intuition behind using matrix factorization to solve
this problem is that there should be some latent
features that determine how a user rates an item.
User 1
User 2
User 3
.
.
.
User n
Recommenders
Module 04 | Demo
Will I pass next semester?
Module 05 | Demo
How can we use ML in IoT?
Information
Intel Edison
• Dual-core, dual-threaded Intel® Atom™ CPU at 500 MHz
• 32-bit Intel® Quark™ microcontroller at 100 MHz
• 1 GB LPDDR3 memory
• 20 digital input/output pins including 4 pins as
PWM(pulse width modulation) outputs
• 6 analog inputs
• 1 I2C
• 1 ICSP(In-Circuit Serial Programming)
• Micro USB device connector
• SD Card connector
• BLE 4.0
• Yocto Linux 1.6*
Water Flow sensor (1-30L/min) – My experiment specific
Thank you 

More Related Content

PDF
Machine Learning : why we should know and how it works
Kevin Lee
 
PDF
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
Sebastian Raschka
 
PDF
Introduction to Data Science - ESCP Europe
Martin Daniel
 
PPTX
Introduction of Data Science
Jason Geng
 
PDF
Introduction to Data Science
Niko Vuokko
 
PDF
Introduction to Data Science and Analytics
Srinath Perera
 
PDF
Introduction to Data Science and Large-scale Machine Learning
Nik Spirin
 
PDF
Introduction on Data Science
Edureka!
 
Machine Learning : why we should know and how it works
Kevin Lee
 
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
Sebastian Raschka
 
Introduction to Data Science - ESCP Europe
Martin Daniel
 
Introduction of Data Science
Jason Geng
 
Introduction to Data Science
Niko Vuokko
 
Introduction to Data Science and Analytics
Srinath Perera
 
Introduction to Data Science and Large-scale Machine Learning
Nik Spirin
 
Introduction on Data Science
Edureka!
 

Viewers also liked (20)

PPTX
Introduction to Machine Learning
Lior Rokach
 
PDF
Introduction to Data Science
Francis Michael Bautista
 
PPTX
Introduction to data science
Koo Ping Shung
 
PPTX
Intro to data science module 1 r
amuletc
 
PDF
An Obligatory Introduction to Data Science
Wesley Eldridge
 
PDF
Introduction to Data Science (Data Science Thailand Meetup #1)
Data Science Thailand
 
PDF
Data Science Introduction
Gang Tao
 
PDF
Machine Learning
Anastasia Jakubow
 
PDF
Introduction to Machine Learning and Deep Learning
Terry Taewoong Um
 
PDF
Introduction to Data Science
ANOOP V S
 
PDF
QCon Rio - Machine Learning for Everyone
Dhiana Deva
 
PPTX
Introduction to Data Science by Datalent Team @Data Science Clinic #9
Dr.Sotarat Thammaboosadee CIMP-Data Governance
 
PDF
Machine Learning for Dummies
Venkata Reddy Konasani
 
PPTX
Introduction to Machine Learning
Rahul Jain
 
PDF
How to Become a Data Scientist
ryanorban
 
PPT
Building Community Buy-in Through Vision-based Comprehensive Planning
Leslie
 
PDF
The five myths of scaling and diffusion
Innovation_Unit
 
PDF
David Albury: learning from NHS (and other) innovation initiatives to date
The King's Fund
 
PDF
Sap Virtualization Week 2009
Sherry Yu
 
DOCX
Planificacion
geral angulo
 
Introduction to Machine Learning
Lior Rokach
 
Introduction to Data Science
Francis Michael Bautista
 
Introduction to data science
Koo Ping Shung
 
Intro to data science module 1 r
amuletc
 
An Obligatory Introduction to Data Science
Wesley Eldridge
 
Introduction to Data Science (Data Science Thailand Meetup #1)
Data Science Thailand
 
Data Science Introduction
Gang Tao
 
Machine Learning
Anastasia Jakubow
 
Introduction to Machine Learning and Deep Learning
Terry Taewoong Um
 
Introduction to Data Science
ANOOP V S
 
QCon Rio - Machine Learning for Everyone
Dhiana Deva
 
Introduction to Data Science by Datalent Team @Data Science Clinic #9
Dr.Sotarat Thammaboosadee CIMP-Data Governance
 
Machine Learning for Dummies
Venkata Reddy Konasani
 
Introduction to Machine Learning
Rahul Jain
 
How to Become a Data Scientist
ryanorban
 
Building Community Buy-in Through Vision-based Comprehensive Planning
Leslie
 
The five myths of scaling and diffusion
Innovation_Unit
 
David Albury: learning from NHS (and other) innovation initiatives to date
The King's Fund
 
Sap Virtualization Week 2009
Sherry Yu
 
Planificacion
geral angulo
 
Ad

Similar to Machine learning workshop @DYP Pune (20)

PDF
Choosing a Machine Learning technique to solve your need
GibDevs
 
PDF
MLEARN 210 B Autumn 2018: Lecture 1
heinestien
 
PDF
Machine learning it is time...
Sandip Chatterjee
 
PDF
Machine Learning_Unit 2_Full.ppt.pdf
Dr.DHANALAKSHMI SENTHILKUMAR
 
PPTX
Selected Topics in CS-CHapter-twooo.pptx
BachaLamessaa
 
PDF
Ml masterclass
Maxwell Rebo
 
PDF
Machine learning- key concepts
Amir Ziai
 
PDF
Managing machine learning
David Murgatroyd
 
PDF
Machine learning for IoT - unpacking the blackbox
Ivo Andreev
 
PPTX
ML howtodo.pptx. Get learning how to do a
mohammedalhuraiby333
 
PPTX
Machine Learning Essentials Demystified part1 | Big Data Demystified
Omid Vahdaty
 
PDF
The Machine Learning Workflow with Azure
Ivo Andreev
 
PPTX
Ml - A shallow dive
Gopi Krishna Nuti
 
PPTX
The 4 Machine Learning Models Imperative for Business Transformation
RocketSource
 
PPTX
Machine Learning Presentation - Vilnius Tech
RicardoSibrin
 
PDF
Mastering the 80% of Analytics: What Data Scientists Really Do
Avrio Analytics
 
PPTX
ECT463 Machine Learning Module 1 KTU 2019 Scheme.pptx
roshi4781
 
PDF
Machine learning
Dr Geetha Mohan
 
PPTX
machine learning basic-1.pptx
DrLola1
 
PDF
ML.pdf
SamuelAwuah1
 
Choosing a Machine Learning technique to solve your need
GibDevs
 
MLEARN 210 B Autumn 2018: Lecture 1
heinestien
 
Machine learning it is time...
Sandip Chatterjee
 
Machine Learning_Unit 2_Full.ppt.pdf
Dr.DHANALAKSHMI SENTHILKUMAR
 
Selected Topics in CS-CHapter-twooo.pptx
BachaLamessaa
 
Ml masterclass
Maxwell Rebo
 
Machine learning- key concepts
Amir Ziai
 
Managing machine learning
David Murgatroyd
 
Machine learning for IoT - unpacking the blackbox
Ivo Andreev
 
ML howtodo.pptx. Get learning how to do a
mohammedalhuraiby333
 
Machine Learning Essentials Demystified part1 | Big Data Demystified
Omid Vahdaty
 
The Machine Learning Workflow with Azure
Ivo Andreev
 
Ml - A shallow dive
Gopi Krishna Nuti
 
The 4 Machine Learning Models Imperative for Business Transformation
RocketSource
 
Machine Learning Presentation - Vilnius Tech
RicardoSibrin
 
Mastering the 80% of Analytics: What Data Scientists Really Do
Avrio Analytics
 
ECT463 Machine Learning Module 1 KTU 2019 Scheme.pptx
roshi4781
 
Machine learning
Dr Geetha Mohan
 
machine learning basic-1.pptx
DrLola1
 
ML.pdf
SamuelAwuah1
 
Ad

Recently uploaded (20)

PDF
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PPTX
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
PPTX
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
PDF
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
PDF
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
PDF
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PPTX
Simple and concise overview about Quantum computing..pptx
mughal641
 
PPTX
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
PDF
Doc9.....................................
SofiaCollazos
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
Brief History of Internet - Early Days of Internet
sutharharshit158
 
PDF
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PDF
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
PDF
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
PPTX
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
PDF
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
PDF
The Future of Artificial Intelligence (AI)
Mukul
 
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
Simple and concise overview about Quantum computing..pptx
mughal641
 
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
Doc9.....................................
SofiaCollazos
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Brief History of Internet - Early Days of Internet
sutharharshit158
 
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
The Future of Artificial Intelligence (AI)
Mukul
 

Machine learning workshop @DYP Pune

  • 1. Introduction to Data Science, Understanding Machine Learning and Embracing it within IoT Solution
  • 2. Meet Ganesh Raskar | @geekwhocodes • Intern at RapidCircle India • Microsoft Student Partner • Microsoft Certified Professional • Microsoft Specialist (HTML5, CSS3 & JavaScript, Azure Web Services) • Periodic Blogger (http://geekwhocodes.me) • Lifelong learner Email [email protected] Twitter https://twitter.com/geekwhocodes About https://about.me/geekwhocodes LinkedIn https://in.linkedin.com/in/geekwhocod es
  • 3. Module 01 | Introduction to Data Science Module 02 | Understanding Machine Learning Module 03 | Machine Learning Workflow • Module 03-01 | Regression • Module 03-02 | Classification • Module 03-03 | Clustering • Module 03-04 | Recommenders Module 04 | Demo – Classification Module 05 | Demo - Embracing ML in IoT solution Modules
  • 4. Module 01 | Introduction to Data Science It has it’s own jargon
  • 5. What is Data Science ? • Evolving subject, no single definition • Requires a range of skills Data science is the exploration and quantitative analysis of all available structured and unstructured data to develop understanding, extract knowledge, and formulate actionable results.
  • 6. ActionDecision Why did it happen? What will happen? What should I do? Decision automation Decision support Data What happened? Manual process Value Data Decision Actions
  • 7. What Types of Analytics Retrospective analytics • Predictive analytics calibrated on past data, tells us what to expect • Prescriptive analysis tells what actions to take Predictive vs Prescriptive Analytics
  • 8. Module 02 | Understanding Machine Learning What is Machine Learning?
  • 9. What Machine Learning does? Finds patterns in data Uses those patterns to predict the future Examples: • Detecting credit card fraud • Determining whether a customer is likely to switch to a competitor • Detecting machine failure • Lots more
  • 10. What Does It Mean to Learn? How did you learn to read? Learning requires: • Identifying patterns • Recognizing those patterns when you see them again • Theory -> Simulation -> Try to understand things >This is what machine learning does
  • 11. Finding Patterns Name Amount Fraudulent Omkar ₹ 10,000 No Amit ₹ 17,000 Yes Ankit ₹ 20,000 Yes Ganesh ₹ 19,000 No A simple example Name Amount Issued Used Age Fraudulent Omkar ₹ 10,000 India India 27 No Ganesh ₹ 23,000 India India 21 No Ankit ₹ 12,000 India USA 25 Yes Amit ₹ 2,000 USA India 27 Yes Avani ₹ 14,000 India Amsterdam 26 No Vinit ₹ 69,000 India Holand 25 Yes Aditi ₹ 70,000 USA USA 26 No Swapnil ₹ 9,000 India India 21 No Gayatri ₹ 30,000 India London 20 Yes A bit complex example What’s the pattern for fraudulent transactions?
  • 12. Machine Learning in a Nutshell Machine learning algorithm Model Application Contains patterns Finds patterns Recognizes patterns Supplies new data to see if it matches known patterns Data
  • 13. Why Machine Learning is so hot right now? Doing machine learning well requires: • Lots of data • Lots of compute power • Effective machine learning algorithms All of those things are now more available than ever
  • 14. Who’s Interested in Machine Learning ? Business Leaders Want solutions to business problems Data Scientists Want powerful, easy- to-use tools Software Developers Want to create better applications
  • 15. Who is Data Scientist? Someone who knows about: • Statistics • Machine learning software • Some problem domain (ideally) Key facts about data scientists: • Good ones are scarce • Good ones are expensive
  • 16. The Role of R R is an open source programming language • Supports machine learning, statistical computing, and more • Has many available packages • R is very popular • Many commercial machine learning offerings support R But it’s not the only choice: • Python is also increasingly popular
  • 17. • Machine learning lets us find patterns in existing data, then create and use a model that recognizes those patterns in new data • Machine learning has gone mainstream • Big vendors think there’s big money in this market • Machine learning can probably help your organization Summary
  • 18. Module 03 | Machine Learning Process
  • 19. ML Process is: Iterative • In both big and small ways Challenging • It’s rarely easy Often rewarding • But not always
  • 20. First Step : Asking The Right Question Choosing what question to ask is the most important part of the ML process Ask yourself : Do you’ve the right data to answer the question? Ask yourself: Do you know how you’ll evaluate the result?
  • 21. Machine Learning Flow Chosen Model Deploy chosen model Candidate Model Apply learning algorithm to data Prepared Data Apply pre- processing to data Iterate to find the best model Iterate until data is ready ML Algorithms Applications Raw Data Raw Data Choose data Data Processing Modules
  • 22. Repeating The Process Raw Data Prepared Data Apply pre- processing to data Deploy chosen model Apply learning algorithm to data Chosen Model Candidate Model Re-create model regularly
  • 23. Scenario : Predicting Customer Churn Detailed Call Data ModelMachine Call Center Staff Call Center Application Aggregated CRM Call Data Data Data for ML ML Prep Application Hadoop, Spark, etc. Aggregation Application Customers
  • 24. • Choose the right question • Data Transformation • Iterate until you have a model that makes good predictions • Periodically rebuild the model • Deploy the solution Summary
  • 25. The Closer Look at Machine Learning ML has it’s own jargon
  • 26. Terminology The value you want to predict is in the training data The data is labeled The value you want to predict is not in the training data The data is unlabeled Training Data Supervised Learning Unsupervised Learning Most common The prepared data used to create a model Creating a model is called training a model * We’ll focus on Supervised ML
  • 27. training or prepared data Data Processing for Supervised Machine Learning Features Target Value Available Data Preprocessing Modules 1) Read raw data 2) Create training data Data Source 2 . . . Data Source 1 Data Source N 100011010011 110111110110
  • 28. Categorizing Machine Learning Problems Regression Classification Clustering Recommenders For Predicting real-valued outcomes : • How many customers will visit our site next week? • How may TV’s will sell next year? • Can we predicts someone’s income from their click through information? • How many? It’s regression problem For predicting truth valued outcomes: • Will I pass next semester ? • Is this transaction is fraudulent ? • Is this a spam e-mail? For solving Unsupervised learning problems : • Identifying chair from bunch of different objects? • Hand-writing recognition • Is this Ganesh's voice ? Recommending products based on history: • Building recommender engines
  • 29. • Machine learning has come of age • Machine learning isn’t hard to understand • Although it can be hard to do well • Machine learning can probably help your organization Summary
  • 30. Module 03-01 | Regression How many ?
  • 31. Regression • Introduction to Regression • Simple Linear Regression (1 Feature) • Ridge Regression • SVM Regression • Cross-Validation
  • 32. Introduction to Regression • Each observation is represented by a set of numbers. A person is represented as: Labels, called ySingle feature, called x Need a function that estimates y for a new x. Clicks [10] [7] […] Income 53 -15 … Name Ganesh Ankit […]
  • 33. Simple Linear Regression • Formally, given training set (xi,yi) for i=1…n, we want to create a regression model f that can predict label y for a new x. f(x) = function(Number of Businessweek clicks) 2000Number of Business week clicks Income0 f(x) 1,000Kf (xi ) = b0 +b1xi f(x) = 100K + 5K*Number of Businessweek clicks(x) • Want model to be as close to data as possible. want these to be small: yi  f (xi ) equivalently want these to be small: (yi  f (xi ))2 SSE(f) : Summation of above function • You do not need to solve the minimization problem – the machine learning algorithm will do it for you.
  • 34. Ridge Regression • Extension to Simple Linear Regression • Formally, given training set (xi,yi) for i=1…n, we want to create a regression model f that can predict label y for a new x. Estimated income: f(x) = function(feature1, feature2, feature3, feature4, feature5,… etc.) For instance, f(x) = 3*Number of visits +10*Number of Businessweek clicks +100*Number people emailed per day +2*Number of purchases of over 5K within the last month +10*Number of visits to airlines But f(x) could be much more complicated
  • 35. Ridge Regression Over-fitting Model : • Multiple features • Wrong ML algorithm • It just remember the data • Worst Could choose b0, b1, b2, etc., to minimize the total error on the training set + regularization term <- keeping the model simple • C will be calculated using Cross Validation • This is called “Ridge Regression” min b0 ,b1,b2 ,... (yi - (b0 + b1xi,1 + b2 xi,2 +...))2 i=1 n å + C(b0 2 + b1 2 + b2 2 + ...) é ë ê ù û ú
  • 36. Support Vector Machine Regression min b0,b1,b2,... ge (yi - f(xi)) i=1 n å +C(b0 2 +b1 2 +b2 2 +...) é ë ê ù û ú 0 e ge (yi - f (xi )) (yi - f (xi )) • The difference between Ridge & SVM is how they measure difference between prediction and the truth • Epsilon to – as long as f(x) & y within the epsilon on either sides, the value of [ y - f(x) ] = 0 • You don’t need to do it by yourself, it’ll covered by ML algorithm
  • 37. Cross Validation • Cross Validation (CV) is the most popular way to evaluate a machine learning algorithm on a dataset. • You will need a dataset, an algorithm, and an evaluation measure for the quality of the result. The evaluation measure might be the squared error between the predictions and the truth. • Divide the data into approximately-equally sized 10 “folds” • Train the algorithm on 9 folds, compute the evaluation measure on the last fold. • Repeat this 10 times, using each fold in turn as the test fold. • Report the mean and standard deviation of the evaluation measure over the 10 folds. Train Test
  • 38. Module 03-02 | Classification True or False? Class1 or Class2 or Class N?
  • 39. Classification • What is classification? • Loss functions for classification • Logistic regression • SVM • AdaBoost • Decision trees • Multiclass classification • Imbalanced learning • ROC curves and the AUC
  • 40. Introduction to Classification • Formally, given training set (xi,yi) for i=1…n, we want to create a classification model f that can predict label y for a new x. A person is represented as: Labels, called yfeatures, called x Need a function that estimates y for a new x. [5] [10] [7] […] 1 -1 1 … [12] [14] [47] […] [51] [15] [8] […] [25] [30] [9] […] 1 2 0 1 1
  • 41. Introduction to Classification 8Study no. of hours per day LastYearBacklog 0 3 f(x)>0 f(x)=0 f(x)<0 Fail Pass f(x) = function(Last Year Back log, Study No. of hr/day) The machine learning algorithm will create the function f for you. It might be very complicated, but the way to use is not complicated: The predicted value of y for a new x is the sign of f(x).
  • 42. Module 03-02 | Clustering
  • 43. Clustering • Clustering is an key unsupervised problem. • “Unsupervised” means that the training data has no ground truth labels to learn from. • This means they are much harder to evaluate. Supervised: chair? (not a chair) (chair)(not a chair) (not a chair) (chair) (chair) (not a chair)
  • 44. Unsupervised: Clustering • “Unsupervised” means that the training data has no ground truth labels to learn from.
  • 45. Applications include: • Automatically grouping documents/webpages into topics – For instance, grouping news stories from today into categories • Clustering large number of products – E.g. online shopping sites (search) • Clustering customers into those with similar purchase behavior Clustering
  • 46. Module 03-02 | Recommenders
  • 47. Introduction to Recommenders • Self Expletory • Market Basket Analysis • Customer purchasing behaviour • Increase sales and maintain inventory Facebook, LinkedIn Matrix Factorisation Collaborative Filtering K-NN & Pearson Content-based Bayesian classifiers, cluster analysis, Decision trees, artificial neural networks Used in : Nextflix
  • 48. Recommenders Terminology : Items : [1,2,3,4,5,6,,7,8,9] Itemset : any subset {3,5} {5,8} {1,3}.. Etc. Transaction : {2,3} {4,9} {7,2} {9,3} Rule : eg. {7 -> 2} Support of itemset : proportion of transactions containing itemset (if user buy 7, what are chances to buy 2 as well.) • Collaborative Filtering • Content-Based Filtering – works on the metadata of item • Hybrid Approach
  • 49. • User – Movie Matrix • Goal is to predict user’s rating for the movie that he didn’t watched yet • The intuition behind using matrix factorization to solve this problem is that there should be some latent features that determine how a user rates an item. User 1 User 2 User 3 . . . User n Recommenders
  • 50. Module 04 | Demo Will I pass next semester?
  • 51. Module 05 | Demo How can we use ML in IoT?
  • 52. Information Intel Edison • Dual-core, dual-threaded Intel® Atom™ CPU at 500 MHz • 32-bit Intel® Quark™ microcontroller at 100 MHz • 1 GB LPDDR3 memory • 20 digital input/output pins including 4 pins as PWM(pulse width modulation) outputs • 6 analog inputs • 1 I2C • 1 ICSP(In-Circuit Serial Programming) • Micro USB device connector • SD Card connector • BLE 4.0 • Yocto Linux 1.6* Water Flow sensor (1-30L/min) – My experiment specific