SlideShare a Scribd company logo
Introduction to
Data Science

Prithwis Mukerjee, PhD
Praxis Business School, Calcutta
prithwis mukerjee, ph.d.
Agenda
●
●

●

●

Why data science ?
Techniques
○ Statistics
○ Data Mining
○ Visualisation
Tools & Platforms
○ R
○ Hadoop / MapReduce
○ Real Time Systems
Business Domains

prithwis mukerjee, ph.d.
prithwis mukerjee, ph.d.
Volume
Data is being acquired from a
variety of sources
●
●
●
●
●
●
●

EFT in Banks, Credit card
payments
Cell phones
Sensors attached to a variety
of equipment
Surveillance cameras, CCTV
Social Media Updates
Blogs
Websites

prithwis mukerjee, ph.d.
Variety / Velocity
●
●
●
●
●
●

Numeric data
Structured text data
Unstructured text data
Images
Sound and video recordings
Graph Nodes
○ Social Media “friends”
○ Websites linked to each
other

prithwis mukerjee, ph.d.

Data is being generated fast and is
becoming obsolete or useless
equally faster
●
●
●

Realtime ( or near realtime)
data from sensors, cameras
Website traffic
Social media “trends”
So what is Big Data ?
●
●
●

Volume
Velocity
Variety ?

A new term coined by
IT vendors to push new
technology like
●
●
●

prithwis mukerjee, ph.d.

Map Reduce
Hadoop
NOSQL

A new way to
●
●
●
●
●

collect
store
manage
analyse
visualise data
Big Data is like Crude Oil { not new Oil }
Think of data as crude oil !
Big Data is like extracting the
crude oil, transporting it in mega
tankers, pumping it through
pipelines and storing it in
massive silos

But what
about
refining ?
prithwis mukerjee, ph.d.
The Science (and Art ) of Data
Think of data as crude oil !

Data Science
●

Big Data is like extracting the
crude oil, transporting it in mega
tankers, pumping it through
pipelines and storing it in
Refining
massive silos

prithwis mukerjee, ph.d.

●
●
●

Discovering what we do not
know about the data
Obtaining predictive, actionable
insight
Creating data products that have
business impacts
Communicating relevent
business stories
Two Perspectives

Programming
or “Hacking”
Skills

Machine
Learning

Mathematics,
Statistics
Knowledge

Data
Science
RDBMS
ERP / BI

Operations
Research

Business
Domain
Knowledge

prithwis mukerjee, ph.d.
10 Things {most} Data Scientists do ...
1. Ask good questions

6. Create models, algorithms

What is what ?

7. Under data relationships

We do not know ! We would like to
know

8. Tell the machine how to learn
from the data

2. Define, Test Hypothesis, Run
experiments
3, Scoop, scrape, sample business
data
4. Wrestle and tame data
5. Play with data, discover
unknowns

prithwis mukerjee, ph.d.

9. Create data products that
deliver actionable insights
10. Tell relevant business stories
from data
Statistics - World of Data
●

Data comes in various types
○ Nominal - colour, gender,
PIN code
○ Ordinal - scale of 1-10,
{high, medium, low}
○ Interval - Dates,
Temperature (Centigrade)
○ Ratio - length, weight, count

prithwis mukerjee, ph.d.

●

Data comes in various
structure
○ Structured data - nominal,
ordinal, interval, ratio
○ Unstructured text - email,
tweets, reviews
○ Images, voice prints
○ graphs, networks - social
media friendships, likes
Descriptive Statistics
●

Numeric Description
○ Mean, Median, Mode
○ Quartile, Percentile
○ Variance / Standard
Deviation

prithwis mukerjee, ph.d.
Statistics : The Path Ahead

Probability,
Distributions

prithwis mukerjee, ph.d.

Testing of
Hypothesis

Regression,
Testing

Predictive
Analysis
Data Mining / Machine Learning
Is the process of obtaining

Typical tasks are

●

novel

●

classification

●

valid

●

clustering

●

potentially useful

●

association rules

●

understandable

●

sequential patterns

●

regression

●

deviation detection

patterns in data

prithwis mukerjee, ph.d.
Some definitions
Instance ( an item or record)
●

an observation that is
characterised by a number of
attributes
○
○

person - with attributes like age,
salary, qualification
sale - with product, quantity, price

Attribute
●

measuring characteristics of an
instance

Class
●

grouping of an instance into
○
○

acceptable, not acceptable
mammal, fish, bird
prithwis mukerjee, ph.d.

Nominal
●

colour, PIN code, state

Ordinal
●

ranking : tall, medium, short or
feedback on a scale of 1 - 10

Ratio
●

length, price, duration, quantity

Interval
●

date, temperature
Data Mining : Classification
Classification
●
●

Which loan applicant will not
default on the loan ?
Which potential customer will
respond to a mailer campaign
?

prithwis mukerjee, ph.d.
Classification Example
s
l
ca uou
ri
go ontin lass
c
ate c

l

a
ric

o

teg
ca

c

Test
Set

Learn
Classifier

prithwis mukerjee, ph.d.

Training
Set

Model
Data Mining : Clustering
Given a set of
unclassified data
points, how to find
a natural grouping
within them

●

Can we segment the market in
some way that is not yet known ?

prithwis mukerjee, ph.d.
Example of Document Clustering
Clustering points : 3204 article
from the Los Angeles Times
Similarity Measure : How many
words are common in these
documents ( after excluding some
common words )

prithwis mukerjee, ph.d.
Clustering of S&P Stock Data
●
●
●

●

Observe Stock Movements
every day.
Clustering points: Stock{UP/DOWN}
Similarity Measure: Two
points are more similar if
the events described by
them frequently happen
together on the same day.
We used association rules
to quantify a similarity
measure.

prithwis mukerjee, ph.d.
Regression
● Predict a value of a given continuous valued variable
based on the values of other variables, assuming a
linear or nonlinear model of dependency.
○

Greatly studied in statistics, neural network fields.

● Examples:
○

Predicting sales amounts of new product based on advertising
expenditure.

○

Predicting wind velocities as a function of temperature, humidity, air

○

pressure, etc.
Time series prediction of stock market indices.
prithwis mukerjee, ph.d.
Data Mining : Association Rules Mining
Association Rules
●

●

which products
should be kept
along with other
products
which two
products should
never be
discounted
together

prithwis mukerjee, ph.d.
Visualisation : The need to tell a story

prithwis mukerjee, ph.d.
Visualisation : The need to tell a story

prithwis mukerjee, ph.d.
Definitions
Data Mining
●

●

Is the process of extracting
unknown, valid and
actionable information from
large databases and using
this to make business
decisions
Non trivial process of
identifying valid, novel,
potentially useful and
understandable /
explainable patterns in data
prithwis mukerjee, ph.d.

Data Science is a rare combination of
multiple skills that include
●

Technology : obviously !

but also
●

●
●

Curiosity - a desire to go below
the surface and discover a
hypothesis that can be tested
Storytelling - create a business
story around the data
Cleverness - again obviously, to
look at the problem from different
angles
prithwis mukerjee, ph.d.
R : Your first step into Data Science

prithwis mukerjee, ph.d.

Try out this free interactive tutorial just now
Statistical Tools

prithwis mukerjee, ph.d.

http://r4stats.com/articles/popularity/
Some Comparisons

prithwis mukerjee, ph.d.
Map Reduce
●
●

●

Input : A set of (key, value)
pairs
User supplies two functions
○ Map (k,v) => List(k1,v1)
○ Reduce (k1, list(v1)) => v2
Output is the set of (k1,v2)
pairs

prithwis mukerjee, ph.d.
Hadoop
A programming framework that
allows you to run Map-Reduce jobs
on a distributed cluster of low cost
machines without having to bother
about anything except
●
●

the Map and Reduce functions
loading data into HDFS

1.

2.

3.
4.

prithwis mukerjee, ph.d.

HIVE
a. A plug-in that allows one to
use SQL like queries that are
converted into map-reduce
jobs
PIG
a. A scripting language for
writing long queries
HBASE
a. A non-relational DBMS
SQOOP
a. moves data to andfrom HDFS
Data-in-Flight

prithwis mukerjee, ph.d.
JavaScript for Data Visualisation

prithwis mukerjee, ph.d.
Business Domain
●

●

Financial Sector
○ Risk Management, Credit
Scoring
○ Predict Customer Spend
○ Stock and Investment
Analysis
○ Loan approval
Telecom Sector
○ Fraud Detection
○ Churn Prediction

prithwis mukerjee, ph.d.

●

●

Retail and Marketing
○ Market segmentation
○ Promotional strategy
○ Market Basket Analysis
○ Trend Analysis
Healthcare & Insurance
○ Fraud Detection
○ Drug Development
○ Medical Diagnostic Tools
Conclusion
●
●

●

●

Why data science ?
Techniques
○ Statistics
○ Data Mining
○ Visualisation
Tools & Platforms
○ R
○ Hadoop / MapReduce
○ Real Time Systems
Business Domains

Data Science is a rare combination of
multiple skills that include
●

but also
●

●
●

prithwis mukerjee, ph.d.

Technology : obviously !
Curiosity - a desire to go below
the surface and discover a
hypothesis that can be tested
Storytelling - create a business
story around the data
Cleverness - again obviously, to
look at the problem from different
angles
prithwis mukerjee, ph.d.
Thank You
Contact

This presentation is accessible at at
the blog

Prithwis Mukerjee
Professor, Praxis Business School

http://blog.yantrajaal.com

prithwis@praxis.ac.in

at the following URL
http://bit.ly/pm-datascience

prithwis mukerjee, ph.d.

More Related Content

What's hot (20)

PPT
Data Preprocessing
Object-Frontier Software Pvt. Ltd
 
PDF
Data Science: Applying Random Forest
Edureka!
 
PPT
Decision tree
Soujanya V
 
PDF
Performance Metrics for Machine Learning Algorithms
Kush Kulshrestha
 
PPTX
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
Simplilearn
 
PDF
Python for Data Science
Harri Hämäläinen
 
ODP
Machine Learning with Decision trees
Knoldus Inc.
 
PPTX
Machine Learning Project
Abhishek Singh
 
PDF
Data Analysis and Statistics in Python using pandas and statsmodels
Wes McKinney
 
PPT
5.1 mining data streams
Krish_ver2
 
PPTX
Machine learning libraries with python
VishalBisht9217
 
PDF
Classification Based Machine Learning Algorithms
Md. Main Uddin Rony
 
PDF
Introduction to Machine Learning with SciKit-Learn
Benjamin Bengfort
 
PPT
Decision tree and random forest
Lippo Group Digital
 
PDF
Python Machine Learning Tutorial | Machine Learning Algorithms | Python Train...
Edureka!
 
PPTX
Machine learning with scikitlearn
Pratap Dangeti
 
PPSX
Frequent itemset mining methods
Prof.Nilesh Magar
 
PDF
Data Science Full Course | Edureka
Edureka!
 
PPTX
PPT on Data Science Using Python
NishantKumar1179
 
PDF
Introduction to Artificial Intelligence
ananth
 
Data Science: Applying Random Forest
Edureka!
 
Decision tree
Soujanya V
 
Performance Metrics for Machine Learning Algorithms
Kush Kulshrestha
 
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
Simplilearn
 
Python for Data Science
Harri Hämäläinen
 
Machine Learning with Decision trees
Knoldus Inc.
 
Machine Learning Project
Abhishek Singh
 
Data Analysis and Statistics in Python using pandas and statsmodels
Wes McKinney
 
5.1 mining data streams
Krish_ver2
 
Machine learning libraries with python
VishalBisht9217
 
Classification Based Machine Learning Algorithms
Md. Main Uddin Rony
 
Introduction to Machine Learning with SciKit-Learn
Benjamin Bengfort
 
Decision tree and random forest
Lippo Group Digital
 
Python Machine Learning Tutorial | Machine Learning Algorithms | Python Train...
Edureka!
 
Machine learning with scikitlearn
Pratap Dangeti
 
Frequent itemset mining methods
Prof.Nilesh Magar
 
Data Science Full Course | Edureka
Edureka!
 
PPT on Data Science Using Python
NishantKumar1179
 
Introduction to Artificial Intelligence
ananth
 

Similar to Data Science (20)

PPTX
Introduction to data science
Mahir Haque
 
PPTX
Fundamentals of Analytics and Statistic (1).pptx
adwaithcj7
 
PPTX
Intoduction to Data Science By Sulav Acharya
achsulav100
 
PPT
Data mining intro-2009-v2
Prithwis Mukerjee
 
PDF
Untitled document.pdf
MuhammadTahiriqbal13
 
PDF
Barga, roger. predictive analytics with microsoft azure machine learning
maldonadojorge
 
PPTX
Data science concept by Raj Krishna Paul
Subir Paul
 
PDF
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Dr. Sunil Kr. Pandey
 
PDF
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Data Science London
 
PPTX
DS_Teacher_Presentation DS and Education.pptx
jdcil1975
 
PPTX
data science, prior knowledge ,modeling, scatter plot
SteffinAlex
 
PDF
Data Warehousing and Suitable for BCA, BSC, MCA
Guru Jhambheswar University of science and technology,Hisar-125033
 
PPTX
Data Science.pptx NEW COURICUUMN IN DATA
javed75
 
PDF
DataScience_introduction.pdf
SouravBiswas747273
 
PDF
Introduction to Data Science
ANOOP V S
 
PPTX
Unit 1 Introduction to Data Analytics .pptx
vipulkondekar
 
PPTX
Intro to Data Science by DatalentTeam at Data Science Clinic#11
Dr.Sotarat Thammaboosadee CIMP-Data Governance
 
PDF
Why Data Science is a Science
Christoforos Anagnostopoulos
 
PDF
IICT-Big Data.pdf slideshow information to communication
juicepiladomusambika
 
PDF
IICT-Big Data.pdf slideshow Information to communication technology
juicepiladomusambika
 
Introduction to data science
Mahir Haque
 
Fundamentals of Analytics and Statistic (1).pptx
adwaithcj7
 
Intoduction to Data Science By Sulav Acharya
achsulav100
 
Data mining intro-2009-v2
Prithwis Mukerjee
 
Untitled document.pdf
MuhammadTahiriqbal13
 
Barga, roger. predictive analytics with microsoft azure machine learning
maldonadojorge
 
Data science concept by Raj Krishna Paul
Subir Paul
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Dr. Sunil Kr. Pandey
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Data Science London
 
DS_Teacher_Presentation DS and Education.pptx
jdcil1975
 
data science, prior knowledge ,modeling, scatter plot
SteffinAlex
 
Data Warehousing and Suitable for BCA, BSC, MCA
Guru Jhambheswar University of science and technology,Hisar-125033
 
Data Science.pptx NEW COURICUUMN IN DATA
javed75
 
DataScience_introduction.pdf
SouravBiswas747273
 
Introduction to Data Science
ANOOP V S
 
Unit 1 Introduction to Data Analytics .pptx
vipulkondekar
 
Intro to Data Science by DatalentTeam at Data Science Clinic#11
Dr.Sotarat Thammaboosadee CIMP-Data Governance
 
Why Data Science is a Science
Christoforos Anagnostopoulos
 
IICT-Big Data.pdf slideshow information to communication
juicepiladomusambika
 
IICT-Big Data.pdf slideshow Information to communication technology
juicepiladomusambika
 
Ad

More from Prithwis Mukerjee (20)

PPTX
Bitcoin, Blockchain and the Crypto Contracts - Part 2
Prithwis Mukerjee
 
PDF
Bitcoin, Blockchain and Crypto Contracts - Part 3
Prithwis Mukerjee
 
PDF
Internet of Things
Prithwis Mukerjee
 
PDF
Thought controlled devices
Prithwis Mukerjee
 
PDF
Cloudcasting
Prithwis Mukerjee
 
PDF
Currency, Commodity and Bitcoins
Prithwis Mukerjee
 
PPT
05 OLAP v6 weekend
Prithwis Mukerjee
 
ODP
04 Dimensional Analysis - v6
Prithwis Mukerjee
 
PDF
Thought control
Prithwis Mukerjee
 
PPT
World of data @ praxis 2013 v2
Prithwis Mukerjee
 
ODP
BIS 08a - Application Development - II Version 2
Prithwis Mukerjee
 
PPT
Lecture02 - Data Mining & Analytics
Prithwis Mukerjee
 
ODP
ইন্টার্নেট কি এবং কেন ?
Prithwis Mukerjee
 
PPT
Data mining clustering-2009-v0
Prithwis Mukerjee
 
PPT
Data mining classification-2009-v0
Prithwis Mukerjee
 
PPT
Data mining arm-2009-v0
Prithwis Mukerjee
 
PPT
PPM Lite
Prithwis Mukerjee
 
PPT
Business Intelligence Industry Perspective Session I
Prithwis Mukerjee
 
PPT
OLAP Cubes in Datawarehousing
Prithwis Mukerjee
 
ODP
Dimensional Modelling
Prithwis Mukerjee
 
Bitcoin, Blockchain and the Crypto Contracts - Part 2
Prithwis Mukerjee
 
Bitcoin, Blockchain and Crypto Contracts - Part 3
Prithwis Mukerjee
 
Internet of Things
Prithwis Mukerjee
 
Thought controlled devices
Prithwis Mukerjee
 
Cloudcasting
Prithwis Mukerjee
 
Currency, Commodity and Bitcoins
Prithwis Mukerjee
 
05 OLAP v6 weekend
Prithwis Mukerjee
 
04 Dimensional Analysis - v6
Prithwis Mukerjee
 
Thought control
Prithwis Mukerjee
 
World of data @ praxis 2013 v2
Prithwis Mukerjee
 
BIS 08a - Application Development - II Version 2
Prithwis Mukerjee
 
Lecture02 - Data Mining & Analytics
Prithwis Mukerjee
 
ইন্টার্নেট কি এবং কেন ?
Prithwis Mukerjee
 
Data mining clustering-2009-v0
Prithwis Mukerjee
 
Data mining classification-2009-v0
Prithwis Mukerjee
 
Data mining arm-2009-v0
Prithwis Mukerjee
 
Business Intelligence Industry Perspective Session I
Prithwis Mukerjee
 
OLAP Cubes in Datawarehousing
Prithwis Mukerjee
 
Dimensional Modelling
Prithwis Mukerjee
 
Ad

Recently uploaded (20)

PDF
Isharyanti-2025-Cross Language Communication in Indonesian Language
Neny Isharyanti
 
PDF
0725.WHITEPAPER-UNIQUEWAYSOFPROTOTYPINGANDUXNOW.pdf
Thomas GIRARD, MA, CDP
 
PDF
Lesson 2 - WATER,pH, BUFFERS, AND ACID-BASE.pdf
marvinnbustamante1
 
PDF
LAW OF CONTRACT (5 YEAR LLB & UNITARY LLB )- MODULE - 1.& 2 - LEARN THROUGH P...
APARNA T SHAIL KUMAR
 
PPTX
I AM MALALA The Girl Who Stood Up for Education and was Shot by the Taliban...
Beena E S
 
PDF
The History of Phone Numbers in Stoke Newington by Billy Thomas
History of Stoke Newington
 
PDF
The Constitution Review Committee (CRC) has released an updated schedule for ...
nservice241
 
PDF
Reconstruct, Restore, Reimagine: New Perspectives on Stoke Newington’s Histor...
History of Stoke Newington
 
PPTX
Growth and development and milestones, factors
BHUVANESHWARI BADIGER
 
PDF
Stokey: A Jewish Village by Rachel Kolsky
History of Stoke Newington
 
PPTX
ASRB NET 2023 PREVIOUS YEAR QUESTION PAPER GENETICS AND PLANT BREEDING BY SAT...
Krashi Coaching
 
PPTX
How to Convert an Opportunity into a Quotation in Odoo 18 CRM
Celine George
 
PDF
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 - GLOBAL SUCCESS - CẢ NĂM - NĂM 2024 (VOCABULARY, ...
Nguyen Thanh Tu Collection
 
PDF
The Different Types of Non-Experimental Research
Thelma Villaflores
 
PPTX
SPINA BIFIDA: NURSING MANAGEMENT .pptx
PRADEEP ABOTHU
 
PDF
Biological Bilingual Glossary Hindi and English Medium
World of Wisdom
 
PPTX
How to Create a PDF Report in Odoo 18 - Odoo Slides
Celine George
 
PDF
DIGESTION OF CARBOHYDRATES,PROTEINS,LIPIDS
raviralanaresh2
 
PPTX
Universal immunization Programme (UIP).pptx
Vishal Chanalia
 
PDF
LAW OF CONTRACT ( 5 YEAR LLB & UNITARY LLB)- MODULE-3 - LEARN THROUGH PICTURE
APARNA T SHAIL KUMAR
 
Isharyanti-2025-Cross Language Communication in Indonesian Language
Neny Isharyanti
 
0725.WHITEPAPER-UNIQUEWAYSOFPROTOTYPINGANDUXNOW.pdf
Thomas GIRARD, MA, CDP
 
Lesson 2 - WATER,pH, BUFFERS, AND ACID-BASE.pdf
marvinnbustamante1
 
LAW OF CONTRACT (5 YEAR LLB & UNITARY LLB )- MODULE - 1.& 2 - LEARN THROUGH P...
APARNA T SHAIL KUMAR
 
I AM MALALA The Girl Who Stood Up for Education and was Shot by the Taliban...
Beena E S
 
The History of Phone Numbers in Stoke Newington by Billy Thomas
History of Stoke Newington
 
The Constitution Review Committee (CRC) has released an updated schedule for ...
nservice241
 
Reconstruct, Restore, Reimagine: New Perspectives on Stoke Newington’s Histor...
History of Stoke Newington
 
Growth and development and milestones, factors
BHUVANESHWARI BADIGER
 
Stokey: A Jewish Village by Rachel Kolsky
History of Stoke Newington
 
ASRB NET 2023 PREVIOUS YEAR QUESTION PAPER GENETICS AND PLANT BREEDING BY SAT...
Krashi Coaching
 
How to Convert an Opportunity into a Quotation in Odoo 18 CRM
Celine George
 
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 - GLOBAL SUCCESS - CẢ NĂM - NĂM 2024 (VOCABULARY, ...
Nguyen Thanh Tu Collection
 
The Different Types of Non-Experimental Research
Thelma Villaflores
 
SPINA BIFIDA: NURSING MANAGEMENT .pptx
PRADEEP ABOTHU
 
Biological Bilingual Glossary Hindi and English Medium
World of Wisdom
 
How to Create a PDF Report in Odoo 18 - Odoo Slides
Celine George
 
DIGESTION OF CARBOHYDRATES,PROTEINS,LIPIDS
raviralanaresh2
 
Universal immunization Programme (UIP).pptx
Vishal Chanalia
 
LAW OF CONTRACT ( 5 YEAR LLB & UNITARY LLB)- MODULE-3 - LEARN THROUGH PICTURE
APARNA T SHAIL KUMAR
 

Data Science

  • 1. Introduction to Data Science Prithwis Mukerjee, PhD Praxis Business School, Calcutta prithwis mukerjee, ph.d.
  • 2. Agenda ● ● ● ● Why data science ? Techniques ○ Statistics ○ Data Mining ○ Visualisation Tools & Platforms ○ R ○ Hadoop / MapReduce ○ Real Time Systems Business Domains prithwis mukerjee, ph.d.
  • 4. Volume Data is being acquired from a variety of sources ● ● ● ● ● ● ● EFT in Banks, Credit card payments Cell phones Sensors attached to a variety of equipment Surveillance cameras, CCTV Social Media Updates Blogs Websites prithwis mukerjee, ph.d.
  • 5. Variety / Velocity ● ● ● ● ● ● Numeric data Structured text data Unstructured text data Images Sound and video recordings Graph Nodes ○ Social Media “friends” ○ Websites linked to each other prithwis mukerjee, ph.d. Data is being generated fast and is becoming obsolete or useless equally faster ● ● ● Realtime ( or near realtime) data from sensors, cameras Website traffic Social media “trends”
  • 6. So what is Big Data ? ● ● ● Volume Velocity Variety ? A new term coined by IT vendors to push new technology like ● ● ● prithwis mukerjee, ph.d. Map Reduce Hadoop NOSQL A new way to ● ● ● ● ● collect store manage analyse visualise data
  • 7. Big Data is like Crude Oil { not new Oil } Think of data as crude oil ! Big Data is like extracting the crude oil, transporting it in mega tankers, pumping it through pipelines and storing it in massive silos But what about refining ? prithwis mukerjee, ph.d.
  • 8. The Science (and Art ) of Data Think of data as crude oil ! Data Science ● Big Data is like extracting the crude oil, transporting it in mega tankers, pumping it through pipelines and storing it in Refining massive silos prithwis mukerjee, ph.d. ● ● ● Discovering what we do not know about the data Obtaining predictive, actionable insight Creating data products that have business impacts Communicating relevent business stories
  • 10. 10 Things {most} Data Scientists do ... 1. Ask good questions 6. Create models, algorithms What is what ? 7. Under data relationships We do not know ! We would like to know 8. Tell the machine how to learn from the data 2. Define, Test Hypothesis, Run experiments 3, Scoop, scrape, sample business data 4. Wrestle and tame data 5. Play with data, discover unknowns prithwis mukerjee, ph.d. 9. Create data products that deliver actionable insights 10. Tell relevant business stories from data
  • 11. Statistics - World of Data ● Data comes in various types ○ Nominal - colour, gender, PIN code ○ Ordinal - scale of 1-10, {high, medium, low} ○ Interval - Dates, Temperature (Centigrade) ○ Ratio - length, weight, count prithwis mukerjee, ph.d. ● Data comes in various structure ○ Structured data - nominal, ordinal, interval, ratio ○ Unstructured text - email, tweets, reviews ○ Images, voice prints ○ graphs, networks - social media friendships, likes
  • 12. Descriptive Statistics ● Numeric Description ○ Mean, Median, Mode ○ Quartile, Percentile ○ Variance / Standard Deviation prithwis mukerjee, ph.d.
  • 13. Statistics : The Path Ahead Probability, Distributions prithwis mukerjee, ph.d. Testing of Hypothesis Regression, Testing Predictive Analysis
  • 14. Data Mining / Machine Learning Is the process of obtaining Typical tasks are ● novel ● classification ● valid ● clustering ● potentially useful ● association rules ● understandable ● sequential patterns ● regression ● deviation detection patterns in data prithwis mukerjee, ph.d.
  • 15. Some definitions Instance ( an item or record) ● an observation that is characterised by a number of attributes ○ ○ person - with attributes like age, salary, qualification sale - with product, quantity, price Attribute ● measuring characteristics of an instance Class ● grouping of an instance into ○ ○ acceptable, not acceptable mammal, fish, bird prithwis mukerjee, ph.d. Nominal ● colour, PIN code, state Ordinal ● ranking : tall, medium, short or feedback on a scale of 1 - 10 Ratio ● length, price, duration, quantity Interval ● date, temperature
  • 16. Data Mining : Classification Classification ● ● Which loan applicant will not default on the loan ? Which potential customer will respond to a mailer campaign ? prithwis mukerjee, ph.d.
  • 17. Classification Example s l ca uou ri go ontin lass c ate c l a ric o teg ca c Test Set Learn Classifier prithwis mukerjee, ph.d. Training Set Model
  • 18. Data Mining : Clustering Given a set of unclassified data points, how to find a natural grouping within them ● Can we segment the market in some way that is not yet known ? prithwis mukerjee, ph.d.
  • 19. Example of Document Clustering Clustering points : 3204 article from the Los Angeles Times Similarity Measure : How many words are common in these documents ( after excluding some common words ) prithwis mukerjee, ph.d.
  • 20. Clustering of S&P Stock Data ● ● ● ● Observe Stock Movements every day. Clustering points: Stock{UP/DOWN} Similarity Measure: Two points are more similar if the events described by them frequently happen together on the same day. We used association rules to quantify a similarity measure. prithwis mukerjee, ph.d.
  • 21. Regression ● Predict a value of a given continuous valued variable based on the values of other variables, assuming a linear or nonlinear model of dependency. ○ Greatly studied in statistics, neural network fields. ● Examples: ○ Predicting sales amounts of new product based on advertising expenditure. ○ Predicting wind velocities as a function of temperature, humidity, air ○ pressure, etc. Time series prediction of stock market indices. prithwis mukerjee, ph.d.
  • 22. Data Mining : Association Rules Mining Association Rules ● ● which products should be kept along with other products which two products should never be discounted together prithwis mukerjee, ph.d.
  • 23. Visualisation : The need to tell a story prithwis mukerjee, ph.d.
  • 24. Visualisation : The need to tell a story prithwis mukerjee, ph.d.
  • 25. Definitions Data Mining ● ● Is the process of extracting unknown, valid and actionable information from large databases and using this to make business decisions Non trivial process of identifying valid, novel, potentially useful and understandable / explainable patterns in data prithwis mukerjee, ph.d. Data Science is a rare combination of multiple skills that include ● Technology : obviously ! but also ● ● ● Curiosity - a desire to go below the surface and discover a hypothesis that can be tested Storytelling - create a business story around the data Cleverness - again obviously, to look at the problem from different angles
  • 27. R : Your first step into Data Science prithwis mukerjee, ph.d. Try out this free interactive tutorial just now
  • 28. Statistical Tools prithwis mukerjee, ph.d. http://r4stats.com/articles/popularity/
  • 30. Map Reduce ● ● ● Input : A set of (key, value) pairs User supplies two functions ○ Map (k,v) => List(k1,v1) ○ Reduce (k1, list(v1)) => v2 Output is the set of (k1,v2) pairs prithwis mukerjee, ph.d.
  • 31. Hadoop A programming framework that allows you to run Map-Reduce jobs on a distributed cluster of low cost machines without having to bother about anything except ● ● the Map and Reduce functions loading data into HDFS 1. 2. 3. 4. prithwis mukerjee, ph.d. HIVE a. A plug-in that allows one to use SQL like queries that are converted into map-reduce jobs PIG a. A scripting language for writing long queries HBASE a. A non-relational DBMS SQOOP a. moves data to andfrom HDFS
  • 33. JavaScript for Data Visualisation prithwis mukerjee, ph.d.
  • 34. Business Domain ● ● Financial Sector ○ Risk Management, Credit Scoring ○ Predict Customer Spend ○ Stock and Investment Analysis ○ Loan approval Telecom Sector ○ Fraud Detection ○ Churn Prediction prithwis mukerjee, ph.d. ● ● Retail and Marketing ○ Market segmentation ○ Promotional strategy ○ Market Basket Analysis ○ Trend Analysis Healthcare & Insurance ○ Fraud Detection ○ Drug Development ○ Medical Diagnostic Tools
  • 35. Conclusion ● ● ● ● Why data science ? Techniques ○ Statistics ○ Data Mining ○ Visualisation Tools & Platforms ○ R ○ Hadoop / MapReduce ○ Real Time Systems Business Domains Data Science is a rare combination of multiple skills that include ● but also ● ● ● prithwis mukerjee, ph.d. Technology : obviously ! Curiosity - a desire to go below the surface and discover a hypothesis that can be tested Storytelling - create a business story around the data Cleverness - again obviously, to look at the problem from different angles
  • 37. Thank You Contact This presentation is accessible at at the blog Prithwis Mukerjee Professor, Praxis Business School http://blog.yantrajaal.com [email protected] at the following URL http://bit.ly/pm-datascience prithwis mukerjee, ph.d.