Introduction to
Data Science

Prithwis Mukerjee, PhD
Praxis Business School, Calcutta
prithwis mukerjee, ph.d.
Agenda
●
●

●

●

Why data science ?
Techniques
○ Statistics
○ Data Mining
○ Visualisation
Tools & Platforms
○ R
○ Hadoop / MapReduce
○ Real Time Systems
Business Domains

prithwis mukerjee, ph.d.
prithwis mukerjee, ph.d.
Volume
Data is being acquired from a
variety of sources
●
●
●
●
●
●
●

EFT in Banks, Credit card
payments
Cell phones
Sensors attached to a variety
of equipment
Surveillance cameras, CCTV
Social Media Updates
Blogs
Websites

prithwis mukerjee, ph.d.
Variety / Velocity
●
●
●
●
●
●

Numeric data
Structured text data
Unstructured text data
Images
Sound and video recordings
Graph Nodes
○ Social Media “friends”
○ Websites linked to each
other

prithwis mukerjee, ph.d.

Data is being generated fast and is
becoming obsolete or useless
equally faster
●
●
●

Realtime ( or near realtime)
data from sensors, cameras
Website traffic
Social media “trends”
So what is Big Data ?
●
●
●

Volume
Velocity
Variety ?

A new term coined by
IT vendors to push new
technology like
●
●
●

prithwis mukerjee, ph.d.

Map Reduce
Hadoop
NOSQL

A new way to
●
●
●
●
●

collect
store
manage
analyse
visualise data
Big Data is like Crude Oil { not new Oil }
Think of data as crude oil !
Big Data is like extracting the
crude oil, transporting it in mega
tankers, pumping it through
pipelines and storing it in
massive silos

But what
about
refining ?
prithwis mukerjee, ph.d.
The Science (and Art ) of Data
Think of data as crude oil !

Data Science
●

Big Data is like extracting the
crude oil, transporting it in mega
tankers, pumping it through
pipelines and storing it in
Refining
massive silos

prithwis mukerjee, ph.d.

●
●
●

Discovering what we do not
know about the data
Obtaining predictive, actionable
insight
Creating data products that have
business impacts
Communicating relevent
business stories
Two Perspectives

Programming
or “Hacking”
Skills

Machine
Learning

Mathematics,
Statistics
Knowledge

Data
Science
RDBMS
ERP / BI

Operations
Research

Business
Domain
Knowledge

prithwis mukerjee, ph.d.
10 Things {most} Data Scientists do ...
1. Ask good questions

6. Create models, algorithms

What is what ?

7. Under data relationships

We do not know ! We would like to
know

8. Tell the machine how to learn
from the data

2. Define, Test Hypothesis, Run
experiments
3, Scoop, scrape, sample business
data
4. Wrestle and tame data
5. Play with data, discover
unknowns

prithwis mukerjee, ph.d.

9. Create data products that
deliver actionable insights
10. Tell relevant business stories
from data
Statistics - World of Data
●

Data comes in various types
○ Nominal - colour, gender,
PIN code
○ Ordinal - scale of 1-10,
{high, medium, low}
○ Interval - Dates,
Temperature (Centigrade)
○ Ratio - length, weight, count

prithwis mukerjee, ph.d.

●

Data comes in various
structure
○ Structured data - nominal,
ordinal, interval, ratio
○ Unstructured text - email,
tweets, reviews
○ Images, voice prints
○ graphs, networks - social
media friendships, likes
Descriptive Statistics
●

Numeric Description
○ Mean, Median, Mode
○ Quartile, Percentile
○ Variance / Standard
Deviation

prithwis mukerjee, ph.d.
Statistics : The Path Ahead

Probability,
Distributions

prithwis mukerjee, ph.d.

Testing of
Hypothesis

Regression,
Testing

Predictive
Analysis
Data Mining / Machine Learning
Is the process of obtaining

Typical tasks are

●

novel

●

classification

●

valid

●

clustering

●

potentially useful

●

association rules

●

understandable

●

sequential patterns

●

regression

●

deviation detection

patterns in data

prithwis mukerjee, ph.d.
Some definitions
Instance ( an item or record)
●

an observation that is
characterised by a number of
attributes
○
○

person - with attributes like age,
salary, qualification
sale - with product, quantity, price

Attribute
●

measuring characteristics of an
instance

Class
●

grouping of an instance into
○
○

acceptable, not acceptable
mammal, fish, bird
prithwis mukerjee, ph.d.

Nominal
●

colour, PIN code, state

Ordinal
●

ranking : tall, medium, short or
feedback on a scale of 1 - 10

Ratio
●

length, price, duration, quantity

Interval
●

date, temperature
Data Mining : Classification
Classification
●
●

Which loan applicant will not
default on the loan ?
Which potential customer will
respond to a mailer campaign
?

prithwis mukerjee, ph.d.
Classification Example
s
l
ca uou
ri
go ontin lass
c
ate c

l

a
ric

o

teg
ca

c

Test
Set

Learn
Classifier

prithwis mukerjee, ph.d.

Training
Set

Model
Data Mining : Clustering
Given a set of
unclassified data
points, how to find
a natural grouping
within them

●

Can we segment the market in
some way that is not yet known ?

prithwis mukerjee, ph.d.
Example of Document Clustering
Clustering points : 3204 article
from the Los Angeles Times
Similarity Measure : How many
words are common in these
documents ( after excluding some
common words )

prithwis mukerjee, ph.d.
Clustering of S&P Stock Data
●
●
●

●

Observe Stock Movements
every day.
Clustering points: Stock{UP/DOWN}
Similarity Measure: Two
points are more similar if
the events described by
them frequently happen
together on the same day.
We used association rules
to quantify a similarity
measure.

prithwis mukerjee, ph.d.
Regression
● Predict a value of a given continuous valued variable
based on the values of other variables, assuming a
linear or nonlinear model of dependency.
○

Greatly studied in statistics, neural network fields.

● Examples:
○

Predicting sales amounts of new product based on advertising
expenditure.

○

Predicting wind velocities as a function of temperature, humidity, air

○

pressure, etc.
Time series prediction of stock market indices.
prithwis mukerjee, ph.d.
Data Mining : Association Rules Mining
Association Rules
●

●

which products
should be kept
along with other
products
which two
products should
never be
discounted
together

prithwis mukerjee, ph.d.
Visualisation : The need to tell a story

prithwis mukerjee, ph.d.
Visualisation : The need to tell a story

prithwis mukerjee, ph.d.
Definitions
Data Mining
●

●

Is the process of extracting
unknown, valid and
actionable information from
large databases and using
this to make business
decisions
Non trivial process of
identifying valid, novel,
potentially useful and
understandable /
explainable patterns in data
prithwis mukerjee, ph.d.

Data Science is a rare combination of
multiple skills that include
●

Technology : obviously !

but also
●

●
●

Curiosity - a desire to go below
the surface and discover a
hypothesis that can be tested
Storytelling - create a business
story around the data
Cleverness - again obviously, to
look at the problem from different
angles
prithwis mukerjee, ph.d.
R : Your first step into Data Science

prithwis mukerjee, ph.d.

Try out this free interactive tutorial just now
Statistical Tools

prithwis mukerjee, ph.d.

http://r4stats.com/articles/popularity/
Some Comparisons

prithwis mukerjee, ph.d.
Map Reduce
●
●

●

Input : A set of (key, value)
pairs
User supplies two functions
○ Map (k,v) => List(k1,v1)
○ Reduce (k1, list(v1)) => v2
Output is the set of (k1,v2)
pairs

prithwis mukerjee, ph.d.
Hadoop
A programming framework that
allows you to run Map-Reduce jobs
on a distributed cluster of low cost
machines without having to bother
about anything except
●
●

the Map and Reduce functions
loading data into HDFS

1.

2.

3.
4.

prithwis mukerjee, ph.d.

HIVE
a. A plug-in that allows one to
use SQL like queries that are
converted into map-reduce
jobs
PIG
a. A scripting language for
writing long queries
HBASE
a. A non-relational DBMS
SQOOP
a. moves data to andfrom HDFS
Data-in-Flight

prithwis mukerjee, ph.d.
JavaScript for Data Visualisation

prithwis mukerjee, ph.d.
Business Domain
●

●

Financial Sector
○ Risk Management, Credit
Scoring
○ Predict Customer Spend
○ Stock and Investment
Analysis
○ Loan approval
Telecom Sector
○ Fraud Detection
○ Churn Prediction

prithwis mukerjee, ph.d.

●

●

Retail and Marketing
○ Market segmentation
○ Promotional strategy
○ Market Basket Analysis
○ Trend Analysis
Healthcare & Insurance
○ Fraud Detection
○ Drug Development
○ Medical Diagnostic Tools
Conclusion
●
●

●

●

Why data science ?
Techniques
○ Statistics
○ Data Mining
○ Visualisation
Tools & Platforms
○ R
○ Hadoop / MapReduce
○ Real Time Systems
Business Domains

Data Science is a rare combination of
multiple skills that include
●

but also
●

●
●

prithwis mukerjee, ph.d.

Technology : obviously !
Curiosity - a desire to go below
the surface and discover a
hypothesis that can be tested
Storytelling - create a business
story around the data
Cleverness - again obviously, to
look at the problem from different
angles
prithwis mukerjee, ph.d.
Thank You
Contact

This presentation is accessible at at
the blog

Prithwis Mukerjee
Professor, Praxis Business School

http://blog.yantrajaal.com

prithwis@praxis.ac.in

at the following URL
http://bit.ly/pm-datascience

prithwis mukerjee, ph.d.

Data Science

  • 1.
    Introduction to Data Science PrithwisMukerjee, PhD Praxis Business School, Calcutta prithwis mukerjee, ph.d.
  • 2.
    Agenda ● ● ● ● Why data science? Techniques ○ Statistics ○ Data Mining ○ Visualisation Tools & Platforms ○ R ○ Hadoop / MapReduce ○ Real Time Systems Business Domains prithwis mukerjee, ph.d.
  • 3.
  • 4.
    Volume Data is beingacquired from a variety of sources ● ● ● ● ● ● ● EFT in Banks, Credit card payments Cell phones Sensors attached to a variety of equipment Surveillance cameras, CCTV Social Media Updates Blogs Websites prithwis mukerjee, ph.d.
  • 5.
    Variety / Velocity ● ● ● ● ● ● Numericdata Structured text data Unstructured text data Images Sound and video recordings Graph Nodes ○ Social Media “friends” ○ Websites linked to each other prithwis mukerjee, ph.d. Data is being generated fast and is becoming obsolete or useless equally faster ● ● ● Realtime ( or near realtime) data from sensors, cameras Website traffic Social media “trends”
  • 6.
    So what isBig Data ? ● ● ● Volume Velocity Variety ? A new term coined by IT vendors to push new technology like ● ● ● prithwis mukerjee, ph.d. Map Reduce Hadoop NOSQL A new way to ● ● ● ● ● collect store manage analyse visualise data
  • 7.
    Big Data islike Crude Oil { not new Oil } Think of data as crude oil ! Big Data is like extracting the crude oil, transporting it in mega tankers, pumping it through pipelines and storing it in massive silos But what about refining ? prithwis mukerjee, ph.d.
  • 8.
    The Science (andArt ) of Data Think of data as crude oil ! Data Science ● Big Data is like extracting the crude oil, transporting it in mega tankers, pumping it through pipelines and storing it in Refining massive silos prithwis mukerjee, ph.d. ● ● ● Discovering what we do not know about the data Obtaining predictive, actionable insight Creating data products that have business impacts Communicating relevent business stories
  • 9.
  • 10.
    10 Things {most}Data Scientists do ... 1. Ask good questions 6. Create models, algorithms What is what ? 7. Under data relationships We do not know ! We would like to know 8. Tell the machine how to learn from the data 2. Define, Test Hypothesis, Run experiments 3, Scoop, scrape, sample business data 4. Wrestle and tame data 5. Play with data, discover unknowns prithwis mukerjee, ph.d. 9. Create data products that deliver actionable insights 10. Tell relevant business stories from data
  • 11.
    Statistics - Worldof Data ● Data comes in various types ○ Nominal - colour, gender, PIN code ○ Ordinal - scale of 1-10, {high, medium, low} ○ Interval - Dates, Temperature (Centigrade) ○ Ratio - length, weight, count prithwis mukerjee, ph.d. ● Data comes in various structure ○ Structured data - nominal, ordinal, interval, ratio ○ Unstructured text - email, tweets, reviews ○ Images, voice prints ○ graphs, networks - social media friendships, likes
  • 12.
    Descriptive Statistics ● Numeric Description ○Mean, Median, Mode ○ Quartile, Percentile ○ Variance / Standard Deviation prithwis mukerjee, ph.d.
  • 13.
    Statistics : ThePath Ahead Probability, Distributions prithwis mukerjee, ph.d. Testing of Hypothesis Regression, Testing Predictive Analysis
  • 14.
    Data Mining /Machine Learning Is the process of obtaining Typical tasks are ● novel ● classification ● valid ● clustering ● potentially useful ● association rules ● understandable ● sequential patterns ● regression ● deviation detection patterns in data prithwis mukerjee, ph.d.
  • 15.
    Some definitions Instance (an item or record) ● an observation that is characterised by a number of attributes ○ ○ person - with attributes like age, salary, qualification sale - with product, quantity, price Attribute ● measuring characteristics of an instance Class ● grouping of an instance into ○ ○ acceptable, not acceptable mammal, fish, bird prithwis mukerjee, ph.d. Nominal ● colour, PIN code, state Ordinal ● ranking : tall, medium, short or feedback on a scale of 1 - 10 Ratio ● length, price, duration, quantity Interval ● date, temperature
  • 16.
    Data Mining :Classification Classification ● ● Which loan applicant will not default on the loan ? Which potential customer will respond to a mailer campaign ? prithwis mukerjee, ph.d.
  • 17.
    Classification Example s l ca uou ri goontin lass c ate c l a ric o teg ca c Test Set Learn Classifier prithwis mukerjee, ph.d. Training Set Model
  • 18.
    Data Mining :Clustering Given a set of unclassified data points, how to find a natural grouping within them ● Can we segment the market in some way that is not yet known ? prithwis mukerjee, ph.d.
  • 19.
    Example of DocumentClustering Clustering points : 3204 article from the Los Angeles Times Similarity Measure : How many words are common in these documents ( after excluding some common words ) prithwis mukerjee, ph.d.
  • 20.
    Clustering of S&PStock Data ● ● ● ● Observe Stock Movements every day. Clustering points: Stock{UP/DOWN} Similarity Measure: Two points are more similar if the events described by them frequently happen together on the same day. We used association rules to quantify a similarity measure. prithwis mukerjee, ph.d.
  • 21.
    Regression ● Predict avalue of a given continuous valued variable based on the values of other variables, assuming a linear or nonlinear model of dependency. ○ Greatly studied in statistics, neural network fields. ● Examples: ○ Predicting sales amounts of new product based on advertising expenditure. ○ Predicting wind velocities as a function of temperature, humidity, air ○ pressure, etc. Time series prediction of stock market indices. prithwis mukerjee, ph.d.
  • 22.
    Data Mining :Association Rules Mining Association Rules ● ● which products should be kept along with other products which two products should never be discounted together prithwis mukerjee, ph.d.
  • 23.
    Visualisation : Theneed to tell a story prithwis mukerjee, ph.d.
  • 24.
    Visualisation : Theneed to tell a story prithwis mukerjee, ph.d.
  • 25.
    Definitions Data Mining ● ● Is theprocess of extracting unknown, valid and actionable information from large databases and using this to make business decisions Non trivial process of identifying valid, novel, potentially useful and understandable / explainable patterns in data prithwis mukerjee, ph.d. Data Science is a rare combination of multiple skills that include ● Technology : obviously ! but also ● ● ● Curiosity - a desire to go below the surface and discover a hypothesis that can be tested Storytelling - create a business story around the data Cleverness - again obviously, to look at the problem from different angles
  • 26.
  • 27.
    R : Yourfirst step into Data Science prithwis mukerjee, ph.d. Try out this free interactive tutorial just now
  • 28.
    Statistical Tools prithwis mukerjee,ph.d. http://r4stats.com/articles/popularity/
  • 29.
  • 30.
    Map Reduce ● ● ● Input :A set of (key, value) pairs User supplies two functions ○ Map (k,v) => List(k1,v1) ○ Reduce (k1, list(v1)) => v2 Output is the set of (k1,v2) pairs prithwis mukerjee, ph.d.
  • 31.
    Hadoop A programming frameworkthat allows you to run Map-Reduce jobs on a distributed cluster of low cost machines without having to bother about anything except ● ● the Map and Reduce functions loading data into HDFS 1. 2. 3. 4. prithwis mukerjee, ph.d. HIVE a. A plug-in that allows one to use SQL like queries that are converted into map-reduce jobs PIG a. A scripting language for writing long queries HBASE a. A non-relational DBMS SQOOP a. moves data to andfrom HDFS
  • 32.
  • 33.
    JavaScript for DataVisualisation prithwis mukerjee, ph.d.
  • 34.
    Business Domain ● ● Financial Sector ○Risk Management, Credit Scoring ○ Predict Customer Spend ○ Stock and Investment Analysis ○ Loan approval Telecom Sector ○ Fraud Detection ○ Churn Prediction prithwis mukerjee, ph.d. ● ● Retail and Marketing ○ Market segmentation ○ Promotional strategy ○ Market Basket Analysis ○ Trend Analysis Healthcare & Insurance ○ Fraud Detection ○ Drug Development ○ Medical Diagnostic Tools
  • 35.
    Conclusion ● ● ● ● Why data science? Techniques ○ Statistics ○ Data Mining ○ Visualisation Tools & Platforms ○ R ○ Hadoop / MapReduce ○ Real Time Systems Business Domains Data Science is a rare combination of multiple skills that include ● but also ● ● ● prithwis mukerjee, ph.d. Technology : obviously ! Curiosity - a desire to go below the surface and discover a hypothesis that can be tested Storytelling - create a business story around the data Cleverness - again obviously, to look at the problem from different angles
  • 36.
  • 37.
    Thank You Contact This presentationis accessible at at the blog Prithwis Mukerjee Professor, Praxis Business School http://blog.yantrajaal.com [email protected] at the following URL http://bit.ly/pm-datascience prithwis mukerjee, ph.d.