This document provides an introduction to data science. It discusses why data science is important and covers key techniques like statistics, data mining, and visualization. It also reviews popular tools and platforms for data science like R, Hadoop, and real-time systems. Finally, it discusses how data science can be applied across different business domains such as financial services, telecom, retail, and healthcare.
Agenda
●
●
●
●
Why data science?
Techniques
○ Statistics
○ Data Mining
○ Visualisation
Tools & Platforms
○ R
○ Hadoop / MapReduce
○ Real Time Systems
Business Domains
prithwis mukerjee, ph.d.
Volume
Data is beingacquired from a
variety of sources
●
●
●
●
●
●
●
EFT in Banks, Credit card
payments
Cell phones
Sensors attached to a variety
of equipment
Surveillance cameras, CCTV
Social Media Updates
Blogs
Websites
prithwis mukerjee, ph.d.
5.
Variety / Velocity
●
●
●
●
●
●
Numericdata
Structured text data
Unstructured text data
Images
Sound and video recordings
Graph Nodes
○ Social Media “friends”
○ Websites linked to each
other
prithwis mukerjee, ph.d.
Data is being generated fast and is
becoming obsolete or useless
equally faster
●
●
●
Realtime ( or near realtime)
data from sensors, cameras
Website traffic
Social media “trends”
6.
So what isBig Data ?
●
●
●
Volume
Velocity
Variety ?
A new term coined by
IT vendors to push new
technology like
●
●
●
prithwis mukerjee, ph.d.
Map Reduce
Hadoop
NOSQL
A new way to
●
●
●
●
●
collect
store
manage
analyse
visualise data
7.
Big Data islike Crude Oil { not new Oil }
Think of data as crude oil !
Big Data is like extracting the
crude oil, transporting it in mega
tankers, pumping it through
pipelines and storing it in
massive silos
But what
about
refining ?
prithwis mukerjee, ph.d.
8.
The Science (andArt ) of Data
Think of data as crude oil !
Data Science
●
Big Data is like extracting the
crude oil, transporting it in mega
tankers, pumping it through
pipelines and storing it in
Refining
massive silos
prithwis mukerjee, ph.d.
●
●
●
Discovering what we do not
know about the data
Obtaining predictive, actionable
insight
Creating data products that have
business impacts
Communicating relevent
business stories
10 Things {most}Data Scientists do ...
1. Ask good questions
6. Create models, algorithms
What is what ?
7. Under data relationships
We do not know ! We would like to
know
8. Tell the machine how to learn
from the data
2. Define, Test Hypothesis, Run
experiments
3, Scoop, scrape, sample business
data
4. Wrestle and tame data
5. Play with data, discover
unknowns
prithwis mukerjee, ph.d.
9. Create data products that
deliver actionable insights
10. Tell relevant business stories
from data
11.
Statistics - Worldof Data
●
Data comes in various types
○ Nominal - colour, gender,
PIN code
○ Ordinal - scale of 1-10,
{high, medium, low}
○ Interval - Dates,
Temperature (Centigrade)
○ Ratio - length, weight, count
prithwis mukerjee, ph.d.
●
Data comes in various
structure
○ Structured data - nominal,
ordinal, interval, ratio
○ Unstructured text - email,
tweets, reviews
○ Images, voice prints
○ graphs, networks - social
media friendships, likes
Data Mining /Machine Learning
Is the process of obtaining
Typical tasks are
●
novel
●
classification
●
valid
●
clustering
●
potentially useful
●
association rules
●
understandable
●
sequential patterns
●
regression
●
deviation detection
patterns in data
prithwis mukerjee, ph.d.
15.
Some definitions
Instance (an item or record)
●
an observation that is
characterised by a number of
attributes
○
○
person - with attributes like age,
salary, qualification
sale - with product, quantity, price
Attribute
●
measuring characteristics of an
instance
Class
●
grouping of an instance into
○
○
acceptable, not acceptable
mammal, fish, bird
prithwis mukerjee, ph.d.
Nominal
●
colour, PIN code, state
Ordinal
●
ranking : tall, medium, short or
feedback on a scale of 1 - 10
Ratio
●
length, price, duration, quantity
Interval
●
date, temperature
16.
Data Mining :Classification
Classification
●
●
Which loan applicant will not
default on the loan ?
Which potential customer will
respond to a mailer campaign
?
prithwis mukerjee, ph.d.
Data Mining :Clustering
Given a set of
unclassified data
points, how to find
a natural grouping
within them
●
Can we segment the market in
some way that is not yet known ?
prithwis mukerjee, ph.d.
19.
Example of DocumentClustering
Clustering points : 3204 article
from the Los Angeles Times
Similarity Measure : How many
words are common in these
documents ( after excluding some
common words )
prithwis mukerjee, ph.d.
20.
Clustering of S&PStock Data
●
●
●
●
Observe Stock Movements
every day.
Clustering points: Stock{UP/DOWN}
Similarity Measure: Two
points are more similar if
the events described by
them frequently happen
together on the same day.
We used association rules
to quantify a similarity
measure.
prithwis mukerjee, ph.d.
21.
Regression
● Predict avalue of a given continuous valued variable
based on the values of other variables, assuming a
linear or nonlinear model of dependency.
○
Greatly studied in statistics, neural network fields.
● Examples:
○
Predicting sales amounts of new product based on advertising
expenditure.
○
Predicting wind velocities as a function of temperature, humidity, air
○
pressure, etc.
Time series prediction of stock market indices.
prithwis mukerjee, ph.d.
22.
Data Mining :Association Rules Mining
Association Rules
●
●
which products
should be kept
along with other
products
which two
products should
never be
discounted
together
prithwis mukerjee, ph.d.
Definitions
Data Mining
●
●
Is theprocess of extracting
unknown, valid and
actionable information from
large databases and using
this to make business
decisions
Non trivial process of
identifying valid, novel,
potentially useful and
understandable /
explainable patterns in data
prithwis mukerjee, ph.d.
Data Science is a rare combination of
multiple skills that include
●
Technology : obviously !
but also
●
●
●
Curiosity - a desire to go below
the surface and discover a
hypothesis that can be tested
Storytelling - create a business
story around the data
Cleverness - again obviously, to
look at the problem from different
angles
Map Reduce
●
●
●
Input :A set of (key, value)
pairs
User supplies two functions
○ Map (k,v) => List(k1,v1)
○ Reduce (k1, list(v1)) => v2
Output is the set of (k1,v2)
pairs
prithwis mukerjee, ph.d.
31.
Hadoop
A programming frameworkthat
allows you to run Map-Reduce jobs
on a distributed cluster of low cost
machines without having to bother
about anything except
●
●
the Map and Reduce functions
loading data into HDFS
1.
2.
3.
4.
prithwis mukerjee, ph.d.
HIVE
a. A plug-in that allows one to
use SQL like queries that are
converted into map-reduce
jobs
PIG
a. A scripting language for
writing long queries
HBASE
a. A non-relational DBMS
SQOOP
a. moves data to andfrom HDFS
Conclusion
●
●
●
●
Why data science?
Techniques
○ Statistics
○ Data Mining
○ Visualisation
Tools & Platforms
○ R
○ Hadoop / MapReduce
○ Real Time Systems
Business Domains
Data Science is a rare combination of
multiple skills that include
●
but also
●
●
●
prithwis mukerjee, ph.d.
Technology : obviously !
Curiosity - a desire to go below
the surface and discover a
hypothesis that can be tested
Storytelling - create a business
story around the data
Cleverness - again obviously, to
look at the problem from different
angles
Thank You
Contact
This presentationis accessible at at
the blog
Prithwis Mukerjee
Professor, Praxis Business School
http://blog.yantrajaal.com
[email protected]
at the following URL
http://bit.ly/pm-datascience
prithwis mukerjee, ph.d.