SlideShare a Scribd company logo
DATA ANALYSIS
CHARAK RAY
libra.charak@gmail.com
COURSE CONTENTS
•Core Data Analysis
• 1D analysis
• 2D analysis: both quantitative
• 2D analysis: both nominal
• Learning multivariate correlation
• Principal components (PCA) and SVD: Mathematical foundations
• Principal components (PCA) and SVD: Applications
• Clustering with k-means
INTRO: WHAT IS CORE DATA
ANALYSIS?
Four main parts
1. Data Mining and data patterns and their use
2. Core data analysis: two main goals for
Knowledge Enhancing
3. Visualization: How it works
4. Illustrative data cases
INTRO: DATA MINING AND DATA PATTERNS
AND THEIR USE
•Is it Data Mining?
• Well, what is Data Mining?
• Generically, Data Mining is looking for (i) patterns in data stored in (ii) Databases
as part of (iii) Knowledge Discovery
• Core data analysis does not care of (ii) Databases
• Core data analysis does care of (ia) specific patterns in data as part of
(iia) Knowledge Enhancing
INTRO: EXAMPLE OF PATTERN
DOUBLE SUCCESS 1
The History of Laws for planetary motion
Double success
Ptolemy (c. 150 a.d.):
• Sun and planets
• circle Earth
• Does not match data well
INTRO: EXAMPLE OF PATTERN
DOUBLE SUCCESS 2
The History of Laws for planetary motion
• Copernicus (c. 1540):
• Planets circle Sun
• Does not match data well
• either
INTRO: EXAMPLE OF PATTERN
DOUBLE SUCCESS 3
Laws for planetary motion:
Kepler (c. 1605):
• 1st Law: Planets revolve Sun in ellipses (ovals)
• 2d Law: Speed changes – the further away from Sun, the faster
• Does either
INTRO: EXAMPLE OF PATTERN
DOUBLE SUCCESS 4
Planet
Period
(year)
Distance (average,
relative to that of
Earth)
Mercury
Venus
Earth
Mars
Jupiter
Saturn
Uranus
Neptune
Pluto
0.241
0.615
1.00
1.88
11.8
29.5
84.0
165
248
0.39
0.72
1.00
1.52
5.20
9.54
19.18
30.06
39.44
3d Law:
Is there any relation
between
speed/period and
distance?
INTRO: EXAMPLE OF PATTERN
DOUBLE SUCCESS 5
3d Kepler’s Law:
Is there any relation
between speed/period
and distance?
Fit no line…
INTRO: EXAMPLE OF PATTERN
DOUBLE SUCCESS 6
3d Kepler’s Law (1619):
[J. Napier invented
logarithm (1614)]
Log(P)=
𝟑
𝟐
Log(D)
P2=D3
INTRO: EXAMPLE OF PATTERN
DOUBLE SUCCESS 7
Three Kepler’s Laws: What is so grand?
Substantiated theoretically by
R. Hooke (1635-1703) and I. Newton (1642-1727)
UNIVERSAL GRAVITATION LAW
Mathematical equation, cornerstone of modern science
INTRO: EXAMPLE OF PATTERN
FAILURE? 1
Imagine this:
Broad street, Soho, London,
Cholera outbreak September 1854
Dr. Snow report: “On proceeding to the spot, I found
that nearly all the deaths had taken place within a short
distance of the pump.”
Dr John Snow’s map:
Cases of death
labeled by ticks.
The handle of pump
removed 7/9/1854.
INTRO 1: EXAMPLE OF PATTERN
FAILURE? 2
Myth: Death stopped. Data analysis won.
Fact: Data analysis lost. The health commission rejected the water
pump theory, as contradicting the science of the day (cholera outbreak
caused by “concentrated noxious atmospheric influence, no doubt
emanating from putrefying organic matter”). The handle of the pump
was ordered back. Death stopped because all died already.
More death occurred at further cholera outbreaks till R. Koch discovered
and publicized the vibrio cholera in 1883.
Dr John Snow’s map:
A case of death
Is labeled by a tick
PATTERN FOUND
Success: if
Compatible with existing knowledge
Failure: if
Not compatible with existing knowledge
Advice
• Find a pattern
• Interpret using existing knowledge
• Care not whether interpretation is
compatible
INTRO: WHAT IS CORE DATA
ANALYSIS II 1
• Core data analysis does care of (ia) specific patterns in data as part of (iia) Knowledge
Enhancing
• What are these (ia), (iia) specifics?
• Have something to do with the notion of Knowledge
• Statements of fact (“I teach this class.”) – factual
• Statements of pattern, regularity (“Professors use to teach classes.”) - structural
INTRO: WHAT IS CORE DATA
ANALYSIS II 2
• Core data analysis does care of (ia) specific patterns in data as part of (iia) Knowledge
Enhancing
• (ia), (iia) specifics relate to elements of structural knowledge
• Elements of Structural knowledge:
• Concepts (“Professor”, “Teach”, “Class”)
• Statements of relation between concepts (“Professors use to teach classes.”) - structural
INTRO: WHAT IS CORE DATA
ANALYSIS II
•List elements of structural knowledge,
•concepts and
•statements of relation among them, for
•3d Kepler’s Law
•Dr Snow’s cholera outbreak map
INTRO: WHAT IS CORE DATA
ANALYSIS II 3
• Core data analysis does care of (ia) deriving concepts and statements of relation between them
from data
• (iia) Structural Knowledge Enhancing, generically, via either of the two pathways
• Two pathways for Structural Knowledge Enhancing
• Summarization: Developing Concepts
• Correlation: Deriving Statements of relation between concepts
W1. INTRO: WHAT IS CORE DATA
ANALYSIS II 4
• Two pathways for Structural Knowledge Enhancing
• Summarization: Developing Concepts
• Correlation: Deriving Statements of relation between concepts
 Two major formats:
 Quantitative (both concepts and statements)
 3d Kepler’s Law
Period2 = Distance3
 Categorical (both concepts and statements)
 Dr Snow’s conclusion:
Cholera death is caused by pump water
INTRO II: STRUCTURAL
KNOWLEDGE ENHANCING GENERIC
METHODS
•Two pathways & Two formats
• Summarization methods:
• Quantitative Principal component analysis (PCA)
• Categorical Cluster analysis
• Correlation methods:
• Quantitative Regression
• Categorical Classifier
INTRO II: THREE POSSIBLE LAYERS
OF STUDY
Pro Con
• Systems Usable now Short lived
Simple Too many
• Concepts Awareness Superficial
• Methods Workable Technical
Extendable Boring
Long-term
INTRO II: COURSE CONTENTS
REVIEW
•Summarization: PCA (Weeks 6 and 7), Cluster
analysis (Week 8)
•Correlation: Classifier (Week 5), (no Regression, sorry;
if needed, go to Statistics, Econometrics and Neuron Networks
courses)
•Prequel: 1D and 2D analyses to study basic
concepts and basic methods
•Pre-prequel: Intro – Data and problems
INTRO II: RELATION TO OTHER
APPROACHES
• Classical mathematical statistics: data is just a vehicle to fit and test
mathematical models in the applied domain (say, in data analysis, a feature is
a column in table, they model it as a random variable!)
• Machine Learning: Prediction rules to be built incrementally (say, here PCA is
a major method; for them, just a method to preprocess the data)
• Data Mining: adding new knowledge by finding
interesting patterns in databases, which is initial
stage of knowledge discovery (CDA is part of that,
up to databases)
OVERALL: METHODS are SAME, PERSPECTIVES DO DIFFER
INTRO III: VISUALIZATION
• Visualization of data is an important activity assisting data analysis by a human in many ways
including
A. Highlighting
B. Integrating different aspects
C. Manipulating (not shown)
A few examples follow.
INTRO III: VISUALIZATION
A. Highlighting 1
Figure 1. A fragment of London Tube
map made after H. Beck (1906); the
central part is highlighted by
disproportionate scaling. Being, for a
long while, totally rejected by the
authorities, a standard for metro maps
worldwide.
INTRO III: VISUALIZATION
A. Highlighting 2: Cheating by distortion
Figure 2. A decline in relative numbers of
general practitioner doctors in California in 70-
es is conveniently visualized using 1D size-, not
2D area-related, scaling of a picture of doctor.
INTRO III: VISUALIZATION
Highlighting 3: Cheating by
distortion
Figure 3. Another unintended
distortion: a newspaper’s self-
satisfaction report (July 2005) is
visualized with bars that grow
from mark 500,000 rather than 0.
A 25% advantage has visually
grown ten-fold!
INTRO III: VISUALIZATION
B. Integrating aspects 1
Figure 4. Con Edison company’s power grid screen over
Manhattan NY. Grid repair problems are dealt with on the fly
by sending operators upon seeing disorders on the screen.
INTRO III: VISUALIZATION
B. Integrating aspects 2
Figure 5. Minard’s (1869) depiction of a lost Napoleon
campaign 1812 integrating space, time and strength of
the French army.
INTRO III: VISUALIZATION
B. Integrating
aspects 3
Figure 6. The
structure of research
activities of CENTRIA
(UNL, Lisbon) in 2007
represented over ACM
Computer Subjects
Classification 1998.
INTRO IV: ILLUSTRATIVE DATA CASES
Company name Income, $mln MShare,% NSup EC Sector
Aversiona
Antyops
Astonite
19.0
29.4
23.9
43.7
36.0
38.0
2
3
3
No
No
No
Utility
Utility
Industrial
Bayermart
Breaktops
Bumchista
18.4
25.7
12.1
27.9
22.3
16.9
2
3
2
Yes
Yes
Yes
Utility
Industrial
Industrial
Civiok
Cyberdam
23.9
27.2
30.2
58.0
4
5
Yes
Yes
Retail
Retail
Case 1: Companies 1
Companies characterized by mixed scale features; the first three companies making product A, the next three
making product B, and the last two product C.
Metadata: A. Features and Domain knowledge
1) Income, $ Mln;
2) Mshare - Market share , per cent;
3) NSup - Number of principal suppliers;
4) ECommerce - Yes e-trade or No;
5) Sector - (a) Retail, (b) Utility, and (c) Industrial.
B. Main production (A,B,C)
C. Feature scale types (3 main types)
INTRO IV: ILLUSTRATIVE DATA CASES
Case 1: Companies 2
Metadata: A. Features and Domain knowledge
1) Income, $ Mln;
2) Mshare - Market share , per cent;
3) NSup - Number of principal suppliers;
4) ECommerce - Yes e-trade or No;
5) Sector - (a) Retail, (b) Utility, and (c) Industrial.
Feature: Maps entities to feature values (Synonyms: Variable,
Attribute, Character, Parameter)
Feature. Quantitative scale: Arithmetic averaging makes
sense
Examples: 1) Income, 2) Mshare, 3) NSup
INTRO IV: ILLUSTRATIVE DATA CASES
Case 1: Companies 3
Metadata: A. Features and Domain knowledge
1) Income, $ Mln;
2) Mshare - Market share , per cent;
3) NSup - Number of principal suppliers;
4) ECommerce - Yes e-trade or No;
5) Sector - (a) Retail, (b) Utility, and (c) Industrial.
Feature. Nominal scale: Disjunctive categories, Only comparison “equal or
not” making sense (Special case of categorical scales)
Example: 5) Sector (Retail, Utility, Industrial are values
Feature. Binary scale: Two disjunctive categories, “Yes” and “No”
Shares properties of nominal scale and quantitative scale if 1/0 coded
Example: 4) ECommerce
INTRO IV: QUANTATIVE CODING
Company name Income, $mln MShare,% NSup EC Sector
Aversiona
Antyops
Astonite
19.0
29.4
23.9
43.7
36.0
38.0
2
3
3
No
No
No
Utility
Utility
Industrial
Bayermart
Breaktops
Bumchista
18.4
25.7
12.1
27.9
22.3
16.9
2
3
2
Yes
Yes
Yes
Utility
Industrial
Industrial
Civiok
Cyberdam
23.9
27.2
30.2
58.0
4
5
Yes
Yes
Retail
Retail
Case 1: Companies 4
Quantitative coding: Each category is made into a 1/0 binary (dummy) feature “Does
it hold? 1 if Yes, 0 if No.”
Entity Income MSchar NSup EC? Util? Indu? Retail?
1
2
3
19.0
29.4
23.9
43.7
36.0
38.0
2
3
3
0
0
0
1
1
0
0
0
1
0
0
0
4
5
6
18.4
25.7
12.1
27.9
22.3
16.9
2
3
2
1
1
1
1
0
0
0
1
1
0
0
0
7
8
23.9
27.2
30.2
58.0
4
5
1
1
0
0
0
0
1
1
Company data 8x5 converted to the quantitative format 8x7
INTRO IV: ILLUSTRATIVE DATA CASES
Company name Income, $mln MShare,% NSup EC Sector
Aversiona
Antyops
Astonite
19.0
29.4
23.9
43.7
36.0
38.0
2
3
3
No
No
No
Utility
Utility
Industrial
Bayermart
Breaktops
Bumchista
18.4
25.7
12.1
27.9
22.3
16.9
2
3
2
Yes
Yes
Yes
Utility
Industrial
Industrial
Civiok
Cyberdam
23.9
27.2
30.2
58.0
4
5
Yes
Yes
Retail
Retail
Case 1: Companies 5
Data analysis:
• How to map companies to the screen with their similarity reflected in distances
between points? (Summarization/visualization)
• Would clustering of companies reflect the product? What features would be
involved then? (Summarization)
• Can rules be derived to predict the product for another company, coming outside
of the table? (Correlation)
• Is there any relation between the structural features (Nsup,EC,Sector) and
market related features (Income, MSchare)? (Correlation.)
INTRO IV: ILLUSTRATIVE DATA CASES
Case 2: Iris 1
Anderson–Fisher Iris 150x4 data of three taxa:
Specimen (1-150)Taxon
1-50 Iris setosa (diploid)
51-100 Iris versicolor (tetraploid)
101-150 Iris virginica (hexaploid)
Features
W1 Sepal length
W2 Sepal width
W3 Petal length
W4 Petal width
INTRO IV: DATA CASES
Case 2: Iris 2
#
I Iris setosa II Iris versicolor III Iris virginica
w1 w2 w3 w4 w1 w2 w3 w4 w1 w2 w3 w4
1
2
3
4
5
6
7
8
9
50
5.1 3.5 1.4 0.3
4.4 3.2 1.3 0.2
4.4 3.0 1.3 0.2
5.0 3.5 1.6 0.6
5.1 3.8 1.6 0.2
4.9 3.1 1.5 0.2
5.0 3.2 1.2 0.2
4.6 3.2 1.4 0.2
5.0 3.3 1.4 0.2
5.1 3.5 1.4 0.2
6.4 3.2 4.5 1.5
5.5 2.4 3.8 1.1
5.7 2.9 4.2 1.3
5.7 3.0 4.2 1.2
5.6 2.9 3.6 1.3
7.0 3.2 4.7 1.4
6.8 2.8 4.8 1.4
6.1 2.8 4.7 1.2
4.9 2.4 3.3 1.0
6.0 2.2 4.0 1.0
6.3 3.3 6.0 2.5
6.7 3.3 5.7 2.1
7.2 3.6 6.1 2.5
7.7 3.8 6.7 2.2
7.2 3.0 5.8 1.6
7.4 2.8 6.1 1.9
7.6 3.0 6.6 2.1
7.7 2.8 6.7 2.0
6.2 3.4 5.4 2.3
6.5 3.2 5.1 2.0
Data analysis
• Visualise the data so that similar specimen are mapped into
points that are near each other, and dissimilar to far away points
• Build a predictor of sepal sizes from the petal sizes (to lessen the
burden of measurement)
• Build a predictor of taxa (classifier) based on the petal/sepal
sizes
INTRO IV: DATA CASES
Case 3: Intrusion attack 1
Features
1) Pr, the protocol-type, which is either tcp or icmp or udp (a nominal feature),
2) BySD, the number of data bytes from source to destination,
3) SH, the number of connections to the same host as the current one in the past two seconds,
4) SS, the number of connections to the same service as the current one in the past two
seconds,
5) SE, the rate of connections (per cent in SHCo) that have SYN errors,
6) RE, the rate of connections (per cent in SHCo) that have REJ errors,
7) A, the type of attack (ap - apache, sa - saint, sm - smurf, and no attack) – a nominal
Pr BySD SH SS SE RE A Pr ByS SH SS Se RE A
Tcp
62344
16 16 0 0.94 Ap Tcp 287 14 14 0 0 no
Tcp 60884 17 17 0.06 0.88 Ap Tcp 308 1 1 0 0 no
Tcp 59424 18 18 0.06 0.89 Ap Tcp 284 5 5 0 0 no
Tcp 59424 19 19 0.05 0.89 Ap Udp 105 2 2 0 0 no
Tcp 59424 20 20 0.05 0.9 Ap Udp 105 2 2 0 0 no
Tcp 75484 21 21 0.05 0.9 Ap Udp 105 2 2 0 0 no
INTRO IV: DATA CASES
Case 3: Intrusion attack 2
Data analysis
• Build a classifier to judge whether the system functions normally or is it under
attack (Correlation);
• Is there any relation between the protocol and type of attack (Correlation);
• Visualize the data reflecting similarity of the patterns (Summarization).
Pr BySD SH SS SE RE A Pr ByS SH SS Se RE A
Tcp
62344
16 16 0 0.94 Ap Tcp 287 14 14 0 0 no
Tcp 60884 17 17 0.06 0.88 Ap Tcp 308 1 1 0 0 no
Tcp 59424 18 18 0.06 0.89 Ap Tcp 284 5 5 0 0 no
Tcp 59424 19 19 0.05 0.89 Ap Udp 105 2 2 0 0 no
Tcp 59424 20 20 0.05 0.9 Ap Udp 105 2 2 0 0 no
Tcp 75484 21 21 0.05 0.9 Ap Udp 105 2 2 0 0 no
TOPICS COVERED:
1. Data Mining and data patterns and their use: if
found a pattern, interpret it!
2. Knowledge Enhancing: summarize to concepts,
correlate to statements of relation.
3. Visualize: to highlight or integrate aspects.
4. Illustrative data cases: concept of feature,
feature scale, data table, data analysis
problem.
THANK YOU…

More Related Content

What's hot (20)

PPT
Data management and analysis
ILRI
 
PPTX
Data analysis
Yusuf Khan
 
PPTX
Exploratory data analysis
Vishwas N
 
PPTX
Data visualization
Jan Willem Tulp
 
PDF
Foundations of analytics.ppt
Surekha98
 
PPTX
Data Visualization - A Brief Overview
Rotary Club of North Raleigh
 
PPTX
Identification of Research Gaps Through Literature Review.pptx
EricBalan1
 
PPTX
Data analysis
Carthikvinay1
 
PPTX
Univariate & bivariate analysis
sristi1992
 
PPT
Introduction to statistics
Kapil Dev Ghante
 
PPTX
Data analysis
Mira K Desai
 
PPTX
Introduction to Data Visualization
Stephen Tracy
 
PDF
Exploratory data analysis data visualization
Dr. Hamdan Al-Sabri
 
PPT
Quant Vs Qual Research
guesta861fa
 
PPT
Data analysis powerpoint
Sarah Hallum
 
PDF
Data Visualization
javaidsameer123
 
PPT
Quantitative data analysis - John Richardson
OUmethods
 
PPTX
Predictive Analytics - An Overview
MachinePulse
 
PPTX
"A basic guide to SPSS"
Bashir7576
 
PDF
Statistics For Data Science | Statistics Using R Programming Language | Hypot...
Edureka!
 
Data management and analysis
ILRI
 
Data analysis
Yusuf Khan
 
Exploratory data analysis
Vishwas N
 
Data visualization
Jan Willem Tulp
 
Foundations of analytics.ppt
Surekha98
 
Data Visualization - A Brief Overview
Rotary Club of North Raleigh
 
Identification of Research Gaps Through Literature Review.pptx
EricBalan1
 
Data analysis
Carthikvinay1
 
Univariate & bivariate analysis
sristi1992
 
Introduction to statistics
Kapil Dev Ghante
 
Data analysis
Mira K Desai
 
Introduction to Data Visualization
Stephen Tracy
 
Exploratory data analysis data visualization
Dr. Hamdan Al-Sabri
 
Quant Vs Qual Research
guesta861fa
 
Data analysis powerpoint
Sarah Hallum
 
Data Visualization
javaidsameer123
 
Quantitative data analysis - John Richardson
OUmethods
 
Predictive Analytics - An Overview
MachinePulse
 
"A basic guide to SPSS"
Bashir7576
 
Statistics For Data Science | Statistics Using R Programming Language | Hypot...
Edureka!
 

Similar to DATA ANALYSIS (20)

PDF
Data Science in 2016: Moving Up
Paco Nathan
 
PDF
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Big Data Spain
 
PPTX
The End(s) of e-Research
Eric Meyer
 
PPT
Sensors1(1)
Lakmal Pathirana
 
PDF
R - datascience
ike kurniati
 
PPT
Diagram webinar gould_30oct12
Tom Kuipers
 
PDF
Innovative design methods for data science - beyond brainstorming
Akin Osman Kazakci
 
PDF
Data Curation and Debugging for Data Centric AI
Paul Groth
 
PDF
CSCW in Times of Social Media
Hendrik Drachsler
 
PPTX
Integration of oreChem with the eCrystals repository for crystal structures
Mark Borkum
 
PDF
Information Retrieval The Early Years Donna K Harman
simonnplasma
 
PDF
SE2016 BigData Denis Reznik "Data driven future"
Inhacking
 
PDF
Denis Reznik Data driven future
Аліна Шепшелей
 
PDF
Scientific Data Visualizations - Data Doesn't Care What You Believe.
contact14711
 
PDF
Let's talk about Data Science
Carlo Lauro
 
PDF
Data Science definition
CarloLauro1
 
PDF
Kid171 chap0 english version
Frank S.C. Tseng
 
PDF
Handbook of Statistics 24 Data Mining and Data Visualization C.R. Rao
brafovitian
 
PDF
05 astrostat feigelson
Marco Quartulli
 
PPTX
Term Paper Presentation
Shubham Singh
 
Data Science in 2016: Moving Up
Paco Nathan
 
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Big Data Spain
 
The End(s) of e-Research
Eric Meyer
 
Sensors1(1)
Lakmal Pathirana
 
R - datascience
ike kurniati
 
Diagram webinar gould_30oct12
Tom Kuipers
 
Innovative design methods for data science - beyond brainstorming
Akin Osman Kazakci
 
Data Curation and Debugging for Data Centric AI
Paul Groth
 
CSCW in Times of Social Media
Hendrik Drachsler
 
Integration of oreChem with the eCrystals repository for crystal structures
Mark Borkum
 
Information Retrieval The Early Years Donna K Harman
simonnplasma
 
SE2016 BigData Denis Reznik "Data driven future"
Inhacking
 
Denis Reznik Data driven future
Аліна Шепшелей
 
Scientific Data Visualizations - Data Doesn't Care What You Believe.
contact14711
 
Let's talk about Data Science
Carlo Lauro
 
Data Science definition
CarloLauro1
 
Kid171 chap0 english version
Frank S.C. Tseng
 
Handbook of Statistics 24 Data Mining and Data Visualization C.R. Rao
brafovitian
 
05 astrostat feigelson
Marco Quartulli
 
Term Paper Presentation
Shubham Singh
 
Ad

More from CHARAK RAY (20)

PPTX
SENSING ENTREPRENEURIAL OPPORTUNITY.pptx
CHARAK RAY
 
PPTX
CH-1C.pptx
CHARAK RAY
 
PPTX
CH-1B.pptx
CHARAK RAY
 
PPTX
BUSINESS, TRADE & COMMERCE
CHARAK RAY
 
PPT
INTRODUCTION TO BUSINESS MANAGEMENT
CHARAK RAY
 
PPTX
CASE STUDY.pptx
CHARAK RAY
 
PPTX
HUMAN TRANSFORMATION.pptx
CHARAK RAY
 
PPTX
ENTREPRENEURSHIP.pptx
CHARAK RAY
 
PPTX
Research Methodology
CHARAK RAY
 
PPTX
MS WORD
CHARAK RAY
 
PDF
PROJECT REPORT ON COLD STORAGE
CHARAK RAY
 
PPT
PROJECT FINANCE
CHARAK RAY
 
PPTX
LINEAR ALGEBRA, WITH OPTIMIZATION
CHARAK RAY
 
PDF
TRAINING OF POLLING PERSONNEL
CHARAK RAY
 
PDF
WRITING AN ABSTRACT
CHARAK RAY
 
PDF
BUSINESS STUDIES PROJECT ON PRINCIPLES OF MANAGEMENT
CHARAK RAY
 
PDF
BUSINESS STUDIES PROJECT ON MARKETING MANAGEMENT
CHARAK RAY
 
PDF
BUSINESS STUDIES PROJECT GUIDELINES
CHARAK RAY
 
PPTX
ROYAL CLEAN SERVICES.
CHARAK RAY
 
PPTX
ROYAL MEGA ORGANIC FOOD PARK
CHARAK RAY
 
SENSING ENTREPRENEURIAL OPPORTUNITY.pptx
CHARAK RAY
 
CH-1C.pptx
CHARAK RAY
 
CH-1B.pptx
CHARAK RAY
 
BUSINESS, TRADE & COMMERCE
CHARAK RAY
 
INTRODUCTION TO BUSINESS MANAGEMENT
CHARAK RAY
 
CASE STUDY.pptx
CHARAK RAY
 
HUMAN TRANSFORMATION.pptx
CHARAK RAY
 
ENTREPRENEURSHIP.pptx
CHARAK RAY
 
Research Methodology
CHARAK RAY
 
MS WORD
CHARAK RAY
 
PROJECT REPORT ON COLD STORAGE
CHARAK RAY
 
PROJECT FINANCE
CHARAK RAY
 
LINEAR ALGEBRA, WITH OPTIMIZATION
CHARAK RAY
 
TRAINING OF POLLING PERSONNEL
CHARAK RAY
 
WRITING AN ABSTRACT
CHARAK RAY
 
BUSINESS STUDIES PROJECT ON PRINCIPLES OF MANAGEMENT
CHARAK RAY
 
BUSINESS STUDIES PROJECT ON MARKETING MANAGEMENT
CHARAK RAY
 
BUSINESS STUDIES PROJECT GUIDELINES
CHARAK RAY
 
ROYAL CLEAN SERVICES.
CHARAK RAY
 
ROYAL MEGA ORGANIC FOOD PARK
CHARAK RAY
 
Ad

Recently uploaded (20)

PDF
717629748-Databricks-Certified-Data-Engineer-Professional-Dumps-by-Ball-21-03...
pedelli41
 
PPTX
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
PPTX
M1-T1.pptxM1-T1.pptxM1-T1.pptxM1-T1.pptx
teodoroferiarevanojr
 
PDF
Top Civil Engineer Canada Services111111
nengineeringfirms
 
PPTX
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
PPTX
Nursing Shift Supervisor 24/7 in a week .pptx
amjadtanveer
 
PPT
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
PPTX
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
PDF
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
PPT
introdution to python with a very little difficulty
HUZAIFABINABDULLAH
 
PPTX
UVA-Ortho-PPT-Final-1.pptx Data analytics relevant to the top
chinnusindhu1
 
PDF
Blitz Campinas - Dia 24 de maio - Piettro.pdf
fabigreek
 
PPTX
Introduction to computer chapter one 2017.pptx
mensunmarley
 
PPTX
Future_of_AI_Presentation for everyone.pptx
boranamanju07
 
PPTX
White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...
RamNeymarjr
 
PPTX
7 Easy Ways to Improve Clarity in Your BI Reports
sophiegracewriter
 
PPTX
Insurance-Analytics-Branch-Dashboard (1).pptx
trivenisapate02
 
PDF
apidays Munich 2025 - Developer Portals, API Catalogs, and Marketplaces, Miri...
apidays
 
PPTX
short term internship project on Data visualization
JMJCollegeComputerde
 
PPTX
Introduction to Data Analytics and Data Science
KavithaCIT
 
717629748-Databricks-Certified-Data-Engineer-Professional-Dumps-by-Ball-21-03...
pedelli41
 
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
M1-T1.pptxM1-T1.pptxM1-T1.pptxM1-T1.pptx
teodoroferiarevanojr
 
Top Civil Engineer Canada Services111111
nengineeringfirms
 
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
Nursing Shift Supervisor 24/7 in a week .pptx
amjadtanveer
 
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
introdution to python with a very little difficulty
HUZAIFABINABDULLAH
 
UVA-Ortho-PPT-Final-1.pptx Data analytics relevant to the top
chinnusindhu1
 
Blitz Campinas - Dia 24 de maio - Piettro.pdf
fabigreek
 
Introduction to computer chapter one 2017.pptx
mensunmarley
 
Future_of_AI_Presentation for everyone.pptx
boranamanju07
 
White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...
RamNeymarjr
 
7 Easy Ways to Improve Clarity in Your BI Reports
sophiegracewriter
 
Insurance-Analytics-Branch-Dashboard (1).pptx
trivenisapate02
 
apidays Munich 2025 - Developer Portals, API Catalogs, and Marketplaces, Miri...
apidays
 
short term internship project on Data visualization
JMJCollegeComputerde
 
Introduction to Data Analytics and Data Science
KavithaCIT
 

DATA ANALYSIS

  • 2. COURSE CONTENTS •Core Data Analysis • 1D analysis • 2D analysis: both quantitative • 2D analysis: both nominal • Learning multivariate correlation • Principal components (PCA) and SVD: Mathematical foundations • Principal components (PCA) and SVD: Applications • Clustering with k-means
  • 3. INTRO: WHAT IS CORE DATA ANALYSIS? Four main parts 1. Data Mining and data patterns and their use 2. Core data analysis: two main goals for Knowledge Enhancing 3. Visualization: How it works 4. Illustrative data cases
  • 4. INTRO: DATA MINING AND DATA PATTERNS AND THEIR USE •Is it Data Mining? • Well, what is Data Mining? • Generically, Data Mining is looking for (i) patterns in data stored in (ii) Databases as part of (iii) Knowledge Discovery • Core data analysis does not care of (ii) Databases • Core data analysis does care of (ia) specific patterns in data as part of (iia) Knowledge Enhancing
  • 5. INTRO: EXAMPLE OF PATTERN DOUBLE SUCCESS 1 The History of Laws for planetary motion Double success Ptolemy (c. 150 a.d.): • Sun and planets • circle Earth • Does not match data well
  • 6. INTRO: EXAMPLE OF PATTERN DOUBLE SUCCESS 2 The History of Laws for planetary motion • Copernicus (c. 1540): • Planets circle Sun • Does not match data well • either
  • 7. INTRO: EXAMPLE OF PATTERN DOUBLE SUCCESS 3 Laws for planetary motion: Kepler (c. 1605): • 1st Law: Planets revolve Sun in ellipses (ovals) • 2d Law: Speed changes – the further away from Sun, the faster • Does either
  • 8. INTRO: EXAMPLE OF PATTERN DOUBLE SUCCESS 4 Planet Period (year) Distance (average, relative to that of Earth) Mercury Venus Earth Mars Jupiter Saturn Uranus Neptune Pluto 0.241 0.615 1.00 1.88 11.8 29.5 84.0 165 248 0.39 0.72 1.00 1.52 5.20 9.54 19.18 30.06 39.44 3d Law: Is there any relation between speed/period and distance?
  • 9. INTRO: EXAMPLE OF PATTERN DOUBLE SUCCESS 5 3d Kepler’s Law: Is there any relation between speed/period and distance? Fit no line…
  • 10. INTRO: EXAMPLE OF PATTERN DOUBLE SUCCESS 6 3d Kepler’s Law (1619): [J. Napier invented logarithm (1614)] Log(P)= 𝟑 𝟐 Log(D) P2=D3
  • 11. INTRO: EXAMPLE OF PATTERN DOUBLE SUCCESS 7 Three Kepler’s Laws: What is so grand? Substantiated theoretically by R. Hooke (1635-1703) and I. Newton (1642-1727) UNIVERSAL GRAVITATION LAW Mathematical equation, cornerstone of modern science
  • 12. INTRO: EXAMPLE OF PATTERN FAILURE? 1 Imagine this: Broad street, Soho, London, Cholera outbreak September 1854 Dr. Snow report: “On proceeding to the spot, I found that nearly all the deaths had taken place within a short distance of the pump.” Dr John Snow’s map: Cases of death labeled by ticks. The handle of pump removed 7/9/1854.
  • 13. INTRO 1: EXAMPLE OF PATTERN FAILURE? 2 Myth: Death stopped. Data analysis won. Fact: Data analysis lost. The health commission rejected the water pump theory, as contradicting the science of the day (cholera outbreak caused by “concentrated noxious atmospheric influence, no doubt emanating from putrefying organic matter”). The handle of the pump was ordered back. Death stopped because all died already. More death occurred at further cholera outbreaks till R. Koch discovered and publicized the vibrio cholera in 1883. Dr John Snow’s map: A case of death Is labeled by a tick
  • 14. PATTERN FOUND Success: if Compatible with existing knowledge Failure: if Not compatible with existing knowledge Advice • Find a pattern • Interpret using existing knowledge • Care not whether interpretation is compatible
  • 15. INTRO: WHAT IS CORE DATA ANALYSIS II 1 • Core data analysis does care of (ia) specific patterns in data as part of (iia) Knowledge Enhancing • What are these (ia), (iia) specifics? • Have something to do with the notion of Knowledge • Statements of fact (“I teach this class.”) – factual • Statements of pattern, regularity (“Professors use to teach classes.”) - structural
  • 16. INTRO: WHAT IS CORE DATA ANALYSIS II 2 • Core data analysis does care of (ia) specific patterns in data as part of (iia) Knowledge Enhancing • (ia), (iia) specifics relate to elements of structural knowledge • Elements of Structural knowledge: • Concepts (“Professor”, “Teach”, “Class”) • Statements of relation between concepts (“Professors use to teach classes.”) - structural
  • 17. INTRO: WHAT IS CORE DATA ANALYSIS II •List elements of structural knowledge, •concepts and •statements of relation among them, for •3d Kepler’s Law •Dr Snow’s cholera outbreak map
  • 18. INTRO: WHAT IS CORE DATA ANALYSIS II 3 • Core data analysis does care of (ia) deriving concepts and statements of relation between them from data • (iia) Structural Knowledge Enhancing, generically, via either of the two pathways • Two pathways for Structural Knowledge Enhancing • Summarization: Developing Concepts • Correlation: Deriving Statements of relation between concepts
  • 19. W1. INTRO: WHAT IS CORE DATA ANALYSIS II 4 • Two pathways for Structural Knowledge Enhancing • Summarization: Developing Concepts • Correlation: Deriving Statements of relation between concepts  Two major formats:  Quantitative (both concepts and statements)  3d Kepler’s Law Period2 = Distance3  Categorical (both concepts and statements)  Dr Snow’s conclusion: Cholera death is caused by pump water
  • 20. INTRO II: STRUCTURAL KNOWLEDGE ENHANCING GENERIC METHODS •Two pathways & Two formats • Summarization methods: • Quantitative Principal component analysis (PCA) • Categorical Cluster analysis • Correlation methods: • Quantitative Regression • Categorical Classifier
  • 21. INTRO II: THREE POSSIBLE LAYERS OF STUDY Pro Con • Systems Usable now Short lived Simple Too many • Concepts Awareness Superficial • Methods Workable Technical Extendable Boring Long-term
  • 22. INTRO II: COURSE CONTENTS REVIEW •Summarization: PCA (Weeks 6 and 7), Cluster analysis (Week 8) •Correlation: Classifier (Week 5), (no Regression, sorry; if needed, go to Statistics, Econometrics and Neuron Networks courses) •Prequel: 1D and 2D analyses to study basic concepts and basic methods •Pre-prequel: Intro – Data and problems
  • 23. INTRO II: RELATION TO OTHER APPROACHES • Classical mathematical statistics: data is just a vehicle to fit and test mathematical models in the applied domain (say, in data analysis, a feature is a column in table, they model it as a random variable!) • Machine Learning: Prediction rules to be built incrementally (say, here PCA is a major method; for them, just a method to preprocess the data) • Data Mining: adding new knowledge by finding interesting patterns in databases, which is initial stage of knowledge discovery (CDA is part of that, up to databases) OVERALL: METHODS are SAME, PERSPECTIVES DO DIFFER
  • 24. INTRO III: VISUALIZATION • Visualization of data is an important activity assisting data analysis by a human in many ways including A. Highlighting B. Integrating different aspects C. Manipulating (not shown) A few examples follow.
  • 25. INTRO III: VISUALIZATION A. Highlighting 1 Figure 1. A fragment of London Tube map made after H. Beck (1906); the central part is highlighted by disproportionate scaling. Being, for a long while, totally rejected by the authorities, a standard for metro maps worldwide.
  • 26. INTRO III: VISUALIZATION A. Highlighting 2: Cheating by distortion Figure 2. A decline in relative numbers of general practitioner doctors in California in 70- es is conveniently visualized using 1D size-, not 2D area-related, scaling of a picture of doctor.
  • 27. INTRO III: VISUALIZATION Highlighting 3: Cheating by distortion Figure 3. Another unintended distortion: a newspaper’s self- satisfaction report (July 2005) is visualized with bars that grow from mark 500,000 rather than 0. A 25% advantage has visually grown ten-fold!
  • 28. INTRO III: VISUALIZATION B. Integrating aspects 1 Figure 4. Con Edison company’s power grid screen over Manhattan NY. Grid repair problems are dealt with on the fly by sending operators upon seeing disorders on the screen.
  • 29. INTRO III: VISUALIZATION B. Integrating aspects 2 Figure 5. Minard’s (1869) depiction of a lost Napoleon campaign 1812 integrating space, time and strength of the French army.
  • 30. INTRO III: VISUALIZATION B. Integrating aspects 3 Figure 6. The structure of research activities of CENTRIA (UNL, Lisbon) in 2007 represented over ACM Computer Subjects Classification 1998.
  • 31. INTRO IV: ILLUSTRATIVE DATA CASES Company name Income, $mln MShare,% NSup EC Sector Aversiona Antyops Astonite 19.0 29.4 23.9 43.7 36.0 38.0 2 3 3 No No No Utility Utility Industrial Bayermart Breaktops Bumchista 18.4 25.7 12.1 27.9 22.3 16.9 2 3 2 Yes Yes Yes Utility Industrial Industrial Civiok Cyberdam 23.9 27.2 30.2 58.0 4 5 Yes Yes Retail Retail Case 1: Companies 1 Companies characterized by mixed scale features; the first three companies making product A, the next three making product B, and the last two product C. Metadata: A. Features and Domain knowledge 1) Income, $ Mln; 2) Mshare - Market share , per cent; 3) NSup - Number of principal suppliers; 4) ECommerce - Yes e-trade or No; 5) Sector - (a) Retail, (b) Utility, and (c) Industrial. B. Main production (A,B,C) C. Feature scale types (3 main types)
  • 32. INTRO IV: ILLUSTRATIVE DATA CASES Case 1: Companies 2 Metadata: A. Features and Domain knowledge 1) Income, $ Mln; 2) Mshare - Market share , per cent; 3) NSup - Number of principal suppliers; 4) ECommerce - Yes e-trade or No; 5) Sector - (a) Retail, (b) Utility, and (c) Industrial. Feature: Maps entities to feature values (Synonyms: Variable, Attribute, Character, Parameter) Feature. Quantitative scale: Arithmetic averaging makes sense Examples: 1) Income, 2) Mshare, 3) NSup
  • 33. INTRO IV: ILLUSTRATIVE DATA CASES Case 1: Companies 3 Metadata: A. Features and Domain knowledge 1) Income, $ Mln; 2) Mshare - Market share , per cent; 3) NSup - Number of principal suppliers; 4) ECommerce - Yes e-trade or No; 5) Sector - (a) Retail, (b) Utility, and (c) Industrial. Feature. Nominal scale: Disjunctive categories, Only comparison “equal or not” making sense (Special case of categorical scales) Example: 5) Sector (Retail, Utility, Industrial are values Feature. Binary scale: Two disjunctive categories, “Yes” and “No” Shares properties of nominal scale and quantitative scale if 1/0 coded Example: 4) ECommerce
  • 34. INTRO IV: QUANTATIVE CODING Company name Income, $mln MShare,% NSup EC Sector Aversiona Antyops Astonite 19.0 29.4 23.9 43.7 36.0 38.0 2 3 3 No No No Utility Utility Industrial Bayermart Breaktops Bumchista 18.4 25.7 12.1 27.9 22.3 16.9 2 3 2 Yes Yes Yes Utility Industrial Industrial Civiok Cyberdam 23.9 27.2 30.2 58.0 4 5 Yes Yes Retail Retail Case 1: Companies 4 Quantitative coding: Each category is made into a 1/0 binary (dummy) feature “Does it hold? 1 if Yes, 0 if No.” Entity Income MSchar NSup EC? Util? Indu? Retail? 1 2 3 19.0 29.4 23.9 43.7 36.0 38.0 2 3 3 0 0 0 1 1 0 0 0 1 0 0 0 4 5 6 18.4 25.7 12.1 27.9 22.3 16.9 2 3 2 1 1 1 1 0 0 0 1 1 0 0 0 7 8 23.9 27.2 30.2 58.0 4 5 1 1 0 0 0 0 1 1 Company data 8x5 converted to the quantitative format 8x7
  • 35. INTRO IV: ILLUSTRATIVE DATA CASES Company name Income, $mln MShare,% NSup EC Sector Aversiona Antyops Astonite 19.0 29.4 23.9 43.7 36.0 38.0 2 3 3 No No No Utility Utility Industrial Bayermart Breaktops Bumchista 18.4 25.7 12.1 27.9 22.3 16.9 2 3 2 Yes Yes Yes Utility Industrial Industrial Civiok Cyberdam 23.9 27.2 30.2 58.0 4 5 Yes Yes Retail Retail Case 1: Companies 5 Data analysis: • How to map companies to the screen with their similarity reflected in distances between points? (Summarization/visualization) • Would clustering of companies reflect the product? What features would be involved then? (Summarization) • Can rules be derived to predict the product for another company, coming outside of the table? (Correlation) • Is there any relation between the structural features (Nsup,EC,Sector) and market related features (Income, MSchare)? (Correlation.)
  • 36. INTRO IV: ILLUSTRATIVE DATA CASES Case 2: Iris 1 Anderson–Fisher Iris 150x4 data of three taxa: Specimen (1-150)Taxon 1-50 Iris setosa (diploid) 51-100 Iris versicolor (tetraploid) 101-150 Iris virginica (hexaploid) Features W1 Sepal length W2 Sepal width W3 Petal length W4 Petal width
  • 37. INTRO IV: DATA CASES Case 2: Iris 2 # I Iris setosa II Iris versicolor III Iris virginica w1 w2 w3 w4 w1 w2 w3 w4 w1 w2 w3 w4 1 2 3 4 5 6 7 8 9 50 5.1 3.5 1.4 0.3 4.4 3.2 1.3 0.2 4.4 3.0 1.3 0.2 5.0 3.5 1.6 0.6 5.1 3.8 1.6 0.2 4.9 3.1 1.5 0.2 5.0 3.2 1.2 0.2 4.6 3.2 1.4 0.2 5.0 3.3 1.4 0.2 5.1 3.5 1.4 0.2 6.4 3.2 4.5 1.5 5.5 2.4 3.8 1.1 5.7 2.9 4.2 1.3 5.7 3.0 4.2 1.2 5.6 2.9 3.6 1.3 7.0 3.2 4.7 1.4 6.8 2.8 4.8 1.4 6.1 2.8 4.7 1.2 4.9 2.4 3.3 1.0 6.0 2.2 4.0 1.0 6.3 3.3 6.0 2.5 6.7 3.3 5.7 2.1 7.2 3.6 6.1 2.5 7.7 3.8 6.7 2.2 7.2 3.0 5.8 1.6 7.4 2.8 6.1 1.9 7.6 3.0 6.6 2.1 7.7 2.8 6.7 2.0 6.2 3.4 5.4 2.3 6.5 3.2 5.1 2.0 Data analysis • Visualise the data so that similar specimen are mapped into points that are near each other, and dissimilar to far away points • Build a predictor of sepal sizes from the petal sizes (to lessen the burden of measurement) • Build a predictor of taxa (classifier) based on the petal/sepal sizes
  • 38. INTRO IV: DATA CASES Case 3: Intrusion attack 1 Features 1) Pr, the protocol-type, which is either tcp or icmp or udp (a nominal feature), 2) BySD, the number of data bytes from source to destination, 3) SH, the number of connections to the same host as the current one in the past two seconds, 4) SS, the number of connections to the same service as the current one in the past two seconds, 5) SE, the rate of connections (per cent in SHCo) that have SYN errors, 6) RE, the rate of connections (per cent in SHCo) that have REJ errors, 7) A, the type of attack (ap - apache, sa - saint, sm - smurf, and no attack) – a nominal Pr BySD SH SS SE RE A Pr ByS SH SS Se RE A Tcp 62344 16 16 0 0.94 Ap Tcp 287 14 14 0 0 no Tcp 60884 17 17 0.06 0.88 Ap Tcp 308 1 1 0 0 no Tcp 59424 18 18 0.06 0.89 Ap Tcp 284 5 5 0 0 no Tcp 59424 19 19 0.05 0.89 Ap Udp 105 2 2 0 0 no Tcp 59424 20 20 0.05 0.9 Ap Udp 105 2 2 0 0 no Tcp 75484 21 21 0.05 0.9 Ap Udp 105 2 2 0 0 no
  • 39. INTRO IV: DATA CASES Case 3: Intrusion attack 2 Data analysis • Build a classifier to judge whether the system functions normally or is it under attack (Correlation); • Is there any relation between the protocol and type of attack (Correlation); • Visualize the data reflecting similarity of the patterns (Summarization). Pr BySD SH SS SE RE A Pr ByS SH SS Se RE A Tcp 62344 16 16 0 0.94 Ap Tcp 287 14 14 0 0 no Tcp 60884 17 17 0.06 0.88 Ap Tcp 308 1 1 0 0 no Tcp 59424 18 18 0.06 0.89 Ap Tcp 284 5 5 0 0 no Tcp 59424 19 19 0.05 0.89 Ap Udp 105 2 2 0 0 no Tcp 59424 20 20 0.05 0.9 Ap Udp 105 2 2 0 0 no Tcp 75484 21 21 0.05 0.9 Ap Udp 105 2 2 0 0 no
  • 40. TOPICS COVERED: 1. Data Mining and data patterns and their use: if found a pattern, interpret it! 2. Knowledge Enhancing: summarize to concepts, correlate to statements of relation. 3. Visualize: to highlight or integrate aspects. 4. Illustrative data cases: concept of feature, feature scale, data table, data analysis problem.

Editor's Notes

  • #3: Совет. Добавьте сюда свои заметки докладчика.