Datasets and
Data Science
Damian Gordon
Contents
•Data Science
•Types of Data
•Picturing the Data
•ACTIVITY
•Dataset Characteristics
Data Science
•Data science is…
•“the science of Data” (!)
• Cao, L. (2017) Data science: a comprehensive overview. ACM Computing
Surveys (CSUR), 50(3), pp.1-42.
Data Science
•What is Data?
•It’s a set of facts and figures
Data Science
•OK, so what is Data Science?
•Extracting insight and information from
data sets to make better decisions.
• Kelleher, J.D., Tierney, B. (2018) Data Science. MIT press.
Data Science
•There is a (possibly apocryphal) story that
is often used to illustrate data mining, and
it’s called the “Beers and Nappies” story.
Data Science
•The story goes that a large American
supermarket, usually it’s Walmart, was
exploring its sales data from their cash
registers. The data is stored one customer’s
purchase after another, but when the
supermarket mined the dataset, they looked
at each product to see if it is commonly
associated with any other products.
Data Science
•They found an unexpected pattern
between the purchase of beers and the
purchase of nappies. The supermarket
starting to place those two products right
beside each other on the supermarket
floor and they made lots of money.
Data Science
•The explanation for the association between
the products could not be deduced from the
dataset, but the cashiers explain that if a
couple with a baby have one partner at home
minding the baby, and one going to work; the
partner who is going to work will pop into the
supermarket after work to buy some nappies,
and will decide that they need to get
themselves some beers as well ;-)
Data Science: Statistics
•A model is
really an approximation
of something else. It’s
not supposed to be a
perfect representation.
Data Science: Statistics
Real Data Predicted Data
Which is the best
prediction for new
data, based on the
real data: A, B, or C
?
A
B
C
Data Science: Statistics
•Any of the three predictions (A, B, and
C) are possible in terms of new data,
so there is no “right answer”, but
based on the linear model we have
created for the existing data, the line
B looks like the most likely predictor
of any new data.
Data Science: Some Software Tools
Python (with
Pandas, NumPy,
Scikit-learn,
Matplotlib)
TensorFlow Hadoop
R Programming
Language WEKA Tableau
Data Science: Main Application Areas
Healthcare Finance
Transportati
on
Marketing
Energy
Consumptio
n
Sports Genetics
Manufacturi
ng
Data Science
• In Data Science, one of the
key formal processes
(methodologies) that
businesses follow is called
CRISP-DM (CRoss Industry
Standard Process for Data
Mining), and it provides
organisations with a step-
by-step guide to using
Data Science in businesses.
Data Science: Computer Science
•DATA CLEANING (or Data Cleansing) is
fixing or removing data that is
incorrect (in some way) from the
dataset.
Data Science: Computer Science
•Let’s imagine one of the columns of
the dataset is a date, but different
rows have different formats, e.g.
•12-3-1992
•06/11/1946
•23rd
November 2022
Data Science: Computer Science
•We can write a computer program to
reformat all of these dates into one
common format, e.g.
•DD-MM-YYYY
•This is called Data
Transformation.
Data Science: Computer Science
•Another issue might be that some of
the rows of data are recorded multiple
times. So we can write a program to
scan for this kind of duplication.
•This is called
Duplicate Elimination
Data Science: Computer Science
•One more issue to mention is that if a
column has text in it, we can write
programs to check if the text is
suitable.
•This is called
Parsing.
Data Science: Computer Science
•Another area the computer programs
can help us with is in creating graphs
to show trends in the data.
•This is called
Data Visualisation.
The Use of Data and Datasets in Data Science
Types of Data
Continuous Data
•It’s data with a decimal place.
•Continuous Data is data that can take
on any value within a given range. It
can be measured to an infinite level of
precision.
•e.g. height, 1.8542 metres.
•e.g. time, 3 hrs, 4 mins, 34 secs, 34 ms, etc.
Discrete Data
•It’s data without a decimal place.
•Discrete Data is data that consists of
distinct, separate values that can be
counted.
•e.g. number of days worked this week.
3.
•e.g. number of leaves on a tree. 426.
Ordinal Data
•It’s data with ordered categories.
•Ordinal Data is categorical data where
the categories have a meaningful order
or ranking.
•e.g. {Very Good, Good, O.K., Bad, Very Bad}
•e.g. Pain severity rated as {0 (no pain), 1
(mild), 2 (moderate), 3 (severe)}
Nominal Data
•It’s data without ordered categories.
•Nominal Data is categorical data that
consists of distinct categories with no
inherent order or ranking.
•e.g. {Yes, No}
•e.g. {Teacher, Chemist, Haberdasher}
Picturing the Data
Pie Chart
Pie Charts
• Nominal/Ordinal
• Only suitable for
data that adds up
to 1
• Hard to compare
values in the chart
Bar Chart
Bar Charts
• Nominal/Ordinal
• Easier to compare
values than pie
chart
• Suitable for a wider
range of data
Histogram
Histograms
• Continuous Data
• Divide Data into
ranges
Dot Plot
Dot Plots
• Nominal/Ordinal
• Represents all the
data
• Difficult to read
Scatter Plot
Scatter Plots
• Excellent for
examining
association
between two
variables
Time-Series Plot
Time-Series Plots
• Time related Data
• e.g. Stock Prices
Box Plot
Box Plots
• Nominal/Ordinal
• 1IQR - First
interquartile range
• 3IQR - Third
interquartile range
• Outliers
The Use of Data and Datasets in Data Science
The Use of Data and Datasets in Data Science
The Use of Data and Datasets in Data Science
John Tukey
• Born June 16, 1915
• Died July 26, 2000
• Born in New Bedford,
Massachusetts
• He introduced the box
plot in his 1977 book
"Exploratory Data
Analysis"
• Also the Cooley–Tukey
FFT algorithm and
jackknife estimation
• While working with John von Neumann on early computer
designs, Tukey introduced the word "bit" as a contraction of
"binary digit". The term "bit" was first used in an article by
Claude Shannon in 1948.
• The term "software", which Paul Niquette claims he coined in
1953, was first used in print by Tukey in a 1958 article in
American Mathematical Monthly, and thus some people
attribute the term to him.
John Tukey Paul Niquette Claude Shannon John von Neumann
ACTIVITY
Question 1
In a telephone survey of 68
households, when asked do they
have pets, the following were the
responses :
• 16 : No Pets
• 28 : Dogs
• 32 : Cats
Draw the appropriate graphic to
illustrate the results !!
Question 1 - Solution
Total number surveyed = 68
Number with no pets = 16
=>Total with pets = (68 - 16) = 52
But total 28 dogs + 32 cats = 60
=> So some people have both cats and dogs
The Use of Data and Datasets in Data Science
Question 1 - Solution
How many? It must be (60 - 52) = 8 people
ÞNo pets = 16
ÞDogs = 20
ÞCats = 24
ÞBoth = 8
-------------------------
Total = 68
Question 1 - Solution
Graphic: Pie Chart or Bar Chart
Question 1 - Solution
Graphic: Pie Chart or Bar Chart Bar Chart is
easier to
read
More Complex Diagrams
Stacked Graph (Stream Graph)
• Stacked Graph of
Unemployed U.S.
Workers by Industry
• Stacked graphs do
not support negative
numbers and are
meaningless for data
that should not be
summed
(temperatures, for
instance).
Parallel Coordinates
• Used for visualizing
multivariate data.
Instead of graphing
every pair of
variables in two
dimensions, we
repeatedly plot the
data on parallel axes
and then connect the
corresponding
points with lines
Flow Map
• A flow map can depict
the movement of a
quantity in space and
(implicitly) in time.
• Here we see a modern
interpretation of
Charles Minard's
depiction of
Napoleon's ill-fated
march on Moscow.
Node-Link Diagrams
• The word "tree" is used
interchangeably with
"hierarchy", as the
fractal branches of an
oak might mirror the
nesting of data.
• If we take a two-
dimensional blueprint
of a tree, we have a
popular choice for
visualizing hierarchies:
a node-link diagram.
Node-Link Diagrams
Arc Diagrams
• An arc diagram uses a
one-dimensional layout
of nodes, with circular
arcs to represent links.
While arc diagrams may
not convey the overall
structure of the graph as
effectively as a two-
dimensional layout, with
a good ordering of
nodes it is easy to
identify cliques and
bridges.
Dataset Characteristics
• Some students will be using a dataset as part of their
research. This is typically thousands of rows of data.
• We are not talking about the data you might be
collecting from surveys and interviews, but rather a
pre-existing set of data.
• If the data is the key consideration in your research
(although not all projects will necessarily be
concerned with large datasets) it is important to
consider several questions.
Dataset Characteristics
Dataset Characteristics: Questions
• How suitable is the data?
• What is the type of the
data?
• Where will you get it
from?
• What size is the dataset?
• What format is it in?
• How much cleaning is
required?
• What is the quality of the
data?
• How do you deal with
missing data?
• How will you evaluate
your analysis?
• etc.
• Determining the suitability of the data is a vital
consideration, it is not sufficient to simply locate a
dataset that is thematically linked to your research
question, it must be appropriate to explore the
questions that you want to ask.
• For example, just because you want to do Credit Card
Fraud detection and you have a dataset that contains
Credit Card transactions or was used in another Credit
Card Fraud project, does not mean that it will be
suitable for your project.
Dataset Characteristics: Suitability
• Is the data already labelled?
• This is very important for supervised learning
problems.
• To take the credit card fraud example again, you can
probably get as many credit card transactions as you
like but you probably won't be able to get them
marked up as fraudulent and non-fraudulent.
Suitability: Labelling
• The same thing goes for a lot of text analytics
problems - can you get people to label thousands of
documents as being interesting or non-interesting
to them so that you can train a predictive model?
• The availability of labelled data is a key consideration
for any supervised learning problem.
• The areas of semi-supervised learning and active
learning try to address this problem and have some
very interesting open research questions.
Suitability: Labelling
• Two important considerations:
• The Curse of Dimensionality – When the dimensionality increases,
the volume of the space increases so fast that the available data
becomes sparse. In order to obtain a statistically sound result,
the amount of data you need often grows exponentially with the
dimensionality.
• The No Free Lunch Theorem - Classifier performance depends
greatly on the characteristics of the data to be classified. There is
no single classifier that works best on all given problems.
Suitability: Labelling
• Also remember for labelling, you might be aiming for one of
three goals:
• Binary classifications – classifying each data item to one of two
categories.
• Multiclass classifications - classifying each data item to more than two
categories.
• Multi-label classifications - classifying each data item to multiple
target labels.
Suitability: Labelling
Types of Data
• Federated data
• High dimensional data
• Descriptive data
• Longitudinal data
• Streaming data
• Web (scraped) data
• Numeric vs. categorical vs.
text data
• etc.
• Image data
• Video data
• Audio data
•e.g.
•http://www.kdnuggets.com/datasets/
•http://www.google.com/publicdata/directory
•http://opendata.ie/
•http://lib.stat.cmu.edu/datasets/
Locating Datasets
• What is a reasonable size of a dataset?
• Obviously it vary a lot from problem to problem, but
in general we would recommend at least 10 features
(columns) in the dataset, and we’d like to see
thousands of instances.
Size of the Dataset
• TXT (Text file)
• MIME (Multipurpose Internet Mail Extensions)
• XML (Extensible Markup Language)
• CSV (Comma-Separated Values)
• ACSII (American Standard Code for Information
Interchange)
• etc.
Format of the Data
•Parsing
•Correcting
•Standardizing
•Matching
•Consolidating
Cleaning of Data
•Frequency counts
•Descriptive statistics (mean, standard
deviation, median)
•Normality (skewness, kurtosis, frequency
histograms, normal probability plots)
•Associations (correlations, scatter plots)
Quality of the Data
•Imputation
•Partial imputation
•Partial deletion
•Full analysis
•Also consider database nullology
Missing Data?
•Training Dataset (Build dataset)
•Test Dataset
•Apply Dataset (Scoring Dataset)
Dataset types
•What about stuff like?
•Area under the Curve
•Misclassification Error
•Confusion Matrix
•N-fold Cross Validation
•ROC Graph
•Log-Loss and Hinge-Loss
Evaluation
• These are good for evaluating the analysis, so they
are good for checking how good the model is based
on the dataset, and are definitely part of the
evaluation, but if you want to discuss the findings
with respect to the real-world (and to the research
question) you must do the following:
•Test predictions using the real-world
Evaluation
Thanks !!!

More Related Content

PPT
Data_Science_Presentationforlearning machine learning
PPTX
A Non-technical Introduction to Data Science
PPTX
Data science notes for reference/ engineering
PPTX
DataScienceandVisualization_Mod_1_ppt.pptx
PPT
Data Science-1 (1).ppt
PPT
data science ppt of emngineering studnets
PDF
01-Introduction.pdf
PPTX
intro to data science Clustering and visualization of data science subfields ...
Data_Science_Presentationforlearning machine learning
A Non-technical Introduction to Data Science
Data science notes for reference/ engineering
DataScienceandVisualization_Mod_1_ppt.pptx
Data Science-1 (1).ppt
data science ppt of emngineering studnets
01-Introduction.pdf
intro to data science Clustering and visualization of data science subfields ...

Similar to The Use of Data and Datasets in Data Science (20)

PPTX
Data Science presentation for explanation of numpy and pandas
PPTX
Session 01 designing and scoping a data science project
PPTX
Session 01 designing and scoping a data science project
PDF
DataScience_introduction.pdf
PPTX
Chapter 2 - Introduction to Data Science.pptx
PPTX
Fundamentals of Data science Introduction Unit 1
PPTX
Data science and visualization power point
PPTX
Altron presentation on Emerging Technologies: Data Science and Artificial Int...
PPTX
CS194Lec0hbh6EDA.pptx
PPTX
Statistical Inference for development statistical model.pptx
PPTX
Big Data 101 - An introduction
PPTX
Big Data Real Time Training in Chennai
PPTX
DMDA Unit-1.pptx .
PDF
Introduction to data science
PPT
Data mining concept and methods for basic
PPTX
Lecture #01
PPTX
DS_Teacher_Presentation DS and Education.pptx
PPTX
ch2 DS.pptx
PPTX
Big Data and Data Mining - Lecture 3 in Introduction to Computational Social ...
PDF
00-01 DSnDA.pdf
Data Science presentation for explanation of numpy and pandas
Session 01 designing and scoping a data science project
Session 01 designing and scoping a data science project
DataScience_introduction.pdf
Chapter 2 - Introduction to Data Science.pptx
Fundamentals of Data science Introduction Unit 1
Data science and visualization power point
Altron presentation on Emerging Technologies: Data Science and Artificial Int...
CS194Lec0hbh6EDA.pptx
Statistical Inference for development statistical model.pptx
Big Data 101 - An introduction
Big Data Real Time Training in Chennai
DMDA Unit-1.pptx .
Introduction to data science
Data mining concept and methods for basic
Lecture #01
DS_Teacher_Presentation DS and Education.pptx
ch2 DS.pptx
Big Data and Data Mining - Lecture 3 in Introduction to Computational Social ...
00-01 DSnDA.pdf
Ad

More from Damian T. Gordon (20)

PPTX
Introduction to Prompts and Prompt Engineering
PPTX
Introduction to Vibe Coding and Vibe Engineering
PPTX
TRIZ: Theory of Inventive Problem Solving
PPTX
Some Ethical Considerations of AI and GenAI
PPTX
Some Common Errors that Generative AI Produces
PPTX
A History of Different Versions of Microsoft Windows
PPTX
Writing an Abstract: A Question-based Approach
PPTX
Using GenAI for Universal Design for Learning
DOC
A CheckSheet for Inclusive Software Design
PPTX
A History of Versions of the Apple MacOS
PPTX
68 Ways that Data Science and AI can help address the UN Sustainability Goals
PPTX
Copyright and Creative Commons Considerations
PPTX
Exam Preparation: Some Ideas and Suggestions
PPTX
Studying and Notetaking: Some Suggestions
PPTX
The Growth Mindset: Explanations and Activities
PPTX
Hyperparameter Tuning in Neural Networks
PPTX
Early 20th Century Modern Art: Movements and Artists
PPTX
An Introduction to Generative Artificial Intelligence
PPTX
An Introduction to Green Computing with a fun quiz.
PPTX
Introduction to Sustainability and the UN Sustainable Development Goals
Introduction to Prompts and Prompt Engineering
Introduction to Vibe Coding and Vibe Engineering
TRIZ: Theory of Inventive Problem Solving
Some Ethical Considerations of AI and GenAI
Some Common Errors that Generative AI Produces
A History of Different Versions of Microsoft Windows
Writing an Abstract: A Question-based Approach
Using GenAI for Universal Design for Learning
A CheckSheet for Inclusive Software Design
A History of Versions of the Apple MacOS
68 Ways that Data Science and AI can help address the UN Sustainability Goals
Copyright and Creative Commons Considerations
Exam Preparation: Some Ideas and Suggestions
Studying and Notetaking: Some Suggestions
The Growth Mindset: Explanations and Activities
Hyperparameter Tuning in Neural Networks
Early 20th Century Modern Art: Movements and Artists
An Introduction to Generative Artificial Intelligence
An Introduction to Green Computing with a fun quiz.
Introduction to Sustainability and the UN Sustainable Development Goals
Ad

Recently uploaded (20)

PPTX
Unit1_Kumod_deeplearning.pptx DEEP LEARNING
PDF
Developing speaking skill_learning_mater.pdf
PDF
Physical pharmaceutics two in b pharmacy
PDF
BSc-Zoology-02Sem-DrVijay-Comparative anatomy of vertebrates.pdf
PDF
Unleashing the Potential of the Cultural and creative industries
PDF
CHALLENGES FACED BY TEACHERS WHEN TEACHING LEARNERS WITH DEVELOPMENTAL DISABI...
PPTX
FILIPINO 8 Q2 WEEK 1(DAY 1).power point presentation
PPTX
GW4 BioMed Candidate Support Webinar 2025
PPTX
principlesofmanagementsem1slides-131211060335-phpapp01 (1).ppt
PDF
gsas-cvs-and-cover-letters jhvgfcffttfghgvhg.pdf
PDF
Laparoscopic Imaging Systems at World Laparoscopy Hospital
PDF
Review of Related Literature & Studies.pdf
PPTX
Approach to a child with acute kidney injury
PDF
FYJC - Chemistry textbook - standard 11.
PDF
GIÁO ÁN TIẾNG ANH 7 GLOBAL SUCCESS (CẢ NĂM) THEO CÔNG VĂN 5512 (2 CỘT) NĂM HỌ...
PDF
Global strategy and action plan on oral health 2023 - 2030.pdf
PDF
WHAT NURSES SAY_ COMMUNICATION BEHAVIORS ASSOCIATED WITH THE COMP.pdf
PPTX
Ppt obs emergecy.pptxydirnbduejguxjjdjidjdbuc
PPTX
ENGlishGrade8_Quarter2_WEEK1_LESSON1.pptx
PPTX
Entrepreneurship Management and Finance - Module 1 - PPT
Unit1_Kumod_deeplearning.pptx DEEP LEARNING
Developing speaking skill_learning_mater.pdf
Physical pharmaceutics two in b pharmacy
BSc-Zoology-02Sem-DrVijay-Comparative anatomy of vertebrates.pdf
Unleashing the Potential of the Cultural and creative industries
CHALLENGES FACED BY TEACHERS WHEN TEACHING LEARNERS WITH DEVELOPMENTAL DISABI...
FILIPINO 8 Q2 WEEK 1(DAY 1).power point presentation
GW4 BioMed Candidate Support Webinar 2025
principlesofmanagementsem1slides-131211060335-phpapp01 (1).ppt
gsas-cvs-and-cover-letters jhvgfcffttfghgvhg.pdf
Laparoscopic Imaging Systems at World Laparoscopy Hospital
Review of Related Literature & Studies.pdf
Approach to a child with acute kidney injury
FYJC - Chemistry textbook - standard 11.
GIÁO ÁN TIẾNG ANH 7 GLOBAL SUCCESS (CẢ NĂM) THEO CÔNG VĂN 5512 (2 CỘT) NĂM HỌ...
Global strategy and action plan on oral health 2023 - 2030.pdf
WHAT NURSES SAY_ COMMUNICATION BEHAVIORS ASSOCIATED WITH THE COMP.pdf
Ppt obs emergecy.pptxydirnbduejguxjjdjidjdbuc
ENGlishGrade8_Quarter2_WEEK1_LESSON1.pptx
Entrepreneurship Management and Finance - Module 1 - PPT

The Use of Data and Datasets in Data Science

  • 2. Contents •Data Science •Types of Data •Picturing the Data •ACTIVITY •Dataset Characteristics
  • 3. Data Science •Data science is… •“the science of Data” (!) • Cao, L. (2017) Data science: a comprehensive overview. ACM Computing Surveys (CSUR), 50(3), pp.1-42.
  • 4. Data Science •What is Data? •It’s a set of facts and figures
  • 5. Data Science •OK, so what is Data Science? •Extracting insight and information from data sets to make better decisions. • Kelleher, J.D., Tierney, B. (2018) Data Science. MIT press.
  • 6. Data Science •There is a (possibly apocryphal) story that is often used to illustrate data mining, and it’s called the “Beers and Nappies” story.
  • 7. Data Science •The story goes that a large American supermarket, usually it’s Walmart, was exploring its sales data from their cash registers. The data is stored one customer’s purchase after another, but when the supermarket mined the dataset, they looked at each product to see if it is commonly associated with any other products.
  • 8. Data Science •They found an unexpected pattern between the purchase of beers and the purchase of nappies. The supermarket starting to place those two products right beside each other on the supermarket floor and they made lots of money.
  • 9. Data Science •The explanation for the association between the products could not be deduced from the dataset, but the cashiers explain that if a couple with a baby have one partner at home minding the baby, and one going to work; the partner who is going to work will pop into the supermarket after work to buy some nappies, and will decide that they need to get themselves some beers as well ;-)
  • 10. Data Science: Statistics •A model is really an approximation of something else. It’s not supposed to be a perfect representation.
  • 11. Data Science: Statistics Real Data Predicted Data Which is the best prediction for new data, based on the real data: A, B, or C ? A B C
  • 12. Data Science: Statistics •Any of the three predictions (A, B, and C) are possible in terms of new data, so there is no “right answer”, but based on the linear model we have created for the existing data, the line B looks like the most likely predictor of any new data.
  • 13. Data Science: Some Software Tools Python (with Pandas, NumPy, Scikit-learn, Matplotlib) TensorFlow Hadoop R Programming Language WEKA Tableau
  • 14. Data Science: Main Application Areas Healthcare Finance Transportati on Marketing Energy Consumptio n Sports Genetics Manufacturi ng
  • 15. Data Science • In Data Science, one of the key formal processes (methodologies) that businesses follow is called CRISP-DM (CRoss Industry Standard Process for Data Mining), and it provides organisations with a step- by-step guide to using Data Science in businesses.
  • 16. Data Science: Computer Science •DATA CLEANING (or Data Cleansing) is fixing or removing data that is incorrect (in some way) from the dataset.
  • 17. Data Science: Computer Science •Let’s imagine one of the columns of the dataset is a date, but different rows have different formats, e.g. •12-3-1992 •06/11/1946 •23rd November 2022
  • 18. Data Science: Computer Science •We can write a computer program to reformat all of these dates into one common format, e.g. •DD-MM-YYYY •This is called Data Transformation.
  • 19. Data Science: Computer Science •Another issue might be that some of the rows of data are recorded multiple times. So we can write a program to scan for this kind of duplication. •This is called Duplicate Elimination
  • 20. Data Science: Computer Science •One more issue to mention is that if a column has text in it, we can write programs to check if the text is suitable. •This is called Parsing.
  • 21. Data Science: Computer Science •Another area the computer programs can help us with is in creating graphs to show trends in the data. •This is called Data Visualisation.
  • 24. Continuous Data •It’s data with a decimal place. •Continuous Data is data that can take on any value within a given range. It can be measured to an infinite level of precision. •e.g. height, 1.8542 metres. •e.g. time, 3 hrs, 4 mins, 34 secs, 34 ms, etc.
  • 25. Discrete Data •It’s data without a decimal place. •Discrete Data is data that consists of distinct, separate values that can be counted. •e.g. number of days worked this week. 3. •e.g. number of leaves on a tree. 426.
  • 26. Ordinal Data •It’s data with ordered categories. •Ordinal Data is categorical data where the categories have a meaningful order or ranking. •e.g. {Very Good, Good, O.K., Bad, Very Bad} •e.g. Pain severity rated as {0 (no pain), 1 (mild), 2 (moderate), 3 (severe)}
  • 27. Nominal Data •It’s data without ordered categories. •Nominal Data is categorical data that consists of distinct categories with no inherent order or ranking. •e.g. {Yes, No} •e.g. {Teacher, Chemist, Haberdasher}
  • 30. Pie Charts • Nominal/Ordinal • Only suitable for data that adds up to 1 • Hard to compare values in the chart
  • 32. Bar Charts • Nominal/Ordinal • Easier to compare values than pie chart • Suitable for a wider range of data
  • 34. Histograms • Continuous Data • Divide Data into ranges
  • 36. Dot Plots • Nominal/Ordinal • Represents all the data • Difficult to read
  • 38. Scatter Plots • Excellent for examining association between two variables
  • 40. Time-Series Plots • Time related Data • e.g. Stock Prices
  • 42. Box Plots • Nominal/Ordinal • 1IQR - First interquartile range • 3IQR - Third interquartile range • Outliers
  • 46. John Tukey • Born June 16, 1915 • Died July 26, 2000 • Born in New Bedford, Massachusetts • He introduced the box plot in his 1977 book "Exploratory Data Analysis" • Also the Cooley–Tukey FFT algorithm and jackknife estimation
  • 47. • While working with John von Neumann on early computer designs, Tukey introduced the word "bit" as a contraction of "binary digit". The term "bit" was first used in an article by Claude Shannon in 1948. • The term "software", which Paul Niquette claims he coined in 1953, was first used in print by Tukey in a 1958 article in American Mathematical Monthly, and thus some people attribute the term to him. John Tukey Paul Niquette Claude Shannon John von Neumann
  • 49. Question 1 In a telephone survey of 68 households, when asked do they have pets, the following were the responses : • 16 : No Pets • 28 : Dogs • 32 : Cats Draw the appropriate graphic to illustrate the results !!
  • 50. Question 1 - Solution Total number surveyed = 68 Number with no pets = 16 =>Total with pets = (68 - 16) = 52 But total 28 dogs + 32 cats = 60 => So some people have both cats and dogs
  • 52. Question 1 - Solution How many? It must be (60 - 52) = 8 people ÞNo pets = 16 ÞDogs = 20 ÞCats = 24 ÞBoth = 8 ------------------------- Total = 68
  • 53. Question 1 - Solution Graphic: Pie Chart or Bar Chart
  • 54. Question 1 - Solution Graphic: Pie Chart or Bar Chart Bar Chart is easier to read
  • 56. Stacked Graph (Stream Graph) • Stacked Graph of Unemployed U.S. Workers by Industry • Stacked graphs do not support negative numbers and are meaningless for data that should not be summed (temperatures, for instance).
  • 57. Parallel Coordinates • Used for visualizing multivariate data. Instead of graphing every pair of variables in two dimensions, we repeatedly plot the data on parallel axes and then connect the corresponding points with lines
  • 58. Flow Map • A flow map can depict the movement of a quantity in space and (implicitly) in time. • Here we see a modern interpretation of Charles Minard's depiction of Napoleon's ill-fated march on Moscow.
  • 59. Node-Link Diagrams • The word "tree" is used interchangeably with "hierarchy", as the fractal branches of an oak might mirror the nesting of data. • If we take a two- dimensional blueprint of a tree, we have a popular choice for visualizing hierarchies: a node-link diagram.
  • 61. Arc Diagrams • An arc diagram uses a one-dimensional layout of nodes, with circular arcs to represent links. While arc diagrams may not convey the overall structure of the graph as effectively as a two- dimensional layout, with a good ordering of nodes it is easy to identify cliques and bridges.
  • 63. • Some students will be using a dataset as part of their research. This is typically thousands of rows of data. • We are not talking about the data you might be collecting from surveys and interviews, but rather a pre-existing set of data. • If the data is the key consideration in your research (although not all projects will necessarily be concerned with large datasets) it is important to consider several questions. Dataset Characteristics
  • 64. Dataset Characteristics: Questions • How suitable is the data? • What is the type of the data? • Where will you get it from? • What size is the dataset? • What format is it in? • How much cleaning is required? • What is the quality of the data? • How do you deal with missing data? • How will you evaluate your analysis? • etc.
  • 65. • Determining the suitability of the data is a vital consideration, it is not sufficient to simply locate a dataset that is thematically linked to your research question, it must be appropriate to explore the questions that you want to ask. • For example, just because you want to do Credit Card Fraud detection and you have a dataset that contains Credit Card transactions or was used in another Credit Card Fraud project, does not mean that it will be suitable for your project. Dataset Characteristics: Suitability
  • 66. • Is the data already labelled? • This is very important for supervised learning problems. • To take the credit card fraud example again, you can probably get as many credit card transactions as you like but you probably won't be able to get them marked up as fraudulent and non-fraudulent. Suitability: Labelling
  • 67. • The same thing goes for a lot of text analytics problems - can you get people to label thousands of documents as being interesting or non-interesting to them so that you can train a predictive model? • The availability of labelled data is a key consideration for any supervised learning problem. • The areas of semi-supervised learning and active learning try to address this problem and have some very interesting open research questions. Suitability: Labelling
  • 68. • Two important considerations: • The Curse of Dimensionality – When the dimensionality increases, the volume of the space increases so fast that the available data becomes sparse. In order to obtain a statistically sound result, the amount of data you need often grows exponentially with the dimensionality. • The No Free Lunch Theorem - Classifier performance depends greatly on the characteristics of the data to be classified. There is no single classifier that works best on all given problems. Suitability: Labelling
  • 69. • Also remember for labelling, you might be aiming for one of three goals: • Binary classifications – classifying each data item to one of two categories. • Multiclass classifications - classifying each data item to more than two categories. • Multi-label classifications - classifying each data item to multiple target labels. Suitability: Labelling
  • 70. Types of Data • Federated data • High dimensional data • Descriptive data • Longitudinal data • Streaming data • Web (scraped) data • Numeric vs. categorical vs. text data • etc. • Image data • Video data • Audio data
  • 72. • What is a reasonable size of a dataset? • Obviously it vary a lot from problem to problem, but in general we would recommend at least 10 features (columns) in the dataset, and we’d like to see thousands of instances. Size of the Dataset
  • 73. • TXT (Text file) • MIME (Multipurpose Internet Mail Extensions) • XML (Extensible Markup Language) • CSV (Comma-Separated Values) • ACSII (American Standard Code for Information Interchange) • etc. Format of the Data
  • 75. •Frequency counts •Descriptive statistics (mean, standard deviation, median) •Normality (skewness, kurtosis, frequency histograms, normal probability plots) •Associations (correlations, scatter plots) Quality of the Data
  • 76. •Imputation •Partial imputation •Partial deletion •Full analysis •Also consider database nullology Missing Data?
  • 77. •Training Dataset (Build dataset) •Test Dataset •Apply Dataset (Scoring Dataset) Dataset types
  • 78. •What about stuff like? •Area under the Curve •Misclassification Error •Confusion Matrix •N-fold Cross Validation •ROC Graph •Log-Loss and Hinge-Loss Evaluation
  • 79. • These are good for evaluating the analysis, so they are good for checking how good the model is based on the dataset, and are definitely part of the evaluation, but if you want to discuss the findings with respect to the real-world (and to the research question) you must do the following: •Test predictions using the real-world Evaluation