The Use of Data and Datasets in Data Science

Datasets and
Data Science
Damian Gordon

Contents
•Data Science
•Types of Data
•Picturing the Data
•ACTIVITY
•Dataset Characteristics

Data Science
•Data science is…
•“the science of Data” (!)
• Cao, L. (2017) Data science: a comprehensive overview. ACM Computing
Surveys (CSUR), 50(3), pp.1-42.

Data Science
•What is Data?
•It’s a set of facts and figures

Data Science
•OK, so what is Data Science?
•Extracting insight and information from
data sets to make better decisions.
• Kelleher, J.D., Tierney, B. (2018) Data Science. MIT press.

Data Science
•There is a (possibly apocryphal) story that
is often used to illustrate data mining, and
it’s called the “Beers and Nappies” story.

Data Science
•The story goes that a large American
supermarket, usually it’s Walmart, was
exploring its sales data from their cash
registers. The data is stored one customer’s
purchase after another, but when the
supermarket mined the dataset, they looked
at each product to see if it is commonly
associated with any other products.

Data Science
•They found an unexpected pattern
between the purchase of beers and the
purchase of nappies. The supermarket
starting to place those two products right
beside each other on the supermarket
floor and they made lots of money.

Data Science
•The explanation for the association between
the products could not be deduced from the
dataset, but the cashiers explain that if a
couple with a baby have one partner at home
minding the baby, and one going to work; the
partner who is going to work will pop into the
supermarket after work to buy some nappies,
and will decide that they need to get
themselves some beers as well ;-)

Data Science: Statistics
•A model is
really an approximation
of something else. It’s
not supposed to be a
perfect representation.

Real Data Predicted Data
Which is the best
prediction for new
data, based on the
real data: A, B, or C
?
A
B
C

•Any of the three predictions (A, B, and
C) are possible in terms of new data,
so there is no “right answer”, but
based on the linear model we have
created for the existing data, the line
B looks like the most likely predictor
of any new data.

Data Science: Some Software Tools
Python (with
Pandas, NumPy,
Scikit-learn,
Matplotlib)
TensorFlow Hadoop
R Programming
Language WEKA Tableau

Data Science: Main Application Areas
Healthcare Finance
Transportati
on
Marketing
Energy
Consumptio
n
Sports Genetics
Manufacturi
ng

Data Science
• In Data Science, one of the
key formal processes
(methodologies) that
businesses follow is called
CRISP-DM (CRoss Industry
Standard Process for Data
Mining), and it provides
organisations with a step-
by-step guide to using
Data Science in businesses.

Data Science: Computer Science
•DATA CLEANING (or Data Cleansing) is
fixing or removing data that is
incorrect (in some way) from the
dataset.

•Let’s imagine one of the columns of
the dataset is a date, but different
rows have different formats, e.g.
•12-3-1992
•06/11/1946
•23rd
November 2022

•We can write a computer program to
reformat all of these dates into one
common format, e.g.
•DD-MM-YYYY
•This is called Data
Transformation.

•Another issue might be that some of
the rows of data are recorded multiple
times. So we can write a program to
scan for this kind of duplication.
•This is called
Duplicate Elimination

•One more issue to mention is that if a
column has text in it, we can write
programs to check if the text is
suitable.
•This is called
Parsing.

•Another area the computer programs
can help us with is in creating graphs
to show trends in the data.
•This is called
Data Visualisation.

The Use of Data and Datasets in Data Science

Continuous Data
•It’s data with a decimal place.
•Continuous Data is data that can take
on any value within a given range. It
can be measured to an infinite level of
precision.
•e.g. height, 1.8542 metres.
•e.g. time, 3 hrs, 4 mins, 34 secs, 34 ms, etc.

Discrete Data
•It’s data without a decimal place.
•Discrete Data is data that consists of
distinct, separate values that can be
counted.
•e.g. number of days worked this week.
3.
•e.g. number of leaves on a tree. 426.

Ordinal Data
•It’s data with ordered categories.
•Ordinal Data is categorical data where
the categories have a meaningful order
or ranking.
•e.g. {Very Good, Good, O.K., Bad, Very Bad}
•e.g. Pain severity rated as {0 (no pain), 1
(mild), 2 (moderate), 3 (severe)}

Nominal Data
•It’s data without ordered categories.
•Nominal Data is categorical data that
consists of distinct categories with no
inherent order or ranking.
•e.g. {Yes, No}
•e.g. {Teacher, Chemist, Haberdasher}

Pie Charts
• Nominal/Ordinal
• Only suitable for
data that adds up
to 1
• Hard to compare
values in the chart

Bar Charts
• Nominal/Ordinal
• Easier to compare
values than pie
chart
• Suitable for a wider
range of data

Histograms
• Continuous Data
• Divide Data into
ranges

Dot Plots
• Nominal/Ordinal
• Represents all the
data
• Difficult to read

Scatter Plots
• Excellent for
examining
association
between two
variables

Time-Series Plots
• Time related Data
• e.g. Stock Prices

Box Plots
• Nominal/Ordinal
• 1IQR - First
interquartile range
• 3IQR - Third
interquartile range
• Outliers

John Tukey
• Born June 16, 1915
• Died July 26, 2000
• Born in New Bedford,
Massachusetts
• He introduced the box
plot in his 1977 book
"Exploratory Data
Analysis"
• Also the Cooley–Tukey
FFT algorithm and
jackknife estimation

• While working with John von Neumann on early computer
designs, Tukey introduced the word "bit" as a contraction of
"binary digit". The term "bit" was first used in an article by
Claude Shannon in 1948.
• The term "software", which Paul Niquette claims he coined in
1953, was first used in print by Tukey in a 1958 article in
American Mathematical Monthly, and thus some people
attribute the term to him.
John Tukey Paul Niquette Claude Shannon John von Neumann

Question 1
In a telephone survey of 68
households, when asked do they
have pets, the following were the
responses :
• 16 : No Pets
• 28 : Dogs
• 32 : Cats
Draw the appropriate graphic to
illustrate the results !!

Question 1 - Solution
Total number surveyed = 68
Number with no pets = 16
=>Total with pets = (68 - 16) = 52
But total 28 dogs + 32 cats = 60
=> So some people have both cats and dogs

How many? It must be (60 - 52) = 8 people
ÞNo pets = 16
ÞDogs = 20
ÞCats = 24
ÞBoth = 8
-------------------------
Total = 68

Graphic: Pie Chart or Bar Chart

Graphic: Pie Chart or Bar Chart Bar Chart is
easier to
read

Stacked Graph (Stream Graph)
• Stacked Graph of
Unemployed U.S.
Workers by Industry
• Stacked graphs do
not support negative
numbers and are
meaningless for data
that should not be
summed
(temperatures, for
instance).

Parallel Coordinates
• Used for visualizing
multivariate data.
Instead of graphing
every pair of
variables in two
dimensions, we
repeatedly plot the
data on parallel axes
and then connect the
corresponding
points with lines

Flow Map
• A flow map can depict
the movement of a
quantity in space and
(implicitly) in time.
• Here we see a modern
interpretation of
Charles Minard's
depiction of
Napoleon's ill-fated
march on Moscow.

Node-Link Diagrams
• The word "tree" is used
interchangeably with
"hierarchy", as the
fractal branches of an
oak might mirror the
nesting of data.
• If we take a two-
dimensional blueprint
of a tree, we have a
popular choice for
visualizing hierarchies:
a node-link diagram.

Arc Diagrams
• An arc diagram uses a
one-dimensional layout
of nodes, with circular
arcs to represent links.
While arc diagrams may
not convey the overall
structure of the graph as
effectively as a two-
dimensional layout, with
a good ordering of
nodes it is easy to
identify cliques and
bridges.

• Some students will be using a dataset as part of their
research. This is typically thousands of rows of data.
• We are not talking about the data you might be
collecting from surveys and interviews, but rather a
pre-existing set of data.
• If the data is the key consideration in your research
(although not all projects will necessarily be
concerned with large datasets) it is important to
consider several questions.
Dataset Characteristics

Dataset Characteristics: Questions
• How suitable is the data?
• What is the type of the
data?
• Where will you get it
from?
• What size is the dataset?
• What format is it in?
• How much cleaning is
required?
• What is the quality of the
data?
• How do you deal with
missing data?
• How will you evaluate
your analysis?
• etc.

• Determining the suitability of the data is a vital
consideration, it is not sufficient to simply locate a
dataset that is thematically linked to your research
question, it must be appropriate to explore the
questions that you want to ask.
• For example, just because you want to do Credit Card
Fraud detection and you have a dataset that contains
Credit Card transactions or was used in another Credit
Card Fraud project, does not mean that it will be
suitable for your project.
Dataset Characteristics: Suitability

• Is the data already labelled?
• This is very important for supervised learning
problems.
• To take the credit card fraud example again, you can
probably get as many credit card transactions as you
like but you probably won't be able to get them
marked up as fraudulent and non-fraudulent.
Suitability: Labelling

• The same thing goes for a lot of text analytics
problems - can you get people to label thousands of
documents as being interesting or non-interesting
to them so that you can train a predictive model?
• The availability of labelled data is a key consideration
for any supervised learning problem.
• The areas of semi-supervised learning and active
learning try to address this problem and have some
very interesting open research questions.

• Two important considerations:
• The Curse of Dimensionality – When the dimensionality increases,
the volume of the space increases so fast that the available data
becomes sparse. In order to obtain a statistically sound result,
the amount of data you need often grows exponentially with the
dimensionality.
• The No Free Lunch Theorem - Classifier performance depends
greatly on the characteristics of the data to be classified. There is
no single classifier that works best on all given problems.

• Also remember for labelling, you might be aiming for one of
three goals:
• Binary classifications – classifying each data item to one of two
categories.
• Multiclass classifications - classifying each data item to more than two
categories.
• Multi-label classifications - classifying each data item to multiple
target labels.

Types of Data
• Federated data
• High dimensional data
• Descriptive data
• Longitudinal data
• Streaming data
• Web (scraped) data
• Numeric vs. categorical vs.
text data
• etc.
• Image data
• Video data
• Audio data

•e.g.
•http://www.kdnuggets.com/datasets/
•http://www.google.com/publicdata/directory
•http://opendata.ie/
•http://lib.stat.cmu.edu/datasets/
Locating Datasets

• What is a reasonable size of a dataset?
• Obviously it vary a lot from problem to problem, but
in general we would recommend at least 10 features
(columns) in the dataset, and we’d like to see
thousands of instances.
Size of the Dataset

• TXT (Text file)
• MIME (Multipurpose Internet Mail Extensions)
• XML (Extensible Markup Language)
• CSV (Comma-Separated Values)
• ACSII (American Standard Code for Information
Interchange)
• etc.
Format of the Data

•Parsing
•Correcting
•Standardizing
•Matching
•Consolidating
Cleaning of Data

•Frequency counts
•Descriptive statistics (mean, standard
deviation, median)
•Normality (skewness, kurtosis, frequency
histograms, normal probability plots)
•Associations (correlations, scatter plots)
Quality of the Data

•Imputation
•Partial imputation
•Partial deletion
•Full analysis
•Also consider database nullology
Missing Data?

•Training Dataset (Build dataset)
•Test Dataset
•Apply Dataset (Scoring Dataset)
Dataset types

•What about stuff like?
•Area under the Curve
•Misclassification Error
•Confusion Matrix
•N-fold Cross Validation
•ROC Graph
•Log-Loss and Hinge-Loss
Evaluation

• These are good for evaluating the analysis, so they
are good for checking how good the model is based
on the dataset, and are definitely part of the
evaluation, but if you want to discuss the findings
with respect to the real-world (and to the research
question) you must do the following:
•Test predictions using the real-world
Evaluation

The Use of Data and Datasets in Data Science

More Related Content

Similar to The Use of Data and Datasets in Data Science (20)

More from Damian T. Gordon (20)

Recently uploaded (20)

The Use of Data and Datasets in Data Science