2. 2
Lecture Outline
• Goals of visualisation
• Usefulness of visualisation
• Basic data visualisation with R and
Python
• Visualising categorical variables
• Visualising numerical variables
• Visualising the relationship
between variables
3. 3
Visualisation in Data Analytics
"We cannot expect a small number of numerical values
[summary statistics] to consistently convey the wealth of
information that exists in data. Numerical reduction
methods do not retain the information in the data.”
William Cleveland The Elements of Graphing Data“
The simple graph has brought more information to the
data analyst’s mind than any other device.
—John Tukey
The use of graphics to examine data is called visualisation.
4. 4
Visualisation in Data Analytics
An important step in the data science methodology is obtaining a visual
representation of the data.
This has multiple advantages:
• We are better at extracting information from visual cues, so a visual
representation is usually more intuitive than a textual representation.
• A visualisation provides a concise snapshot and summarisation of the
data.
The goal of data visualisation is to convey a story to the viewer. This story
could be in the form of general trends about the data or an insight.
5. 5
A picture is worth a thousand words
This visualisation
summarises the
relationship between BMI
and pulse and
corresponding health
status. What do you
discover?
“The greatest value of a picture is when it forces us to notice what we never expected to see.”
6. 6
What Makes a Good Visualisation?
The McCandless Method
Four elements to achieve success in
data visualisation.
1. Information, the data you are working
with must be accurate
2. Story, a clear, compelling, interesting,
and relevant concept
3. Goal, a specific objective or function
for the visual
4. Visual form, an effective use of
metaphor or visual expression
Source:
https://www.informationisbeautiful.net/visualizations/what-makes-a-go
od-data-visualization/
7. 7
Data Visualisation with R
• R includes basic graphing function ‘plot’. But we will use the ggplot2 package.
> install.packages(“ggplot2”)
> library(ggplot2)
> ggplot(data = dta, aes(sex)) + geom_bar(fill = “blue")
The main three components ggplot command are:
• Data: The dta represents the data is being summarised.
We refer to this as the data component.
• Aesthetic mapping: The plot uses several visual cues to
represent the information provided by the dataset.
aes(sex) represents sex variable from the dataset. We
refer to this as the aesthetic mapping component.
• Geometry: geom_bar indicates the plot is a bar graph.
This is referred to as the geometry component.
To use ggplot2 you will have to learn several functions and arguments. These are hard to
memorise, so we highly recommend you have the ggplot2 cheat sheet handy.
8. 8
Summarising continuous numerical variable
• The first step of summarising a continuous
numerical variable is to identify the
distribution of the variable using a histogram
or boxplot.
• Histograms reveal the overall shape of the
frequencies in the groups.
• Suppose, we want to visualise the distribution
of the weight of the respondents in our
sample.
• Using a bar graph as shown below has no
explanatory power because the variable is a
continuous variable.
9. 9
Data Visualisation with R
To summarise a numerical variable:
> ggplot(dta, aes(x=height)) +
geom_histogram(bins = 10, fill = "blue")
> ggplot(dta, aes(weight)) + geom_boxplot() +
coord_flip()
We can add more arguments.
> ggplot(dta, aes(x=height, y=..density..)) +
geom_histogram(bins = 10, fill = "blue") +
geom_density(color="red", size=1.2)
Using boxplot to summarise a numerical variable:
Note: For a variable with right-skewed distribution
and non-negative values (such as income, number of
employees), we may need to use logarithmic scale
for a histogram or boxplot.
10. 10
Data Visualisation with R
Use a stacked (clustered) bar graph to visualise the
association between two categorical variables.
> ggplot(dta, aes(sex, fill = status)) +
geom_bar(position = "stack")
> ggplot(dta, aes(sex, fill = status)) +
geom_bar(position = “dodge")
Use a multiple boxplot to visualise the association
between a categorical and a numerical variable.
> ggplot(dta, aes(status, bmi)) +
geom_boxplot() + coord_flip()
Use a scatterplot to visualise the association between two
numerical variables.
> ggplot(dta, aes(bmi, pulse)) + geom_point()
+ stat_smooth(method="lm")
> ggplot(dta, aes(sex, fill = status)) +
geom_bar(position = “fill")
11. 11
Data Visualisation with Python
We can use the pandas, matplotlib, seaborn packages for visualisation in Python.
In pandas package, just add “plot” attribute with suitable graph type. To plot a bar chart for a
categorical variable.
We can add additional arguments, such as the colour of the bar, font size, etc in the bracket.
12. 12
Data Visualisation with Python
To summarise a numerical variable, use ‘plot.hist()’:
dta["weight"].plot.hist()
dta["weight"].plot.box()
To add a density line, use “seaborn” package.
sns.histplot(dta["weight"], bins=12, color='k’,
kde=True)
Using boxplot to summarise a numerical variable:
dta.boxplot(column = ["weight"])
13. 13
Data Visualisation with Python
To visualise the relationship between two categorical
variables, append “plot.bar” attribute to the cross-tab of
the categorical variables.
pd.crosstab(dta.sex,dta.status).plot.bar()
Use a multiple boxplot to visualise the association
between a categorical and a numerical variable.
dta.boxplot(column=["pulse"], by="exercise",
showmeans=True)
Use Seaborn’s regplot to visualise the association
between two numerical variables with a trend line.
sns.regplot(x="BMI", y="pulse", data=dta)
pd.crosstab(dta.sex,dta.status).plot.bar(stacked = True)
pd.crosstab(dta.sex,dta.status, normalize =
“index”).plot.bar(stacked = True)
14. 14
Scatter plot with a categorical variable
We can also add a third dimension in a scatter plot by
setting different colours of the dots for different group
of a categorical variable. For example, we can assign
different colour for observations with different health
status.
sns.scatterplot(x="BMI", y="pulse",
data=dta, hue = "status")
ggplot(dta, aes(bmi, pulse, colour = status)) +
geom_point(size = 3)
15. 15
Summary of the lecture
In this section, we covered:
• the goals of visualisation
• usefulness of visualisation
• how to visualise categorical variables
• how to visualise numerical variables
• visualising the relationship between variables