Introduction To Statistics WITH SAS
Introduction To Statistics WITH SAS
Introduction to Statistics
Script
Reinhard Furrer
and the Applied Statistics Group
Preface v
2 Random Variables 21
2.1 Basics of Probability Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2 Discrete Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3 Continuous Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4 Expectation and Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.5 Independent Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.6 The Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.7 Limit Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.8 Functions of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.9 Bibliographic remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.10 Exercises and Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3 Estimation 43
3.1 Point Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.2 Construction of Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3 Comparison of Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.4 Interval Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.5 Bibliographic remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.6 Exercises and Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4 Statistical Testing 57
4.1 The General Concept of Significance Testing . . . . . . . . . . . . . . . . . . . . . 57
4.2 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
i
ii CONTENTS
6 Rank-Based Methods 93
6.1 Robust Point Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.2 Rank-Based Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.3 Other Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.4 Bibliographic remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.5 Exercises and Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
B Calculus 205
B.1 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
B.2 Functions in Higher Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
B.3 Approximating Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
References 216
Glossary 221
This document accompanies the lecture STA120 Introduction to Statistics that has been given
each spring semester since 2013. The lecture is given in the framework of the minor in Applied
Probability and Statistics (www.math.uzh.ch/aws) and comprises 14 weeks of two hours of lecture
and one hour of exercises per week.
As the lecture’s topics are structured on a week by week basis, the script contains thirteen
chapters, each covering “one” topic. Some of chapters contain consolidations or in-depth studies
of previous chapters. The last week is dedicated to a recap/review of the material.
I have thought long and hard about an optimal structure for this script. Let me quickly
summarize my thoughts. It is very important that the document contains a structure that is
tailored to the content I cover in class each week. This inherently leads to 13 “chapters.” Instead
of covering Linear Models over four weeks, I framed the material in four seemingly different
chapters. This structure helps me to better frame the lectures: each week having a start, a set
of learning goals and a predetermined end.
So to speak, the script covers not 13 but essentially only three topics:
1. Background
3. Linear Modeling
We will not cover these topics chronologically. This is not necessary and I have opted for a
smoother setting. For example, we do not cover the multivariate Gaussian distribution at the
beginning but just before we need it. This also allows for a recap of several univariate concepts.
We use a path illustrated in Figure 1.
In case you use this document outside the lecture, here are several alternative paths through
the chapters, with a minimal impact on concepts that have not been covered:
All the datasets that are not part of regular CRAN packages are available via the url
www.math.uzh.ch/furrer/download/sta120/. The script is equipped with appropriate links that
facilitate the download.
v
vi Preface
Start
Background
Exploratory Data Analysis
Random Variables
Multivariate Normal Distribution
Statistical Foundations
Estimation
Statistical Testing
Frequentist
Proportions
Rank−Based Methods
Bayesian Approach
Bayesian
Monte Carlo Methods
End
Linear Modeling
Correlation and Simple Regression
Multiple Regression
Analysis of Variance
Design of Experiments
The lecture STA120 Introduction to Statistics formally requires the prerequisites MAT183
Stochastic for the Natural Sciences and MAT141 Linear Algebra for the Natural Sciences or
equivalent modules. For the content of these lectures we refer to the corresponding course
web pages www.math.uzh.ch/fs20/mat183 and www.math.uzh.ch/hs20/mat141. It is possible to
successfully pass the lecture without having had the aforementioned lectures, some self-studying
is necessary though. This script and the accompanying exercises require some: differentiation,
integration, matrix notation and basic operations, concept of solving a linear system of equations.
Appendix B and C give the bare minimum of relevant concepts in calculus and in linear algebra.
We review and summarize the relevant concepts of probability theory in Chapter 2.
I have therefore augmented this script with short video sequences giving additional – often
more technical – insight. These videos are indicated in the margins with a ‘video’ symbol as
here.
6 min
Preface vii
Many have contributed to this document. A big thanks to all of them, especially (alphabet-
ically) Zofia Baranczuk, Julia Braun, Eva Furrer, Florian Gerber, Lisa Hofer, Mattia Molinaro,
Franziska Robmann, Leila Schuh and many more. Kelly Reeve spent many hours improving my
English. Without their help, you would not be reading these lines. Yet, this document needs
more than a polishing. Please let me know of any necessary improvements and I highly appreci-
ate all forms of contributions in form of errata, examples, or text blocks. Contributions can be
deposited directly in the following Google Doc sheet.
Major errors that were corrected after the lecture of the corresponding semester are listed
www.math.uzh.ch/furrer/download/sta120/errata.txt. I try hard that after the lecture, the pag-
ination of the document does not change anymore.
Reinhard Furrer
February 2020
viii Preface
Chapter 1
Perform EDA in R
We start with a rather pragmatic setup: suppose we have some data. This chapter illustrates
the first steps thereafter: exploring and visualizing the data. Of course, much of the visualization
aspects are also used after the statistical analysis. No worries, subsequent chapters come back
to questions we should ask ourselves before we start collecting data, i.e., before we start an
experiment and how to conduct the analysis. Figure 1.1 shows one representation of a data
analysis flowchart and we discuss in this chapter the two right most boxes.
Assuming that a data collection process is completed, the “analysis” of this data is one of the
next steps. This analysis is typically done in an appropriate software environment. There are
many of such but our prime choice is R (R Core Team, 2020), often used alternatives are SPSS,
SAS, Minitab. Appendix A gives some links to R and R resources.
The first step is loading data in the software environment. This task sounds trivial and for
pre-processed and readily available datasets often is. Cleaning own and others’ data is typically
1
2 CHAPTER 1. EXPLORATORY DATA ANALYSIS AND VISUALIZATION OF DATA
Hypothesis to Exploratory
investigate Design experiment Data collection Data Analysis
Phenomena to study
very painful and eats up much unanticipated time. We do load external data but will not cover
the aspect of data cleaning — be aware when planning your analysis.
Example 1.1. There are many datasets available in R, the command data() would list these.
Packages often provide additional datasets, which can be listed with data(package="spam"),
here for the package spam (the command data( package=.packages( all.available=TRUE))
would list all datasets from all installed packages).
Often, we will work with own data and hence we have to “load” the data. It is recommended
to store the data in a simple tabular comma separated format, typically a csv file. After im-
porting (loading/reading) data, it is of utmost importance to check if the variables have been
properly read, that (possible) row and column names are correctly parsed. In R-Code 1.1 we load
observations of content mercury in lake Geneva sediments. There is also a commented example
that illustrates how the format of the imported dataset changes. ♣
At the beginning of any statistical analysis, an exploratory data analysis (EDA) should be
performed (Tukey, 1977). An EDA summarizes the main characteristics of the data (mainly)
graphically, i.e., observations or measured values are depicted, and qualitatively and quantita-
tively described. Each dataset tells us a ‘story’ that we should try to understand. To do so, we
should ask questions like
• What are the key summary statistics of the data? (discussed in Sections 1.2 and 1.3)
At the end of a study, results are often summarized graphically because it is generally easier
to interpret and understand graphics than values in a table. As such, graphical representation
of data is an essential part of statistical analysis, from start to finish.
Scales
Nominal Ordinal Interval Ratio
═╪ ═╪ ═╪ ═╪
Mathematical <> <> <>
operators +- +-
*/
Mode Mode Mode Mode
Median Median Median
Arithmetic mean Arithmetic mean
Statistical Geometric mean
measures Standard deviation Standard deviation
Coefficient of
variation
Range Studentized range
Figure 1.2: Types of scales according to Stevens (1946) and possible mathematical
operations. The statistical measures are for a description of location and spread.
Example 1.2. The classification of elements as either “C” or “H” results in a nominal variable. If
we associate “C” with cold and “H” with hot we can use an ordinal scale (based on temperature).
In R, nominal scales are represented with factors. R-Code 1.2 illustrates the creation of
nominal and interval scales as well as some simple operations. It would be possible to create
ordinal scales as well, but we will not use it in this script.
When measuring temperature in Kelvin (absolute zero at −273.15◦ C), a statement such as
“The temperature has increased by 20%” can be made. However, a comparison of twice as hot
(in degrees Celsius) does not make sense as the origin is arbitrary. ♣
evaluate if the missing values are due to some random mechanism, emerge consistently or with
some deterministic pattern, appear in all variables, for example.
For a basic analysis one often neglects the observations if missing values are present in any
variable. There exist techniques to fill in missing values but these are quite complicated and not
treated here.
As a side note, with a careful inspection of missing values in ozone readings, the Antarctic
“ozone hole” would have been discovered more than one decade earlier (see, e.g., en.wikipedia.
org/wiki/Ozone_depletion#Research_history).
Informally a statistic is a single measure of some attribute of the data, in the context of
this chapter a statistic gives a good first impression of the distribution of the data. Typical
statistics for the location parameter include the (empirical) mean, truncated/trimmed mean,
median, quantiles and quartiles. The trimmed mean omits a fraction of the smallest and largest
values. A trimming of 50% is equivalent to the (empirical) median. Quantiles or more specifically
percentiles link observations or values with the position in the ordered data. The median is the
50th-percentile, half the data is smaller than the median, the other half is larger. The 25th
and 75th-percentile are also called the lower and upper quartiles, i.e., the quartiles divide the
data in four equally sized goups. Depending on the number of observations at hand, arbitrary
quantiles are not precisely defined. In such cases, a linearly interpolated value is used, for which
the precise interpolation weights depend on the software at hand. It is important to know this
potential ambiguity less important to know the exact values of the weights.
Typical statistics for the scale parameter include the variance, standard deviation (square
root of the variance), and interquartile range (third quartile minus the first quartile) and the
coefficient of variation (standard deviation divided by the mean). Note that the coefficient of
variance is dimension less and should be used only with ratio scaled data.
We often denote data with x1 , . . . , xn , with n denoting the data size. The ordered data
(smallest to largest) is denoted with x(1) ≤ · · · ≤ x(n) . Hence, we use the following classical
notation:
Xn
empirical mean (average): x= xi , (1.1)
i=1
x(n/2+1/2) , if n odd,
empirical median: = 1 (1.2)
(x + x(n/2+1) ), if n odd,
2 (n/2
n
1 X
empirical variance: 2
s = (xi − x)2 , (1.3)
n−1
i=1
√
empirical standard deviation: s= s .2 (1.4)
(1.5)
Example 1.3. In R-Code 1.3 several summary statistics for 293 observations of content mercury
in lake Geneva sediments are calculated (data at www.math.uzh.ch/furrer/download/sta120/
lemanHg.csv, see also R-Code 1.1). ♣
6 CHAPTER 1. EXPLORATORY DATA ANALYSIS AND VISUALIZATION OF DATA
R-Code 1.3 A quantitative EDA of the mercury dataset (subset of the ‘leman’ dataset).
For discrete data, the mode is the most frequent value of an empirical frequency distribution;
in order to calculate the mode, only the operations {=, 6=} are necessary. Continuous data are
first divided into categories (discretized/binned) and then the mode can be determined.
Another important aspect is the identification of outliers, which are defined (verbatim from
Olea, 1991): “In a sample, any of the few observations that are separated so far in value from the
remaining measurements thety the questions arise whether they belong to a different population,
or that the sampling technique is faulty. The determination that an observation is an outlier
may be highly subjective, as there is no strict criteria for deciding what is and what is not an
outlier”. Graphical respresentations of the data often help in the identification of outliers.
100
30
Other
80
25 Deforest
Electr
20 Manufac
60
Percent
Percent
Transp
15 Air
40
10
20
5
0
Air
Transp
Manufac
Electr
Deforest
Other
2005
Figure 1.3: Bar plots: juxtaposed bars (left), stacked (right) of CO2 emissions ac-
cording to different sources taken from SWISS Magazine (2011). (See R-Code 1.4.)
Example 1.4. R-Code 1.4 and Figure 1.3 illustrate bar plots with data giving aggregated CO2
emissions from different sources (transportation, electricity production, deforestation, . . . ) in
the year 2005. Note that the numbers vary considerably according to different sources, mainly
due to the political and interest factors associated with these numbers. ♣
R-Code 1.4 Emissions sources for the year 2005 as presented by the SWISS Magazine
10/2011-01/2012, page 107 (SWISS Magazine, 2011). (See Figure 1.3.)
dat <- c(2, 15, 16, 32, 25, 10) # see Figure 1.9
emissionsource <- c('Air', 'Transp', 'Manufac', 'Electr', 'Deforest', 'Other')
barplot( dat, names=emissionsource, ylab="Percent", las=2)
barplot( cbind('2005'=dat), col=c(2,3,4,5,6,7), legend=emissionsource,
args.legend=list(bty='n'), ylab='Percent', xlim=c(0.2,4))
Do not use pie charts unless absolutely necessary. Pie charts are often difficult to read. When
slices are similar in size it is nearly impossible to distinguish which is larger. Barplots allow an
easier comparison.
Histograms illustrate the frequency distribution of observations graphically and are easy to
construct and to interpret. Histograms allow one to quickly assess whether the data is symmetric
or rather left- or right-skewed, whether the data has rather one mode or several or whether ex-
ceptional values are present. Important statistics like mean and median can be added. However,
the number of bins (categories to break the data down into) is a subjective choice that affects
the look of the histogram and several valid rules of thumb exist for choosing the optimal num-
ber of bins. R-Code 1.5 and the associated Figure 1.4 illustrate the construction and resulting
histograms of the mercury dataset. In one of the histograms, a “smoothed density” has been
superimposed. Such curves will be helpful when comparing the data with different statistical
8 CHAPTER 1. EXPLORATORY DATA ANALYSIS AND VISUALIZATION OF DATA
models, as we will see in later chapters. The histogram shows that the data is unimodal, right
skewed, no exceptional values.
R-Code 1.5 Different histograms (good and bad ones) for the mercury dataset. (See
Figure 1.4.)
1.0
60
Density
0.5
0 20
0.0
Hg Hg
200
Frequency
Frequency
10
100
5
0
Hg Hg
Figure 1.4: Histograms with various bin sizes. (See R-Code 1.5.)
When constructing histograms for discrete data (e.g., integer values), one has to be careful
with the binning. Often it is better to manually specify the bins. To represent the result of many
1.3. UNIVARIATE DATA 9
dice tosses, it would be advisable to use hist( x, breaks=seq( from=0.5, to=6.5, by=1)),
or possibly use a bar plot as explained above. A stem-and-leaf plot is similar to a histogram,
however, this plot is rarely used today Figure 1.5 gives an example.
A quantile-quantile plot (Q-Q plot) is used to visually compare empirical data quantiles with
the quantiles of a theoretical distribution (we will talk more about “theoretical distributions” in
the next chapter). The ordered values are compared with the i/(n + 1)-quantiles. In practice,
some software use (i − a)/(n + 1 − 2a), for a specific a ∈ [0, 1]. R-Code 1.7 and Figure 1.7
illustrate a Q-Q plot for the mercury dataset by comparing it to a normal distribution and a so-
called chi-squared distribution. In cases of good fits, the points are aligned almost on a straight
line. To “guide-the-eye”, one often adds the a line to the plots.
10 CHAPTER 1. EXPLORATORY DATA ANALYSIS AND VISUALIZATION OF DATA
R-Code 1.6 Box plot. Notice that the function boxplot has several arguments for tailoring
the appearance of the box plots. These are discussed in the function’s help file. (See Figure 1.6.)
1.5
1.0
1.0
Hg
Hg
0.5
0.5
●
0.0
0.0
Figure 1.6: Box plots (notched version) and violin plot. (See R-Code 1.6.)
Remark 1.1. There are several fundamentally different approaches to creating plots in R: base
graphics (package graphics, which is automatically loaded upon startup), trellis graphics (pack-
ages lattice and latticeExtra), and the grammar of graphics approach (package ggplot2).
We focus on base graphics. This approach is in sync with the R source code style, but we have a
clear direct handling of all elements. ggplot functionality may produce seemingly fancy graphics
at the price of certain black box elements. ♣
1.4. MULTIVARIATE DATA 11
R-Code 1.7 Q-Q plot of the mercury dataset. (See Figure 1.7.)
qqnorm( Hg)
qqline( Hg, col=2, main='')
qqplot( qchisq( ppoints( 293), df=5), Hg, xlab="Theoretical quantiles")
# For 'chisq' some a priori knowledge was used, for 'df=5' minimal
# trial and error was used.
qqline( Hg, distribution=function(p) qchisq( p, df=5), col=2)
1.5
● ●
Sample Quantiles
●●●●
● ●●
●● ●
● ●
●● ●●
●● ●
●
1.0
1.0
●
●●
● ●
●●
●
●
●
●● ●
●●
●
● ●
Hg
●
●●
●
● ●
●
●●
●
●●
●
●
●
●●
●
●● ●●
●●
●
●●
●
●● ●●
●
●
●●
● ●
●●
●
●● ●
●
●●
●●
● ●
●
●
●
●
●●
● ●
●●
●
●●
●
●
●●
● ●
●
●●
●
0.5
0.5
●
●
●
● ●●
●
●
●
●
● ●
●
●
●
●
●
●● ●●
●
●
●
●●
●
●
● ●
●●
●
●
●
●
●●
● ●
●
●●
●
●
●
●●
●
●
● ●
●●
●
●
●
●
●
●●
●
●
● ●
●
●●
●●
●
● ●
●
●
●
●
●●
● ●
●●
●
●●
●
●● ●
●●
●
●●
●
●●
●●
● ●
●
●●
●
●●
●●
●
●●
●
●●
●
●
●● ●
●●
●
●
●
●●●●● ●
●●
0.0
0.0
● ●
−3 −2 −1 0 1 2 3 0 5 10 15 20
Figure 1.7: Q-Q plots using the normal distribution (left) and a so-called chi-squared
distribution with five degrees of freedom (right). The red line passes through the
lower and upper quantiles of both the emprirical and theoretical distribution. (See
R-Code 1.7.)
In a scatter plot, “guide-the-eye” lines are often included. In such situation, some care is
needed as there is an perception of asymmetry between y versus x and x versus y. We will
discuss this further in Chapter 8.
In the case of several frequency distributions, bar plots, either stacked or grouped, may also
be used in an intuitive way. See R-Code 1.8 and Figure 1.8 for two slightly different partitions
of emission sources.
12 CHAPTER 1. EXPLORATORY DATA ANALYSIS AND VISUALIZATION OF DATA
R-Code 1.8 Emissions sources for the year 2005 from www.c2es.org/facts-
figures/international-emissions/sector (approximate values). (See Figure 1.8.)
dat2 <- c(2, 10, 12, 28, 26, 22) # source c2es.org
mat <- cbind( SWISS=dat, c2es.org=dat2)
barplot(mat, col=c(2,3,4,5,6,7), xlim=c(0.2,5), legend=emissionsource,
args.legend=list(bty='n') ,ylab='Percent', las=2)
barplot(mat, col=c(2,3,4,5,6,7), xlim=c(1,30), legend=emissionsource,
args.legend=list(bty='n'), ylab='Percent', beside=TRUE, las=2)
100
30
Other Air
80 Deforest 25 Transp
Electr Manufac
60 Manufac 20 Electr
Percent
Transp Percent 15
Deforest
40 Air Other
10
20
5
0 0
SWISS
c2es.org
SWISS
c2es.org
Figure 1.8: Bar plots for two variables: stacked (left), grouped (right). (See R-
Code 1.8.)
Example 1.5. The iris dataset is a classic teaching dataset. It gives measurements (in cen-
timeters) for the variables sepal length and width and petal length and width for 50 flowers from
each of three species of iris (Iris setosa L., Iris versicolor L., and Iris virginica L.; see Figure 1.9).
In R it is part of the package datasets and thus automatically available.
We use different types of graphics to represent the data, (Figure 1.10 and R-Code 1.10). ♣
Figure 1.9: Photos of the three species of iris (setosa, versicolor and virginica). The
images are taken from en.wikipedia.org/wiki/Iris_flower_data_set.
1.4. MULTIVARIATE DATA 13
R-Code 1.9 Constructing histograms, box plots, violin plots and scatter plot with iris
data. (See Figure 1.10.)
8 7
6
30
6 ●
Frequency
5
●
20
●
●
4 ●
4
3
10
2 ●
2
●
1
0
0
Sepal.Length
Sepal.Width
Petal.Length
Petal.Width
setosa
versicolor
virginica
1 2 3 4 5 6 7
● ● ● ● ● ●
●● ● ●●
● ● ●
● ● ●
● ● ●
● ● ● ●●● ● ● ●
● ● ●
● ● ●
●● ●● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ●● ● ● ● ●● ●●● ●● ●● ● ●●●
●● ●● ●●
● ● ● ● ●● ● ● ● ● ● ●
●●● ●● ●● ●●● ● ● ●● ●●●
Sepal.Length ●
●
● ●●●
●●●
●●
● ●●●
● ●●
●● ●●●
●
● ●
●●
●
●
●
●
●
●
●●
● ●●●● ● ●
●● ● ●
● ●●●
● ● ●●●
● ●●
●●●
● ●● ● ●
●
●
●
●●
●
● ●
● ●●
● ●● ●●
● ●
●●●
●● ●
●
●
●
●
●
●
●
●●
●
● ●●●● ● ●●● ● ● ● ● ● ●
●●●● ● ● ●● ●●● ● ● ●●●●
● ● ● ● ●●● ● ● ● ●
● ● ●
● ●● ● ●● ● ●● ●
● ●●● ●● ●●●●● ● ●●●● ●
● ● ● ●●●●● ●●●●● ●● ●●● ● ●
●● ●● ● ●● ● ● ●● ● ●
●● ● ●● ● ●●●
● ●● ●
●● ● ● ● ●● ●●
● ● ●
●● ● ●● ●
● ● ●
● ● ●
● ● ●
4.0
● ● ●
● ● ●
● ● ● ●
● ● ● ● ●●●● ●● ●●● ● ●
● ●● ● ● ●
● ●● ● ● ● ● ●● ●
●●● ● ●●●● ●● ●
● ● ●●● ● ● ●● ●●●●● ● ●● ●●● ● ●●
● ●● ●
●●
●
● ●
●● ●●● ●
Sepal.Width ●●
●●●●
● ●●
●●● ●● ●●●
●
●
●
●●
●
● ●
●
●
●
3.0
● ●● ● ● ● ●● ● ●●● ●●● ●● ●● ● ● ●●
●● ●●● ● ●● ●●● ●●●● ●● ●● ●●●● ●●●●●●●●●● ● ●●● ● ●●● ●●●●●●● ●●●●
● ●● ●●●●● ● ● ● ● ●●●●● ● ● ● ●●● ●
●●● ●●●●● ● ● ● ●● ●●●●●● ● ● ● ●●●● ●●●●● ●
● ● ● ● ●● ●●● ●●● ● ●●● ● ●●
● ●● ● ● ● ● ● ● ● ● ● ● ●
● ● ●●● ● ● ● ●● ● ●● ● ● ● ● ●●●●
● ● ● ●● ●●
● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ●
2.0
● ● ●
1 2 3 4 5 6 7
● ● ●
●● ● ● ● ●●●
● ● ● ● ● ●
● ● ●
● ● ● ● ●● ● ●● ● ● ●
●
●●● ●●
● ● ● ● ●
● ●●● ● ● ● ●●
● ●
● ●●● ●● ●
●●●● ● ● ●
●● ●
● ●
●● ● ● ●
●
● ● ●
●
●●● ● ●
●●● ●●● ●
●● ●
●● ●● ●
●
● ● ●
● ● ● ●● ●
● ●●● ●●●
●
●
● ●
● ●●
● ●
●●●
●
● ●●
●●● ●● ● ●●
●●●● ●● ●●● ●●
● ●●
● ●●●● ●●●●●●●●● ●● ●● ●●● ●
●●● ● ●●●●●●●
●●● ● ● ●● ●
●
● ●
●
●●
●●
● ●
●●
●
● ●●
●
●●
●
●
●
●●●
●
●●●●
●
●
●
Petal.Length ●
●●●
●● ●
●
●●
●
●
●●
●
●● ●● ●
● ● ●
● ● ● ● ● ●
● ● ● ●●● ●● ●●●●●
● ●●●
●●●
●●
●●●●● ● ●●●●●●
●● ●
●●●●● ●● ● ●●●●●● ●
●●●●●●●
●
●●● ●● ● ● ●●
● ●● ●● ● ●
●●●
● ● ● ● ●
● ●●●
2.5
● ● ● ● ● ● ●●
● ● ● ● ● ● ● ●
● ● ●●● ● ● ●●● ● ●●●● ●●● ●
●● ● ● ● ● ●● ●
● ●●● ● ● ● ●● ● ●●●●● ●
●● ● ● ● ● ● ● ● ● ●●●● ●●
● ●● ● ●● ● ●●● ●
●●●●●●● ● ●● ●●●●●●● ●●● ●●●● ●
● ● ● ● ● ●
1.5
● ● ● ● ● ●● ●● ● ●
● ● ●● ●●●● ● ● ● ●
●●●●● ● ●●●●●●
● ● ●●● ● ●●●●●●● ● ●●●● ●
●●
●
●●●
● ●● ●
●●
● ●● ●
●●●● ●
●
● ● ●●●●
●●
●●● ●
●●● ●●
●
● ●●●●●●●
●●●● ●
●●
●●● ●●
Petal.Width
0.5
● ● ●
● ● ●
●● ● ● ● ●●● ● ●●●●●
●● ● ●● ● ● ● ●● ● ●●●●
● ●●●
●●●●
●●●●●●● ● ●●●●●●●●●● ● ● ●●●●
●●
●●
●●●●
● ●● ● ●● ● ● ● ●●
Figure 1.10: Top left to right: histograms for the variable petal length box plots
for several variables and violin plots for Petal length from the iris dataset. Bottom:
simple matrix of scatterplots. (See R-Code 1.9.)
14 CHAPTER 1. EXPLORATORY DATA ANALYSIS AND VISUALIZATION OF DATA
Parallel coordinate plots are a popular way of representing several observations in high di-
mensions (i.e., many variables). Each variable (scaled to zero-one) is recorded along a vertical
axis. The values of each observation are then connected with a line across the various variables.
That means that points in the usual (Euclidean) representation correspond to lines in a paral-
lel coordinate plot. All interval scaled variables are normalized to [0, 1]. Additionally, nominal
variables may also be depicted.
Example 1.6. The dataset swiss (provided by the package datasets) contains 47 observa-
tions on 6 variables (standardized fertility measure and socio-economic indicators) for each of 47
French-speaking provinces of Switzerland at about 1888.
R-Code 1.10 and Figure 1.11 give an example of a parallel coordinate plot. Groups can be
quickly detected and strong associations are spotted directly.
As a side note, the raw data is available at https://opr.princeton.edu/archive/Download.
aspx?FileID=1113 and documentation at https://opr.princeton.edu/archive/Download.aspx?FileID=
1116, see also https://opr.princeton.edu/archive/pefp/switz.aspx. It would be a fun task to ex-
tract the corresponding data not only for the French-speaking provinces but also for entire
Switzerland. ♣
R-Code 1.10 Parallel coordinate plot for the swiss dataset. (See Figure 1.11.)
The open source visualization program ggobi may be used to explore high-dimensional data
(Swayne et al., 2003). It provides highly dynamic and interactive graphics such as tours, as well
as familiar graphics such as scatter plots, bar charts and parallel coordinates plots. All plots
1.4. MULTIVARIATE DATA 15
Figure 1.11: Parallel coordinate plot of the swiss dataset. (See R-Code 1.10.)
are interactive and linked with brushing and identification. The package rggobi provides a link
to R. Figure 1.12 gives a screenshot of the 2D Tour, a projection pursuit visualization which
involves finding the most “interesting” possible projections of multidimensional data (Friedman
and Tukey, 1974). Such a projection should highlight interesting features of the data.
Figure 1.12: GGobi screenshot based on the state.x77 data with Alaska marked.
R-Code 1.11 GGobi example based on state.x77. (See Figure 1.12 for a screenshot)
require( rggobi)
ggobi( state.x77)
16 CHAPTER 1. EXPLORATORY DATA ANALYSIS AND VISUALIZATION OF DATA
Figure 1.13: Bad example (above) and improved but still not ideal graphic (below).
Figures from university documents.
Many consider John Tukey to be the founder and promoter of exploratory data analysis.
Thus his EDA book (Tukey, 1977) is often seen as the (first) authoritative text on the subject.
In a series of books, Tufte rigorously yet vividly explains all relevant elements of visualization
1.6. BIBLIOGRAPHIC REMARKS 17
and displaying information (Tufte, 1983, 1990, 1997b,a). Many university programs offer lectures
on information visualization or similar topics. The lecture by Ross Ihaka is worth mentioning:
18 CHAPTER 1. EXPLORATORY DATA ANALYSIS AND VISUALIZATION OF DATA
Figure 1.15: Examples of bad graphs in scientific journals. The figure is taken
from www.biostat.wisc.edu/˜kbroman/topten_worstgraphs/. The website discusses
the problems with each graph and possible improvements (‘[Discussion]’ links).
1.7. EXERCISES AND PROBLEMS 19
www.stat.auckland.ac.nz/ ihaka/120/lectures.html.
In a lengthy article, Friendly and Denis (2001) give an extensive historical overview of the
evolvement of cartography, graphics and visualization. The pdf has active links for virtually end-
less browsing: euclid.psych.yorku.ca/SCS/Gallery/milestone/milestone.pdf. See also the applet
at www.datavis.ca/milestones/.
There are also many interesting videos available illustrating good and not-so-good graphics.
For example, www.youtube.com/watch?v=ajUcjR0ma4c.
i) (Datasets). R has many built-in datasets, one example is volcano. What’s the name of
the Volcano? Describe the dataset in a few words.
ii) (Help, plotting). Use the R help function to get information on how to use the image()
function for plotting matrices. Display the volcano data.
iii) (loading packages). Install the package fields. Display the volcano data with the function
image.plot().
iv) (demo, 3D plotting). Use the the R help function to find out the purpose of the function
demo() and have a look at the list of available demos. The demo persp utilizes the volcano
data to illustrate basic three-dimensional plotting. Call persp and have a look at the plots.
What is the maximum height of the volcano depicted?
Problem 1.2 (EDA of multivariate data) In this problem we want to explore a classical dataset.
Load the mtcars dataset. Perform an EDA of the dataset mtcars and provide at least three
meaningful plots (as part of the EDA) and a short description of what they display.
i) Construct a boxplot and a Q-Q plot of the moose and wolf data. Give a short interpretation.
ii) Jointly visualize the wolves and moose data, as well as their abundances over the years.
Give a short interpretation of what you see in the figures. (Of course you may compare
the result with what is given on the aforementioned web page).
20 CHAPTER 1. EXPLORATORY DATA ANALYSIS AND VISUALIZATION OF DATA
Problem 1.5 (parallel coordinate plot) Construct a parallel coordinate plot using the built-in
dataset state.x77. In the left and right margins, annotate the states. Give a few interpretations
that can be derived from the plot.
Problem 1.6 (Feature detection) The synthetic dataset whatfeature, available at www.math.
uzh.ch/furrer/download/sta120/whatfeature.RData has a hidden feature. Try to find it using
projection pursuit in rggobi and notice how difficult it is to find structures in small, low-
dimensional datasets.
Chapter 2
Random Variables
Pass from a cdf to a quantile function, pdf or pmf and vice versa
Schematically sketch, plot in R and interpret pdf, pmf, cdf, quantile function,
Q-Q plot
Give the definition and intuition of an expected value (E), variance (Var),
know the basic properties of E, Var used for calculation
(*) Given the formula, calculate the cdf/pdf of transformed random variables
Probability theory is the prime tool of all statistical modeling. Hence we need a minimal
understanding of the theory of probability in order to well understand statistical models, inter-
pretation thereof, etc. This chapter may seem quite dense but several results should be seen for
reference only.
21
22 CHAPTER 2. RANDOM VARIABLES
ii) P(Ω) = 1,
for Ai ∩ Aj = ∅, i 6= j.
S P
iii) P i Ai = i P(Ai ),
In the last sum we only specify the index without indicating start and end which means sum
over all, say ni=1 , where n may be finite or infinite. (Similarly for the union).
P
Informally, a probability function P assigns a value in [0, 1], i.e., the probability, to each event
of the sample space constraint to:
iii) the probability of several events is equal to the sum of the individual probabilities, if the
events are mutually exclusive.
Probabilities are often visualized with Venn diagrams (Figure 2.1), which clearly and intu-
itively illustrate more complex facts, such as:
The last statement can be written for arbitrary number of events Bi with Bi ∩ Bj = ∅, i 6= j and
i Bi = Ω yielding P(A) =
S P
i P(A | Bi ) P(Bi ).
2.1. BASICS OF PROBABILITY THEORY 23
Ω
B
C 1111111
0000000
0000000
1111111
0000000
1111111
0000000
1111111
0000000
1111111
0000000
1111111
0000000
1111111
A 0000000
1111111
0000000
1111111
0000000
1111111
0000000
1111111
0000000
1111111
0000000
1111111
We consider a random variable as a function that assigns values to the outcomes (events)
of a random experiment, that is, these values or values in the interval are assumed with certain
probabilities. The outcomes of the experiment, i.e., the values are called realizations of the
random variable.
The following definition defines a random variable and gives a (unique) characterization
of random variables. In subsequent sections, we will see additional characterizations. These,
however, will depend on the type of values the random variable takes.
Definition 2.1. A random variable X is a function from the sample space Ω to R and represents a
possible numerical outcome of an experiment. The distribution function (cumulative distribution
function, cdf) of a random variable X is
Random variables are denoted with uppercase letters (e.g. X, Y ), while realizations are
denoted by the corresponding lowercase letters (x, y). This means that the theoretical concept,
or the random variable as a function, is denoted by uppercase letters. Actual values or data, for
example the columns in your dataset, would be denoted with lowercase letters.
Example 2.1. Let X be the sum of the roll of two dice. The random variable X assumes
the values 2, 3, . . . , 12. The right panel of Figure 2.2 illustrates the distribution function. The
distribution function (as for all discrete random variables) is piece-wise constant with jumps
equal to the probability of that value. ♣
Example 2.2. A boy practices free throws, i.e., foul shots to the basket standing at a distance
of 15 ft to the board. Let the random variable X be the number of throws that are necessary
until the boy succeeds. Theoretically, there is no upper bound on this number. Hence X can
take the values 1, 2, . . . . ♣
Another way of describing discrete random variables is the probability mass function, defined
as follows.
Definition 2.2. The probability mass function (pmf) of a discrete random variable X is defined
by fX (x) = P(X = x). ♦
In other words, the pmf gives probabilities that the random variables takes one single value,
whereas, as seen, the cdf gives probabilities that the random variables takes that or any smaller
value.
Property 2.2. Let X be a discrete random variable with probability mass function fX (x) and
cumulative distribution function FX (x). Then:
The two points iii) and iv) show that there is a one-to-one relation (also called a bijection)
between the cumulative distribution function and probability mass function. Given one, we can
construct the other.
Figure 2.2 illustrates the probability mass function and cumulative distribution function of
the random variable X as given in Example 2.1. The jump locations and sizes (discontinuities)
of the cdf correspond to probabilities given in the left panel. Notice that we have emphasized
the right continuity of the cdf (see Proposition 2.1.ii)) with the additional dot.
2.2. DISCRETE DISTRIBUTIONS 25
x <- 2:12
p <- c(1:6, 5:1)/36
plot( x, p, type='h', ylim=c(0, .2),
xlab=expression(x[i]), ylab=expression(p[i]==f[X](x[i])))
points( x, p, pch = 19)
plot.ecdf( outer(1:6, 1:6, "+"), ylab=expression(F[X](x)), main='')
0.00 0.05 0.10 0.15 0.20
●
● ●
pi = fX(xi)
●
● ●
FX(x)
● ● ●
● ● ●
●
● ●
●
●
2 4 6 8 10 12 2 4 6 8 10 12
xi x
Figure 2.2: Probability mass function (left) and cumulative distribution function
(right) of X = “the sum of the roll of two dice”. (See R-Code 2.1.)
A random experiment with exactly two possible outcomes (for example: heads/tails, male/female,
success/failure) is called a Bernoulli trial. For simplicity, we code the sample space with ‘1’ (suc-
cess) and ‘0’ (failure). The probability mass function is determined by a single probability:
λk
P(X = k) = exp(−λ), 0 < λ, k = 0, 1, . . . , (2.8)
k!
is said to follow a Poisson distribution with parameter λ, denoted by X ∼ Pois(λ). ♦
The Poisson distribution is also a good approximation for the binomial distribution with large
n and small p (as a rule of thumb if n > 20 and np < 10).
Definition 2.4. The probability density function (density function, pdf) fX (x), or density for
short, of a continuous random variable X is defined by
Z b
P(a < X ≤ b) = fX (x)dx, a < b. (2.9)
a
Property 2.3. Let X be a continuous random variable with density function fX (x) and distri-
bution function FX (x). Then:
i) The density function satisfies fX (x) ≥ 0 for all x ∈ R and fX (x) is continuous almost
everywhere.
Z ∞
ii) fX (x)dx = 1.
−∞
2.3. CONTINUOUS DISTRIBUTIONS 27
Z x
iii) FX (x) = fX (y)dy.
−∞
d
iv) fX (x) = FX0 (x) = FX (x).
dx
v) The cumulative distribution function FX (x) is continous everywhere.
vi) P(X = x) = 0.
As given by Property 2.3.iii) and iv), there is again a bijection between the density function
and the cumulative distribution function: if we know one we can construct the other. Actually,
there is a third characterization of random variables, called the quantile function, which is es-
sentially the inverse of the cdf. That means, we are interested in values x for which FX (x) = p.
Definition 2.5. The quantile function QX (p) of a random variable X with (strictly) monotone
cumulative distribution function FX (x) is defined by
i.e., the quantile function is equivalent to the inverse of the distribution function. ♦
The quantile function can be used to define the theoretical counter part to the empirical
quartiles of Chapter 1 as illustrated next.
Definition 2.6. The median ν of a continuous random variable X with cumulative distribution
function FX (x) is defined by ν = QX (1/2). ♦
Remark 2.1. For discrete random variables the cdf is not continuous (see the plateaus in the
right panel of Figure 2.2) and the inverse does not exist. The quantile function returns the
minimum value of x from amongst all those values with probability p ≤ P(X ≤ x) = FX (x),
more formally,
Example 2.3. The continuous uniform distribution U(a, b) is defined by a constant density
1 , if a ≤ x ≤ b,
function over the interval [a, b], a < b, i.e. f (x) = b − a
0, otherwise.
The quantile function is QX (p) = a + p(b − a) for 0 < p < 1. Figure 2.3 shows the density and
cumulative distribution function of the uniform distribution U(0, 1). ♣
28 CHAPTER 2. RANDOM VARIABLES
R-Code 2.2 Density and distribution function of a uniform distribution. (See Figure 2.3.)
−1.0 0.0 0.5 1.0 1.5 2.0 −1.0 0.0 0.5 1.0 1.5 2.0
x x
Figure 2.3: Density and distribution function of the uniform distribution U(0, 1). (See
R-Code 2.2.)
Many other “summary” values are reduced to calculate a particular expectation. The following
property states how to calculate the expectation of a function of the random variable X, which
is in turn used to summarize the spread of X.
Property 2.4.For an “arbitrary” real function g we have:
X
g(xi ) P(X = xi ), if X discrete,
E g(X) = Z i
g(x)fX (x)dx, if X continuous.
R
2.4. EXPECTATION AND VARIANCE 29
Var(X) = E (X − E(X))2
(2.14)
and is also denoted as the centered second moment, in contrast to the second moment E(X 2 ).♦
The expectation is “linked” to the average (or empirical mean, mean) if we have a set of
realizations thought to be from the particular random variable. Similarly, the variance, the
expectation of the squared deviation from its expected value is “linked” to the empirical variance
(var). This link will be formalized in later chapters.
E(X) = 0 · (1 − p) + 1 · p = 1, (2.15)
Var(X) = (0 − p)2 · (1 − p) + (1 − p)2 · p = p(1 − p). (2.16)
ii) The expectation and variance of a Poisson random variable are (see Problem 1.i))
Property 2.5. For random variables X and Y , regardless of whether discrete or continuous,
and for a and b given constants, we have
2
i) Var(X) = E(X 2 ) − E(X) ;
The second but last property seems somewhat surprising. But starting from the definition of
the variance, one quickly realizes that the variance is not a linear operator:
2 2
Var(a + bX) = E a + bX − E(a + bX) = E a + bX − (a + b E(X)) , (2.18)
followed by a factorization of b2 .
Example 2.5. We consider again the setting of Example 2.1, and straightforward calculation
shows that
12
X
E(X) = i P(X = i) = 7, by equation (2.12), (2.19)
i=2
6
X 1 7
=2 i =2· , by using Property 2.5.iv) first. (2.20)
6 2
i=1
♣
30 CHAPTER 2. RANDOM VARIABLES
The definition also implies that the joint density and joint cumulative distribution is simply
the product of the individual ones, also called marginal ones.
We will often use many independent random variables with a common distribution function.
The iid assumption is very crucial and relaxing the assumptions to allow, for example, de-
pendence between the random variables, has severe implications on the statistical modeling.
Independence also implies a simple formula for the variance of the sum of two or many random
variables.
The latter two properties will be used when we investigate statistical properties of the sample
n n
1X 1X
mean, i.e., linking the empirical mean x = xi with the random sample mean X = Xi .
n n
i=1 i=1
Example 2.6. For X ∼ Bin(n, p), we have
Definition 2.11. The random variable X is said to be normally distributed if the cumulative
distribution function is given by
Z x
FX (x) = fX (x)dx (2.24)
−∞
1 (x − µ)2
1
f (x) = fX (x) = √ exp − · , (2.25)
2πσ 2 2 σ2
While the exact form of the density (2.25) is not important, a certain recognizing factor will
be very useful. Especially, for a standard normal random variable, the density is proportional to
exp(−z 2 /2).
The following property is essential and will be consistently used throughout the work. We
justify the first one later in this chapter. The second one is a result of the particular form of the
density.
X −µ
Property 2.7. i) Let X ∼ N (µ, σ 2 ), then ∼ N (0, 1). Conversely, if Z ∼ N (0, 1),
σ
then σZ + µ ∼ N (µ, σ 2 ), σ > 0.
ii) Let X1 ∼ N (µ1 , σ12 ) and X2 ∼ N (µ2 , σ22 ) be independent and a and b arbitrary, then
aX1 + bX2 ∼ N (aµ1 + bµ2 , a2 σ12 + b2 σ22 ).
The cumulative distribution function Φ has no closed form and the corresponding probabilities
must be determined numerically. In the past, so-called “standard tables” were often used and
included in statistics books. Table 2.1 gives an excerpt of such a table. Now even “simple”
pocket calculators have the corresponding functions to calculate the probabilities. It is probably
worthwhile to remember 84% = Φ(1), 98% = Φ(2), 100% ≈ Φ(3), as well as 95% = Φ(1.64) and
97.5% = Φ(1.96). Relevant quantiles have been illustrated in Figure 2.4 for a standard normal
random variable. For arbitrary normal density, the density scales linearly with the standard
deviation.
32 CHAPTER 2. RANDOM VARIABLES
P(Z<0)=50.0% P(−1<Z<1)=68.3%
P(Z<1)=84.1% P(−2<Z<2)=95.4%
P(Z<1)=97.7% P(−3<Z<3)=99.7%
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
Figure 2.4: Different probabilities for some quantiles of the standard normal distribu-
tion.
Table 2.1: Probabilities of the standard normal distribution. The table gives the value
of Φ(zp ) for selected values of zp . For example, Φ(0.2 + 0.04) = 0.595.
zp 0.0 0.1 0.2 0.3 0.4 . . . 1 ... 1.6 1.7 1.8 1.9 2 ... 3
0.00 0.500 0.540 0.579 0.618 0.655 0.841 0.945 0.955 0.964 0.971 0.977 0.999
0.02 0.508 0.548 0.587 0.626 0.663 0.846 0.947 0.957 0.966 0.973 0.978 0.999
0.04 0.516 0.556 0.595 0.633 0.670 0.851 0.949 0.959 0.967 0.974 0.979 ..
.
0.06 0.524 0.564 0.603 0.641 0.677 0.855 0.952 0.961 0.969 0.975 0.980
0.08 0.532 0.571 0.610 0.648 0.684 0.860 0.954 0.962 0.970 0.976 0.981
R-Code 2.3 Calculation of the “z-table” (see Table 2.1) and density, distribution, and
quantile functions of the standard normal distribution. (See Figure 2.5.)
2.0
1.5
1.0
dnorm
0.5
0.0
−0.5
−2 −1 0 1 2
Property 2.8. (Central Limit Theorem (CLT), classical version) Let X1 , X2 , X3 , . . . an infinite
sequence of iid random variables with E(Xi ) = µ and Var(Xi ) = σ 2 . Then
X −µ
n
lim P √ ≤ z = Φ(z) (2.26)
n→∞ σ/ n
where we kept the subscript n for the sample mean to emphasis its dependence on n.
The proof of the CLT is a typical exercise in a probability theory lecture. Many extensions
of the CLT exist, for example, the independence assumptions can be relaxed.
Using the central limit theorem argument, we can show that distribution of a binomial random
variable X ∼ Bin(n, p) converges to a distribution of a normal random variable as n → ∞. Thus,
the distribution of a normal random variable N (np, np(1 − p)) can be used as an approximation
for the binomial distribution Bin(n, p). For the approximation, n should be larger than 30 for
p ≈ 0.5. For p closer to 0 and 1, n needs to be much larger.
Example 2.8. Let X ∼ Bin(30, 0.5). Then P(X ≤ 10) = 0.049, “exactly”. However,
X − np 10 − np 10 − 15
P(X ≤ 10) ≈ P p ≤p =Φ p = 0.034 , (2.27)
np(1 − p) np(1 − p) 30/4
X + 0.5 − np 10 + 0.5 − np 10.5 − 15
P(X ≤ 10) ≈ P p ≤ p =Φ p = 0.05 . (2.28)
np(1 − p) np(1 − p) 30/4
Another very important law is the law of large numbers (LLN) that essentially states that
for X1 , . . . , Xn iid with E(Xi ) = µ, the average X n converges to µ. We have deliberately used
the somewhat ambiguous “convergence” statement, a more rigorous statement is technically a bit
more involved. We will use the LLN next chapter, when we try to infer parameter values from
data, i.e., say something about µ when we observe x1 , . . . , xn .
Remark 2.2. There are actually two forms of the LLN, the strong and the weak formulation.
We do not not need the precise formulation later and thus simply state them here for the sole
reason of stating them.
The differences between both formulations are subtle. The weak version states that the
average is close to the mean and excursions (for specific n) beyond µ ± can happen arbitrary
often. The strong version states that there exists a large n such that the average is always within
µ ± .
The two forms represent fundamentally different notions of convergence of random variables:
(2.30) is almost sure convergence, (2.29) is convergence in probability. The CLT represents con-
vergence in distribution. ♣
To derive the probability mass function we apply Property 2.2.iv). In the more interesting setting
of continuous random variables, the density function is derived by Property 2.3.iv) and is thus
d −1
fY (y) = g (y) fX (g −1 (y)). (2.33)
dy
Example 2.9. Let X be a random variable with cdf FX (x) and pdf fX (x). We consider Y =
a+bX, for b > 0 and a arbitrary. Hence, g(·) is a linear function and its inverse g −1 (y) = (y−a)/b
is monotonically increasing. The cdf of Y is thus FX (y −a)/b and the pdf is fX (y −a)/b ·1/b.
This has fact has already been stated in Property 2.7 for the Gaussian random variables. ♣
Example 2.10. Let X ∼ U(0, 1) and for 0 < x < 1, we set g(x) = − log(1 − x), thus g −1 (y) =
1 − exp(−y). Then the distribution and density function of Y = g(X) is
for y > 0. This random variable is called the exponential random variable (with rate parameter
one). Notice further that g(x) is the quantile function of this random variable. ♣
As we are often interested in summarizing a random variable by its mean and variance, we
have a very convenient short-cut.
The expectation and the variance of a transformed random variable Y can be approximated by
the so-called delta method. The idea thereof consists of a Taylor expansion around the expectation
E(X):
♣
36 CHAPTER 2. RANDOM VARIABLES
Of course, in the case of a linear transformation (as, e.g., in Example 2.9), equation (2.36) is
an equality and thus relations (2.37) and (2.38) are exact, which is in sync with Property 2.7.
Of course it is also possible to construct random variables based on an entire random sample,
say Y = g(X1 , . . . , Xn ). Property 2.8 uses exactly such an approach, where g(·) is given by
. √
σ/ n .
P
g(X1 , . . . , Xn ) = i Xi − µ
The next section discusses random variables that are essentially derived (obtained as func-
tions) from normal random variables. We will encounter these much later, for example, the t
distribution in Chapter 4 and the F distribution in Chapter 9, as well as some other handy
distributions.
n
X
Xn2 = Zi2 (2.41)
i=1
is called the chi-square distribution (X 2 distribution) with n degrees of freedom. The following
applies:
Here and for the next two distributions, we do not give the densities as they are very com-
plex. Similarly, the expectation and the variance here and for the next two distributions are for
reference only.
The chi-square distribution is used in numerous statistical tests that we see Chapter 4 and 6.
R-Code 2.4 Chi-square distribution for various degrees of freedom. (See Figure 2.6.)
1
0.5
2
4
0.4
8
16
Density
32
0.3
64
0.2
0.1
0.0
0 10 20 30 40 50
Figure 2.6: Densities of the Chi-square distribution for various degrees of freedom.
(See R-Code 2.4.)
random variable
Z
Tm = p (2.43)
X/m
is called the t-distribution (or Student’s t-distribution) with m degrees of freedom. We have:
Remark 2.3. For m = 1, 2 the density is heavy-tailed and the variance of the distribution does
not exist. Realizations of this random variable occasionally manifest with extremely large values.
Of course, the empirical variance can still be calculated (see R-Code 2.6). We come back to this
issue in Chapter 5. ♣
38 CHAPTER 2. RANDOM VARIABLES
R-Code 2.5 t-distribution for various degrees of freedom. (See Figure 2.7.)
1
2
4
0.3
8
16
Density
32
0.2
64
0.1
0.0
−3 −2 −1 0 1 2 3
Figure 2.7: Densities of the t-distribution for various degrees of freedom. The normal
distribution is in black. A density with 27 = 128 degrees of freedom would make the
normal density function appear thicker. (See R-Code 2.5.)
2.8.3 F -Distribution
The F -distribution is mainly used to compare two empirical variances with each other, as we
will see in Chapter 12.
Let X ∼ Xm 2 and Y ∼ X 2 be two independent random variables. The distribution of the
n
random variable
X/m
Fm,n = (2.46)
Y /n
n
E(Fm,n ) = , for n > 2; (2.47)
n−2
2n2 (m + n − 2)
Var(Fm,n ) = , for n > 4. (2.48)
m(n − 2)2 (n − 4)
That means that if n increases the expectation gets closer to one and the variance to 2/m, with
m fixed.
Figure 2.8 shows the density for various degrees of freedom.
2.9. BIBLIOGRAPHIC REMARKS 39
R-Code 2.6 Empirical variances of the t-distribution with one degree of freedom.
set.seed( 14)
tmp <- rt( 1000, df=1)
print( c(summary( tmp), Var=var( tmp)))
## Min. 1st Qu. Median Mean 3rd Qu.
## -1.9093e+02 -1.1227e+00 -2.0028e-02 8.1108e+00 1.0073e+00
## Max. Var
## 5.7265e+03 3.7391e+04
sort( tmp)[1:10] # many "large" values, but 2 exceptionally large
## [1] -190.929 -168.920 -60.603 -53.736 -47.764 -43.377 -36.252
## [8] -31.498 -30.029 -25.596
sort( tmp, decreasing=TRUE)[1:10]
## [1] 5726.531 2083.682 280.848 239.752 137.363 119.157 102.702
## [8] 47.376 37.887 32.443
R-Code 2.7 F -distribution for various degrees of freedom. (See Figure 2.8.)
In general, wikipedia has nice summaries of many distributions. The page https://en.wikipedia.
org/wiki/List_of_probability_distributions lists many thereof.
The ultimate reference for (univariate) distributions is the encyclopedic series of Johnson
et al. (2005, 1994, 1995). Figure 1 of Leemis and McQueston (2008) illustrates extensively
the links between countless univariate distributions, a simplified version is available at https:
//www.johndcook.com/blog/distribution_chart/.
40 CHAPTER 2. RANDOM VARIABLES
F1, 1
F2, 50
3.0 F5, 10
F10, 50
F50, 50
Density
2.0
F100, 300
F250, 250
1.0
0.0
0 1 2 3 4
Figure 2.8: Density of the F -distribution for various degrees of freedom. (See R-
Code 2.7.)
iv) (*) Starting from the pmf of a Binomial random variable, derive the pmf of the Poisson
random variable when n → ∞, p → 0 but λ = np constant.
Problem 2.2 (Exponential Distribution) In this problem you get to know another important
distribution you will frequently come across - the expontential distribution. Consider the random
variable X with density
0 x<0
f (x) =
c · exp(−λx) x ≥ 0
with λ > 0. The parameter λ is called the rate. Subsequently, we denote an exponential random
variable with X ∼ Exp(λ).
v) Let λ = 2. Calculate:
i) Simulate realizations of X1 , . . . , Xn for n = 10, 100 and 1000. Display the results with
histograms and superimpose the theoretical densities. Further, use the functions density()
to add empirical densities and rug() to visualize the values of the realizations. Give an
interpretation of the empirical density.
ii) Let X = (1/n) ni=1 Xi . Determine E X and Var X . When is the hypothesis of inde-
P
pendence necessary for your calculations? In addition, what happens when n → +∞?
iii) Assume n = 100 and simulate 500 realizations of X1 , . . . , Xn . Calculate x = (1/n) ni=1 xi
P
for each realization and plot a histogram of the averages. Compare it to the histograms
from i): Which distribution do you get?
iv) Calculate the (empirical) median for each of the 50 simulations from iii). How many of
the 50 medians are bigger than the averages?
v) Draw a histogram of min(X1 , . . . , Xn ) for the 500 realizations from iii) and compare it to
the theoretical result from Problem 2.vii).
42 CHAPTER 2. RANDOM VARIABLES
Chapter 3
Estimation
Y i = µ + εi , i = 1, . . . , n, (3.1)
where Yi are the observations, µ is an unknown constant and εi are random variables representing
measurement error. It is often reasonable to assume E(εi ) = 0 with a symmetric density. Here,
iid
we even assume εi ∼ N (0, σ 2 ). Thus, Y1 , . . . , Yn are normally distributed with mean µ and
variance σ 2 .
43
44 CHAPTER 3. ESTIMATION
As typically, both parameters µ and σ 2 are unknown we first address the question how can
we determine plausible values for these model parameters from observed data.
Example 3.1. In R-Code 3.1 hemoglobin levels of blood samples from patients with Hb SS and
Hb S/β sickle cell disease are given (Hüsler and Zimmermann, 2010). The data are summarized
in Figure 3.1.
Equation (3.1) can be used as a simple statistical model for both diseases individually, with
µ representing the corresponding population mean and εi the variance describing the variability
of the individuals from the population mean. A slightly more involved model that links data
both diseases is
iid
We assume εi ∼ N (0, σ 2 ). Thus the model states that both diseases have a different mean
but the same variability. This assumption pools information from both samples to estimate the
variance σ 2 . The parameters of the model are µSS , µSb and (of lesser interest) σ 2 .
Natural questions that arise are: What are plausible values of the population levels? How
much do the indivdual deviances vary?
The questions if both levels are comparable or if the level form Hb SS patients is (statistically)
smaller than 10 are of completely different nature and will be discussed in the next chapter, where
we formally discuss statistical tests. ♣
10 11 12
●
10 11 12
● ●
Sample Quantiles
●
●
●
● ●
● ●
●
● ●● ●
9
9
● ●●
● ●●
● ● ●
8
8
●
●
7
HbSS HbSb −2 −1 0 1 2
Theoretical Quantiles
Figure 3.1: Hemoglobin levels of patients with Hb SS and Hb S/β sickle cell disease.
(See R-Code 3.1.)
R-Code 3.1 Hemoglobin levels of patients with sickle cell disease and some summary
statistics of hemoglobin levels of patients with sickle cell disease. (See Figure 3.1.)
HbSS <- c( 7.2, 7.7, 8, 8.1, 8.3, 8.4, 8.4, 8.5, 8.6, 8.7, 9.1,
9.1, 9.1, 9.8, 10.1, 10.3)
HbSb <- c(8.1, 9.2, 10, 10.4, 10.6, 10.9, 11.1, 11.9, 12.0, 12.1)
boxplot( list( HbSS=HbSS, HbSb=HbSb), col=c(3, 4))
qqnorm( HbSS, xlim=c(-2, 2), ylim=c(7, 12), col=3, main='')
qqline( HbSS, col=3)
tmp <- qqnorm( HbSb, plot.it=FALSE)
points( tmp, col=4)
qqline( HbSb, col=4)
c( mean( HbSS), mean( HbSb)) # means for both diseases
## [1] 8.7125 10.6300
var( HbSS) # here and below spread measures
## [1] 0.71317
c( var(HbSb), sum( (HbSb-mean(HbSb))^2)/(length(HbSb)-1) )
## [1] 1.649 1.649
c( sd( HbSS), sqrt( var( HbSS)))
## [1] 0.84449 0.84449
Example 3.2. i) The numerical values shown in R-Code 3.1 are estimates.
n
1X
ii) Y = Yi is an estimator.
n
i=1
10
1 X
y= yi = 10.6 is a point estimate.
n
i=1
n
1 X
iii) S2 = (Yi − Y )2 is an estimator.
n−1
i=1
10
1 X
s2 = (yi − y)2 = 1.65 or s = 0.844 is a point estimate. ♣
n−1
i=1
46 CHAPTER 3. ESTIMATION
Often, we denote parameters with Greek letters (µ, σ, λ, . . . ), with θ being the generic one.
The estimator and estimate of a parameter θ are denoted by θ. b Context makes clear which of
the two cases is meant.
b =Y.
and θb solves g(θ)
In linear regression settings, the ordinary least squares method minimizes the sum of squares
of the differences between observed responses and those predicted by a linear function of the ex-
planatory variables. Due to the linearity simple and close form solutions exist (see Chapters 8ff).
By using the observed values of a random sample in the method of moments estimator, the
estimates of the corresponding parameters are obtained.
iid
Example 3.3. Let Y1 , . . . , Yn ∼ E(λ)
bMM = 1 .
.
E(Y ) = 1/λ, Y =1 λb λ
b=λ (3.8)
Y
Thus, the estimate of λ is the value of 1/y. ♣
3.2. CONSTRUCTION OF ESTIMATORS 47
iid
Example 3.4. Let Y1 , . . . , Yn ∼ F with expectation µ and variance σ 2 . Since Var(Y ) = E(Y 2 )−
E(Y )2 (Property 2.5.i)), we can write σ 2 = µ2 − (µ)2 and we have the estimator
n n
c2 MM = 1 1X
X
σ Yi2 − Y 2 = (Yi − Y )2 . (3.9)
n n
i=1 i=1
♣
For a given distribution, we call L(θ) the likelihood function, or simply the likelihood.
Definition 3.2. The maximum likelihood estimator θbML of the parameter θ is based on maxi-
mizing the likelihood, i.e.
likelihood is often preferred because the expressions simplify more and maximizing sums is much
easier than maximizing products.
iid
Example 3.5. Let Y1 , . . . , Yn ∼ Exp(λ), thus
n
Y n
Y n
X
n
L(λ) = fY (yi ) = λ exp(−λyi ) = λ exp(−λ yi ) . (3.13)
i=1 i=1 i=1
Then
Pn Pn n
d`(λ) d log(λn exp(−λ i=1 yi ) dn log(λ) − λ i=1 yi n X !
= = = − yi = 0 (3.14)
dλ dλ dλ λ
i=1
λ bML = Pnn
b=λ 1
= . (3.15)
i=1 yi y
In this case (as in others), λ bMM .
bML = λ ♣
In a vast majority of cases, maximum likelihood estimators posses very nice properties. In-
tuitively, because we use information about the density and not only about the moments, they
are “better” compared to method of moment estimators and to least squares method. Further,
for many common random variables, the likelihood function has a single optimum, in fact a
maximum, for all permissible θ.
48 CHAPTER 3. ESTIMATION
E(θ)
b = θ, (3.16)
iid
Example 3.6. Y1 , . . . , Yn ∼ N (µ, σ 2 )
i) Y is unbiased for µ, since
n
1 X 1
EY =E Yi = n E(Yi ) = µ . (3.17)
n n
i=1
n
1 X
ii) S 2 = (Yi −Y )2 is unbiased for σ 2 . To show this we use the following two identities
n−1
i=1
i.e., we rewrite the square for which the cross-term simplifies with the second square term.
Collecting all leads finally to
n
X
(n − 1)S 2 = (Yi − µ)2 − n(µ − Y )2 (3.21)
i=1
n
X
(n − 1) E(S 2 ) = Var(Yi ) − n E (µ − Y )2
(3.22)
i=1
σ2
(n − 1) E(S 2 ) = nσ 2 − n · = (n − 1)σ 2 (3.23)
n
c2 = 1
X
iii) σ (Yi − Y )2 is biased, since
n
i
1 1 X n−1
c2 ) =
E(σ (n − 1) E (Yi − Y )2 = σ2. (3.24)
n n−1 n
i
| {z }
2
E(S ) = σ 2
The bias is
n−1 2
c2 ) − σ 2 = 1
E(σ σ − σ2 = − σ2, (3.25)
n n
which amounts to a slight underestimation of the variance. ♣
3.4. INTERVAL ESTIMATORS 49
iid
Example 3.7. Let Y1 , . . . , Yn ∼ N (µ, σ 2 ). Using the result (3.17) and Property 2.6.iii), we have
σ2
MSE(Y ) = bias(Y )2 + Var(Y ) = 0 + . (3.27)
n
Hence, the MSE vanishes as n increases. ♣
There is a second “classical” example for the calculation of the mean squared error, however
it requires some properties of squared Gaussian variables.
iid
Example 3.8. If Y1 , . . . , Yn ∼ N (µ, σ 2 ) it is possible to show that (n − 1)S 2 /σ 2 ∼ Xn−1
2 . Then
σ4 (n − 1)S 2 σ4 2σ 4
MSE(S 2 ) = Var(S 2 ) = Var = (2n − 2) = . (3.28)
(n − 1)2 σ2 (n − 1)2 n−1
Analogously, one can show that MSE(σ c2 MM ) is smaller than Equation (3.28). Moreover, the
estimator (n − 1)S 2 /(n + 1) possesses the smallest MSE. ♣
iid
Definition 3.4. Let Y1 , . . . , Yn ∼ N (µ, σ 2 ) with known σ 2 . The interval
h σ σ i
Y − z1−α/2 √ ,Y + z1−α/2 √ (3.32)
n n
is an exact (1 − α) confidence interval for the parameter µ. 1 − α is the called the level of the
confidence interval. ♦
50 CHAPTER 3. ESTIMATION
If the standard deviation σ is unknown, the ansatz must be modified by using a point estimate
√ Y −µ
for σ, typically S = S 2 , S 2 = n−1
1 P
i (Yi −Y ) . Since
2 √ ∼ Tn−1 , the corresponding quantile
S/ n
must be modified:
Y −µ
1 − α = P tn−1,α/2 ≤ √ ≤ tn−1,1−α/2 . (3.33)
S/ n
iid
Definition 3.5. Let Y1 , . . . , Yn ∼ N (µ, σ 2 ). The interval
h S S i
Y − tn−1,1−α/2 √ , Y + tn−1,1−α/2 √ (3.34)
n n
Confidence intervals are, as shown in the previous two definitions, constituted by random
variables (functions of Y1 , . . . , Yn ). Similar to estimators and estimates, confidence intervals are
computed with the corresponding realization y1 , . . . , yn of the random sample. Subsequently,
confidence intervals will be outlined in the blue-highlighted text boxes, as shown here.
Notice that the empirical approximate and empirical exact confidence interval are of the form
that is, symmetric intervals around the estimate. Here, SE(·) denotes the standard error of the
estimate, that is, an estimate of the variance of the estimator.
iid
Example 3.9. Let Y1 , . . . , Y4 ∼ N (0, 1). The R-Code 3.2 and Figure 3.2 show 100 empirical
confidence intervals based on Equation (3.32) (top) and Equation (3.34) (bottom). Because n
is small, the difference between the normal and the t-distribution is quite pronounced. This
becomes clear when
h S S i
Y − z1−α/2 √ , Y + z1−α/2 √ (3.38)
n n
is used as an approximation (Figure 3.2, middle).
A few more points to note are as follows. As we do not estimate the variance, all intervals in
the top panel have the same lengths. In average we should observe 5% of the intervals colored
red in the top and bottom panel. In the middle one there are typically more, as the normal
quantiles are too small compared to the t-distribution ones (see Figure 2.7). ♣
Confidence intervals can often be constructed from a starting estimator and its distribution.
In many cases it is possible to extract the parameter to get to 1−α = P(Bl ≤ θ ≤ Bu ), often some
approximations are necessary. We consider another classical case in the framework of Gaussian
random variables.
iid
Let Y1 , . . . , Yn ∼ N (µ, σ 2 ). The estimator S 2 for the parameter σ 2 is such, that (n −
2 , i.e. a chi-square distribution with n − 1 degrees of freedom. Hence
1)S 2 /σ 2 ∼ Xn−1
(n − 1)S 2
1 − α = P χ2n−1,α/2 ≤ ≤ χ 2
n−1,1−α/2 (3.39)
σ2
(n − 1)S 2 (n − 1)S 2
2
=P ≥ σ ≥ , (3.40)
χ2n−1,α/2 χ2n−1,1−α/2
where χ2n−1,p is the p-quantile of the chi-square distribution with n − 1 degrees of freedom. The
corresponding exact (1 − α) confidence interval no longer has the form θb ± q1−α/2 SE(θ),
b because
the chi-square distribution is not symmetric.
For large n, the chi-square distribution can be approximated with a normal one (see also
Section 2.8.1) with mean n and variance 2n. Hence we can also use an Gaussian approximation.
Example 3.10. For the Hb SS Daten with sample variance 0.71, we have the empirical confi-
dence interval [0.39, 1.71] for σ 2 , computed with (16-1)*var(HbSS)/qchisq( c(.975, .025),
df=16-1).
If we would like to construct a confidence interval for the standard deviation parameter
√ √ √
σ = σ 2 , we can use the approximation [ 0.39, 1.71]. That means, we have used the same
transformation for the bounds as for the estimate, a tool that is often used. ♣
For a fixed model the width of a confidence interval can be reduced by reducing the level or
increasing n.
The coverage probability of an exact confidence interval amounts to exactly 1−α. This is not
the case for approximate confidence intervals (we see a particular example in the next chapter).
52 CHAPTER 3. ESTIMATION
R-Code 3.2 100 confidence intervals for the parameter µ = 0 with σ = 1 and unknown σ,
based on three different approaches (exact, knowing σ, approximation, and exact again).
set.seed( 1)
ex.n <- 100 # 100 confidence intervals
alpha <- .05 # 95\% confidence intervals
n <- 4 # sample size
mu <- 0
sigma <- 1
sample <- array( rnorm( ex.n * n, mu, sigma), c(n,ex.n))
yl <- mu + c( -6, 6)*sigma/sqrt(n) # same y-axis for all
ybar <- apply( sample, 2, mean) # mean
# Sigma known:
sigmaybar <- sigma/sqrt(n)
plot( 1:ex.n, 1:ex.n, type='n', ylim=yl, xaxt='n', ylab='',
main=expression(sigma~known))
abline( h=mu)
for ( i in 1:ex.n){
ci <- ybar[i] + sigmaybar * qnorm(c(alpha/2,1-alpha/2))
lines( c(i,i), ci, col=ifelse( ci[1]>mu|ci[2]<mu, 2, 1))
}
σ known
3
2
1
0
−1
−2
−3
Gaussian Approximation
1:ex.n
3
2
1
0
−1
−2
−3
t−distribution
1:ex.n
3
2
1
0
−1
−2
−3
Figure 3.2: Normal and t-based confidence intervals for the parameters µ = 0 with
σ = 1 (above) and unknown σ (middle and below). The sample size is n = 4 and
confidence level is (1 − α) = 95%. Confidence intervals which do not cover the true
value zero are in red. (Figure based on R-Code 3.2).
54 CHAPTER 3. ESTIMATION
ii) Plot the cdf and the pmf for λ = 1 and λ = 5.5.
Hint: use a discrete grid {0, 1, 2, . . . } (why?) and, where necessary, the R command
stepfun.
iii) For λ = 1 and λ = 5.5 sample m = 1000 random variables and draw histograms. Compare
the histograms with ii). What do you expect to happen when m is large?
Pn
iv) Let λ
b= 1
n i=1 Xi be an estimator of the sample mean λ. Calculate E(λ), b and the
b Var(λ)
MSE(λ).
b
vi) Let λ = 3: calculate P(X1 ≤ 2), P(X1 < 2) and P(X1 ≥ 3).
vii) Observe that P(X1 ≤ 2) 6= P(X1 < 2). Would this still be true if X1 had a continuous
distribution?
3.6. EXERCISES AND PROBLEMS 55
Problem 3.2 (Germany cancer counts) The dataset Oral is available in the R package spam
and contains oral cavity cancer counts for 544 districts in Germany.
i) Load the data and take a look at its help page using ?Oral.
Hint: The R package spam is available on CRAN and can be installed with the com-
mand install.packages("spam") and loaded with require(spam) or, alternatively, with
library(spam). The command data(Oral) copies the dataset to the global environment.
iii) Poisson distribution is common for modeling rare events such as death caused by cavity
cancer (column Y in the data). However, the districts differ greatly in their populations.
Define a subset from the data, which only considers districts with expected fatal casualities
caused by cavity cancer between 35 and 45 (subset, column E). Perform a Q-Q Plot for a
Poisson distribution.
Hint: use qqplot() from the stats package. Note that you need to define the distribution
and the number of quantiles ppoints. Only qqnorm does this automatically. You also need
to define lambda for Poisson distribtuion.
Simulate a Poisson distributed random variable with the same length and and the same
lambda as your subset. Perform a Q-Q Plot of your simulated data. Also check the
historgams for visualization. What can you say about the distribution of your subset of
the cancer data?
iv) Assume that the standardized mortality ratio Zi = Yi /Ei is normally distributed, i.e.,
iid
Z1 , . . . , Z544 ∼ N (µ, σ 2 ). Estimate µ and give a 95% (exact) confidence interval (CI).
What is the precise meaning of the CI?
v) Simulate a 95% confidence interval based on the following bootstrap scheme (sampling with
replacement):
Repeat 100 000 times
Construct the confidence interval by taking the 2.5% and the 97.5% quantiles of the stored
means.
Compare it to the CI from iv).
56 CHAPTER 3. ESTIMATION
Chapter 4
Statistical Testing
Be aware of the multiple testing problem and know how to deal with it
As we are all aware, when tossing a fair coin 2m times we do not always observe exactly m
heads but often close to it. This is illustrated in R with rbinom(1, size=2*m, prob=1/2) for
m tosses, since rbinom(1, size=1, prob=1/2) “is” a fair coin. We use this idea for arbitrary
number of tosses and expect — for a fair coin — roughly half of them heads. We need to quantify
“roughly”, in the sense what is natural or normal variability. We will discuss what can be said if
we are outside the range of normal variability of a fair coin. As illustration, suppose we observe
13 heads in 17 tosses, representing seemingly an unusual case and we intuitively wonder if the
coin is fair. In other words, is the observed data providing enough evidence against a fair coin?
We will formulate a formal statistical procedure to answer such type of questions.
57
58 CHAPTER 4. STATISTICAL TESTING
Example 4.1. In rabbits, pododermatitis is a chronic multifactorial skin disease that manifests
mainly on the hind legs. This presumably progressive disease can cause pain leading to poor
welfare. To study the progression of this disease on the level of individual animals, scientists
assessed many rabbits in three farms over the period of an entire year (Ruchti et al., 2019). We
use a subset of the dataset in this and later chapters, consisting of one farm (with two barns) and
two visits (July 19/20, 2016 and June 29/30, 2017). The 6 stages from Drescher and Schlender-
Böbbis (1996) were used as a tagged visual-analogue-scale to score the occurrence and severity
of pododermatitis on 4 spots on the rabbits hind legs (left and right, heal and middle position),
resulting in the variable PDHmean with range 0–10, for details on the scoring see Ruchti et al.
(2018).
We consider the visits in June 2017 and would like to asses if the score of the 17 rabbits is
comparable to 3.333, representing a low-grade scoring (low-grade hyperkeratosis, hypotrichosis or
alopecia). (The observed mean is 3.87 with a standard deviation of 0.64.) R-Code 4.1 illustrates
the calculation of empirical confidence intervals, visualized in Figure 4.1. ♣
1 2 3 4 5 6
Figure 4.1: Rug plot with sample mean (vertical green) and confidence intervals based
on the t-distribution (green) and normal approximation (blue). (See R-Code 4.1.)
The idea of a statistical testing procedure is to formulate statistical hypotheses and to draw
conclusions from them based on the data. We always start with a null hypothesis, denoted with
H0 , and – informally – we compare how compatible the data is with respect to this hypothesis.
Simply stated, starting from a statistical hypothesis a statistical test calculates a value from
the data and places that value in the context of the hypothetical density induced by the statistical
(null) hypothesis. If the value is unlikely to occur, we argue that the data provides evidence
against the (null) hypothesis.
More formally, if we want to test about a certain parameter, say θ, we need an estimator θb
for that parameter. We often need to transform the estimator such that the distribution thereof
does not depend on (the) parameter(s). We call this random variable (function of the random
sample) test statistic. Some of these test statistics are well-known and have been named, as
we shall see later. Once the test statistic has been determined, we evaluate it at the observed
sample and compare it with quantiles of the distribution of the null hypothesis, which is typically
expressed as a probability, i.e., the famous p-value.
A more formal definition of p-value follows.
Definition 4.1. The p-value is the probability, under the distribution of the null hypothesis, of
obtaining a result equal to or more extreme than the observed result. ♦
Example 4.2. We assume a fair coin is tossed 17 times and we observe 13 heads. Hence, the p-
value is the probability of observing 13, 14,. . . ,17 heads, or by symmetry of observing 13, 14, . . . ,
17 tails (corresponding to 4,3,. . . ,0 heads), which is sum( dbinom(0:4, size=17, prob=1/2) +
dbinom(13:17, size=17, prob=1/2)), equaling 0.049, i.e., we observe such a seemingly unlikely
event roughly every 20th time.
Note that we could calculate the p-value as 2*pbinom(4, size=17, prob=1/2) or equiva-
lently as 2*pbinom(12, size=17, prob=1/2, lower.tail=FALSE). ♣
In practice, one often starts with a scientific hypothesis and starts to collect data or performs
an experiment. The data is then “modeled statistically”, e.g., we need to determine a theoretical
distribution thereof. In our discussion here, the distribution typically involves parameters that
are linked to the scientific question (probability p in a binomial distribution for coin tosses, mean
µ of a Gaussian distribution for testing differences pododermatitis scores). We formulate the null
hypothesis H0 . In many cases we pick a “known” test, instead of a “manually” constructing a test
statistic. Of course this test has to be in sync with the statistical model. Based on the p-value
we then summarize the evidence against the null hypothesis. We cannot make any statement for
the hypothesis.
60 CHAPTER 4. STATISTICAL TESTING
Figure 4.2 illustrates graphically the p-value in two hypothetical situations. Suppose that
under the null hypothesis the density of the test statistic is Gaussian and suppose that we observe
a value of the test statistic of 1.8. If more extreme is considered on both sides of the density then
the p-value consists of two probabilities (here because of the symmetry, twice the probability of
either side). If more extreme is actually larger (here, possibly smaller in other situations), the
p-value is calculated based on a one-sided probability. As the Gaussian distribution is symmetric,
the two-sided p-value is twice the one-side one, here 1-pnorm(1.8), or, equivalently, pnorm(1.8,
lower.tail=FALSE).
p−value=0.072 p−value=0.036
H0 (two−sided) H0 (one−sided)
Observation Observation
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
Figure 4.2: Illustration of the p-value in the case of a Gaussian test statistic with
observed value 1.8. Two-sided (left panel) and one-sided setting (right panel).
Example 4.3. For simplicity we assume that the pododermatitis scores of Example 4.1 is a
iid
realization of X1 , . . . , X17 ∼ N (µ, 0.82 ), i.e., n = 17 and the standard deviation is known.
Hence, X ∼ N (µ, 0.82 /17). Moreover, under the distributionan assumption H0 : µ = µ0 = 3.333,
X ∼ N (3.333, 0.82 /17). Thus — taking again a two sided setting —
p-value = 2 P(X ≥ x) = 2 1 − P(X < 3.869) (4.1)
X − 3.333 3.869 − 3.333
=2 1−P √ < √ = 2(1 − ϕ(2.762)) ≈ 0.6%. (4.2)
0.8/ 17 0.8/ 17
There is evidence in the data against the null hypothesis. ♣
Some authors summarize p-values in [1, 0.1] as no evidence, in [0.1, 0.01] as weak evidence, in
[0.01, 0.001] as substantial evidence, and smaller ones as strong evidence (Held and Sabanés Bové,
2014). In R, symbols are used for similar ranges ␣ , . and * , ** , and, *** .
Although we discuss all six points in various settings, in this chapter emphasis lies on points
ii) to v).
More precisely, we start with a null hypothesis H0 and an alternative hypothesis, denoted by
H1 or H1 . These hypotheses are with respect to a parameter, say θ. Hypotheses are classified as
simple if parameter θ assumes only a single value (e.g., H0 : θ = 0), or composite if parameter θ
can take on a range of values (e.g., H0 : θ ≤ 0, H1 : µ 6= µ0 ). Often, you will encounter a simple
null hypothesis with a simple or compositive alternative hypothesis. For example, when testing
a mean parameter we would have H0 : µ = µ0 vs H1 : µ 6= µ0 for the latter. In practice the case
of simple null and simple alternative hypothesis, e.g., H0 : µ = µ0 H1 : µ = µA 6= µ0 , is rarely
used but has considerable didactic value.
Hypothesis tests may be either one-sided (directional), in which only a relationship in a
prespecified direction is of interest, or two-sided, in which a relationship in either direction is
tested. One could use a one-sided test for “Hb SS has a lower average hemoglobin value than
Hb Sβ”, but a two-sided test is needed for “Hb SS and Hb Sβ have different average hemoglobin
values”. Further examples of hypotheses are given in Rudolf and Kuhlisch (2008). We strongly
recommend to always use two-sided tests (e.g. Bland and Bland, 1994; Moyé and Tita, 2002),
not only in clinical studies where it is the norm but as Bland and Bland (1994) states “a one-
sided test is appropriate when a large difference in one direction would lead to the same action
as no difference at all. Expectation of a difference in a particular direction is not adequate
justification.” However to illustrate certain concepts, a one-sided setting may be simpler and
more accessible.
As in the case of a significance test, we compare the value of the test statistic with the
quantiles of the distribution of the null hypothesis. In the case of small p-values we reject H0 , if
not, we fail to reject H0 . The decision of whether the p-value is “small” or not is based on the
so-called significance level, α.
Definition 4.2. The rejection region of a test includes all values of the test statistic with a
p-value smaller than the significance level. The boundary values of the rejection region are called
critical values. ♦
If we assume that the distribution of the test statistic is Gaussian and α = 0.05, the critical
values are ±1.96 and 1.64, respectively (qnorm(c(0.05/2, 1 - 0.05/2) and qnorm(1 - 0.05)).
These critical values are typically linked with the so-called z-test. Note the similarity with
Example 4.3.
H0 H0
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
Figure 4.3: Critical values (red) and rejection regions (orange) for two-sided H0 : µ =
µ0 = 0 (left) and one-sided H0 : µ ≤ µ0 = 0 (right) hypothesis test with significance
level α = 5%.
Two types of errors can occur in significance testing: Type I errors: we reject H0 if we should
not and Type II errors: we fail to reject H0 if we should. The framework of hypothesis testing
allows us to quantify probability of committing these two errors. The Type I error is exactly
α. To calculate the Type II error, we need to assume a specific value for our parameter within
the alternative hypothesis, e.g., a simple alternative. The probability of Type II error is often
denoted with β. Table 4.1 summarizes the errors in a classical 2 × 2 layout.
Ideally we would like to construct tests that have small Type I and Type II errors. This is
not possible and one typically fixes the Type I error to some small value, say 5%, 1% or suchlike
(committing a type one error has typically more severe consequences than a Type II error).
Type I and Type II errors are shown in Figure 4.4 for two different alternative hypotheses. This
illustrates that reducing the significance level α leads to an increase in β, the probability of
committing a Type II error.
Table 4.1: Type I and Type II errors in the setting of significance tests.
Example 4.4. Suppose we reject the null hypothesis of having a fair coin if we observe 0,. . . ,3
or 14,. . . ,17 heads out of 17 tosses. The Type I error is 2*pbinom(3, size=17, prob=1/2), i.e.,
0.013, and, if the coin has a probability of 0.6 for heads, the Type II error is sum(dbinom(4:13,
size=17, prob=0.6)), i.e., 0.953. ♣
4.2. HYPOTHESIS TESTING 63
H0 H1 H0 H1
−2 0 2 4 6 −2 0 2 4 6
H0 H1 H0 H1
−2 0 2 4 6 −2 0 2 4 6
Figure 4.4: Type I error with significance level α (red) and Type II error with prob-
ability β (blue) for two different alternative hypotheses (µ = 2 top row, µ = 4 bottom
row) with two-single hypothesis H0 : µ ≤ µ0 = 0 (left column) and one-sided hypothesis
H0 : µ ≤ µ0 = 0 (right).
The value 1−β is called the power of a test. High power of a test is desirable in an experiment:
we want to detect small effects with a large probability. R-Code 4.2 computes the power under
a Gaussian assumption. More specifically, under the assumption of σ = 1 we test H0 : µ0 = 0
versus H1 : µ0 6= 0. The power can only be calculated for a specific assumption of the “actual”
mean µ1 , i.e., of a simple alternative. Thus, as typically done, Figure 4.5 plots power (µ1 − µ0 ).
The workflow of a hypothesis test is very similar to the one of a statistical significance test
and only point iii) and v) need to be slightly modified:
R-Code 4.2 A one-sided and two-sided power curve for a z-test. (See Figure 4.5.)
1.0
0.8
0.6
Power
0.4
0.2
α
0.0
−1 0 1 2 3 4
µ1 − µ0
Figure 4.5: Power: one-sided (blue solid line) and two-sided (black dashed line). The
gray line represents the level of the test, here α = 5%. The vertical lines represent the
alternative hypotheses µ = 2 and µ = 4 of Figure 4.4. (See R-Code 4.2.)
The choice of test is again constrained by the assumptions. The significance level must, however,
always be chosen before the computations.
The test statistic value tobs calculated in step vi) is compared with critical values tcrit in order
to reach a decision in step vii). When the decision is based on the calculation of the p-value,
it consists of a comparison with α. The p-value can be difficult to calculate, but is valuable
because of its direct interpretation as the strength (or weakness) of the evidence against the null
hypothesis.
4.2. HYPOTHESIS TESTING 65
• tcrit , Fcrit , . . . the critical values, i.e., the quantiles according to the distribu-
tion of the test statistica and the significance level.
Our scientific question can only be formulated in general terms and we call it a
hypothesis. Generally, two-sided tests are performed. The significance level is
modified accordingly for one-sided tests.
In unclear cases, the statistical hypothesis is specified.
For most tests, there is a corresponding R function. The arguments x, y usually
represent vectors containing the data and alpha the significance level. From the
output, it is possible to get the p-value.
In this work we consider various test situations. The choice of test is primarily dependent on
the parameter and secondly on the statistical assumptions. The following list can be used as a
decision tree.
Distribution tests (also goodness-of-fit tests) differ from the other tests discussed here, in the
sense that they do not test or compare a single parameter. Of course there are many additional
possible tests, the approaches described in the first two sections allow to construct arbitrary tests.
In Sections 4.3 and 4.6 we present some of these tests in more details by motivating the test
statistic, giving an expicit example and by summarizing the test in yellow boxes. Ultimately, we
perform test with a single call in R. However, the underlying mechanism has to be understood,
it would be too dangerous using statistical tests as black-box tools only.
Y −µ Y −µ
Y ∼ N (µ, σ 2 /n) =⇒ √ ∼ N (0, 1) and √ ∼ Tn−1 . (4.3)
σ/ n S/ n
Under the null hypothesis H0 , µ is of course our theoretical value, say µT . We typically use the
last distribution, as σ 2 is unknown (Example 4.3 was linked to the first distribution).
The test statistic is t-distributed (see Figure 2.8.2) and so the function pt is used to calculate
p-values. As we have only one sample, it is typically called the “one-sample t-test”, illustrated in
box Test 1, and R-Code 4.5 with an example based on the pododermatitis data.
Assumptions: The population, from which the sample arises, is normally dis-
tributed with unknown mean µT . The observed data are independent and
the variance is unknown.
|x − µT | √
Calculation: tobs = · n.
s
Decision: Reject H0 : µ = µT , if tobs > tcrit = tn−1,1−α/2 .
Example 4.5. We test the hypothesis that the animals have a stronger pododermatitis compared
to low-grade hyperkeratosis. The latter corresponds to scores lower than 3 · 1/3. The statistical
4.3. COMPARING MEANS 67
null hypthese is that the mean score is equal to 3.333 and we want to know if the mean of the
(single) sample deviates from a specified value, sufficiently for a statistical claim.
The following values are given: mean: 3.869, standard deviation: 0.638, sample size: 17.
H0 : µ = 3.333;
|3.869 − 3.333|
tobs = √ = 3.467;
0.638/ 17
tcrit = t16,1−0.05/2 = 2.120 p-value: 0.003.
The p-value is low and hence there is evidence against the null hypothesis. R-Code 4.3 illustrates
the direct calculation of the p-value. Figure 4.1 illustrated that the value 3.333 is not in the
confidence interal(s) of the mean. The evidence against the null hypothesis is thus not too
surprising. ♣
R-Code 4.3 One sample t-test, pododermatitis (see Example 4.5 and Test 1)
Often we have want to compare to different means, say x and y. We assume that both
iid iid
random samples are normally distributed, i.e., X1 , . . . , Xn ∼ N (µx , σ 2 ), Y1 , . . . , Yn ∼ N (µy , σ 2 ),
and independent. Then
σ2 σ2 2σ 2
X ∼ N µx , , Y ∼ N µy , =⇒ X − Y ∼ N µx − µy , (4.4)
n n n
X − Y − (µx − µy ) X − Y H0 :µx =µy
=⇒ p ∼ N (0, 1) =⇒ p ∼ N (0, 1). (4.5)
σ/ n/2 σ/ n/2
The difficulty is that we do not know σ and we have to estimate it. The estimate thereof takes a
somewhat complicated form as both samples need to be taken into account, with possibly different
lengths and means. This pooled estimate is denoted by sp and given in Test 2. Ultimately, we
have again a t-distribution of the test statistic, as we use an estimate of the standard deviation
in the standardization of a normal random variable. As the calculation sd requires the estimates
µx and µy , we adjust the degrees of freedom of to nx + ny − 2.
R-Code 4.6 is based on the pododermatitis data again and compares the scores between the
two different barns.
68 CHAPTER 4. STATISTICAL TESTING
Assumptions: Both populations are normally distributed with the same unknown
variance. The samples are independent.
|x − y| nx · ny
r
Calculation: tobs = · ,
sp nx + ny
1
where s2p = · (nx − 1)s2x + (ny − 1)s2y .
nx + ny − 2
Decision: Reject H0 : µx = µy if tobs > tcrit = tnx +ny −2,1−α/2 .
Example 4.6. For the pododermatitis scores of the two barns, we have the following summaries.
Means 3.83 and 3.67; standard deviations: 0.88 and 0.87; sample sizes: 20 and 14. Hence, using
the formulas given in Test 2, we have
H0 : µx = µy
1
s2p = (19 · 0.8842 + 13 · 0.8682 ) = 0.878
20 + 14 − 2
r
|3.826 − 3.675| 20 · 14
tobs = √ = 0.494
0.878 20 + 14
tcrit = t32,1−0.05/2 = 0.494 p-value: 0.625.
Hence, 3.826 and 3.675 are not statistically different. See also R-Code 4.4. ♣
R-Code 4.4 Two-sample t-test with independent samples, pododermatitis (see Exam-
ple 4.6 and Test 2).
In practice, the variances of both samples are often different, say σx2 and σy2 . In such a setting,
q
we have to normalize the mean difference by s2x /nx + s2y /ny . While this estimate seems simpler
than the pooled estimate sp , the degrees of freedom of the resulting t-distribution is are not
intuitive and difficult to derive, and we refrain to elaborate it here. In the literature, this test is
called Welch’s t-test and actually the default choice of t.test( x, y, conf.level=1-alpha).
The assumption of independence of both samples in the previous Test 2 may not be valid if
the two samples consist of two measurements of the same individual, e.g., observations over two
different instances of time. In such settings, were we have a “before” and “after” measurement, it
would be better to take this pairing into account, by considering differences only instead of two
samples. Hence, instead of constructing a test statistic based on X − Y we consider
iid
σ2
X1 − Y1 , . . . , Xn − Yn ∼ N (µx − µy , σd2 ) =⇒ X − Y ∼ N µx − µy , d (4.6)
n
X − Y H0 :µx =µy
=⇒ √ ∼ N (0, 1). (4.7)
σd / n
where σd2 is essentially the sum of the variances minus the “dependence” between Xi and Yi . We
formalize this dependence, called covariance, starting in Chapter 7.
The paired two-sample t-test can thus be considered a one sample t-test of the differences
with mean µT = 0.
Question: Are the means x and y of two paired samples significantly different?
Assumptions: The samples are paired, the observed values are on the interval
scale. The differences are normally distributed with unknown mean δ. The
variance is unknown.
|d | √
Calculation: tobs = · n, where
sd
• di = xi − yi is the i-th observed distance,
• d is the arithmetic mean and sd is the standard deviation of the differ-
ences di .
Example 4.7. We consider the pododermatitis measurements from July 2016 and June 2017
and test if there is a progression over time. We have the following summaries for the differences
(see R-Code 4.5 and Test 3). Mean: 0.21; standard deviation: 1.26; and sample size: 17.
70 CHAPTER 4. STATISTICAL TESTING
H0 : d = 0 or H0 : µx = µy ;
|0.210|
tobs = √ = 0.687;
1.262/ 17
tcrit = t16;0.05 = 2.12 p-value: 0.502.
There is no evidence that there is a progression over time. ♣
R-Code 4.5 Two-sample t-test with paired samples, pododermatitis (see Example 4.7
and Test 3).
The “t-tests” requires normally distributed data. The tests are relatively robust towards
deviation from normality, as long as there are no extreme outliers. Otherwise, rank-based tests
can be used (see Chapter 6). The assumption of normality can be verified quantitatively with
formal Normality tests (X 2 -test as shown in Section 4.6, Shapiro–Wilk test, Kolmogorov–Smirnov
test). Often, however, a qualitative verification is sufficient (e.g. with the help of a Q-Q plot).
which corresponds to the boundaries of the empirical (1 − α) confidence interval for µT . Analo-
gously, this duality can be established for the other tests described in this chapter.
Example 4.8. Consider the situation from Example 4.5. Instead of comparing the p-value, we
can also consider the confidence interval, whose boundary values are 3.54 and 4.20. Since the
value 3.33 is not in this range, the null hypothesis is rejected.
This is shown in Figure 4.1. ♣
In R most test functions give the corresponding confidence intervals with the value of the
statistic and p-value. Some functions require the additional argument conf.int=TRUE, as well.
i) p-values can indicate how incompatible the data are with a specified statistical model.
ii) p-values do not measure the probability that the studied hypothesis is true, or the proba-
bility that the data were produced by random chance alone.
iii) Scientific conclusions and business or policy decisions should not be based only on whether
a p-value passes a specific threshold.
v) A p-value, or statistical significance does measure the size of and effect or the importance
of a result.
vi) By itself, a p-value does not provide a good measure of evidence regarding a model or
hypothesis.
P(at least 1 false significant results) = 1 − P(no false significant results) (4.13)
= 1 − (1 − α)m . (4.14)
In Table 4.2 the probabilities of at least one false significant result for α = 0.05 and various m
are presented. Even for just a few tests, the probability increases drastically, which should not
be tolerated.
Table 4.2: Probabilities of at least one false significant test result when performing m
tests at level α = 5% (top row) and at level αnew = α/m (bottom row).
m 1 2 3 4 5 6 8 10 20 100
1 − (1 − α)m 0.05 0.098 0.143 0.185 0.226 0.265 0.337 0.401 0.642 0.994
1 − (1 − αnew )m 0.05 0.049 0.049 0.049 0.049 0.049 0.049 0.049 0.049 0.049
There are several different methods that allow multiple tests to be performed while maintain-
ing the selected significance level. The simplest and most well-known of them is the Bonferroni
correction. Instead of comparing the p-value of every test to α, they are compared to a new
significance level, αnew = α/m, see second row of Table 4.2. There are several alternative meth-
ods, which, according to the situation, may be more appropriate. We recommend to use at least
method="holm" (default) in p.adjust. For more details see, for example, Farcomeni (2008).
Hypothesizing After the Results are Known (HARKing) is another inappropriate scientific
practice in which a post hoc hypothesis is presented as an a priori hypotheses. In a nutshell, we
4.6. ADDITIONAL TESTS 73
collect the data of the experiment and adjust the hypothesis after we have analysed the data,
e.g., select effects small enough such that significant results have been observed.
Along similar lines, analyzing a dataset with many different methods will lead to many p-
values out of which are α% significant (in the case of the null hypothesis). Due to various inherent
decisions often even more. When searching for a good statistical analysis one often has to make
many choices and thus inherently selects the best one among many. This danger is often called
the ‘garden of forking paths’. Conceptually, adjusting the p-value for the many (not-performed)
test would mitigate the problem.
Often if a result is not significant, the study is not published and is left in a ‘file-drawer’. A
seemingly significant result might be well due to Type I error but this is not evident as many
similar experiments lead to non-significant outcomes that are not published.
For many scientific domains, it is possible to preregister the study, i.e., to declare the study
experiment, analysis methods, etc. before the actual data has been collected. In other words,
everything is determined except the actual data collection and actual numbers of the statistical
analysis. The idea is that the scientific question is worthwhile investigating and reporting in-
dependent of the actual outcome. Such an approach reduces HARKing, garden-of-forking-paths
issue, publication bias and more.
density, and distribution functions implemented in R with [q,d,p]f). This “classic” F -test is
given in Test 4.
In latter chapters we will see more natural settings, where we need to compare variances (not
necessary from a priori two different samples).
Example 4.9. The two-sample t-test (Test 2) requires equal variances. The data pododermatitis
does not contradict the null hypothesis, as R-Code 4.6 shows (compare Example 4.6 and R-
Code 4.4). ♣
The so-called chi-square test (X 2 test) verifies if the observed data follow a particular dis-
tribution. This test is based on a comparison of the observations with the expected frequencies
and can also be used in other situations, for example, to test if two categorical variables are
independent (Test 6).
Under the null hypothesis, the chi-square test is X 2 distributed. The quantile, density, and
distribution functions are implemented in R with [q,d,p]chisq (see Section 2.8.1).
In Test 5, the categories should be aggregated so that all bins contain a reasonable amount
of counts, e.g., ei ≥ 5. Additionally, N − k > 1.
74 CHAPTER 4. STATISTICAL TESTING
Question: Are the variances s2x and s2y of two samples significantly different?
Assumptions: Both populations, from which the samples arise, are normally dis-
tributed. The samples are independent and the observed values are from the
interval scale.
s2x
Calculation: Fobs =
s2y
(the larger variance always goes in the numerator, so s2x > s2y ).
Decision: Reject H0 : σx2 = σy2 if Fobs > Fcrit = fnx −1,ny −1,1−α .
R-Code 4.6 Comparison of two variances, PDH (see Example 4.9 and Test 4).
Example 4.10. With few observations (10 to 50) it is often pointless to test for normality of
the data. Even for larger samples, a Q-Q plot is often more informative. For completeness,
we illustrate a simple goodness-of-fit test by comparing the pododermatitits data with expected
counts constructed from Gaussian density with matching mean and variance (R-Code 4.7). As
there is no significant difference in the means and the variances, we pool over both periods and
barns (n = 34).
The binning of the data is done through the a histogram-type binning (an alternative way
would be table( cut( podo$PDHmean))). As we have less than five observations in several bins,
the function chisq.test issues a warning. This effect could be mitigated if we calculate the
p-value using a bootstrap simulation by setting the argument simulate.p.value=TRUE. Pooling
the bins, say breaks=c(1.5,2.5,3.5,4,4.5,5) would be an alternative as well.
4.7. BIBLIOGRAPHIC REMARKS 75
Decision: Reject H0 : “no deviation between the observed and expected” if χ2obs >
χ2crit = χ2N −1−k,1−α , where k is the number of parameters estimated from the
data to calculate the expected counts.
R-Code 4.7 Testing normality, pododermatitis (see Example 4.10 and Test 5).
iii) Determine the lower and upper bound of a confidence interval Bl and Bu (both functions
of X̄) such that
√
P(−q ≤ n(X − µ)/σ ≤ q) = P(Bl ≤ µ ≤ Bu )
vi) Use the sickle-cell disease data and construct 90%-confidence intervals for the means of
HbSS and HbSβ variants (assume σ = 1). sickle.RData is available on the web-page and
provides the HbSS and HbSb measurements.
iid
Problem 4.2 (Normal distribution with unknown σ) X1 , . . . , Xn ∼ N (µ, σ 2 ) with σ > 0
1 Pn
unknown. S 2 = n−1 i=1 (Xi − X) .
2
√
i) What is the distribution of n(X − µ)/S? (No formal proof required)
iv) Use the sickle-cell disease data. Construct 90%-confidence intervals for the means of vari-
ants HbSS and HbSβ (assume σ is unknown).
Problem 4.3 (t-Test) Use again the sickle-cell disease data. For the cases listed below, spec-
ify the null and alternative hypothesis. Then use R to perform the tests and give a careful
interpretation.
Explain and apply estimation, confidence interval and hypothesis testing for
proportions
In this chapter we have a closer look at statistical techniques that help us to correctly answer
the above questions. More precisely, we will estimate proportions and then compare proportions
with each other. To simplify the exhibition, we discuss the estimation using one of the two risk
factors only.
79
80 CHAPTER 5. A CLOSER LOOK: PROPORTIONS
Diagnosis
positive negative Total
with factor h11 h12 n1
Risk
without h21 h22 n2
5.1 Estimation
We start with a simple setting where we observe occurrences of a certain event and are interested
in the proportion of the events over the total population. More specifically, we consider the
number of successes in a sequence of experiments, i.e., whether a certain treatment had an effect.
We often use a binomial random variable X ∼ Bin(n, p) for such a setting, where n is given or
known and p is unknown, the parameter of interest. Intuitively, we find x/n to be an estimate
of p and X/n the corresponding estimator. We will construct a confidence interval for p in the
next section.
With the method of moments we obtain the estimator pbMM = X/n, since np = E(X) and
we have only one observation (total number of cases). The estimator is identical to the intuitive
estimator.
The likelihood estimator is constructed as follows:
n x
L(p) = p (1 − p)n−x (5.1)
x
n
`(p) = log L(p) = log + x log(p) + (n − x) log(1 − p) (5.2)
x
d`(p) x n−x x n−x
= − ⇒ = . (5.3)
dp p 1−p pbML 1 − pbML
In our example we have the following estimates: pbT = 138/1370 ≈ 10% for the treatment
group and pbC = 175/1336 ≈ 13% for the control. The question then arises as to whether the two
proportions are different enough to be able to speak of an effect of the drug will be discussed in
a later part of this chapter.
5.2. CONFIDENCE INTERVALS 81
i) How many cases of pre-eclampsia can be expected in a group of 100 pregnant women?
Note however, that the estimator pb = X/n does not have a “classical” distribution. Figure 5.1
illustrates the probability mass function based on the estimate for the pre-eclampsia cases in
the treated group. The figure visually suggest to use a Gaussian approximation which is well
justified here as np(1 − p) 9. The Gaussian approximation for X is then used to state that
the estimator pb is also approximately Gaussian.
0.030
0.015
0.000
x
0.030
0.015
0.000
Figure 5.1: Probability mass function (top) with zoom in and the normal approxima-
tion (bottom).
When dealing with proportions we often speak of odds, or simply of chance, defined by
ω = p/(1 − p). The corresponding intuitive estimator (and estimate) is ω b = pb/(1 − pb). As
a side note, this estimator also coincides with the maximum likelihood estimator. Similarly,
p/(1 − pb) is an intuitive (and the maximum likelihood) estimator (and
θb = log(b
ω ) = log (b
estimate) of log odds.
Remark 5.1. For (regular) models with parameter θ, as n → ∞, likelihood theory states that,
the estimator θbML is normally distributed with expected value θ and variance Var(θbML ).
Since Var(X/n) = p(1 − p)/n, one can assume that SE(b p) = pb(1 − pb)/n. The so-called
p
Wald confidence interval rests upon this assumption (which can be shown more formally) and is
identical to (5.8). ♣
If the inequality in (5.4) is solved through a quadratic equation, we obtain the empirical
Wilson confidence interval
r !
1 q2 pb(1 − pb) q2
bl,u = · pb + ±q· + 2 , (5.9)
1 + q 2 /n 2n n 4n
The Wilson confidence interval is “more complicated” than the Wald confidence interval. Is
it also “better” because one less approximation is required?
Ideally the coverage probability of a (1−α) confidence interval should be 1−α. For a discrete
random variable, the coverage is
n
X
P(p ∈ CI ) = P(X = x)I{p∈CI } . (5.12)
x=0
R-Code 5.2 calculated the coverage of the 95% confidence intervals for X ∼ Bin(n = 40, p = 0.4)
and demonstrates that the Wilson confidence interval has better coverage (96% compared to
94%).
R-Code 5.2: Coverage of 95% confidence intervals for X ∼ Bin(n = 40, p = 0.4).
p <- .4
n <- 40
x <- 0:n
The coverage depends on p, as shown in Figure 5.2 (from R-Code 5.3). The Wilson confidence
interval has better nominal coverage at the center. This observation also holds when n is varied,
as in Figure 5.3. Note that the top “row” of the left and right part of the panel corresponds to
the top and bottom panel of Figure 5.2.
84 CHAPTER 5. A CLOSER LOOK: PROPORTIONS
1.00
0.95
Waldcoverage
0.90
0.85
0.80
p
1.00
0.95
Wilsoncoverage
0.90
0.85
0.80
Figure 5.2: Coverage of the 95% confidence intervals for X ∼ Bin(n = 40, p). The red
dashed line is the nominal level 1 − α and in green we have a smoothed curve to “guide
the eye”.
The width of an empirical confidence interval is bu − bl . For the Wald confidence interval we
obtain
r
pb(1 − pb)
2q · (5.13)
n
5.3. STATISTICAL TESTS 85
40 Wald CI Wilson CI
1.00
0.95
30
0.90
20
0.85
n
0.80
10
0.75
0.70
Figure 5.3: Coverage of the 95% confidence intervals for X ∼ Bin(n, p) as functions
of p and n. The probabilities are symmetric around p = 1/2. All values smaller than
0.7 are represented with dark red.
0.30
0.20
Width
0.10
0.00
0 10 20 30 40
Figure 5.4: Widths of the empirical 95% confidence intervals for X ∼ Bin(n = 40, p)
(The Wald is in solid green, the Wilson in dashed blue).
The widths are depicted in Figure 5.4. For 5 < x < 36, the Wilson confidence interval has
a smaller width and a better nominal coverage. For small and very large values x, the Wald
confidence interval has a way too small coverage and thus wider intervals are desired.
case of Test 5. That is, an overall (joint) proportion is estimated and the observed and expected
counts are compared. Typically, continuity corrections are applied.
Calculation: We use the notation for cells of a contingency table, as in Table 5.1.
The test statistic is
(h11 h22 − h12 h21 )2 (h11 + h12 + h21 + h22 )
χ2obs =
(h11 + h12 )(h21 + h22 )(h12 + h22 )(h11 + h21 )
and, under the null hypothesis that the proportions are the same, is X 2
distributed with one degree of freedom.
Example 5.1. The R-Code 5.4 shows the results for the pre-eclampsia data, once using a pro-
portion test and once using a Chi-squared test (comparing expected and observed frequencies).
♣
Remark 5.2. We have presented the rows of Table 5.1 in terms of two binomials, i.e., with
two fixed marginals. In certain situations, such a table can be seen from a hypergeometric
distribution point of view (see help( dhyper)), where three margins are fixed. For this latter
view, fisher.test is the test of choice.
It is natural to extend the 2 × 2-tables to general two-way tables or to include covariates etc.
Several concepts discussed here may still apply but need to be extended. Often, at the very end,
a test based on a chi-squared distributed test statistic is used. ♣
The goal of this section is to introduce formal approaches for a comparison of two proportions
p1 and p2 . This can be accomplished
. using (i) a difference p1 − p2 , (ii) a quotient p1 /p2 , or (iii)
an odds ratio p1 /(1 − p1 ) (p2 /(1 − p2 )), which we consider in the following three sections.
5.4. COMPARISON OF PROPORTIONS 87
The relative risk assumes positive values. A value of 1 means that the risk is the same in
both groups and there is no evidence of a association between the diagnosis/disease/event and
the risk factor. A value greater than one is evidence of a possible positive association between
a risk factor and a diagnosis/disease. If the relative risk is less than one, the exposure has a
protective effect, as is the case, for example, for vaccinations.
implies positive confidence boundaries. Note that with the back-transformation we loose the
‘symmetry’ of estimate plus/minus standard error.
Example 5.2. The relative risk and corresponding confidence interval for the pre-eclampsia
data are given in R-Code 5.5. The relative risk is smaller than one (diuretics reduce the risk).
An approximate 95% confidence interval does not include one. ♣
5.4. COMPARISON OF PROPORTIONS 89
with A and B the positive diagnosis with and without risk factors. The odds ratio indicates the
strength of an association between factors (association measure). The calculation of the odds
ratio also makes sense when the number of diseased is determined by study design, as is the case
for case-control studies.
When a disease is rare (very low probability of disease), the odds ratio and relative risk are
approximately equal.
An estimate of the odds ratio is
h11
h12 h11 h22
OR
d= = . (5.26)
h21 h12 h21
h22
The construction of confidence intervals for the odds ratio is based on Equation (2.39) and
Equation (2.40), analogous to that of the relative risk.
Example 5.3. The odds ratio with confidence interval for the pre-eclampsia data is given in
R-Code 5.6. The 95% confidence interval is again similar as calculated for the relative risks and
does also not include one, strengthening the claim (i.e., significant result).
Notice that the function fisher.test (see Remark 5.2) also calculates the odds ratios. As
they are based on a likelihood calculation, there are minor differences between both estimates.♣
90 CHAPTER 5. A CLOSER LOOK: PROPORTIONS
R-Code 5.6 Odds ratio with confidence interval, approximate and exact.
i) Derive the test statistic of the test of proportions (without continuity correction).
Problem 5.2 (Binomial distribution) Suppose that among n = 95 Swiss males, eight are red-
green colour blind. We are interested in estimating the proportion p of people suffering from
such disease among the male population.
ii) Calculate the maximum likelihood estimate (ML) p̂ML and the ML of the odds ω̂.
iii) Using the central limit theorem (CLT), it can be shown that pb follows approximately
N p, n1 p(1 − p) . Compare the binomial distribution to the normal approximation for
different n and p. To do so, plot the exact cumulative distribution function (CDF) and
compare it with the CDF obtained from the CLT. For which values of n and p is the ap-
proximation reasonable? Is the approximation reasonable for the red-green colour blindness
data?
iv) Use the R functions binom.test() and prop.test() to compute two-sided 95%-confidence
intervals for the exact and for the approximate proportion. Compare the results.
vi) Compute the Wilson 95%-confidence interval and compare it to the confidence intervals
from (d).
Problem 5.3 (A simple clinical trial) A clinical trial is performed to compare two treatments,
A and B, that are intended to treat a skin disease named psoriasis. The outcome shown in the
following table is whether the patient’s skin cleared within 16 weeks of the start of treatment.
Treatment A Treatment B
Cleared 9 5
Not cleared 18 22
i) Compute for each of the two treatments a Wald type and a Wilson confidence interval for
the proportion of patients whose skin cleared.
ii) Test whether the risk difference is significantly different to zero (i.e., RD = 0). Use both
an exact and an approximated approach.
iii) Compute CIs for both, relative risk (RR) and odds ratio (OR).
Chapter 6
Rank-Based Methods
Until now we have often assumed that we have a realization of a Gaussian random sample.
In this chapter, we discuss basic approaches to estimation and testing for cases in which this is
not the case. This includes the presence of outliers, or that the data might not be measured on
the interval/ratio scale, etc.
If, hypothetically, we set an arbitrary value xi to an infinitely large value (i.e., we create an
extreme outlier), these estimates above also “explode”. A single value may exert enough influence
on the estimate such that the estimate is not representative of the bulk of the data.
Robust estimators are not sensitive to one or possibly several outliers. They do often not
require specific distributional assumptions on the random sample.
The mean is therefore not a robust estimate of location. A robust estimate of location is the
trimmed mean, in which the biggest and smallest values are trimmed away and not considered.
The (empirical) median (the middle value of an odd amount of data or the center of the two
middle-most values of an even amount of data) is another robust estimate of location.
93
94 CHAPTER 6. RANK-BASED METHODS
Robust estimates of the dispersion (data spread) are (1) the (empirical) interquartile range
(IQR), calculated as the difference between the third and first quartiles, and (2) the (empirical)
median absolute deviation (MAD), calculated as
where most software programs (including R) use c = 1.4826. The choice of c is such, that for
normally distributed random variables we have an unbiased estimator, i.e., E(MAD) = σ for
MAD seen as an estimator. Since for normally distributed random variables IQR= 2Φ−1 (3/4)σ,
IQR/1.349 is an estimator of σ; for IQR seen as an estimator.
Example 6.1. Let the values 1.1, 3.2, 2.2, 1.8, 1.9, 2.1, 2.7 be given. Unfortunately, we have
entered the final number as 27. R-Code 6.1 compares several statistics (for location and scale)
and illustrates the effect of a single outlier on the estimates. ♣
The estimators of the trimmed mean or median do not possess simple distribution functions
and for this reason the corresponding confidence intervals are not easy to calculate. If we could
assume that the distribution of robust estimators are somewhat Gaussian (for large samples) we
could calculate approximate confidence intervals based on
s
d robust\
Var( estimator)
robust\estimator ± zα/2 , (6.3)
n
which is of course equivalent to θb ± zα/2 SE(θ)/n,
b for a robust estimator θ.
b Note that we have
deliberately put a hat on the variance term in (6.3) as the variance often needs to be estimated
as well (which is reflected in a precise definition of the standard error). For example, the R
expression median( x)+c(-2,2)*mad( x)/sqrt(length( x)) yields an approximate empirical
95% confidence interval for the median.
A second disadvantage of robust estimators is their lower efficiency, i.e., these estimators
have larger variances. Formally, the efficiency is the ratio of the variance of one estimator to the
variance of the second estimator.
In some cases the exact variance of robust estimators can be determined, often approximations
or asymptotic approximations exist. For a continuous random variable with cdf F (x), asymptot-
ically, the median is also normally distributed around the true median η = Q(1/2) = F −1 (1/2)
with variance (4nf (η)2 )−1 , where f (x) is the density function. The following example illustrates
this result and R-Code 6.2 compares the efficiency of two estimators based on repeated sampling.
6.1. ROBUST POINT ESTIMATES 95
iid
Example 6.2. Let X1 , . . . , X10 ∼ N (0, σ 2 ). We simulate realizations of this random sample
and calculate the empirical mean and median of the sample. We repeat R = 1000. Figure 6.1
shows the histogram of the means and medians including a (smoothed) empirical density. The
histogram and empirical densities of the median are wider and thus the mean is more efficient.
For this particular example, the empirical efficiency is roughly 72%. Because the density is
symmetric, η = µ = 0 and thus the asymptotic efficiency is
σ 2 /n σ2 1 2 2
2
= · 4n √ = ≈ 64%. (6.4)
1/ 4nf (0) n 2πσ π
Of course, if we change the distribution of X1 , . . . , X10 , the efficiency changes. For example let
us consider the case of a t-distribution with 4 degrees of freedom, a density with heavier tails
than the normal. Now the empirical efficiency for sample size n = 10 is 1.26, which means that
the median is better compared to the mean. ♣
R-Code 6.2 Distribution of empirical mean and median, see Example 6.2. (See Figure 6.1.)
Robust estimates have the advantage of not having to identify outliers and eliminating these
for the estimation process.
The decision as to whether a realization of a random sample contains outliers is not always
easy and some care is needed. For example for all distributions with values from R, observations
will lay outside the whiskers of a box plot when n is sufficiently large and thus “marked” as
outliers. Obvious outliers are easy to identify and eliminate, but in less clear cases robust
estimation methods are preferred.
96 CHAPTER 6. RANK-BASED METHODS
1.2
0.8
Density
0.4
0.0
estimates
Figure 6.1: Comparing efficiency of the mean and median. Medians in yellow with
red smoothed, empirical density, means in black. (See R-Code 6.2.)
Outliers can be very difficult to recognize in multivariate random samples, because they are
not readily apparent with respect to the marginal distributions. Robust methods for random
vectors exist, but are often computationally intense and not as intuitive as for scalar values.
It has to be added that independent of the estimation procedures, if an EDA finds outliers,
these should be noted and scrutinized.
In Chapter 4 we considered tests to compare means. These tests assume normally distributed
data for exact results. Slight deviations from a normal distribution has typically negligible
consequences as the central limit theorem reassures that the mean is approximately normal.
However, if outliers are present or the data are skewed or the data is measured on the ordinal
scale, the use of so-called ‘rank-based’ tests is recommended. Classical tests typically assume a
distribution that is parametrized (e.g., µ, σ 2 in N (µ, σ 2 ) or p in Bin(n, p)). Rank based test do
not prescribe a detailed distribution and are thus also called non-parametric tests.
The rank of a value in a sequence is the position (order) of that value in the ordered sequence
(from smallest to largest). In particular, the smallest value has rank 1 and the largest rank n.
In the case of ties, the arithmetic mean of the ranks is used.
Example 6.3. The values 1.1, −0.6, 0.3, 0.1, 0.6, 2.1 have ranks 5, 1, 3, 2, 4 and 6. However,
the ranks of the absolute values are 5, (3+4)/2, 2, 1, (3+4)/2 and 6. ♣
Rank-based tests only consider the ranks of the observations or of the differences, not the
observation value itself or the magnitude of the differences between the observations. The largest
value always has the same rank and therefore always has the same influence on the test statistic.
We now introduce two classical rank tests (i) the Mann–Whitney U test (Wilcoxon–Mann–
Whitney U test) and (ii) the Wilcoxon test, i.e., rank-based versions of Test 2 and Test 3
respectively.
6.2. RANK-BASED TESTS 97
To motivate this test, assume that we have two samples with equal sample sizes available. The
idea is that if we have one common underlying density, the observations mingle nicely and hence
the ranks are comparable. Alternatively, assume that the first sample has a much smaller median
(or mean) then the ranks of the first sample would be smaller than those of the sample with the
larger median (or mean).
When using rank tests, the symmetry assumption is dropped and we test if the samples likely
come from the same distribution, i.e., the two distributions have the same shape but are shifted.
The Wilcoxon–Mann–Whitney test can be interpreted as comparing the medians between the
two populations, see Test 7.
The quantile, density and distribution functions of the test statistic U are implemented in
R with [q,d,p]wilcox. For example, the critical value Ucrit (nx , ny ; α/2) mentioned in Test 7 is
qwilcox( .025, nx, ny) for α = 5% and corresponding p-value 2*pwilcox( Uobs, nx, ny).
Decision: Reject H0 : “medians are the same” if Uobs < Ucrit (nx , ny ; α/2), where
Ucrit is the critical value.
It is possible to approximate the distribution of the test statistic by a Gaussian one. The U
statistic value is then transformed by
Uobs − nx ny
zobs =r 2 , (6.5)
nx ny (nx + ny + 1)
12
98 CHAPTER 6. RANK-BASED METHODS
where nx ny /2 is the mean of U and the denominator is the standard deviation. This value is
then compared with the respective quantile of the standard normal distribution. The normal
approximation may be used with sufficiently large samples, nx ≥ 2 and ny ≥ 8. With additional
continuity corrections, the approximation may be improved.
To construct confidence intervals, the argument conf.int=TRUE must be used in the function
wilcox.test and a possible specification of conf.level unless α = 5% is chosen. The numerical
values of the confidence interval are accessed with the list element $conf.int.
In case of ties, R may not be capable to calculate exact p-values and thus will issue a warning.
The warning can be avoided by not requiring exact p-values through the setting of the argument
exact=FALSE.
The quantile, density and distribution functions of the Wilcoxon signed rank test statistic
are implemented in R with [q,d,p]signrank. For example, the critical value Wcrit (n; α/2)
mentioned in Test 8 is qsignrank( .025, nx, ny) for α = 5% and corresponding p-value
2*psignrank( Wobs, n).
Wobs − n? (n? + 1)
zobs =r 4 , (6.6)
n? (n? + 1)(2n? + 1)
24
and then zobs is compared with the corresponding quantile of the standard normal distribution.
This approximation may be used when the sample is sufficiently large, which is as a rule of thumb
n? ≥ 20.
Example 6.4. We consider again the podo data as introduced in Example 4.1. R-Code 6.3
performs various rank tests (comparing the expected median with a theoretical value, comparing
the median of two samples, comparing the medians of two paired samples). As expected, the
p-values are similar to those obtained with “classical” t-tests in Chapter 4. Because ties may
exist, we used the argument exact=FALSE to avoid warnings when calculating exact p-values.
The advantage of robust methods becomes clear when the first value is changed from 3.75
to 37.5, as shown towards the end of the same R-Code. While the p-value of the signed rank
test does virtually not change, the one from the paired two sample t-test changes from 0.5 (see
6.2. RANK-BASED TESTS 99
R-Code 4.5) to 0.31. More importantly, the confidence intervals are now considerably different as
the outlier inflated the estimated standard deviation of the t-test. In other situations, it is quite
likely that with or without a particular “outlier” the p-value falls below the magical threshold α
(recall the discussion of Section 4.5.3).
Of course, a corrupt value as introduced in this example would be detected with a proper
EDA of the data (scales are within zero and ten). ♣
R-Code 6.3: Rank tests and comparison of a paired tests with a corrupted observation.
# Possibly relaod the 'podo.csv' and construct the variables as in Example 4.1
wilcox.test( PDHmean, mu=3.333, exact=FALSE)
##
## Wilcoxon signed rank test with continuity correction
##
## data: PDHmean
## V = 133, p-value = 0.008
## alternative hypothesis: true location is not equal to 3.333
100 CHAPTER 6. RANK-BASED METHODS
We conclude the chapter with two additional tests, one relying on very few statistical assuptions
and one being more of a toolbox to construct arbitrary tests.
(3) If the medians of both samples are the same, k is a realization from a
binomial distribution Bin(n? , p = 0.5).
Decision: Reject H0 : p = 0.5 (i.e., the medians are the same), if k > bcrit =
b(n? , 0.5, 1 − α2 ) or k < bcrit = b(n? , 0.5, α2 ).
Calculation in R:
binom.test( sum( d>0), sum( d!=0), conf.level=1-alpha)
Permutation tests can be used to answer many different questions. They are based on the idea
that, under the null hypothesis, the samples being compared are the same. In this case, the
result of the test would not change if the values of both samples were to be randomly reassigned
102 CHAPTER 6. RANK-BASED METHODS
to either group. In R-Code 6.5 we show an example of a permutation test for comparing the
means of two independent samples.
require(coin)
oneway_test( PDHmean ~ as.factor(Visit), data=podo)
##
## Asymptotic Two-Sample Fisher-Pitman Permutation Test
##
## data: PDHmean by as.factor(Visit) (1, 13)
## Z = -0.707, p-value = 0.48
## alternative hypothesis: true mu is not equal to 0
Assumptions: The null hypothesis is formulated, such that the groups, under H0 ,
are exchangeable.
Calculation: (1) Calculate the difference dobs in the means of the two groups to
be compared (m observations in group 1, n observations in group 2).
(2) Form a random permutation of the values of both groups by randomly
allocating the observed values to the two groups (m observations in
group 1, n observations in group 2). There are m+n possibilities.
n
Permutation tests are straightforward to implement manually and thus are often used in
settings where the distribution of the test statistic is complex or even unknown.
Note that the package exactRankTests will no longer be further developed. Therefore, use of
the function perm.test is discouraged. Functions within the package coin can be used instead
and this package includes extensions to other rank-based tests.
6.4. BIBLIOGRAPHIC REMARKS 103
Problem 6.2 (Rank and permutation tests) Download the water_transfer.csv data from the
course web page and read it into R with read.csv(). The data describes tritiated water diffusion
across human chorioamnion and were taken from Hollander & Wolfe (1999), Nonparametric
Statistical Methods, Table 4.1, page 110. The pd values for age = "At term" and age =
"12-26 Weeks" are denoted with yA and yB , respectively. We will statistically determine whether
the yA values are “different” from yB values or not. That means we test whether there is a shift
in the distribution of the second group compared to the first.
i) Use a Wilcoxon-Mann-Whitney test to test for a shift in the groups. Interpret the results.
ii) Now, use a permutation test as implemented by the function wilcox_test() from R pack-
age coin to test for a potential shift. Compare to (a).
iii) Under the null hypothesis, we are allowed to permute the observations (all y-values) while
keeping the group assignments fix. Keeping this in mind, we will now manually construct
a permutation test to detect a potential shift. Write an R function perm_test() that
implements a two-sample permutation test and returns the p-value. Your function should
execute the following steps.
• Compute the test statistic tobs = yeA − yeB , where e· denotes the empirical median.
• Then repeat many (n = 1000) times
– Randomly assign all the values of pd to two groups xA and xB of the same size
as yA and yB .
– Store the test statistic tsim = x eB .
eA − x
• Return the two-sided p-value, i.e., the number of permuted test statistics tsim which
are smaller or equal than −|tobs | or larger or equal than |tobs | divided by the total
number of permutations (in our case n = 1000).
104 CHAPTER 6. RANK-BASED METHODS
Chapter 7
Describe a random vector, cdf, pdf of a random vector and its properties
Give the definition and intuition of E, Var and Cov for a random vector
Explain the relationship between the eigenvalues and eigenvector of the co-
variance matrix and the shape of the density function.
In Chapter 2 we have introduced univariate random variables. We now extend the framework
to random vectors (i.e., multivariate random variables). In the framework of this document, we
can only cover a tiny part of the beautiful theory and thus we will mainly focus on continuous
random vectors, especially Gaussian random vectors. We are pragmatic and discuss what will
be needed in the sequel.
105
106 CHAPTER 7. MULTIVARIATE NORMAL DISTRIBUTION
Definition 7.1. The multivariate (or multidimensional) distribution function of a random vector
X is defined as
The multivariate distribution function generally contains more information than the set of
p
marginal distribution functions P(Xi ≤ xi ), because (7.1) only simplifies to FX (x ) =
Q
P(Xi ≤
i=1
xi ) under independence of all random variables Xi (compare to Equation (2.22)).
Definition 7.2. The probability density function (or density function, pdf) fX (x ) of a p-
dimensional continuous random vector X is defined by
Z
P(X ∈ A) = fX (x )dx , for all A ⊂ Rp . (7.2)
A
For convenience, we summarize here a few facts of random vectors with two continuous
components, i.e., for a bivariate random vector (X, Y )> . The univariate counterparts are stated
in Properties 2.1 and 2.3.
∂2
• fX,Y (x, y) = FX,Y (x, y).
∂x∂y
Z bZ d
• P(a < X ≤ b, c < Y ≤ d) = fX,Y (x, y)dxdy
a c
= FX,Y (b, d) − FX,Y (b, c) − FX,Y (a, d) + FX,Y (a, c).
In the multivariate setting there is also the concept termed marginalization, i.e., reduce a
higher-dimensional random vector to a lower dimensional one. Intuitively, we “neglect” compo-
nents of the random vector in allowing them to take any value. In two dimensions, we have
7.1. RANDOM VECTORS 107
Hence the expectation of a random vector is simply the vector of the individual expectations.
Of course, to calculate these, we only need the marginal univariate densities fXi (x) and thus the
expectation does not change whether (7.1) can be factored or not. The expectation of products
of random variables is defined as
Z Z
E(X1 X2 ) = x1 x2 f (x1 , x2 ) dx1 dx2 (7.4)
(for continuous random variables). The variance of a random vector requires a bit more thought
and we first need the following.
Definition 7.4. The covariance between two arbitrary random variables X1 and X2 is defined
as
Cov(X1 , X2 ) = E (X1 − E(X1 ))(X2 − E(X2 )) = E(X1 X2 ) − E(X1 ) E(X2 ). (7.5)
Using the linearity properties of the expectation operator, it is possible to show the following
handy properties.
i) Cov(X1 , X2 ) = Cov(X2 , X1 ),
The covariance describes the linear relationship between the random variables. The correla-
tion between two random variables X1 and X2 is defined as
Cov(X1 , X2 )
Corr(X1 , X2 ) = p (7.6)
Var(X1 ) Var(X2 )
5 min
and corresponds to the normalized covariance. It holds that −1 ≤ Corr(X1 , X2 ) ≤ 1, with
equality only in the degenerate case X2 = a + bX1 for some a and b 6= 0.
108 CHAPTER 7. MULTIVARIATE NORMAL DISTRIBUTION
Definition 7.5. The variance of a p-variate random vector X = (X1 , . . . , Xp )> is defined as
The covariance matrix is a symmetric matrix and – except for degenerate cases – a positive
definite matrix. We will not consider degenerate cases and thus we can assume that the inverse
of the matrix Var(X) exists and is called the precision.
Similar to Properties 2.5, we have the following properties for random vectors.
Property 7.2. For an arbitrary p-variate random vector X, (fixed) vector a ∈ Rq and matrix
4 min B ∈ Rq×p it holds:
Definition 7.6. The random variable pair (X, Y ) has a bivariate normal distribution if
Z x Z y
FX,Y (x, y) = fX,Y (x, y)dxdy (7.9)
−∞ −∞
with density
for all x and y and where µx ∈ R, µy ∈ R, σx > 0, σy > 0 and −1 < ρ < 1. ♦
The role of some of the parameters µx , µy , σx , σy and ρ might be guessed. We will discuss
their precise meaning after the following example.
7.2. MULTIVARIATE NORMAL DISTRIBUTION 109
Example 7.1. R-Code 7.1 and Figure 7.1 show the density of a bivariate normal distribution
√ √
with µx = µy = 0, σx = 1, σy = 5, and ρ = 2/ 5 ≈ 0.9. Because of the quadratic form
in (7.10), the contour lines (isolines) are ellipses.
Several R packages implement the bivariate/multivariate normal distribution. We recommend
the package mvtnorm. ♣
require( mvtnorm)
require( fields) # providing tim.colors() and image.plot()
Sigma <- array( c(1,2,2,5), c(2,2))
x <- y <- seq( -3, to=3, length=100)
grid <- expand.grid( x=x, y=y)
densgrid <- dmvnorm( grid, mean=c(0,0), sigma=Sigma)
density <- array( densgrid, c(100,100))
image.plot(x, y, density, col=tim.colors()) # left panel
Property 7.3. For the bivariate normal distribution we have: The marginal distributions are
X ∼ N (µx , σx2 ) and Y ∼ N (µy , σy2 ) and
!! ! !! !
X µx X σx2 ρσx σy
E = Var = . (7.11)
Y µy , Y ρσx σy σy2
Thus,
0.15
2
0.15
1
0.10
0.10
0
y
density
−1
0.05 0.05
−3
−2
0.00
−2
−1
−3 −2 0
−1 0 1 x
0.00 y 1 2
−3
2 33
−3 −2 −1 0 1 2 3
x
3
1.0
2
0.8
1
0.6
0
y
0.8 3
0.4
0.6 2
−1
cdf
1
0.4
0.2 0
y
0.2
−2
−1
−3 −2 −2
0.0 −1 0
1
−3
x 2 3−3
−3 −2 −1 0 1 2 3
Note, however, that the equivalence of independence and uncorrelatedness is specific to jointly
normal variables and cannot be assumed for random variables that are not jointly normal.
Example 7.2. R-Code 7.2 and Figure 7.2 show realizations from a bivariate normal distribution
for various values of correlation ρ. Even for large sample shown here (n = 500), correlations
between −0.25 and 0.25 are barely perceptible. ♣
R-Code 7.2 Realizations from a bivariate normal distribution for various values of ρ,
termed binorm (See Figure 7.2.)
set.seed(12)
rho <- c(-.25, 0, .1, .25, .75, .9)
for (i in 1:6) {
Sigma <- array( c(1, rho[i], rho[i], 1), c(2,2))
sample <- rmvnorm( 500, sigma=Sigma)
plot(sample, pch='.', xlab='', ylab='')
legend( "topleft", legend=bquote(rho==.(rho[i])), bty='n')
}
4
4
ρ = −0.25 ρ=0 ρ = 0.1
2
2
0
0
−2
−2
−2
−4
−4
−4
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
4
2
0
0
−2
−2
−2
−4
−4
−4
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
Figure 7.2: Realizations from a bivariate normal distribution. (See R-Code 7.2.)
Definition 7.7. The random vector X = (X1 , . . . , Xp )> is multivariate normally distributed if
Z x1 Z xp
FX (x ) = ··· fX (x1 , . . . , xp )dx1 . . . dxp (7.13)
−∞ −∞
with density
1 1
> −1
fX (x1 , . . . , xp ) = fX (x ) = exp − (x − µ) Σ (x − µ) (7.14)
(2π)p/2 det(Σ)1/2 2
for all x ∈ Rp (with µ ∈ Rp and symmetric, positive-definite Σ). We denote this distribution
with X ∼ Np (µ, Σ). ♦
Property 7.4. For the multivariate normal distribution we have:
a + BX ∼ Nq a + Bµ, BΣB> .
(7.16)
This last property has profound consequences. It also asserts that the one-dimensional
marginal distributions are again Gaussian with Xi ∼ N (µ)i , (Σ)ii , i = 1, . . . , p. Similarly,
any subset and any (non-degenerate) linear combination of random variables of X is again Gaus-
sian with appropriate subset selection of the mean and covariance matrix.
We now discuss how to draw realizations from an arbitrary Gaussian random vector, much
in the spirit of Property 2.7.ii). Let I ∈ Rp×p be the identity matrix, a square matrix which has
5 min only ones on the main diagonal and only zeros elsewhere, and let L ∈ Rp×p so that LL> = Σ.
That means, L is like a “matrix square root” of Σ.
To draw a realization x from a p-variate random vector X ∼ Np (µ, Σ), one starts with
iid
drawing p values from Z1 , . . . , Zp ∼ N (0, 1), and sets z = (z1 , . . . , zp )> . The vector is then
(linearly) transformed with µ + Lz . Since Z ∼ Np (0, I) Property 7.5 asserts that X = µ + LZ ∼
Np (µ, LL> ).
In practice, the Cholesky decomposition of Σ is often used. This decomposes a symmetric
positive-definite matrix into the product of a lower triangular matrix L and its transpose. It
holds that det(Σ) = det(L)2 = pi=1 (L)2ii .
Q
Property 7.6. If one conditions a multivariate normally distributed random vector (7.18) on a
sub-vector, the result is itself multivariate normally distributed with
Equation (7.19) is probably one of the most important formulas you encounter in statistics
albeit not always explicit. It is illustrated in Figure (??) for the case of p = 2.
7.3. ESTIMATION OF MEAN AND VARIANCE 113
*
* * *
*
Figure 7.3: Graphical illustration of the conditional distribution of a bivariate normal
random vector. Blue: bivariate density with isolines indicating quartiles, cyan: marginal
densities, red: conditional densities. The respective means are indicated in green. The
height of the univariate densities are exaggerated by a factor of five.
Remark 7.1. Actually, it is possible to show that these former two estimators in (7.20) are
unbiased estimators of E(X) and Var(X) for an arbitrary sample of random vectors. ♣
Example 7.3. Similar as in R-Code 7.2, we generate bivariate realizations with different sam-
ple sizes (n = 10, 50, 100, 500). We estimate the mean vector and covariance matrix according
to (7.21); from these we can calculate the corresponding isolines of the bivariate normal density
114 CHAPTER 7. MULTIVARIATE NORMAL DISTRIBUTION
(where with plug-in estimates for µ and Σ). Figure 7.4 (based on R-Code 7.3) shows the esti-
mated 95% and 50% confidence regions (isolines). As n increases, the estimation improves, i.e.,
the estimated ellipses are closer to the ellipses based on the true (unknown) parameters. ♣
R-Code 7.3 Bivariate normally distributed random numbers for various sample sizes with
contour lines of the density and estimated moments. (See Figure 7.4.)
set.seed( 14)
require( ellipse)
n <- c( 10, 50, 100, 500)
mu <- c(2, 1) # theoretical mean
Sigma <- matrix( c(4, 2, 2, 2), 2) # and covariance matrix
cov2cor( Sigma)[2] # equal to sqrt(2)/2
## [1] 0.70711
for (i in 1:4) {
plot(ellipse( Sigma, cent=mu, level=.95), col='gray',
xaxs='i', yaxs='i', xlim=c(-4, 8), ylim=c(-4, 6), type='l')
lines( ellipse( Sigma, cent=mu, level=.5), col='gray')
sample <- rmvnorm( n[i], mean=mu, sigma=Sigma)
points( sample, pch='.', cex=2)
Sigmahat <- cov( sample) # var( sample) # is identical
muhat <- colMeans( sample) # apply( sample, 2, mean) # is identical
lines( ellipse( Sigmahat, cent=muhat, level=.95), col=2, lwd=2)
lines( ellipse( Sigmahat, cent=muhat, level=.5), col=4, lwd=2)
points( rbind( muhat), col=3, cex=2)
text( -2, 4, paste('n =',n[i]))
}
muhat # Estimates for n=500
## [1] 2.02040 0.95478
Sigmahat
## [,1] [,2]
## [1,] 4.1921 2.0638
## [2,] 2.0638 2.0478
cov2cor( Sigmahat)[2]
## [1] 0.70437
7.4. BIBLIOGRAPHIC REMARKS 115
6
n = 10 n = 50
4
4
●
2
2
●
y
0
0
−2
−2
−4
−4
−4 −2 0 2 4 6 8 −4 −2 0 2 4 6 8
x x
6
6
n = 100 n = 500
4
4
2
●
y
●
0
0
−2
−2
−4
−4
−4 −2 0 2 4 6 8 −4 −2 0 2 4 6 8
Figure 7.4: Bivariate normally distributed random numbers. The contour lines of
the (theoretical) density are in gray, the isolines corresponding estimated 95% (50%)
probability in red (blue) and the empirical mean in green. (See R-Code 7.3.)
Hints:
• E(X> ) = E(X)> .
• E(E(X)) = E(X), because E(X) is a constant (non-stochastic) vector.
• For a given random vector Y and conformable matrices C and D (non-stochastic), it
holds that
E (CYD) = C E (Y) D.
iid
Problem 7.2 (Bivariate normal distribution) Consider the random sample X1 , . . . , Xn ∼
N2 (µ, Σ) with
! ! !
Xi,1 1 1 1
Xi = , µ= , Σ= .
Xi,2 2 1 2
i) Explain in words that these estimators for µ and Σ “generalize” the univariate estimators
for µ and σ.
ii) Simulate n = 500 iid realizations from N2 (µ, Σ) using the function rmvnorm() from package
mvtnorm. Draw a scatter plot of the results and interpret the figure.
iii) Add contour lines of the density of X to the plot. Calculate an eigendecomposition of Σ
and place the two eigenvectors in the center of the ellipses.
iv) Estimate µ, Σ and the correlation between X1 and X2 from the 500 simulated values using
mean(), cov() and cor(), respectively.
v) Redo the simulation with several different covariance matrices, i.e., choose different values
as entries for the covariance matrices. What is the influence of the diagonal elements and
the off-diagonal elements of the covariance matrix on the shape of the scatter plot?
In this chapter (and the following three chapters) we consider “linear models.” A detailed
discussion of linear models would fill an entire lecture module, hence we consider here only the
most important elements.
The (simple) linear regression is commonly considered the archetypical task of statistics and
is often introduced as early as middle school.
In this chapter, we will (i) quantify the (linear) relationship between two variables, (ii) explain
one variable through another variable with the help of a “model”.
The goal of this section is to quantify the linear relationship between two random variables X
and Y with the help of n pairs of values (x1 , yn ), . . . , (xn , yn ), i.e., realizations of the two random
variables (X, Y ).
An intuitive estimator of the correlation between the random variables X and Y (i.e.,
117
118 CHAPTER 8. A CLOSER LOOK: CORRELATION AND SIMPLE REGRESSION
which is also called the Pearson correlation coefficient. Just like the correlation, the Pearson
correlation coefficient also lies in the interval [−1, 1].
We will introduce a handy notation that is often used in the following:
n
X n
X n
X
sxy = (xi − x)(yi − y), sxx = (xi − x)2 , syy = (yi − y)2 . (8.2)
i=1 i=1 i=1
√
Hence, we can express (8.1) as r = sxy / sxx syy . Further, an estimator for the covariance is
sxy /(n − 1) (which would be an unbiased estimator).
Alternatives to Pearson correlation coefficient are so-called rank correlation coefficients, such
as Spearman’s ρ or Kendall’s τ and are seen as non-parametric correlation estimates. In brief,
Spearman’s ρ is calculated similarly to (8.1), where the values are replaced by their ranks.
Kendall’s τ compares the number of concordant (if xi < xj then yi < yj ) and discordant pairs
(if xi < xj then yi > yj ).
R-Code 8.1 binorm data: Pearson correlation coefficient of the scatter plot from Figure 7.2.
Spearman’s ρ or Kendall’s τ are given as well.
require( mvtnorm)
set.seed(12)
rho <- array(0,c(4,6))
rownames(rho) <- c("rho","Pearson","Spearman","Kendall")
rho[1,] <- c(-.25, 0, .1, .25, .75, .9)
for (i in 1:6) {
Sigma <- array( c(1, rho[1,i], rho[1,i], 1), c(2,2))
sample <- rmvnorm( 500, sigma=Sigma)
rho[2,i] <- cor( sample)[2]
rho[3,i] <- cor( sample, method="spearman")[2]
rho[4,i] <- cor( sample, method="kendall")[2]
}
print( rho, digits=2)
## [,1] [,2] [,3] [,4] [,5] [,6]
## rho -0.25 0.000 0.10 0.25 0.75 0.90
## Pearson -0.22 0.048 0.22 0.28 0.78 0.91
## Spearman -0.22 0.066 0.18 0.26 0.77 0.91
## Kendall -0.15 0.045 0.12 0.18 0.57 0.74
8.1. ESTIMATION OF THE CORRELATION 119
Example 8.1. R-Code 8.1 estimates the correlation of the scatter plot data from Figure 7.2.
Although n = 500, in the case ρ = 0.1, the estimate is more than two times too large.
We also calculate Spearman’s ρ or Kendall’s τ for the same data. There is, of course, quite
some agreement between the estimates. ♣
Pearson’s correlation coefficient is not robust, while Spearman’s ρ or Kendall’s τ are “robust”
see R-Code 8.2 and Figure 8.1 which compares the various correlation estimates of the so-called
anscombe data.
R-Code 8.2: anscombe data: visualization and correlation estimates. (See Figure 8.1.)
library( faraway)
data( anscombe)
head( anscombe, 3)
## x1 x2 x3 x4 y1 y2 y3 y4
## 1 10 10 10 8 8.04 9.14 7.46 6.58
## 2 8 8 8 8 6.95 8.14 6.77 5.76
## 3 13 13 13 8 7.58 8.74 12.74 7.71
with( anscombe, { plot(x1, y1); plot(x2, y2); plot(x3, y3); plot(x4, y4) })
sel <- c(0:3*9+5) # extract diagonal entries of sub-block
print(rbind( pearson=cor(anscombe)[sel],
spearman=cor(anscombe, method='spearman')[sel],
kendall=cor(anscombe, method='kendall')[sel]), digits=2)
## [,1] [,2] [,3] [,4]
## pearson 0.82 0.82 0.82 0.82
## spearman 0.82 0.69 0.99 0.50
## kendall 0.64 0.56 0.96 0.43
Let us consider bivarate normally distributed random variables as discussed last chapter.
Naturally, r as given in (8.1) is an estimate of the correlation parameter ρ explicated in the
density (7.10). Let R be the corresponding estimator of ρ based on (8.1), i.e., replacing (xi , yi )
by (Xi , Yi ). The random variable
√
n−2
T =R √ (8.3)
1 − R2
is, under H0 : ρ = 0, t-distributed with n − 2 degrees of freedom. The corresponding test is
described under Test 11.
In order to construct confidence intervals for correlation estimates, we typically need the
so-called Fisher transformation
1 1 + r
W (r) = log = arctanh(r) (8.4)
2 1−r
and the fact that, for bivarate normally distributed random variables, the distribution of W (R) is
approximately N W (ρ), 1/(n−3) and a straight-forward confidence interval can be constructed.
● ● ● ●
3 4 5 6 7 8 9
● ●
10
●
● ●
● ●
●
●
8
y1
y2
● ●
●
●
●
6
●
●
● ●
4
4 6 8 10 12 14 4 6 8 10 12 14
x1 x2
● ●
12
12
10
10
y3
y4
● ●
●
8
● ●
8
● ●
● ●
● ●
● ●
●
6
●
6
● ●
●
● ●
4 6 8 10 12 14 8 10 12 14 16 18
x3 x4
Figure 8.1: anscombe data, the four cases all have the same Pearson correlation
coefficient of 0.82, yet the scatterplot shows a completely different relationship. (See
R-Code 8.2.)
Astoundingly large sample sizes are needed for correlation estimates around 0.25 to be sig-
nificant.
Y i = µ i + εi (8.5)
= β 0 + β 1 x i + εi , i = 1, . . . , n, (8.6)
with
• β0 , β1 : parameters (unknown);
• εi : measurement error, error term, noise (unknown), with symmetric distribution around
E(εi ) = 0.
It is often also assumed that Var(εi ) = σ 2 and/or that the errors are independent of each other.
iid
For simplicity, we assume εi ∼ N (0, σ 2 ) with unknown σ 2 . Thus, Yi ∼ N (β0 + β1 xi , σ 2 ),
i = 1, . . . , n, and Yi and Yj are independent when i 6= j.
Example 8.2. (hardness data) One of the steps in the manufacturing of metal springs is a
quenching bath. The temperature of the bath has an influence on the hardness of the springs.
Data is taken from Abraham and Ledolter (2006). Figure 8.2 shows the Rockwell hardness of
coil springs as a function of the temperature of the quenching bath, as well as the line of best fit.
The Rockwell scale measures the hardness of technical materials and is denoted by HR (Hardness
Rockwell). R-Code 8.3 shows how a simple linear regression for this data is performed with the
command lm and the ‘formula’ statement Hard˜Temp. ♣
The main idea of regression is to estimate βbi , which minimize the sum of squared errors.
That means that βb0 and βb1 are determined such that
n
X
(yi − βb0 − βb1 xi )2 (8.7)
i=1
122 CHAPTER 8. A CLOSER LOOK: CORRELATION AND SIMPLE REGRESSION
R-Code 8.3 hardness data from Example 8.2, see Figure 8.2.
●
●
50
Hardness [HR]
●
●
●
40
●
●
30
20
30 35 40 45 50 55 60
Temperature [C]
Figure 8.2: hardness data: hardness as a function of temperature (see R-Codes 8.3
and 8.4). The black line is the fitted regression line.
is minimized. This concept is also called the least squares method. The solution, i.e., the
estimated regression coefficients, are
n
X
(xi − x)(yi − y)
sxy sxx
r
i=1
βb1 = =r = n , (8.8)
sxx syy X
2
(xi − x)
i=1
R-Code 8.4 hardness data from Example 8.2, see Figure 8.2.
where we have used a slightly different denominator than in Example 3.6.ii), to obtain an un-
biased estimate. The variance estimate σb2 is often termed mean squared error and its root, σ
b,
residual standard error.
124 CHAPTER 8. A CLOSER LOOK: CORRELATION AND SIMPLE REGRESSION
Example 8.3. (hardness data Part 2) R-Code 8.4 illustrates how to access the estimates,
the fitted values and the residuals. An estimate of σ 2 can be obtained via resid( lm1) and
Equation (8.13) (e.g., sum(resid( lm1)ˆ2)/12) or directly via summary(lm1)$sigmaˆ2. ♣
In simple linear regression, the central task is often to determine whether there exists a linear
relationship between the dependent and independent variables. This can be tested with the
hypothesis H0 : β1 = 0 (Test 12). We do not formally derive the test statistic here. The idea is
to replace in Equation (8.8) the observations yi with random variables Yi with distribution 8.6
and derive the distribution of a test statistic.
Comparing the expression of Pearson’s correlation coefficient (8.1) and the estimate of β1 (8.8),
it is not surprising that the p-values of Tests 12 and 11 coincide.
For prediction of the dependent variable at a given independent variable x0 , we plug in the
value x0 in Equation (8.11). The function predict can be used for prediction in R, as illustrated
in R-Code 8.5.
Prediction at a (potentially) new value x0 can also be written as
This equation is equivalent to Equation (7.19) but with estimates instead of (unknown) param-
eters.
The uncertainty of the prediction depends on the uncertainty of the estimated parameter.
Specifically:
Var(b
µ0 ) = Var(βb0 + βb1 x0 ), (8.15)
which also depends on the variance of the error term. In general, however, βb0 and βb1 are not
independent and with matrix notation, the variance is easy to calculate, as will be illustrated in
Chapter 9.
8.2. SIMPLE LINEAR REGRESSION 125
To construct confidence intervals for a prediction, first we must discern whether the prediction
is for the mean response µ b0 or for an unobserved (e.g., future) observation yb0 at x0 . The former
prediction interval depends on the variability of the estimates of βb0 and βb1 . For the latter the
prediction interval depends on the uncertainty of µ b0 and additionally on the variability of the
error ε, i.e., σ
b2 . Hence, the latter is always wider than the former. In R, these two types are
denoted — somewhat intuitively — with interval="confidence" and interval="prediction",
R-Code 8.5. The confidence interval summary CI 7 gives the precise formulas. We will see
another, yet easier approach Chapter 9.
R-Code 8.5 hardness data: predictions and pointwise confidence intervals. (See Fig-
ure 8.3.)
●
●
●
50
Hardness [HR]
●
●
●
40
●
●
30
20
●
●
●
10
30 35 40 45 50 55 60
Temperature [C]
In both intervals we use estimators, i.e., all yi the estimates are to be replaced with
Yi to obtain estimators.
Hint: you might consider a parallel coordinate plot, as shown in Figure 1.11 in Chapter 1.
Problem 8.2 (Linear regression) In a simple linear regression, the data are assumed to follow
iid
Yi = β0 + β1 xi + εi with εi ∼ N (0, σ 2 ), i = 1, . . . , n. We simulate n = 15 data points from that
model with β0 = 1, β1 = 2, σ = 2 and the follwoing values for xi .
Hint: copy & paste the following lines into your R-Script.
## simulation of y values:
y <- beta0.true + x * beta1.true + rnorm(15, sd = 2)
data <- data.frame(x = x, y = y)
i) Plot the simulated data in a scatter plot. Calculate the Pearson correlation coefficient and
the Spearman’s rank correlation coefficient. Why do they agree well?
ii) Estimate the linear regression coefficients βb0 and βb1 using the formulas from the script.
Add the estimated regression line to the plot from (a).
iii) Calculate the fitted values Ybi for the data in x and add them to the plot from (a).
iv) Calculate the residuals (yi − ybi ) for all n points and the residual sum of squares SS =
bi )2 . Visualize the residuals by adding lines to the plot with segments(). Are
P
i (yi − y
the residuals normally distributed? Do the residuals increase or decrease with the fitted
values?
v) Calculate standard errors for β0 and β1 . For σ bε = SS /(n − 2), they are given by
p
s s
1 x2 1
σ
bβ0 = σ
bε +P 2
, σ
bβ1 = σbε P 2
.
n (x
i i − x) i i − x)
(x
vi) Give an empirical 95% confidence interval for β0 and β1 . (The degree of freedom is the
number of observations minus the number of parameters in the model.)
vii) Calculate the values of the t statistic for βb0 and βb1 and the corresponding two-sided p-
values.
viii) Verify your result with the R function lm() and the corresponding S3 methods summary(),
fitted(), residuals() and plot() (i.e., apply these functions to the returned object of
lm()).
ix) Use predict() to add a “confidence” and a “prediction” interval to the plot from (a). What
is the difference?
Hint: The meanings of "confidence" and "predict" here are based on the R function. Use
the help of those functions to understand their behaviour.
x) Fit a linear model without intercept (i.e., force β0 to be zero). Add the corresponding
regression line to the plot from (a). Discuss if the model fits “better” the data.
xii) What is the difference between a model with formula y ∼ x and x ∼ y? Explain it from
a stochastic and fitting perspective.
128 CHAPTER 8. A CLOSER LOOK: CORRELATION AND SIMPLE REGRESSION
Chapter 9
Multiple Regression
– Multicollinearity
– Influential points
– Interactions between variables
– Categorical variables (factors)
– Model validation and information criterion (basic theory and R)
The simple linear regression model can be extended by the addition of further independent
variables. We first introduce the model and estimators. Subsequently, we become acquainted
with the most important steps in model validation. Two typical examples of multiple regression
are given. At the end, several typical examples of extensions of linear regression are illustrated.
Y i = µ i + εi , (9.1)
= β0 + β1 xi1 + · · · + βp xip + εi , (9.2)
= x>
i β + εi , i = 1, . . . , n, n > p, (9.3)
with
129
130 CHAPTER 9. MULTIPLE REGRESSION
• εi : (unknown) error term, noise, “measurement error”, with symmetric distribution around
zero, E(εi ) = 0.
It is often also assumed that Var(εi ) = σ 2 and/or that the errors are independent of each other.
iid
To derive estimators with simple, closed form distributions, we further assume that εi ∼ N (0, σ 2 ),
with unknown σ 2 . In matrix notation, Equation (9.3) is written as
Y = Xβ + ε (9.4)
The mean of the response varies, implying that Y1 , . . . , Yn are only independent and not iid.
We assume that the rank of X equals p + 1 (rank(X) = p + 1, column rank). This assumption
guarantees that the inverse of X> X exists. In practical terms, this implies that we do not include
twice the same predictor, or that an predictor has additional information on top of the already
included predictors.
We estimate the parameter vector β with the method of ordinary least squares (see Sec-
tion 3.2.1). That means the estimate βb is such that the sum of the squared errors (residuals) is
minimal and is thus derived as follows:
n
X
β
b = argmin (yi − x > 2 >
i β) = argmin(y − Xβ) (y − Xβ) (9.7)
β i=1 β
d
⇒ (y − Xβ)> (y − Xβ) (9.8)
dβ
d
= (y > y − 2β > X> y + β > X> Xβ) = −2X> y + 2X> Xβ (9.9)
dβ
⇒ X> Xβ = X> y (9.10)
> −1 >
⇒ β
b = (X X) X y (9.11)
Equation (9.10) is also called the normal equation and Equation (9.11) indicates why we need
to assume full column rank of the matrix X.
We now derive the distributions of the estimator and other related and important vectors.
The derivation of the results are based directly on Property 7.5. Starting from the distributional
assumption of the errors (9.5), jointly with Equations (9.6) and (9.11), it can be shown that
where we term the matrix H = X(X> X)−1 X> as the hat matrix. In the left column we find the
estimates, in the right column the functions of the random samples. Notice the subtle difference
in the covariance matrix of the distributions Y and Y: b the hat matrix H is not I, hopefully
quite close to it. The latter would imply that the variances of R are close to zero.
The distribution of the coefficients will be used when interpreting a fitted regression model
(similarly as in the case of the simple regression). The marginal distributions of the individual
coefficients βbi are determined by the distribution (9.12):
βbi − βi
(again direct consequence of Property 7.5). Hence √ 2 ∼ N (0, 1). However, since σ 2 is
σ vii
usually unknown, we use the unbiased estimate
n
1 X 1
b2 =
σ (yi − ybi )2 = r >r , (9.16)
n−p−1 n−p−1
i=1
again termed mean squared error. Its square root is termed residual standard error (with n−p−1
degrees of freedom). Finally, we use the same approach when deriving the t-test in Equation (4.3)
and obtain
βbi − βi
√ ∼ tn−p−1 (9.17)
b2 vii
σ
as our statistic for testing and to derive confidence intervals about individual coefficients βi .
For testing, we are often interested in H0 : βi = 0. Confidence intervals are constructed along
equation (3.34) and summarized in the subsequent blue box.
Model validation verifies (i) the fixed components (or fixed part) µi and (ii) the stochastic
components (or stochastic part) εi and is typically an iterative process (arrow back to Propose
statistical model in Figure 1.1).
(yi − ybi )2
P
SSE
Coefficient of determination, or R : R = 1 −
2 2
= 1 − Pi 2
, (9.19)
SST i (yi − y i )
p
Adjusted R2 : Radj2
= R2 − (1 − R2 ) , (9.20)
n−p−1
(SST − SSE )/p
Observed value of F -Test: , (9.21)
SSE /(n − p − 1)
were SS stands for sums of squares and SST , SSE for total sums of squares and sums of squares
of the error, respectively. The last statistic explains how much variability in the data is explained
by the model and is essentially equivalent to Test 4 and performs the omnibus test H0 : β1 =
β2 = · · · = βp = 0. When we reject this test, this merely signifies that at least one of the
coefficients is significant and thus often not very useful.
A slightly more general version of the F -Test (9.21) is used to compare nested models. Let
M0 be the simpler model with only q out of the p predictors of the more complex model M1
(0 ≤ q < p). The test H0 : “M0 is sufficient” is based on the statistic
(SSsimple model − SScomplex model )/(p − q)
(9.22)
SScomplex model /(n − p − 1)
and often runs under the name ANOVA (analysis of variance). We see an alternative derivation
thereof in Chapter 10.
In order to validate the fixed components of a model, it must be verified whether the necessary
predictors are in the model. We do not want too many, nor too few. Unnecessary predictors
are often identified through insignificant coefficients. When predictors are missing, the residuals
show (in the ideal case) structure, indicative for model improvement. In other cases, the quality
of the regression is low (F -Test, R2 (too) small). Example 9.1 below will illustrate the most
important elements.
Example 9.1. We construct synthetic data in order to better illustrate the difficulty of detecting
a suitable model. Table 9.1 gives the actual models and the five fitted models. In all cases we use
a small dataset of size n = 50 and predictors (x1 , x2 and x3 ) that we construct from a uniform
iid
distribution. Further, we set εi ∼ N (0, 0.252 ). R-Code 9.1 and the corresponding Figure 9.1
illustrate how model deficiencies manifest.
9.2. MODEL VALIDATION 133
We illustrate how residual plots may or may not show missing or unnecessary predictors.
Because of the ‘textbook’ example, the adjusted R2 values are very high and the p-value of the
F -Test is – as often in practice – of little value.
Since the output of summary() is quite long, here we show only elements from it. This is
achieved with the functions print() and cat(). For Examples 2 to 5 the output has been
constructed by a function call to subset_of_summary() constructing the output as the first
example.
The plots should supplement a classical graphical analysis through lm( res). ♣
Table 9.1: Fitted models for five examples of 9.1. The true model is always Yi =
β0 + β1 x1 + β2 x21 + β3 x2 + εi .
R-Code 9.1: Illustration of missing and unnecessary predictors for an artificial dataset.
(See Figure 9.1.)
set.seed( 18)
n <- 50
x1 <- runif( n); x2 <- runif( n); x3 <- runif( n)
eps <- rnorm( n, sd=0.16)
y <- -1 + 3*x1 + 2.5*x1^2 + 1.5*x2 + eps
# Example 1: Correct model
sres <- summary( res <- lm( y ~ x1 + I(x1^2) + x2 ))
print( sres$coef, digits=2)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.96 0.067 -14.4 1.3e-18
## x1 2.90 0.264 11.0 1.8e-14
## I(x1^2) 2.54 0.268 9.5 2.0e-12
## x2 1.49 0.072 20.7 5.8e-25
cat("Adjusted R-squared: ", formatC( sres$adj.r.squared),
" F-Test: ", pf(sres$fstatistic[1], 1, n-2, lower.tail = FALSE))
## Adjusted R-squared: 0.9927 F-Test: 6.5201e-42
plotit( res$fitted, res$resid, "fitted values") # Essentially equivalent to:
# plot( res, which=1, caption=NA, sub.caption=NA, id.n=0) with ylim=c(-1,1)
134 CHAPTER 9. MULTIPLE REGRESSION
It is important to understand that the stochastic part εi does not only represent measurement
error. In general, the error is the remaining “variability” (also noise) that is not explained through
the predictors (“signal”).
9.2. MODEL VALIDATION 135
−1.0
−1.0
0 1 2 3 4 5 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
fitted values x1 x2
0.0 0.5 1.0
−1.0
−1.0
0 1 2 3 4 5 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
fitted values x1 x2
0.0 0.5 1.0
−1.0
−1.0
0 1 2 3 4 5 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
fitted values x1 x2
0.0 0.5 1.0
●● ●
●● ● ●
● ●
● ●
● ●
● ●●
●
● ●
● ●
●
●
● ●
● ●
● ●
● ●● ●
●
●●
● ●
●
●●
● ●● ●
● ●
●
● ●●
● ● ● ● ●
●
● ●
● ●
●
● ●
● ● ● ● ●●
● ● ●
● ● ●●● ●
● ●
●●
●● ● ●
●● ● ●
●●
●●●●
● ● ●●●●
● ●●
●●
●● ●
● ●
● ●
● ●
● ●
● ●
● ●
● ●
● ●● ●
●
● ●●
●
●
●
●
● ●
● ●
● ●
● ●
●●●
●
● ●
● ●
●
●
● ●
● ●
● ●●
●
●
●
●
●
●
● ●
●
● ●
●
●
●● ●
● ●
● ●
● ● ●
● ● ●
●
● ●
● ●
● ●
● ●●
● ● ●
● ● ● ●
●● ● ●
●●
● ●
●
●
●
●
● ●
●
●●
●
●
●
● ●●
●
● ●● ●
● ●
● ●
●
●
● ●
● ●●
●
●● ●
●
● ●
● ●●
●
●
●
●
●
● ●
●
●
●●
●
●
●
● ●
● ●
● ●
●
●
●
●
●●
●
●
●
● ●
●
●
● ● ●
● ●
●
● ●
● ●
●
−1.0
−1.0
−1.0
0 1 2 3 4 5 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
fitted values x1 x2
0.0 0.5 1.0
●
● ●
● ●
●
● ●
● ●
●
● ●
● ●
●
●
● ●● ●
● ●●
●
● ●
●
●
● ● ● ●
●
● ● ●
● ●
●
●● ●● ●
● ●
●
●●
● ● ● ●
● ●● ●
● ●● ● ●
●● ● ●
●
● ●
●●●
● ●
● ● ● ●
● ●
● ● ● ●●●●
● ●
● ●
●
● ● ●● ● ● ●● ● ● ●●
● ● ● ●
●
● ●
●●
●●●● ●
● ●
● ● ● ●●● ●
●
●●●
●● ● ●
● ● ● ●
●
●
● ●
● ●
●
●● ● ●
●●
●● ●●
●●
●
● ● ● ● ●
●
● ●
● ●●
●
●
● ●
● ●
●
● ● ● ●
● ●●
● ● ●● ●●
● ●
● ● ●
●
●●
●
● ●
● ●● ● ●●
● ●●●●
● ●
● ●
●
●
● ●
● ●
● ●
● ●
●
● ●
●
●
●
●
● ●
●●
●●
● ●
●
●
●
●●● ●
●
● ● ● ● ●
● ●
●
● ●●●
●
●
● ●
● ●● ●
● ●● ●
●
●
● ●● ●●
●
● ●
● ●● ●● ●●
●
●
−1.0
−1.0
−1.0
0 1 2 3 4 5 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
fitted values x1 x2
Figure 9.1: Residual plots. Residuals versus fitted values (left column), predictor x1
(middle) and x2 (right column). The rows correspond to the different fitted models.
The panels in the left column have different scaling of the x-axis. (See R-Code 9.1.)
136 CHAPTER 9. MULTIPLE REGRESSION
With respect to the stochastic part εi , the following points should be verified:
i) constant variance: if the (absolute) residuals are displayed as a function of the predictors,
the estimated values, or the index, no structure should be discernible. The observations
can often be transformed in order to achieve constant variance. Constant variance is also
called homoscedasticity and the terms heteroscedasticity or variance heterogeneity are used
otherwise.
indep
More precisely, for heteroscedacity we relax Model (9.3) to εi ∼ N (0, σi2 ), i.e., ε ∼
Nn (0, σ 2 V). In case the diagonal matrix V is known, we can use so-called weighted least
squares (WLS) by considering the argument weights in R (weights=1/diag(V)).
iii) symmetric distribution: it is not easy to find evidence against this assumption. If the
distribution is strongly right- or left-skewed, the scatter plots of the residuals will have
structure. Transformations or generalized linear models may help. We have a quick look
at a generalized linear model in Section 9.4.
We assume that the distribution of the variables follows a known distribution with an un-
known parameter θ with p components. In maximum likelihood estimation, the larger the
likelihood function Lθ)
b or, equivalently, the smaller the negative log-likelihood function −`(θ),
b
the better the model is. The oldest criterion was proposed as “an information criterion” in 1973
by Hirotugu Akaike and is known today as the Akaike information criterion (AIC):
AIC = −2`(θ)
b + 2p. (9.23)
In regression models with normally distributed errors, the maximized log-likelihood is linear to
σ 2 ) and so the first term describes the goodness of fit.
log(b
9.3. EXAMPLES 137
The disadvantage of AIC is that the penalty term is independent of the sample size. The
Bayesian information criterion (BIC)
BIC = −2`(θ)
b + log(n) p (9.24)
penalizes the model more heavily based on both the number of parameters p and sample size n,
and its use is recommended.
9.3 Examples
In this section we give two more examples of multiple regression problems, based on classical
datasets.
Example 9.2. (abrasion data) The data comes from an experiment investigating how rubber’s
resistance to abrasion is affected by the hardness of the rubber and its tensile strength (Cleveland,
1993). Each of the 30 rubber samples was tested for hardness and tensile strength, and then
subjected to steady abrasion for a fixed time.
R-Code 9.2 performs the regression analysis based on two predictors. The empirical confidence
intervals confint( res) do not contain zero. Accordingly, the p-values of the three t-tests are
small.
A quadratic term for strength is not necessary. However, the residuals appear to be slightly
correlated. We do not have further information about the data here and cannot investigate this
aspect further. ♣
● ●
350
350
● ● ● ●
● ●
250
250
● ●
● ●
●● ● ● ●●
loss ●
●● ●
●
●
●
●
●●
●
●
150
150
● ●
● ● ● ● ● ●
● ● ● ●
●● ● ● ● ●
● ●
● ●
● ●● ●
50
9050
●●
● ●
●●
● ●
● ●
80
● ● ● ●
●
● ●
70
● ●●
hardness ●
● ●
●
●
60
● ●
● ●
● ●
●
240 50
●
200
strength
160
120
Figure 9.2: abrasion data: EDA in form of a pairs plot. (See R-Code 9.2.)
138 CHAPTER 9. MULTIPLE REGRESSION
R-Code 9.2: abrasion data: EDA, fitting a linear model and model validation. (See
Figures 9.2 and 9.3.)
# Fitted values
plot( loss~hardness, ylim=c(0,400), yaxs='i', data=abrasion)
points( res$fitted~hardness, col=4, data=abrasion)
plot( loss~strength, ylim=c(0,400), yaxs='i', data=abrasion)
points( res$fitted~strength, col=4, data=abrasion)
9.3. EXAMPLES 139
# Residuals vs ...
plot( res$resid~res$fitted)
lines( lowess( res$fitted, res$resid), col=2)
abline( h=0, col='gray')
plot( res$resid~hardness, data=abrasion)
lines( lowess( abrasion$hardness, res$resid), col=2)
abline( h=0, col='gray')
plot( res$resid~strength, data=abrasion)
lines( lowess( abrasion$strength, res$resid), col=2)
abline( h=0, col='gray')
plot( res$resid[-1]~res$resid[-30])
abline( h=0, col='gray')
400
400
● ●
●
● ● ● ●
●
● ● ●
● ●
●
● ● ●●●
●
●
● ● ●
loss
loss
● ●
200
200
● ●● ●
● ● ● ● ●●
● ● ● ● ● ● ● ●
●●
● ●
● ● ● ●● ● ● ●
● ●
●
● ● ● ● ● ● ●
● ● ● ● ●● ● ● ●
●● ● ●
● ● ● ● ● ● ●
●
● ● ● ●
● ● ● ●
● ●
● ● ●
●
● ● ● ● ●
●
● ●
0
hardness strength
● ●
50
50
● ●
● ● ● ●
● ●
res$resid
res$resid
● ● ● ●
● ●● ● ●
● ● ● ● ● ●●
● ●● ● ● ● ● ●
0
●● ● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ● ● ●
−50
−50
● ●
● ● ● ●
res$fitted hardness
● ●
50
50
● ●
res$resid[−1]
● ● ● ●
● ●
res$resid
● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ●
0
● ● ● ● ● ●
●● ● ● ● ●
●
● ● ●
● ● ● ●
−50
−50
● ●
● ● ● ●
strength res$resid[−30]
Figure 9.3: abrasion data: model validation. Top row shows the loss (black) and
fitted values (blue) as a function of hardness (left) and strength (right). Middle and
bottom panels are different residual plots. (See R-Code 9.2.)
140 CHAPTER 9. MULTIPLE REGRESSION
Example 9.3. (LifeCycleSavings data) Under the life-cycle savings hypothesis developed by
Franco Modigliani, the savings ratio (aggregate personal savings divided by disposable income) is
explained by per-capita disposable income, the percentage rate of change in per-capita disposable
income, and two demographic variables: the percentage of population less than 15 years old and
the percentage of the population over 75 years old. The data are averaged over the decade
1960–1970 to remove the business cycle or other short-term fluctuations.
The dataset contains information from 50 countries about these five variables:
Scatter plots are shown in Figure 9.4. R-Code 9.3 fits a multiple linear model, selects models
through comparison of various goodness of fit criteria (AIC, BIC) and shows the model validation
plots for the model selected using AIC. The step function is a convenient way for selecting
relevant predictors. Figure 9.4 gives four of the most relevant diagnostic plots, obtained by
passing a fitted object to plot (compare with the manual construction of Figure 9.3).
Different models may result from different criteria: when using BIC for model selection,
pop75 drops out of the model. ♣
0 5 10 15 20 25 35 45 1 2 3 4 0 2000 4000
0 5 10 15
● ● ● ●
10 15 20
10 15 20
● ● ● ●
● ● ● ●
● ● ● ●
●
●● ●●● ● ● ● ●● ●
● ● ●●● ● ●●● ● ● ●●●●● ● ●●● ●●
● ● ●●
● ● ● ● ●● ● ●● ●● ●
● ●●
sr ● ●
●
●
●● ●●●
● ●
●
● ● ●
●
● ●● ● ● ●
●
●●
● ●● ●
● ●●
●●● ●●
●●
●
● ●
●
●
●●
●
●
● ●
●
●
●●
● ●● ●●
● ●●
●●
● ● ●
●
●●● ●
●
● ● ● ● ● ● ●
●● ● ●● ● ●●● ● ●
5
● ● ●● ●● ● ●●●
●
● ●●
● ●● ●● ●●●
● ● ● ● ● ● ●
● ●
● ● ●
0
45 0
●● ● ● ●● ● ●●
●● ●● ●●●●●● ●
●●● ● ●● ●
● ●●
●●● ● ●●
●
●● ● ●●● ●
●
●●● ● ●● ● ●●● ●●
● ●● ●
● ● ●
pop15
35
● ● ●
● ●● ● ● ● ● ● ● ●● ● ●●
●●● ●
● ● ● ●●
●● ●● ● ● ●
● ● ● ● ● ●
●● ● ● ●● ●
25
● ● ● ● ●
● ● ● ● ● ●●
● ● ● ● ● ● ● ●● ●●●●●
● ● ● ● ●●
● ●
●●● ● ●●●
● ●
4
● ●
●●● ● ●●
● ● ● ●●●
● ● ●●
● ● ●●
3
pop75 ●
●●
●
●
●
●
● ● ● ●
●●
● ●
●
●
●
2
● ●
●● ● ● ●
●
● ●
●●
● ● ●● ●●●●
● ●
● ●●
●● ● ● ●●
●●●
4000 1
●●● ● ●
●
●●● ● ●●●
●
●
●
●
● ●●●
2000
dpi ●●
● ●●
●
●●
● ●
● ●
●● ● ●
●●●● ●●
●●● ●● ●
●●● ●
●●●● ●●
●
●
15 0
10
ddpi
5
0
0 5 10 15
Figure 9.4: LifeCycleSavings data: scatter plots of the data. (See R-Code 9.3.)
9.3. EXAMPLES 141
R-Code 9.3: LifeCycleSavings data: EDA, linear model and model selection. (See
Figures 9.4 and 9.5.)
data( LifeCycleSavings)
head( LifeCycleSavings)
## sr pop15 pop75 dpi ddpi
## Australia 11.43 29.35 2.87 2329.68 2.87
## Austria 12.07 23.32 4.41 1507.99 3.93
## Belgium 13.17 23.80 4.43 2108.47 3.82
## Bolivia 5.75 41.89 1.67 189.13 0.22
## Brazil 12.88 42.19 0.83 728.47 4.56
## Canada 8.79 31.72 2.85 2982.88 2.43
pairs(LifeCycleSavings, upper.panel=panel.smooth, lower.panel=NULL, gap=0)
lcs.all <- lm(sr ~ pop15 + pop75 + dpi + ddpi, data = LifeCycleSavings)
summary( lcs.all)
##
## Call:
## lm(formula = sr ~ pop15 + pop75 + dpi + ddpi, data = LifeCycleSavings)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.242 -2.686 -0.249 2.428 9.751
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.566087 7.354516 3.88 0.00033 ***
## pop15 -0.461193 0.144642 -3.19 0.00260 **
## pop75 -1.691498 1.083599 -1.56 0.12553
## dpi -0.000337 0.000931 -0.36 0.71917
## ddpi 0.409695 0.196197 2.09 0.04247 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.8 on 45 degrees of freedom
## Multiple R-squared: 0.338,Adjusted R-squared: 0.28
## F-statistic: 5.76 on 4 and 45 DF, p-value: 0.00079
Yi ≈ f (x i , β), (9.25)
3
10
Zambia ● Zambia ●
Standardized residuals
2
●Philippines
● Philippines
● ●
● ● ● ●
5
●●
Residuals
● ● ● ●
●
1
● ● ● ● ● ● ●●●
●
● ●●●●●
● ● ● ● ● ● ●
●
●
0
● ● ● ●● ●●●●●●
0
● ●●●
● ● ● ● ● ●●●
● ● ● ● ● ● ●●●●
● ●
● ● ●●
●●●●
●●●●
−1
●
−5
●
● ● ● ●
● ● ●
Chile ●
−2
−10
● Chile
6 8 10 12 14 16 −2 −1 0 1 2
3
1.5
Chile ● ● Zambia
Standardized residuals
Standardized residuals
●Philippines
●
2
● ● ● ●
● ● 1
● ● ● ● ● Japan
● 0.5
●
1.0
1
● ● ●●
● ● ● ● ●
● ● ● ●
●● ●●
● ● ●
● ●●
● ● ● ● ● ● ●●●● ● ●●
0
●
●
●●
● ● ● ●●
●
● ●●●● ●
● ●●● ●
0.5
●
● −1 ● ● ●
●
● ● ● ● ●
● ● Libya ● 0.5
● ● ●
● ●
●
● 1
−2
●
Cook's distance
0.0
For a Poisson random variable, the variance increases with increasing mean. Similarly, it is possi-
ble that the variance of the residuals increase with increasing observations. Instead of “modeling”
increasing variances, transformations of the response variables often render the variances of the
residuals sufficiently constant. For example, instead of linear model for Yi a linear model for
log(Yi ) is constructed.
In situations where we do not have count observations but nevertheless increasing variance
with increasing mean, a log-transformation is helpful. If the original data stems from a “truly”
linear model, a transformation typically leads to an approximation. On the other hand,
We may pick “any” other reasonable transformation. There are formal approaches to deter-
mine an optimal transformation of the data, notably the function boxcox from the package MASS.
√
However, in practice log(·) and · are used dominantly.
144 CHAPTER 9. MULTIPLE REGRESSION
However, we do not have closed forms for the resulting estimates and iterative approaches are
needed. That is, starting from an initial condition we improve the solution by small steps. The
correction factor typically depends on the gradient. Such algorithms are variants of the so-called
Gauss–Newton algorithm. The details of these are beyond the scope of this document.
There is no guarantee that an optimal solution exists (global minimum). Using nonlinear
least squares as a black box approach is dangerous and should be avoided. In R, the function
nls (nonlinear least-squares) can be used. General optimizer functions are nlm and optim.
There are some prototype nonlinear functions that are often used:
Example 9.4. (orings data) In January 1986 the space shuttle Challenger exploded shortly
after taking off, killing all seven crew members aboard. Part of the problem was with the rubber
seals, the so-called O-rings, of the booster rockets. Due to low ambient temperature, the seals
started to leak causing the catastrophe. The data set data( orings, package="faraway")
contains the number defects in the six seals in 23 previous launches (Figure 9.6). The question
we ask here is whether the probability of a defect for an arbitrary seal can be predicted for an
9.5. BIBLIOGRAPHIC REMARKS 145
air temperature of 31◦ F (as in January 1986). See Dalal et al. (1989) for a detailed statistical
account or simply https://en.wikipedia.org/wiki/Space_Shuttle_Challenger_disaster.
The variable of interest is a probability (failure of a rubber seal), that we estimate based on
binomial data (failures of orings) but a linear model cannot guarantee pbi ∈ [0, 1] (see linear fit in
Figure 9.6). In this and similar cases, logistic regression is appropriate. The logistic regression
models the probability of a defect as
1
p = P(defect) = , (9.31)
1 + exp(−β0 − β1 x)
where x is the air temperature. Through inversion one obtains a linear model for the log odds
p
g(p) = log = β0 + β1 x (9.32)
1−p
where g(·) is generally called the link function. In this special case, the function g −1 (·) is called
the logistic function. ♣
R-Code 9.4 orings data and estimated probability of defect dependent on air temperature.
(See Figure 9.6.)
1.0
+
0.8
Probability of damage
0.6
●
0.4
●
●
0.2
++ + + +
●
● ●
● ● ●
0.0
● ● ● +
+++++ + + + ● +
● +
● +
●
20 30 40 50 60 70 80
Temperature [F]
Figure 9.6: orings data (proportion of damaged orings, black crosses) and estimated
probability of defect (red dots) dependent on air temperature. Linear fit is given by
the gray solid line. Dotted vertical line is the ambient launch temperature at the time
of launch. (See R-Code 9.4.)
Problem 9.1 (Multiple linear regression 1) The data stackloss.txt are available on the
course web page. The data represents the production of nitric acid in the process of oxidizing
ammonia. The response variable, stack loss, is the percentage of the ingoing ammonia that
escapes unabsorbed. Key process variables are the airflow, the cooling water temperature (in
degrees C), and the acid concentration (in percent).
Construct a regression model that relates the three predictors to the response, stack loss.
Check the adequacy of the model.
Exercise and data are from B. Abraham and J. Ledolter, Introduction to Regression Modeling,
2006, Thomson Brooks/Cole.
Hints:
• Look at the data. Outliers?
• Try to find a “optimal” model. Exclude predictors that do not improve the model fit.
• Use model Diagnostics, use t−, F -tests and (adjusted) R2 values to compare different
models.
• Which data points have a (too) strong influence on the model fit? (influence.measures())
• Are the predictors correlated? In case of a high correlation, what are possible implications?
9.6. EXERCISES AND PROBLEMS 147
Problem 9.2 (Multiple linear regression 2) The file salary.txt contains information about
average teacher salaries for 325 school districts in Iowa. The variables are
District name of the district
districtSize size of the district:
1 = small (less than 1000 students)
2 = medium (between 1000 and 2000 students)
3 = large (more than 2000 students)
salary average teacher salary (in dollars)
experience average teacher experience (in years)
ii) For each of the three district sizes, fit a linear model using salary as the dependent variable
and experience as the covariate. Is there an effect of experience? How can we compare the
results?
iii) We now use all data jointly and use districtSize as covariate as well. However, districtSize
is not numerical, rather categorical and thus we set mydata$districtSize <- as.factor(
mydata$districtSize) (with appropriate dataframe name). Fit a linear model using
salary as the dependent variable and the remaining data as the covariates. Is there an
effect of experience and/or district size? How can we interpret the parameter estimates?
148 CHAPTER 9. MULTIPLE REGRESSION
Chapter 10
Analysis of Variance
In Test 2 discussed in Chapter 4, we compared the means of two independent samples with
each other. Naturally, the same procedure can be applied to I independent samples, which would
amount to I2 tests and would require adjustments due to multiple testing (see Section 4.5.2).
In this chapter we learn a “better” method, based on the concept of analysis of variance,
termed ANOVA. We focus on a linear model approach to ANOVA. Due to historical reasons, the
notation is slightly different than what we have seen in the last two chapters; but we try to link
and unify as much as possible.
where we use the indices to indicate the group and within group observation. Similarly as in
iid
the regression models of the last chapters, we again assume εij ∼ N (0, σ 2 ). Formulation (10.1)
149
150 CHAPTER 10. ANALYSIS OF VARIANCE
represents the individual group means directly, whereas formulation (10.2) models an overall
mean and deviations from the mean.
However, model (10.2) is overparameterized (I levels and I + 1 parameters) and an additional
constraint on the parameters is necessary. Often, the sum-to-zero-contrast or treatment contrast,
written as:
I
X
βi = 0 or β1 = 0, (10.3)
i=1
are used.
We are inherently interested in whether there exists a difference between the groups and so
our null hypothesis is H0 : β1 = β2 = · · · = βI = 0. Note that the hypothesis is independent of
the constraint. To develop the associated test, we proceed in several steps. We first link the two
group case to the notation from the last chapter. In a second step we intuitively derive estimates
in the general setting. Finally, we state the test statistic.
Model (10.2) with I = 2 and treatment constraint β1 = 0 can be written as a regression problem
with Yi∗ the components of the vector (Y11 , Y12 , . . . , Y1n1 , Y21 , . . . , Y2n2 )> and xi = 0 if i =
1, . . . , n1 and xi = 1 otherwise. We simplify the notation and spare ourselves from writing the
index denoted by the asterisk with
Y1 β0 1 0 β0
= Xβ + ε = X +ε= +ε (10.5)
Y2 β1 1 1 β1
N n2 −1 1> 1>
b
β0 > −1 > y 1 y1
β
b= = (X X) X = (10.6)
βb1 y2 n2 n2 0> 1> y2
P 1 P
y1j
1 n2 − n2 yij
= Pij = 1 P n1 j 1 P . (10.7)
n1 n2 −n2 N j y2j n j y 2j − n j y 1j
2 1
Thus the least squares estimates of µ and β2 in (10.2) for two groups are the mean of the first
group and the difference between the two group means.
The null hypothesis H0 : β1 = β2 = 0 in Model (10.2) is equivalent to the null hypothesis
H0 : β1∗ = 0 in Model (10.4) or to the null hypothesis H0 : β1 = 0 in Model (10.5). The latter is
of course based on a t-test for a linear association (Test 12) and coincides with the two sample
t-test for two independent samples (Test 2).
Estimators can also be derived in a similar fashion under other constraints or for more factor
levels.
10.1. ONE-WAY ANOVA 151
With the least squares method, µ b and βbi are chosen such that
X X
b − βbi )2 =
(yij − µ b − βbi )2
(y ·· + y i· − y ·· + yij − y i· − µ (10.10)
i,j i,j
X 2
= (y ·· − µ
b) + (y i· − y ·· − βbi ) + (yij − y i· ) (10.11)
i,j
is minimized. We evaluate the square of this last equation and note that the cross terms are zero
since
J
X I
X
(yij − y i· ) = 0 and (y i· − y ·· − βbi ) = 0 (10.12)
j=1 i=1
yij = µ
b + βbi + rij . (10.13)
The observations are orthogonally projected in the space spanned by µ and βi . This orthog-
onal projection allows for the division of the sums of squares of the observations (mean corrected
to be precise) into the sums of squares of the model and sum of squares of the error component.
These sums of squares are then weighted and compared. The representation of this process
in table form and the subsequent interpretation is often equated with the analysis of variance,
denoted ANOVA.
Remark 10.1. This orthogonal projection also holds in the case of a classical regression frame-
work, of course. Using (9.13) and (9.14), we have
because the hat matrix H is symmetric (H> = H) and idempotent (HH = H). ♣
The decomposition of the sums of squares can be derived with help from (10.9). No assump-
tions about constraints or ni are made
X X
(yij − y ·· )2 = (y i· − y ·· + yij − y i· )2 (10.15)
i,j ij
X X X
= (y i· − y ·· )2 + (yij − y i· )2 + 2(y i· − y ·· )(yij − y i· ), (10.16)
ij i,j i,j
152 CHAPTER 10. ANALYSIS OF VARIANCE
ni
X
where the cross term is again zero because (yij − y i· ) = 0. Hence we have the decomposition
j=1
of the sums of squares
X X X
(yij − y ·· )2 = (y i· − y ·· )2 + (yij − y i· )2 (10.17)
i,j i,j i,j
| {z } | {z } | {z }
Total Model Error
or SST = SSA + SSE . We choose deliberately SSA instead of SSM as this will simplify subsequent
extensions. Using the least squares estimates µ b = y ·· and βbi = y i· − y ·· , this equation can be
read as
1 X 1 X 1 X
b)2 =
(yij − µ ni (µ\ + βi − µb)2 + (yij − µ\ + βi )2 (10.18)
N N N
i,j i i,j
\ij ) = 1 X 2
c2 ,
Var(y ni βbi + σ (10.19)
N
i
(where we could have used some divisor other than N ). The test statistic for the statistical
hypothesis H0 : β1 = β2 = · · · = βI = 0 is based on the idea of decomposing the variance into
variance between groups and variance within groups, just as illustrated in (10.19), and comparing
them. Formally, this must be made more precise. A good model has a small estimate for σ c2 in
comparison to that for the second sum. We now develop a quantitative comparison of the sums.
A raw comparison of both variance terms is not sufficient, the number of observations must
be considered: SSE increases as N increases also in light of a high quality model. In order to
weight the individual sums of squares, we divide them by their degrees of freedom, e.g., instead
of SSE we will use SSE /(N − I) and instead of SSA we will use SSA /(I − 1), which we will
term a mean squares. Under the null hypothesis, the mean squares are chi-square distributed
and thus their quotients are F distributed. Hence, an F -test as illustrated in Test 4 is needed
again. Historically, such a test has been “constructed” via a table and is still represented as such.
This so-called ANOVA table consists of columns for the sums of squares, degrees of freedom,
mean squares and F -test statistic due to variance between groups, within groups, and the total
variance. Table 10.1 illustrates such a generic ANOVA table, numerical examples are given in
Example 10.1 later in this section.
Calculation in R: summary( lm(...)) for the value of the test statistic or anova(
lm(...)) for the explicit ANOVA table.
Example 10.1. (retardant data) Many substances related to human activities end up in
wastewater and accumulate in sewage sludge. The present study focuses on hexabromocyclodo-
decane (HBCD) detected in sewage sludge collected from a monitoring network in Switzerland.
HBCD’s main use is in expanded and extruded polystyrene for thermal insulation foams, in
building and construction. HBCD is also applied in the backcoating of textiles, mainly in furni-
ture upholstery. A very small application of HBCD is in high impact polystyrene, which is used
for electrical and electronic appliances, for example in audio visual equipment. Data and more
detailed background information are given in Kupper et al. (2008) where it is also argued that
loads from different types of monitoring sites showed that brominated flame retardants ending
154 CHAPTER 10. ANALYSIS OF VARIANCE
up in sewage sludge originate mainly from surface runoff, industrial and domestic wastewater.
HBCD is harmful to one’s health, may affect reproductive capacity, and may harm children
in the mother’s womb.
In R-Code 10.1 the data are loaded and reduced to Hexabromocyclododecane. First we use
constraint β1 = 0, i.e., Model (10.5). The estimates naturally agree with those from (10.7). Then
we use the sum-to-zero constraint and compare the results. The estimates and the standard errors
changed (and thus the p-values of the t-test). The p-values of the F -test are, however, identical,
since the same test is used.
The R command aov is an alternative for performing ANOVA and its use is illustrated in
R-Code 10.2. We prefer, however, the more general lm approach. Nevertheless we need a function
which provides results on which, for example, Tukey’s honest significant difference (HSD) test
can be performed with the function TukeyHSD. The differences can also be calculated from the
coefficients in R-Code 10.1. The p-values are smaller because multiple tests are considered. ♣
R-Code 10.1: retardant data: ANOVA with lm command and illustration of various
contrasts.
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 84.8 on 13 degrees of freedom
## Multiple R-squared: 0.249,Adjusted R-squared: 0.134
## F-statistic: 2.16 on 2 and 13 DF, p-value: 0.155
options( "contrasts")
## $contrasts
## unordered ordered
## "contr.treatment" "contr.poly"
# manually construct the estimates:
c( mean(HBCD[1:4]), mean(HBCD[5:8])-mean(HBCD[1:4]),
mean(HBCD[9:16])-mean(HBCD[1:4]))
## [1] 75.675 77.250 107.788
# change the constrasts to sum-to-zero
options(contrasts=c("contr.sum","contr.sum"))
lmout1 <- lm( HBCD ~ type )
summary(lmout1)
##
## Call:
## lm(formula = HBCD ~ type)
##
## Residuals:
## Min 1Q Median 3Q Max
## -87.6 -44.4 -26.3 22.0 193.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 137.4 22.3 6.15 3.5e-05 ***
## type1 -61.7 33.1 -1.86 0.086 .
## type2 15.6 33.1 0.47 0.646
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 84.8 on 13 degrees of freedom
## Multiple R-squared: 0.249,Adjusted R-squared: 0.134
## F-statistic: 2.16 on 2 and 13 DF, p-value: 0.155
beta <- as.numeric(coef(lmout1))
# Construct 'contr.treat' coefficients:
c( beta[1]+beta[2], beta[3]-beta[2], -2*beta[2]-beta[3])
## [1] 75.675 77.250 107.787
156 CHAPTER 10. ANALYSIS OF VARIANCE
R-Code 10.2 retardant data: ANOVA with aov and multiple testing of the means.
iid
with i = 1, . . . , I, j = 1, . . . , J, k = 1, . . . , nij and εijk ∼ N (0, σ 2 ). The indices again specify the
levels of the first and second factor as well as the count for that configuration. As stated, the
model is over parameterized and additional constraints are again necessary, in which case
I
X J
X
βi = 0, γj = 0 or β1 = 0, γ1 = 0 (10.22)
i=1 j=1
lm(...). In case we do compare sums of squares there are resulting ambiguities and factors need
to be included in decreasing order of “natural” importance.
For the sake of illustration, we consider the balanced case of nij = K, called complete two-way
ANOVA. More precisely, the model consists of I · J groups and every group contains K samples
and N = I · J · K. The calculation of the estimates are easier than in the unbalanced case and
are illustrated as follows.
As in the one-way case, we can derive least squares estimates
yijk = y ··· + y i·· − y ··· + y ·j· − y ··· + yijk − y i·· − y ·j· + y ··· (10.23)
|{z} | {z } | {z } | {z }
µ
b βbi γ
bj rijk
X SSA MSA
Factor A SSA = (y i·· − y ··· )2 I −1 MSA = Fobs,A =
I −1 MSE
i,j,k
X SSB MSB
Factor B SSB = (y ·j· − y ··· )2 J −1 MSB = Fobs,B =
J −1 MSE
i,j,k
SSE = DFE =
X SSE
Error (yijk − y i·· − y ·j· + y ··· )2 N −I −J +1 MSE =
DFE
i,j,k
X
Total SST = (yijk − y ··· )2 N −1
i,j,k
Model (10.21) is additive: “More of both leads to even more”. It might be that there is a
certain canceling or saturation effect. To model such a situation, we need to include an interaction
(βγ)ij in the model, to account the not linear effects:
iid
with εijk ∼ N (0, σ 2 ) and corresponding ranges for the indices. In addition to constraints (10.22)
we require
I
X J
X
(βγ)ij = 0 and (βγ)ij = 0 for all i and j (10.26)
i=1 j=1
or analogous treatment constraints are often used. As in the previous two-way case, we can
derive the least squares estimates
yijk = y ··· + y i·· − y ··· + y ·j· − y ··· + y ij· − y i·· − y ·j· + y ··· + yijk − y ij· (10.27)
|{z} | {z } | {z } | {z } | {z }
µ
b βbi γ
bj (βγ)
d
ij rijk
X SSA MSA
Factor A SSA = (y i·· − y ··· )2 I −1 MSA = Fobs,A =
DFA MSE
i,j,k
X SSB MSB
Factor B SSB = (y ·j· − y ··· )2 J −1 MSB = Fobs,B =
DFB MSE
i,j,k
iid
with j = 1, . . . , J, k = 1, . . . , nj and εijk ∼ N (0, σ 2 ). Additional constraints are again necessary.
Keeping track of indices and Greek letters quickly gets cumbersome and one often assumes a R
formula notation. For example if the predictor xi is in the variable Population and γj is in the
variable Treatment, in form of a factor then
Hence our unified approach via lm(...). Notice that for the estimates the order of the
variable in formula (10.29) does not play a role, for the decomposition of sums of squares it does.
Different statistical software packages have different approaches and thus may lead to minor
differences.
10.4 Example
Example 10.2. UVfilter data Octocrylene is an organic UV Filter found in sunscreen and
cosmetics. The substance is classified as a contaminant and dangerous for the environment by
the EU under the CLP Regulation.
Because the substance is difficult to break down, the environmental burden of Octocrylene
can be estimated through measurement of its concentration in sludge from waste treatment
facilities.
The study Plagellat et al. (2006) analyzed Octocrylene (OC) concentrations from 24 different
purification plants (consisting of three different types of Treatment), each with two samples
(Month). Additionally, the catchment area (Population) and the amount of sludge (Production)
are known. Treatment type A refers to small plants, B medium-sized plants without considerable
industry and C medium-sized plants with industry.
R-Code 10.3 prepares the data and shows a one-way ANOVA. R-Code 10.4 shows a two-way
ANOVA (with and without interactions).
Figure 10.1 shows why the interaction is not significant. First, the seasonal effect of groups
A and B are very similar and second, the variability in group C is too large. ♣
R-Code 10.4: UVfilter data: two-way ANOVA and two-way ANOVA with interactions
using lm. (See Figure 10.1.)
●
Month
8.5
jan
mean of log(OT)
●
8.5
jul
log(OT)
8.0
7.5
7.5
●
6.5
7.0
A B C A B C
Treatment Treatment
Figure 10.1: UVfilter data: box plots sorted by treatment and interaction plot. (See
R-Code 10.4.)
i) Describe the data. Do a visual inspection to check for differences between the treatment
types and between the months of data aquirement. Use an appropriate plot function to do
so. Describe your results.
Hint: Also try the function table()
ii) Fit a one-way ANOVA with log(OC) as response variable and Behandlung as explanatory
variable.
Hint: use lm and perform an anova on the output. Don’t forget to check model assump-
tions.
iii) Extend the model to a two-way ANOVA by adding Monat as a predictor. Interpret the
summary table.
iv) Test if there is a significant interaction between Behandlung and Monat. Compare the
result with the output of interaction.plot
v) Extend the model from (b) by adding Produktion as an explanatory variable. Perform an
anova on the model output and interpret the summary table. (Such a model is sometimes
called Analysis of Covariance, ANCOVA).
Switch the order of your explanatory variables and run an anova on both model outputs.
Discuss the results of Behandlung + Produktion and Produktion + Behandlung. What
causes the differences?
164 CHAPTER 10. ANALYSIS OF VARIANCE
Chapter 11
Bayesian Methods
In statistics there exist two different philosophical approaches to inference: frequentist and
Bayesian inference. Past chapters dealt with the frquentist approach; now we deal with the
Bayesian approach. Here, we consider the parameter as a random variable with a suitable
distribution, which is chosen a priori, i.e., before the data is collected and analyzed. The goal is
to update this prior knowledge after observation of the data in order to draw conclusions (with
the help of the so-called posterior distribution).
and is shown by using twice Equation (2.3). Bayes theorem is often used in probability theory
to calculate probabilities along an event tree, as illustrated in the arch-example below.
Example 11.1. A patient sees a doctor and gets a test for a (relatively) rare disease. The
prevalence of this disease is 0.5%. As typical, the screening test is not perfect and has a sensitivity
165
166 CHAPTER 11. BAYESIAN METHODS
of 99%, i.e., true positive rate; properly identified the disease in a sick patient, and a specificity
of 98%, i.e., true negative rate; a healthy person is correctly identified disease free. What is the
probability that the patient has the disease provided the test is positive?
Denoting the events D = ‘Patient has disease’ and + = ‘test is positive’ we have using (11.1)
P(+ | D) P(D) P(+ | D) P(D)
P(D | +) = = (11.2)
P(+) P(+ | D) P(D) + P(+ | ¬D) P(¬D)
99% · 0.5%
= = 20%. (11.3)
99% · 0.5% + 2% · 99.5%
Note that for the denominator we have used the so-called law of total probability to get an
expression for P(+). ♣
Extending Bayes’ theorem to the setting of two continuous random variables X and Y we
have
fY |X=x (y | x) fX (x)
fX|Y =y (x) = . (11.4)
fY (y)
In the context of Bayesian inference the random variable X will now be a parameter, typically
of the distribution of Y :
fY |Θ=θ (y | θ) fΘ (θ)
fΘ|Y =y (θ | y) = . (11.5)
fY (y)
Hence, current knowledge about the parameter is expressed by a probability distribution on
the parameter: the prior distribution. The model for our observations is called the likelihood. We
use our observed data to update the prior distribution and thus obtain the posterior distribution.
Notice that P(B) in (11.1), P(+) in (11.2), or fY (y) in (11.4) and (11.5) serves as a normal-
izing constant, i.e., it is independent of A, D, x or the parameter θ. Thus, we often write the
posterior without this normalizing constant
(or in short form f (θ | y) ∝ f (y | θ)f (θ) if the context is clear). The symbol “∝” means
“proportional to”.
Finally we can summarize the most important result in Bayesian inference as the posterior
density is proportional to the likelihood multiplied by the prior density, i.e.,
Until recently, there were clear fronts between frequentists and Bayesians. Luckily these differ-
ences have vanished.
11.2 Examples
We start with two very typical examples that are tractable and well illustrate the concept of
Bayesian inference.
Example 11.2. (Beta-Binomial) Let Y ∼ Bin(n, p). We observe y successes (out of n). As was
shown in Section 5.1, pb = y/n. We often have additional knowledge about the parameter p. For
example, let n = 13, the number of autumn lambs in a herd of sheep. We count the number of
male lambs. It is highly unlikely that p ≤ 0.1. Thus we assume that p is beta distributed, i.e.,
with normalization constant c. We write p ∼ Beta(α, β). Figure 11.4 shows densities for various
pairs (α, β).
The posterior density is then
n y
∝ p (1 − p)n−y × c · pα−1 (1 − p)β−1 (11.9)
y
∝ py pα−1 (1 − p)n−y (1 − p)β−1 (11.10)
∝ py+α−1 (1 − p)n−y+β−1 , (11.11)
4
Data/likelihood
Prior
Posterior
3
2
1
0
^
p
0.0 0.2 0.4 0.6 0.8 1.0
Figure 11.1: Beta-binomial model with prior density (cyan), data/likelihood (green)
and posterior density (blue).
In the previous example, we use p ∼ Beta(α, β) and fix α and β during model specification
and are thus called hyper-parameters.
iid
Example 11.3. (Normal-normal) Let Y1 , . . . , Yn ∼ N (µ, σ 2 ). We assume that µ ∼ N (η, τ 2 )
and σ is known. Thus we have the Bayesian model:
iid
Yi | µ ∼ N (µ, σ 2 ), i = 1, . . . , n, (11.13)
µ ∼ N (η, τ 2 ). (11.14)
where the constants (2πσ 2 )−1/2 and (2πτ 2 )−1/2 do not need to be considered. Through further
manipulation (of the square in µ) one obtains
2 !
1 −1 ny
1 n 1 n η
∝ exp − + µ− + + (11.18)
2 σ2 τ 2 σ2 τ 2 σ2 τ 2
σ2 nτ 2
E(µ | y1 , . . . , yn ) = η + y (11.20)
nτ 2 + σ 2 nτ 2 + σ 2
11.2. EXAMPLES 169
is a weighted mean of the prior mean η and the mean of the Likelihood y. The greater n is, the
less weight there is on the prior mean, since σ 2 /(nτ 2 + σ 2 ) → 0 for n → ∞.
Figure 11.2 illustrates the setting of this example with artificial data (see R-Code 11.1).
Typically, the prior is fixed but if more data is collected, the likelihood gets more and more
peaked. As a result, the posterior mean will be closer to the mean of the data. We discuss this
further in the next section. ♣
Data/likelihood
0.8
Prior
Posterior
0.6
Density
0.4
0.2
0.0
−2 −1 0 1 2 3 4
Figure 11.2: Normal-normal model with prior (cyan), data/likelihood (green) and
posterior (blue). (See R-Code 11.1.)
The posterior mode is often used as a summary statistic of the posterior distribution. Nat-
urally, the posterior median and posterior mean (i.e., expectation of the posterior distribution)
are intuitive alternatives.
With the frequentist approach we have constructed confidence intervals, but these intervals
170 CHAPTER 11. BAYESIAN METHODS
with v = n/σ 2 + 1/τ 2 and m = ny/σ 2 + η/τ 2 . That means that the bounds v −1 m ± z1−α/2 v −1/2
can be used to construct a Bayesian counterpart to a confidence interval.
is called a (1 − α)% credible interval for θ with respect to the posterior density f (θ | y1 , . . . , yn )
and 1 − α is the credible level of the interval. ♦
The definition states that the random variable whose density is given by f (θ | y1 , . . . , yn ) is
contained in the (1 − α)% credible interval with probability (1 − α).
Since the credible interval for a fixed α is not unique, the “narrowest” is often used. This
is the so-called HPD Interval (highest posterior density interval). A detailed discussion can be
found in Held (2008). Credible intervals are often determined numerically.
Example 11.4. In the context of Example 11.2, the 2.5% and 97.5% quantiles of the posterior
are 0.45 and 0.83, respectively. A HPD is given by 0.46 and 0.84. The differences are not
pronounced as the posterior density is fairly symmetric. Hence, the widths of both are almost
identical: 0.377 and 0.375.
The frequentist empirical 95% CI is [0.5, 0.92], with width 0.42, see Equation (5.9). ♣
that means that the Bayes factor BF01 summarizes the evidence of the data for the hypothesis
H0 versus the hypothesis H1 . However, it has to be mentioned that a Bayes factor needs to
exceed 3 to talk about substantial evidence for H0 . For strong evidence we typically require
Bayes factors larger than 10. More precisely, Jeffreys (1983) differentiates
For values smaller than one, we would favor H1 and the situation is similar by inverting the
ratio, as also illustrated in the following example.
Example 11.5. We consider the setup of Example 11.2 and would compare the models with
p = 1/2 and p = 0.8 when observing 10 successes among the 13 trials. To calculate the Bayes
factor, we need to calculate P(Y = 10 | p) for p = 1/2 and p = 0.8. Hence, the Bayes factor is
13
10 3
10 0.5 (1 − 0.5) 0.0349
BF01 = 13 = = 0.1421, (11.24)
10 0.810 (1 − 0.2)3 0.2457
Bayes factors are popular because they are linked to the BIC (Bayesian Information Crite-
rion) and thus automatically penalize model complexity. Further, they also work for non-nested
models.
Example 11.6. We consider again the normal-normal model and compare the posterior density
for various n with the likelihood. We keep y = 2.1, independent of n. As shown in Figure 11.3,
the maximum likelihood estimate does not depend on n (y is kept constant by design). The
√
uncertainty decreases, however (Standard error is σ/ n). For increasing n, the posterior ap-
proaches the likelihood density. In the limit there is no difference between the posterior and the
likelihood.
172 CHAPTER 11. BAYESIAN METHODS
Data/likelihood
0.8
Prior
Posterior
0.6
Density
0.4
0.2
0.0
−2 −1 0 1 2 3 4
µ
4
Data/likelihood
Prior
Posterior
3
Density
2
1
0
Figure 11.3: Normal-normal model with prior (cyan), data/likelihood (green) and
posterior (blue). Two different priors top and increasing n bottom (n = 4, 36, 64, 100).
The choice of prior distribution leads to several discussions and we refer you to Held and Sa-
banés Bové (2014) for more details.
The source https://www.nicebread.de/grades-of-evidence-a-cheat-sheet compares different
categorizations of evidence based on a Bayes factor and illustrates that the terminology is not
universal.
We introduce a random variable with support [0, 1]. Hence this random variable is well suited
to model probabilities (proportions, fractions) in the context of Bayesian modeling.
A random variable X with density
where c is a normalization constant, is called the beta distributed with parameters α and β. We
write this as X ∼ Beta(α, β). The normalization constant cannot be written in closed form for
all parameters α and β. For α = β the density is symmetric around 1/2 and for α > 1, β > 1
11.5. APPENDIX: BETA DISTRIBUTION 173
the density is concave with mode (α − 1)/(α + β − 2). For arbitrary α > 0, β > 0 we have:
α
E(X) = ; (11.27)
α+β
αβ
Var(X) = . (11.28)
(α + β + 1)(α + β)2
Figure 11.4 shows densities of the beta distribution for various pairs of (α, β).
R-Code 11.2 Densities of beta distributed random variables for various pairs of (α, β).
(See Figure 11.4.)
α, β
11
2.5
22
33
44
2.0
55
66
Density
0.8 0.8
1.5
0.4 0.4
0.2 0.2
1.0
14
0.5 4
24
0.5
0.0
Figure 11.4: Densities of beta distributed random variables for various pairs of (α, β).
(See R-Code 11.2.)
174 CHAPTER 11. BAYESIAN METHODS
ii) We choose the following Gamma prior density for the parameter κ:
α
β κα−1 exp(−βκ), if κ > 0,
f (κ | α, β) = Γ(α)
0, otherwise,
for fixed hyper-parameters α > 0, β > 0, i.e., κ ∼ Gamma(α, β). How does this distribution
relates to the exponential distribution?
iii) Plot four pdfs for (α, β) = (1,1), (1,2), (2,1) and (2,2). How does a certain choice of α, β
be interpreted with respect to our “beliefs” on κ?
v) Compare the prior and posterior distributions. Why is the choice in ii) sensible?
vi) Simulate some data with n = 50, µ = 10 and κ = 0.25. Plot the prior and posterior
distributions of κ for α = 2 and β = 1.
Problem 11.2 (Bayesian statistics) For the following Bayesian models, derive the posterior
distribution and give an interpretation thereof in terms of prior and data.
i) Let Y | µ ∼ N (µ, 1/κ), where κ is the precision (inverse of the variance) and is assumed
to be known (hyper-parameter). Further, we assume that µ ∼ N (η, 1/ν), for fixed hyper-
parameters η and ν > 0.
ii) Let Y | λ ∼ Pois(λ) with a prior λ ∼ Gamma(α, β) for fixed hyper-parameters α > 0,
β > 0.
Chapter 12
Design of Experiments
Design of Experiments (DoE) is a relatively old field of statistics. Pioneering work has been
done almost 100 years ago by Sir Ronald Fisher and co-workers at Rothamsted Experimental
Station, England, where mainly agricultural questions have been discussed. The topic has been
taken up by the industry after the second world war to, e.g., optimize production of chemical
compounds, work with robust parameter designs. In recent decades, advances are still been made
on the one hand using the abundance of data in machine learning type discovery and on the other
hand in preclinical and clinical research were the sample sizes are often extremely small.
In this chapter we will selectively cover different aspects of DoE, focusing on sample size
calculations and randomization. Additionally, in the last section we also cover a few domain
specific concepts and terms that are often used in the context of setting up experiments for
clinical or preclinical trials.
175
176 CHAPTER 12. DESIGN OF EXPERIMENTS
Maximize primary variance, minimize error variance and control for secondary variance.
which translates to maximize signal we are investigating, minimize noise we are not modeling
and control uncertainties with carefully chosen independent variables.
In the context of DoE we often want to compare the effect of a treatment (or procedure) on an
outcome. Examples that have been discussed in previous chapters are: “Is there a progression of
pododermatitis at the hind paws over time?”, “Is a diretic medication during pregnancy reducing
the risk of pre-eclampsia?” “How much can we increase hardness of metal springs with lower
temperaturs quenching baths?”, “Is residual octocrylene in waste water sludge linked to particular
waste water types?”.
To design an experiment, it is very important to differentiate between exploratory or con-
firmatory research questions. An exploratory experiment tries to discover as much as possible
about the sample material or the phenomenon under investigation, given time and resource con-
straints. Whereas in a confirmatory experiment we want to verify, to confirm, or to validate a
result, which was often derived from an earlier exploratory experiment. Table 12.1 summarizes
both approaches in a two-valued setting. Some of the design elements will be further discussed
in later sections of this chapter. The binary classification should be understood within each
domain: few observations in one domain may be very many in another one. In both situations
and all scientific domains, however, proper statistical analysis is crucial.
2 σ2
n ≈ 4z1−α/2 (12.1)
ω2
observations. In this setting, the right-hand-side of (12.1) does not involve the data and thus
the width of the confidence interval is in any case guaranteed. Note to reduce the width in half,
we need to quadruple the sample size.
The same approach is used when estimating a proportion. We can, for example, use the pre-
cise Wilson confidence interval (5.11) and solve a quadratic equation to obtain n. Alternatively,
we can use the Wald confidence interval (5.10) to get
2 pb(1 − pb)
n ≈ 4z1−α/2 , (12.2)
ω2
178 CHAPTER 12. DESIGN OF EXPERIMENTS
which corresponds to (12.1) with the the plug-in estimate for σ b2 . Of course, pb is not known a
priori and we often take the conservative choice of pb = 1/2 as the function x(1 − x) is maximized
over (0, 1) at x = 1/2. Thus we may choose n ≈ (z1−α/2 /ω)2 .
If we are estimating a Pearson’s correlation coefficient, we can use CI 6 to again link interval
width with n. Here, we use an alternative approach, and would like to determine sample size
such that the interval does not contain the value zero, i.e., the width is just smaller than 2r. The
derivation relies on the duality of tests and confidence intervals (see Section 4.4). Recall Test 11
for Pearson’s correlation coefficient. From Equation (8.3) we construct the critical value for the
test (boundary of the rejection region, see Figure 4.3) and based on that we can calculate the
minimum sample size necessary to detect a correlation |r| ≥ rcrit as significant:
√
n−2 tcrit
tcrit = rcrit q =⇒ rcrit = q . (12.3)
2
1 − rcrit n − 2 + t2crit
Figure 12.1 illustrates the least significant correlation for specific sample sizes. Specifically, with
sample size n < 24 correlations below 0.4 are not significant and for a correlation of 0.25 to be
significant, we require n > 62 at level α = 5% (see R-Code 12.1).
R-Code 12.1 Significant correlation for specific sample sizes (See Figure 12.1.)
0.4
0.2
0.0
Figure 12.1: Significant correlation for specific sample sizes (at level α = 5%). For an
empirical correlation of 0.25, n needs to be larger than 62 as indicated with the gray
lines. For a particular n correlations above the line are significant, below are not. (See
R-Code 12.1.)
12.2. SAMPLE SIZE CALCULATIONS 179
Sample sizes are most often determined to be able to “detect” an alternative hypothesis with a
certain probability. That means we need to work with power 1 − β of a particular statistical test.
As a simple example, we consider a one-sided z-test with H0 : µ ≤ µ0 and H1 : µ > µ0 . The
Type II error is
µ0 − µ1
β = β(µ1 ) = P(H0 not rejected given µ = µ1 ) = · · · = Φ z1−α + √ . (12.4)
σ/ n
Suppose we would be able to detect the alternative µ1 with probability 1 − β(µ1 ), i.e., reject the
null hypothesis with probability 1 − β when the true mean is µ1 . Hence, plugging the values
in (12.4) and solving for n we have approximate sample size
z1−α + z1−β 2
n≈ σ . (12.5)
µ0 − µ1
Hence, the sample size depends on the Type I and II errors as well as the standard deviation and
the difference of the means. The latter three quantities are often combined to the standardized
effect size
µ0 − µ1
d= , called Cohen’s d. (12.6)
σ
For t-tests, Cohen (1988) defined small, medium and large (standardized) effect sizes as
d = 0.2, 0.5 and 0.8, respectively. These are often termed the conventional effect sizes but
depend on the type of test, also implemented in the function cohen.ES() of the R package pwr).
Example 12.1. In the setting of a two-sample t-test with equal group sizes, we need at level
α = 5% and power 1−β = 80% in each group 26, 64 and 394 observations for a large, medium and
small effect size, respectively, see, e.g., pwr.t.test( d=0.2, power=.8) from the pwr package.
For unequal sample sizes, the sum of both group sizes is a bit larger compared to equal sample
sizes (balanced setting). For a large effect size, we would, for example, require n1 = 20 and n2 =
35, leading to three more observations compared the the balanced setting, (pwr.t2n.test(n1=20,
d=0.8, power=.8)).
180 CHAPTER 12. DESIGN OF EXPERIMENTS
12.3 ANOVA
DoE in the Fisher sense is heavily ANOVA driven by his analysis of the crop experiments at
Rothamsted Experimental Station and thus in many textbooks DoE is equated to the discussion
of ANOVA. Here, we have separated the statistical analysis in Chapter 10 from the conceptual
setup of the experiment in this chapter.
In a typical ANOVA setting we should strive to have the same amount of observations in
each cell (for all settings of levels). Such a setting is called a balanced design (otherwise it is
unbalanced). If every treatment has the same number of observations, effect of unequal variances
are mitigated.
In a simple regression setting, the standard errors of βb0 and βb1 depend on 1/ i (xi −x)2 , see
P
expressions for the estimates (8.8) and (8.9). Hence, to reduce the variability of the estimates, we
should increase i (xi −x)2 as much as possible. Specifically, suppose the interval [a, b] represents
P
a natural range for the predictor, then we should choose half of the predictors as a and the other
half as b.
This last arguments justifies a discretization of continuous predictor variables in levels. Of
course this implies that we expect a linear relationship. If the relationship is not linear, a
discretization may be fatal.
(see also Equation (10.24)). In the unbalanced setting this is not the case and the decomposition
depends on the order we introduce the factors in the model. At each step, we reduce additional
variability, Hence, we should rather write
where the term SSB|A indicates the sums of squares of factor B after correction of factor A. and
similarly, term SSAB|A,B indicates the sums of squares of the interaction AB after correction of
factors A and B.
This concept of sums of squares after correction is not new. We have encountered this type
of correction already: SST is actually calculated after correcting for the overall mean.
Equation (12.8) represents the sequential sums of squares decomposition, called Type I se-
quential SS : SSA and SSB|A and SSAB|A,B . It is possible to show that SSB|A = SSA,B − SSA ,
where the former is the classical sums of squares of a model without interactions. An ANOVA
table such as given in Table 10.3 yields different p-values for H0 : β1 = · · · = βI = 0 and
H0 : γ1 = · · · = γJ = 0 if the order of the factors is exchanged. This is often a disadvantage
and for the F -test the so-called Type II partial SS, being SSA|B and SSB|A should be used. As
there is no interaction involved, we should use Type II only if the interaction is not significant
12.4. RANDOMIZATION 181
(in which case it is to be preferred over Type I). Alternatively, Type III partial SS, SSA|B,AB and
SSB|A,AB , may be used.
In R, the output of aov, or anova are Type I SS. To obtain the other types, manual calculations
may be done or using the function Anova(..., type=i) from the package car.
Example 12.2. Consider Example 10.2 in Section 10.4 but we eliminate the first observation
and the design is unbalanced in both factors. R-Code 12.2 calculates the Type I sequential SS
for the same order as in R-Code 10.4. Type II partial SS are subsequently slightly different.
Note that the design is balanced for the factor Month and thus simply exchanging the order
does not alter the SS here. ♣
R-Code 12.2 Type I and II SS for UVfilter data without the first observation.
require( car)
lmout2 <- lm( log(OT) ~ Month + Treatment, data=UV, subset=-1) # omit 1st!
print( anova( lmout2), signif.stars=FALSE)
## Analysis of Variance Table
##
## Response: log(OT)
## Df Sum Sq Mean Sq F value Pr(>F)
## Month 1 1.14 1.137 4.28 0.053
## Treatment 2 5.38 2.692 10.12 0.001
## Residuals 19 5.05 0.266
print( Anova( lmout2, type=2), signif.stars=FALSE) # type=2 is default
## Anova Table (Type II tests)
##
## Response: log(OT)
## Sum Sq Df F value Pr(>F)
## Month 1.41 1 5.31 0.033
## Treatment 5.38 2 10.12 0.001
## Residuals 5.05 19
we use sample(x=4, size=20, replace=TRUE). This procedure has the disadvantage of leading
to a possibly unbalanced design. The constrained randomization places in all groups the same
number of subjects (conditional on appropriate sample size). This can be achieved by numbering
the subjects and then random drawing the corresponding numbers and putting the corresponding
subjects in the appropriate four groups: sample(x=20, size=20, replace=FALSE).
In the case of discrete confounders it is possible to split your sample into subgroups accord-
ing to these pre-defined factors. These subgroups are often called blocks (when controllable)
or strata (when not). To randomize, randomized complete block design (RCBD) or stratified
randomization is used.
In RCBD each block receives the same amount of subjects. This can be achieved by num-
bering the subjects and then random drawing the corresponding numbers and putting the cor-
responding subjects in the appropriate groups: sample(x=20, size=20, replace=FALSE).
Of course, the corresponding sample sizes are determined a priori. Finally, randomization
also protects against spurios correlations in the observations.
Example 12.3. Suppose we are studying the effect of irrigation amount and fertilizer type
on crop yield. We have access to eight fields, which can be treated independently and without
proximity effects. If applying irrigation and fertilizer is equally easy, we can use a complete
2 × 2 factorial design and assign levels of both factors randomly to fields in a balanced way (each
combination of factor levels is equally represented).
Alternatively, the following options are possible and are further illustrated in Figure 12.2.
In CRD, levels of irrigation and fertilizer are assigned to plots of land (experimental units) in a
random and balanced fashion. In RCBD, similar experimental units are grouped (for example, by
field) into blocks and treatments are distributed in a CRD fashion within the block. If irrigation
is more difficult to vary on a small scale and fields are large enough to be split, a split plot
design becomes appropriate. Irrigation levels are assigned to whole plots by CRD and fertilizer
is assigned to subplots using RCBD (irrigation is the block). Finally, ff the fields are large
enough, they can be used as blocks for two levels of irrigation. Each field is composed of two
whole plots, each composed of two subplots. Irrigation is assigned to whole plots using RCBD
(blocked by field) and fertilizer assigned to subplots using RCBD (blocked by irrigation). ♣
12.4. RANDOMIZATION 183
Figure 12.2: Different randomization of eight fields. CRD (a), RCBD (b) and split
plot CRD (c) and RCBD (d). Source ??.
In many experiments the subjects are inherently heterogeneous with respect to factors
that we are not interested in. This heterogeneity may imply variability in the data masking the
effect we would like to study. Blocking is a technique for dealing with this nuisance heterogeneity.
Hence, we distinguish between the treatment factors that we are interested in and the nuisance
factors which have some effect on the response but are not of interest to us.
The term blocking comes from agricultural experiments where it designated a set fo plots of
land that have a very similar characteristics with respect to crop yield, in other words they are
homogeneous.
If a nuisance factor is known and controllable, we use blocking and control for it by including
a blocking factor in the experiment. Typical, blocking factors are sex, factory, production batch.
These are controllable in the sense that we are able to choose in which factor to include
If a nuisance factor is known and uncontrollable, we may use the concept of ANCOVA, i.e.,
to remove the effect of the factor. Suppose that age has an effect on the treatment. It is not
possible to control for age and creating age batches may not be efficient either. Hence we include
age in our model. This approach is less efficient than blocking as we do correct for the design
compared to design the experiment to account for the factor.
Unfortunately, there are also unknown and uncontrollable nuisance factors. To protect for
these we use proper randomization such that their impact is balanced in all groups. Hence, we
can see randomization as a insurance against systematic biases due to nuisance factors.
184 CHAPTER 12. DESIGN OF EXPERIMENTS
A simple examples of RCBD are Example 10.2 and Exercise 1. Treatment type is the main
“Treatment” and we control for, e.g., season, population size, . . . .
Figure 12.4: (a) A crossed design examines every combination of levels for each fixed
factor. (b) Nested design can progressively subreplicate a fixed factor with nested levels
of a random factor that are unique to the level within which they are nested. (c) If
a random factor can be reused for different levels of the treatment, it can be crossed
with the treatment and modeled as a block. (d) A split plot design in which the fixed
effects (tissue, drug) are crossed (each combination of tissue and drug are tested) but
themselves nested within replicates. Source from ?.
Figure 12.5: (a) A two-factor, split plot animal experiment design. The whole plot is
represented by a mouse assigned to drug, and tissues represent subplots. (b) Biological
variability coming from nuisance factors, such as weight, can be addressed by blocking
the whole plot factor, whose levels are now sampled using RCBD. (c) With three factors,
the design is split-split plot. The housing unit is the whole plot experimental unit, each
subject to a different temperature. Temperature is assigned to housing using CRD.
Within each whole plot, the design shown in b is performed. Drug and tissue are
subplot and sub-subplot units. Replication is done by increasing the number of housing
units. Source from ?
Often a treatment is compared to an existing one and the aim is to show that it is at least as good
or equivalent to an existing one. In such a situation it is not appropriate to state H0 : µE = µN ,
then to compare the mean (effect) of the existing with the new one and, finally, in case of failure
of rejection to claim that these are equivalent.
We need to reformulate the alternative hypothesis stating that the effects are equivalent.
12.4. RANDOMIZATION 185
A randomized controlled trial (RCT) is study in which people are allocated at random (by chance
alone) to receive one of several clinical interventions. One of these interventions is the standard
of comparison or control. The control may be a standard practice, a placebo ("sugar pill"), or
no intervention at all. Someone who takes part in a randomized controlled trial (RCT) is called
a participant or subject. RCTs seek to measure and compare the outcomes after the participants
receive the interventions. Because the outcomes are measured, RCTs are quantitative studies.
In sum, RCTs are quantitative, comparative, controlled experiments in which investigators
study two or more interventions in a series of individuals who receive them in random order.
The RCT is one of the simplest and most powerful tools in clinical research.
An intervention is a process which a group of subjects (or experimental units) is subjected
to such as a surgical procedure, a drug injection, or some other form of a treatment.
Control has several different uses in design. First, an experiment is controlled because we as
experimenters assign treatments to experimental units. Otherwise, we would have an observa-
tional study. Second, acontroltreatment is a “standard” treatment that is used as abaseline or
basis of comparison for the other treatments. This control treatment might be the treatment
in common use, or it might bea null treatment (no treatment at all). For example, a study of
new pain killing drugs could use a standard pain killer as a control treatment, or a study on the
efficacy of fertilizer could give some fields no fertilizer at all. This would control for average soil
fertility or weather conditions.
Placebo is a null treatment that is used when the act of applying a treatment—any treat-
ment—has an effect. Placebos are often used with human subjects, because people often respond
to any treatment: for example, reduction in headache pain when given a sugar pill. Blinding is
important when placebos are used with human subjects. Placebos are also useful for nonhuman
subjects. The apparatus for spraying a field witha pesticide may compact the soil. Thus we drive
the apparatusover the field, without actually spraying, as a placebo treatment.Factorscombine
to form treatments. For example, the baking treatment fora cake involves a given time at a
given temperature. The treatment is the combination of time and temperature, but we can vary
the time and temperature separately. Thus we speak of a time factor and a temperature factor.
Individual settings for each factor are calledlevelsof the factor.
Confounding occurs when the effect of one factor or treatment cannot be distinguished from
that of another factor or treatment. The two factors or treatments are said to be confounded.
Except in very special circumstances, confounding should be avoided. Consider planting corn
variety A in Minnesota and corn variety B in Iowa. In this experiment, we cannot distinguish
location effects from variety effects—the variety factor and the location factor are confounded.
Blinding occurs when the evaluators of a response do not know which treat-ment was given
to which unit. Blinding helps prevent bias in the evaluation, even unconscious bias from well-
intentioned evaluators. Double blinding occurs when both the evaluators of the response and
the (human subject) experimental units do not know the assignment of treatments to units.
186 CHAPTER 12. DESIGN OF EXPERIMENTS
Systematic reviews
Expert opinion
Blinding the subjects can also prevent bias, because subject responses can change when subjects
have expectations for certain treatments.
Before a new drug is admitted to the market, many steps are necessary: starting from a
discovery based step toward highly standardized clinical trials (type I, II and III). At the very
end, there are typically randomized controlled trials, that by design (should) eliminate all possible
confounders.
At later steps, when searching for an appropriate drug, we may base the decision on available
“evidence”: what has been used in the past, what shown to work (for similar situations). This
is part of evidence-based medicine. Past information may be of varying quality, ranging from
ideas opinions to case studies to RCTs or systematic reviews. Figure 12.6 represents a so-called
evidence-based medicine pyramid which reflects the quality of research designs (increasing) and
quantity (decreasing) of each study design in the body of published literature (from bottom
to top). For other scientific domains, similar pyramids exist, with bottom and top typically
remaining the same.
Problem 12.2 (Sample size calculation) Suppose we compare the mean of some treatment
in two equally sized groups. Let zγ denote the γ-quantile of the standard normal distribution.
Furthermore, the following properties are assumed to be known or fixed:
• clinically relevant difference ∆ = µ1 − µ0 , we can assume without loss of generality that
∆>0
• Power 1 − β.
i) Write down the suitable test statistic and its distributions under the null hypothesis.
ii) Derive an expression for the power using the test statistic.
iii) Prove analytically that the required sample size n in each group is at least
2σ 2 (z1−β + z1−α/2 )2
n=
∆2
Problem 12.3 (Sample size and group allocation) A randomized clinical trial to compare
treatment A to treatment B is being conducted. To this end 20 patients need to be allocated to
the two treatment arms.
i) Using R randomize the 20 patients to the two treatments with equal probability. Repeat
the randomization in total a 1000 times retaining the difference in group size and visualize
the distribution of the differences with a histogram.
ii) In order to obtain group sizes that are closer while keeping randomization codes secure a
random permuted block design with varying block sizes 2 and 4 and respective probabilities
0.25 and 0.75 is now to be used. Here, for a given length, each possible block of equal
numbers of As and Bs is chosen with equal probability. Using R randomize the 20 patients
to the two treatments using this design. Repeat the randomization in total a 1000 times
retaining the difference in group size. What are the possible values this difference my take?
How often did these values occur?
188 CHAPTER 12. DESIGN OF EXPERIMENTS
Chapter 13
Several examples in Chapter 11 resulted in the same posterior and prior distributions albeit
with different parameters, for example for binomial data with a beta prior, the posterior is again
beta. This was no coincidence; rather, we chose so-called conjugate priors based on our likelihood
(distribution of the data).
With other prior distributions, we may have “complicated”, not standard posterior distribu-
tions, for which, we no longer know the normalizing constant and thus the expected value or
any other moment in general. Theoretically, we could derive the normalizing constant and then
the expectation (via integration). The calculation of these two integrals is often complex and
so here we consider classic simulation procedures as a solution to this problem. In general, so-
called Monte Carlo simulation is used to numerically solve a complex problem through repeated
random sampling.
In this chapter, we start with illustrating the power of Monte Carlo simulation where we
utilize, above all, the law of large numbers. We then discuss one method to draw a sample from
an arbitrary density and, finally, illustrate a method to derive (virtually) arbitrary posterior
densities by simulation.
189
190 CHAPTER 13. A CLOSER LOOK: MONTE CARLO METHODS
Hence, g(x) cannot be entirely arbitrary, but such that the integral is well defined. An approxi-
mation of this integral is (along the idea of method of moments)
Z n
\ 1X
E g(X) = g(x)fX (x) dx ≈ E g(X) = g(xi ), (13.2)
R n
i=1
where x1 , . . . , xn is a random sample of fX (x). The method relies on the law of large numbers
(see Section 2.7).
Example 13.1. To estimate the expectation of a χ21 random variable we can use mean( rnorm(
100000)ˆ2), yielding 1 with a couple digits of precision, close to what we expect according to
Equation (2.42).
Of course, we can use the same approach to calculate arbitrary moments of a χ2n or Fn,m
distribution. ♣
We now discuss this justification in slightly more details. We consider a continuous real
Rb
function g and the integral I = a g(x) dx. There exists a value ξ such that I = (b − a)g(ξ)
(often termed as the mean value theorem for definite integrals). We do not know ξ nor g(ξ) and
iid
but we hope that the “average” value of g is close to g(ξ). More formally, let X1 , . . . , Xn ∼ U(a, b)
which we use to calculate the average (the density of Xi is fX (x) = 1/(b − a) over the interval
[a, b] and zero elsewhere). We now show that on average, our approximation is correct:
n n
1X 1X
E I = E (b − a)
b g(Xi ) = (b − a) E(g(Xi )) = (b − a) E g(X)
n n
i=1 i=1 (13.3)
Z b Z b Z b
1
= (b − a) g(x)fX (x) dx = (b − a) g(x) dx = g(x) dx = I .
a a b−a a
We can generalize this to almost arbitrary densities fX (x) having a sufficiently large support:
n
1 X g(xi )
Ib = , (13.4)
n fX (xi )
i=1
where the justification is as in (13.3). The density in the denominator takes the role of an
additional weight for each term.
Similarly, to integrate over a rectangle R in two dimensions (or a cuboid in three dimensions,
etc.), we use a uniform random variable for each dimension. More specifically, let R = [a, b]×[c, d]
then
Z Z bZ d n
1X
g(x, y) dx dy = g(x, y) dx dy ≈ (b − a)(d − c) g(xi , yi ), (13.5)
R a c n
i=1
13.1. MONTE CARLO INTEGRATION 191
random vector having a density fX,Y (x, y) whose support contains A. For example we define a
rectangle R such that A ⊂ R and let fX,Y (x, y) = (b − a)(d − c) over R and zero otherwise. We
define the indicator function 1A (x, y) that is one if (x, y) ∈ A and zero otherwise. Then we have
the general formula
Z Z bZ d
g(x, y) dx dy = 1A (x, y)g(x, y) dx dy
A a c
n (13.6)
1X g(xi , yi )
≈ 1A (xi , yi ) .
n fX,Y (xi , yi )
i=1
Example 13.2. Consider the bivariate normal density specified in Example 7.1 and suppose we
are interested in evaluating the probability that P(X > Y 2 ). To approximate this probability we
can draw a large sample of the bivariate normal density and calculate the proportion for which
xi > yi2 , as illustrated in R-Code 13.1 and yielding 10.47%.
In this case, the function g is the density with which we are drawing the data points. Hence,
Equation (13.6) reduces to calculate the proportion of the data satisfying xi > yi2 . ♣
R-Code 13.1 Calculating probability with the aid of a Monte Carlo simulation
set.seed( 14)
require(mvtnorm)
l.sample <- rmvnorm( 10000, mean=c(0,0), sigma=matrix( c(1,2,2,5), 2))
mean( l.sample[,1] > l.sample[,2]^2)
## [1] 0.1047
Example 13.3. The area of the unit circle is π as well as a cylinder placed at the origin with
height one. To estimate π we estimate the volume of the cylinder and we consider U(−1, 1) for
both coordinates, a square that contains the unit circle. The function g(x, y) = 1 is the identity
function and 1A (x, y) is the indicator function of the set x2 + y 2 ≤ 1. We have the following
approximation of the number π
Z 1 Z 1 n
1X
π= 1A (x, y) dx dy ≈ 4 1A (xi , yi ), (13.7)
−1 −1 n
i=1
where xi and yi , i = 1 . . . , n are two independent random samples from U(−1, 1). Equation (13.6)
reduces to calculate a proportion again.
It is important to note that the convergence is very slow, see Figure 13.1. It can be shown
√
that the rate of convergence is of the order of 1/ n. ♣
192 CHAPTER 13. A CLOSER LOOK: MONTE CARLO METHODS
R-Code 13.2 Approximation of π with the aid of Monte Carlo integration. (See Fig-
ure 13.1.)
set.seed(14)
m <- 49
n <- round( 10+1.4^(1:m))
piapprox <- numeric(m)
for (i in 1:m) {
st <- matrix( runif( 2*n[i]), ncol=2)
piapprox[i] <- 4*mean( rowSums( st^2)<= 1)
}
plot( n, abs( piapprox-pi)/pi, log='xy', type='l')
lines( n, 1/sqrt(n), col=2, lty=2)
sel <- 1:7*7
cbind( n=n[sel], pi.approx=piapprox[sel], rel.error=
abs( piapprox[sel]-pi)/pi, abs.error=abs( piapprox[sel]-pi))
## n pi.approx rel.error abs.error
## [1,] 21 2.4762 0.21180409 0.66540218
## [2,] 121 3.0083 0.04243968 0.13332819
## [3,] 1181 3.1634 0.00694812 0.02182818
## [4,] 12358 3.1662 0.00783535 0.02461547
## [5,] 130171 3.1403 0.00040166 0.00126186
## [6,] 1372084 3.1424 0.00025959 0.00081554
## [7,] 14463522 3.1406 0.00032656 0.00102592
1e−01
abs(piapprox − pi)/pi
1e−03
1e−05
Figure 13.1: Convergence of the approximation for π: the relative error as a function
of n. (See R-Code 13.2.)
In practice, more efficient “sampling” schemes are used. More specifically, we do not sample
uniformly but deliberately “stratified”. There are several reasons to sample randomly stratified
but the discussion is beyond the scope of the work here.
13.2. REJECTION SAMPLING 193
and no longer considered. We cycle along Steps 1 and 2 until a sufficiently large sample has been
obtained. The algorithm is illustrated in the following example.
Example 13.4. The goal is to draw a sample from a Beta(6, 3) distribution with the rejection
sampling method. That means, fY (y) = c · y 6−1 (1 − y)3−1 and f ∗ (y) = y 5 (1 − y)2 . As proposal
density we use a uniform distribution, hence fZ (y) = 10≤y≤1 (y). We select m = 0.02, which
fulfills the condition f ∗ (y) ≤ m · fZ (y) since optimize( function(x) xˆ5*(1-x)ˆ2, c(0, 1),
maximum=TRUE) is roughly 0.152.
An implementation of the example is given in R-Code 13.3. Of course, f_Z is always one
here. The R-Code can be optimized with respect to speed. It would then, however, be more
difficult to read.
Figure 13.2 shows a histogram and the density of the simulated values. By construction the
bars of the target density are smaller than the one of the proposal density. In this particular
example, we have sample size 285. ♣
R-Code 13.3: Rejection sampling in the setting of a beta distribution. (See Figure 13.2.)
set.seed( 14)
n.sim <- 1000
m <- 0.02
fst <- function(y) y^( 6-1) * (1-y)^(3-1)
f_Z <- function(y) ifelse( y >= 0 & y <= 1, 1, 0)
result <- sample <- rep( NA, n.sim)
for (i in 1:n.sim){
sample[i] <- runif(1) # ytilde, proposal
u <- runif(1) # u, uniform
if( u < fst( sample[i]) /( m * f_Z( sample[i])) ) # if accept ...
result[i] <- sample[i] # ... keep
}
mean( !is.na(result)) # proportion of accepted samples
## [1] 0.285
194 CHAPTER 13. A CLOSER LOOK: MONTE CARLO METHODS
truth
smoothed empirical
20 40 60 80
Frequency
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
y y
Figure 13.2: On the left we have a histogram of the simulated values of fZ (y) (light
blue) and fY (y) (dark blue). On the right the theoretical density (truth) and the
simulated density (smoothed empirical) are drawn. (See R-Code 13.3.)
For efficiency reasons the constant m should be chosen to be as small as possible, as this
reduces the number of rejections. Nevertheless in practice, rejection sampling is intuitive but
very inefficient. The next section illustrates an approach well suited for complex Bayesian models.
In many cases one does not have to program a Gibbs sampler oneself but can use a pre-
programmed sampler. We use the sampler JAGS (Just Another Gibbs sampler) (Plummer,
2003) with the R-Interface package rjags (Plummer, 2016).
13.3. GIBBS SAMPLING 195
R-Codes 13.3, 13.4 and 13.5 give a short, but practical overview into Markov chain Monte
Carlo methods with JAGS in the case of a simple Gaussian likelihood. Luckily more complex
models can easily be constructed based on the approach shown here.
When using MCMC methods, you may encounter situations in which the sampler does not
converge (or converges too slowly). In such a case the posterior distribution cannot be approx-
imated with the simulated values. It is therefore important to examine the simulated values
for eye-catching patterns. For example, the so-called trace plot, observations in function of the
index, as illustrated in the right panel of Figure 13.3 is often used.
Example 13.5. R-Code 13.4 implements the normal-normal model for a single observation,
y = 1, n = 1, known variance, σ 2 = 1.1, and a normal prior for the mean µ:
The basic approach to use JAGS is to first create a file containing the Bayesian model definition.
This file is then transcribed into a model graph (function jags.model()) from which we can
finally draw samples (coda.samples()).
Defining a model for JAGS is quite straightforward, as the notation is very close to R’s one.
Some care is needed when specifying variance parameters. In our notation, we typically use the
variance σ 2 , as in N ( · , σ 2 ) ; in R we have to specify the standard deviation σ as parameter sd
in the function dnorm(..., sd=sigma); and in JAGS we have to specify the precision 1/σ 2 in
the function dnorm(..., precision=1/sigma2).
The resulting samples are typically plotted with smoothed densities, as seen in the left panel
of Figure 13.3 with prior and likelihood, if possible. The posterior seems affected similarly by
likelihood (data) and prior, the mean is close to the average of the prior mean and the data. The
prior is slightly tighter as its variance is slightly smaller (0.8 vs. 1.1) but this does not seem to
have a visual impact on the posterior. The setting here is identical to Example 11.3 and thus
the posterior is again Normal N 0.8/(0.8 + 1.1), 0.8 · 1.1/(0.8 + 1.1) , see Equation (11.19). ♣
R-Code 13.4: JAGS sampler for normal-normal model, with n = 1. (See Figure 13.3.)
require( rjags)
writeLines("model { # File with Bayesian model definition
y ~ dnorm( mu, 1/1.1) # here Precision = 1/Variance
mu ~ dnorm( 0, 1/0.8) # Precision again!
}", con="jags01.txt")
jagsModel <- jags.model( "jags01.txt", data=list( 'y'=1)) # transcription
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
196 CHAPTER 13. A CLOSER LOOK: MONTE CARLO METHODS
2
0.4
1
0
0.2
−1
0.0
−2
Figure 13.3: Left: empirical densities: MCMC based posterior (black), exact (red),
prior (blue), likelihood (green). Right: trace plot of the posterior µ | y = 1. (See
R-Code 13.4.)
Example 13.6. R-Code 13.5 extends the normal-normal model to n = 10 observations, still
with known variance:
iid
Y1 , . . . , Yn | µ ∼ N (µ, 1.1), (13.10)
µ ∼ N (0, 0.8). (13.11)
We draw the data in R via rnorm(n, 1, sqrt(1.1)) and proceed similarly as in R-Code 13.4.
Figure 13.4 gives the empirical and exact densities of the posterior, prior and likelihood and shows
a trace plot as a basic graphical diagnostic tool. The density of the likelihood is according to
√
N (y, 1.1/ n), the prior density is based on (13.11) and the posterior density is based on (11.19).
The latter simplifies considerably because we have η = 0 in (13.11).
13.3. GIBBS SAMPLING 197
As the number of observations increases, the data gets more “weight”. From (11.20), the
weight increases from 0.8/(0.8 + 1.1) ≈ 0.42 to 0.8n/(0.8n + 1.1) ≈ 0.88. Thus, the posterior
is “closer” to the likelihood but slightly more peaked. As both the variance of the data and the
variance of the priors are comparable, the prior has a comparable impact on the posterior as if
we would possess an additional observation with value zero. ♣
2.5
1.2
2.0
0.8
1.5
0.4
1.0
0.0
0.5
−0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0 500 1000 1500 2000
R-Code 13.5: JAGS sampler for the normal-normal model, with n = 10. (See Figure 13.4.)
set.seed( 4)
n <- 10
obs <- rnorm( n, 1, sqrt(1.1)) # generate artificial data
writeLines("model {
for (i in 1:n) { # define a likelihood for each
y[i] ~ dnorm( mu, 1/1.1) # individual observation
}
mu ~ dnorm( 0, 1/0.8)
}", con="jags02.txt")
jagsModel <- jags.model( "jags02.txt", data=list('y'=obs, 'n'=n), quiet=T)
postSamples <- coda.samples( jagsModel, 'mu', n.iter=2000)
Example 13.7. In this last example, we consider an extension of the previous example by
including an unknown variance, respectively unknown precision. That means that we specify now
two prior distributions and we have apriori no knowledge of the posterior and cannot compare
the empirical posterior density with a true (bivariate) density (as we had the red densities in
Figures 13.3 and 13.4).
R-Code 13.6 implements the following model in JAGS:
iid
Yi | µ, κ ∼ N (µ, 1/κ), i = 1, . . . , n, with n = 10, (13.12)
µ ∼ N (η, 1/λ), with η = 0, λ = 1.25, (13.13)
κ ∼ Gamma(α, β), with α = 1, β = 0.2. (13.14)
For more flexibility with the code, we also pass the hyper-parameters η, λ, α, β to the JAGS
MCMC engine.
Figure 13.5 gives the marginal empirical posterior densities of µ and κ, as well as the priors
(based on (13.14) and (13.14)) and likelihood (based on (13.12)).
√
The likelihood for µ is N (y, s2 / n), i.e., we have replaced the parameters in the model with
their unbiased estimates. For κ, it is a Gamma distribution based on parameters n/2 + 1 and
ns2 /2 = ni=1 (yi − y)2 /2, see Problem 11.1,i).
P
Note that this is a another classical example and with a very careful specification of the
priors, we can construct a closed form posterior density. Problem 13.2 gives a hint towards this
more advanced topic. ♣
R-Code 13.6: JAGS sampler for priors on mean and precision parameter, with n = 10.
(See Figure 13.5.)
1.2
0.8
0.8
0.4
0.4
0.0
0.0
Note that the function jags.model writes some local files that may be cleaned after the
analysis.
An alternative to JAGS is BUGS (Bayesian inference Using Gibbs Sampling) which is dis-
tributed as two main versions: WinBUGS and OpenBUGS, see also Lunn et al. (2012). Addi-
tionally, there is the R-Interface package (R2OpenBUGS, Sturtz et al., 2005). Other possibilities
are the Stan or INLA engines with convenient user interfaces to R through rstan and INLA
(Gelman et al., 2015; Rue et al., 2009; Lindgren and Rue, 2015).
The list of textbooks discussing MCMC is long and extensive. Held and Sabanés Bové (2014)
has some basic and accessible ideas. Accessible examples for actual implementations can be
found in Kruschke (2015) (JAGS and STAN) and Kruschke (2010) (Bugs).
We use inverse transform sampling, which is well suited for distributions whose cdf is easily
invertible.
i) Find c such that fX (x) is an actual pdf (two points are to be checked).
ii) Assume an arbitrary cumulative distribution function F (x) with an existing inverse F −1 (p) =
Q(p) (quantile function). Show that the random variable X = F −1 (U ), where U ∼ U(0, 1),
has cdf F (x).
iii) Without using the functions rexp and qexp, implement your own code to simulate from
an exponential distribution of rate λ > 0.
v) Check the correctness of your sampler(s), you can use, e.g., hist(..., prob=TRUE) and/or
QQ-plots.
iid
Problem 13.2 (? Normal-normal-gamma model) Let Y1 , Y2 , . . . , Yn | µ, τ ∼ N (µ, 1/κ). Instead
of independent priors on µ and κ, we propose a joint prior density that can be factorized by
the density of κ and µ | κ. We assume κ ∼ Gamma(α, β) and µ | κ ∼ N (η, 1/(κν)), for some
hyper-parameters η, ν > 0, α > 0, and β > 0. This distribution is a so-called normal-gamma
distribution, denoted by N Γ(η, ν, α, β).
13.5. EXERCISES AND PROBLEMS 201
iid
i) Create an artificial dataset consisting for Y1 , . . . , Yn ∼ N (1, 1), with n = 20.
ii) Write a function called dnormgamma() that calculates the density at mu, kappa based on the
parameters eta, nu, alpha, beta. Visualize the bivariate density based on η = 1, ν = 1.5,
α = 1, and β = 0.2.
iii) Setup a Gibbs sampler for the following values η = 0, ν = 1.5, α = 1, and β = 0.2. For a
sample of length 2000 illustrate the (empirical) joint posterior density of µ, κ | y1 , . . . , yn .
iv) It can be shown that the posterior is again normal-gamma with parameters
1
ηpost = (ny + νη) νpost = ν + n (13.15)
n+ν
n 1 nν(η − x)2
αpost =α+ βpost = β + (n − 1)s2 + (13.16)
2 2 n+ν
where s2 is the usual unbiased estimate of σ 2 . Superimpose the true isolines of the normal-
gamma prior and posterior density in the plot form the previous problem.
Software Environment R
R is a freely available language and environment for statistical computing and graphics which
provides a wide variety of statistical and graphical techniques. It compiles and runs on a wide
varieties operating systems (Windows, Mac, and Linux), its central entry point is https://www.
r-project.org.
The R software can be downloaded from CRAN (Comprehensive R Archive Network) https:
//cran.r-project.org, a network of ftp and web servers around the world that store identical,
up-to-date, versions of code and documentation for R. Figure A.1 shows a screenshot of the web
page.
R is console based, that means that individual commands have to be typed. It is very im-
portant to save these commands as we construct a reproducible workflow – the big advantage
203
204 APPENDIX A. SOFTWARE ENVIRONMENT R
over a “click-and-go” approach. We strongly recommend to use some graphical, integrated de-
velopment environment (IDE) for R. The prime choice these days is RStudio. RStudio includes
a console, syntax-highlighting editor that supports direct code execution, as well as tools for
plotting, history, debugging and workspace management, see Figure A.2.
RStudio is available in a desktop open source version for many different operating systems
(Windows, Mac, and Linux) or in a browser connected to an RStudio Server. There are several
providers of such servers, including rstudio.math.uzh.ch for the students of the STA120 lecture.
Figure A.2: Studio screenshot. The four panels shown are (clock-wise starting top
left): (i) console, (ii) plots, (iii) environment, (iv) script.
The installation of all software components are quite straightforward, but the look of the
download page may change from time to time and the precise steps may vary a bit. Some
examples are given by the attached videos.
4 min
The biggest advantage of using R is the support from and for a huge user community. Sheer
endless packages provide almost seemingly every statistical task, often implemented by several
authors. The packages are documented and by the upload to CRAN confined to a limited level
of documentation, coding standards, (unit) testing etc. There are several forums (e.g., R mailing
lists, Stack Overflow with tag “r”) to get additional help, see https://www.r-project.org/help.
html.
Appendix B
Calculus
In this chapter we present some of the most important ideas and concepts of calculus. For exam-
ple, we will not discuss sequences and series. It is impossible to give a formal, mathematically
precise exposition. Further, we cannot present all rules, identities, guidelines or even tricks.
B.1 Functions
We start with one of the most basic concepts, a formal definition that describes a relation between
two sets.
Definition B.1. A function f from a set D to a set W is a rule that assigns a unique value
element f (x) ∈ W to each element x ∈ D. We write
f :D→W (B.1)
x 7→ f (x) (B.2)
The set D is called the domain, the set W is called the range (or target set or codomain).
The graph of a function f is the set (x, f (x)) : x ∈ D .
♦
The function will not necessarily map to every element in W , and there may be several
elements in D with the same image in W . These functions are characterized as follows.
Definition B.2. i) A function f is called injective, if the image of two different elements in
D is different.
ii) A function f is called surjective, if for every element y in W there is at least one element
x in D such that y = f (x).
iii) A function f is called bijective if it is surjective and injective. Such a function is also called
a one-to-one function. ♦
205
206 APPENDIX B. CALCULUS
In general, there is virtually no restriction on the domain and codomain. However, we often
work with real functions, i.e., D ⊂ R and W ⊂ R.
There are many different characterizations of functions. Some relevant one are as follows.
i) periodic if there exists an ω > 0 such that f (x + ω) = f (x) for all x ∈ D. The smallest
value ω is called the period of f ;
ii) called increasing if f (x) ≤ f (x + h) for all h ≥ 0. In case of strict inequalities, we call the
function strictly increasing. Similar definitions hold when reversing the inequalities. ♦
f −1 : W → D
(B.3)
y 7→ f −1 (y), such that y = f f −1 (y) .
To capture the behavior of a function locally, say at a point x0 ∈ D, we use the concept of a
limit.
The latter definition does not assume that the function is defined at x0 .
It is possible to define “directional” limits, in the sense that x approaches x0 from above (from
the right side) or from below (from the left side). These limits are denoted with
lim lim for the former; or lim lim for the latter. (B.4)
x→x+
0
x&x0 x→x−
0
x%x0
We are used to interpret graphs and when we sketch an arbitrary function we often use a
single, continuous line. This concept of not lifting the pen while sketching is formalized as follows
and linked directly to limits, introduced above.
There are many other approaches to define coninuity, for example in terms of neighborhoods,
in terms of limits of sequences.
Another very important (local) characterization of a function is the derivative, which quan-
tifies the (infinitesimal) rate of change.
B.1. FUNCTIONS 207
Definition B.6. The derivative of a function f (x) with respect to the variable x at the point
x0 is defined by
f (x0 + h) − f (x0 )
f 0 (x0 ) = lim , (B.6)
h→0 h
df (x0 )
provided the limit exists. We also write = f 0 (x0 ).
dx
If the derivative exists for all x0 ∈ D, the function f is differentiable. ♦
ii) (Mean value theorem) For a continuous function f : [a, b] → R, which is differentiable on
f (b) − f (a)
(a, b) there exists a point ξ ∈ (a, b) such that f 0 (ξ) = .
b−a
The integral of a (positive) function quantifies the area between the function and the x-axis.
A mathematical definition is a bit more complicated.
Definition B.7. Let f (x) : D → R a function and [a, b] ∈ D a finite interval such that |f (x)| <
∞ for x ∈ [a, b]. For any n, let t0 = a < t1 < · · · < tn = b a partition of [a, b].
The integral of f from a to b is defined as
Z b X n
f (x)dx = lim f (ti )(ti − ti−1 ). (B.7)
a n→∞
i=1
For non-finite a and b, the definition of the integral can be extended via limits.
Property B.2. (Fundamental theorem of calculus (I)). Let f : [a, b] → R continuous. For all
Rx
x ∈ [a, b], let F (x) = a f (u)du. Then F is continuous on [a, b], differentiable on (a, b) and
F 0 (x) = f (x), for all x ∈ (a, b).
The function F is often called the antiderivative of f . There exists a second form of the
previous theorem that does not assume continuity of f but only Riemann integrability, that
means that an integral exists.
Property B.3. (Fundamental theorem of calculus (II)). Let f : [a, b] → R. And let F such that
Z b
0
F (x) = f (x), for all x ∈ (a, b). If f is Riemann integrable then f (u)du = F (b) − F (a).
a
There are many ‘rules’ to calculate integrals. One of the most used ones is called integration
by substitution and is as follows.
Property B.4. Let I be an interval and ϕ : [a, b] → I be a differentiable function with integrable
derivative. Let f : I → R be a continuous function. Then
Z ϕ(b) Z b
f (u) du = f (ϕ(x))ϕ0 (x) dx. (B.8)
ϕ(a) a
208 APPENDIX B. CALCULUS
We denote with Rm the vector space with elements x = (x1 , . . . , xm )> , called vectors, equipped
with the standard operations. We will discuss vectors and vector notation in more details in the
subsequent chapter.
A natural extension of a real function is as follows. The set D is subset of Rm and thus we
write
f : D ⊂ Rm → W
(B.9)
x 7→ f (x ).
(provided it exists). ♦
Remark B.1. The existence of partial derivatives is not sufficient for the differentiability of the
function f . ♣
In a similar fashion, higher order derivatives can be calculated. For example, taking the
derivative of each component of (B.11) with respect to all components is an matrix with com-
ponents
∂ 2 f (x )
f 00 (x ) = , (B.12)
∂xi ∂xj
Property B.5. Let f : D → R with continuous Then there exists ξ ∈ [a, x] such that
1
f (x) = f (a) + f 0 (a)(x − a) + f 00 (a)(x − a)2 + . . .
2
1 (m) 1 (B.13)
+ f (a)(x − a)m + f (m+1) (ξ)(x − a)m
m! (m + 1)!
We call (B.13) Taylor’s formula and the last term, often denoted by Rn (x), as the reminder
of order n. Taylor’s formula is an extension of the mean value theorem.
If the function has bounded derivatives, the reminder Rn (x) converges to zero as x → a.
Hence, if the function is at least twice differentiable in a neighborhood of a then
1
f (a) + f 0 (a)(x − a) + f 00 (a)(x − a)2 (B.14)
2
is the best quadratic approximation in this neighborhood.
Taylor’s formula can be expressed for multivariate real functions. Without stating the precise
assumptions we consider here the following example
∞
X X 1 ∂ r f (a)
f (a + h) = hi1 hi2 . . . hinn , (B.16)
i1 !i2 ! . . . in ! ∂xi1 . . . ∂xin 1 2
r=0 i :i1 +···+in =r
Linear Algebra
In this chapter we cover the most important aspects of linear algebra, namely of notational
nature.
The n × n identity matrix I is defined as the matrix with ones on the diagonal and zeros
elsewhere. We denote the vector with solely one elements with 1 similarly, 0 is a vector with only
zero elements. A matrix with entries d1 , . . . , dn on the diagonal and zero elsewhere is denoted
with diag(d1 , . . . , dn ) or diag(di ) for short and called a diagonal matrix. Hence, I = diag(1).
To indicate the ith-jth element of A, we use (A)ij . The transpose of a vector or a matrix
flips its dimension. When a matrix is transposed, i.e., when all rows of the matrix are turned
into columns (and vice-versa), the elements aij and aji are exchanged. Thus (A> )ij = (A)ji .
The vector x > = (xa , . . . , xp ) is termed a row vector. We work mainly with column vectors as
shown in (C.1).
In the classical setting of real numbers, there is only one type of multiplication. As soon
as we have several dimensions, several different types of multiplications exist, notably scalar
multiplication, matrix multiplication and inner product (and actually more such as the vector
product, outer product).
Let A and B be two n × p and p × m matrices. Matrix multiplication AB is defined as
p
X
AB = C with (C)ij = aik bkj . (C.2)
k=1
211
212 APPENDIX C. LINEAR ALGEBRA
This last equation shows that the matrix I is the neutral element (or identity element) of the
matrix multiplication.
Definition C.1. The inner product between two p-vectors x and y is defined as x > y =
Pp
i=1 xi yi . There are several different notations used: x y = ha, bi = x · y .
>
AB = BA = I, (C.3)
then the matrix B is uniquely determined by A and is called the inverse of A, denoted by A−1 .
Definition C.2. A vector space over R is a set V with the following two operations:
i) + : V × V → V (vector addition)
Typically, V is Rp , p ∈ N.
In the following we assume a fixed d and the usual operations on the vectors.
Definition C.3. i) The vectors v 1 , . . . , v k are linearly dependent if there exists scalars a1 , . . . , ak
(not all equal to zero), such that a1 v 1 + · · · + ak v k = 0.
In a set of linearly dependent vectors, each vector can be expressed as a linear combination
of the others.
Definition C.4. The set of vectors {b 1 , . . . , b d } is a basis of a vectors space V if the set is
linearly independent and any other vector v ∈ V can be expressed by v = v1 b 1 + · · · + vd b d . ♦
ii) All basis of V have the same cardinality, which is called the dimension of V , dim(V ).
iii) If there are two basis {b1 , . . . , bd } and {e1 , . . . , ed } then there exists a d × d matrix A such
that ei = Abi , for all i.
Definition C.6. Let A be a n × m matrix. The column rank of the matrix is the dimension
of the subspace that the m columns of A span and is denoted by rank(A). A matrix is said to
have full rank if rank(A) = m.
The row rank is the column rank of A> . ♦
C.3 Projections
We consider classical Euclidean vector spaces with elements x = (x1 , . . . , xp )> ∈ Rp with Eu-
clidean norm ||x || = ( i x2i )1/2 .
P
To illustrate projections, consider the setup illustrated in Figure C.1, where y and a are two
vectors in R2 . The subspace spanned by a is
where the second expression is based on a normalized vector a/||a||. By the (geometric) definition
of the inner product (dot product),
where θ is the angle between the vectors. Classical trigonometric properties state that the length
of the projection is a/||a|| · ||y || cos(θ). Hence, the projected vector is
a a>
y = a(a > a)−1 a > y . (C.6)
||a|| ||a||
In statistics we often encounter expressions like this last term. For example, ordinary least
squares (“classical” multiple regression) is a projection of the vector y onto the column space
spanned by X, i.e., the space spanned by the columns of the matrix X. The projection is
X(X> X)−1 X> y . Usually, the column space is in a lower dimension.
y
a
θ
Remark C.1. Projection matrices (like H = X(X> X)−1 X> ) have many nice properties such
as being symmetric, being idempotent, i.e., H = HH, having eigenvalues within [0, 1], (see next
section), rank(H) = rank(X), etc. ♣
Ax = λx , (C.7)
We often denote the set of eigenvectors with γ 1 , . . . , γ n . Let Γ be the matrix with columns
γ i , i.e., Γ = (γ 1 , . . . , γ n ). Then
due to the orthogonality property of the eigenvectors Γ> Γ = I. This last identity also implies
that A = Γ diag(λ1 , . . . , λn )Γ> .
B = UDV> (C.9)
Besides an SVD there are many other matrix factorization. We often use the so-called
Cholesky factorization, as - to a certain degree - it generalizes the concept of a square root for
matrices. Assume that all eigenvalues of A are strictly positive, then there exists a unique lower
triangular matrix L with positive entries on the diagonal such that A = LL> . There exist very
efficient algorithm to calculate L and solving large linear systems is often based on a Cholesky
factorization.
The determinant of a square matrix essentially describes the change in “volume” that associ-
ated linear transformation induces. The formal definition is quite complex but it can be written
as det(A) = ni=1 λi for matrices with real eigenvalues.
Q
215
216 References
For a non-singular matrix A, written as a 2 × 2 block matrix (with square matrices A11 and
A22 ), we have
!−1 !
−1 A11 A12 A−1 −1 −1
11 + A11 A12 CA21 A11 −A−1
11 A12 C
A = = −1
(C.13)
A21 A22 −CA21 A11 C
Bland, J. M. and Bland, D. G. (1994). Statistics notes: One and two sided tests of significance.
BMJ, 309, 248.
Box, G. E. P. and Draper, N. R. (1987). Empirical Model-building and Response Surfaces. Wiley.
Brown, L. D., Cai, T. T., and DasGupta, A. (2002). Confidence intervals for a binomial propor-
tion and asymptotic expansions. The Annals of Statistics, 30, 160–201.
Cleveland, W. S. (1993). Visualizing Data. Hobart Press, Summit, New Jersey, U.S.A.
Cohen, J. (1988). Statistical power analysis for the behavioral sciences. Routledge.
Dalal, S. R., Fowlkes, E. B., and Hoadley, B. (1989). Risk analysis of the space shuttle: Pre-
challenger prediction of failure. Journal of the American Statistical Association, 84, 945–957.
Devore, J. L. (2011). Probability and Statistics for Engineering and the Sciences. Brooks/Cole,
8th edition.
Fahrmeir, L., Kneib, T., and Lang, S. (2009). Regression: Modelle, Methoden und Anwendungen.
Springer, 2 edition.
Fahrmeir, L., Kneib, T., Lang, S., and Marx, B. (2013). Regression: Models, Methods and
Applications. Springer.
Faraway, J. J. (2006). Extending the Linear Model with R: Generalized Linear, Mixed Effects
and Nonparametric Regression Models. CRC Press.
Farcomeni, A. (2008). A review of modern multiple hypothesis testing, with particular attention
to the false discovery proportion. Statistical Methods in Medical Research, 17, 347–388.
Fisher, R. A. (1938). Presidential address. Sankhyā: The Indian Journal of Statistics, 4, 14–17.
Friedman, J. H. and Tukey, J. W. (1974). A projection pursuit algorithm for exploratory data
analysis. IEEE Transactions on Computers, C-23, 881–890.
217
218 BIBLIOGRAPHY
Gelman, A., Lee, D., and Guo, J. (2015). Stan: A probabilistic programming language for
bayesian inference and optimization. Journal of Educational and Behavior Science, 40, 530–
543.
Gelman, A. and Rubin, D. B. (1992). Inference from iterative simulation using multiple sequences.
Statistical Science, 7, 457–511.
Held, L. (2008). Methoden der statistischen Inferenz: Likelihood und Bayes. Springer, Heidelberg.
Hollander, M. and Wolfe, D. A. (1999). Nonparametric Statistical Methods. John Wiley & Sons.
Hüsler, J. and Zimmermann, H. (2010). Statistische Prinzipien für medizinische Projekte. Huber,
5 edition.
Jeffreys, H. (1983). Theory of probability. The Clarendon Press Oxford University Press, third
edition.
Johnson, N. L., Kemp, A. W., and Kotz, S. (2005). Univariate Discrete Distributions. Wiley-
Interscience, 3rd edition.
Johnson, N. L., Kotz, S., and Balakrishnan, N. (1994). Continuous Univariate Distributions,
Vol. 1. Wiley-Interscience, 2nd edition.
Johnson, N. L., Kotz, S., and Balakrishnan, N. (1995). Continuous Univariate Distributions,
Vol. 2. Wiley-Interscience, 2nd edition.
Kruschke, J. K. (2010). Doing Bayesian Data Analysis: A Tutorial with R and BUGS. Academic
Press, first edition.
Kruschke, J. K. (2015). Doing Bayesian Data Analysis: A Tutorial with R, JAGS and Stan.
Academic Press/Elsevier, second edition.
Kupper, T., De Alencastro, L., Gatsigazi, R., Furrer, R., Grandjean, D., and J., T. (2008). Con-
centrations and specific loads of brominated flame retardants in sewage sludge. Chemosphere,
71, 1173–1180.
Landesman, R., Aguero, O., Wilson, K., LaRussa, R., Campbell, W., and Penaloza, O. (1965).
The prophylactic use of chlorthalidone, a sulfonamide diuretic, in pregnancy. J. Obstet. Gy-
naecol., 72, 1004–1010.
Lindgren, F. and Rue, H. (2015). Bayesian spatial modelling with R-INLA. Journal of Statistical
Software, 63, i19.
Lunn, D., Jackson, C., Best, N., Thomas, A., and Spiegelhalter, D. (2012). The BUGS Book:
A Practical Introduction to Bayesian Analysis. Texts in Statistical Science. Chapman &
Hall/CRC.
Moyé, L. A. and Tita, A. T. (2002). Defending the rationale for the two-tailed test in clinical
research. Circulation, 105, 3062–3065.
Olea, R. A. (1991). Geostatistical Glossary and Multilingual Dictionary. Oxford University Press.
Petersen, K. B. and Pedersen, M. S. (2008). The Matrix Cookbook. Version 2008-11-14, http:
//matrixcookbook.com.
Plagellat, C., Kupper, T., Furrer, R., de Alencastro, L. F., Grandjean, D., and Tarradellas, J.
(2006). Concentrations and specific loads of UV filters in sewage sludge originating from a
monitoring network in Switzerland. Chemosphere, 62, 915–925.
Plummer, M. (2003). Jags: A program for analysis of bayesian graphical models using gibbs sam-
pling. In Proceedings of the 3rd International Workshop on Distributed Statistical Computing
(DSC 2003). Vienna, Austria.
Plummer, M. (2016). rjags: Bayesian Graphical Models using MCMC. R package version 4-6.
R Core Team (2020). R: A Language and Environment for Statistical Computing. R Foundation
for Statistical Computing, Vienna, Austria.
Raftery, A. E. and Lewis, S. M. (1992). One long run with diagnostics: Implementation strategies
for Markov chain Monte Carlo. Statistical Science, 7, 493–497.
Ruchti, S., Kratzer, G., Furrer, R., Hartnack, S., Würbel, H., and Gebhardt-Henrich, S. G.
(2019). Progression and risk factors of pododermatitis in part-time group housed rabbit does
in switzerland. Preventive Veterinary Medicine, 166, 56–64.
Ruchti, S., Meier, A. R., Würbel, H., Kratzer, G., Gebhardt-Henrich, S. G., and Hartnack, S.
(2018). Pododermatitis in group housed rabbit does in switzerland prevalence, severity and
risk factors. Preventive Veterinary Medicine, 158, 114–121.
Rue, H., Martino, S., and Chopin, N. (2009). Approximate Bayesian inference for latent Gaussian
models by using integrated nested Laplace approximations. Journal of the Royal Statistical
Society B, 71, 319–392.
Siegel, S. and Castellan Jr, N. J. (1988). Nonparametric Statistics for The Behavioral Sciences.
McGraw-Hill, 2nd edition.
Sturtz, S., Ligges, U., and Gelman, A. (2005). R2WinBUGS: A package for running WinBUGS from
R. Journal of Statistical Software, 12, 1–16.
Swayne, D. F., Temple Lang, D., Buja, A., and Cook, D. (2003). GGobi: evolving from XGobi
into an extensible framework for interactive data visualization. Computational Statistics &
Data Analysis, 43, 423–444.
Tufte, E. R. (1997a). Visual and Statistical Thinking: Displays of Evidence for Making Decisions.
Graphics Press.
Tufte, E. R. (1997b). Visual Explanations: Images and Quantities, Evidence and Narrative.
Graphics Press.
Wasserstein, R. L. and Lazar, N. A. (2016). The asa statement on p-values: Context, process,
and purpose. The American Statistician, 70, 129–133.
Glossary
:= Define the left hand side by the expression on the other side.
♣, ♦ End of example, end of definition end of remark.
, , Integration, summation and product symbol. If there is no ambiguity, we omit
R P Q
221
222 Glossary
n n n!
Binomial coefficient defined as = .
k k k!(n − k)!
In = I Identity matrix, I = (δij ).
I{A} Indicator function, talking the value one if A is true and zero otherwise.
lim Limit.
log(·) Logarithmic function to the base e.
max{A}, min{A} Maximum, minimum of the set A.
N, Nd Space of natural numbers, of d-vectors with natural elements.
ϕ(x) Gaussian probability densitiy function ϕ(x) = (2π)−1/2 exp(−x2 /2).
Rx
Φ(x) Gaussian cumulative distribution function Φ(x) = −∞ ϕ(z) dz.
π Transzendental number π = 3.14159 26535.
P(A) Probability of the event A.
R, Rn , Rn×m Space of real numbers, real n-vectors and real (n × m)-matrices.
rank(A) The rank of a matrix A is defined as the number of linearly independent rows
(or columns) of A.
tr(A) Trace of an matrix A defined by the sum of its diagonal elements.
Var(X) Variance of the random variable X.
Z, Zd Space of integers, of d-vectors with integer elements.
The following table contains the abbreviations of the statistical distributions (dof denotes degrees
of freedom).
The following table contains the abbreviations of the statistical methods, properties and quality
measures.
225
226 Index of Statistical Tests and CIs
Video Index
The following index gives a short description of the available videos, including a link to the
referenced page. The videos are uploaded to https://tube.switch.ch/.
Chapter 0
What are all these videos about?, vi
Chapter 7
Construction of general multivariate normal variables, 112
Important comment about an important equation, 112
Proof that the correlation is bounded, 107
Properties of expectation and variance in the setting of random vectors, 108
Two classical estimators and estimates for random vectors, 113
Chapter A
Installing of RStudio, 204
227
228 Video Index