0% found this document useful (0 votes)

61 views238 pages

Introduction To Statistics WITH SAS

Uploaded by

VBS van

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

61 views238 pages

Introduction To Statistics WITH SAS

Uploaded by

VBS van

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 238

(A Gentle)

Introduction to Statistics

Script

Reinhard Furrer
and the Applied Statistics Group

Version May 12, 2020

Git 1f1d85e
Contents

Preface v

1 Exploratory Data Analysis and Visualization of Data 1

1.1 Types of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Univariate Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Multivariate Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.5 Examples of Poor Charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.6 Bibliographic remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.7 Exercises and Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2 Random Variables 21
2.1 Basics of Probability Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2 Discrete Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3 Continuous Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4 Expectation and Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.5 Independent Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.6 The Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.7 Limit Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.8 Functions of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.9 Bibliographic remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.10 Exercises and Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3 Estimation 43
3.1 Point Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.2 Construction of Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3 Comparison of Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.4 Interval Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.5 Bibliographic remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.6 Exercises and Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4 Statistical Testing 57
4.1 The General Concept of Significance Testing . . . . . . . . . . . . . . . . . . . . . 57
4.2 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

i
ii CONTENTS

4.3 Comparing Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.4 Duality of Tests and Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . 70
4.5 Missuse of p-Values and Other Dangers . . . . . . . . . . . . . . . . . . . . . . . . 71
4.6 Additional Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.7 Bibliographic remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.8 Exercises and Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5 A Closer Look: Proportions 79

5.1 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.2 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.3 Statistical Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.4 Comparison of Proportions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.5 Bibliographic remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.6 Exercises and Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

6 Rank-Based Methods 93
6.1 Robust Point Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.2 Rank-Based Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.3 Other Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.4 Bibliographic remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.5 Exercises and Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

7 Multivariate Normal Distribution 105

7.1 Random Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
7.2 Multivariate Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
7.3 Estimation of Mean and Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
7.4 Bibliographic remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
7.5 Exercises and Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

8 A Closer Look: Correlation and Simple Regression 117

8.1 Estimation of the Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
8.2 Simple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
8.3 Bibliographic remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
8.4 Exercises and Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

9 Multiple Regression 129

9.1 Model and Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
9.2 Model Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
9.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
9.4 Nonlinear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
9.5 Bibliographic remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
9.6 Exercises and Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
CONTENTS iii

10 Analysis of Variance 149

10.1 One-Way ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
10.2 Two-Way and Complete Two-Way ANOVA . . . . . . . . . . . . . . . . . . . . . 156
10.3 Analysis of Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
10.4 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
10.5 Bibliographic remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
10.6 Exercises and Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

11 Bayesian Methods 165

11.1 Motivating Example and Terminology . . . . . . . . . . . . . . . . . . . . . . . . 165
11.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
11.3 Choice and Effect of Prior Distribution . . . . . . . . . . . . . . . . . . . . . . . . 171
11.4 Bibliographic remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
11.5 Appendix: Beta Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
11.6 Exercises and Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

12 Design of Experiments 175

12.1 Basic Idea and Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
12.2 Sample Size Calculations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
12.3 ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
12.4 Randomization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
12.5 Bibliographic remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
12.6 Exercises and Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187

13 A Closer Look: Monte Carlo Methods 189

13.1 Monte Carlo Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
13.2 Rejection Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
13.3 Gibbs Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
13.4 Bibliographic remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
13.5 Exercises and Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200

A Software Environment R 203

B Calculus 205
B.1 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
B.2 Functions in Higher Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
B.3 Approximating Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209

C Linear Algebra 211

C.1 Vectors, Matrices and Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
C.2 Linear Spaces and Basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
C.3 Projections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
C.4 Matrix Decompositions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
iv CONTENTS

C.5 Positive Definite Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215

References 216

Glossary 221

Index of Statistical Tests and Confidence Intervals 225

Video Index 227

Preface

This document accompanies the lecture STA120 Introduction to Statistics that has been given
each spring semester since 2013. The lecture is given in the framework of the minor in Applied
Probability and Statistics (www.math.uzh.ch/aws) and comprises 14 weeks of two hours of lecture
and one hour of exercises per week.

As the lecture’s topics are structured on a week by week basis, the script contains thirteen
chapters, each covering “one” topic. Some of chapters contain consolidations or in-depth studies
of previous chapters. The last week is dedicated to a recap/review of the material.
I have thought long and hard about an optimal structure for this script. Let me quickly
summarize my thoughts. It is very important that the document contains a structure that is
tailored to the content I cover in class each week. This inherently leads to 13 “chapters.” Instead
of covering Linear Models over four weeks, I framed the material in four seemingly different
chapters. This structure helps me to better frame the lectures: each week having a start, a set
of learning goals and a predetermined end.
So to speak, the script covers not 13 but essentially only three topics:

1. Background

2. Statistical foundations in a (i) frequentist setting and (ii) Bayesian setting

3. Linear Modeling

We will not cover these topics chronologically. This is not necessary and I have opted for a
smoother setting. For example, we do not cover the multivariate Gaussian distribution at the
beginning but just before we need it. This also allows for a recap of several univariate concepts.
We use a path illustrated in Figure 1.

In case you use this document outside the lecture, here are several alternative paths through
the chapters, with a minimal impact on concepts that have not been covered:

• Focusing on linear models: Chapters 1, 2, 7, 3, 4, 8, 9, 10, 11

• Good background in probability: You may omit Chapters 2 and 7

All the datasets that are not part of regular CRAN packages are available via the url
www.math.uzh.ch/furrer/download/sta120/. The script is equipped with appropriate links that
facilitate the download.

v
vi Preface

Start

Background
Exploratory Data Analysis
Random Variables
Multivariate Normal Distribution

Statistical Foundations
Estimation
Statistical Testing
Frequentist
Proportions
Rank−Based Methods

Bayesian Approach
Bayesian
Monte Carlo Methods

End
Linear Modeling
Correlation and Simple Regression
Multiple Regression
Analysis of Variance
Design of Experiments

Figure 1: Structure of the script

The lecture STA120 Introduction to Statistics formally requires the prerequisites MAT183
Stochastic for the Natural Sciences and MAT141 Linear Algebra for the Natural Sciences or
equivalent modules. For the content of these lectures we refer to the corresponding course
web pages www.math.uzh.ch/fs20/mat183 and www.math.uzh.ch/hs20/mat141. It is possible to
successfully pass the lecture without having had the aforementioned lectures, some self-studying
is necessary though. This script and the accompanying exercises require some: differentiation,
integration, matrix notation and basic operations, concept of solving a linear system of equations.
Appendix B and C give the bare minimum of relevant concepts in calculus and in linear algebra.
We review and summarize the relevant concepts of probability theory in Chapter 2.

I have therefore augmented this script with short video sequences giving additional – often
more technical – insight. These videos are indicated in the margins with a ‘video’ symbol as
here.
6 min
Preface vii

Many have contributed to this document. A big thanks to all of them, especially (alphabet-
ically) Zofia Baranczuk, Julia Braun, Eva Furrer, Florian Gerber, Lisa Hofer, Mattia Molinaro,
Franziska Robmann, Leila Schuh and many more. Kelly Reeve spent many hours improving my
English. Without their help, you would not be reading these lines. Yet, this document needs
more than a polishing. Please let me know of any necessary improvements and I highly appreci-
ate all forms of contributions in form of errata, examples, or text blocks. Contributions can be
deposited directly in the following Google Doc sheet.

Major errors that were corrected after the lecture of the corresponding semester are listed
www.math.uzh.ch/furrer/download/sta120/errata.txt. I try hard that after the lecture, the pag-
ination of the document does not change anymore.

Reinhard Furrer
February 2020
viii Preface
Chapter 1

Exploratory Data Analysis and

Visualization of Data

A valuable graphical representation should quickly and unambiguously transmit its

message. Depending on the data and the message, different types of graphics can
be used — but all are clear, concise and stripped of clutter.

Learning goals for this chapter:

Understand the concept and need of Exploratory Data Analysis (EDA)

Perform EDA in R

Schematically sketch, plot in R and interpret a histogram, barplot, boxplot

Visualize multivariate data (e.g. scatterplot)

Know the definition of an outlier and (subjectively) identify outliers

R-Code for this chapter: www.math.uzh.ch/furrer/download/sta120/chapter01.R.

We start with a rather pragmatic setup: suppose we have some data. This chapter illustrates
the first steps thereafter: exploring and visualizing the data. Of course, much of the visualization
aspects are also used after the statistical analysis. No worries, subsequent chapters come back
to questions we should ask ourselves before we start collecting data, i.e., before we start an
experiment and how to conduct the analysis. Figure 1.1 shows one representation of a data
analysis flowchart and we discuss in this chapter the two right most boxes.

Assuming that a data collection process is completed, the “analysis” of this data is one of the
next steps. This analysis is typically done in an appropriate software environment. There are
many of such but our prime choice is R (R Core Team, 2020), often used alternatives are SPSS,
SAS, Minitab. Appendix A gives some links to R and R resources.
The first step is loading data in the software environment. This task sounds trivial and for
pre-processed and readily available datasets often is. Cleaning own and others’ data is typically

1
2 CHAPTER 1. EXPLORATORY DATA ANALYSIS AND VISUALIZATION OF DATA

Hypothesis to Exploratory
investigate Design experiment Data collection Data Analysis
Phenomena to study

Propose statistical Model fit Model validation Summarizing results

model (e.g., estimatation)

Figure 1.1: Data analysis workflow seen from a statistical perspective.

very painful and eats up much unanticipated time. We do load external data but will not cover
the aspect of data cleaning — be aware when planning your analysis.

Example 1.1. There are many datasets available in R, the command data() would list these.
Packages often provide additional datasets, which can be listed with data(package="spam"),
here for the package spam (the command data( package=.packages( all.available=TRUE))
would list all datasets from all installed packages).
Often, we will work with own data and hence we have to “load” the data. It is recommended
to store the data in a simple tabular comma separated format, typically a csv file. After im-
porting (loading/reading) data, it is of utmost importance to check if the variables have been
properly read, that (possible) row and column names are correctly parsed. In R-Code 1.1 we load
observations of content mercury in lake Geneva sediments. There is also a commented example
that illustrates how the format of the imported dataset changes. ♣

R-Code 1.1 Loading a dataset (here the ‘leman’ dataset).

Hg.frame <- read.csv('data/lemanHg.csv')

str( Hg.frame) # dataframe with one numeric column
## 'data.frame': 293 obs. of 1 variable:
## $ Hg: num 0.17 0.21 0.06 0.24 0.35 0.14 0.08 0.26 0.23 0.18 ...
head( Hg.frame, 3) # column way is 'Hg'
## Hg
## 1 0.17
## 2 0.21
## 3 0.06
Hg <- Hg.frame$Hg # equivalent to Hg.frame[,1] or Hg.frame[,"Hg"]
str( Hg)
## num [1:293] 0.17 0.21 0.06 0.24 0.35 0.14 0.08 0.26 0.23 0.18 ...
sum( is.na( Hg)) # or any( is.na( Hg)), check if there are NA
## [1] 0
# str( read.csv('data/lemanHg.csv', header=FALSE))
# Wrong way to import data. Result is a factor!! not a numeric
1.1. TYPES OF DATA 3

At the beginning of any statistical analysis, an exploratory data analysis (EDA) should be
performed (Tukey, 1977). An EDA summarizes the main characteristics of the data (mainly)
graphically, i.e., observations or measured values are depicted, and qualitatively and quantita-
tively described. Each dataset tells us a ‘story’ that we should try to understand. To do so, we
should ask questions like

• What is the data generating process?

• What data types do we have? (discussed in Section 1.1)

• How many data points/missing values do we have? (discussed in Section 1.2)

• What are the key summary statistics of the data? (discussed in Sections 1.2 and 1.3)

• What patterns/features/clusters exist in the data? (discussed in Section 1.4)

At the end of a study, results are often summarized graphically because it is generally easier
to interpret and understand graphics than values in a table. As such, graphical representation
of data is an essential part of statistical analysis, from start to finish.

1.1 Types of Data

Presumably we all have a fairly good idea of what data is. However, often this view is quite
narrow and boils down to data are numbers. But “data is not just data”: data can be hard or
soft, quantitative or qualitative.
Hard data is associated with quantifiable statements like “The height of this female is 172
cm.” Soft data is often associated with subjective statements or fuzzy quantities requiring inter-
pretation, such as “This female is tall”. Probability statements can be considered hard (derived
from hard data) or soft (due to a lack of quantitative values). In this script, we are especially
concerned with hard data.
An important distinction is whether data is qualitative or quantitative in nature. Quali-
tative data consists of categories and are either nominal (e.g., male/female) or ordinal (e.g.,
weak<average<strong, ordinal with an ordering). Quantitative data is numeric and mathemat-
ical operations can be performed with it.
Quantitative data can be either discrete, taking on only certain values like qualitative data but
in the form of integers, or continuous, taking on any value on the real number line. Quantitative
data can be measured on an interval or ratio scale. Unlike the ordinal scale, the interval scale is
uniformly spaced. The ratio scale is characterized by a meaningful absolute zero in addition to the
characteristics of all previous scales. Depending on the measurement scale, certain statistics are
appropriate. The measurement scales are classified according to Stevens (1946) and summarized
in Figure 1.2. We will discuss the statistical measures based on data next and their theoretical
counterparts and properties in later chapters.
Non-numerical data, often gathered from open-ended responses or in audio-visual form, is
considered qualitative. We will not discuss such type of data here.
4 CHAPTER 1. EXPLORATORY DATA ANALYSIS AND VISUALIZATION OF DATA

Scales
Nominal Ordinal Interval Ratio
═╪ ═╪ ═╪ ═╪
Mathematical <> <> <>
operators +- +-
*/
Mode Mode Mode Mode
Median Median Median
Arithmetic mean Arithmetic mean
Statistical Geometric mean
measures Standard deviation Standard deviation
Coefficient of
variation
Range Studentized range

Figure 1.2: Types of scales according to Stevens (1946) and possible mathematical
operations. The statistical measures are for a description of location and spread.

Example 1.2. The classification of elements as either “C” or “H” results in a nominal variable. If
we associate “C” with cold and “H” with hot we can use an ordinal scale (based on temperature).
In R, nominal scales are represented with factors. R-Code 1.2 illustrates the creation of
nominal and interval scales as well as some simple operations. It would be possible to create
ordinal scales as well, but we will not use it in this script.
When measuring temperature in Kelvin (absolute zero at −273.15◦ C), a statement such as
“The temperature has increased by 20%” can be made. However, a comparison of twice as hot
(in degrees Celsius) does not make sense as the origin is arbitrary. ♣

R-Code 1.2 Example of creating ordinal and interval scales in R.

ordinal <- factor( c("male","female"))

ordinal[1] == ordinal[2]
## [1] FALSE
# ordinal[1] > ordinal[2] # warning ‘>’ not meaningful for factors
interval <- c(2, 3)
interval[1] > interval[2]
## [1] FALSE

1.2 Descriptive Statistics

Although rudimentary information, very basic summary of the data is its size: number of ob-
servations, number of variables, and of course their types (see output of R-Code 1.1). Another
important aspect is the number of missing values which deserve careful attention. One has to
1.2. DESCRIPTIVE STATISTICS 5

evaluate if the missing values are due to some random mechanism, emerge consistently or with
some deterministic pattern, appear in all variables, for example.
For a basic analysis one often neglects the observations if missing values are present in any
variable. There exist techniques to fill in missing values but these are quite complicated and not
treated here.
As a side note, with a careful inspection of missing values in ozone readings, the Antarctic
“ozone hole” would have been discovered more than one decade earlier (see, e.g., en.wikipedia.
org/wiki/Ozone_depletion#Research_history).

Informally a statistic is a single measure of some attribute of the data, in the context of
this chapter a statistic gives a good first impression of the distribution of the data. Typical
statistics for the location parameter include the (empirical) mean, truncated/trimmed mean,
median, quantiles and quartiles. The trimmed mean omits a fraction of the smallest and largest
values. A trimming of 50% is equivalent to the (empirical) median. Quantiles or more specifically
percentiles link observations or values with the position in the ordered data. The median is the
50th-percentile, half the data is smaller than the median, the other half is larger. The 25th
and 75th-percentile are also called the lower and upper quartiles, i.e., the quartiles divide the
data in four equally sized goups. Depending on the number of observations at hand, arbitrary
quantiles are not precisely defined. In such cases, a linearly interpolated value is used, for which
the precise interpolation weights depend on the software at hand. It is important to know this
potential ambiguity less important to know the exact values of the weights.
Typical statistics for the scale parameter include the variance, standard deviation (square
root of the variance), and interquartile range (third quartile minus the first quartile) and the
coefficient of variation (standard deviation divided by the mean). Note that the coefficient of
variance is dimension less and should be used only with ratio scaled data.

We often denote data with x1 , . . . , xn , with n denoting the data size. The ordered data
(smallest to largest) is denoted with x(1) ≤ · · · ≤ x(n) . Hence, we use the following classical
notation:
Xn
empirical mean (average): x= xi , (1.1)
i=1

x(n/2+1/2) , if n odd,
empirical median: = 1 (1.2)
 (x + x(n/2+1) ), if n odd,
2 (n/2
n
1 X
empirical variance: 2
s = (xi − x)2 , (1.3)
n−1
i=1
√
empirical standard deviation: s= s .2 (1.4)
(1.5)

If the context is clear, we may omit empirical.

Example 1.3. In R-Code 1.3 several summary statistics for 293 observations of content mercury
in lake Geneva sediments are calculated (data at www.math.uzh.ch/furrer/download/sta120/
lemanHg.csv, see also R-Code 1.1). ♣
6 CHAPTER 1. EXPLORATORY DATA ANALYSIS AND VISUALIZATION OF DATA

R-Code 1.3 A quantitative EDA of the mercury dataset (subset of the ‘leman’ dataset).

c( mean=mean( Hg), tr.mean=mean( Hg, trim=.1), median=median( Hg))

## mean tr.mean median
## 0.46177 0.43238 0.40000
c( var=var( Hg), sd=sd( Hg), iqr=IQR( Hg))
## var sd iqr
## 0.090146 0.300243 0.380000
summary( Hg) # min, max, quartiles and mean
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.010 0.250 0.400 0.462 0.630 1.770
range( Hg) # min, max
## [1] 0.01 1.77
tail( sort( Hg)) # siz largest values in decreasing order
## [1] 1.28 1.29 1.30 1.31 1.47 1.77

For discrete data, the mode is the most frequent value of an empirical frequency distribution;
in order to calculate the mode, only the operations {=, 6=} are necessary. Continuous data are
first divided into categories (discretized/binned) and then the mode can be determined.

In subsequent chapters, we will discuss (statistical) properties of these different statistics. At

this point, it will be important to emphasize that we are referring to the sample, e.g., empirical
median, or to the theoretical one.

Another important aspect is the identification of outliers, which are defined (verbatim from
Olea, 1991): “In a sample, any of the few observations that are separated so far in value from the
remaining measurements thety the questions arise whether they belong to a different population,
or that the sampling technique is faulty. The determination that an observation is an outlier
may be highly subjective, as there is no strict criteria for deciding what is and what is not an
outlier”. Graphical respresentations of the data often help in the identification of outliers.

1.3 Classical Approaches to Represent Univariate Data

Univariate data is composed of one variable, or a single scalar component. Graphical depiction
of univariate data is usually accomplished with bar plots, histograms, box plots, or Q-Q plots
among others.
Ordinal and nominal data are often represented with bar plots (also called bar charts or bar
graphs). The height of the bars is proportional to the frequency of the corresponding value.
The German language discriminates between bar plots with vertical and horizontal orientation
(‘Säulendiagramm’ and ‘Balkendiagramm’).
1.3. UNIVARIATE DATA 7

100
30
Other

80
25 Deforest
Electr
20 Manufac

60
Percent

Percent
Transp
15 Air

40
10

20
5

0
Air

Transp

Manufac

Electr

Deforest

Other
2005

Figure 1.3: Bar plots: juxtaposed bars (left), stacked (right) of CO2 emissions ac-
cording to different sources taken from SWISS Magazine (2011). (See R-Code 1.4.)

Example 1.4. R-Code 1.4 and Figure 1.3 illustrate bar plots with data giving aggregated CO2
emissions from different sources (transportation, electricity production, deforestation, . . . ) in
the year 2005. Note that the numbers vary considerably according to different sources, mainly
due to the political and interest factors associated with these numbers. ♣

R-Code 1.4 Emissions sources for the year 2005 as presented by the SWISS Magazine
10/2011-01/2012, page 107 (SWISS Magazine, 2011). (See Figure 1.3.)

dat <- c(2, 15, 16, 32, 25, 10) # see Figure 1.9
emissionsource <- c('Air', 'Transp', 'Manufac', 'Electr', 'Deforest', 'Other')
barplot( dat, names=emissionsource, ylab="Percent", las=2)
barplot( cbind('2005'=dat), col=c(2,3,4,5,6,7), legend=emissionsource,
args.legend=list(bty='n'), ylab='Percent', xlim=c(0.2,4))

Do not use pie charts unless absolutely necessary. Pie charts are often difficult to read. When
slices are similar in size it is nearly impossible to distinguish which is larger. Barplots allow an
easier comparison.

Histograms illustrate the frequency distribution of observations graphically and are easy to
construct and to interpret. Histograms allow one to quickly assess whether the data is symmetric
or rather left- or right-skewed, whether the data has rather one mode or several or whether ex-
ceptional values are present. Important statistics like mean and median can be added. However,
the number of bins (categories to break the data down into) is a subjective choice that affects
the look of the histogram and several valid rules of thumb exist for choosing the optimal num-
ber of bins. R-Code 1.5 and the associated Figure 1.4 illustrate the construction and resulting
histograms of the mercury dataset. In one of the histograms, a “smoothed density” has been
superimposed. Such curves will be helpful when comparing the data with different statistical
8 CHAPTER 1. EXPLORATORY DATA ANALYSIS AND VISUALIZATION OF DATA

models, as we will see in later chapters. The histogram shows that the data is unimodal, right
skewed, no exceptional values.

R-Code 1.5 Different histograms (good and bad ones) for the mercury dataset. (See
Figure 1.4.)

histout <- hist( Hg)

hist( Hg, col=7, probability=TRUE, main="With 'smoothed density'")
lines( density( Hg))
abline( v=c(mean( Hg), median( Hg)), col=3:2, lty=2:3, lwd=2:3)
hist( Hg, col=7, breaks=90, main="Too many bins")
hist( Hg, col=7, breaks=2, main="Too few bins")
str( histout[1:3]) # Contains essentially all information of histogram
## List of 3
## $ breaks : num [1:10] 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
## $ counts : int [1:9] 58 94 61 38 26 9 5 1 1
## $ density: num [1:9] 0.99 1.604 1.041 0.648 0.444 ...

Histogram of Hg With 'smoothed density'

1.5
Frequency

1.0
60

Density

0.5
0 20

0.0

0.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5

Hg Hg

Too many bins Too few bins

200
Frequency

Frequency
10

100
5
0

0.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5 2.0

Hg Hg

Figure 1.4: Histograms with various bin sizes. (See R-Code 1.5.)

When constructing histograms for discrete data (e.g., integer values), one has to be careful
with the binning. Often it is better to manually specify the bins. To represent the result of many
1.3. UNIVARIATE DATA 9

dice tosses, it would be advisable to use hist( x, breaks=seq( from=0.5, to=6.5, by=1)),
or possibly use a bar plot as explained above. A stem-and-leaf plot is similar to a histogram,
however, this plot is rarely used today Figure 1.5 gives an example.

Figure 1.5: Stem-and-leaf plot of residents of the municipality of Staldenried as of

31.12.1993 (Gemeinde Staldenried, 1994).

A box plot is a graphical representation of five statistics of the frequency distribution of

observations: the minimum and maximum values, the lower and upper quartiles, and the median.
A violin plot combines the advantages of the box plot and a histogram. Compared to a box plot
a violin plot depicts possible multi-modality and for large datasets de-emphasizes the marked
observations outside he whiskers (often termed “outliers” but see our discussion in Chapter 6).
R-Code 1.6 and Figure 1.6 illustrates the box plot and violin for the mercury data. Due to
the right-skewedness of the data, there are several data points beyond the whiskers.

A quantile-quantile plot (Q-Q plot) is used to visually compare empirical data quantiles with
the quantiles of a theoretical distribution (we will talk more about “theoretical distributions” in
the next chapter). The ordered values are compared with the i/(n + 1)-quantiles. In practice,
some software use (i − a)/(n + 1 − 2a), for a specific a ∈ [0, 1]. R-Code 1.7 and Figure 1.7
illustrate a Q-Q plot for the mercury dataset by comparing it to a normal distribution and a so-
called chi-squared distribution. In cases of good fits, the points are aligned almost on a straight
line. To “guide-the-eye”, one often adds the a line to the plots.
10 CHAPTER 1. EXPLORATORY DATA ANALYSIS AND VISUALIZATION OF DATA

R-Code 1.6 Box plot. Notice that the function boxplot has several arguments for tailoring
the appearance of the box plots. These are discussed in the function’s help file. (See Figure 1.6.)

out <- boxplot( Hg, col="LightBlue", notch=TRUE, ylab="Hg", outlty=1,

outpch='') # 'out' contains numeric values of the boxplot
quantile( Hg, c(0.25, 0.75)) # compare with summary( Hg) and out["stats"]
## 25% 75%
## 0.25 0.63
IQR( Hg)
## [1] 0.38
quantile(Hg, 0.75) + 1.5 * IQR( Hg)
## 75%
## 1.2
Hg[ quantile(Hg, 0.75) + 1.5 * IQR( Hg) < Hg] # points beyond wiskers
## [1] 1.25 1.30 1.47 1.31 1.28 1.29 1.77
require(vioplot)
vioplot( Hg, col="Lightblue", ylab="Hg") # output is the range
## [1] 0.01 1.77
1.5

1.5
1.0

1.0
Hg

Hg
0.5

0.5

●
0.0

0.0

Figure 1.6: Box plots (notched version) and violin plot. (See R-Code 1.6.)

Remark 1.1. There are several fundamentally different approaches to creating plots in R: base
graphics (package graphics, which is automatically loaded upon startup), trellis graphics (pack-
ages lattice and latticeExtra), and the grammar of graphics approach (package ggplot2).
We focus on base graphics. This approach is in sync with the R source code style, but we have a
clear direct handling of all elements. ggplot functionality may produce seemingly fancy graphics
at the price of certain black box elements. ♣
1.4. MULTIVARIATE DATA 11

R-Code 1.7 Q-Q plot of the mercury dataset. (See Figure 1.7.)

qqnorm( Hg)
qqline( Hg, col=2, main='')
qqplot( qchisq( ppoints( 293), df=5), Hg, xlab="Theoretical quantiles")
# For 'chisq' some a priori knowledge was used, for 'df=5' minimal
# trial and error was used.
qqline( Hg, distribution=function(p) qchisq( p, df=5), col=2)

Normal Q−Q Plot

● ●
1.5

1.5
● ●
Sample Quantiles

●●●●
● ●●
●● ●
● ●
●● ●●
●● ●
●
1.0

1.0
●
●●
● ●
●●
●
●
●
●● ●
●●
●
● ●
Hg

●
●●
●
● ●
●
●●
●
●●
●
●
●
●●
●
●● ●●
●●
●
●●
●
●● ●●
●
●
●●
● ●
●●
●
●● ●
●
●●
●●
● ●
●
●
●
●
●●
● ●
●●
●
●●
●
●
●●
● ●
●
●●
●
0.5

0.5

●
●
●
● ●●
●
●
●
●
● ●
●
●
●
●
●
●● ●●
●
●
●
●●
●
●
● ●
●●
●
●
●
●
●●
● ●
●
●●
●
●
●
●●
●
●
● ●
●●
●
●
●
●
●
●●
●
●
● ●
●
●●
●●
●
● ●
●
●
●
●
●●
● ●
●●
●
●●
●
●● ●
●●
●
●●
●
●●
●●
● ●
●
●●
●
●●
●●
●
●●
●
●●
●
●
●● ●
●●
●
●
●
●●●●● ●
●●
0.0

0.0

● ●

−3 −2 −1 0 1 2 3 0 5 10 15 20

Theoretical Quantiles Theoretical quantiles

Figure 1.7: Q-Q plots using the normal distribution (left) and a so-called chi-squared
distribution with five degrees of freedom (right). The red line passes through the
lower and upper quantiles of both the emprirical and theoretical distribution. (See
R-Code 1.7.)

1.4 Visualizing Multivariate Data

Multivariate data means two or more variables are collected for each observation and are of inter-
est. Additionally to the univariate EDA, visualization of multivariate data is often accomplished
with scatter plots (plot(x,y) for two variables or pairs() for several) so that the relationship
between the variables may be illustrated. For three variables, an interactive visualization based
on the package rgl might be helpful.

In a scatter plot, “guide-the-eye” lines are often included. In such situation, some care is
needed as there is an perception of asymmetry between y versus x and x versus y. We will
discuss this further in Chapter 8.

In the case of several frequency distributions, bar plots, either stacked or grouped, may also
be used in an intuitive way. See R-Code 1.8 and Figure 1.8 for two slightly different partitions
of emission sources.
12 CHAPTER 1. EXPLORATORY DATA ANALYSIS AND VISUALIZATION OF DATA

R-Code 1.8 Emissions sources for the year 2005 from www.c2es.org/facts-
figures/international-emissions/sector (approximate values). (See Figure 1.8.)

dat2 <- c(2, 10, 12, 28, 26, 22) # source c2es.org
mat <- cbind( SWISS=dat, c2es.org=dat2)
barplot(mat, col=c(2,3,4,5,6,7), xlim=c(0.2,5), legend=emissionsource,
args.legend=list(bty='n') ,ylab='Percent', las=2)
barplot(mat, col=c(2,3,4,5,6,7), xlim=c(1,30), legend=emissionsource,
args.legend=list(bty='n'), ylab='Percent', beside=TRUE, las=2)

100
30
Other Air
80 Deforest 25 Transp
Electr Manufac
60 Manufac 20 Electr
Percent

Transp Percent 15
Deforest
40 Air Other
10
20
5

0 0
SWISS

c2es.org

SWISS

c2es.org

Figure 1.8: Bar plots for two variables: stacked (left), grouped (right). (See R-
Code 1.8.)

We conclude this section with the presentation of a famous historical dataset.

Example 1.5. The iris dataset is a classic teaching dataset. It gives measurements (in cen-
timeters) for the variables sepal length and width and petal length and width for 50 flowers from
each of three species of iris (Iris setosa L., Iris versicolor L., and Iris virginica L.; see Figure 1.9).
In R it is part of the package datasets and thus automatically available.
We use different types of graphics to represent the data, (Figure 1.10 and R-Code 1.10). ♣

Figure 1.9: Photos of the three species of iris (setosa, versicolor and virginica). The
images are taken from en.wikipedia.org/wiki/Iris_flower_data_set.
1.4. MULTIVARIATE DATA 13

R-Code 1.9 Constructing histograms, box plots, violin plots and scatter plot with iris
data. (See Figure 1.10.)

hist( iris$Petal.Length, main='', xlab="Petal length [cm]", col=7)

box()
boxplot( iris[,1:4], notch=TRUE, col=7, las=2)
with(iris, vioplot(Petal.Length[Species=="setosa"],
Petal.Length[Species=="versicolor"],
Petal.Length[Species=="virginica"],
names=c("setosa", "versicolor", "virginica"), col=2:4))
pairs( iris[, 1:4], gap=0, col=as.numeric(iris$Species)+1, cex=.5)

8 7
6
30

6 ●
Frequency

5
●
20

●
●
4 ●
4
3
10

2 ●

2
●
1
0

0
Sepal.Length

Sepal.Width

Petal.Length

Petal.Width

setosa

versicolor

virginica
1 2 3 4 5 6 7

Petal length [cm]

2.0 3.0 4.0 0.5 1.5 2.5

● ● ●
4.5 5.5 6.5 7.5

● ● ● ● ● ●
●● ● ●●
● ● ●
● ● ●
● ● ●
● ● ● ●●● ● ● ●
● ● ●
● ● ●
●● ●● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ●● ● ● ● ●● ●●● ●● ●● ● ●●●
●● ●● ●●
● ● ● ● ●● ● ● ● ● ● ●
●●● ●● ●● ●●● ● ● ●● ●●●

Sepal.Length ●
●
● ●●●

●●●
●●
● ●●●
● ●●
●● ●●●
●

● ●
●●
●
●
●
●
●
●
●●
● ●●●● ● ●
●● ● ●
● ●●●
● ● ●●●
● ●●
●●●
● ●● ● ●
●
●
●
●●
●
● ●
● ●●
● ●● ●●
● ●
●●●
●● ●
●
●
●
●
●
●
●
●●

●
● ●●●● ● ●●● ● ● ● ● ● ●
●●●● ● ● ●● ●●● ● ● ●●●●
● ● ● ● ●●● ● ● ● ●
● ● ●
● ●● ● ●● ● ●● ●
● ●●● ●● ●●●●● ● ●●●● ●
● ● ● ●●●●● ●●●●● ●● ●●● ● ●
●● ●● ● ●● ● ● ●● ● ●
●● ● ●● ● ●●●
● ●● ●
●● ● ● ● ●● ●●
● ● ●
●● ● ●● ●
● ● ●
● ● ●
● ● ●
4.0

● ● ●
● ● ●
● ● ● ●
● ● ● ● ●●●● ●● ●●● ● ●
● ●● ● ● ●
● ●● ● ● ● ● ●● ●
●●● ● ●●●● ●● ●
● ● ●●● ● ● ●● ●●●●● ● ●● ●●● ● ●●
● ●● ●
●●
●
● ●
●● ●●● ●
Sepal.Width ●●
●●●●
● ●●
●●● ●● ●●●
●
●
●
●●
●
● ●
●
●
●
3.0

● ● ●
1 2 3 4 5 6 7

●
● ●
●
●●
●●
● ●
●●
●
● ●●
●
●●
●
●
●
●●●
●
●●●●
●
●
●
Petal.Length ●
●●●
●● ●
●
●●
●
●
●●
●

●● ●● ●
● ● ●

● ● ● ● ● ●
● ● ● ●●● ●● ●●●●●
● ●●●
●●●
●●
●●●●● ● ●●●●●●
●● ●
●●●●● ●● ● ●●●●●● ●
●●●●●●●
●
●●● ●● ● ● ●●
● ●● ●● ● ●
●●●
● ● ● ● ●
● ●●●
2.5

● ● ● ● ● ● ●●
● ● ● ● ● ● ● ●
● ● ●●● ● ● ●●● ● ●●●● ●●● ●
●● ● ● ● ● ●● ●
● ●●● ● ● ● ●● ● ●●●●● ●
●● ● ● ● ● ● ● ● ● ●●●● ●●
● ●● ● ●● ● ●●● ●
●●●●●●● ● ●● ●●●●●●● ●●● ●●●● ●
● ● ● ● ● ●
1.5

● ● ● ● ● ●● ●● ● ●
● ● ●● ●●●● ● ● ● ●
●●●●● ● ●●●●●●
● ● ●●● ● ●●●●●●● ● ●●●● ●

●●
●
●●●
● ●● ●
●●
● ●● ●
●●●● ●

●
● ● ●●●●
●●
●●● ●
●●● ●●
●
● ●●●●●●●
●●●● ●
●●
●●● ●●
Petal.Width
0.5

● ● ●
● ● ●
●● ● ● ● ●●● ● ●●●●●
●● ● ●● ● ● ● ●● ● ●●●●
● ●●●
●●●●
●●●●●●● ● ●●●●●●●●●● ● ● ●●●●
●●
●●
●●●●
● ●● ● ●● ● ● ● ●●

4.5 5.5 6.5 7.5 1 2 3 4 5 6 7

Figure 1.10: Top left to right: histograms for the variable petal length box plots
for several variables and violin plots for Petal length from the iris dataset. Bottom:
simple matrix of scatterplots. (See R-Code 1.9.)
14 CHAPTER 1. EXPLORATORY DATA ANALYSIS AND VISUALIZATION OF DATA

1.4.1 High-Dimensional Data

We often represent the data in a table form, where each row represents one subject or a measure-
ment containing possibly several variables. We denote the number of variables with p and the
number of observations (individuals) with n. A dataset that has many variables is often referred
to as high-dimensional. Even for large p, as long as p < n (which is assumed throughout this
document), we can still use classical methods.
For large p scatter plots are no longer appropriate. There are too many panels and nontrivial
underlying structures are rarely visible.

Parallel coordinate plots are a popular way of representing several observations in high di-
mensions (i.e., many variables). Each variable (scaled to zero-one) is recorded along a vertical
axis. The values of each observation are then connected with a line across the various variables.
That means that points in the usual (Euclidean) representation correspond to lines in a paral-
lel coordinate plot. All interval scaled variables are normalized to [0, 1]. Additionally, nominal
variables may also be depicted.

Example 1.6. The dataset swiss (provided by the package datasets) contains 47 observa-
tions on 6 variables (standardized fertility measure and socio-economic indicators) for each of 47
French-speaking provinces of Switzerland at about 1888.
R-Code 1.10 and Figure 1.11 give an example of a parallel coordinate plot. Groups can be
quickly detected and strong associations are spotted directly.
As a side note, the raw data is available at https://opr.princeton.edu/archive/Download.
aspx?FileID=1113 and documentation at https://opr.princeton.edu/archive/Download.aspx?FileID=
1116, see also https://opr.princeton.edu/archive/pefp/switz.aspx. It would be a fun task to ex-
tract the corresponding data not only for the French-speaking provinces but also for entire
Switzerland. ♣

R-Code 1.10 Parallel coordinate plot for the swiss dataset. (See Figure 1.11.)

str( swiss, strict.width='cut') # in package:datasets

## 'data.frame': 47 obs. of 6 variables:
## $ Fertility : num 80.2 83.1 92.5 85.8 76.9 76.1 83.8 92.4 82..
## $ Agriculture : num 17 45.1 39.7 36.5 43.5 35.3 70.2 67.8 53.3..
## $ Examination : int 15 6 5 12 17 9 16 14 12 16 ...
## $ Education : int 12 9 5 7 15 7 7 8 7 13 ...
## $ Catholic : num 9.96 84.84 93.4 33.77 5.16 ...
## $ Infant.Mortality: num 22.2 22.2 20.2 20.3 20.6 26.6 23.6 24.9 21..
require( MASS)
parcoord( swiss, col=2-(swiss$Catholic<40) + (swiss$Catholic>60))

The open source visualization program ggobi may be used to explore high-dimensional data
(Swayne et al., 2003). It provides highly dynamic and interactive graphics such as tours, as well
as familiar graphics such as scatter plots, bar charts and parallel coordinates plots. All plots
1.4. MULTIVARIATE DATA 15

Fertility Agriculture Examination Education Catholic Infant.Mortality

Figure 1.11: Parallel coordinate plot of the swiss dataset. (See R-Code 1.10.)

are interactive and linked with brushing and identification. The package rggobi provides a link
to R. Figure 1.12 gives a screenshot of the 2D Tour, a projection pursuit visualization which
involves finding the most “interesting” possible projections of multidimensional data (Friedman
and Tukey, 1974). Such a projection should highlight interesting features of the data.

Figure 1.12: GGobi screenshot based on the state.x77 data with Alaska marked.

R-Code 1.11 GGobi example based on state.x77. (See Figure 1.12 for a screenshot)

require( rggobi)
ggobi( state.x77)
16 CHAPTER 1. EXPLORATORY DATA ANALYSIS AND VISUALIZATION OF DATA

1.5 Examples of Poor Charts

A good graphic should immediately convey the essence without distraction and influential ele-
ments. The human eye can quickly compare lengths but not angles, areas, or volumes. Whenever
comparing quantities, one should directly represent ratios or differences.
Figures 1.13 and 1.14 are examples of faulty plots and charts. Bad graphics can be found
everywhere and for some it is almost a sport to detect these in the literature or in the media.
See, for example, Figure 1.15 or the websites qz.com/580859/ the-most-misleading-charts-of-
2015-fixed/ or www.statisticshowto.com/misleading-graphs/.

Figure 1.13: Bad example (above) and improved but still not ideal graphic (below).
Figures from university documents.

1.6 Bibliographic remarks

A “complete” or representative list of published material about and tutorials on displaying in-
formation is beyond the scope of this section. Here are a few links to works that we consider
relevant.

Many consider John Tukey to be the founder and promoter of exploratory data analysis.
Thus his EDA book (Tukey, 1977) is often seen as the (first) authoritative text on the subject.
In a series of books, Tufte rigorously yet vividly explains all relevant elements of visualization
1.6. BIBLIOGRAPHIC REMARKS 17

Figure 1.14: SWISS Magazine 10/2011-01/2012, 107.

and displaying information (Tufte, 1983, 1990, 1997b,a). Many university programs offer lectures
on information visualization or similar topics. The lecture by Ross Ihaka is worth mentioning:
18 CHAPTER 1. EXPLORATORY DATA ANALYSIS AND VISUALIZATION OF DATA

Figure 1.15: Examples of bad graphs in scientific journals. The figure is taken
from www.biostat.wisc.edu/˜kbroman/topten_worstgraphs/. The website discusses
the problems with each graph and possible improvements (‘[Discussion]’ links).
1.7. EXERCISES AND PROBLEMS 19

www.stat.auckland.ac.nz/ ihaka/120/lectures.html.
In a lengthy article, Friendly and Denis (2001) give an extensive historical overview of the
evolvement of cartography, graphics and visualization. The pdf has active links for virtually end-
less browsing: euclid.psych.yorku.ca/SCS/Gallery/milestone/milestone.pdf. See also the applet
at www.datavis.ca/milestones/.

There are also many interesting videos available illustrating good and not-so-good graphics.
For example, www.youtube.com/watch?v=ajUcjR0ma4c.

The page https://en.wikipedia.org/wiki/List_of_statistical_software gives an extensive list

of alternative statistical software (from open-source to proprietary) most of these are compared
at https://en.wikipedia.org/wiki/Comparison_of_statistical_packages.

1.7 Exercises and Problems

Problem 1.1 (Introduction to R/RStudio) The aim of this exercise is to get some insight on the
capabilities of the statistical software R and the environment RStudio. If you have never used R,
some of the videos uploaded on http://www.math.uzh.ch/furrer/download/intro2stat/intro2R/
might be helpful. You may find hints in brackets.

i) (Datasets). R has many built-in datasets, one example is volcano. What’s the name of
the Volcano? Describe the dataset in a few words.

ii) (Help, plotting). Use the R help function to get information on how to use the image()
function for plotting matrices. Display the volcano data.

iii) (loading packages). Install the package fields. Display the volcano data with the function
image.plot().

iv) (demo, 3D plotting). Use the the R help function to find out the purpose of the function
demo() and have a look at the list of available demos. The demo persp utilizes the volcano
data to illustrate basic three-dimensional plotting. Call persp and have a look at the plots.
What is the maximum height of the volcano depicted?

Problem 1.2 (EDA of multivariate data) In this problem we want to explore a classical dataset.
Load the mtcars dataset. Perform an EDA of the dataset mtcars and provide at least three
meaningful plots (as part of the EDA) and a short description of what they display.

Problem 1.3 (EDA of bivariate data) On www.isleroyalewolf.org/data/data/home.html the

file isleroyale_graph_data_28Dec2011.xlsx contains population data from wolves and moose.
Download the data from the STA120 course page. Have a look at the data.

i) Construct a boxplot and a Q-Q plot of the moose and wolf data. Give a short interpretation.

ii) Jointly visualize the wolves and moose data, as well as their abundances over the years.
Give a short interpretation of what you see in the figures. (Of course you may compare
the result with what is given on the aforementioned web page).
20 CHAPTER 1. EXPLORATORY DATA ANALYSIS AND VISUALIZATION OF DATA

Problem 1.4 (EDA) Perform an EDA of the dataset www.math.uzh.ch/furrer/download/

sta120/lemanHgCdZn.csv, containing mercury, cadmium and zinc content in sediment samples
taken in lake Geneva.

Problem 1.5 (parallel coordinate plot) Construct a parallel coordinate plot using the built-in
dataset state.x77. In the left and right margins, annotate the states. Give a few interpretations
that can be derived from the plot.

Problem 1.6 (Feature detection) The synthetic dataset whatfeature, available at www.math.
uzh.ch/furrer/download/sta120/whatfeature.RData has a hidden feature. Try to find it using
projection pursuit in rggobi and notice how difficult it is to find structures in small, low-
dimensional datasets.
Chapter 2

Random Variables

Learning goals for this chapter:

Describe in own words a cumulative distribution function (cdf), probability

density function (pdf), probability mass function (pmf), quantile function

Pass from a cdf to a quantile function, pdf or pmf and vice versa

Given a pdf/pmf or cdf calculate probabilities

Schematically sketch, plot in R and interpret pdf, pmf, cdf, quantile function,
Q-Q plot

Give the definition and intuition of an expected value (E), variance (Var),
know the basic properties of E, Var used for calculation

Know the definition and properties of independent and identically distributed

(iid) random variables

Describe a binomial, Poisson and Gaussian random variable

Verify that a given function is a pdf or a cdf. Find a multiplicative constant

that makes a given function a pdf.

(*) Given the formula, calculate the cdf/pdf of transformed random variables

R-Code for this chapter: www.math.uzh.ch/furrer/download/sta120/chapter02.R.

Probability theory is the prime tool of all statistical modeling. Hence we need a minimal
understanding of the theory of probability in order to well understand statistical models, inter-
pretation thereof, etc. This chapter may seem quite dense but several results should be seen for
reference only.

21
22 CHAPTER 2. RANDOM VARIABLES

2.1 Basics of Probability Theory

Suppose we want to (mathematically or stochastically) describe an experiment whose outcome
is perceived as “random”, the outcome is rather according to chance than to plan. To do so,
we need probability theory, starting with the definition of a probability function, then random
variable and properties thereof.
Assume we have a certain experiment. The set of all possible outcomes of this experiment
is called the sample space, denoted by Ω. Each outcome of the experiment ω ∈ Ω is called an
elementary event. A subset of the sample space Ω is called an event, denoted by A ⊂ Ω.
Loosely speaking, a probability is a function that assigns a value in the interval [0, 1] to an
event of the sample set, that is, P(A) ∈ [0, 1], for A ⊂ Ω. Such a function cannot be chosen
arbitrarily and has to obey certain rules that are required for consistency.
In order to introduce probability theory formally one needs some technical terms (σ-algebra,
measure space, . . . ). However, the axiomatic structure of A. Kolmogorov can also be described
accessibly as follows and is sufficient as a basis for our purposes.
A probability measure must satisfy the following axioms:

i) 0 ≤ P (A) ≤ 1, for every event A,

ii) P(Ω) = 1,

for Ai ∩ Aj = ∅, i 6= j.
S P
iii) P i Ai = i P(Ai ),

In the last sum we only specify the index without indicating start and end which means sum
over all, say ni=1 , where n may be finite or infinite. (Similarly for the union).
P

Informally, a probability function P assigns a value in [0, 1], i.e., the probability, to each event
of the sample space constraint to:

i) the probability of an event is never smaller than 0 or greater than 1,

ii) the probability of the whole sample space is 1,

iii) the probability of several events is equal to the sum of the individual probabilities, if the
events are mutually exclusive.

Probabilities are often visualized with Venn diagrams (Figure 2.1), which clearly and intu-
itively illustrate more complex facts, such as:

union of two events P(A ∪ B) = P(A) + P(B) − P(A ∩ B), (2.1)

complement of an events c
P(B ) = P(Ω\B) = 1 − P(B), (2.2)
P(A ∩ B)
conditional probability P(A | B) = , (2.3)
P(B)
law of total probability P(A) = P(A | B) P(B) + P(A | B c ) P(B c ). (2.4)

The last statement can be written for arbitrary number of events Bi with Bi ∩ Bj = ∅, i 6= j and
i Bi = Ω yielding P(A) =
S P
i P(A | Bi ) P(Bi ).
2.1. BASICS OF PROBABILITY THEORY 23

Ω
B
C 1111111
0000000
0000000
1111111
0000000
1111111
0000000
1111111
0000000
1111111
0000000
1111111
0000000
1111111
A 0000000
1111111
0000000
1111111
0000000
1111111
0000000
1111111
0000000
1111111
0000000
1111111

Figure 2.1: Venn diagram

We consider a random variable as a function that assigns values to the outcomes (events)
of a random experiment, that is, these values or values in the interval are assumed with certain
probabilities. The outcomes of the experiment, i.e., the values are called realizations of the
random variable.

The following definition defines a random variable and gives a (unique) characterization
of random variables. In subsequent sections, we will see additional characterizations. These,
however, will depend on the type of values the random variable takes.

Definition 2.1. A random variable X is a function from the sample space Ω to R and represents a
possible numerical outcome of an experiment. The distribution function (cumulative distribution
function, cdf) of a random variable X is

F (x) = FX (x) = P(X ≤ x), for all x. (2.5)

Random variables are denoted with uppercase letters (e.g. X, Y ), while realizations are
denoted by the corresponding lowercase letters (x, y). This means that the theoretical concept,
or the random variable as a function, is denoted by uppercase letters. Actual values or data, for
example the columns in your dataset, would be denoted with lowercase letters.

Property 2.1. A distribution function FX (x) is

i) monotonically increasing, i.e., for x < y, FX (x) ≤ FX (y);

ii) right-continuous, i.e., lim FX (x + ) = FX (x), for all x ∈ R;

iii) normalized, i.e. lim FX (x) = 0 and lim FX (x) = 1.

x→−∞ x→∞

For further characterizations of random variables, we need to differentiate according to the

sample space of the random variables. The next two sections discuss the two essential settings.
24 CHAPTER 2. RANDOM VARIABLES

2.2 Discrete Distributions

A random variable is called discrete when it can assume only a finite or countably infinite number
of values, as illustrated in the following two examples.

Example 2.1. Let X be the sum of the roll of two dice. The random variable X assumes
the values 2, 3, . . . , 12. The right panel of Figure 2.2 illustrates the distribution function. The
distribution function (as for all discrete random variables) is piece-wise constant with jumps
equal to the probability of that value. ♣

Example 2.2. A boy practices free throws, i.e., foul shots to the basket standing at a distance
of 15 ft to the board. Let the random variable X be the number of throws that are necessary
until the boy succeeds. Theoretically, there is no upper bound on this number. Hence X can
take the values 1, 2, . . . . ♣

Another way of describing discrete random variables is the probability mass function, defined
as follows.

Definition 2.2. The probability mass function (pmf) of a discrete random variable X is defined
by fX (x) = P(X = x). ♦

In other words, the pmf gives probabilities that the random variables takes one single value,
whereas, as seen, the cdf gives probabilities that the random variables takes that or any smaller
value.

Property 2.2. Let X be a discrete random variable with probability mass function fX (x) and
cumulative distribution function FX (x). Then:

i) The probability mass function satisfies fX (x) ≥ 0 for all x ∈ R.

X
ii) fX (xi ) = 1.
i
X
iii) FX (x) = fX (xi ).
k;xi ≤x

iv) The values fX (xi ) > 0 are the “jumps” in xi of FX (x).

v) The cumulative distribution function is a right-continuous step function.

The two points iii) and iv) show that there is a one-to-one relation (also called a bijection)
between the cumulative distribution function and probability mass function. Given one, we can
construct the other.

Figure 2.2 illustrates the probability mass function and cumulative distribution function of
the random variable X as given in Example 2.1. The jump locations and sizes (discontinuities)
of the cdf correspond to probabilities given in the left panel. Notice that we have emphasized
the right continuity of the cdf (see Proposition 2.1.ii)) with the additional dot.
2.2. DISCRETE DISTRIBUTIONS 25

R-Code 2.1 Probabilities and cumulative distribution function of X as given in Exam-

ple 2.1. (See Figure 2.2.)

x <- 2:12
p <- c(1:6, 5:1)/36
plot( x, p, type='h', ylim=c(0, .2),
xlab=expression(x[i]), ylab=expression(p[i]==f[X](x[i])))
points( x, p, pch = 19)
plot.ecdf( outer(1:6, 1:6, "+"), ylab=expression(F[X](x)), main='')
0.00 0.05 0.10 0.15 0.20

0.0 0.2 0.4 0.6 0.8 1.0

●
●
●
● ●

●
● ●
pi = fX(xi)

●
● ●
FX(x)
● ● ●

● ● ●

●
● ●
●
●

2 4 6 8 10 12 2 4 6 8 10 12

xi x

Figure 2.2: Probability mass function (left) and cumulative distribution function
(right) of X = “the sum of the roll of two dice”. (See R-Code 2.1.)

There is of course no limitation on the number of different random variables. In practice,

we can often reduce our framework to some common distribution. We now look at two discrete
ones.

2.2.1 Binomial Distribution

A random experiment with exactly two possible outcomes (for example: heads/tails, male/female,
success/failure) is called a Bernoulli trial. For simplicity, we code the sample space with ‘1’ (suc-
cess) and ‘0’ (failure). The probability mass function is determined by a single probability:

P(X = 1) = p, P(X = 0) = 1 − p, 0 < p < 1, (2.6)

where the cases p = 0 and p = 1 are typically not relevant.

If a Bernoulli experiment is repeated n times (resulting in an n-tuple of zeros and ones),
the exact order of the successes are not important, only the total number. Hence, the random
variable X = “number of successes in n trials” is intuitive. The distribution of X is called the
binomial distribution, denoted with X ∼ Bin(n, p) and the following applies:

n k
P(X = k) = p (1 − p)n−k , 0 < p < 1, k = 0, 1, . . . , n. (2.7)
k
26 CHAPTER 2. RANDOM VARIABLES

2.2.2 Poisson Distribution

The Poisson distribution gives the probability of a given number of events occurring in a fixed
interval of time if these events occur with a known and constant rate over time. One way to
formally introduce such a random variable is by defining it through its probability mass function.
Here and in other cases to follow, the exact form thereof is not very important.

Definition 2.3. A random variable X, whose probability function is given by

λk
P(X = k) = exp(−λ), 0 < λ, k = 0, 1, . . . , (2.8)
k!
is said to follow a Poisson distribution with parameter λ, denoted by X ∼ Pois(λ). ♦

The Poisson distribution is also a good approximation for the binomial distribution with large
n and small p (as a rule of thumb if n > 20 and np < 10).

2.3 Continuous Distributions

A random variable is called continuous if it can (theoretically) assume any value from one or
several intervals. This means that the number of possible values in the sample space is infinite.
Therefore, it is impossible to assign one probability value to one (elementary) event. Or, in other
words, given an infinite amount of possible outcomes, the likeliness of one particular value being
the outcome becomes zero. For this reason, we need to consider outcomes that are contained
in a specific interval. Hence, the probability is described by an integral, as an area under the
probability density function, which is formally defined as follows.

Definition 2.4. The probability density function (density function, pdf) fX (x), or density for
short, of a continuous random variable X is defined by
Z b
P(a < X ≤ b) = fX (x)dx, a < b. (2.9)
a

Loosely speaking, the density function is the theoretical counterpart to a histogram.

The density function does not give directly a probability and thus cannot be compared to
the probability mass function. The following properties are nevertheless similar to Property 2.2.

Property 2.3. Let X be a continuous random variable with density function fX (x) and distri-
bution function FX (x). Then:

i) The density function satisfies fX (x) ≥ 0 for all x ∈ R and fX (x) is continuous almost
everywhere.
Z ∞
ii) fX (x)dx = 1.
−∞
2.3. CONTINUOUS DISTRIBUTIONS 27

Z x
iii) FX (x) = fX (y)dy.
−∞

d
iv) fX (x) = FX0 (x) = FX (x).
dx
v) The cumulative distribution function FX (x) is continous everywhere.

vi) P(X = x) = 0.

As given by Property 2.3.iii) and iv), there is again a bijection between the density function
and the cumulative distribution function: if we know one we can construct the other. Actually,
there is a third characterization of random variables, called the quantile function, which is es-
sentially the inverse of the cdf. That means, we are interested in values x for which FX (x) = p.

Definition 2.5. The quantile function QX (p) of a random variable X with (strictly) monotone
cumulative distribution function FX (x) is defined by

QX (p) = FX−1 (p), 0 < p < 1, (2.10)

i.e., the quantile function is equivalent to the inverse of the distribution function. ♦

The quantile function can be used to define the theoretical counter part to the empirical
quartiles of Chapter 1 as illustrated next.

Definition 2.6. The median ν of a continuous random variable X with cumulative distribution
function FX (x) is defined by ν = QX (1/2). ♦

Remark 2.1. For discrete random variables the cdf is not continuous (see the plateaus in the
right panel of Figure 2.2) and the inverse does not exist. The quantile function returns the
minimum value of x from amongst all those values with probability p ≤ P(X ≤ x) = FX (x),
more formally,

QX (p) = inf {p ≤ FX (x)}, 0 < p < 1. (2.11)

x∈R

Example 2.3. The continuous uniform distribution  U(a, b) is defined by a constant density
 1 , if a ≤ x ≤ b,
function over the interval [a, b], a < b, i.e. f (x) = b − a
0, otherwise.

The quantile function is QX (p) = a + p(b − a) for 0 < p < 1. Figure 2.3 shows the density and
cumulative distribution function of the uniform distribution U(0, 1). ♣
28 CHAPTER 2. RANDOM VARIABLES

R-Code 2.2 Density and distribution function of a uniform distribution. (See Figure 2.3.)

plot( c(-1, 0, NA, 0, 1, NA, 1, 2), c(0, 0, NA, 1, 1, NA, 0, 0),

type='l', xlab='x', ylab=expression(f[X](x)))
plot( c(-1, 0, 1, 2), c(0, 0, 1, 1), type='l',
xlab='x', ylab=expression(F[X](x)))
0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

FX(x)
fX(x)

−1.0 0.0 0.5 1.0 1.5 2.0 −1.0 0.0 0.5 1.0 1.5 2.0

x x

Figure 2.3: Density and distribution function of the uniform distribution U(0, 1). (See
R-Code 2.2.)

2.4 Expectation and Variance

Density, cumulative distribution function or quantile function uniquely characterize random vari-
ables. Often we do not require such a complete definition and “summary” values are sufficient.
We start introducing a measure of location (expectation) and spread (variance). More and
alternative measures will be seen in Chapter 6.

Definition 2.7. The expectation of a discrete random variable X is defined by

X
E(X) = xi P(X = xi ) . (2.12)
i

The expectation of a continuous random variable X is defined by

Z
E(X) = xfX (x)dx , (2.13)
R

where fX (x) denotes the density of X. ♦

Many other “summary” values are reduced to calculate a particular expectation. The following
property states how to calculate the expectation of a function of the random variable X, which
is in turn used to summarize the spread of X.
Property 2.4.For an “arbitrary” real function g we have:
X


 g(xi ) P(X = xi ), if X discrete,

E g(X) = Z i
g(x)fX (x)dx, if X continuous.



R
2.4. EXPECTATION AND VARIANCE 29

Definition 2.8. The variance of X is defined by:

Var(X) = E (X − E(X))2

(2.14)

and is also denoted as the centered second moment, in contrast to the second moment E(X 2 ).♦

The expectation is “linked” to the average (or empirical mean, mean) if we have a set of
realizations thought to be from the particular random variable. Similarly, the variance, the
expectation of the squared deviation from its expected value is “linked” to the empirical variance
(var). This link will be formalized in later chapters.

Example 2.4. i) The expectation and variance of a Bernoulli trial are

E(X) = 0 · (1 − p) + 1 · p = 1, (2.15)
Var(X) = (0 − p)2 · (1 − p) + (1 − p)2 · p = p(1 − p). (2.16)

ii) The expectation and variance of a Poisson random variable are (see Problem 1.i))

E(X) = λ, Var(X) = λ. (2.17)

Property 2.5. For random variables X and Y , regardless of whether discrete or continuous,
and for a and b given constants, we have
2
i) Var(X) = E(X 2 ) − E(X) ;

ii) E(a + bX) = a + b E(X);

iii) Var(a + bX) = b2 Var(X) ,

iv) E(aX + bY ) = a E(X) + b E(Y ).

The second but last property seems somewhat surprising. But starting from the definition of
the variance, one quickly realizes that the variance is not a linear operator:
2 2
Var(a + bX) = E a + bX − E(a + bX) = E a + bX − (a + b E(X)) , (2.18)

followed by a factorization of b2 .

Example 2.5. We consider again the setting of Example 2.1, and straightforward calculation
shows that
12
X
E(X) = i P(X = i) = 7, by equation (2.12), (2.19)
i=2
6
X 1 7
=2 i =2· , by using Property 2.5.iv) first. (2.20)
6 2
i=1

♣
30 CHAPTER 2. RANDOM VARIABLES

2.5 Independent Random Variables

Often we not only have one random variable but many of them. We considered for example the
setting where we combined several trials to a binomial random variable. Here, we see a first way
to characterize a set of random variables.

Definition 2.9. Two random variables X and Y are independent if

P(X ∈ A ∩ Y ∈ B) = P(X ∈ A) P(Y ∈ B) . (2.21)

The random variables X1 , . . . , Xn are independent if

n
\ n
Y
P Xi ∈ Ai = P(Xi ∈ Ai ) . (2.22)
i=1 i=1
♦

The definition also implies that the joint density and joint cumulative distribution is simply
the product of the individual ones, also called marginal ones.
We will often use many independent random variables with a common distribution function.

Definition 2.10. A random sample X1 , . . . , Xn consists of n independent random variables

iid
with the same distribution, specified by, say, F . We write X1 , . . . , Xn ∼ F where iid stands
for “independent and identically distributed.” The number n of random variables is called the
sample size. ♦

The iid assumption is very crucial and relaxing the assumptions to allow, for example, de-
pendence between the random variables, has severe implications on the statistical modeling.
Independence also implies a simple formula for the variance of the sum of two or many random
variables.

Property 2.6. Let X and Y be two independent random variables. Then

i) Var(aX + bY ) = a2 Var(X) + b2 Var(Y ) .
n
iid 1X
Let X1 , . . . , Xn ∼ F with E(X1 ) = µ and Var(X1 ) = σ2. Denote X = Xi . Then
n
i=1
n
1X
ii) E(X) = E Xi = E(X1 ) = µ .
n
i=1
n
1X 1 σ2
iii) Var(X) = Var Xi = Var(X1 ) = .
n n n
i=1

The latter two properties will be used when we investigate statistical properties of the sample
n n
1X 1X
mean, i.e., linking the empirical mean x = xi with the random sample mean X = Xi .
n n
i=1 i=1
Example 2.6. For X ∼ Bin(n, p), we have

E(X) = np, Var(X) = np(1 − p). (2.23)

2.6. THE NORMAL DISTRIBUTION 31

2.6 The Normal Distribution

The normal or Gaussian distribution is probably the most known distribution, having the om-
nipresent “bell-shaped” density. Its importance is mainly due the fact that the sum of many
random variables (under iid or more general settings) is distributed as a normal random vari-
able. This is due to the celebrated central limit theorem. As in the case of a Poisson random
variable, we define the normal distribution by giving its density.

Definition 2.11. The random variable X is said to be normally distributed if the cumulative
distribution function is given by
Z x
FX (x) = fX (x)dx (2.24)
−∞

with density function

1 (x − µ)2

1
f (x) = fX (x) = √ exp − · , (2.25)
2πσ 2 2 σ2

for all x (µ ∈ R, σx > 0). We denote this with X ∼ N (µ, σ 2 ).

The random variable Z = (X − µ)/σ (the so-called z-transformation) is standard normal and
its density and distribution function are usually denoted with ϕ(z) and Φ(z), respectively. ♦

While the exact form of the density (2.25) is not important, a certain recognizing factor will
be very useful. Especially, for a standard normal random variable, the density is proportional to
exp(−z 2 /2).

The following property is essential and will be consistently used throughout the work. We
justify the first one later in this chapter. The second one is a result of the particular form of the
density.

X −µ
Property 2.7. i) Let X ∼ N (µ, σ 2 ), then ∼ N (0, 1). Conversely, if Z ∼ N (0, 1),
σ
then σZ + µ ∼ N (µ, σ 2 ), σ > 0.

ii) Let X1 ∼ N (µ1 , σ12 ) and X2 ∼ N (µ2 , σ22 ) be independent and a and b arbitrary, then
aX1 + bX2 ∼ N (aµ1 + bµ2 , a2 σ12 + b2 σ22 ).

The cumulative distribution function Φ has no closed form and the corresponding probabilities
must be determined numerically. In the past, so-called “standard tables” were often used and
included in statistics books. Table 2.1 gives an excerpt of such a table. Now even “simple”
pocket calculators have the corresponding functions to calculate the probabilities. It is probably
worthwhile to remember 84% = Φ(1), 98% = Φ(2), 100% ≈ Φ(3), as well as 95% = Φ(1.64) and
97.5% = Φ(1.96). Relevant quantiles have been illustrated in Figure 2.4 for a standard normal
random variable. For arbitrary normal density, the density scales linearly with the standard
deviation.
32 CHAPTER 2. RANDOM VARIABLES

P(Z<0)=50.0% P(−1<Z<1)=68.3%
P(Z<1)=84.1% P(−2<Z<2)=95.4%
P(Z<1)=97.7% P(−3<Z<3)=99.7%

−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3

Figure 2.4: Different probabilities for some quantiles of the standard normal distribu-
tion.

Table 2.1: Probabilities of the standard normal distribution. The table gives the value
of Φ(zp ) for selected values of zp . For example, Φ(0.2 + 0.04) = 0.595.

zp 0.0 0.1 0.2 0.3 0.4 . . . 1 ... 1.6 1.7 1.8 1.9 2 ... 3
0.00 0.500 0.540 0.579 0.618 0.655 0.841 0.945 0.955 0.964 0.971 0.977 0.999
0.02 0.508 0.548 0.587 0.626 0.663 0.846 0.947 0.957 0.966 0.973 0.978 0.999
0.04 0.516 0.556 0.595 0.633 0.670 0.851 0.949 0.959 0.967 0.974 0.979 ..
.
0.06 0.524 0.564 0.603 0.641 0.677 0.855 0.952 0.961 0.969 0.975 0.980
0.08 0.532 0.571 0.610 0.648 0.684 0.860 0.954 0.962 0.970 0.976 0.981

R-Code 2.3 Calculation of the “z-table” (see Table 2.1) and density, distribution, and
quantile functions of the standard normal distribution. (See Figure 2.5.)

y <- seq( 0, by=.02, length=5)

x <- c( seq( 0, by=.1, to=.4), 1, seq(1.6, by=.1, to=2), 3)
round( pnorm( outer( x, y, "+")), 3)

plot( dnorm, -3, 3, ylim=c(-.5,2), xlim=c(-2.6,2.6))

abline( c(0, 1), h=c(0,1), v=c(0,1), col='gray') # diag and horizontal lines
plot( pnorm, -3, 3, col=3, add=TRUE)
plot( qnorm, 0, 1, col=4, add=TRUE)

Example 2.7. Let X ∼ N (4, 9). Then

X − 4 −2 − 4
i) P(X ≤ −2) = P ≤
3 3
= P(Z ≤ −2) = Φ(−2) ≈ 1 − Φ(2) ≈ 1 − 0.977 = 0.023 .

ii) P(|X − 3| > 2) = 1 − P(|X − 3| ≤ 2) = 1 − P(−2 ≤ X − 3 ≤ 2)

5−4 1−4
= 1 − P(X − 3 ≤ 2) − P(X − 3 ≤ −2) = 1 − Φ +Φ
3 3
0.626 + 0.633
≈1− + (1 − 0.841) ≈ 0.5295 . ♣
2
2.7. LIMIT THEOREMS 33

2.0
1.5
1.0
dnorm

0.5
0.0
−0.5

−2 −1 0 1 2

Figure 2.5: Probability density function (black), cumulative distribution function

(green), and quantile function (blue) of the standard normal distribution. (See R-
Code 2.3.)

2.7 Limit Theorems

The following theorem is of paramount importance in statistics.

Property 2.8. (Central Limit Theorem (CLT), classical version) Let X1 , X2 , X3 , . . . an infinite
sequence of iid random variables with E(Xi ) = µ and Var(Xi ) = σ 2 . Then

X −µ
n
lim P √ ≤ z = Φ(z) (2.26)
n→∞ σ/ n

where we kept the subscript n for the sample mean to emphasis its dependence on n.

The proof of the CLT is a typical exercise in a probability theory lecture. Many extensions
of the CLT exist, for example, the independence assumptions can be relaxed.

Using the central limit theorem argument, we can show that distribution of a binomial random
variable X ∼ Bin(n, p) converges to a distribution of a normal random variable as n → ∞. Thus,
the distribution of a normal random variable N (np, np(1 − p)) can be used as an approximation
for the binomial distribution Bin(n, p). For the approximation, n should be larger than 30 for
p ≈ 0.5. For p closer to 0 and 1, n needs to be much larger.

To calculate probabilities, we often apply a so-called continuity correction, as illustrated in

the following example.
34 CHAPTER 2. RANDOM VARIABLES

Example 2.8. Let X ∼ Bin(30, 0.5). Then P(X ≤ 10) = 0.049, “exactly”. However,
X − np 10 − np 10 − 15
P(X ≤ 10) ≈ P p ≤p =Φ p = 0.034 , (2.27)
np(1 − p) np(1 − p) 30/4
X + 0.5 − np 10 + 0.5 − np 10.5 − 15
P(X ≤ 10) ≈ P p ≤ p =Φ p = 0.05 . (2.28)
np(1 − p) np(1 − p) 30/4

Another very important law is the law of large numbers (LLN) that essentially states that
for X1 , . . . , Xn iid with E(Xi ) = µ, the average X n converges to µ. We have deliberately used
the somewhat ambiguous “convergence” statement, a more rigorous statement is technically a bit
more involved. We will use the LLN next chapter, when we try to infer parameter values from
data, i.e., say something about µ when we observe x1 , . . . , xn .

Remark 2.2. There are actually two forms of the LLN, the strong and the weak formulation.
We do not not need the precise formulation later and thus simply state them here for the sole
reason of stating them.

Weak LLN: lim P |X n − µ| > = 0 for all > 0,

(2.29)
n→∞

Strong LLN: P lim X n = µ = 1.

(2.30)
n→∞

The differences between both formulations are subtle. The weak version states that the
average is close to the mean and excursions (for specific n) beyond µ ± can happen arbitrary
often. The strong version states that there exists a large n such that the average is always within
µ ± .

The two forms represent fundamentally different notions of convergence of random variables:
(2.30) is almost sure convergence, (2.29) is convergence in probability. The CLT represents con-
vergence in distribution. ♣

2.8 Functions of Random Variables

In the previous sections we saw different examples of often used, classical random variables.
These examples are often not enough and through a modeling approach, we need additional
ones. In this section we illustrate how to construct the cdf and pdf of a random variable that is
the square of one from which we know the density.
Let X be a random variable with distribution function FX (x). We define a random variable
Y = g(X). The cumulative distribution function of Y is written as

FY (y) = P (Y ≤ y) = P g(X) ≤ y . (2.31)
2.8. FUNCTIONS OF RANDOM VARIABLES 35

In many cases g(·) is invertable (and differentiable) and we obtain


P X ≤ g −1 (y) = F g −1 (y),
X if g −1 monotonically increasing,
FY (y) = (2.32)
P X ≥ g −1 (y) = 1 − F g −1 (y), if g −1 monotonically decreasing.
X

To derive the probability mass function we apply Property 2.2.iv). In the more interesting setting
of continuous random variables, the density function is derived by Property 2.3.iv) and is thus

d −1
fY (y) = g (y) fX (g −1 (y)). (2.33)
dy

Example 2.9. Let X be a random variable with cdf FX (x) and pdf fX (x). We consider Y =
a+bX, for b > 0 and a arbitrary. Hence, g(·) is a linear function and its inverse g −1 (y) = (y−a)/b
is monotonically increasing. The cdf of Y is thus FX (y −a)/b and the pdf is fX (y −a)/b ·1/b.

This has fact has already been stated in Property 2.7 for the Gaussian random variables. ♣

Example 2.10. Let X ∼ U(0, 1) and for 0 < x < 1, we set g(x) = − log(1 − x), thus g −1 (y) =
1 − exp(−y). Then the distribution and density function of Y = g(X) is

FY (y) = FX (g −1 (y)) = g −1 (y) = 1 − exp(−y), (2.34)

d −1
g (y) fX g −1 (y) = exp(−y),

fY (y) = (2.35)
dy

for y > 0. This random variable is called the exponential random variable (with rate parameter
one). Notice further that g(x) is the quantile function of this random variable. ♣

As we are often interested in summarizing a random variable by its mean and variance, we
have a very convenient short-cut.
The expectation and the variance of a transformed random variable Y can be approximated by
the so-called delta method. The idea thereof consists of a Taylor expansion around the expectation
E(X):

g(X) ≈ g E(X) + g 0 E(X) · X − E(X)

(2.36)

(two terms of the Taylor series). Thus

E(Y ) ≈ g E(X) , (2.37)
2
Var(Y ) ≈ g 0 E(X) · Var(X). (2.38)

Example 2.11. Let X ∼ B(1, p) and Y = X/(1 − X). Thus,

1 2 p
E(Y ) ≈ p/(1 − p); Var(Y ) ≈ 2
· p(1 − p) = . (2.39)
(1 − p) (1 − p)3

♣
36 CHAPTER 2. RANDOM VARIABLES

Example 2.12. Let X ∼ B(1, p) and Y = log(X). Thus

1 2 1−p
E(Y ) ≈ log(p), Var(Y ) ≈ · p(1 − p) = . (2.40)
p p

Of course, in the case of a linear transformation (as, e.g., in Example 2.9), equation (2.36) is
an equality and thus relations (2.37) and (2.38) are exact, which is in sync with Property 2.7.

Of course it is also possible to construct random variables based on an entire random sample,
say Y = g(X1 , . . . , Xn ). Property 2.8 uses exactly such an approach, where g(·) is given by
. √
σ/ n .
P
g(X1 , . . . , Xn ) = i Xi − µ
The next section discusses random variables that are essentially derived (obtained as func-
tions) from normal random variables. We will encounter these much later, for example, the t
distribution in Chapter 4 and the F distribution in Chapter 9, as well as some other handy
distributions.

2.8.1 Chi-Square Distribution

iid
Let Z1 , . . . , Zn ∼ N (0, 1). The distribution of the random variable

n
X
Xn2 = Zi2 (2.41)
i=1

is called the chi-square distribution (X 2 distribution) with n degrees of freedom. The following
applies:

E(Xn2 ) = n; Var(Xn2 ) = 2n. (2.42)

Here and for the next two distributions, we do not give the densities as they are very com-
plex. Similarly, the expectation and the variance here and for the next two distributions are for
reference only.
The chi-square distribution is used in numerous statistical tests that we see Chapter 4 and 6.

The definition of a chi-square random variable is based on a sum of independent random

variables. Hence, in view of the central limit theorem, the following result is not surprising. If
n > 50, we can approximate the chi-square distribution with a normal distribution, i.e., Xn2 is
distributed approximately N (n, 2n). Furthermore, for Xn2 with n > 30 the random variable
√
2Xn2 is approximately normally distributed with expectation 2n − 1 and standard
p
X =
deviation of one.
2.8. FUNCTIONS OF RANDOM VARIABLES 37

R-Code 2.4 Chi-square distribution for various degrees of freedom. (See Figure 2.6.)

x <- seq( 0, to=50, length=150)

plot(x, dchisq( x, df=1), type='l', ylab='Density')
for (i in 1:6)
lines( x, dchisq(x, df=2^i), col=i+1)
legend( "topright", legend=2^(0:6), col=1:7, lty=1, bty="n")
0.6

1
0.5

2
4
0.4

8
16
Density

32
0.3

64
0.2
0.1
0.0

0 10 20 30 40 50

Figure 2.6: Densities of the Chi-square distribution for various degrees of freedom.
(See R-Code 2.4.)

2.8.2 Student’s t-Distribution

In later chapters, we use the so-called t-test when, for example, comparing empirical means. The
test will be based on the distribution that we define next.
Let Z ∼ N (0, 1) and X ∼ Xm 2 be two independent random variables. The distribution of the

random variable
Z
Tm = p (2.43)
X/m
is called the t-distribution (or Student’s t-distribution) with m degrees of freedom. We have:

E(Tm ) = 0, for m > 1; (2.44)

m
Var(Tm ) = , for m > 2. (2.45)
(m − 2)
The density is symmetric around zero and as m → ∞ the density converges to the standard
normal density ϕ(x) (see Figure 2.7).

Remark 2.3. For m = 1, 2 the density is heavy-tailed and the variance of the distribution does
not exist. Realizations of this random variable occasionally manifest with extremely large values.
Of course, the empirical variance can still be calculated (see R-Code 2.6). We come back to this
issue in Chapter 5. ♣
38 CHAPTER 2. RANDOM VARIABLES

R-Code 2.5 t-distribution for various degrees of freedom. (See Figure 2.7.)

x <- seq( -3, to=3, length=100)

plot( x, dnorm(x), type='l', ylab='Density')
for (i in 0:6)
lines( x, dt(x, df=2^i), col=i+2)
legend( "topright", legend=2^(0:6), col=2:8, lty=1, bty="n")
0.4

1
2
4
0.3

8
16
Density

32
0.2

64
0.1
0.0

−3 −2 −1 0 1 2 3

Figure 2.7: Densities of the t-distribution for various degrees of freedom. The normal
distribution is in black. A density with 27 = 128 degrees of freedom would make the
normal density function appear thicker. (See R-Code 2.5.)

2.8.3 F -Distribution

The F -distribution is mainly used to compare two empirical variances with each other, as we
will see in Chapter 12.
Let X ∼ Xm 2 and Y ∼ X 2 be two independent random variables. The distribution of the
n
random variable

X/m
Fm,n = (2.46)
Y /n

is called the F -distribution with m and n degrees of freedom. It holds that:

n
E(Fm,n ) = , for n > 2; (2.47)
n−2
2n2 (m + n − 2)
Var(Fm,n ) = , for n > 4. (2.48)
m(n − 2)2 (n − 4)

That means that if n increases the expectation gets closer to one and the variance to 2/m, with
m fixed.
Figure 2.8 shows the density for various degrees of freedom.
2.9. BIBLIOGRAPHIC REMARKS 39

R-Code 2.6 Empirical variances of the t-distribution with one degree of freedom.

set.seed( 14)
tmp <- rt( 1000, df=1)
print( c(summary( tmp), Var=var( tmp)))
## Min. 1st Qu. Median Mean 3rd Qu.
## -1.9093e+02 -1.1227e+00 -2.0028e-02 8.1108e+00 1.0073e+00
## Max. Var
## 5.7265e+03 3.7391e+04
sort( tmp)[1:10] # many "large" values, but 2 exceptionally large
## [1] -190.929 -168.920 -60.603 -53.736 -47.764 -43.377 -36.252
## [8] -31.498 -30.029 -25.596
sort( tmp, decreasing=TRUE)[1:10]
## [1] 5726.531 2083.682 280.848 239.752 137.363 119.157 102.702
## [8] 47.376 37.887 32.443

R-Code 2.7 F -distribution for various degrees of freedom. (See Figure 2.8.)

x <- seq(0, to=4, length=500)

df1 <- c( 1, 2, 5, 10, 50, 50, 250)
df2 <- c( 1, 50, 10, 50, 50, 250, 250)
plot( x, df( x, df1=1, df2=1), type='l', ylab='Density')
for (i in 2:length(df1))
lines( x, df(x, df1=df1[i], df2=df2[i]), col=i)
legend( "topright", legend=c(expression(F[list(1,1)]),
expression(F[list(2,50)]), expression(F[list(5,10)]),
expression(F[list(10,50)]), expression(F[list(50,50)]),
expression(F[list(100,300)]), expression(F[list(250,250)])),
col=1:7, lty=1, bty="n")

2.9 Bibliographic remarks

In general, wikipedia has nice summaries of many distributions. The page https://en.wikipedia.
org/wiki/List_of_probability_distributions lists many thereof.
The ultimate reference for (univariate) distributions is the encyclopedic series of Johnson
et al. (2005, 1994, 1995). Figure 1 of Leemis and McQueston (2008) illustrates extensively
the links between countless univariate distributions, a simplified version is available at https:
//www.johndcook.com/blog/distribution_chart/.
40 CHAPTER 2. RANDOM VARIABLES

F1, 1
F2, 50
3.0 F5, 10
F10, 50
F50, 50
Density

2.0

F100, 300
F250, 250
1.0
0.0

0 1 2 3 4

Figure 2.8: Density of the F -distribution for various degrees of freedom. (See R-
Code 2.7.)

2.10 Exercises and Problems

Problem 2.1 (Poisson distribution) In this problem we derive and visualize some properties of
iid
the Poisson random variable. Consider X1 , . . . , Xn ∼ Pois(λ), λ > 0.

i) Show that E(X1 ) = λ, and Var(X1 ) = λ.

iid
ii) Visualize in R the cdf and pmf of X1 ∼ Pois(λ), for λ = .2 and λ = 2
iid
iii) Plot the pmf of X1 ∼ Pois(λ), λ = 5, and Y ∼ Bin(n, p) for n = 10, 100, 1000 with λ = np.

iv) (*) Starting from the pmf of a Binomial random variable, derive the pmf of the Poisson
random variable when n → ∞, p → 0 but λ = np constant.

Problem 2.2 (Exponential Distribution) In this problem you get to know another important
distribution you will frequently come across - the expontential distribution. Consider the random
variable X with density

0 x<0
f (x) =
c · exp(−λx) x ≥ 0

with λ > 0. The parameter λ is called the rate. Subsequently, we denote an exponential random
variable with X ∼ Exp(λ).

i) Determine c such that X is a random variable (two properties have to be fulfilled).

ii) Determine the associated cumulative distribution function (cdf) F (x).

iii) Determine E(X) and Var(X).

2.10. EXERCISES AND PROBLEMS 41

iv) Determine the quantile function of X. What are the quartiles of X?

v) Let λ = 2. Calculate:

P(X ∈ R) P(X ≥ −10) P(X = 4)

P(X ≤ 4) P(X ≤ log(2)/2) P(3 < X ≤ 5)

vi) Show that P(X > s + t | X > t) = P(X > s), s, t ≥ 0.

A random variable satisfying the previous equation is called memoryless. (Why?)

vii) Derive theoretically the distribution of min(X1 , . . . , Xn ).

Problem 2.3 (Exponential Distribution in R) Let X1 , . . . , Xn be independent and identically

distributed (iid) random variables following Exp(λ). Assume λ = 2 for the following.

i) Simulate realizations of X1 , . . . , Xn for n = 10, 100 and 1000. Display the results with
histograms and superimpose the theoretical densities. Further, use the functions density()
to add empirical densities and rug() to visualize the values of the realizations. Give an
interpretation of the empirical density.

ii) Let X = (1/n) ni=1 Xi . Determine E X and Var X . When is the hypothesis of inde-
P

pendence necessary for your calculations? In addition, what happens when n → +∞?

iii) Assume n = 100 and simulate 500 realizations of X1 , . . . , Xn . Calculate x = (1/n) ni=1 xi
P

for each realization and plot a histogram of the averages. Compare it to the histograms
from i): Which distribution do you get?

iv) Calculate the (empirical) median for each of the 50 simulations from iii). How many of
the 50 medians are bigger than the averages?

v) Draw a histogram of min(X1 , . . . , Xn ) for the 500 realizations from iii) and compare it to
the theoretical result from Problem 2.vii).
42 CHAPTER 2. RANDOM VARIABLES
Chapter 3

Estimation

Learning goals for this chapter:

Explain what a simple statistical model is (including the role of parameters)

Describe the concept of point estimation and interval estimation

Interpret point estimates and confidence intervals

Describe the concept of Method of Moments, Least Squares and Likelihood

Estimation

Construct theoretically and using Rconfidence intervals

R: for loops, set.seed()

R-Code for this chapter: www.math.uzh.ch/furrer/download/sta120/chapter03.R.

A central point of statistics is to draw information from observations (data, measurements,

subset of a population) and to infer from the data towards our hypotheses or possibly towards
the population. This requires foremost data and a statistical model. Such a statistical model
describes describes the data through the use of distributions, which means the data is considered a
realization of the distribution. The model comprises of unknown quantities, so-called parameters.
The goal of a statistical estimation is to determine plausible values for the parameters of the model
using the observed data.

The following is a very simple example of a statistical model:

Y i = µ + εi , i = 1, . . . , n, (3.1)

where Yi are the observations, µ is an unknown constant and εi are random variables representing
measurement error. It is often reasonable to assume E(εi ) = 0 with a symmetric density. Here,
iid
we even assume εi ∼ N (0, σ 2 ). Thus, Y1 , . . . , Yn are normally distributed with mean µ and
variance σ 2 .

43
44 CHAPTER 3. ESTIMATION

As typically, both parameters µ and σ 2 are unknown we first address the question how can
we determine plausible values for these model parameters from observed data.

Example 3.1. In R-Code 3.1 hemoglobin levels of blood samples from patients with Hb SS and
Hb S/β sickle cell disease are given (Hüsler and Zimmermann, 2010). The data are summarized
in Figure 3.1.
Equation (3.1) can be used as a simple statistical model for both diseases individually, with
µ representing the corresponding population mean and εi the variance describing the variability
of the individuals from the population mean. A slightly more involved model that links data
both diseases is

Yi = µSS + εi , i = 1, . . . , nSS , (3.2)

Yi = µSb + εi , i = nSS + 1, . . . , nSS + nSb , (3.3)

iid
We assume εi ∼ N (0, σ 2 ). Thus the model states that both diseases have a different mean
but the same variability. This assumption pools information from both samples to estimate the
variance σ 2 . The parameters of the model are µSS , µSb and (of lesser interest) σ 2 .
Natural questions that arise are: What are plausible values of the population levels? How
much do the indivdual deviances vary?
The questions if both levels are comparable or if the level form Hb SS patients is (statistically)
smaller than 10 are of completely different nature and will be discussed in the next chapter, where
we formally discuss statistical tests. ♣
10 11 12

●
10 11 12

● ●
Sample Quantiles

●
●
●
● ●
● ●
●
● ●● ●
9
9

● ●●
● ●●
● ● ●
8
8

●
●
7

HbSS HbSb −2 −1 0 1 2

Theoretical Quantiles

Figure 3.1: Hemoglobin levels of patients with Hb SS and Hb S/β sickle cell disease.
(See R-Code 3.1.)

3.1 Point Estimation

For estimation, the data is considered a realization of a random sample with a certain (paramet-
ric) distribution. The goal of point estimation is to provide a plausible value for the parameters
of the distribution based on the data at hand.
3.1. POINT ESTIMATION 45

R-Code 3.1 Hemoglobin levels of patients with sickle cell disease and some summary
statistics of hemoglobin levels of patients with sickle cell disease. (See Figure 3.1.)

HbSS <- c( 7.2, 7.7, 8, 8.1, 8.3, 8.4, 8.4, 8.5, 8.6, 8.7, 9.1,
9.1, 9.1, 9.8, 10.1, 10.3)
HbSb <- c(8.1, 9.2, 10, 10.4, 10.6, 10.9, 11.1, 11.9, 12.0, 12.1)
boxplot( list( HbSS=HbSS, HbSb=HbSb), col=c(3, 4))
qqnorm( HbSS, xlim=c(-2, 2), ylim=c(7, 12), col=3, main='')
qqline( HbSS, col=3)
tmp <- qqnorm( HbSb, plot.it=FALSE)
points( tmp, col=4)
qqline( HbSb, col=4)
c( mean( HbSS), mean( HbSb)) # means for both diseases
## [1] 8.7125 10.6300
var( HbSS) # here and below spread measures
## [1] 0.71317
c( var(HbSb), sum( (HbSb-mean(HbSb))^2)/(length(HbSb)-1) )
## [1] 1.649 1.649
c( sd( HbSS), sqrt( var( HbSS)))
## [1] 0.84449 0.84449

Definition 3.1. A statistic is an arbitrary function of a random sample Y1 , . . . , Yn and is there-

fore also a random variable.
An estimator is a statistic used to obtain a plausible value for a parameter based the random
sample.
A point estimate is the value of the estimator evaluated at y1 , . . . , yn , the realizations of the
random sample.
Estimation (or estimating) is the process of finding a (point) estimate. ♦

Example 3.2. i) The numerical values shown in R-Code 3.1 are estimates.
n
1X
ii) Y = Yi is an estimator.
n
i=1
10
1 X
y= yi = 10.6 is a point estimate.
n
i=1

n
1 X
iii) S2 = (Yi − Y )2 is an estimator.
n−1
i=1
10
1 X
s2 = (yi − y)2 = 1.65 or s = 0.844 is a point estimate. ♣
n−1
i=1
46 CHAPTER 3. ESTIMATION

Often, we denote parameters with Greek letters (µ, σ, λ, . . . ), with θ being the generic one.
The estimator and estimate of a parameter θ are denoted by θ. b Context makes clear which of
the two cases is meant.

3.2 Construction of Estimators

We now consider three examples of how estimators are constructed for arbitrary distributions.

3.2.1 Ordinary Least Squares

This first approach is intuitive and straightforward to construct estimators for location parame-
ters or, more generally, for parameters defining the location.
The ordinary least squares method of parameter estimation minimizes the sum of squares
of the differences between the random variables and the location parameter. More formally, let
Y1 , . . . , Yn be iid with E(Yi ) = µ. The least squares estimator for µ is based on
n
X
µ
b=µ
bLS = argmin (Yi − µ)2 , (3.4)
µ
i=1

bLS = Y and estimate µ

and thus one has estimator µ bLS = y.
Often, the parameter θ is linked to the expectation E(Yi ) through some function, say g. In
such a setting, we have
n
X 2
θb = θbLS = argmin Yi − g(θ) . (3.5)
θ i=1

b =Y.
and θb solves g(θ)
In linear regression settings, the ordinary least squares method minimizes the sum of squares
of the differences between observed responses and those predicted by a linear function of the ex-
planatory variables. Due to the linearity simple and close form solutions exist (see Chapters 8ff).

3.2.2 Method of Moments

The method of moments is based on the following idea. The parameters of the distribution are
expressed as functions of the moments, e.g., E(Y ), E(Y 2 ). The random sample moments are
then plugged into the theoretical moments of the equations in order to obtain the estimators:
n
1X
µ := E(Y ) , µ
b= Yi = Y , (3.6)
n
i=1
n
1 X
µ2 := E(Y 2 ) , µ
b2 = Yi2 . (3.7)
n
i=1

By using the observed values of a random sample in the method of moments estimator, the
estimates of the corresponding parameters are obtained.
iid
Example 3.3. Let Y1 , . . . , Yn ∼ E(λ)

bMM = 1 .
.
E(Y ) = 1/λ, Y =1 λb λ
b=λ (3.8)
Y
Thus, the estimate of λ is the value of 1/y. ♣
3.2. CONSTRUCTION OF ESTIMATORS 47

iid
Example 3.4. Let Y1 , . . . , Yn ∼ F with expectation µ and variance σ 2 . Since Var(Y ) = E(Y 2 )−
E(Y )2 (Property 2.5.i)), we can write σ 2 = µ2 − (µ)2 and we have the estimator
n n
c2 MM = 1 1X
X
σ Yi2 − Y 2 = (Yi − Y )2 . (3.9)
n n
i=1 i=1
♣

3.2.3 Likelihood Method

The likelihood method chooses as estimate the value such that the observed data is most likely
to stem from the model (using the estimate). To derive the method, we consider the probability
density function (for continuous random variables) or the probability mass function (for discrete
random variables) to be a function of parameter θ, i.e.,

fY (y) = fY (y; θ) −→ L(θ) := fY (y; θ) (3.10)

pi = P(Y = yi ) = P(Y = yi ; θ) −→ L(θ) := P(Y = yi ; θ). (3.11)

For a given distribution, we call L(θ) the likelihood function, or simply the likelihood.

Definition 3.2. The maximum likelihood estimator θbML of the parameter θ is based on maxi-
mizing the likelihood, i.e.

θbML = argmax L(θ). (3.12)

θ
♦
Since a random sample contains independent and identically distributed random variables, the
likelihood is the product of the individual densities (see page 2.5). Since θbML = argmaxθ L(θ) =
argmaxθ log L(θ) , the log-likelihood `(θ) := log L(θ) can be maximized instead. The log-

likelihood is often preferred because the expressions simplify more and maximizing sums is much
easier than maximizing products.

iid
Example 3.5. Let Y1 , . . . , Yn ∼ Exp(λ), thus
n
Y n
Y n
X
n
L(λ) = fY (yi ) = λ exp(−λyi ) = λ exp(−λ yi ) . (3.13)
i=1 i=1 i=1
Then
Pn Pn n
d`(λ) d log(λn exp(−λ i=1 yi ) dn log(λ) − λ i=1 yi n X !
= = = − yi = 0 (3.14)
dλ dλ dλ λ
i=1

λ bML = Pnn
b=λ 1
= . (3.15)
i=1 yi y
In this case (as in others), λ bMM .
bML = λ ♣

In a vast majority of cases, maximum likelihood estimators posses very nice properties. In-
tuitively, because we use information about the density and not only about the moments, they
are “better” compared to method of moment estimators and to least squares method. Further,
for many common random variables, the likelihood function has a single optimum, in fact a
maximum, for all permissible θ.
48 CHAPTER 3. ESTIMATION

3.3 Comparison of Estimators

In Example 3.4 we saw an estimator that divides the sum of the squared deviances by n, whereas
the default denominator used in R is n − 1 (see R-Code 3.1 or Example 3.2). There are different
estimators for a particular parameter possible. We now see two measures to compare them.

Definition 3.3. An estimator θb of a parameter θ is unbiased if

E(θ)
b = θ, (3.16)

otherwise it is biased. The value E(θ)

b − θ is called the bias. ♦

iid
Example 3.6. Y1 , . . . , Yn ∼ N (µ, σ 2 )
i) Y is unbiased for µ, since
n
1 X 1

EY =E Yi = n E(Yi ) = µ . (3.17)
n n
i=1

n
1 X
ii) S 2 = (Yi −Y )2 is unbiased for σ 2 . To show this we use the following two identities
n−1
i=1

(Yi − Y )2 = (Yi − µ + µ − Y )2 = (Yi − µ)2 + 2(Yi − µ)(µ − Y ) + (µ − Y )2 , (3.18)

n
X n
X
2(Yi − µ)(µ − Y ) = 2(µ − Y ) (Yi − µ) = 2(µ − Y )(nY − nµ) (3.19)
i=1 i=1
2
= −2n(µ − Y ) , (3.20)

i.e., we rewrite the square for which the cross-term simplifies with the second square term.
Collecting all leads finally to
n
X
(n − 1)S 2 = (Yi − µ)2 − n(µ − Y )2 (3.21)
i=1
n
X
(n − 1) E(S 2 ) = Var(Yi ) − n E (µ − Y )2

(3.22)
i=1

By Property 2.6.iii), E (µ − Y )2 = Var Y = σ 2 /n and thus

σ2
(n − 1) E(S 2 ) = nσ 2 − n · = (n − 1)σ 2 (3.23)
n

c2 = 1
X
iii) σ (Yi − Y )2 is biased, since
n
i
1 1 X n−1
c2 ) =
E(σ (n − 1) E (Yi − Y )2 = σ2. (3.24)
n n−1 n
i
| {z }
2
E(S ) = σ 2

The bias is
n−1 2
c2 ) − σ 2 = 1
E(σ σ − σ2 = − σ2, (3.25)
n n
which amounts to a slight underestimation of the variance. ♣
3.4. INTERVAL ESTIMATORS 49

A further possibility for comparing estimators is the mean squared error

MSE(θ)b = E (θb − θ)2 . (3.26)

The mean squared error can also be written as MSE(θ)

b = bias(θb )2 + Var(θb ).

iid
Example 3.7. Let Y1 , . . . , Yn ∼ N (µ, σ 2 ). Using the result (3.17) and Property 2.6.iii), we have

σ2
MSE(Y ) = bias(Y )2 + Var(Y ) = 0 + . (3.27)
n
Hence, the MSE vanishes as n increases. ♣

There is a second “classical” example for the calculation of the mean squared error, however
it requires some properties of squared Gaussian variables.

iid
Example 3.8. If Y1 , . . . , Yn ∼ N (µ, σ 2 ) it is possible to show that (n − 1)S 2 /σ 2 ∼ Xn−1
2 . Then

using the variance from a chi-squared random variable (2.42), we have

σ4 (n − 1)S 2 σ4 2σ 4
MSE(S 2 ) = Var(S 2 ) = Var = (2n − 2) = . (3.28)
(n − 1)2 σ2 (n − 1)2 n−1

Analogously, one can show that MSE(σ c2 MM ) is smaller than Equation (3.28). Moreover, the
estimator (n − 1)S 2 /(n + 1) possesses the smallest MSE. ♣

3.4 Interval Estimators

Point estimates, inherently, are single values and do not provide us with uncertainties. For this
we have to extend the idea towards interval estimates and interval estimators. We start with a
motivating example.
iid
Let Y1 , . . . , Yn ∼ N (µ, σ 2 ) with known σ 2 . Thus Y ∼ N (µ, σ 2 /n) and, by standardizing,
Y −µ
√ ∼ N (0, 1). As a consequence,
σ/ n
Y −µ σ σ
1 − α = P zα/2 ≤ √ ≤ z1−α/2 = P zα/2 √ ≤ Y − µ ≤ z1−α/2 √ (3.29)
σ/ n n n
σ σ
= P −Y + zα/2 √ ≤ −µ ≤ −Y + z1−α/2 √ (3.30)
n n
σ σ
= P Y − zα/2 √ ≥ µ ≥ Y − z1−α/2 √ , (3.31)
n n
where z1−α/2 is the 1 − α/2-quantile of the standard normal distribution.

iid
Definition 3.4. Let Y1 , . . . , Yn ∼ N (µ, σ 2 ) with known σ 2 . The interval
h σ σ i
Y − z1−α/2 √ ,Y + z1−α/2 √ (3.32)
n n
is an exact (1 − α) confidence interval for the parameter µ. 1 − α is the called the level of the
confidence interval. ♦
50 CHAPTER 3. ESTIMATION

If we evaluate the bounds of the confidence interval Bl , Bu at a realization we denote

bl , bu as the empirical confidence interval or the observed confidence interval.

The interpretation of an exact confidence interval is as follows: If a large number of realisa-

tions are drawn from a random sample, on average (1 − α) · 100% of the confidence intervals will
cover the true parameter µ .
Empirical confidence intervals do not contain random variables and therefore it is not possible
to make probability statements about the parameter.

If the standard deviation σ is unknown, the ansatz must be modified by using a point estimate
√ Y −µ
for σ, typically S = S 2 , S 2 = n−1
1 P
i (Yi −Y ) . Since
2 √ ∼ Tn−1 , the corresponding quantile
S/ n
must be modified:
Y −µ
1 − α = P tn−1,α/2 ≤ √ ≤ tn−1,1−α/2 . (3.33)
S/ n

iid
Definition 3.5. Let Y1 , . . . , Yn ∼ N (µ, σ 2 ). The interval
h S S i
Y − tn−1,1−α/2 √ , Y + tn−1,1−α/2 √ (3.34)
n n

is an exact (1 − α) confidence interval for the parameter µ. ♦

Confidence intervals are, as shown in the previous two definitions, constituted by random
variables (functions of Y1 , . . . , Yn ). Similar to estimators and estimates, confidence intervals are
computed with the corresponding realization y1 , . . . , yn of the random sample. Subsequently,
confidence intervals will be outlined in the blue-highlighted text boxes, as shown here.

CI 1: Confidence interval for the mean µ

Under the assumption of a normal random sample,

h S S i
Y − tn−1,1−α/2 √ , Y + tn−1,1−α/2 √ (3.35)
n n

is an exact (1 − α) confidence interval and

h S S i
Y − z1−α/2 √ , Y + z1−α/2 √ (3.36)
n n

an approximate (1 − α) confidence interval for µ.

Notice that the empirical approximate and empirical exact confidence interval are of the form

estimate ± quantile · SE(estimate), (3.37)

3.4. INTERVAL ESTIMATORS 51

that is, symmetric intervals around the estimate. Here, SE(·) denotes the standard error of the
estimate, that is, an estimate of the variance of the estimator.

iid
Example 3.9. Let Y1 , . . . , Y4 ∼ N (0, 1). The R-Code 3.2 and Figure 3.2 show 100 empirical
confidence intervals based on Equation (3.32) (top) and Equation (3.34) (bottom). Because n
is small, the difference between the normal and the t-distribution is quite pronounced. This
becomes clear when
h S S i
Y − z1−α/2 √ , Y + z1−α/2 √ (3.38)
n n
is used as an approximation (Figure 3.2, middle).
A few more points to note are as follows. As we do not estimate the variance, all intervals in
the top panel have the same lengths. In average we should observe 5% of the intervals colored
red in the top and bottom panel. In the middle one there are typically more, as the normal
quantiles are too small compared to the t-distribution ones (see Figure 2.7). ♣

Confidence intervals can often be constructed from a starting estimator and its distribution.
In many cases it is possible to extract the parameter to get to 1−α = P(Bl ≤ θ ≤ Bu ), often some
approximations are necessary. We consider another classical case in the framework of Gaussian
random variables.
iid
Let Y1 , . . . , Yn ∼ N (µ, σ 2 ). The estimator S 2 for the parameter σ 2 is such, that (n −
2 , i.e. a chi-square distribution with n − 1 degrees of freedom. Hence
1)S 2 /σ 2 ∼ Xn−1
(n − 1)S 2
1 − α = P χ2n−1,α/2 ≤ ≤ χ 2
n−1,1−α/2 (3.39)
σ2
(n − 1)S 2 (n − 1)S 2
2
=P ≥ σ ≥ , (3.40)
χ2n−1,α/2 χ2n−1,1−α/2

where χ2n−1,p is the p-quantile of the chi-square distribution with n − 1 degrees of freedom. The
corresponding exact (1 − α) confidence interval no longer has the form θb ± q1−α/2 SE(θ),
b because
the chi-square distribution is not symmetric.
For large n, the chi-square distribution can be approximated with a normal one (see also
Section 2.8.1) with mean n and variance 2n. Hence we can also use an Gaussian approximation.

Example 3.10. For the Hb SS Daten with sample variance 0.71, we have the empirical confi-
dence interval [0.39, 1.71] for σ 2 , computed with (16-1)*var(HbSS)/qchisq( c(.975, .025),
df=16-1).
If we would like to construct a confidence interval for the standard deviation parameter
√ √ √
σ = σ 2 , we can use the approximation [ 0.39, 1.71]. That means, we have used the same
transformation for the bounds as for the estimate, a tool that is often used. ♣

For a fixed model the width of a confidence interval can be reduced by reducing the level or
increasing n.
The coverage probability of an exact confidence interval amounts to exactly 1−α. This is not
the case for approximate confidence intervals (we see a particular example in the next chapter).
52 CHAPTER 3. ESTIMATION

R-Code 3.2 100 confidence intervals for the parameter µ = 0 with σ = 1 and unknown σ,
based on three different approaches (exact, knowing σ, approximation, and exact again).

set.seed( 1)
ex.n <- 100 # 100 confidence intervals
alpha <- .05 # 95\% confidence intervals
n <- 4 # sample size
mu <- 0
sigma <- 1
sample <- array( rnorm( ex.n * n, mu, sigma), c(n,ex.n))
yl <- mu + c( -6, 6)*sigma/sqrt(n) # same y-axis for all
ybar <- apply( sample, 2, mean) # mean

# Sigma known:
sigmaybar <- sigma/sqrt(n)
plot( 1:ex.n, 1:ex.n, type='n', ylim=yl, xaxt='n', ylab='',
main=expression(sigma~known))
abline( h=mu)
for ( i in 1:ex.n){
ci <- ybar[i] + sigmaybar * qnorm(c(alpha/2,1-alpha/2))
lines( c(i,i), ci, col=ifelse( ci[1]>mu|ci[2]<mu, 2, 1))
}

# Sigma unknown, normal approx:

sybar <- apply(sample, 2, sd)/sqrt(n)
plot( 1:ex.n, 1:ex.n, type='n', ylim=yl, xaxt='n', ylab='',
main="Gaussian Approximation")
abline( h=mu)
for ( i in 1:ex.n){
ci <- ybar[i] + sybar[i] * qnorm(c(alpha/2, 1-alpha/2))
lines( c(i,i), ci, col=ifelse( ci[1]>mu|ci[2]<mu, 2, 1))
}

# Sigma unknown, t-based:

plot(1:ex.n, 1:ex.n, type='n', ylim=yl, xaxt='n', ylab='',
main='t-distribution')
abline( h=mu)
for ( i in 1:ex.n){
ci <- ybar[i] + sybar[i] * qt(c(alpha/2,1-alpha/2), n-1)
lines( c(i,i), ci, col=ifelse( ci[1]>mu|ci[2]<mu, 2, 1))
}
3.4. INTERVAL ESTIMATORS 53

σ known
3
2
1
0
−1
−2
−3

Gaussian Approximation
1:ex.n
3
2
1
0
−1
−2
−3

t−distribution
1:ex.n
3
2
1
0
−1
−2
−3

Figure 3.2: Normal and t-based confidence intervals for the parameters µ = 0 with
σ = 1 (above) and unknown σ (middle and below). The sample size is n = 4 and
confidence level is (1 − α) = 95%. Confidence intervals which do not cover the true
value zero are in red. (Figure based on R-Code 3.2).
54 CHAPTER 3. ESTIMATION

CI 2: Confidence interval for the variance σ 2

Under the assumption of a normal random sample,

h (n − 1)S 2 (n − 1)S 2 i
, (3.41)
χ2n−1,1−α/2 χ2n−1,α/2

is an exact (1 − α) confidence interval for σ 2 .

For a random sample with n > 50
√ 2 √ 2i
h
2 2S 2 2S
S − z1−α/2 √ , S + z1−α/2 √ (3.42)
n n

is an approximate (1 − α) confidence interval for σ 2 .

3.5 Bibliographic remarks

Statistical estimation theory is very classical and many books are available. For example, Held
and Sabanés Bové (2014) (or it’s German predecessor, Held, 2008) are written at an accessible
level.

3.6 Exercises and Problems

iid
Problem 3.1 (Poisson Distribution) Consider X1 , . . . , Xn ∼ Pois(λ) with a fixed λ > 0.
Hint: ?Poisson for some commands you might need.

i) What is special about the variance of a Poisson distributed random variable?

ii) Plot the cdf and the pmf for λ = 1 and λ = 5.5.
Hint: use a discrete grid {0, 1, 2, . . . } (why?) and, where necessary, the R command
stepfun.

iii) For λ = 1 and λ = 5.5 sample m = 1000 random variables and draw histograms. Compare
the histograms with ii). What do you expect to happen when m is large?
Pn
iv) Let λ
b= 1
n i=1 Xi be an estimator of the sample mean λ. Calculate E(λ), b and the
b Var(λ)
MSE(λ).
b

Hint: remember the relationship between variance and expected value.

v) What is E(λ), b and MSE(λ)

b Var(λ) b when n → ∞ ?

vi) Let λ = 3: calculate P(X1 ≤ 2), P(X1 < 2) and P(X1 ≥ 3).

vii) Observe that P(X1 ≤ 2) 6= P(X1 < 2). Would this still be true if X1 had a continuous
distribution?
3.6. EXERCISES AND PROBLEMS 55

Problem 3.2 (Germany cancer counts) The dataset Oral is available in the R package spam
and contains oral cavity cancer counts for 544 districts in Germany.

i) Load the data and take a look at its help page using ?Oral.
Hint: The R package spam is available on CRAN and can be installed with the com-
mand install.packages("spam") and loaded with require(spam) or, alternatively, with
library(spam). The command data(Oral) copies the dataset to the global environment.

ii) Compute summary statistics for all variables in the dataset.

Which of the 544 regions has the highest number of expected counts E ?

iii) Poisson distribution is common for modeling rare events such as death caused by cavity
cancer (column Y in the data). However, the districts differ greatly in their populations.
Define a subset from the data, which only considers districts with expected fatal casualities
caused by cavity cancer between 35 and 45 (subset, column E). Perform a Q-Q Plot for a
Poisson distribution.
Hint: use qqplot() from the stats package. Note that you need to define the distribution
and the number of quantiles ppoints. Only qqnorm does this automatically. You also need
to define lambda for Poisson distribtuion.

Simulate a Poisson distributed random variable with the same length and and the same
lambda as your subset. Perform a Q-Q Plot of your simulated data. Also check the
historgams for visualization. What can you say about the distribution of your subset of
the cancer data?

iv) Assume that the standardized mortality ratio Zi = Yi /Ei is normally distributed, i.e.,
iid
Z1 , . . . , Z544 ∼ N (µ, σ 2 ). Estimate µ and give a 95% (exact) confidence interval (CI).
What is the precise meaning of the CI?

v) Simulate a 95% confidence interval based on the following bootstrap scheme (sampling with
replacement):
Repeat 100 000 times

– Sample 544 observations Zi with replacement

– Calculate and store the mean of these sampled observations

Construct the confidence interval by taking the 2.5% and the 97.5% quantiles of the stored
means.
Compare it to the CI from iv).
56 CHAPTER 3. ESTIMATION
Chapter 4

Statistical Testing

Learning goals for this chapter:

Discuss the concept of hypothesis and significance test

Given a problem situation, state appropriate null and alternative hypotheses,

perform a hypothesis test, interpret the results.

Describe p-value and the significance level

Difference between one sample and two- sample t-tests

Explain and apply various classical tests

Understand the duality of tests and confidence intervals

Be aware of the multiple testing problem and know how to deal with it

R-Code for this chapter: www.math.uzh.ch/furrer/download/sta120/chapter04.R.

As we are all aware, when tossing a fair coin 2m times we do not always observe exactly m
heads but often close to it. This is illustrated in R with rbinom(1, size=2*m, prob=1/2) for
m tosses, since rbinom(1, size=1, prob=1/2) “is” a fair coin. We use this idea for arbitrary
number of tosses and expect — for a fair coin — roughly half of them heads. We need to quantify
“roughly”, in the sense what is natural or normal variability. We will discuss what can be said if
we are outside the range of normal variability of a fair coin. As illustration, suppose we observe
13 heads in 17 tosses, representing seemingly an unusual case and we intuitively wonder if the
coin is fair. In other words, is the observed data providing enough evidence against a fair coin?
We will formulate a formal statistical procedure to answer such type of questions.

4.1 The General Concept of Significance Testing

We start illustrating the idea of statistical testing with the following example.

57
58 CHAPTER 4. STATISTICAL TESTING

Example 4.1. In rabbits, pododermatitis is a chronic multifactorial skin disease that manifests
mainly on the hind legs. This presumably progressive disease can cause pain leading to poor
welfare. To study the progression of this disease on the level of individual animals, scientists
assessed many rabbits in three farms over the period of an entire year (Ruchti et al., 2019). We
use a subset of the dataset in this and later chapters, consisting of one farm (with two barns) and
two visits (July 19/20, 2016 and June 29/30, 2017). The 6 stages from Drescher and Schlender-
Böbbis (1996) were used as a tagged visual-analogue-scale to score the occurrence and severity
of pododermatitis on 4 spots on the rabbits hind legs (left and right, heal and middle position),
resulting in the variable PDHmean with range 0–10, for details on the scoring see Ruchti et al.
(2018).
We consider the visits in June 2017 and would like to asses if the score of the 17 rabbits is
comparable to 3.333, representing a low-grade scoring (low-grade hyperkeratosis, hypotrichosis or
alopecia). (The observed mean is 3.87 with a standard deviation of 0.64.) R-Code 4.1 illustrates
the calculation of empirical confidence intervals, visualized in Figure 4.1. ♣

R-Code 4.1 Pododermatitis in rabbits, dataset pododermatitis. (See Figure 4.1.)

str( podo <- read.csv('data/podo.csv'))

## 'data.frame': 34 obs. of 6 variables:
## $ ID : int 4 3 9 10 1 7 8 5 14 13 ...
## $ Age : int 12 12 14 12 17 14 14 12 6 6 ...
## $ Weight : num 4.46 4.31 4.76 5.34 5.71 5.39 5.42 5.13 5.39 5.41 ...
## $ Visit : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Barn : int 1 1 1 1 1 1 1 1 1 1 ...
## $ PDHmean: num 4.47 3.62 4.28 4.2 1.77 ...
apply( podo, 2, function(x) length(unique(x)) )
## ID Age Weight Visit Barn PDHmean
## 17 12 31 2 2 30
PDHmean <- podo$PDHmean[podo$Visit==13] # two visits, two barns 17 animals
print( me <- mean( PDHmean))
## [1] 3.8691
print( se <- sd( PDHmean))
## [1] 0.63753
plot( 0, type='n', yaxt='n', xaxt='n', bty='n', xlim=c(1,6), ylab='')
axis(1)
rug( PDHmean, ticksize = 0.3)
abline( v=c(me, 3.33), lwd=3, col=3:2)
# t-based CI in green and Gaussian CI (approximation!) in blue
lines( me + qt(c(.025,.975),16)*se/sqrt(17), c(.2,.2), col=3, lwd=4)
lines( me + qnorm(c(.025,.975))*se/sqrt(17), c(-.1,-.1), col=4, lwd=4)
4.1. THE GENERAL CONCEPT OF SIGNIFICANCE TESTING 59

1 2 3 4 5 6

Figure 4.1: Rug plot with sample mean (vertical green) and confidence intervals based
on the t-distribution (green) and normal approximation (blue). (See R-Code 4.1.)

The idea of a statistical testing procedure is to formulate statistical hypotheses and to draw
conclusions from them based on the data. We always start with a null hypothesis, denoted with
H0 , and – informally – we compare how compatible the data is with respect to this hypothesis.
Simply stated, starting from a statistical hypothesis a statistical test calculates a value from
the data and places that value in the context of the hypothetical density induced by the statistical
(null) hypothesis. If the value is unlikely to occur, we argue that the data provides evidence
against the (null) hypothesis.
More formally, if we want to test about a certain parameter, say θ, we need an estimator θb
for that parameter. We often need to transform the estimator such that the distribution thereof
does not depend on (the) parameter(s). We call this random variable (function of the random
sample) test statistic. Some of these test statistics are well-known and have been named, as
we shall see later. Once the test statistic has been determined, we evaluate it at the observed
sample and compare it with quantiles of the distribution of the null hypothesis, which is typically
expressed as a probability, i.e., the famous p-value.
A more formal definition of p-value follows.

Definition 4.1. The p-value is the probability, under the distribution of the null hypothesis, of
obtaining a result equal to or more extreme than the observed result. ♦

Example 4.2. We assume a fair coin is tossed 17 times and we observe 13 heads. Hence, the p-
value is the probability of observing 13, 14,. . . ,17 heads, or by symmetry of observing 13, 14, . . . ,
17 tails (corresponding to 4,3,. . . ,0 heads), which is sum( dbinom(0:4, size=17, prob=1/2) +
dbinom(13:17, size=17, prob=1/2)), equaling 0.049, i.e., we observe such a seemingly unlikely
event roughly every 20th time.
Note that we could calculate the p-value as 2*pbinom(4, size=17, prob=1/2) or equiva-
lently as 2*pbinom(12, size=17, prob=1/2, lower.tail=FALSE). ♣

In practice, one often starts with a scientific hypothesis and starts to collect data or performs
an experiment. The data is then “modeled statistically”, e.g., we need to determine a theoretical
distribution thereof. In our discussion here, the distribution typically involves parameters that
are linked to the scientific question (probability p in a binomial distribution for coin tosses, mean
µ of a Gaussian distribution for testing differences pododermatitis scores). We formulate the null
hypothesis H0 . In many cases we pick a “known” test, instead of a “manually” constructing a test
statistic. Of course this test has to be in sync with the statistical model. Based on the p-value
we then summarize the evidence against the null hypothesis. We cannot make any statement for
the hypothesis.
60 CHAPTER 4. STATISTICAL TESTING

Figure 4.2 illustrates graphically the p-value in two hypothetical situations. Suppose that
under the null hypothesis the density of the test statistic is Gaussian and suppose that we observe
a value of the test statistic of 1.8. If more extreme is considered on both sides of the density then
the p-value consists of two probabilities (here because of the symmetry, twice the probability of
either side). If more extreme is actually larger (here, possibly smaller in other situations), the
p-value is calculated based on a one-sided probability. As the Gaussian distribution is symmetric,
the two-sided p-value is twice the one-side one, here 1-pnorm(1.8), or, equivalently, pnorm(1.8,
lower.tail=FALSE).

p−value=0.072 p−value=0.036
H0 (two−sided) H0 (one−sided)

Observation Observation

−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3

Figure 4.2: Illustration of the p-value in the case of a Gaussian test statistic with
observed value 1.8. Two-sided (left panel) and one-sided setting (right panel).

Example 4.3. For simplicity we assume that the pododermatitis scores of Example 4.1 is a
iid
realization of X1 , . . . , X17 ∼ N (µ, 0.82 ), i.e., n = 17 and the standard deviation is known.
Hence, X ∼ N (µ, 0.82 /17). Moreover, under the distributionan assumption H0 : µ = µ0 = 3.333,
X ∼ N (3.333, 0.82 /17). Thus — taking again a two sided setting —

p-value = 2 P(X ≥ x) = 2 1 − P(X < 3.869) (4.1)
X − 3.333 3.869 − 3.333
=2 1−P √ < √ = 2(1 − ϕ(2.762)) ≈ 0.6%. (4.2)
0.8/ 17 0.8/ 17
There is evidence in the data against the null hypothesis. ♣

Some authors summarize p-values in [1, 0.1] as no evidence, in [0.1, 0.01] as weak evidence, in
[0.01, 0.001] as substantial evidence, and smaller ones as strong evidence (Held and Sabanés Bové,
2014). In R, symbols are used for similar ranges ␣ , . and * , ** , and, *** .

As a summary, the workflow of a statistical significance test can be summarized as:

i) Formulation of the scientific question or scientific hypothesis,

ii) Formulation of the statistical model (statistical assumptions),
iii) Formulation of the statistical test hypothesis (null hypothesis H0 ),
iv) Selection of the appropriate test,
v) Calculation of the p-value,
vi) Interpretation.
4.2. HYPOTHESIS TESTING 61

Although we discuss all six points in various settings, in this chapter emphasis lies on points
ii) to v).

4.2 Hypothesis Testing

A criticism of significance testing is that we have only one hypotheses. It is often easier to choose
between two alternatives. This is the approach of hypothesis testing. Beforehand, we will still
not choose for one hypothesis but rather reject or fail to reject the null hypothesis.

More precisely, we start with a null hypothesis H0 and an alternative hypothesis, denoted by
H1 or H1 . These hypotheses are with respect to a parameter, say θ. Hypotheses are classified as
simple if parameter θ assumes only a single value (e.g., H0 : θ = 0), or composite if parameter θ
can take on a range of values (e.g., H0 : θ ≤ 0, H1 : µ 6= µ0 ). Often, you will encounter a simple
null hypothesis with a simple or compositive alternative hypothesis. For example, when testing
a mean parameter we would have H0 : µ = µ0 vs H1 : µ 6= µ0 for the latter. In practice the case
of simple null and simple alternative hypothesis, e.g., H0 : µ = µ0 H1 : µ = µA 6= µ0 , is rarely
used but has considerable didactic value.
Hypothesis tests may be either one-sided (directional), in which only a relationship in a
prespecified direction is of interest, or two-sided, in which a relationship in either direction is
tested. One could use a one-sided test for “Hb SS has a lower average hemoglobin value than
Hb Sβ”, but a two-sided test is needed for “Hb SS and Hb Sβ have different average hemoglobin
values”. Further examples of hypotheses are given in Rudolf and Kuhlisch (2008). We strongly
recommend to always use two-sided tests (e.g. Bland and Bland, 1994; Moyé and Tita, 2002),
not only in clinical studies where it is the norm but as Bland and Bland (1994) states “a one-
sided test is appropriate when a large difference in one direction would lead to the same action
as no difference at all. Expectation of a difference in a particular direction is not adequate
justification.” However to illustrate certain concepts, a one-sided setting may be simpler and
more accessible.

As in the case of a significance test, we compare the value of the test statistic with the
quantiles of the distribution of the null hypothesis. In the case of small p-values we reject H0 , if
not, we fail to reject H0 . The decision of whether the p-value is “small” or not is based on the
so-called significance level, α.

Definition 4.2. The rejection region of a test includes all values of the test statistic with a
p-value smaller than the significance level. The boundary values of the rejection region are called
critical values. ♦

For one-sided (or directional) hypotheses like H0 : µ ≤ µ0 or H0 : µ ≥ µ0 , the “=” case is

used in the null hypothesis, i.e., the most unfavorable case of the alternative point of view. That
means, the null hypothesis is from a calculation perspective always simple.
Figure 4.3 shows the rejection area of two-sided test H0 : µ = µ0 vs H1 : µ 6= µ0 (left panel),
H0 : µ ≤ µ0 vs H1 : µ > µ0 and one-sided (right panel), for the specific case µ = 0. For the
left-sided case, H0 : µ ≥ µ0 vs H1 : µ < µ0 , the rejection area is on the left side.
62 CHAPTER 4. STATISTICAL TESTING

If we assume that the distribution of the test statistic is Gaussian and α = 0.05, the critical
values are ±1.96 and 1.64, respectively (qnorm(c(0.05/2, 1 - 0.05/2) and qnorm(1 - 0.05)).
These critical values are typically linked with the so-called z-test. Note the similarity with
Example 4.3.

H0 H0

Critical value Critical value Critical value Critical value

−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3

Figure 4.3: Critical values (red) and rejection regions (orange) for two-sided H0 : µ =
µ0 = 0 (left) and one-sided H0 : µ ≤ µ0 = 0 (right) hypothesis test with significance
level α = 5%.

Two types of errors can occur in significance testing: Type I errors: we reject H0 if we should
not and Type II errors: we fail to reject H0 if we should. The framework of hypothesis testing
allows us to quantify probability of committing these two errors. The Type I error is exactly
α. To calculate the Type II error, we need to assume a specific value for our parameter within
the alternative hypothesis, e.g., a simple alternative. The probability of Type II error is often
denoted with β. Table 4.1 summarizes the errors in a classical 2 × 2 layout.
Ideally we would like to construct tests that have small Type I and Type II errors. This is
not possible and one typically fixes the Type I error to some small value, say 5%, 1% or suchlike
(committing a type one error has typically more severe consequences than a Type II error).
Type I and Type II errors are shown in Figure 4.4 for two different alternative hypotheses. This
illustrates that reducing the significance level α leads to an increase in β, the probability of
committing a Type II error.

Table 4.1: Type I and Type II errors in the setting of significance tests.

True unknown state

H0 true H1 true
do not reject H0 1−α β
Test results
reject H0 α 1−β

Example 4.4. Suppose we reject the null hypothesis of having a fair coin if we observe 0,. . . ,3
or 14,. . . ,17 heads out of 17 tosses. The Type I error is 2*pbinom(3, size=17, prob=1/2), i.e.,
0.013, and, if the coin has a probability of 0.6 for heads, the Type II error is sum(dbinom(4:13,
size=17, prob=0.6)), i.e., 0.953. ♣
4.2. HYPOTHESIS TESTING 63

H0 H1 H0 H1

−2 0 2 4 6 −2 0 2 4 6

H0 H1 H0 H1

−2 0 2 4 6 −2 0 2 4 6

Figure 4.4: Type I error with significance level α (red) and Type II error with prob-
ability β (blue) for two different alternative hypotheses (µ = 2 top row, µ = 4 bottom
row) with two-single hypothesis H0 : µ ≤ µ0 = 0 (left column) and one-sided hypothesis
H0 : µ ≤ µ0 = 0 (right).

We can further summarize. The Type I error:

• is defined a priori by selection of the significance level (often 5%, 1%),

• is not influenced by sample size,

• is increased with multiple testing of the same data

and the Type II error:

• depends on sample size and significance level α,

• is a function of the alternative hypothesis.

The value 1−β is called the power of a test. High power of a test is desirable in an experiment:
we want to detect small effects with a large probability. R-Code 4.2 computes the power under
a Gaussian assumption. More specifically, under the assumption of σ = 1 we test H0 : µ0 = 0
versus H1 : µ0 6= 0. The power can only be calculated for a specific assumption of the “actual”
mean µ1 , i.e., of a simple alternative. Thus, as typically done, Figure 4.5 plots power (µ1 − µ0 ).

The workflow of a hypothesis test is very similar to the one of a statistical significance test
and only point iii) and v) need to be slightly modified:

i) Formulation of the scientific question or scientific hypothesis,

64 CHAPTER 4. STATISTICAL TESTING

R-Code 4.2 A one-sided and two-sided power curve for a z-test. (See Figure 4.5.)

alpha <- 0.05 # significance level

mu0 <- 0 # H_0 mean
mu1 <- seq(-1.5, to=5, by=0.1) # H_1 mean
power_onesided <- 1-pnorm( qnorm(1-alpha, mean=mu0), mean=mu1)
power_twosided <- pnorm( qnorm(alpha/2, mean=mu0), mean=mu1) +
pnorm( qnorm(1-alpha/2, mean=mu0), mean=mu1, lower.tail=FALSE)
plot( mu1, power_onesided, type='l', ylim=c(0,1), xlim=c(-1, 4.25), las=1,
xlab=expression(mu[1]-mu[0]), ylab="Power", col=4, yaxs='i', lwd=1)
axis(2 ,at=alpha, labels='')
axis(2, at=1.4*alpha, labels=expression(alpha), las=1, adj=0, tick=FALSE)
lines( mu1, power_twosided, lty=2)
abline( h=alpha, col='gray') # significance level
abline( v=c(2, 4), lwd=2, col=3) # values from figure 4.3

1.0

0.8

0.6
Power

0.4

0.2
α
0.0
−1 0 1 2 3 4
µ1 − µ0

Figure 4.5: Power: one-sided (blue solid line) and two-sided (black dashed line). The
gray line represents the level of the test, here α = 5%. The vertical lines represent the
alternative hypotheses µ = 2 and µ = 4 of Figure 4.4. (See R-Code 4.2.)

ii) Formulation of the statistical model (assumptions),

iii) Formulation of the statistical test hypothesis and selection of significance level,
iv) Selection of the appropriate test,
v) Calculation of the test statistic value (or p-value), comparison and decision
vi) Interpretation.

The choice of test is again constrained by the assumptions. The significance level must, however,
always be chosen before the computations.
The test statistic value tobs calculated in step vi) is compared with critical values tcrit in order
to reach a decision in step vii). When the decision is based on the calculation of the p-value,
it consists of a comparison with α. The p-value can be difficult to calculate, but is valuable
because of its direct interpretation as the strength (or weakness) of the evidence against the null
hypothesis.
4.2. HYPOTHESIS TESTING 65

General remarks about statistical tests

In the following pages, we will present several important statistical tests in text
blocks like this one.
In general, we call

• n, nx , ny , . . . the sample size;

• x1 , . . . , xnx , y1 , . . . , yny , samples/independent observations;

• x, y, . . . the arithmetic mean;

• s2 , s2x , s2y , . . . the unbiased sample variance estimates, e.g.,

xn
1 X
s2x = (xi − x)2
nx − 1
i=1

(In the tests under consideration, the variance is unknown);

• α the significance level (0 < α < 1, but α typically small);

• tcrit , Fcrit , . . . the critical values, i.e., the quantiles according to the distribu-
tion of the test statistica and the significance level.

Our scientific question can only be formulated in general terms and we call it a
hypothesis. Generally, two-sided tests are performed. The significance level is
modified accordingly for one-sided tests.
In unclear cases, the statistical hypothesis is specified.
For most tests, there is a corresponding R function. The arguments x, y usually
represent vectors containing the data and alpha the significance level. From the
output, it is possible to get the p-value.

In this work we consider various test situations. The choice of test is primarily dependent on
the parameter and secondly on the statistical assumptions. The following list can be used as a
decision tree.

• Tests involving the mean or means

– One sample (Test 1 in Section 4.3)

– Two samples
∗ Two independent samples (Test 2 in Section 4.3 and Test 7 in Chapter 6)
∗ Two paired/dependent samples (Test 3 in Section 4.3 and Test 8 in Chapter 6)
– Several samples (Test 13 in Chapter 12)

• Test to compare two variances (Test 4 in Section 4.6)

• Test to compare two proportions (Test 6 in Chapter 5)

66 CHAPTER 4. STATISTICAL TESTING

• Testing distributions (Test 5 in Section 4.6)

Distribution tests (also goodness-of-fit tests) differ from the other tests discussed here, in the
sense that they do not test or compare a single parameter. Of course there are many additional
possible tests, the approaches described in the first two sections allow to construct arbitrary tests.
In Sections 4.3 and 4.6 we present some of these tests in more details by motivating the test
statistic, giving an expicit example and by summarizing the test in yellow boxes. Ultimately, we
perform test with a single call in R. However, the underlying mechanism has to be understood,
it would be too dangerous using statistical tests as black-box tools only.

4.3 Comparing Means

In this section, we compare means from normally distributed samples, i.e., from interval scaled
data. To start, we compare the mean of a sample with a hypothesized value. More formally,
iid
Y1 , . . . , Yn ∼ N (µ, σ 2 ) with unknown σ 2 and parameter of interest µ. Thus, from

Y −µ Y −µ
Y ∼ N (µ, σ 2 /n) =⇒ √ ∼ N (0, 1) and √ ∼ Tn−1 . (4.3)
σ/ n S/ n

Under the null hypothesis H0 , µ is of course our theoretical value, say µT . We typically use the
last distribution, as σ 2 is unknown (Example 4.3 was linked to the first distribution).

The test statistic is t-distributed (see Figure 2.8.2) and so the function pt is used to calculate
p-values. As we have only one sample, it is typically called the “one-sample t-test”, illustrated in
box Test 1, and R-Code 4.5 with an example based on the pododermatitis data.

Test 1: Comparing a sample mean with a theoretical value

Question: Does the experiment-based sample mean x deviate significantly from

the unknown (“theoretical”) mean µT ?

Assumptions: The population, from which the sample arises, is normally dis-
tributed with unknown mean µT . The observed data are independent and
the variance is unknown.
|x − µT | √
Calculation: tobs = · n.
s
Decision: Reject H0 : µ = µT , if tobs > tcrit = tn−1,1−α/2 .

Calculation in R: t.test( x, mu=muT, conf.level=1-alpha)

Example 4.5. We test the hypothesis that the animals have a stronger pododermatitis compared
to low-grade hyperkeratosis. The latter corresponds to scores lower than 3 · 1/3. The statistical
4.3. COMPARING MEANS 67

null hypthese is that the mean score is equal to 3.333 and we want to know if the mean of the
(single) sample deviates from a specified value, sufficiently for a statistical claim.
The following values are given: mean: 3.869, standard deviation: 0.638, sample size: 17.
H0 : µ = 3.333;
|3.869 − 3.333|
tobs = √ = 3.467;
0.638/ 17
tcrit = t16,1−0.05/2 = 2.120 p-value: 0.003.
The p-value is low and hence there is evidence against the null hypothesis. R-Code 4.3 illustrates
the direct calculation of the p-value. Figure 4.1 illustrated that the value 3.333 is not in the
confidence interal(s) of the mean. The evidence against the null hypothesis is thus not too
surprising. ♣

R-Code 4.3 One sample t-test, pododermatitis (see Example 4.5 and Test 1)

t.test( PDHmean, mu=3.333)

##
## One Sample t-test
##
## data: PDHmean
## t = 3.47, df = 16, p-value = 0.0032
## alternative hypothesis: true mean is not equal to 3.333
## 95 percent confidence interval:
## 3.5413 4.1969
## sample estimates:
## mean of x
## 3.8691

Often we have want to compare to different means, say x and y. We assume that both
iid iid
random samples are normally distributed, i.e., X1 , . . . , Xn ∼ N (µx , σ 2 ), Y1 , . . . , Yn ∼ N (µy , σ 2 ),
and independent. Then
σ2 σ2 2σ 2
X ∼ N µx , , Y ∼ N µy , =⇒ X − Y ∼ N µx − µy , (4.4)
n n n
X − Y − (µx − µy ) X − Y H0 :µx =µy
=⇒ p ∼ N (0, 1) =⇒ p ∼ N (0, 1). (4.5)
σ/ n/2 σ/ n/2
The difficulty is that we do not know σ and we have to estimate it. The estimate thereof takes a
somewhat complicated form as both samples need to be taken into account, with possibly different
lengths and means. This pooled estimate is denoted by sp and given in Test 2. Ultimately, we
have again a t-distribution of the test statistic, as we use an estimate of the standard deviation
in the standardization of a normal random variable. As the calculation sd requires the estimates
µx and µy , we adjust the degrees of freedom of to nx + ny − 2.
R-Code 4.6 is based on the pododermatitis data again and compares the scores between the
two different barns.
68 CHAPTER 4. STATISTICAL TESTING

Test 2: Comparing means from two independent samples

Question: Are the means x and y of two samples significantly different?

Assumptions: Both populations are normally distributed with the same unknown
variance. The samples are independent.
|x − y| nx · ny
r
Calculation: tobs = · ,
sp nx + ny
1
where s2p = · (nx − 1)s2x + (ny − 1)s2y .

nx + ny − 2
Decision: Reject H0 : µx = µy if tobs > tcrit = tnx +ny −2,1−α/2 .

Calculation in R: t.test( x, y, var=TRUE, conf.level=1-alpha)

Example 4.6. For the pododermatitis scores of the two barns, we have the following summaries.
Means 3.83 and 3.67; standard deviations: 0.88 and 0.87; sample sizes: 20 and 14. Hence, using
the formulas given in Test 2, we have
H0 : µx = µy
1
s2p = (19 · 0.8842 + 13 · 0.8682 ) = 0.878
20 + 14 − 2
r
|3.826 − 3.675| 20 · 14
tobs = √ = 0.494
0.878 20 + 14
tcrit = t32,1−0.05/2 = 0.494 p-value: 0.625.
Hence, 3.826 and 3.675 are not statistically different. See also R-Code 4.4. ♣

R-Code 4.4 Two-sample t-test with independent samples, pododermatitis (see Exam-
ple 4.6 and Test 2).

t.test( PDHmeanB1, PDHmeanB2, var.equal=TRUE)

##
## Two Sample t-test
##
## data: PDHmeanB1 and PDHmeanB2
## t = 0.495, df = 32, p-value = 0.62
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.4717 0.7742
## sample estimates:
## mean of x mean of y
## 3.8262 3.6750
4.3. COMPARING MEANS 69

In practice, the variances of both samples are often different, say σx2 and σy2 . In such a setting,
q
we have to normalize the mean difference by s2x /nx + s2y /ny . While this estimate seems simpler
than the pooled estimate sp , the degrees of freedom of the resulting t-distribution is are not
intuitive and difficult to derive, and we refrain to elaborate it here. In the literature, this test is
called Welch’s t-test and actually the default choice of t.test( x, y, conf.level=1-alpha).

The assumption of independence of both samples in the previous Test 2 may not be valid if
the two samples consist of two measurements of the same individual, e.g., observations over two
different instances of time. In such settings, were we have a “before” and “after” measurement, it
would be better to take this pairing into account, by considering differences only instead of two
samples. Hence, instead of constructing a test statistic based on X − Y we consider

iid
σ2
X1 − Y1 , . . . , Xn − Yn ∼ N (µx − µy , σd2 ) =⇒ X − Y ∼ N µx − µy , d (4.6)
n
X − Y H0 :µx =µy
=⇒ √ ∼ N (0, 1). (4.7)
σd / n
where σd2 is essentially the sum of the variances minus the “dependence” between Xi and Yi . We
formalize this dependence, called covariance, starting in Chapter 7.
The paired two-sample t-test can thus be considered a one sample t-test of the differences
with mean µT = 0.

Test 3: Comparing means from two paired samples

Question: Are the means x and y of two paired samples significantly different?

Assumptions: The samples are paired, the observed values are on the interval
scale. The differences are normally distributed with unknown mean δ. The
variance is unknown.
|d | √
Calculation: tobs = · n, where
sd
• di = xi − yi is the i-th observed distance,
• d is the arithmetic mean and sd is the standard deviation of the differ-
ences di .

Decision: Reject H0 : δ = 0 if tobs > tcrit = tn−1,1−α/2 .

Calculation in R: t.test(x, y, paired=TRUE, conf.level=1-alpha) or

t.test(x-y, conf.level=1-alpha)

Example 4.7. We consider the pododermatitis measurements from July 2016 and June 2017
and test if there is a progression over time. We have the following summaries for the differences
(see R-Code 4.5 and Test 3). Mean: 0.21; standard deviation: 1.26; and sample size: 17.
70 CHAPTER 4. STATISTICAL TESTING

H0 : d = 0 or H0 : µx = µy ;
|0.210|
tobs = √ = 0.687;
1.262/ 17
tcrit = t16;0.05 = 2.12 p-value: 0.502.
There is no evidence that there is a progression over time. ♣

R-Code 4.5 Two-sample t-test with paired samples, pododermatitis (see Example 4.7
and Test 3).

t.test( PDHmean2[,2], PDHmean2[,1], paired=TRUE)

##
## Paired t-test
##
## data: PDHmean2[, 2] and PDHmean2[, 1]
## t = 0.687, df = 16, p-value = 0.5
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.43870 0.85929
## sample estimates:
## mean of the differences
## 0.21029
# Same result as with
# t.test( PDHmean2[,2] - PDHmean2[,1])

The “t-tests” requires normally distributed data. The tests are relatively robust towards
deviation from normality, as long as there are no extreme outliers. Otherwise, rank-based tests
can be used (see Chapter 6). The assumption of normality can be verified quantitatively with
formal Normality tests (X 2 -test as shown in Section 4.6, Shapiro–Wilk test, Kolmogorov–Smirnov
test). Often, however, a qualitative verification is sufficient (e.g. with the help of a Q-Q plot).

4.4 Duality of Tests and Confidence Intervals

There is a close connection between a significance test and a confidence interval for the same
parameter. Rejection of H0 : θ = θ0 with a significance level α is equivalent to θ0 not being
included in the (1 − α) confidence interval of θ. The duality of tests and confidence intervals is
quickly apparent when the criterion for the test decision is reworded, here for example, “compare
a mean value with the theoretical value”.
As shown above, H0 is rejected if tobs > tcrit , so if
|x − µT | √
tobs = · n > tcrit . (4.8)
s
H0 is not rejected if
|x − µT | √
tobs = · n ≤ tcrit . (4.9)
s
4.5. MISSUSE OF P -VALUES AND OTHER DANGERS 71

This can be rewritten as

s
|x − µT | ≤ tcrit √ , (4.10)
n

which also means that H0 is not rejected if

s s
−tcrit √ ≤ x − µT and x − µT ≤ tcrit √ . (4.11)
n n

We can again rewrite this as

s s
µT ≤ x + tcrit √ and µT ≥ x − tcrit √ , (4.12)
n n

which corresponds to the boundaries of the empirical (1 − α) confidence interval for µT . Analo-
gously, this duality can be established for the other tests described in this chapter.

Example 4.8. Consider the situation from Example 4.5. Instead of comparing the p-value, we
can also consider the confidence interval, whose boundary values are 3.54 and 4.20. Since the
value 3.33 is not in this range, the null hypothesis is rejected.
This is shown in Figure 4.1. ♣

In R most test functions give the corresponding confidence intervals with the value of the
statistic and p-value. Some functions require the additional argument conf.int=TRUE, as well.

4.5 Missuse of p-Values and Other Dangers

p-values and their use have been criticized lately. By not carefully and properly performing
statistical tests it is possible to “obtain” p-values that are small (especially smaller than 0.05),
to “observe” a significant result. In this section we discuss and illustrate a few possible pitfalls
of statistical testing. Note that wrong statistical results are often due to insufficient statistical
knowledge and not due to deliberate manipulation of data or suchlike.

4.5.1 Interpretation of p-Values

The definition and interpretation of p-values are not as easy as it seems and quite often lead to
confusion or a misinterpretation. This has lead scientific journals to ban articles with p-values
altogether. In the last few years many articles have emerged discussing what p-values are and
what not, often jointly with the interpretation of confidence intervals. Here, we cite verbatim
from the ASA’s Statement (Wasserstein and Lazar, 2016):

i) p-values can indicate how incompatible the data are with a specified statistical model.

ii) p-values do not measure the probability that the studied hypothesis is true, or the proba-
bility that the data were produced by random chance alone.

iii) Scientific conclusions and business or policy decisions should not be based only on whether
a p-value passes a specific threshold.

iv) Proper inference requires full reporting and transparency.

72 CHAPTER 4. STATISTICAL TESTING

v) A p-value, or statistical significance does measure the size of and effect or the importance
of a result.

vi) By itself, a p-value does not provide a good measure of evidence regarding a model or
hypothesis.

4.5.2 Multiple Testing and p-Value Adjustments

In many cases, we want to perform not just one test, but a series of tests. We must then be
aware that the significance level α only holds for a single test. In the case of a single test, the
probability of a falsely significant test result equals the significance level, usually α = 0.05. The
probability that the null hypothesis H0 is correctly not rejected is then 1 − 0.05 = 0.95.
Consider the situation in which m > 1 tests are performed. The probability that at least one
false significant test result is obtained is then equal to one minus the probability that no false
significant test results are obtained. It holds that

P(at least 1 false significant results) = 1 − P(no false significant results) (4.13)
= 1 − (1 − α)m . (4.14)

In Table 4.2 the probabilities of at least one false significant result for α = 0.05 and various m
are presented. Even for just a few tests, the probability increases drastically, which should not
be tolerated.

Table 4.2: Probabilities of at least one false significant test result when performing m
tests at level α = 5% (top row) and at level αnew = α/m (bottom row).

m 1 2 3 4 5 6 8 10 20 100
1 − (1 − α)m 0.05 0.098 0.143 0.185 0.226 0.265 0.337 0.401 0.642 0.994
1 − (1 − αnew )m 0.05 0.049 0.049 0.049 0.049 0.049 0.049 0.049 0.049 0.049

There are several different methods that allow multiple tests to be performed while maintain-
ing the selected significance level. The simplest and most well-known of them is the Bonferroni
correction. Instead of comparing the p-value of every test to α, they are compared to a new
significance level, αnew = α/m, see second row of Table 4.2. There are several alternative meth-
ods, which, according to the situation, may be more appropriate. We recommend to use at least
method="holm" (default) in p.adjust. For more details see, for example, Farcomeni (2008).

4.5.3 p-Hacking, HARKing and Publication Bias

There are other dangers with p-values. It is often very easy to tweak the data such that we
observe a significant p-value (declaring values as outliers, removing certain observations, use
secondary outcomes of the experiment). Such tweaking is often called p-hacking: manage the
data until we get a significant result.

Hypothesizing After the Results are Known (HARKing) is another inappropriate scientific
practice in which a post hoc hypothesis is presented as an a priori hypotheses. In a nutshell, we
4.6. ADDITIONAL TESTS 73

collect the data of the experiment and adjust the hypothesis after we have analysed the data,
e.g., select effects small enough such that significant results have been observed.

Along similar lines, analyzing a dataset with many different methods will lead to many p-
values out of which are α% significant (in the case of the null hypothesis). Due to various inherent
decisions often even more. When searching for a good statistical analysis one often has to make
many choices and thus inherently selects the best one among many. This danger is often called
the ‘garden of forking paths’. Conceptually, adjusting the p-value for the many (not-performed)
test would mitigate the problem.

Often if a result is not significant, the study is not published and is left in a ‘file-drawer’. A
seemingly significant result might be well due to Type I error but this is not evident as many
similar experiments lead to non-significant outcomes that are not published.
For many scientific domains, it is possible to preregister the study, i.e., to declare the study
experiment, analysis methods, etc. before the actual data has been collected. In other words,
everything is determined except the actual data collection and actual numbers of the statistical
analysis. The idea is that the scientific question is worthwhile investigating and reporting in-
dependent of the actual outcome. Such an approach reduces HARKing, garden-of-forking-paths
issue, publication bias and more.

4.6 Additional Tests

Just as means can be compared, there are also tests for comparing variances. However, instead
of taking differences as in the case of means, we take the ratio of the estimated variances. If
the two estimates are similar, the ratio should be close to one. The test statistic is accordingly
Sx2 Sy2 and the distribution thereof is an F -distribution (see also Section 2.8.3 with quantile,

density, and distribution functions implemented in R with [q,d,p]f). This “classic” F -test is
given in Test 4.
In latter chapters we will see more natural settings, where we need to compare variances (not
necessary from a priori two different samples).

Example 4.9. The two-sample t-test (Test 2) requires equal variances. The data pododermatitis
does not contradict the null hypothesis, as R-Code 4.6 shows (compare Example 4.6 and R-
Code 4.4). ♣

The so-called chi-square test (X 2 test) verifies if the observed data follow a particular dis-
tribution. This test is based on a comparison of the observations with the expected frequencies
and can also be used in other situations, for example, to test if two categorical variables are
independent (Test 6).
Under the null hypothesis, the chi-square test is X 2 distributed. The quantile, density, and
distribution functions are implemented in R with [q,d,p]chisq (see Section 2.8.1).

In Test 5, the categories should be aggregated so that all bins contain a reasonable amount
of counts, e.g., ei ≥ 5. Additionally, N − k > 1.
74 CHAPTER 4. STATISTICAL TESTING

Test 4: Comparing two variances

Question: Are the variances s2x and s2y of two samples significantly different?

Assumptions: Both populations, from which the samples arise, are normally dis-
tributed. The samples are independent and the observed values are from the
interval scale.
s2x
Calculation: Fobs =
s2y
(the larger variance always goes in the numerator, so s2x > s2y ).

Decision: Reject H0 : σx2 = σy2 if Fobs > Fcrit = fnx −1,ny −1,1−α .

Calculation in R: var.test( x, y, conf.level=1-alpha)

R-Code 4.6 Comparison of two variances, PDH (see Example 4.9 and Test 4).

var.test( PDHmeanB1, PDHmeanB2)

##
## F test to compare two variances
##
## data: PDHmeanB1 and PDHmeanB2
## F = 1.04, num df = 19, denom df = 13, p-value = 0.97
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.35027 2.78376
## sample estimates:
## ratio of variances
## 1.0384

Example 4.10. With few observations (10 to 50) it is often pointless to test for normality of
the data. Even for larger samples, a Q-Q plot is often more informative. For completeness,
we illustrate a simple goodness-of-fit test by comparing the pododermatitits data with expected
counts constructed from Gaussian density with matching mean and variance (R-Code 4.7). As
there is no significant difference in the means and the variances, we pool over both periods and
barns (n = 34).
The binning of the data is done through the a histogram-type binning (an alternative way
would be table( cut( podo$PDHmean))). As we have less than five observations in several bins,
the function chisq.test issues a warning. This effect could be mitigated if we calculate the
p-value using a bootstrap simulation by setting the argument simulate.p.value=TRUE. Pooling
the bins, say breaks=c(1.5,2.5,3.5,4,4.5,5) would be an alternative as well.
4.7. BIBLIOGRAPHIC REMARKS 75

Test 5: Comparison of observations with expected frequencies

Question: Do the observed frequencies oi of a sample deviate significantly from

the expected frequencies ei of a certain distribution?

Assumptions: Sample with data from any scale.

Calculation: Calculate the observed values oi and the expected frequencies ei

(with the help of the expected distribution) and then compute
N
X (oi − ei )2
χ2obs =
ei
i=1

where N is the number of categories.

Decision: Reject H0 : “no deviation between the observed and expected” if χ2obs >
χ2crit = χ2N −1−k,1−α , where k is the number of parameters estimated from the
data to calculate the expected counts.

Calculation in R: chisq.test( obs, p=expect/sum(obs)) or

chisq.test( obs, p=expect, rescale.p=TRUE)

Additionally, the degrees of freedom should be N − 1 − k = 7 − 1 − 2 = 4 rather than six, as

we estimate the mean and standard deviation to determine the expected counts. (This decrease
would lower the p-value further.) ♣

4.7 Bibliographic remarks

There are many books about statistical testing, ranging from accessible to very mathematical,
too many to even start a list here.
76 CHAPTER 4. STATISTICAL TESTING

R-Code 4.7 Testing normality, pododermatitis (see Example 4.10 and Test 5).

observed <- hist( podo$PDHmean, plot=FALSE, breaks=5)

# without 'breaks' argument there are too many categories
observed[1:2]
## $breaks
## [1] 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
##
## $counts
## [1] 2 3 1 4 6 14 4
m <- mean( podo$PDHmean)
s <- sd( podo$PDHmean)
p <- pnorm( observed$breaks, mean=m, sd=s)
chisq.test( observed$counts, p=diff( p), rescale.p=TRUE)
## Warning in chisq.test(observed$counts, p = diff(p), rescale.p = TRUE):
Chi-squared approximation may be incorrect
##
## Chi-squared test for given probabilities
##
## data: observed$counts
## X-squared = 14.8, df = 6, p-value = 0.022
4.8. EXERCISES AND PROBLEMS 77

4.8 Exercises and Problems

iid
Problem 4.1 (Normal distribution with known σ 2 ) Let X1 , . . . , Xn ∼ N (µ, σ 2 ), with σ > 0
assumed to be known.

i) What is the distribution of X?

√
ii) Let q = 1.96. Calculate P(−q ≤ n(X − µ)/σ ≤ q).

iii) Determine the lower and upper bound of a confidence interval Bl and Bu (both functions
of X̄) such that
√
P(−q ≤ n(X − µ)/σ ≤ q) = P(Bl ≤ µ ≤ Bu )

iv) Construct an empirical 95%-confidence interval for µ.

v) What is the width of a confidence interval? What “elements” appearing in a general 1 − α-

confidence interval for µ make the interval narrower?

vi) Use the sickle-cell disease data and construct 90%-confidence intervals for the means of
HbSS and HbSβ variants (assume σ = 1). sickle.RData is available on the web-page and
provides the HbSS and HbSb measurements.

iid
Problem 4.2 (Normal distribution with unknown σ) X1 , . . . , Xn ∼ N (µ, σ 2 ) with σ > 0
1 Pn
unknown. S 2 = n−1 i=1 (Xi − X) .
2

√
i) What is the distribution of n(X − µ)/S? (No formal proof required)

ii) Let n = 9. Calculate P(−1 ≤ (X − µ)/S ≤ 1).

iii) Write down a 95%-confidence interval for µ.

iv) Use the sickle-cell disease data. Construct 90%-confidence intervals for the means of vari-
ants HbSS and HbSβ (assume σ is unknown).

Problem 4.3 (t-Test) Use again the sickle-cell disease data. For the cases listed below, spec-
ify the null and alternative hypothesis. Then use R to perform the tests and give a careful
interpretation.

i) µHbSβ = 10 (α = 5%, two-sided)

ii) µHbSβ = µHbSS (α = 1%, two-sided)

iii) What changes, if one-sided tests are performed instead?

78 CHAPTER 4. STATISTICAL TESTING
Chapter 5

A Closer Look: Proportions

Learning goals for this chapter:

Identify if the situation involves proportion

Perform tests of hypotheses of proportion, interpret the results

Explain and apply estimation, confidence interval and hypothesis testing for
proportions

Compare different CI for proportions (Wald, Wilson)

Explain and apply different methods of comparing proportions (Difference

Between Proportions, Odds Ratio, Relative Risk)

R-Code for this chapter: www.math.uzh.ch/furrer/download/sta120/chapter05.R.

Pre-eclampsia is a hypertensive disorder occurring during pregnancy (gestational hyperten-

sion) with symptoms including: edema, high blood pressure, and proteinuria. In a double-blinded
randomized controlled trial (RCT) 2706 pregnant women were treated with either a diuretic or
with a placebo (Landesman et al., 1965). Pre-eclampsia was diagnosed in 138 of the 1370 sub-
jects in the treatment group and in 175 of the subjects receiving placebo. The medical question
is whether diuretic medications, which reduce water retension, reduce the risk of pre-eclampsia.
The pre-eclampsia data can be presented with a so-called two-dimensional contingency table
(two-way table or 2 × 2 table). R-Code 5.1 defines the data as depicted in Table 5.1. Here, the
risk defines the two treatments (control/placebo and diuretic medication).

In this chapter we have a closer look at statistical techniques that help us to correctly answer
the above questions. More precisely, we will estimate proportions and then compare proportions
with each other. To simplify the exhibition, we discuss the estimation using one of the two risk
factors only.

79
80 CHAPTER 5. A CLOSER LOOK: PROPORTIONS

R-Code 5.1 Contingency table for the pre-eclampsia data example.

xD <- 138; xC <- 175 # positive diagnosed counts

nD <- 1370; nC <- 1336 # totals in both groups
tab <- rbind( Diuretic=c(xD, nD-xD), Control=c(xC, nC-xC))
colnames( tab) <- c( 'pos', 'neg')
tab
## pos neg
## Diuretic 138 1232
## Control 175 1161

Table 5.1: Example of a two-dimensional contingency table displaying frequencies.

The first index refers to the presence of the risk factor, the second to the diagnosis.

Diagnosis
positive negative Total
with factor h11 h12 n1
Risk
without h21 h22 n2

5.1 Estimation
We start with a simple setting where we observe occurrences of a certain event and are interested
in the proportion of the events over the total population. More specifically, we consider the
number of successes in a sequence of experiments, i.e., whether a certain treatment had an effect.
We often use a binomial random variable X ∼ Bin(n, p) for such a setting, where n is given or
known and p is unknown, the parameter of interest. Intuitively, we find x/n to be an estimate
of p and X/n the corresponding estimator. We will construct a confidence interval for p in the
next section.
With the method of moments we obtain the estimator pbMM = X/n, since np = E(X) and
we have only one observation (total number of cases). The estimator is identical to the intuitive
estimator.
The likelihood estimator is constructed as follows:

n x
L(p) = p (1 − p)n−x (5.1)
x

n
`(p) = log L(p) = log + x log(p) + (n − x) log(1 − p) (5.2)
x
d`(p) x n−x x n−x
= − ⇒ = . (5.3)
dp p 1−p pbML 1 − pbML

Thus, the maximum likelihood estimator is (again) pbML = X/n.

In our example we have the following estimates: pbT = 138/1370 ≈ 10% for the treatment
group and pbC = 175/1336 ≈ 13% for the control. The question then arises as to whether the two
proportions are different enough to be able to speak of an effect of the drug will be discussed in
a later part of this chapter.
5.2. CONFIDENCE INTERVALS 81

Once we have an estimator, we can now answer questions, such as:

i) How many cases of pre-eclampsia can be expected in a group of 100 pregnant women?

ii) What is the probability that more than 20 cases occur?

Note however, that the estimator pb = X/n does not have a “classical” distribution. Figure 5.1
illustrates the probability mass function based on the estimate for the pre-eclampsia cases in
the treated group. The figure visually suggest to use a Gaussian approximation which is well
justified here as np(1 − p) 9. The Gaussian approximation for X is then used to state that
the estimator pb is also approximately Gaussian.
0.030
0.015
0.000

0 200 400 600 800 1000 1200 1400

x
0.030
0.015
0.000

100 120 140 160 180

Figure 5.1: Probability mass function (top) with zoom in and the normal approxima-
tion (bottom).

When dealing with proportions we often speak of odds, or simply of chance, defined by
ω = p/(1 − p). The corresponding intuitive estimator (and estimate) is ω b = pb/(1 − pb). As
a side note, this estimator also coincides with the maximum likelihood estimator. Similarly,
p/(1 − pb) is an intuitive (and the maximum likelihood) estimator (and

θb = log(b
ω ) = log (b
estimate) of log odds.

5.2 Confidence Intervals

To construct a confidence intervals for the parameter p we use the Gaussian approximation for
the binomial distribution:
X − np
1 − α ≈ P zα/2 ≤ p ≤ z1−α/2 . (5.4)
np(1 − p)
82 CHAPTER 5. A CLOSER LOOK: PROPORTIONS

This can be rewritten as

p p
1 − α ≈ P zα/2 np(1 − p) ≤ X − np ≤ z1−α/2 np(1 − p) (5.5)
X 1p X 1p
= P − + zα/2 np(1 − p) ≤ −p ≤ − + z1−α/2 np(1 − p) . (5.6)
n n n n
As a further approximation we replace p with pb in the argument of the square root to obtain
X 1p X 1p
1 − α ≈ P − + zα/2 p(1 − pb) ≤ −p ≤ − + z1−α/2
nb p(1 − pb) .
nb (5.7)
n n n n
Since pb = x/n (as estimate) and q := z1−α/2 = −zα/2 , we have as empirical confidence interval
r
pb(1 − pb)
bl,u = pb ± q · = pb ± q · SE(b
p) . (5.8)
n

Remark 5.1. For (regular) models with parameter θ, as n → ∞, likelihood theory states that,
the estimator θbML is normally distributed with expected value θ and variance Var(θbML ).
Since Var(X/n) = p(1 − p)/n, one can assume that SE(b p) = pb(1 − pb)/n. The so-called
p

Wald confidence interval rests upon this assumption (which can be shown more formally) and is
identical to (5.8). ♣

If the inequality in (5.4) is solved through a quadratic equation, we obtain the empirical
Wilson confidence interval
r !
1 q2 pb(1 − pb) q2
bl,u = · pb + ±q· + 2 , (5.9)
1 + q 2 /n 2n n 4n

where we use the estimates pb = x/n.

CI 3: Confidence intervals for proportions

An approximate (1 − α) Wald confidence interval for a proportion is

r
pb(1 − pb)
Bl,u = pb ± q · (5.10)
n
with estimator pb = X/n. An approximate (1 − α) Wilson confidence interval for a
proportion is
r !
1 q2 pb(1 − pb) q2
Bl,u = · pb + ±q· + 2 . (5.11)
1 + q 2 /n 2n n 4n

In R an exact (1 − α) confidence interval for a proportion is computed with

binom.test( x, n)$conf.int.
5.2. CONFIDENCE INTERVALS 83

The Wilson confidence interval is “more complicated” than the Wald confidence interval. Is
it also “better” because one less approximation is required?
Ideally the coverage probability of a (1−α) confidence interval should be 1−α. For a discrete
random variable, the coverage is
n
X
P(p ∈ CI ) = P(X = x)I{p∈CI } . (5.12)
x=0

R-Code 5.2 calculated the coverage of the 95% confidence intervals for X ∼ Bin(n = 40, p = 0.4)
and demonstrates that the Wilson confidence interval has better coverage (96% compared to
94%).

R-Code 5.2: Coverage of 95% confidence intervals for X ∼ Bin(n = 40, p = 0.4).

p <- .4
n <- 40
x <- 0:n

WaldCI <- function(x, n){ # eq (5.9)

mid <- x/n
se <- sqrt(x*(n-x)/n^3)
cbind( pmax(0, mid - 1.96*se), pmin(1, mid + 1.96*se))
}
WaldCIs <- WaldCI(x,n)
Waldind <- (WaldCIs[,1] <= p) & (WaldCIs[,2] >= p)
Waldcoverage <- sum( dbinom(x, n, p)*Waldind) # eq (5.12)

WilsonCI <- function(x, n){ # eq (5.9)

mid <- (x + 1.96^2/2)/(n + 1.96^2)
se <- sqrt(n)/(n+1.96^2)*sqrt(x/n*(1-x/n)+1.96^2/(4*n))
cbind( pmax(0, mid - 1.96*se), pmin(1, mid + 1.96*se))
}
WilsonCIs <- WilsonCI(x,n)
Wilsonind <- (WilsonCIs[,1] <= p) & (WilsonCIs[,2] >= p)
Wilsoncoverage <- sum( dbinom(x, n, p)*Wilsonind)

print( c(true=0.95, Wald=Waldcoverage, Wilson=Wilsoncoverage))

## true Wald Wilson
## 0.95000 0.94587 0.96552

The coverage depends on p, as shown in Figure 5.2 (from R-Code 5.3). The Wilson confidence
interval has better nominal coverage at the center. This observation also holds when n is varied,
as in Figure 5.3. Note that the top “row” of the left and right part of the panel corresponds to
the top and bottom panel of Figure 5.2.
84 CHAPTER 5. A CLOSER LOOK: PROPORTIONS

1.00
0.95
Waldcoverage

0.90
0.85
0.80

0.0 0.2 0.4 0.6 0.8 1.0

p
1.00
0.95
Wilsoncoverage

0.90
0.85
0.80

0.0 0.2 0.4 0.6 0.8 1.0

Figure 5.2: Coverage of the 95% confidence intervals for X ∼ Bin(n = 40, p). The red
dashed line is the nominal level 1 − α and in green we have a smoothed curve to “guide
the eye”.

R-Code 5.3 Tips for construction of Figure 5.2.

p <- seq(0, 1, .001)

# either a loop over all elements in p or a few applies
# over the functions 'Wilsonind' and 'Wilsoncoverage'
#
# Waldcoverage and Wilsoncoverage are thus a vectors!
Waldsmooth <- loess( Waldcoverage ~ p, span=.1)
Wilsonsmooth <- loess( Wilsoncoverage ~ p, span=.1)
plot( p, Waldcoverage, type='l', ylim=c(.8,1))
lines( c(-1, 2), c(.95, .95), col=2, lty=2)
lines( Waldsmooth$x, Waldsmooth$fitted, col=3, lw=2)

The width of an empirical confidence interval is bu − bl . For the Wald confidence interval we
obtain
r
pb(1 − pb)
2q · (5.13)
n
5.3. STATISTICAL TESTS 85

40 Wald CI Wilson CI
1.00
0.95
30

0.90
20

0.85
n

0.80
10

0.75
0.70

0.0 0.2 0.4 0.6 0.8 1.0

Figure 5.3: Coverage of the 95% confidence intervals for X ∼ Bin(n, p) as functions
of p and n. The probabilities are symmetric around p = 1/2. All values smaller than
0.7 are represented with dark red.
0.30
0.20
Width

0.10
0.00

0 10 20 30 40

Figure 5.4: Widths of the empirical 95% confidence intervals for X ∼ Bin(n = 40, p)
(The Wald is in solid green, the Wilson in dashed blue).

and for the Wilson confidence interval

r
2q pb(1 − pb) q2
+ 2. (5.14)
1 + q 2 /n n 4n

The widths are depicted in Figure 5.4. For 5 < x < 36, the Wilson confidence interval has
a smaller width and a better nominal coverage. For small and very large values x, the Wald
confidence interval has a way too small coverage and thus wider intervals are desired.

5.3 Statistical Tests

We now look at the contingency table as a whole. If one wants to know whether the proportions
in both groups are equal or whether they come from the same distribution, a test for equality of
the proportions can be conducted, i.e., H0 : p1 = p2 , where p1 and p2 are the two proportions.
This test is also called Pearson’s χ2 test and is given in Test 6. It can also be considered a special
86 CHAPTER 5. A CLOSER LOOK: PROPORTIONS

case of Test 5. That is, an overall (joint) proportion is estimated and the observed and expected
counts are compared. Typically, continuity corrections are applied.

Test 6: Test of proportions

Question: Are the proportions in the two groups the same?

Assumptions: Two sufficiently large, independent samples of binary data.

Calculation: We use the notation for cells of a contingency table, as in Table 5.1.
The test statistic is
(h11 h22 − h12 h21 )2 (h11 + h12 + h21 + h22 )
χ2obs =
(h11 + h12 )(h21 + h22 )(h12 + h22 )(h11 + h21 )

and, under the null hypothesis that the proportions are the same, is X 2
distributed with one degree of freedom.

Decision: Reject if χ2obs > χ2crit = χ21,1−α .

Calculation in R: prop.test( tab) or chisq.test(tab)

Example 5.1. The R-Code 5.4 shows the results for the pre-eclampsia data, once using a pro-
portion test and once using a Chi-squared test (comparing expected and observed frequencies).
♣

Remark 5.2. We have presented the rows of Table 5.1 in terms of two binomials, i.e., with
two fixed marginals. In certain situations, such a table can be seen from a hypergeometric
distribution point of view (see help( dhyper)), where three margins are fixed. For this latter
view, fisher.test is the test of choice.
It is natural to extend the 2 × 2-tables to general two-way tables or to include covariates etc.
Several concepts discussed here may still apply but need to be extended. Often, at the very end,
a test based on a chi-squared distributed test statistic is used. ♣

5.4 Comparison of Proportions

The goal of this section is to introduce formal approaches for a comparison of two proportions
p1 and p2 . This can be accomplished
. using (i) a difference p1 − p2 , (ii) a quotient p1 /p2 , or (iii)
an odds ratio p1 /(1 − p1 ) (p2 /(1 − p2 )), which we consider in the following three sections.
5.4. COMPARISON OF PROPORTIONS 87

R-Code 5.4 Test of proportions

print(RD <- tab[2,1]/ sum(tab[2,]) - tab[1,1]/ sum(tab[1,]) )

## [1] 0.030258
prop.test( tab)
##
## 2-sample test for equality of proportions with continuity
## correction
##
## data: tab
## X-squared = 5.76, df = 1, p-value = 0.016
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.0551074 -0.0054088
## sample estimates:
## prop 1 prop 2
## 0.10073 0.13099
chisq.test( tab)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: tab
## X-squared = 5.76, df = 1, p-value = 0.016

5.4.1 Difference Between Proportions

Using appropriate binomial random variables, the difference h11 (h11 + h12 ) − h21 (h21 + h22 )

can be seen as a realization of X1 /n1 − X2 /n2 , which is is approximately normally distributed

p1 (1 − p1 ) p2 (1 − p2 )
N p1 − p2 , + (5.15)
n1 n2
(based on the normal approximation of the binomial distribution). Hence, a corresponding
confidence interval can be derived.
The risk difference RD describes the (absolute) difference in the probability of experiencing
the event in question.

5.4.2 Relative Risk

The relative risk estimates the size of the effect of a risk factor compared with the effect size
when the risk factor is not present:
P(Positive diagnosis with risk factor)
RR = . (5.16)
P(Positive diagnosis without risk factor)
The groups with or without the risk factor can also be considered the treatment and control
groups.
88 CHAPTER 5. A CLOSER LOOK: PROPORTIONS

The relative risk assumes positive values. A value of 1 means that the risk is the same in
both groups and there is no evidence of a association between the diagnosis/disease/event and
the risk factor. A value greater than one is evidence of a possible positive association between
a risk factor and a diagnosis/disease. If the relative risk is less than one, the exposure has a
protective effect, as is the case, for example, for vaccinations.

An estimate of the relative risk is (see Table 5.1)

h11
d = pb1 =
RR
h11 + h12
. (5.17)
pb2 h21
h21 + h22
To construct confidence intervals, we consider first θb = log(RR). d The standard error of θb is
determined with the delta method and based on Equation (2.40), applied to a Binomial instead
of a Bernoulli:
d = Var log( pb1 ) = Var log(b

Var( θb ) = Var log(RR)

p1 ) − log(b
p2 ) (5.18)
pb2
1 − pb1 1 − pb2
= Var log(bp1 ) + Var((log(b p2 ) ≈ + (5.19)
n1 · pb1 n2 · pb2
h11 h22
1− 1−
h11 + h12 h21 + h22
≈ + (5.20)
h11 h22
(h11 + h12 ) · (h21 + h22 ) ·
h11 + h12 h21 + h22
1 1 1 1
= − + − . (5.21)
h11 h11 + h12 h21 h21 + h22
A back-transformation
h i
exp θb ± z1−α/2 SE(θ) b (5.22)

implies positive confidence boundaries. Note that with the back-transformation we loose the
‘symmetry’ of estimate plus/minus standard error.

CI 4: Confidence interval for relative risk (RR)

An approximate (1 − α) confidence interval for RR, based on the two-dimensional

contingency table (Table 5.1), is
h i
exp log(RR)
d ± z1−α/2 SE log(RR)
d (5.23)

where for the empirical confidence intervals

r we use the estimates
d = h11 (h21 + h22 ) , SE log(RR)
RR
1 1 1 1
.

d = − + −
(h11 + h12 )h21 h11 h11 + h12 h21 h21 + h22

Example 5.2. The relative risk and corresponding confidence interval for the pre-eclampsia
data are given in R-Code 5.5. The relative risk is smaller than one (diuretics reduce the risk).
An approximate 95% confidence interval does not include one. ♣
5.4. COMPARISON OF PROPORTIONS 89

R-Code 5.5 Relative Risk with confidence interval.

print( RR <- ( tab[1,1]/ sum(tab[1,])) / ( tab[2,1]/ sum(tab[2,])) )

## [1] 0.769
s <- sqrt( 1/tab[1,1] + 1/tab[2,1] - 1/sum(tab[1,]) - 1/sum(tab[2,]) )
exp( log(RR) + qnorm(c(.025,.975))*s)
## [1] 0.62333 0.94872

5.4.3 Odds Ratio

The relative risk is closely related to the odds ratio, which is defined as

P(Positive diagnosis with risk factor)

P(Negative Diagnosis with risk factor)
OR = (5.24)
P(Positive diagnosis without risk factor)
P(Negative diagnosis without risk factor)
P(A)
1 − P(A) P(A)(1 − P(B))
= = (5.25)
P(B) P(B)(1 − P(A))
1 − P(B)

with A and B the positive diagnosis with and without risk factors. The odds ratio indicates the
strength of an association between factors (association measure). The calculation of the odds
ratio also makes sense when the number of diseased is determined by study design, as is the case
for case-control studies.
When a disease is rare (very low probability of disease), the odds ratio and relative risk are
approximately equal.
An estimate of the odds ratio is
h11
h12 h11 h22
OR
d= = . (5.26)
h21 h12 h21
h22
The construction of confidence intervals for the odds ratio is based on Equation (2.39) and
Equation (2.40), analogous to that of the relative risk.

Example 5.3. The odds ratio with confidence interval for the pre-eclampsia data is given in
R-Code 5.6. The 95% confidence interval is again similar as calculated for the relative risks and
does also not include one, strengthening the claim (i.e., significant result).
Notice that the function fisher.test (see Remark 5.2) also calculates the odds ratios. As
they are based on a likelihood calculation, there are minor differences between both estimates.♣
90 CHAPTER 5. A CLOSER LOOK: PROPORTIONS

CI 5: Confidence interval for the odds ratio (OR)

An approximate (1 − α) confidence interval for OR, based on a two-dimensional

contingency table (Table 5.1), is
h i
exp log(OR)d ± z1−α/2 SE(log(OR))
d (5.27)

where for the empirical confidence intervals we use the estimates

r
h11 h22 1 1 1 1
OR =
d and SE(log(OR)) =
d + + + .
h12 h21 h11 h21 h12 h22

R-Code 5.6 Odds ratio with confidence interval, approximate and exact.

print( OR <- tab[1]tab[4]/(tab[2]tab[3]))

## [1] 0.74313
s <- sqrt( sum( 1/tab) )
exp( log(OR) + qnorm(c(.025,.975))*s)
## [1] 0.58626 0.94196
# Exact test from Fisher:
fisher.test(tab)
##
## Fisher's Exact Test for Count Data
##
## data: tab
## p-value = 0.016
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.58166 0.94826
## sample estimates:
## odds ratio
## 0.74321
5.5. BIBLIOGRAPHIC REMARKS 91

5.5 Bibliographic remarks

See also the package binom and the references therein for further confidence intervals for propor-
tions.
The ultimate reference for CI for the binomial model is Brown et al. (2002).

5.6 Exercises and Problems

Problem 5.1 (Theoretical derivations) We assume a classical two-way contingency table.

i) Derive the test statistic of the test of proportions (without continuity correction).

ii) Derive standard error of the odds ratio SE log(OR)

d .

Problem 5.2 (Binomial distribution) Suppose that among n = 95 Swiss males, eight are red-
green colour blind. We are interested in estimating the proportion p of people suffering from
such disease among the male population.

i) Is a binomial model suitable for this problem?

ii) Calculate the maximum likelihood estimate (ML) p̂ML and the ML of the odds ω̂.

iii) Using the central limit theorem (CLT), it can be shown that pb follows approximately
N p, n1 p(1 − p) . Compare the binomial distribution to the normal approximation for

different n and p. To do so, plot the exact cumulative distribution function (CDF) and
compare it with the CDF obtained from the CLT. For which values of n and p is the ap-
proximation reasonable? Is the approximation reasonable for the red-green colour blindness
data?

iv) Use the R functions binom.test() and prop.test() to compute two-sided 95%-confidence
intervals for the exact and for the approximate proportion. Compare the results.

v) What is the meaning of the p-value?

vi) Compute the Wilson 95%-confidence interval and compare it to the confidence intervals
from (d).

Problem 5.3 (A simple clinical trial) A clinical trial is performed to compare two treatments,
A and B, that are intended to treat a skin disease named psoriasis. The outcome shown in the
following table is whether the patient’s skin cleared within 16 weeks of the start of treatment.

Treatment A Treatment B
Cleared 9 5
Not cleared 18 22

Use α = 0.05 throughout this problem.

92 CHAPTER 5. A CLOSER LOOK: PROPORTIONS

i) Compute for each of the two treatments a Wald type and a Wilson confidence interval for
the proportion of patients whose skin cleared.

ii) Test whether the risk difference is significantly different to zero (i.e., RD = 0). Use both
an exact and an approximated approach.

iii) Compute CIs for both, relative risk (RR) and odds ratio (OR).
Chapter 6

Rank-Based Methods

Learning goals for this chapter:

Explain robust estimators and their properties

Explain the motivation and the concept of ranks

Describe, apply and interpret the rank based tests

R-Code for this chapter: www.math.uzh.ch/furrer/download/sta120/chapter06.R.

Until now we have often assumed that we have a realization of a Gaussian random sample.
In this chapter, we discuss basic approaches to estimation and testing for cases in which this is
not the case. This includes the presence of outliers, or that the data might not be measured on
the interval/ratio scale, etc.

6.1 Robust Point Estimates

As seen in previous chapters “classic” estimates of the mean and the variance are
n n
1X 1 X
x= xi , and s2 = (xi − x)2 . (6.1)
n n−1
i=1 i=1

If, hypothetically, we set an arbitrary value xi to an infinitely large value (i.e., we create an
extreme outlier), these estimates above also “explode”. A single value may exert enough influence
on the estimate such that the estimate is not representative of the bulk of the data.

Robust estimators are not sensitive to one or possibly several outliers. They do often not
require specific distributional assumptions on the random sample.
The mean is therefore not a robust estimate of location. A robust estimate of location is the
trimmed mean, in which the biggest and smallest values are trimmed away and not considered.
The (empirical) median (the middle value of an odd amount of data or the center of the two
middle-most values of an even amount of data) is another robust estimate of location.

93
94 CHAPTER 6. RANK-BASED METHODS

Robust estimates of the dispersion (data spread) are (1) the (empirical) interquartile range
(IQR), calculated as the difference between the third and first quartiles, and (2) the (empirical)
median absolute deviation (MAD), calculated as

MAD = c · median|xi − median{x1 , . . . , xn }|, (6.2)

where most software programs (including R) use c = 1.4826. The choice of c is such, that for
normally distributed random variables we have an unbiased estimator, i.e., E(MAD) = σ for
MAD seen as an estimator. Since for normally distributed random variables IQR= 2Φ−1 (3/4)σ,
IQR/1.349 is an estimator of σ; for IQR seen as an estimator.

Example 6.1. Let the values 1.1, 3.2, 2.2, 1.8, 1.9, 2.1, 2.7 be given. Unfortunately, we have
entered the final number as 27. R-Code 6.1 compares several statistics (for location and scale)
and illustrates the effect of a single outlier on the estimates. ♣

R-Code 6.1 Classic and robust estimates from Example 6.1.

sam <- c(1.1, 3.2, 2.2, 1.8, 1.9, 2.1, 27)

print( c(mean(sam), mean(sam, trim=.2), median(sam)))
## [1] 5.6143 2.2400 2.1000
print( c(sd(sam), IQR(sam)/1.349, mad(sam)))
## [1] 9.45083 0.63010 0.44478

The estimators of the trimmed mean or median do not possess simple distribution functions
and for this reason the corresponding confidence intervals are not easy to calculate. If we could
assume that the distribution of robust estimators are somewhat Gaussian (for large samples) we
could calculate approximate confidence intervals based on
s
d robust\
Var( estimator)
robust\estimator ± zα/2 , (6.3)
n
which is of course equivalent to θb ± zα/2 SE(θ)/n,
b for a robust estimator θ.
b Note that we have
deliberately put a hat on the variance term in (6.3) as the variance often needs to be estimated
as well (which is reflected in a precise definition of the standard error). For example, the R
expression median( x)+c(-2,2)*mad( x)/sqrt(length( x)) yields an approximate empirical
95% confidence interval for the median.

A second disadvantage of robust estimators is their lower efficiency, i.e., these estimators
have larger variances. Formally, the efficiency is the ratio of the variance of one estimator to the
variance of the second estimator.
In some cases the exact variance of robust estimators can be determined, often approximations
or asymptotic approximations exist. For a continuous random variable with cdf F (x), asymptot-
ically, the median is also normally distributed around the true median η = Q(1/2) = F −1 (1/2)
with variance (4nf (η)2 )−1 , where f (x) is the density function. The following example illustrates
this result and R-Code 6.2 compares the efficiency of two estimators based on repeated sampling.
6.1. ROBUST POINT ESTIMATES 95

iid
Example 6.2. Let X1 , . . . , X10 ∼ N (0, σ 2 ). We simulate realizations of this random sample
and calculate the empirical mean and median of the sample. We repeat R = 1000. Figure 6.1
shows the histogram of the means and medians including a (smoothed) empirical density. The
histogram and empirical densities of the median are wider and thus the mean is more efficient.
For this particular example, the empirical efficiency is roughly 72%. Because the density is
symmetric, η = µ = 0 and thus the asymptotic efficiency is
σ 2 /n σ2 1 2 2
2
= · 4n √ = ≈ 64%. (6.4)
1/ 4nf (0) n 2πσ π

Of course, if we change the distribution of X1 , . . . , X10 , the efficiency changes. For example let
us consider the case of a t-distribution with 4 degrees of freedom, a density with heavier tails
than the normal. Now the empirical efficiency for sample size n = 10 is 1.26, which means that
the median is better compared to the mean. ♣

R-Code 6.2 Distribution of empirical mean and median, see Example 6.2. (See Figure 6.1.)

set.seed( 14) # to reproduce the numbers!

n <- 10 # sample size
R <- 1000 # How often we repeat the sampling
samples <- array( rnorm(n*R), c(R, n))
means <- apply( samples, 1, mean)
medians <- apply( samples, 1, median)
print( c( var( means), var( medians), var( means)/var( medians)))
## [1] 0.10991 0.15306 0.71809
hist( medians, border=7, col=7, prob=TRUE, main='', ylim=c(0, 1.2),
xlab='estimates')
hist( means, add=TRUE, prob=TRUE)
lines( density( medians), col=2)
lines( density( means), col=1)
# with a t_4 distribution the situation is different:
samples <- array( rt(n*R, df=4), c(R, n))
means <- apply( samples, 1, mean)
medians <- apply( samples, 1, median)
print( c( var( means), var( medians), var( means)/var( medians)))
## [1] 0.19268 0.15315 1.25815

Robust estimates have the advantage of not having to identify outliers and eliminating these
for the estimation process.
The decision as to whether a realization of a random sample contains outliers is not always
easy and some care is needed. For example for all distributions with values from R, observations
will lay outside the whiskers of a box plot when n is sufficiently large and thus “marked” as
outliers. Obvious outliers are easy to identify and eliminate, but in less clear cases robust
estimation methods are preferred.
96 CHAPTER 6. RANK-BASED METHODS

1.2
0.8
Density

0.4
0.0

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

estimates

Figure 6.1: Comparing efficiency of the mean and median. Medians in yellow with
red smoothed, empirical density, means in black. (See R-Code 6.2.)

Outliers can be very difficult to recognize in multivariate random samples, because they are
not readily apparent with respect to the marginal distributions. Robust methods for random
vectors exist, but are often computationally intense and not as intuitive as for scalar values.

It has to be added that independent of the estimation procedures, if an EDA finds outliers,
these should be noted and scrutinized.

6.2 Rank-Based Tests

In Chapter 4 we considered tests to compare means. These tests assume normally distributed
data for exact results. Slight deviations from a normal distribution has typically negligible
consequences as the central limit theorem reassures that the mean is approximately normal.
However, if outliers are present or the data are skewed or the data is measured on the ordinal
scale, the use of so-called ‘rank-based’ tests is recommended. Classical tests typically assume a
distribution that is parametrized (e.g., µ, σ 2 in N (µ, σ 2 ) or p in Bin(n, p)). Rank based test do
not prescribe a detailed distribution and are thus also called non-parametric tests.
The rank of a value in a sequence is the position (order) of that value in the ordered sequence
(from smallest to largest). In particular, the smallest value has rank 1 and the largest rank n.
In the case of ties, the arithmetic mean of the ranks is used.

Example 6.3. The values 1.1, −0.6, 0.3, 0.1, 0.6, 2.1 have ranks 5, 1, 3, 2, 4 and 6. However,
the ranks of the absolute values are 5, (3+4)/2, 2, 1, (3+4)/2 and 6. ♣

Rank-based tests only consider the ranks of the observations or of the differences, not the
observation value itself or the magnitude of the differences between the observations. The largest
value always has the same rank and therefore always has the same influence on the test statistic.

We now introduce two classical rank tests (i) the Mann–Whitney U test (Wilcoxon–Mann–
Whitney U test) and (ii) the Wilcoxon test, i.e., rank-based versions of Test 2 and Test 3
respectively.
6.2. RANK-BASED TESTS 97

6.2.1 Wilcoxon–Mann–Whitney Test

To motivate this test, assume that we have two samples with equal sample sizes available. The
idea is that if we have one common underlying density, the observations mingle nicely and hence
the ranks are comparable. Alternatively, assume that the first sample has a much smaller median
(or mean) then the ranks of the first sample would be smaller than those of the sample with the
larger median (or mean).
When using rank tests, the symmetry assumption is dropped and we test if the samples likely
come from the same distribution, i.e., the two distributions have the same shape but are shifted.
The Wilcoxon–Mann–Whitney test can be interpreted as comparing the medians between the
two populations, see Test 7.
The quantile, density and distribution functions of the test statistic U are implemented in
R with [q,d,p]wilcox. For example, the critical value Ucrit (nx , ny ; α/2) mentioned in Test 7 is
qwilcox( .025, nx, ny) for α = 5% and corresponding p-value 2*pwilcox( Uobs, nx, ny).

Test 7: Comparison of location for two independent samples

Question: Are the medians of two independent samples significantly different?

Assumptions: Both populations follow continuous distributions of the same shape,

the samples are independent and the data are at least ordinally scaled.

Calculation: Let nx ≤ ny , otherwise switch the samples. Assemble the (nx + ny )

sample values into a single set ordered by rank and calculate the sum Rx and
Ry of the sample ranks. Calculate Ux and Uy
nx (nx + 1)
• Ux = Rx −
2
ny (ny + 1)
• Uy = Ry − (Ux + Uy = nx ny )
2
• Uobs = min(Ux , Uy )

Decision: Reject H0 : “medians are the same” if Uobs < Ucrit (nx , ny ; α/2), where
Ucrit is the critical value.

Calculation in R: wilcox.test(x, y, conf.level=1-alpha)

It is possible to approximate the distribution of the test statistic by a Gaussian one. The U
statistic value is then transformed by

Uobs − nx ny 
 
zobs =r 2 , (6.5)
nx ny (nx + ny + 1)
12
98 CHAPTER 6. RANK-BASED METHODS

where nx ny /2 is the mean of U and the denominator is the standard deviation. This value is
then compared with the respective quantile of the standard normal distribution. The normal
approximation may be used with sufficiently large samples, nx ≥ 2 and ny ≥ 8. With additional
continuity corrections, the approximation may be improved.
To construct confidence intervals, the argument conf.int=TRUE must be used in the function
wilcox.test and a possible specification of conf.level unless α = 5% is chosen. The numerical
values of the confidence interval are accessed with the list element $conf.int.
In case of ties, R may not be capable to calculate exact p-values and thus will issue a warning.
The warning can be avoided by not requiring exact p-values through the setting of the argument
exact=FALSE.

6.2.2 Wilcoxon Signed Rank Test

The Wilcoxon signed rank test is for matched/paired samples (or repeated measurements on
the same sample) and can be used to compare the the median with a theoretical value. As in
the two-sample t-test, we calculate the differences. However, here we compare the ranks of the
negative and positive differences. If the two samples are from the same distribution the ranks
and thus the rank sums should be comparable. Zero differences are omitted from the sample and
the sample size is reduced accordingly. (We denote n? as the number of non-differences). Details
are given in Test 8.

The quantile, density and distribution functions of the Wilcoxon signed rank test statistic
are implemented in R with [q,d,p]signrank. For example, the critical value Wcrit (n; α/2)
mentioned in Test 8 is qsignrank( .025, nx, ny) for α = 5% and corresponding p-value
2*psignrank( Wobs, n).

As in the case of the Wilcoxon-Mann-Whitney test, it is again possible to approximate the

distribution of test statistic with a normal distribution. The statistic is transformed as follows:

Wobs − n? (n? + 1) 
 
zobs =r 4 , (6.6)
n? (n? + 1)(2n? + 1)
24
and then zobs is compared with the corresponding quantile of the standard normal distribution.
This approximation may be used when the sample is sufficiently large, which is as a rule of thumb
n? ≥ 20.

Example 6.4. We consider again the podo data as introduced in Example 4.1. R-Code 6.3
performs various rank tests (comparing the expected median with a theoretical value, comparing
the median of two samples, comparing the medians of two paired samples). As expected, the
p-values are similar to those obtained with “classical” t-tests in Chapter 4. Because ties may
exist, we used the argument exact=FALSE to avoid warnings when calculating exact p-values.
The advantage of robust methods becomes clear when the first value is changed from 3.75
to 37.5, as shown towards the end of the same R-Code. While the p-value of the signed rank
test does virtually not change, the one from the paired two sample t-test changes from 0.5 (see
6.2. RANK-BASED TESTS 99

Test 8: Comparison of location of two dependent samples

Question: Are the medians of two dependent samples significantly different?

Assumptions: Both populations follow continuous distributions of the same shape,

the samples are dependent and the data are measured at least on the ordinal
scale.

Calculation: (1) Calculate the differences di = xi − yi between the sample val-

ues. Ignore all differences di = 0 and consider only the remaining n?
differences di 6= 0.
(2) Order the n? differences di by their absolute differences |di |.
(3) Calculate W + , the sum of the ranks of all the positive differences di > 0
and also W − , the sum of the ranks of all the negative differences di < 0
n? (n? + 1)
Wobs = min(W + , W − ) (W + + W − = )
2
Decision: Reject H0 : “the medians are the same” if Wobs < Wcrit (n? ; α/2), where
Wcrit is the critical value.

Calculation in R: wilcox.test(x-y, conf.level=1-alpha) or

wilcox.test(x, y, paired=TRUE, conf.level=1-alpha)

R-Code 4.5) to 0.31. More importantly, the confidence intervals are now considerably different as
the outlier inflated the estimated standard deviation of the t-test. In other situations, it is quite
likely that with or without a particular “outlier” the p-value falls below the magical threshold α
(recall the discussion of Section 4.5.3).
Of course, a corrupt value as introduced in this example would be detected with a proper
EDA of the data (scales are within zero and ten). ♣

R-Code 6.3: Rank tests and comparison of a paired tests with a corrupted observation.

# Possibly relaod the 'podo.csv' and construct the variables as in Example 4.1
wilcox.test( PDHmean, mu=3.333, exact=FALSE)
##
## Wilcoxon signed rank test with continuity correction
##
## data: PDHmean
## V = 133, p-value = 0.008
## alternative hypothesis: true location is not equal to 3.333
100 CHAPTER 6. RANK-BASED METHODS

wilcox.test( PDHmeanB1, PDHmeanB2, exact=FALSE)

##
## Wilcoxon rank sum test with continuity correction
##
## data: PDHmeanB1 and PDHmeanB2
## W = 153, p-value = 0.66
## alternative hypothesis: true location shift is not equal to 0
wilcox.test( PDHmean2[,2], PDHmean2[,1], paired=TRUE, exact=FALSE)
##
## Wilcoxon signed rank test with continuity correction
##
## data: PDHmean2[, 2] and PDHmean2[, 1]
## V = 88, p-value = 0.6
## alternative hypothesis: true location shift is not equal to 0
PDHmean2[1, 2] <- PDHmean2[1, 2]*10 # corrupted value, decimal point wrong
rbind(t.test=unlist( t.test( PDHmean2[,2], PDHmean2[,1], paired=TRUE)[3:4]),
wilcox=unlist( wilcox.test( PDHmean2[,2], PDHmean2[,1], paired=TRUE,
conf.int=TRUE, exact=FALSE)[c( "p.value", "conf.int")]))
## p.value conf.int1 conf.int2
## t.test 0.31465 -2.28792 6.6791
## wilcox 0.60256 -0.54998 1.1375

6.3 Other Tests

We conclude the chapter with two additional tests, one relying on very few statistical assuptions
and one being more of a toolbox to construct arbitrary tests.

6.3.1 Sign Test to Compare Medians

Given paired data X1 , . . . , Xn , Y1 , . . . , Yn from symmetric distributions, we consider the signs of

the differences Di = Yi − Xi . Under the null hypothesis H0 : P(Di > 0) = P(Di < 0) = 1/2,
the signs are binomially distributed with probability p = 0.5. If too many or too few negative
signs are observed, the data is not supporting the null hypothesis. Note that all cases where the
observed difference di is zero will be dropped from the sample and n will be reduced.
The test is based on very weak assumptions and therefore has little power. The test procedure
is illustrated in Test 9 and R-Code 6.4 illustrates the sign test along Example 6.4. The resulting
p-values are very similar to the Wilcoxon signed rank test.
6.3. OTHER TESTS 101

Test 9: Sign test to compare medians

Question: Are the medians of two paired/matched samples significantly different?

Assumptions: Both populations follow continuous distributions of the same shape,

the samples are dependent and the data are at least ordinally scaled.

Calculation: (1) Calculate the differences di = xi − yi . Ignore all differences

di = 0 and consider only the n? differences di 6= 0.
(2) Categorize each difference di by its sign (+ or −) and count the number
of positive signs: k = ni=1 I{di >0} .
P ?

(3) If the medians of both samples are the same, k is a realization from a
binomial distribution Bin(n? , p = 0.5).

Decision: Reject H0 : p = 0.5 (i.e., the medians are the same), if k > bcrit =
b(n? , 0.5, 1 − α2 ) or k < bcrit = b(n? , 0.5, α2 ).

Calculation in R:
binom.test( sum( d>0), sum( d!=0), conf.level=1-alpha)

R-Code 6.4 Sign test.

d <- PDHmean2[,2] - PDHmean2[,1]

binom.test( sum( d>0), sum( d!=0))
##
## Exact binomial test
##
## data: sum(d > 0) and sum(d != 0)
## number of successes = 9, number of trials = 17, p-value = 1
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
## 0.27812 0.77017
## sample estimates:
## probability of success
## 0.52941

6.3.2 Permutation Test

Permutation tests can be used to answer many different questions. They are based on the idea
that, under the null hypothesis, the samples being compared are the same. In this case, the
result of the test would not change if the values of both samples were to be randomly reassigned
102 CHAPTER 6. RANK-BASED METHODS

to either group. In R-Code 6.5 we show an example of a permutation test for comparing the
means of two independent samples.

R-Code 6.5 Permutation test comparing means of two samples.

require(coin)
oneway_test( PDHmean ~ as.factor(Visit), data=podo)
##
## Asymptotic Two-Sample Fisher-Pitman Permutation Test
##
## data: PDHmean by as.factor(Visit) (1, 13)
## Z = -0.707, p-value = 0.48
## alternative hypothesis: true mu is not equal to 0

Test 10: Permutation test

Question: Are the means of two independent samples significantly different?

Assumptions: The null hypothesis is formulated, such that the groups, under H0 ,
are exchangeable.

Calculation: (1) Calculate the difference dobs in the means of the two groups to
be compared (m observations in group 1, n observations in group 2).
(2) Form a random permutation of the values of both groups by randomly
allocating the observed values to the two groups (m observations in
group 1, n observations in group 2). There are m+n possibilities.

n

(3) Calculate the difference in means of the two new groups.

(4) Repeat this procedure many times.

Decision: Compare the selected significance level with the p-value:

number of permuted differences more extreme than dobs
m+n

n

Calculation in R: require(coin); oneway_test( formula, data)

Permutation tests are straightforward to implement manually and thus are often used in
settings where the distribution of the test statistic is complex or even unknown.

Note that the package exactRankTests will no longer be further developed. Therefore, use of
the function perm.test is discouraged. Functions within the package coin can be used instead
and this package includes extensions to other rank-based tests.
6.4. BIBLIOGRAPHIC REMARKS 103

6.4 Bibliographic remarks

There are many (“elderly”) books about robust statistics and rank test. Siegel and Castellan Jr
(1988) was a very accessible classic for many decades. The book by Hollander and Wolfe (1999)
treats the topic in much more details and depth.

6.5 Exercises and Problems

Problem 6.1 (Theoretical derivations)
Determine the proportion of observations which would be marked on average as outliers when
the data is from a (i) Gaussian distribution, (ii) from a t-distribution with ν degrees of freedom,
(iii) from an exponential distribution with rate λ.

Problem 6.2 (Rank and permutation tests) Download the water_transfer.csv data from the
course web page and read it into R with read.csv(). The data describes tritiated water diffusion
across human chorioamnion and were taken from Hollander & Wolfe (1999), Nonparametric
Statistical Methods, Table 4.1, page 110. The pd values for age = "At term" and age =
"12-26 Weeks" are denoted with yA and yB , respectively. We will statistically determine whether
the yA values are “different” from yB values or not. That means we test whether there is a shift
in the distribution of the second group compared to the first.

i) Use a Wilcoxon-Mann-Whitney test to test for a shift in the groups. Interpret the results.

ii) Now, use a permutation test as implemented by the function wilcox_test() from R pack-
age coin to test for a potential shift. Compare to (a).

iii) Under the null hypothesis, we are allowed to permute the observations (all y-values) while
keeping the group assignments fix. Keeping this in mind, we will now manually construct
a permutation test to detect a potential shift. Write an R function perm_test() that
implements a two-sample permutation test and returns the p-value. Your function should
execute the following steps.

• Compute the test statistic tobs = yeA − yeB , where e· denotes the empirical median.
• Then repeat many (n = 1000) times
– Randomly assign all the values of pd to two groups xA and xB of the same size
as yA and yB .
– Store the test statistic tsim = x eB .
eA − x
• Return the two-sided p-value, i.e., the number of permuted test statistics tsim which
are smaller or equal than −|tobs | or larger or equal than |tobs | divided by the total
number of permutations (in our case n = 1000).
104 CHAPTER 6. RANK-BASED METHODS
Chapter 7

Multivariate Normal Distribution

Learning goals for this chapter:

Describe a random vector, cdf, pdf of a random vector and its properties

Give the definition and intuition of E, Var and Cov for a random vector

Know basic properties of E, Var and Cov for a random vector

Recognize the density of Gaussian random vector

Know properties of Gaussian random vector

Explain and work with conditional and marginal distributions.

Estimate the mean and the variance of the multivariate distribution.

Explain the relationship between the eigenvalues and eigenvector of the co-
variance matrix and the shape of the density function.

R-Code for this chapter: www.math.uzh.ch/furrer/download/sta120/chapter07.R.

In Chapter 2 we have introduced univariate random variables. We now extend the framework
to random vectors (i.e., multivariate random variables). In the framework of this document, we
can only cover a tiny part of the beautiful theory and thus we will mainly focus on continuous
random vectors, especially Gaussian random vectors. We are pragmatic and discuss what will
be needed in the sequel.

7.1 Random Vectors

A random vector is a (column) vector X = (X1 , . . . , Xp )> with p random variables as components.
The following definition is the generalization of the univariate cumulative distribution function
(cdf) to the multivariate setting (compare with Definition 2.1).

105
106 CHAPTER 7. MULTIVARIATE NORMAL DISTRIBUTION

Definition 7.1. The multivariate (or multidimensional) distribution function of a random vector
X is defined as

FX (x ) = P(X ≤ x ) = P(X1 ≤ x1 , . . . , Xp ≤ xp ), (7.1)

where the list in the right-hand-side is to be understood as the intersection (∩). ♦

The multivariate distribution function generally contains more information than the set of
p
marginal distribution functions P(Xi ≤ xi ), because (7.1) only simplifies to FX (x ) =
Q
P(Xi ≤
i=1
xi ) under independence of all random variables Xi (compare to Equation (2.22)).

A random vector X is a continuous random vector if each component Xi is a continuous

random variable. The probability density function for a continuous random vector is defined in
a similar manner as for univariate random variables.

Definition 7.2. The probability density function (or density function, pdf) fX (x ) of a p-
dimensional continuous random vector X is defined by
Z
P(X ∈ A) = fX (x )dx , for all A ⊂ Rp . (7.2)
A

For convenience, we summarize here a few facts of random vectors with two continuous
components, i.e., for a bivariate random vector (X, Y )> . The univariate counterparts are stated
in Properties 2.1 and 2.3.

• The distribution function is monotonically increasing:

for x1 ≤ x2 and y1 ≤ y2 , FX,Y (x1 , y1 ) ≤ FX,Y (x2 , y2 ).

• The distribution function is normalized:

lim FX,Y (x, y) = FX,Y (∞, ∞) = 1.
x,y→∞

We use the slight abuse of notation by writing ∞ in arguments without a limit.

• FX,Y (−∞, −∞) = FX,Y (x, −∞) = FX,Y (−∞, y) = 0.

• FX,Y (x, y) and fX,Y (x, y) are continuous (almost) everywhere.

∂2
• fX,Y (x, y) = FX,Y (x, y).
∂x∂y
Z bZ d
• P(a < X ≤ b, c < Y ≤ d) = fX,Y (x, y)dxdy
a c
= FX,Y (b, d) − FX,Y (b, c) − FX,Y (a, d) + FX,Y (a, c).

In the multivariate setting there is also the concept termed marginalization, i.e., reduce a
higher-dimensional random vector to a lower dimensional one. Intuitively, we “neglect” compo-
nents of the random vector in allowing them to take any value. In two dimensions, we have
7.1. RANDOM VECTORS 107

• FX (x) = P(X ≤ x, Y arbitrary) = FX,Y (x, ∞);

Z
• fX (x) = fX,Y (x, y)dy.
R

We now characterize the moments of random vectors.

Definition 7.3. The expected value of a random vector X is defined as

   
X1 E(X1 )
 .   ..
 .  

E(X) = E   .  =  . . (7.3)

Xp E(Xp )

Hence the expectation of a random vector is simply the vector of the individual expectations.
Of course, to calculate these, we only need the marginal univariate densities fXi (x) and thus the
expectation does not change whether (7.1) can be factored or not. The expectation of products
of random variables is defined as
Z Z
E(X1 X2 ) = x1 x2 f (x1 , x2 ) dx1 dx2 (7.4)

(for continuous random variables). The variance of a random vector requires a bit more thought
and we first need the following.

Definition 7.4. The covariance between two arbitrary random variables X1 and X2 is defined
as

Cov(X1 , X2 ) = E (X1 − E(X1 ))(X2 − E(X2 )) = E(X1 X2 ) − E(X1 ) E(X2 ). (7.5)

Using the linearity properties of the expectation operator, it is possible to show the following
handy properties.

Property 7.1. We have for arbitrary random variables X1 , X2 and X3 :

i) Cov(X1 , X2 ) = Cov(X2 , X1 ),

ii) Cov(X1 , X1 ) = Var(X1 ),

iii) Cov(a + bX1 , c + dX2 ) = bd Cov(X1 , X2 ), for arbitrary values a, b, c and d,

iv) Cov(X1 , X2 + X3 ) = Cov(X1 , X2 ) + Cov(X1 , X3 ).

The covariance describes the linear relationship between the random variables. The correla-
tion between two random variables X1 and X2 is defined as
Cov(X1 , X2 )
Corr(X1 , X2 ) = p (7.6)
Var(X1 ) Var(X2 )
5 min
and corresponds to the normalized covariance. It holds that −1 ≤ Corr(X1 , X2 ) ≤ 1, with
equality only in the degenerate case X2 = a + bX1 for some a and b 6= 0.
108 CHAPTER 7. MULTIVARIATE NORMAL DISTRIBUTION

Definition 7.5. The variance of a p-variate random vector X = (X1 , . . . , Xp )> is defined as

Var(X) = E (X − E(X))(X − E(X))>

(7.7)
   
X1 Var(X1 ) ... Cov(Xi , Xj )
 .   ..
 .  

 .  = 
= Var  . ,
 (7.8)
Xp Cov(Xj , Xi ) . . . Var(Xp )

called the covariance matrix or variance–covariance matrix. ♦

The covariance matrix is a symmetric matrix and – except for degenerate cases – a positive
definite matrix. We will not consider degenerate cases and thus we can assume that the inverse
of the matrix Var(X) exists and is called the precision.
Similar to Properties 2.5, we have the following properties for random vectors.

Property 7.2. For an arbitrary p-variate random vector X, (fixed) vector a ∈ Rq and matrix
4 min B ∈ Rq×p it holds:

i) Var(X) = E(XX> ) − E(X) E(X)> ,

ii) E(a + BX) = a + B E(X),

iii) Var(a + BX) = B Var(X)B> .

7.2 Multivariate Normal Distribution

We now consider a special multivariate distribution: the multivariate normal distribution, by
first considering the bivariate case.

7.2.1 Bivariate Normal Distribution

Definition 7.6. The random variable pair (X, Y ) has a bivariate normal distribution if
Z x Z y
FX,Y (x, y) = fX,Y (x, y)dxdy (7.9)
−∞ −∞

with density

f (x, y) = fX,Y (x, y) (7.10)

)2 )2

1 1 (x − µx (y − µy 2ρ(x − µx )(y − µy )
= exp − + − ,
2(1 − ρ2 ) σx2 σy2
p
2πσx σy 1 − ρ2 σx σy

for all x and y and where µx ∈ R, µy ∈ R, σx > 0, σy > 0 and −1 < ρ < 1. ♦

The role of some of the parameters µx , µy , σx , σy and ρ might be guessed. We will discuss
their precise meaning after the following example.
7.2. MULTIVARIATE NORMAL DISTRIBUTION 109

Example 7.1. R-Code 7.1 and Figure 7.1 show the density of a bivariate normal distribution
√ √
with µx = µy = 0, σx = 1, σy = 5, and ρ = 2/ 5 ≈ 0.9. Because of the quadratic form
in (7.10), the contour lines (isolines) are ellipses.
Several R packages implement the bivariate/multivariate normal distribution. We recommend
the package mvtnorm. ♣

R-Code 7.1: Density of a bivariate normal distribution. (See Figure 7.1.)

require( mvtnorm)
require( fields) # providing tim.colors() and image.plot()
Sigma <- array( c(1,2,2,5), c(2,2))
x <- y <- seq( -3, to=3, length=100)
grid <- expand.grid( x=x, y=y)
densgrid <- dmvnorm( grid, mean=c(0,0), sigma=Sigma)
density <- array( densgrid, c(100,100))
image.plot(x, y, density, col=tim.colors()) # left panel

faccol <- fields::tim.colors()[cut(density[-1,-1],64)]

persp(x, y, density, col=faccol, border = NA, # right panel
tick='detailed', theta=120, phi=30, r=100)

# To calculate the cdf, we need a lower and upper bound.

# Passing directly the grid is not possible
cdfgrid <- apply(grid, 1, function(x) {
pmvnorm( upper=x, mean=c(0,0), sigma=Sigma) } )
cdf <- array( cdfgrid, c(100,100))
image.plot(x, y, cdf, zlim=c(0,1), col=tim.colors()) # left panel

faccol <- fields::tim.colors()[cut(cdf[-1,-1],64)]

persp(x, y, cdf, col=faccol, border = NA, # right panel
tick='detailed', theta=12, phi=50, r=100)

The bivariate normal distribution has many nice properties.

Property 7.3. For the bivariate normal distribution we have: The marginal distributions are
X ∼ N (µx , σx2 ) and Y ∼ N (µy , σy2 ) and
!! ! !! !
X µx X σx2 ρσx σy
E = Var = . (7.11)
Y µy , Y ρσx σy σy2

Thus,

Cov(X, Y ) = ρσx σy , Corr(X, Y ) = ρ. (7.12)

If ρ = 0, X and Y are independent and vice versa.

110 CHAPTER 7. MULTIVARIATE NORMAL DISTRIBUTION

0.15
2

0.15
1

0.10

0.10
0
y

density
−1

0.05 0.05
−3
−2
0.00
−2

−1
−3 −2 0
−1 0 1 x
0.00 y 1 2
−3

2 33
−3 −2 −1 0 1 2 3

x
3

1.0
2

0.8
1

0.6
0
y

0.8 3
0.4
0.6 2
−1

cdf

1
0.4
0.2 0

y
0.2
−2

−1
−3 −2 −2
0.0 −1 0
1
−3

x 2 3−3
−3 −2 −1 0 1 2 3

Figure 7.1: Density of a bivariate normal distribution. (See R-Code 7.1.)

Note, however, that the equivalence of independence and uncorrelatedness is specific to jointly
normal variables and cannot be assumed for random variables that are not jointly normal.

Example 7.2. R-Code 7.2 and Figure 7.2 show realizations from a bivariate normal distribution
for various values of correlation ρ. Even for large sample shown here (n = 500), correlations
between −0.25 and 0.25 are barely perceptible. ♣

7.2.2 Multivariate Normal Distribution

In the general case we have to use vector notation. Surprisingly, we gain clarity even compared
to the bivariate case.
7.2. MULTIVARIATE NORMAL DISTRIBUTION 111

R-Code 7.2 Realizations from a bivariate normal distribution for various values of ρ,
termed binorm (See Figure 7.2.)

set.seed(12)
rho <- c(-.25, 0, .1, .25, .75, .9)
for (i in 1:6) {
Sigma <- array( c(1, rho[i], rho[i], 1), c(2,2))
sample <- rmvnorm( 500, sigma=Sigma)
plot(sample, pch='.', xlab='', ylab='')
legend( "topleft", legend=bquote(rho==.(rho[i])), bty='n')
}
4

4
ρ = −0.25 ρ=0 ρ = 0.1
2

2
0

0
−2

−2

−2
−4

−4

−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
4

ρ = 0.25 ρ = 0.75 ρ = 0.9

2
0

0
−2

−2

−2
−4

−4

−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4

Figure 7.2: Realizations from a bivariate normal distribution. (See R-Code 7.2.)

Definition 7.7. The random vector X = (X1 , . . . , Xp )> is multivariate normally distributed if
Z x1 Z xp
FX (x ) = ··· fX (x1 , . . . , xp )dx1 . . . dxp (7.13)
−∞ −∞

with density
1 1
> −1

fX (x1 , . . . , xp ) = fX (x ) = exp − (x − µ) Σ (x − µ) (7.14)
(2π)p/2 det(Σ)1/2 2
for all x ∈ Rp (with µ ∈ Rp and symmetric, positive-definite Σ). We denote this distribution
with X ∼ Np (µ, Σ). ♦
Property 7.4. For the multivariate normal distribution we have:

E(X) = µ , Var(X) = Σ . (7.15)

112 CHAPTER 7. MULTIVARIATE NORMAL DISTRIBUTION

Property 7.5. Let a ∈ Rq , B ∈ Rq×p , q ≤ p, rank(B) = q and X ∼ Np (µ, Σ), then

a + BX ∼ Nq a + Bµ, BΣB> .

(7.16)

This last property has profound consequences. It also asserts that the one-dimensional
marginal distributions are again Gaussian with Xi ∼ N (µ)i , (Σ)ii , i = 1, . . . , p. Similarly,

any subset and any (non-degenerate) linear combination of random variables of X is again Gaus-
sian with appropriate subset selection of the mean and covariance matrix.

We now discuss how to draw realizations from an arbitrary Gaussian random vector, much
in the spirit of Property 2.7.ii). Let I ∈ Rp×p be the identity matrix, a square matrix which has
5 min only ones on the main diagonal and only zeros elsewhere, and let L ∈ Rp×p so that LL> = Σ.
That means, L is like a “matrix square root” of Σ.
To draw a realization x from a p-variate random vector X ∼ Np (µ, Σ), one starts with
iid
drawing p values from Z1 , . . . , Zp ∼ N (0, 1), and sets z = (z1 , . . . , zp )> . The vector is then
(linearly) transformed with µ + Lz . Since Z ∼ Np (0, I) Property 7.5 asserts that X = µ + LZ ∼
Np (µ, LL> ).
In practice, the Cholesky decomposition of Σ is often used. This decomposes a symmetric
positive-definite matrix into the product of a lower triangular matrix L and its transpose. It
holds that det(Σ) = det(L)2 = pi=1 (L)2ii .
Q

7.2.3 Conditional Distributions

We now consider properties of parts of the random vector X. For simplicity we write

X1
X= , X1 ∈ Rq , X2 ∈ Rp−q . (7.17)
X2
We divide the matrix Σ in 2 × 2 blocks accordingly:
!!
X1 µ1 Σ11 Σ12
X= ∼ Np , (7.18)
X2 µ2 Σ21 Σ22
Both (multivariate) marginal distributions X1 and X2 are again normally distributed with X1 ∼
Nq (µ1 , Σ11 ) and X2 ∼ Np−q (µ2 , Σ22 ) (this can be seen again by Property 7.5).
X1 and X2 are independent if Σ21 = 0 and vice versa.

Property 7.6. If one conditions a multivariate normally distributed random vector (7.18) on a
sub-vector, the result is itself multivariate normally distributed with

X2 | X1 = x1 ∼ Np−q µ2 + Σ21 Σ−1 −1

11 (x1 − µ1 ), Σ22 − Σ21 Σ11 Σ12 . (7.19)
1 min
The expected value depends linearly on the value of x 1 , but the variance is independent of
the value of x 1 . The conditional expected value represents an update of X2 through X1 = x 1 :
the difference x 1 − µ1 is normalized by the variance and scaled by the covariance. Notice that
for p = 2, Σ21 Σ−111 = ρσy /σx .

Equation (7.19) is probably one of the most important formulas you encounter in statistics
albeit not always explicit. It is illustrated in Figure (??) for the case of p = 2.
7.3. ESTIMATION OF MEAN AND VARIANCE 113

*
* * *

*
Figure 7.3: Graphical illustration of the conditional distribution of a bivariate normal
random vector. Blue: bivariate density with isolines indicating quartiles, cyan: marginal
densities, red: conditional densities. The respective means are indicated in green. The
height of the univariate densities are exaggerated by a factor of five.

7.3 Estimation of Mean and Variance

The estimators for parameters of random vectors are constructed in a manner similar to that
for the univariate case. Here, we focus again on Gaussian random vectors. Let x 1 , . . . , x n be a
iid
realization of the random sample X1 , . . . , Xn ∼ Np (µ, Σ). We have the following estimators
n n
1X 1 X
b =X =
µ Xi Σ
b = (Xi − X)(Xi − X)> (7.20)
n n−1
i=1 i=1

and corresponding estimates

4 min
n n
1X 1 X
µ
b =x = xi Σ
b = (x i − x )(x i − x )> . (7.21)
n n−1
i=1 i=1

A normally distributed random variable is determined by two parameters. A multivariate

normally distributed random variable is determined by p (for µ) and p(p + 1)/2 (for Σ) param-
eters. The remaining p(p − 1)/2 parameters in Σ are determined by symmetry. However, the
p(p+1)/2 values can not be arbitrarily chosen since Σ must be positive-definite (in the univariate
case σ > 0 must be satisfied). As long as n > p, the estimator in (7.21) satisfies this condition.
If p ≤ n, additional assumptions about the structure of the matrix Σ are needed.

Remark 7.1. Actually, it is possible to show that these former two estimators in (7.20) are
unbiased estimators of E(X) and Var(X) for an arbitrary sample of random vectors. ♣

Example 7.3. Similar as in R-Code 7.2, we generate bivariate realizations with different sam-
ple sizes (n = 10, 50, 100, 500). We estimate the mean vector and covariance matrix according
to (7.21); from these we can calculate the corresponding isolines of the bivariate normal density
114 CHAPTER 7. MULTIVARIATE NORMAL DISTRIBUTION

(where with plug-in estimates for µ and Σ). Figure 7.4 (based on R-Code 7.3) shows the esti-
mated 95% and 50% confidence regions (isolines). As n increases, the estimation improves, i.e.,
the estimated ellipses are closer to the ellipses based on the true (unknown) parameters. ♣

R-Code 7.3 Bivariate normally distributed random numbers for various sample sizes with
contour lines of the density and estimated moments. (See Figure 7.4.)

set.seed( 14)
require( ellipse)
n <- c( 10, 50, 100, 500)
mu <- c(2, 1) # theoretical mean
Sigma <- matrix( c(4, 2, 2, 2), 2) # and covariance matrix
cov2cor( Sigma)[2] # equal to sqrt(2)/2
## [1] 0.70711
for (i in 1:4) {
plot(ellipse( Sigma, cent=mu, level=.95), col='gray',
xaxs='i', yaxs='i', xlim=c(-4, 8), ylim=c(-4, 6), type='l')
lines( ellipse( Sigma, cent=mu, level=.5), col='gray')
sample <- rmvnorm( n[i], mean=mu, sigma=Sigma)
points( sample, pch='.', cex=2)
Sigmahat <- cov( sample) # var( sample) # is identical
muhat <- colMeans( sample) # apply( sample, 2, mean) # is identical
lines( ellipse( Sigmahat, cent=muhat, level=.95), col=2, lwd=2)
lines( ellipse( Sigmahat, cent=muhat, level=.5), col=4, lwd=2)
points( rbind( muhat), col=3, cex=2)
text( -2, 4, paste('n =',n[i]))
}
muhat # Estimates for n=500
## [1] 2.02040 0.95478
Sigmahat
## [,1] [,2]
## [1,] 4.1921 2.0638
## [2,] 2.0638 2.0478
cov2cor( Sigmahat)[2]
## [1] 0.70437
7.4. BIBLIOGRAPHIC REMARKS 115

6
n = 10 n = 50
4

4
●
2

2
●

y
0

0
−2

−2
−4

−4
−4 −2 0 2 4 6 8 −4 −2 0 2 4 6 8

x x
6

6
n = 100 n = 500
4

4
2

●
y

●
0

0
−2

−2
−4

−4

−4 −2 0 2 4 6 8 −4 −2 0 2 4 6 8

Figure 7.4: Bivariate normally distributed random numbers. The contour lines of
the (theoretical) density are in gray, the isolines corresponding estimated 95% (50%)
probability in red (blue) and the empirical mean in green. (See R-Code 7.3.)

7.4 Bibliographic remarks

The online book “Matrix Cookbook” is a very good synopsis of the important formulas related
to vectors and matrices (Petersen and Pedersen, 2008).

7.5 Exercises and Problems

Problem 7.1 (Derivations of some properties) Let X be a p-variate random vector and let
B ∈ Rq×p and a ∈ Rq be a non-stochastic matrix and vector, respectively. Prove the following
equalities and verify that they gerneralize the univariate setting:

i) E(a + BX) = a + B E(X),

ii) Var(X) = E(XX> ) − E(X) E(X)> ,

iii) Var(a + BX) = B Var(X)B> .

116 CHAPTER 7. MULTIVARIATE NORMAL DISTRIBUTION

Hints:
• E(X> ) = E(X)> .
• E(E(X)) = E(X), because E(X) is a constant (non-stochastic) vector.
• For a given random vector Y and conformable matrices C and D (non-stochastic), it
holds that
E (CYD) = C E (Y) D.

iid
Problem 7.2 (Bivariate normal distribution) Consider the random sample X1 , . . . , Xn ∼
N2 (µ, Σ) with
! ! !
Xi,1 1 1 1
Xi = , µ= , Σ= .
Xi,2 2 1 2

We define estimators for µ and Σ as

n n
1X 1 X
µ
b= Xi and Σ
b = (Xi − µ b )> ,
b )(Xi − µ
n n−1
i=1 i=1

i) Explain in words that these estimators for µ and Σ “generalize” the univariate estimators
for µ and σ.

ii) Simulate n = 500 iid realizations from N2 (µ, Σ) using the function rmvnorm() from package
mvtnorm. Draw a scatter plot of the results and interpret the figure.

iii) Add contour lines of the density of X to the plot. Calculate an eigendecomposition of Σ
and place the two eigenvectors in the center of the ellipses.

iv) Estimate µ, Σ and the correlation between X1 and X2 from the 500 simulated values using
mean(), cov() and cor(), respectively.

v) Redo the simulation with several different covariance matrices, i.e., choose different values
as entries for the covariance matrices. What is the influence of the diagonal elements and
the off-diagonal elements of the covariance matrix on the shape of the scatter plot?

vi) The matrix Σ has to be symmetric and positive definite. Why?

Chapter 8

A Closer Look: Correlation and Simple

Regression

Learning goals for this chapter:

Explain different correlation coefficient

Explain the least square criterion

Explain the statistical model of the linear regression

Apply linear model in R, check the assumptions, interpret the output

R-Code for this chapter: www.math.uzh.ch/furrer/download/sta120/chapter08.R.

In this chapter (and the following three chapters) we consider “linear models.” A detailed
discussion of linear models would fill an entire lecture module, hence we consider here only the
most important elements.

The (simple) linear regression is commonly considered the archetypical task of statistics and
is often introduced as early as middle school.
In this chapter, we will (i) quantify the (linear) relationship between two variables, (ii) explain
one variable through another variable with the help of a “model”.

8.1 Estimation of the Correlation

The goal of this section is to quantify the linear relationship between two random variables X
and Y with the help of n pairs of values (x1 , yn ), . . . , (xn , yn ), i.e., realizations of the two random
variables (X, Y ).
An intuitive estimator of the correlation between the random variables X and Y (i.e.,

117
118 CHAPTER 8. A CLOSER LOOK: CORRELATION AND SIMPLE REGRESSION

Corr(X, Y ), see Equation (7.6)) is

X
\Y )
Cov(X, (xi − x)(yi − y)
\Y ) = q i
r = Corr(X, = rX X , (8.1)
\ Var(Y
Var(X) \) (xi − x)2 (yj − y)2
i j

which is also called the Pearson correlation coefficient. Just like the correlation, the Pearson
correlation coefficient also lies in the interval [−1, 1].
We will introduce a handy notation that is often used in the following:
n
X n
X n
X
sxy = (xi − x)(yi − y), sxx = (xi − x)2 , syy = (yi − y)2 . (8.2)
i=1 i=1 i=1

√
Hence, we can express (8.1) as r = sxy / sxx syy . Further, an estimator for the covariance is
sxy /(n − 1) (which would be an unbiased estimator).

Alternatives to Pearson correlation coefficient are so-called rank correlation coefficients, such
as Spearman’s ρ or Kendall’s τ and are seen as non-parametric correlation estimates. In brief,
Spearman’s ρ is calculated similarly to (8.1), where the values are replaced by their ranks.
Kendall’s τ compares the number of concordant (if xi < xj then yi < yj ) and discordant pairs
(if xi < xj then yi > yj ).

R-Code 8.1 binorm data: Pearson correlation coefficient of the scatter plot from Figure 7.2.
Spearman’s ρ or Kendall’s τ are given as well.

require( mvtnorm)
set.seed(12)
rho <- array(0,c(4,6))
rownames(rho) <- c("rho","Pearson","Spearman","Kendall")
rho[1,] <- c(-.25, 0, .1, .25, .75, .9)
for (i in 1:6) {
Sigma <- array( c(1, rho[1,i], rho[1,i], 1), c(2,2))
sample <- rmvnorm( 500, sigma=Sigma)
rho[2,i] <- cor( sample)[2]
rho[3,i] <- cor( sample, method="spearman")[2]
rho[4,i] <- cor( sample, method="kendall")[2]
}
print( rho, digits=2)
## [,1] [,2] [,3] [,4] [,5] [,6]
## rho -0.25 0.000 0.10 0.25 0.75 0.90
## Pearson -0.22 0.048 0.22 0.28 0.78 0.91
## Spearman -0.22 0.066 0.18 0.26 0.77 0.91
## Kendall -0.15 0.045 0.12 0.18 0.57 0.74
8.1. ESTIMATION OF THE CORRELATION 119

Example 8.1. R-Code 8.1 estimates the correlation of the scatter plot data from Figure 7.2.
Although n = 500, in the case ρ = 0.1, the estimate is more than two times too large.
We also calculate Spearman’s ρ or Kendall’s τ for the same data. There is, of course, quite
some agreement between the estimates. ♣

Pearson’s correlation coefficient is not robust, while Spearman’s ρ or Kendall’s τ are “robust”
see R-Code 8.2 and Figure 8.1 which compares the various correlation estimates of the so-called
anscombe data.

R-Code 8.2: anscombe data: visualization and correlation estimates. (See Figure 8.1.)

library( faraway)
data( anscombe)
head( anscombe, 3)
## x1 x2 x3 x4 y1 y2 y3 y4
## 1 10 10 10 8 8.04 9.14 7.46 6.58
## 2 8 8 8 8 6.95 8.14 6.77 5.76
## 3 13 13 13 8 7.58 8.74 12.74 7.71
with( anscombe, { plot(x1, y1); plot(x2, y2); plot(x3, y3); plot(x4, y4) })
sel <- c(0:3*9+5) # extract diagonal entries of sub-block
print(rbind( pearson=cor(anscombe)[sel],
spearman=cor(anscombe, method='spearman')[sel],
kendall=cor(anscombe, method='kendall')[sel]), digits=2)
## [,1] [,2] [,3] [,4]
## pearson 0.82 0.82 0.82 0.82
## spearman 0.82 0.69 0.99 0.50
## kendall 0.64 0.56 0.96 0.43

Let us consider bivarate normally distributed random variables as discussed last chapter.
Naturally, r as given in (8.1) is an estimate of the correlation parameter ρ explicated in the
density (7.10). Let R be the corresponding estimator of ρ based on (8.1), i.e., replacing (xi , yi )
by (Xi , Yi ). The random variable
√
n−2
T =R √ (8.3)
1 − R2
is, under H0 : ρ = 0, t-distributed with n − 2 degrees of freedom. The corresponding test is
described under Test 11.
In order to construct confidence intervals for correlation estimates, we typically need the
so-called Fisher transformation
1 1 + r
W (r) = log = arctanh(r) (8.4)
2 1−r
and the fact that, for bivarate normally distributed random variables, the distribution of W (R) is
approximately N W (ρ), 1/(n−3) and a straight-forward confidence interval can be constructed.

A confidence interval for ρ requires a back-transformation and is shown in CI 6.

120 CHAPTER 8. A CLOSER LOOK: CORRELATION AND SIMPLE REGRESSION

● ● ● ●

3 4 5 6 7 8 9
● ●

10
●
● ●
● ●
●
●

8
y1

y2
● ●
●
●
●

6
●
●
● ●

4
4 6 8 10 12 14 4 6 8 10 12 14
x1 x2
● ●

12
12

10
10
y3

y4
● ●
●

8
● ●
8

● ●
● ●
● ●
● ●
●

6
●
6

● ●
●
● ●

4 6 8 10 12 14 8 10 12 14 16 18
x3 x4

Figure 8.1: anscombe data, the four cases all have the same Pearson correlation
coefficient of 0.82, yet the scatterplot shows a completely different relationship. (See
R-Code 8.2.)

Test 11: Test of correlation

Question: Is the correlation between two paired samples significant?

Assumptions: The pairs of values come from a bivariate normal distribution.

√
n−2
Calculation: tobs = |r| √ where r is the Pearson correlation coefficient (8.1).
1 − r2
Decision: Reject H0 : ρ = 0 if tobs > tcrit = tn−2,1−α/2 .

Calculation in R: cor.test( x, y, conf.level=1-alpha)

CI 6: Confidence interval for the Pearson correlation coefficient

An approximate (1 − α) confidence interval for r is

h zα/2 zα/2 i
tanh arctanh(r) − √ , tanh arctanh(r) + √
n−3 n−3
where tanh and arctanh are the hyperbolic and inverse hyperbolic tangent func-
tions.
8.2. SIMPLE LINEAR REGRESSION 121

Astoundingly large sample sizes are needed for correlation estimates around 0.25 to be sig-
nificant.

The correlation is a symmetric measure, there is no preference between either variable. We

now extend the idea where we have one variable as fixed (or given) and observe the second one.
Of course, in practice even the given variable has to be measured.

8.2 Simple Linear Regression

In simple linear regression a (dependent) variable is explained linearly through a single indepen-
dent variable:

Y i = µ i + εi (8.5)
= β 0 + β 1 x i + εi , i = 1, . . . , n, (8.6)

with

• Yi : dependent variable, measured values, observations;

• xi : independent variable, predictor, assumed known or observed, not stochastic;

• β0 , β1 : parameters (unknown);

• εi : measurement error, error term, noise (unknown), with symmetric distribution around
E(εi ) = 0.

It is often also assumed that Var(εi ) = σ 2 and/or that the errors are independent of each other.
iid
For simplicity, we assume εi ∼ N (0, σ 2 ) with unknown σ 2 . Thus, Yi ∼ N (β0 + β1 xi , σ 2 ),
i = 1, . . . , n, and Yi and Yj are independent when i 6= j.

Fitting a linear model in R is straightforward as shown in a motivating example below.

Example 8.2. (hardness data) One of the steps in the manufacturing of metal springs is a
quenching bath. The temperature of the bath has an influence on the hardness of the springs.
Data is taken from Abraham and Ledolter (2006). Figure 8.2 shows the Rockwell hardness of
coil springs as a function of the temperature of the quenching bath, as well as the line of best fit.
The Rockwell scale measures the hardness of technical materials and is denoted by HR (Hardness
Rockwell). R-Code 8.3 shows how a simple linear regression for this data is performed with the
command lm and the ‘formula’ statement Hard˜Temp. ♣

The main idea of regression is to estimate βbi , which minimize the sum of squared errors.
That means that βb0 and βb1 are determined such that
n
X
(yi − βb0 − βb1 xi )2 (8.7)
i=1
122 CHAPTER 8. A CLOSER LOOK: CORRELATION AND SIMPLE REGRESSION

R-Code 8.3 hardness data from Example 8.2, see Figure 8.2.

Temp <- rep( 10*3:6, c(4, 3, 3, 4))

Hard <- c(55.8, 59.1, 54.8, 54.6, 43.1, 42.2, 45.2,
31.6, 30.9, 30.8, 17.5, 20.5, 17.2, 16.9)
plot( Temp, Hard, xlab="Temperature [C]", ylab="Hardness [HR]")
lm1 <- lm( Hard~Temp) # fitting of the linear model
abline( lm1) # scatter plot and fit of the linear model
60

●
●
50
Hardness [HR]

●
●
●
40

●
●
30
20

30 35 40 45 50 55 60

Temperature [C]

Figure 8.2: hardness data: hardness as a function of temperature (see R-Codes 8.3
and 8.4). The black line is the fitted regression line.

is minimized. This concept is also called the least squares method. The solution, i.e., the
estimated regression coefficients, are
n
X
(xi − x)(yi − y)
sxy sxx
r
i=1
βb1 = =r = n , (8.8)
sxx syy X
2
(xi − x)
i=1

βb0 = y − βb1 x . (8.9)

The predicted values are

ybi = βb0 + βb1 xi . (8.10)

which lie on the estimated regression line

y = βb0 + βb1 x . (8.11)

The residuals are

ri = yi − ybi = yi − βb0 − βb1 x . (8.12)

8.2. SIMPLE LINEAR REGRESSION 123

R-Code 8.4 hardness data from Example 8.2, see Figure 8.2.

summary( lm1) # summary of the fit

##
## Call:
## lm(formula = Hard ~ Temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.550 -1.190 -0.369 0.599 2.950
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 94.1341 1.5750 59.8 3.2e-16 ***
## Temp -1.2662 0.0339 -37.4 8.6e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.5 on 12 degrees of freedom
## Multiple R-squared: 0.991,Adjusted R-squared: 0.991
## F-statistic: 1.4e+03 on 1 and 12 DF, p-value: 8.58e-14
coef( lm1)
## (Intercept) Temp
## 94.1341 -1.2662
rbind( observation=Hard, fitted=fitted( lm1), residuals=resid( lm1))[,1:6]
## 1 2 3 4 5 6
## observation 55.80000 59.1000 54.8000 54.6000 43.10000 42.2000
## fitted 56.14945 56.1495 56.1495 56.1495 43.48791 43.4879
## residuals -0.34945 2.9505 -1.3495 -1.5495 -0.38791 -1.2879
head( Hard - (fitted( lm1) + resid( lm1)))
## 1 2 3 4 5 6
## 0 0 0 0 0 0

Finally, an estimate of the variance σ 2 of the errors εi is given by

n n
1 X 1 X 2
b2 =
σ (yi − ybi )2 = ri , (8.13)
n−2 n−2
i=1 i=1

where we have used a slightly different denominator than in Example 3.6.ii), to obtain an un-
biased estimate. The variance estimate σb2 is often termed mean squared error and its root, σ
b,
residual standard error.
124 CHAPTER 8. A CLOSER LOOK: CORRELATION AND SIMPLE REGRESSION

Example 8.3. (hardness data Part 2) R-Code 8.4 illustrates how to access the estimates,
the fitted values and the residuals. An estimate of σ 2 can be obtained via resid( lm1) and
Equation (8.13) (e.g., sum(resid( lm1)ˆ2)/12) or directly via summary(lm1)$sigmaˆ2. ♣

In simple linear regression, the central task is often to determine whether there exists a linear
relationship between the dependent and independent variables. This can be tested with the
hypothesis H0 : β1 = 0 (Test 12). We do not formally derive the test statistic here. The idea is
to replace in Equation (8.8) the observations yi with random variables Yi with distribution 8.6
and derive the distribution of a test statistic.

Test 12: Test of a linear relationship

Question: Is there a linear relationship between two paired samples?

Assumptions: Based on the sample x1 , . . . , xn , the second sample is a realization

of a normally distributed random variable with expected value β0 + β1 xi and
variance σ 2 .
r
syy |sxy |
Calculation: tobs = |r| =
sxx sxx
Decision: Reject H0 : β1 = 0 if tobs > tobs = tn−2,1−α/2 .

Calculation in R: summary( lm( y ∼ x ))

Comparing the expression of Pearson’s correlation coefficient (8.1) and the estimate of β1 (8.8),
it is not surprising that the p-values of Tests 12 and 11 coincide.

For prediction of the dependent variable at a given independent variable x0 , we plug in the
value x0 in Equation (8.11). The function predict can be used for prediction in R, as illustrated
in R-Code 8.5.
Prediction at a (potentially) new value x0 can also be written as

βb0 + βb1 x0 = y − βb1 x + βb1 x0 = y + sxy (s2x )−1 (x0 − x) . (8.14)

This equation is equivalent to Equation (7.19) but with estimates instead of (unknown) param-
eters.

The uncertainty of the prediction depends on the uncertainty of the estimated parameter.
Specifically:

Var(b
µ0 ) = Var(βb0 + βb1 x0 ), (8.15)

which also depends on the variance of the error term. In general, however, βb0 and βb1 are not
independent and with matrix notation, the variance is easy to calculate, as will be illustrated in
Chapter 9.
8.2. SIMPLE LINEAR REGRESSION 125

To construct confidence intervals for a prediction, first we must discern whether the prediction
is for the mean response µ b0 or for an unobserved (e.g., future) observation yb0 at x0 . The former
prediction interval depends on the variability of the estimates of βb0 and βb1 . For the latter the
prediction interval depends on the uncertainty of µ b0 and additionally on the variability of the
error ε, i.e., σ
b2 . Hence, the latter is always wider than the former. In R, these two types are
denoted — somewhat intuitively — with interval="confidence" and interval="prediction",
R-Code 8.5. The confidence interval summary CI 7 gives the precise formulas. We will see
another, yet easier approach Chapter 9.

R-Code 8.5 hardness data: predictions and pointwise confidence intervals. (See Fig-
ure 8.3.)

new <- data.frame( Temp = seq(25, 65, by=.5))

pred.w.clim <- predict( lm1, new, interval="confidence") # for hat(mu)
pred.w.plim <- predict( lm1, new, interval="prediction") # for hat(y), wider!
plot( Temp, Hard, xlab="Temperature [C]", ylab="Hardness [HR]",
xlim=c(28,62), ylim=c(10,65))

matlines( new$Temp, cbind(pred.w.clim, pred.w.plim[,-1]),

col=c(1,2,2,3,3), lty=c(1,1,1,2,2)) # Prediction intervals are wider!
60

●
●
●
50
Hardness [HR]

●
●
●
40

●
●
30
20

●
●
●
10

30 35 40 45 50 55 60

Temperature [C]

Figure 8.3: hardness data: hardness as a function of temperature. Black: fitted

regression line, red: confidence intervals for the mean µ bi (pointwise), green: prediction
intervals for (future) predictions ybi (pointwise). (See R-Code 8.5.)

Linear regression framework is based on several statistical assumptions, including on the

errors εi . After fitting the model, we must check whether these assumptions are realistic or if
there residuals indicate evidence against these. For this task, we return in Chapter 10 to the
analysis of the residuals and their properties.
126 CHAPTER 8. A CLOSER LOOK: CORRELATION AND SIMPLE REGRESSION

CI 7: Confidence interval for the mean response and for prediction

A (1 − α) confidence interval for the mean response µ

b0 = βb0 + βb1 x0 is
s
1 x0 − x
β0 + β1 x0 ± tn−2,1−α/2 σ
b b +P 2
.
i (xi − x)
b
n

A (1 − α) prediction interval for an unobserved observation of yb0 = βb0 + βb1 x0 is

s
1 x0 − x
βb0 + βb1 x0 ± tn−2,1−α/2 σ
b 1+ + P 2
.
n i (xi − x)

In both intervals we use estimators, i.e., all yi the estimates are to be replaced with
Yi to obtain estimators.

8.3 Bibliographic remarks

We recommend the book from Fahrmeir et al. (2009) (German) or Fahrmeir et al. (2013) (En-
glish), which is both detailed and accessible. Many other books contain a single chapter on
simple linear regression.
There exists a fancier version of anscombe data, called the ‘DataSaurus Dozen’, found
on https://blog.revolutionanalytics.com/2017/05/the-datasaurus-dozen.html. Even the transi-
tional frames in the animation https://blog.revolutionanalytics.com/downloads/DataSaurus%
20Dozen.gif maintain the same summary statistics to two decimal places.

8.4 Exercises and Problems

Problem 8.1 (Correlation) For the swiss dataset, calculate the correlation between the vari-
ables Catholic and Fertility as well as Catholic and Education. What do you conclude?

Hint: you might consider a parallel coordinate plot, as shown in Figure 1.11 in Chapter 1.

Problem 8.2 (Linear regression) In a simple linear regression, the data are assumed to follow
iid
Yi = β0 + β1 xi + εi with εi ∼ N (0, σ 2 ), i = 1, . . . , n. We simulate n = 15 data points from that
model with β0 = 1, β1 = 2, σ = 2 and the follwoing values for xi .
Hint: copy & paste the following lines into your R-Script.

set.seed(5) ## for reproducable simulations

beta0.true <- 1 ## true parameters, intercept
beta1.true <- 2 ## and slope
## observed x values:
x <- c(2.9, 6.7, 8.0, 3.1, 2.0, 4.1, 2.2, 8.9, 8.1, 7.9, 5.7, 1.6, 6.6, 3.0, 6.3)
8.4. EXERCISES AND PROBLEMS 127

## simulation of y values:
y <- beta0.true + x * beta1.true + rnorm(15, sd = 2)
data <- data.frame(x = x, y = y)

i) Plot the simulated data in a scatter plot. Calculate the Pearson correlation coefficient and
the Spearman’s rank correlation coefficient. Why do they agree well?

ii) Estimate the linear regression coefficients βb0 and βb1 using the formulas from the script.
Add the estimated regression line to the plot from (a).

iii) Calculate the fitted values Ybi for the data in x and add them to the plot from (a).

iv) Calculate the residuals (yi − ybi ) for all n points and the residual sum of squares SS =
bi )2 . Visualize the residuals by adding lines to the plot with segments(). Are
P
i (yi − y
the residuals normally distributed? Do the residuals increase or decrease with the fitted
values?

v) Calculate standard errors for β0 and β1 . For σ bε = SS /(n − 2), they are given by
p
s s
1 x2 1
σ
bβ0 = σ
bε +P 2
, σ
bβ1 = σbε P 2
.
n (x
i i − x) i i − x)
(x

vi) Give an empirical 95% confidence interval for β0 and β1 . (The degree of freedom is the
number of observations minus the number of parameters in the model.)

vii) Calculate the values of the t statistic for βb0 and βb1 and the corresponding two-sided p-
values.

viii) Verify your result with the R function lm() and the corresponding S3 methods summary(),
fitted(), residuals() and plot() (i.e., apply these functions to the returned object of
lm()).

ix) Use predict() to add a “confidence” and a “prediction” interval to the plot from (a). What
is the difference?
Hint: The meanings of "confidence" and "predict" here are based on the R function. Use
the help of those functions to understand their behaviour.

x) Fit a linear model without intercept (i.e., force β0 to be zero). Add the corresponding
regression line to the plot from (a). Discuss if the model fits “better” the data.

xi) How do outliers influence the model fit? Add outliers:

data_outlier <- rbind(data, data.frame(x = c(10, 11, 12), y = c(9, 7, 8)))

Fit a linear model and discuss it (including diagnostic plots).

xii) What is the difference between a model with formula y ∼ x and x ∼ y? Explain it from
a stochastic and fitting perspective.
128 CHAPTER 8. A CLOSER LOOK: CORRELATION AND SIMPLE REGRESSION
Chapter 9

Multiple Regression

Learning goals for this chapter:

Statistical model of multiple regression

Multiple regression in R, including:

– Multicollinearity
– Influential points
– Interactions between variables
– Categorical variables (factors)
– Model validation and information criterion (basic theory and R)

Be aware of nonlinear regression - examples

R-Code for this chapter: www.math.uzh.ch/furrer/download/sta120/chapter09.R.

The simple linear regression model can be extended by the addition of further independent
variables. We first introduce the model and estimators. Subsequently, we become acquainted
with the most important steps in model validation. Two typical examples of multiple regression
are given. At the end, several typical examples of extensions of linear regression are illustrated.

9.1 Model and Estimators

Equation (8.6) is generalized to p independent variables

Y i = µ i + εi , (9.1)
= β0 + β1 xi1 + · · · + βp xip + εi , (9.2)
= x>
i β + εi , i = 1, . . . , n, n > p, (9.3)

with

129
130 CHAPTER 9. MULTIPLE REGRESSION

• Yi : dependent variable, measured value, observation, response;

• x i = (1, xi1 , . . . , xip )> : (known) independent/explanatory variables, predictors;

• β = (β0 , . . . , βp )> : (unknown) parameter vector;

• εi : (unknown) error term, noise, “measurement error”, with symmetric distribution around
zero, E(εi ) = 0.

It is often also assumed that Var(εi ) = σ 2 and/or that the errors are independent of each other.
iid
To derive estimators with simple, closed form distributions, we further assume that εi ∼ N (0, σ 2 ),
with unknown σ 2 . In matrix notation, Equation (9.3) is written as

Y = Xβ + ε (9.4)

with X an n × (p + 1) matrix with rows x >

i . Hence, with the distribution on the error term we
have (in matrix notation)

ε ∼ Nn (0, σ 2 I), (9.5)

2
Y ∼ Nn (Xβ, σ I). (9.6)

The mean of the response varies, implying that Y1 , . . . , Yn are only independent and not iid.
We assume that the rank of X equals p + 1 (rank(X) = p + 1, column rank). This assumption
guarantees that the inverse of X> X exists. In practical terms, this implies that we do not include
twice the same predictor, or that an predictor has additional information on top of the already
included predictors.
We estimate the parameter vector β with the method of ordinary least squares (see Sec-
tion 3.2.1). That means the estimate βb is such that the sum of the squared errors (residuals) is
minimal and is thus derived as follows:
n
X
β
b = argmin (yi − x > 2 >
i β) = argmin(y − Xβ) (y − Xβ) (9.7)
β i=1 β

d
⇒ (y − Xβ)> (y − Xβ) (9.8)
dβ
d
= (y > y − 2β > X> y + β > X> Xβ) = −2X> y + 2X> Xβ (9.9)
dβ
⇒ X> Xβ = X> y (9.10)
> −1 >
⇒ β
b = (X X) X y (9.11)

Equation (9.10) is also called the normal equation and Equation (9.11) indicates why we need
to assume full column rank of the matrix X.
We now derive the distributions of the estimator and other related and important vectors.
The derivation of the results are based directly on Property 7.5. Starting from the distributional
assumption of the errors (9.5), jointly with Equations (9.6) and (9.11), it can be shown that

b = (X> X)−1 X> y , b ∼ Np+1 β, σ 2 (X> X)−1 ,

β β (9.12)
b = X(X> X)−1 X> y = Hy ,
y b ∼ Nn (Xβ, σ 2 H),
Y (9.13)
R ∼ Nn 0, σ 2 (I − H) ,

r =y −y
b = (I − H)y (9.14)
9.2. MODEL VALIDATION 131

where we term the matrix H = X(X> X)−1 X> as the hat matrix. In the left column we find the
estimates, in the right column the functions of the random samples. Notice the subtle difference
in the covariance matrix of the distributions Y and Y: b the hat matrix H is not I, hopefully
quite close to it. The latter would imply that the variances of R are close to zero.
The distribution of the coefficients will be used when interpreting a fitted regression model
(similarly as in the case of the simple regression). The marginal distributions of the individual
coefficients βbi are determined by the distribution (9.12):

βbi ∼ N (βi , σ 2 vii ) with vii = (X> X)−1

ii
, i = 0, . . . , p, (9.15)

βbi − βi
(again direct consequence of Property 7.5). Hence √ 2 ∼ N (0, 1). However, since σ 2 is
σ vii
usually unknown, we use the unbiased estimate
n
1 X 1
b2 =
σ (yi − ybi )2 = r >r , (9.16)
n−p−1 n−p−1
i=1

again termed mean squared error. Its square root is termed residual standard error (with n−p−1
degrees of freedom). Finally, we use the same approach when deriving the t-test in Equation (4.3)
and obtain

βbi − βi
√ ∼ tn−p−1 (9.17)
b2 vii
σ
as our statistic for testing and to derive confidence intervals about individual coefficients βi .
For testing, we are often interested in H0 : βi = 0. Confidence intervals are constructed along
equation (3.34) and summarized in the subsequent blue box.

CI 8: Confidence interval for regression coefficients

An empirical (1 − α) confidence interval for βi is

r
h 1 i
βi ± tn−p−1,1−α/2
b r > r vii (9.18)
n−p−1

with r = y − yb and vii = (X> X)−1 ii .

9.2 Model Validation

Recall the data analysis workflow shown in Figure 1.1. Suppose we have a Model fit of a linear
model (obtained with lm, as illustrated in the last chapter or in later sections of this chapter).
Model validation essentially verifies if (9.3) is an adequate model for the data to probe the given
Hypothesis to investigate. The question is not whether a model is correct, but rather if the model
is useful (“Essentially, all models are wrong, but some are useful”, Box and Draper, 1987 page
424).
132 CHAPTER 9. MULTIPLE REGRESSION

Model validation verifies (i) the fixed components (or fixed part) µi and (ii) the stochastic
components (or stochastic part) εi and is typically an iterative process (arrow back to Propose
statistical model in Figure 1.1).

9.2.1 Basics and Illustrations

Validation is based on (a) graphical summaries, typically (standardized) residuals versus fitted
values, individual predictors, summaries values of predictors or simply Q-Q plots and (b) sum-
mary statistics. The latter are also part of a summary() call of a regression object, see, e.g.,
R-Code 8.4. The residuals are summarized by the range and the quartiles. Below the summary
of the coefficients, the following statistics are given

(yi − ybi )2
P
SSE
Coefficient of determination, or R : R = 1 −
2 2
= 1 − Pi 2
, (9.19)
SST i (yi − y i )
p
Adjusted R2 : Radj2
= R2 − (1 − R2 ) , (9.20)
n−p−1
(SST − SSE )/p
Observed value of F -Test: , (9.21)
SSE /(n − p − 1)
were SS stands for sums of squares and SST , SSE for total sums of squares and sums of squares
of the error, respectively. The last statistic explains how much variability in the data is explained
by the model and is essentially equivalent to Test 4 and performs the omnibus test H0 : β1 =
β2 = · · · = βp = 0. When we reject this test, this merely signifies that at least one of the
coefficients is significant and thus often not very useful.
A slightly more general version of the F -Test (9.21) is used to compare nested models. Let
M0 be the simpler model with only q out of the p predictors of the more complex model M1
(0 ≤ q < p). The test H0 : “M0 is sufficient” is based on the statistic
(SSsimple model − SScomplex model )/(p − q)
(9.22)
SScomplex model /(n − p − 1)

and often runs under the name ANOVA (analysis of variance). We see an alternative derivation
thereof in Chapter 10.

In order to validate the fixed components of a model, it must be verified whether the necessary
predictors are in the model. We do not want too many, nor too few. Unnecessary predictors
are often identified through insignificant coefficients. When predictors are missing, the residuals
show (in the ideal case) structure, indicative for model improvement. In other cases, the quality
of the regression is low (F -Test, R2 (too) small). Example 9.1 below will illustrate the most
important elements.

Example 9.1. We construct synthetic data in order to better illustrate the difficulty of detecting
a suitable model. Table 9.1 gives the actual models and the five fitted models. In all cases we use
a small dataset of size n = 50 and predictors (x1 , x2 and x3 ) that we construct from a uniform
iid
distribution. Further, we set εi ∼ N (0, 0.252 ). R-Code 9.1 and the corresponding Figure 9.1
illustrate how model deficiencies manifest.
9.2. MODEL VALIDATION 133

We illustrate how residual plots may or may not show missing or unnecessary predictors.
Because of the ‘textbook’ example, the adjusted R2 values are very high and the p-value of the
F -Test is – as often in practice – of little value.

Since the output of summary() is quite long, here we show only elements from it. This is
achieved with the functions print() and cat(). For Examples 2 to 5 the output has been
constructed by a function call to subset_of_summary() constructing the output as the first
example.
The plots should supplement a classical graphical analysis through lm( res). ♣

Table 9.1: Fitted models for five examples of 9.1. The true model is always Yi =
β0 + β1 x1 + β2 x21 + β3 x2 + εi .

Example fitted model

1 Yi = β0 + β1 x1 + β2 x21 + β3 x2 + εi correct model
2 Yi = β0 + β1 x1 + β2 x2 + εi missing predictor x21
3 Yi = β0 + β1 x21 + β2 x2 + εi missing predictor x1
4 Yi = β0 + β1 x1 + β2 x21 + εi missing predictor x2
5 Yi = β0 + β1 x1 + β2 x21 + β3 x2 + β4 x3 + εi unnecessary predictor x3

R-Code 9.1: Illustration of missing and unnecessary predictors for an artificial dataset.
(See Figure 9.1.)

set.seed( 18)
n <- 50
x1 <- runif( n); x2 <- runif( n); x3 <- runif( n)
eps <- rnorm( n, sd=0.16)
y <- -1 + 3*x1 + 2.5*x1^2 + 1.5*x2 + eps
# Example 1: Correct model
sres <- summary( res <- lm( y ~ x1 + I(x1^2) + x2 ))
print( sres$coef, digits=2)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.96 0.067 -14.4 1.3e-18
## x1 2.90 0.264 11.0 1.8e-14
## I(x1^2) 2.54 0.268 9.5 2.0e-12
## x2 1.49 0.072 20.7 5.8e-25
cat("Adjusted R-squared: ", formatC( sres$adj.r.squared),
" F-Test: ", pf(sres$fstatistic[1], 1, n-2, lower.tail = FALSE))
## Adjusted R-squared: 0.9927 F-Test: 6.5201e-42
plotit( res$fitted, res$resid, "fitted values") # Essentially equivalent to:
# plot( res, which=1, caption=NA, sub.caption=NA, id.n=0) with ylim=c(-1,1)
134 CHAPTER 9. MULTIPLE REGRESSION

# i.e, plot(), followed by a panel.smooth()

plotit( x1, res$resid, expression(x[1]))
plotit( x2, res$resid, expression(x[2]))
# Example 2: Missing predictor x1^2
sres <- subset_of_summary( lm( y ~ x1 + x2 ))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.3 0.093 -14 9.0e-19
## x1 5.3 0.113 47 2.5e-41
## x2 1.5 0.122 13 1.1e-16
## Adjusted R-squared: 0.9789 F-Test: 4.2729e-35
# Example 3: Missing predictor x1
sres <- subset_of_summary( lm( y ~ I(x1^2) + x2 ))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.45 0.091 -5 9.3e-06
## I(x1^2) 5.39 0.126 43 3.0e-39
## x2 1.41 0.135 10 8.4e-14
## Adjusted R-squared: 0.9742 F-Test: 4.945e-33
# Example 4: Missing predictor x2
sres <- subset_of_summary( lm( y ~ x1 + I(x1^2) ))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.042 0.16 -0.26 0.7942
## x1 2.307 0.83 2.76 0.0081
## I(x1^2) 2.963 0.85 3.49 0.0011
## Adjusted R-squared: 0.9266 F-Test: 1.3886e-22
# Example 5: Too many predictors x3
sres <- subset_of_summary( lm( y ~ x1 + I(x1^2) + x2 + x3))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.973 0.071 -13.8 9.5e-18
## x1 2.908 0.266 10.9 3.0e-14
## I(x1^2) 2.537 0.270 9.4 3.4e-12
## x2 1.483 0.075 19.9 6.5e-24
## x3 0.044 0.074 0.6 5.5e-01
## Adjusted R-squared: 0.9926 F-Test: 7.6574e-39
# The results are in sync with:
# tmp <- cbind( y, x1, "x1^2"=x1^2, x2, x3)
# pairs( tmp, upper.panel=panel.smooth, lower.panel=NULL, gap=0)
# cor( tmp)

It is important to understand that the stochastic part εi does not only represent measurement
error. In general, the error is the remaining “variability” (also noise) that is not explained through
the predictors (“signal”).
9.2. MODEL VALIDATION 135

0.0 0.5 1.0

●
● ●
● ●
●
● ●
● ●
●● ●
● ●● ●
●
●
● ●
● ●●
●
● ●
● ●
●
●
● ●
●
●● ● ● ●● ●●● ●
● ●
●●
● ● ● ● ●
● ●
● ●
● ●
●● ●
● ● ●
● ● ●
● ●
●● ● ● ● ● ●●
● ●
● ● ● ●
● ● ● ● ●
● ● ●
● ● ●●●
●●
● ● ● ●● ● ● ● ●
● ●● ● ● ● ●●
● ● ● ● ●
● ●
●
●
● ●
●● ● ● ●● ●●●
●●●
●● ● ● ● ● ●
● ● ●●
● ●●
●●
● ● ● ● ●
● ● ●
● ● ●● ●●
● ● ● ●
● ● ● ●● ● ● ● ●
● ●
● ● ● ● ● ●
● ● ● ● ●● ● ● ● ●● ● ● ●
●
●
● ● ●●
● ● ● ● ● ●
● ●
●
● ●
●● ●
● ● ● ●●● ●● ●●
● ● ●
● ● ●
● ● ●
● ● ●
●
●
●
●
●
● ● ●
● ●● ● ●●
●
●● ●●
●●● ●
● ● ●
●● ●
● ● ●●
●
●● ● ●●
●
● ●
● ●● ●● ●●
●
●
−1.0

−1.0

−1.0
0 1 2 3 4 5 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

fitted values x1 x2
0.0 0.5 1.0

0.0 0.5 1.0

●
● ●
● ●
●
●
●
●
● ●
●
●
● ●
●
●
●
●● ●
● ●
● ●● ●● ●● ●● ●
● ●●
● ● ● ●
● ● ●
● ●
● ●
●
●
● ● ●
● ● ● ● ●
● ●●●●
●● ● ● ●
● ● ● ● ●
●
● ● ●
● ●● ●
●●
● ●●
●
● ●
● ● ●
● ● ●
● ● ●
●
● ● ●
● ●●
●
●
● ●●
● ● ●
● ● ● ●
●
●● ●
● ●
● ● ●
●●
●● ●●
● ●
● ● ●●
● ● ●
● ● ●
●
●
● ●
● ●
● ●● ● ●
●●●
● ●
● ●
●●●
●●
● ●● ●● ●
●●
● ●
● ●
●● ●
●
● ●
● ●
● ●● ●
●
●
● ●
● ●● ●
● ●
●
●
●
●
●
●
●
● ●
●
● ●● ●
●
●● ●
● ● ● ●
● ●
● ●
● ●
● ●
● ●
●
● ●
●
●
●●●● ● ●
● ●
● ●
● ● ●
●
● ●
●
●● ●
●
●● ●
●
● ●
●
● ●
●
●● ● ●
● ● ● ●
● ●
● ●
● ●
● ●
● ●
● ●● ●
● ●
●
●● ●
● ●
●
−1.0

−1.0

−1.0
0 1 2 3 4 5 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

fitted values x1 x2
0.0 0.5 1.0

0.0 0.5 1.0

●
●
●●
●
● ●
● ●
● ●
● ●
●
● ●
● ●
● ● ●
● ●
● ●
●
● ●
● ●
● ●● ● ● ● ●
●
● ● ● ● ●
●
●
● ● ●
● ● ●
●● ●
●● ● ● ●
●
●●● ●
● ●●
● ●●
●●● ●
● ●
●●
● ●
● ●
●
● ●
●
●●● ● ● ●
●● ●
● ●
●
●
● ●
● ●● ●
●
●
● ●●
● ●
●
● ● ●● ● ●
● ● ●
● ●
● ●●
●
●
●
● ●
● ● ●●
●●
●
●
● ●● ●
●
● ●● ●
● ●
●● ●●
● ● ● ●
●
●●● ●● ●
●
●● ●● ●
● ●
● ●
● ● ●
●
●
● ●●●●
●●●
●
● ●
● ●●
●●●
●
● ● ●
● ●
● ●
● ●●
●
●● ●
●● ● ●●
● ● ● ●
● ●●
●● ●● ●
● ●● ●
●● ●● ●● ●
●
●
● ●●
●●
●● ●
●
● ●
● ●
● ● ●
● ●
●
●● ●
● ●
●
●
● ● ●
● ●●
●
●
● ●
● ●
●
●
● ●
●
●
● ● ●● ●
●
●● ●
●
−1.0

−1.0

0 1 2 3 4 5 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

fitted values x1 x2
0.0 0.5 1.0

0.0 0.5 1.0

●● ●
●● ● ●
● ●
● ●
● ●
● ●●
●
● ●
● ●
●
●
● ●
● ●
● ●
● ●● ●
●

●●
● ●
●
●●
● ●● ●
● ●
●
● ●●
● ● ● ● ●
●
● ●
● ●
●
● ●
● ● ● ● ●●
● ● ●
● ● ●●● ●
● ●
●●
●● ● ●
●● ● ●
●●
●●●●
● ● ●●●●
● ●●
●●
●● ●
● ●
● ●
● ●
● ●
● ●
● ●
● ●
● ●● ●
●
● ●●
●
●
●

●
● ●
● ●
● ●
● ●
●●●

−1.0

0 1 2 3 4 5 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

fitted values x1 x2
0.0 0.5 1.0

0.0 0.5 1.0

●
● ●
● ●
●
● ●
● ●
●
● ●
● ●
●
●
● ●● ●
● ●●
●
● ●
●
●
● ● ● ●
●
● ● ●
● ●
●
●● ●● ●
● ●
●
●●
● ● ● ●
● ●● ●
● ●● ● ●
●● ● ●
●
● ●
●●●
● ●
● ● ● ●
● ●
● ● ● ●●●●
● ●
● ●
●
● ● ●● ● ● ●● ● ● ●●
● ● ● ●
●
● ●
●●
●●●● ●
● ●
● ● ● ●●● ●
●
●●●
●● ● ●
● ● ● ●
●
●
● ●
● ●
●
●● ● ●
●●
●● ●●
●●
●
● ● ● ● ●
●
● ●
● ●●
●
●
● ●
● ●
●
● ● ● ●
● ●●
● ● ●● ●●
● ●
● ● ●
●
●●
●
● ●
● ●● ● ●●
● ●●●●
● ●
● ●
●
●
● ●
● ●
● ●
● ●
●
● ●
●
●
●
●
● ●
●●
●●
● ●
●
●
●
●●● ●
●
● ● ● ● ●
● ●
●
● ●●●
●
●
● ●
● ●● ●
● ●● ●
●
●
● ●● ●●
●
● ●
● ●● ●● ●●
●
●
−1.0

−1.0

0 1 2 3 4 5 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

fitted values x1 x2

Figure 9.1: Residual plots. Residuals versus fitted values (left column), predictor x1
(middle) and x2 (right column). The rows correspond to the different fitted models.
The panels in the left column have different scaling of the x-axis. (See R-Code 9.1.)
136 CHAPTER 9. MULTIPLE REGRESSION

With respect to the stochastic part εi , the following points should be verified:

i) constant variance: if the (absolute) residuals are displayed as a function of the predictors,
the estimated values, or the index, no structure should be discernible. The observations
can often be transformed in order to achieve constant variance. Constant variance is also
called homoscedasticity and the terms heteroscedasticity or variance heterogeneity are used
otherwise.
indep
More precisely, for heteroscedacity we relax Model (9.3) to εi ∼ N (0, σi2 ), i.e., ε ∼
Nn (0, σ 2 V). In case the diagonal matrix V is known, we can use so-called weighted least
squares (WLS) by considering the argument weights in R (weights=1/diag(V)).

ii) independence: correlation between the residuals should be negligible.

If data are taken over time, observations might be serially correlated or dependent. That
means, Corr(εi−1 , εi ) 6= 0 and Var(ε) = Σ which is not a diagonal matrix. This is easy to
test and to visualize through the residuals, illustrated in Example 9.2.
If Var(ε) = Σ = σ 2 R where R is a known correlation matrix a so-called generalized least
squares (GLS) approach can be used. Correlated data are extensively discussed in time
series analysis and in spatial analysis. We refer to follow-up lectures for a more detailed
exhibition.

iii) symmetric distribution: it is not easy to find evidence against this assumption. If the
distribution is strongly right- or left-skewed, the scatter plots of the residuals will have
structure. Transformations or generalized linear models may help. We have a quick look
at a generalized linear model in Section 9.4.

9.2.2 Information Criterion

An information criterion in statistics is a tool for model selection. It follows the idea of Occam’s
razor, in that a model should not be unnecessarily complex. It balances the goodness of fit of the
estimated models with its complexity, measured by the number of parameters. There is a penalty
for the number of parameters, otherwise complex models with numerous parameters would be
preferred.
The coefficient of determination R2 is thus not a proper information criterion, the adjusted
R2 is somewhat.

We assume that the distribution of the variables follows a known distribution with an un-
known parameter θ with p components. In maximum likelihood estimation, the larger the
likelihood function Lθ)
b or, equivalently, the smaller the negative log-likelihood function −`(θ),
b
the better the model is. The oldest criterion was proposed as “an information criterion” in 1973
by Hirotugu Akaike and is known today as the Akaike information criterion (AIC):

AIC = −2`(θ)
b + 2p. (9.23)

In regression models with normally distributed errors, the maximized log-likelihood is linear to
σ 2 ) and so the first term describes the goodness of fit.
log(b
9.3. EXAMPLES 137

The disadvantage of AIC is that the penalty term is independent of the sample size. The
Bayesian information criterion (BIC)

BIC = −2`(θ)
b + log(n) p (9.24)

penalizes the model more heavily based on both the number of parameters p and sample size n,
and its use is recommended.

9.3 Examples
In this section we give two more examples of multiple regression problems, based on classical
datasets.

Example 9.2. (abrasion data) The data comes from an experiment investigating how rubber’s
resistance to abrasion is affected by the hardness of the rubber and its tensile strength (Cleveland,
1993). Each of the 30 rubber samples was tested for hardness and tensile strength, and then
subjected to steady abrasion for a fixed time.

R-Code 9.2 performs the regression analysis based on two predictors. The empirical confidence
intervals confint( res) do not contain zero. Accordingly, the p-values of the three t-tests are
small.
A quadratic term for strength is not necessary. However, the residuals appear to be slightly
correlated. We do not have further information about the data here and cannot investigate this
aspect further. ♣

50 150 250 350 50 60 70 80 90

120 160 200 240

● ●
350

350

● ● ● ●

● ●
250

250

● ●
● ●
●● ● ● ●●
loss ●
●● ●
●
●
●
●
●●
●
●
150

150

● ●
● ● ● ● ● ●
● ● ● ●
●● ● ● ● ●
● ●
● ●
● ●● ●
50

9050

●●
● ●
●●
● ●
● ●
80

● ● ● ●
●
● ●
70

● ●●

hardness ●
● ●
●
●
60

● ●
● ●
● ●
●
240 50

●
200

strength
160
120

120 160 200 240

Figure 9.2: abrasion data: EDA in form of a pairs plot. (See R-Code 9.2.)
138 CHAPTER 9. MULTIPLE REGRESSION

R-Code 9.2: abrasion data: EDA, fitting a linear model and model validation. (See
Figures 9.2 and 9.3.)

abrasion <- read.csv('data/abrasion.csv')

str(abrasion)
## 'data.frame': 30 obs. of 3 variables:
## $ loss : int 372 206 175 154 136 112 55 45 221 166 ...
## $ hardness: int 45 55 61 66 71 71 81 86 53 60 ...
## $ strength: int 162 233 232 231 231 237 224 219 203 189 ...
pairs(abrasion, upper.panel=panel.smooth, lower.panel=NULL, gap=0)
res <- lm(loss~hardness+strength, data=abrasion)
summary( res)
##
## Call:
## lm(formula = loss ~ hardness + strength, data = abrasion)
##
## Residuals:
## Min 1Q Median 3Q Max
## -79.38 -14.61 3.82 19.75 65.98
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 885.161 61.752 14.33 3.8e-14 ***
## hardness -6.571 0.583 -11.27 1.0e-11 ***
## strength -1.374 0.194 -7.07 1.3e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 36.5 on 27 degrees of freedom
## Multiple R-squared: 0.84,Adjusted R-squared: 0.828
## F-statistic: 71 on 2 and 27 DF, p-value: 1.77e-11
confint( res)
## 2.5 % 97.5 %
## (Intercept) 758.4573 1011.86490
## hardness -7.7674 -5.37423
## strength -1.7730 -0.97562

# Fitted values
plot( loss~hardness, ylim=c(0,400), yaxs='i', data=abrasion)
points( res$fitted~hardness, col=4, data=abrasion)
plot( loss~strength, ylim=c(0,400), yaxs='i', data=abrasion)
points( res$fitted~strength, col=4, data=abrasion)
9.3. EXAMPLES 139

# Residuals vs ...
plot( res$resid~res$fitted)
lines( lowess( res$fitted, res$resid), col=2)
abline( h=0, col='gray')
plot( res$resid~hardness, data=abrasion)
lines( lowess( abrasion$hardness, res$resid), col=2)
abline( h=0, col='gray')
plot( res$resid~strength, data=abrasion)
lines( lowess( abrasion$strength, res$resid), col=2)
abline( h=0, col='gray')
plot( res$resid[-1]~res$resid[-30])
abline( h=0, col='gray')
400

400

● ●
●
● ● ● ●
●
● ● ●
● ●
●
● ● ●●●
●
●
● ● ●
loss

loss

● ●
200

200

● ●● ●
● ● ● ● ●●
● ● ● ● ● ● ● ●
●●
● ●
● ● ● ●● ● ● ●
● ●
●
● ● ● ● ● ● ●
● ● ● ● ●● ● ● ●
●● ● ●
● ● ● ● ● ● ●
●
● ● ● ●
● ● ● ●
● ●
● ● ●
●
● ● ● ● ●
●
● ●
0

50 60 70 80 90 120 140 160 180 200 220 240

hardness strength

● ●
50

● ●
● ● ● ●
● ●
res$resid

res$resid

● ● ● ●
● ●● ● ●
● ● ● ● ● ●●
● ●● ● ● ● ● ●
0

●● ● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ● ● ●
−50

−50

● ●
● ● ● ●

50 100 150 200 250 300 350 50 60 70 80 90

res$fitted hardness

● ●
50

● ●
res$resid[−1]

● ● ● ●
● ●
res$resid

● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ●
0

● ● ● ● ● ●
●● ● ● ● ●
●
● ● ●
● ● ● ●
−50

−50

● ●
● ● ● ●

120 140 160 180 200 220 240 −50 0 50

strength res$resid[−30]

Figure 9.3: abrasion data: model validation. Top row shows the loss (black) and
fitted values (blue) as a function of hardness (left) and strength (right). Middle and
bottom panels are different residual plots. (See R-Code 9.2.)
140 CHAPTER 9. MULTIPLE REGRESSION

Example 9.3. (LifeCycleSavings data) Under the life-cycle savings hypothesis developed by
Franco Modigliani, the savings ratio (aggregate personal savings divided by disposable income) is
explained by per-capita disposable income, the percentage rate of change in per-capita disposable
income, and two demographic variables: the percentage of population less than 15 years old and
the percentage of the population over 75 years old. The data are averaged over the decade
1960–1970 to remove the business cycle or other short-term fluctuations.
The dataset contains information from 50 countries about these five variables:

• sr aggregate personal savings,

• pop15 % of population under 15,

• pop75 % of population over 75,

• dpi real per-capita disposable income,

• ddpi % growth rate of dpi.

Scatter plots are shown in Figure 9.4. R-Code 9.3 fits a multiple linear model, selects models
through comparison of various goodness of fit criteria (AIC, BIC) and shows the model validation
plots for the model selected using AIC. The step function is a convenient way for selecting
relevant predictors. Figure 9.4 gives four of the most relevant diagnostic plots, obtained by
passing a fitted object to plot (compare with the manual construction of Figure 9.3).
Different models may result from different criteria: when using BIC for model selection,
pop75 drops out of the model. ♣

0 5 10 15 20 25 35 45 1 2 3 4 0 2000 4000
0 5 10 15
● ● ● ●
10 15 20

10 15 20

● ● ● ●
● ● ● ●
● ● ● ●
●
●● ●●● ● ● ● ●● ●
● ● ●●● ● ●●● ● ● ●●●●● ● ●●● ●●
● ● ●●
● ● ● ● ●● ● ●● ●● ●
● ●●
sr ● ●
●
●
●● ●●●
● ●
●
● ● ●
●
● ●● ● ● ●
●
●●
● ●● ●
● ●●
●●● ●●
●●

●
● ●
●
●
●●
●
●
● ●
●
●
●●
● ●● ●●
● ●●
●●
● ● ●
●
●●● ●
●
● ● ● ● ● ● ●
●● ● ●● ● ●●● ● ●
5

● ● ●● ●● ● ●●●
●
● ●●
● ●● ●● ●●●
● ● ● ● ● ● ●
● ●
● ● ●
0

45 0

●● ● ● ●● ● ●●
●● ●● ●●●●●● ●
●●● ● ●● ●
● ●●
●●● ● ●●
●
●● ● ●●● ●
●
●●● ● ●● ● ●●● ●●
● ●● ●
● ● ●

pop15
35

● ● ●
● ●● ● ● ● ● ● ● ●● ● ●●
●●● ●
● ● ● ●●
●● ●● ● ● ●
● ● ● ● ● ●
●● ● ● ●● ●
25

● ● ● ● ●
● ● ● ● ● ●●
● ● ● ● ● ● ● ●● ●●●●●
● ● ● ● ●●
● ●
●●● ● ●●●
● ●
4

● ●
●●● ● ●●
● ● ● ●●●
● ● ●●
● ● ●●
3

pop75 ●
●●
●
●
●
●
● ● ● ●
●●
● ●
●
●
●
2

● ●
●● ● ● ●
●
● ●
●●
● ● ●● ●●●●
● ●
● ●●
●● ● ● ●●
●●●
4000 1

●●● ● ●
●
●●● ● ●●●
●

●
●
●
● ●●●
2000

dpi ●●
● ●●
●
●●
● ●

● ●
●● ● ●
●●●● ●●
●●● ●● ●
●●● ●
●●●● ●●
●
●
15 0
10

ddpi
5
0

0 5 10 15

Figure 9.4: LifeCycleSavings data: scatter plots of the data. (See R-Code 9.3.)
9.3. EXAMPLES 141

R-Code 9.3: LifeCycleSavings data: EDA, linear model and model selection. (See
Figures 9.4 and 9.5.)

data( LifeCycleSavings)
head( LifeCycleSavings)
## sr pop15 pop75 dpi ddpi
## Australia 11.43 29.35 2.87 2329.68 2.87
## Austria 12.07 23.32 4.41 1507.99 3.93
## Belgium 13.17 23.80 4.43 2108.47 3.82
## Bolivia 5.75 41.89 1.67 189.13 0.22
## Brazil 12.88 42.19 0.83 728.47 4.56
## Canada 8.79 31.72 2.85 2982.88 2.43
pairs(LifeCycleSavings, upper.panel=panel.smooth, lower.panel=NULL, gap=0)
lcs.all <- lm(sr ~ pop15 + pop75 + dpi + ddpi, data = LifeCycleSavings)
summary( lcs.all)
##
## Call:
## lm(formula = sr ~ pop15 + pop75 + dpi + ddpi, data = LifeCycleSavings)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.242 -2.686 -0.249 2.428 9.751
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.566087 7.354516 3.88 0.00033 ***
## pop15 -0.461193 0.144642 -3.19 0.00260 **
## pop75 -1.691498 1.083599 -1.56 0.12553
## dpi -0.000337 0.000931 -0.36 0.71917
## ddpi 0.409695 0.196197 2.09 0.04247 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.8 on 45 degrees of freedom
## Multiple R-squared: 0.338,Adjusted R-squared: 0.28
## F-statistic: 5.76 on 4 and 45 DF, p-value: 0.00079

lcs.aic <- step( lcs.all) # AIC is default choice

## Start: AIC=138.3
## sr ~ pop15 + pop75 + dpi + ddpi
##
## Df Sum of Sq RSS AIC
142 CHAPTER 9. MULTIPLE REGRESSION

## - dpi 1 1.9 653 136

## <none> 651 138
## - pop75 1 35.2 686 139
## - ddpi 1 63.1 714 141
## - pop15 1 147.0 798 146
##
## Step: AIC=136.45
## sr ~ pop15 + pop75 + ddpi
##
## Df Sum of Sq RSS AIC
## <none> 653 136
## - pop75 1 47.9 701 138
## - ddpi 1 73.6 726 140
## - pop15 1 145.8 798 144
summary( lcs.aic)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.12466 7.18379 3.9150 0.00029698
## pop15 -0.45178 0.14093 -3.2056 0.00245154
## pop75 -1.83541 0.99840 -1.8384 0.07247270
## ddpi 0.42783 0.18789 2.2771 0.02747818
plot( lcs.aic) # 4 plots to assess the models
summary( step( lcs.all, k=log(50), trace=0))$coefficients # now BIC
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.59958 2.334394 6.6825 2.4796e-08
## pop15 -0.21638 0.060335 -3.5863 7.9597e-04
## ddpi 0.44283 0.192401 2.3016 2.5837e-02

9.4 Nonlinear Regression

Naturally, the linearity assumption (of the individual parameters) in the linear model is central.
There are situations, where we have a nonlinear relationship between the observations and the
parameters. In this section we enumerate some of the alternatives, such that at least relevant
statistical terms can be associated.
We generalize model (9.3) to

Yi ≈ f (x i , β), (9.25)

where f (x , β) is a (sufficiently well behaved) function depending on a vector of covariates x i

and the vector of parameters β. We write the relationship in form of ‘approximately’, since
we do not necessarily claim that we have additive Gaussian errors. For example, we may have
multiplicative errors or the right-hand side of model (9.25) may be used for the expectation of
the response.
9.4. NONLINEAR REGRESSION 143

Residuals vs Fitted Normal Q−Q

3
10
Zambia ● Zambia ●

Standardized residuals

2
●Philippines
● Philippines
● ●
● ● ● ●
5

●●
Residuals

● ● ● ●
●

1
● ● ● ● ● ● ●●●
●
● ●●●●●
● ● ● ● ● ● ●
●
●
0

● ● ● ●● ●●●●●●

0
● ●●●
● ● ● ● ● ●●●
● ● ● ● ● ● ●●●●
● ●
● ● ●●
●●●●
●●●●

−1
●
−5

●
● ● ● ●
● ● ●
Chile ●

−2
−10

● Chile

6 8 10 12 14 16 −2 −1 0 1 2

Fitted values Theoretical Quantiles

Scale−Location Residuals vs Leverage

Zambia ●

3
1.5

Chile ● ● Zambia
Standardized residuals

Standardized residuals
●Philippines
●

2
● ● ● ●
● ● 1
● ● ● ● ● Japan
● 0.5
●
1.0

1
● ● ●●
● ● ● ● ●
● ● ● ●
●● ●●
● ● ●
● ●●
● ● ● ● ● ● ●●●● ● ●●

0
●
●
●●
● ● ● ●●
●
● ●●●● ●
● ●●● ●
0.5

●
● −1 ● ● ●
●
● ● ● ● ●
● ● Libya ● 0.5
● ● ●
● ●
●
● 1
−2

●
Cook's distance
0.0

6 8 10 12 14 16 0.0 0.1 0.2 0.3 0.4 0.5

Fitted values Leverage

Figure 9.5: LifeCycleSavings data: model validation. (See R-Code 9.3.)

9.4.1 Transformation of the Response

For a Poisson random variable, the variance increases with increasing mean. Similarly, it is possi-
ble that the variance of the residuals increase with increasing observations. Instead of “modeling”
increasing variances, transformations of the response variables often render the variances of the
residuals sufficiently constant. For example, instead of linear model for Yi a linear model for
log(Yi ) is constructed.
In situations where we do not have count observations but nevertheless increasing variance
with increasing mean, a log-transformation is helpful. If the original data stems from a “truly”
linear model, a transformation typically leads to an approximation. On the other hand,

Yi = β0 xβi 1 eεi (9.26)

leads to a truly linear relationship

log(Yi ) = log(β0 ) + β1 log(xi ) + εi . (9.27)

We may pick “any” other reasonable transformation. There are formal approaches to deter-
mine an optimal transformation of the data, notably the function boxcox from the package MASS.
√
However, in practice log(·) and · are used dominantly.
144 CHAPTER 9. MULTIPLE REGRESSION

9.4.2 Nonlinear and Non-parametric Regression

When model (9.25) has additive (Gaussian) errors, we can estimate β based on least squares,
that means
n
X 2
β
b = argmin yi − f (x i , β) . (9.28)
β i=1

However, we do not have closed forms for the resulting estimates and iterative approaches are
needed. That is, starting from an initial condition we improve the solution by small steps. The
correction factor typically depends on the gradient. Such algorithms are variants of the so-called
Gauss–Newton algorithm. The details of these are beyond the scope of this document.
There is no guarantee that an optimal solution exists (global minimum). Using nonlinear
least squares as a black box approach is dangerous and should be avoided. In R, the function
nls (nonlinear least-squares) can be used. General optimizer functions are nlm and optim.
There are some prototype nonlinear functions that are often used:

Yi = β1 exp −(xi /β2 )β3 + εi Weibull model,

(9.29)

 β + ε , if x < β
1 i i 3
Yi = break point models. (9.30)
 β + ε , if x ≥ β ,
2 i i 3

The exponential growth model is a special case of the Weibull model.

An alternative way to express a response in a nonlinear way is through a virtually arbitrary

flexible function, similar to the (red) guide-the-eye curves in many of our plots. These smooth
curves are constructed through a so-called “non-parametric” approach.

9.4.3 Generalized Linear Regression

The response Yi in Model (9.3) is Gaussian, albeit not iid. If this is not the case, for example the
observations yi are count data, then we need another approach. The idea of this new approach
is to model the expected value of Yi or a function thereof as a linear function of some predictors,
for example E(Yi ) = β0 + β1 xi or g(E(Yi )) = β0 + β1 xi . This approach is called Generalized
Linear Regression (GLM) and contains as a special case the classical linear regression. The
estimation procedure is quite similar to a nonlinear approach, but details are again deferred to
other lectures.
To illustrate the concept we look at a simple logistic regression, one particular case of a GLM.
Example 9.4 is based on widely discussed data.

Example 9.4. (orings data) In January 1986 the space shuttle Challenger exploded shortly
after taking off, killing all seven crew members aboard. Part of the problem was with the rubber
seals, the so-called O-rings, of the booster rockets. Due to low ambient temperature, the seals
started to leak causing the catastrophe. The data set data( orings, package="faraway")
contains the number defects in the six seals in 23 previous launches (Figure 9.6). The question
we ask here is whether the probability of a defect for an arbitrary seal can be predicted for an
9.5. BIBLIOGRAPHIC REMARKS 145

air temperature of 31◦ F (as in January 1986). See Dalal et al. (1989) for a detailed statistical
account or simply https://en.wikipedia.org/wiki/Space_Shuttle_Challenger_disaster.

The variable of interest is a probability (failure of a rubber seal), that we estimate based on
binomial data (failures of orings) but a linear model cannot guarantee pbi ∈ [0, 1] (see linear fit in
Figure 9.6). In this and similar cases, logistic regression is appropriate. The logistic regression
models the probability of a defect as
1
p = P(defect) = , (9.31)
1 + exp(−β0 − β1 x)
where x is the air temperature. Through inversion one obtains a linear model for the log odds
p
g(p) = log = β0 + β1 x (9.32)
1−p
where g(·) is generally called the link function. In this special case, the function g −1 (·) is called
the logistic function. ♣

R-Code 9.4 orings data and estimated probability of defect dependent on air temperature.
(See Figure 9.6.)

data( orings, package="faraway")

str(orings)
## 'data.frame': 23 obs. of 2 variables:
## $ temp : num 53 57 58 63 66 67 67 67 68 69 ...
## $ damage: num 5 1 1 1 0 0 0 0 0 0 ...
plot( damage/6~temp, xlim=c(21,80), ylim=c(0,1), data=orings, pch='+',
xlab="Temperature [F]", ylab='Probability of damage') # data
abline(lm(damage/6~temp, data=orings), col='gray') # regression line

glm1 <- glm( cbind(damage,6-damage)~temp, family=binomial, data=orings)

points( orings$temp, glm1$fitted, col=2) # fitted values
ct <- seq(20, to=85, length=100) # vector to predict
p.out <- predict( glm1, new=data.frame(temp=ct), type="response")
lines(ct, p.out)
abline( v=31, col='gray', lty=2) # actual temp. at start

9.5 Bibliographic remarks

There are many textbooks dealing with linear models. We again refer to Fahrmeir et al. (2009)
or Fahrmeir et al. (2013).
The book by Faraway (2006) nicely summarizes extensions of the linear model.
146 CHAPTER 9. MULTIPLE REGRESSION

1.0

+
0.8
Probability of damage

0.6

●
0.4

●
●
0.2

++ + + +
●
● ●
● ● ●
0.0

● ● ● +
+++++ + + + ● +
● +
● +
●

20 30 40 50 60 70 80

Temperature [F]

Figure 9.6: orings data (proportion of damaged orings, black crosses) and estimated
probability of defect (red dots) dependent on air temperature. Linear fit is given by
the gray solid line. Dotted vertical line is the ambient launch temperature at the time
of launch. (See R-Code 9.4.)

9.6 Exercises and Problems

Problem 9.1 (Multiple linear regression 1) The data stackloss.txt are available on the
course web page. The data represents the production of nitric acid in the process of oxidizing
ammonia. The response variable, stack loss, is the percentage of the ingoing ammonia that
escapes unabsorbed. Key process variables are the airflow, the cooling water temperature (in
degrees C), and the acid concentration (in percent).
Construct a regression model that relates the three predictors to the response, stack loss.
Check the adequacy of the model.
Exercise and data are from B. Abraham and J. Ledolter, Introduction to Regression Modeling,
2006, Thomson Brooks/Cole.
Hints:
• Look at the data. Outliers?

• Try to find a “optimal” model. Exclude predictors that do not improve the model fit.

• Use model Diagnostics, use t−, F -tests and (adjusted) R2 values to compare different
models.

• Which data points have a (too) strong influence on the model fit? (influence.measures())

• Are the predictors correlated? In case of a high correlation, what are possible implications?
9.6. EXERCISES AND PROBLEMS 147

Problem 9.2 (Multiple linear regression 2) The file salary.txt contains information about
average teacher salaries for 325 school districts in Iowa. The variables are
District name of the district
districtSize size of the district:
1 = small (less than 1000 students)
2 = medium (between 1000 and 2000 students)
3 = large (more than 2000 students)
salary average teacher salary (in dollars)
experience average teacher experience (in years)

i) Produce a pairs plot of the data and briefly describe it.

Hint: districtSize is a categorical random variable with only three possible outcomes.

ii) For each of the three district sizes, fit a linear model using salary as the dependent variable
and experience as the covariate. Is there an effect of experience? How can we compare the
results?

iii) We now use all data jointly and use districtSize as covariate as well. However, districtSize
is not numerical, rather categorical and thus we set mydata$districtSize <- as.factor(
mydata$districtSize) (with appropriate dataframe name). Fit a linear model using
salary as the dependent variable and the remaining data as the covariates. Is there an
effect of experience and/or district size? How can we interpret the parameter estimates?
148 CHAPTER 9. MULTIPLE REGRESSION
Chapter 10

Analysis of Variance

Learning goals for this chapter:

Understanding the link between t-tests, linear model and ANOVA

Define an ANOVA model

Understand the concept of sums of squares decomposition

R-Code for this chapter: www.math.uzh.ch/furrer/download/sta120/chapter10.R.

In Test 2 discussed in Chapter 4, we compared the means of two independent samples with
each other. Naturally, the same procedure can be applied to I independent samples, which would
amount to I2 tests and would require adjustments due to multiple testing (see Section 4.5.2).

In this chapter we learn a “better” method, based on the concept of analysis of variance,
termed ANOVA. We focus on a linear model approach to ANOVA. Due to historical reasons, the
notation is slightly different than what we have seen in the last two chapters; but we try to link
and unify as much as possible.

10.1 One-Way ANOVA

The model of this section is tailored to compare I different groups where the variability of the
observations around the mean is the same in all groups. That means that there is one common
variance parameter and we pool the information across all observations to estimate it.
More formally, the model consists of I groups, in the ANOVA context called factor levels,
i = 1, . . . , I, and every level contains a sample of size ni . Thus, in total we have N = n1 +· · ·+nI
observations. The model is given by

Yij = µi + εij (10.1)

= µ + βi + εij , (10.2)

where we use the indices to indicate the group and within group observation. Similarly as in
iid
the regression models of the last chapters, we again assume εij ∼ N (0, σ 2 ). Formulation (10.1)

149
150 CHAPTER 10. ANALYSIS OF VARIANCE

represents the individual group means directly, whereas formulation (10.2) models an overall
mean and deviations from the mean.
However, model (10.2) is overparameterized (I levels and I + 1 parameters) and an additional
constraint on the parameters is necessary. Often, the sum-to-zero-contrast or treatment contrast,
written as:

I
X
βi = 0 or β1 = 0, (10.3)
i=1

are used.
We are inherently interested in whether there exists a difference between the groups and so
our null hypothesis is H0 : β1 = β2 = · · · = βI = 0. Note that the hypothesis is independent of
the constraint. To develop the associated test, we proceed in several steps. We first link the two
group case to the notation from the last chapter. In a second step we intuitively derive estimates
in the general setting. Finally, we state the test statistic.

10.1.1 Two Level ANOVA Written as a Regression Model

Model (10.2) with I = 2 and treatment constraint β1 = 0 can be written as a regression problem

Yi∗ = β0∗ + β1∗ xi + ε∗i , i = 1, . . . , N (10.4)

with Yi∗ the components of the vector (Y11 , Y12 , . . . , Y1n1 , Y21 , . . . , Y2n2 )> and xi = 0 if i =
1, . . . , n1 and xi = 1 otherwise. We simplify the notation and spare ourselves from writing the
index denoted by the asterisk with

Y1 β0 1 0 β0
= Xβ + ε = X +ε= +ε (10.5)
Y2 β1 1 1 β1

and thus we have as least squares estimate

N n2 −1 1> 1>
b
β0 > −1 > y 1 y1
β
b= = (X X) X = (10.6)
βb1 y2 n2 n2 0> 1> y2
P 1 P
y1j

1 n2 − n2 yij
= Pij = 1 P n1 j 1 P . (10.7)
n1 n2 −n2 N j y2j n j y 2j − n j y 1j
2 1

Thus the least squares estimates of µ and β2 in (10.2) for two groups are the mean of the first
group and the difference between the two group means.
The null hypothesis H0 : β1 = β2 = 0 in Model (10.2) is equivalent to the null hypothesis
H0 : β1∗ = 0 in Model (10.4) or to the null hypothesis H0 : β1 = 0 in Model (10.5). The latter is
of course based on a t-test for a linear association (Test 12) and coincides with the two sample
t-test for two independent samples (Test 2).

Estimators can also be derived in a similar fashion under other constraints or for more factor
levels.
10.1. ONE-WAY ANOVA 151

10.1.2 Sums of Squares Decomposition in the Balanced Case

Historically, we often had the case n1 = · · · = nI = J, representing a balanced setting. In
this case, there is another simple approach to deriving the estimators under the sum-to-zero
constraint. We use dot notation in order to show that we work with averages, for example,
ni I ni
1 X 1 XX
y i· = yij and y ·· = yij . (10.8)
ni N
j=1 i=1 j=1

Based on Yij = µ + βi + εij we use the following ansatz to derive estimates

yij = y ·· + (y i· − y ·· ) + (yij − y i· ). (10.9)

With the least squares method, µ b and βbi are chosen such that
X X
b − βbi )2 =
(yij − µ b − βbi )2
(y ·· + y i· − y ·· + yij − y i· − µ (10.10)
i,j i,j
X 2
= (y ·· − µ
b) + (y i· − y ·· − βbi ) + (yij − y i· ) (10.11)
i,j

is minimized. We evaluate the square of this last equation and note that the cross terms are zero
since
J
X I
X
(yij − y i· ) = 0 and (y i· − y ·· − βbi ) = 0 (10.12)
j=1 i=1

b = y ·· and βbi = y i· − y ·· . Hence, writing

(the latter due to sum-to-zero constraint) and thus µ
rij = yij − y i· , we have

yij = µ
b + βbi + rij . (10.13)

The observations are orthogonally projected in the space spanned by µ and βi . This orthog-
onal projection allows for the division of the sums of squares of the observations (mean corrected
to be precise) into the sums of squares of the model and sum of squares of the error component.
These sums of squares are then weighted and compared. The representation of this process
in table form and the subsequent interpretation is often equated with the analysis of variance,
denoted ANOVA.

Remark 10.1. This orthogonal projection also holds in the case of a classical regression frame-
work, of course. Using (9.13) and (9.14), we have

b > r = y > H> (I − H)y = y > (H − HH)y = 0,

y (10.14)

because the hat matrix H is symmetric (H> = H) and idempotent (HH = H). ♣

The decomposition of the sums of squares can be derived with help from (10.9). No assump-
tions about constraints or ni are made
X X
(yij − y ·· )2 = (y i· − y ·· + yij − y i· )2 (10.15)
i,j ij
X X X
= (y i· − y ·· )2 + (yij − y i· )2 + 2(y i· − y ·· )(yij − y i· ), (10.16)
ij i,j i,j
152 CHAPTER 10. ANALYSIS OF VARIANCE

ni
X
where the cross term is again zero because (yij − y i· ) = 0. Hence we have the decomposition
j=1
of the sums of squares
X X X
(yij − y ·· )2 = (y i· − y ·· )2 + (yij − y i· )2 (10.17)
i,j i,j i,j
| {z } | {z } | {z }
Total Model Error

or SST = SSA + SSE . We choose deliberately SSA instead of SSM as this will simplify subsequent
extensions. Using the least squares estimates µ b = y ·· and βbi = y i· − y ·· , this equation can be
read as
1 X 1 X 1 X
b)2 =
(yij − µ ni (µ\ + βi − µb)2 + (yij − µ\ + βi )2 (10.18)
N N N
i,j i i,j

\ij ) = 1 X 2
c2 ,
Var(y ni βbi + σ (10.19)
N
i

(where we could have used some divisor other than N ). The test statistic for the statistical
hypothesis H0 : β1 = β2 = · · · = βI = 0 is based on the idea of decomposing the variance into
variance between groups and variance within groups, just as illustrated in (10.19), and comparing
them. Formally, this must be made more precise. A good model has a small estimate for σ c2 in
comparison to that for the second sum. We now develop a quantitative comparison of the sums.

A raw comparison of both variance terms is not sufficient, the number of observations must
be considered: SSE increases as N increases also in light of a high quality model. In order to
weight the individual sums of squares, we divide them by their degrees of freedom, e.g., instead
of SSE we will use SSE /(N − I) and instead of SSA we will use SSA /(I − 1), which we will
term a mean squares. Under the null hypothesis, the mean squares are chi-square distributed
and thus their quotients are F distributed. Hence, an F -test as illustrated in Test 4 is needed
again. Historically, such a test has been “constructed” via a table and is still represented as such.
This so-called ANOVA table consists of columns for the sums of squares, degrees of freedom,
mean squares and F -test statistic due to variance between groups, within groups, and the total
variance. Table 10.1 illustrates such a generic ANOVA table, numerical examples are given in
Example 10.1 later in this section.

The expected values of the mean squares (MS) are

ni βi2
P
2
E(MSA ) = σ + i , E(MSE ) = σ 2 . (10.20)
I −1
Note that only the latter is intuitive.
Thus under H0 : β1 = · · · = βI = 0, E(MSA ) = E(MSE ) and hence the ratio MSA /MSE is
close to one, but typically larger. We reject H0 for large values of the ratio (see Section 2.8.3).
Test 13 summarizes the procedure. The test statistic of Table 10.1 naturally agrees with that
of Test 13. Observe that when MSA ≤ MSE , i.e., Fobs ≤ 1, H0 is never rejected. Details about
F -distributed random variables are given in Section 2.8.3.

Example 10.1 discusses a simple analysis of variance.

10.1. ONE-WAY ANOVA 153

Table 10.1: Table for one-way analysis of variance.

Source Sum of squares Degrees of freedom Mean squares Test statistic

(SS) (DF) (MS)

Between groups SSA =

X SSA MSA
(Factors, levels, . . . ) (y i· − y ·· )2 I −1 MSA = Fobs =
I −1 MSE
i,j
Within groups SSE =
X SSE
(Error) (yij − y i· )2 N −I MSE =
N −I
i,j
Total SST =
X
(yij − y ·· )2 N −1
i,j

Test 13: Performing a one-way analysis of variance

Question: Of the means y 1 , y 2 , . . . , y I , are at least two significantly different?

Assumptions: The I populations are normally distributed with homogeneous vari-

ances. The samples are independent.

Calculation: Construct a one-way ANOVA table. The quotient of the mean

squares of the factor and the error are needed:

SSA /(I − 1) MSA

Fobs = =
SSE /(N − I) MSE

Decision: Reject H0 : β1 = β2 = · · · = βI if Fobs > Fcrit = fI−1,N −I,1−α , where

I −1 gives the degrees of freedom “between” and N −I the degrees of freedom
“within”

Calculation in R: summary( lm(...)) for the value of the test statistic or anova(
lm(...)) for the explicit ANOVA table.

Example 10.1. (retardant data) Many substances related to human activities end up in
wastewater and accumulate in sewage sludge. The present study focuses on hexabromocyclodo-
decane (HBCD) detected in sewage sludge collected from a monitoring network in Switzerland.
HBCD’s main use is in expanded and extruded polystyrene for thermal insulation foams, in
building and construction. HBCD is also applied in the backcoating of textiles, mainly in furni-
ture upholstery. A very small application of HBCD is in high impact polystyrene, which is used
for electrical and electronic appliances, for example in audio visual equipment. Data and more
detailed background information are given in Kupper et al. (2008) where it is also argued that
loads from different types of monitoring sites showed that brominated flame retardants ending
154 CHAPTER 10. ANALYSIS OF VARIANCE

up in sewage sludge originate mainly from surface runoff, industrial and domestic wastewater.
HBCD is harmful to one’s health, may affect reproductive capacity, and may harm children
in the mother’s womb.
In R-Code 10.1 the data are loaded and reduced to Hexabromocyclododecane. First we use
constraint β1 = 0, i.e., Model (10.5). The estimates naturally agree with those from (10.7). Then
we use the sum-to-zero constraint and compare the results. The estimates and the standard errors
changed (and thus the p-values of the t-test). The p-values of the F -test are, however, identical,
since the same test is used.

The R command aov is an alternative for performing ANOVA and its use is illustrated in
R-Code 10.2. We prefer, however, the more general lm approach. Nevertheless we need a function
which provides results on which, for example, Tukey’s honest significant difference (HSD) test
can be performed with the function TukeyHSD. The differences can also be calculated from the
coefficients in R-Code 10.1. The p-values are smaller because multiple tests are considered. ♣

R-Code 10.1: retardant data: ANOVA with lm command and illustration of various
contrasts.

tmp <- read.csv('./data/retardant.csv')

retardant <- read.csv('./data/retardant.csv', skip=1)
names( retardant) <- names(tmp)
HBCD <- retardant$cHBCD
str( retardant$StationLocation)
## Factor w/ 16 levels "A11","A12","A15",..: 1 2 3 4 5 6 7 8 11 12 ...
type <- as.factor( rep(c("A","B","C"), c(4,4,8)))

lmout <- lm( HBCD ~ type )

summary(lmout)
##
## Call:
## lm(formula = HBCD ~ type)
##
## Residuals:
## Min 1Q Median 3Q Max
## -87.6 -44.4 -26.3 22.0 193.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 75.7 42.4 1.78 0.098 .
## typeB 77.2 60.0 1.29 0.220
## typeC 107.8 51.9 2.08 0.058 .
## ---
10.1. ONE-WAY ANOVA 155

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 84.8 on 13 degrees of freedom
## Multiple R-squared: 0.249,Adjusted R-squared: 0.134
## F-statistic: 2.16 on 2 and 13 DF, p-value: 0.155
options( "contrasts")
## $contrasts
## unordered ordered
## "contr.treatment" "contr.poly"
# manually construct the estimates:
c( mean(HBCD[1:4]), mean(HBCD[5:8])-mean(HBCD[1:4]),
mean(HBCD[9:16])-mean(HBCD[1:4]))
## [1] 75.675 77.250 107.788
# change the constrasts to sum-to-zero
options(contrasts=c("contr.sum","contr.sum"))
lmout1 <- lm( HBCD ~ type )
summary(lmout1)
##
## Call:
## lm(formula = HBCD ~ type)
##
## Residuals:
## Min 1Q Median 3Q Max
## -87.6 -44.4 -26.3 22.0 193.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 137.4 22.3 6.15 3.5e-05 ***
## type1 -61.7 33.1 -1.86 0.086 .
## type2 15.6 33.1 0.47 0.646
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 84.8 on 13 degrees of freedom
## Multiple R-squared: 0.249,Adjusted R-squared: 0.134
## F-statistic: 2.16 on 2 and 13 DF, p-value: 0.155
beta <- as.numeric(coef(lmout1))
# Construct 'contr.treat' coefficients:
c( beta[1]+beta[2], beta[3]-beta[2], -2*beta[2]-beta[3])
## [1] 75.675 77.250 107.787
156 CHAPTER 10. ANALYSIS OF VARIANCE

R-Code 10.2 retardant data: ANOVA with aov and multiple testing of the means.

aovout <- aov( HBCD ~ type )

options("contrasts")
## $contrasts
## [1] "contr.sum" "contr.sum"
coefficients( aovout) # coef( aovout) is suffient as well.
## (Intercept) type1 type2
## 137.354 -61.679 15.571
summary(aovout)
## Df Sum Sq Mean Sq F value Pr(>F)
## type 2 31069 15534 2.16 0.15
## Residuals 13 93492 7192
TukeyHSD( aovout)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = HBCD ~ type)
##
## $type
## diff lwr upr p adj
## B-A 77.250 -81.085 235.59 0.42602
## C-A 107.787 -29.335 244.91 0.13376
## C-B 30.537 -106.585 167.66 0.82882

10.2 Two-Way and Complete Two-Way ANOVA

Model (10.2) can be extended for additional factors. Adding one factor to a one-way model leads
us to a two-way model

Yijk = µ + βi + γj + εijk (10.21)

iid
with i = 1, . . . , I, j = 1, . . . , J, k = 1, . . . , nij and εijk ∼ N (0, σ 2 ). The indices again specify the
levels of the first and second factor as well as the count for that configuration. As stated, the
model is over parameterized and additional constraints are again necessary, in which case
I
X J
X
βi = 0, γj = 0 or β1 = 0, γ1 = 0 (10.22)
i=1 j=1

are often used.

When the nij in each group are not equal, the decomposition of the sums of squares is not
necessarily unique. In practice this is often unimportant since, above all, the estimated coeffi-
cients are compared with each other (“contrasts”). We recommend always using the command
10.2. TWO-WAY AND COMPLETE TWO-WAY ANOVA 157

lm(...). In case we do compare sums of squares there are resulting ambiguities and factors need
to be included in decreasing order of “natural” importance.

For the sake of illustration, we consider the balanced case of nij = K, called complete two-way
ANOVA. More precisely, the model consists of I · J groups and every group contains K samples
and N = I · J · K. The calculation of the estimates are easier than in the unbalanced case and
are illustrated as follows.
As in the one-way case, we can derive least squares estimates

yijk = y ··· + y i·· − y ··· + y ·j· − y ··· + yijk − y i·· − y ·j· + y ··· (10.23)
|{z} | {z } | {z } | {z }
µ
b βbi γ
bj rijk

and separate the sums of squares

X X X X
(yijk − y ··· )2 = (y i·· − y ··· )2 + (y ·j· − y ··· )2 + (yijk − y i·· − y ·j· + y ··· )2 . (10.24)
i,j,k i,j,k i,j,k i,j,k
| {z } | {z } | {z } | {z }
SST SSA SSB SSE

We are interested in the statistical hypotheses H0 : β1 = · · · = βI = 0 and H0 : γ1 = · · · =

γJ = 0. The test statistics of both of these tests are given in the last column of Table 10.2.
The test statistic Fobs,A is compared with the quantiles of the F distribution with I − 1 and
N − I − J + 1 degrees of freedom. Similarly, for Fobs,B we use J − 1 and N − I − J + 1 degrees
of freedom. The tests are equivalent to that of Test 13.

Table 10.2: Table of the complete two-way ANOVA.

Source SS DF MS Test statistic

X SSA MSA
Factor A SSA = (y i·· − y ··· )2 I −1 MSA = Fobs,A =
I −1 MSE
i,j,k

X SSB MSB
Factor B SSB = (y ·j· − y ··· )2 J −1 MSB = Fobs,B =
J −1 MSE
i,j,k
SSE = DFE =
X SSE
Error (yijk − y i·· − y ·j· + y ··· )2 N −I −J +1 MSE =
DFE
i,j,k
X
Total SST = (yijk − y ··· )2 N −1
i,j,k

Model (10.21) is additive: “More of both leads to even more”. It might be that there is a
certain canceling or saturation effect. To model such a situation, we need to include an interaction
(βγ)ij in the model, to account the not linear effects:

Yijk = µ + βi + γj + (βγ)ij + εijk (10.25)

158 CHAPTER 10. ANALYSIS OF VARIANCE

iid
with εijk ∼ N (0, σ 2 ) and corresponding ranges for the indices. In addition to constraints (10.22)
we require
I
X J
X
(βγ)ij = 0 and (βγ)ij = 0 for all i and j (10.26)
i=1 j=1

or analogous treatment constraints are often used. As in the previous two-way case, we can
derive the least squares estimates

yijk = y ··· + y i·· − y ··· + y ·j· − y ··· + y ij· − y i·· − y ·j· + y ··· + yijk − y ij· (10.27)
|{z} | {z } | {z } | {z } | {z }
µ
b βbi γ
bj (βγ)
d
ij rijk

and a decomposition in sums of squares is straightforward.

Note that for K = 1 a model with interaction does not make sense as the error of the two-way
model is the interaction.
Table 10.3 shows the table of a complete two-way ANOVA. The test statistics Fobs,A , Fobs,B
and Fobs,AB are compared with the quantiles of the F distribution with I −1, J −1, (I −1)(J −1)
and N − IJ degrees of freedom, respectively.

Table 10.3: Table of the complete two-way ANOVA with interaction.

Source SS DF MS Test statistic

X SSA MSA
Factor A SSA = (y i·· − y ··· )2 I −1 MSA = Fobs,A =
DFA MSE
i,j,k

X SSB MSB
Factor B SSB = (y ·j· − y ··· )2 J −1 MSB = Fobs,B =
DFB MSE
i,j,k

Interaction SSAB = (I − 1)×

SSAB MSAB
X MSAB = Fobs,AB =
(y ij· − y i·· − y ·j· + y ··· )2 (J − 1) DFAB MSE
i,j,k
X SSE
Error SSE = (yijk − y ij· )2 N − IJ MSE =
DFE
i,j,k
X
Total SST = (yijk − y ··· )2 N −1
i,j,k

10.3 Analysis of Covariance

We now combine regression and analysis of variance elements, i.e., continuous predictors and fac-
tor levels, by adding “classic” predictors to models (10.2) or (10.21). Such models are sometimes
called ANCOVA, e.g., example

Yijk = µ + β1 xi + γj + εijk , (10.28)

10.4. EXAMPLE 159

iid
with j = 1, . . . , J, k = 1, . . . , nj and εijk ∼ N (0, σ 2 ). Additional constraints are again necessary.
Keeping track of indices and Greek letters quickly gets cumbersome and one often assumes a R
formula notation. For example if the predictor xi is in the variable Population and γj is in the
variable Treatment, in form of a factor then

y ˜ Population + Treatment (10.29)

is representing (10.28). An intersection is indicated with :, for example Treatment:Month. The

* operator denotes ‘factor crossing’: a*b is interpreted as a + b + a:b. See Section Details of
help(formula).
ANOVA tables are constructed similarly and individual rows thereof are associated to the
individual elements in a formula specification.

Hence our unified approach via lm(...). Notice that for the estimates the order of the
variable in formula (10.29) does not play a role, for the decomposition of sums of squares it does.
Different statistical software packages have different approaches and thus may lead to minor
differences.

10.4 Example
Example 10.2. UVfilter data Octocrylene is an organic UV Filter found in sunscreen and
cosmetics. The substance is classified as a contaminant and dangerous for the environment by
the EU under the CLP Regulation.
Because the substance is difficult to break down, the environmental burden of Octocrylene
can be estimated through measurement of its concentration in sludge from waste treatment
facilities.
The study Plagellat et al. (2006) analyzed Octocrylene (OC) concentrations from 24 different
purification plants (consisting of three different types of Treatment), each with two samples
(Month). Additionally, the catchment area (Population) and the amount of sludge (Production)
are known. Treatment type A refers to small plants, B medium-sized plants without considerable
industry and C medium-sized plants with industry.
R-Code 10.3 prepares the data and shows a one-way ANOVA. R-Code 10.4 shows a two-way
ANOVA (with and without interactions).
Figure 10.1 shows why the interaction is not significant. First, the seasonal effect of groups
A and B are very similar and second, the variability in group C is too large. ♣

R-Code 10.3: UVfilter data: one-way ANOVA using lm.

UV <- read.csv( './data/chemosphere.csv')

UV <- UV[,c(1:6,10)] # reduce to OT
str(UV, strict.width='cut')
160 CHAPTER 10. ANALYSIS OF VARIANCE

## 'data.frame': 24 obs. of 7 variables:

## $ Treatment : Factor w/ 3 levels "A","B","C": 1 1 1 1 1 2 2 2 3 3 ..
## $ Site_code : Factor w/ 12 levels "A11","A12","A15",..: 1 2 3 4 7 ..
## $ Site : Factor w/ 12 levels "Chevilly","Cronay",..: 1 2 10 7..
## $ Month : Factor w/ 2 levels "jan","jul": 1 1 1 1 1 1 1 1 1 1 ..
## $ Population: int 210 284 514 214 674 5700 8460 11300 6500 7860 ...
## $ Production: num 2.7 3.2 12 3.5 13 80 150 220 80 250 ...
## $ OT : int 1853 1274 1342 685 1003 3502 4781 3407 11073 33..
with( UV, table(Treatment, Month))
## Month
## Treatment jan jul
## A 5 5
## B 3 3
## C 4 4
options( contrasts=c("contr.sum", "contr.sum"))
lmout <- lm( log(OT) ~ Treatment, data=UV)
summary( lmout)
##
## Call:
## lm(formula = log(OT) ~ Treatment, data = UV)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.952 -0.347 -0.136 0.343 1.261
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.122 0.116 70.15 < 2e-16 ***
## Treatment1 -0.640 0.154 -4.16 0.00044 ***
## Treatment2 0.438 0.175 2.51 0.02049 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.555 on 21 degrees of freedom
## Multiple R-squared: 0.454,Adjusted R-squared: 0.402
## F-statistic: 8.73 on 2 and 21 DF, p-value: 0.00174
10.4. EXAMPLE 161

R-Code 10.4: UVfilter data: two-way ANOVA and two-way ANOVA with interactions
using lm. (See Figure 10.1.)

lmout <- lm( log(OT) ~ Treatment + Month, data=UV)

summary( lmout)
##
## Call:
## lm(formula = log(OT) ~ Treatment + Month, data = UV)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.7175 -0.3452 -0.0124 0.1691 1.2236
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.122 0.106 76.78 < 2e-16 ***
## Treatment1 -0.640 0.141 -4.55 0.00019 ***
## Treatment2 0.438 0.160 2.74 0.01254 *
## Month1 -0.235 0.104 -2.27 0.03444 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.507 on 20 degrees of freedom
## Multiple R-squared: 0.566,Adjusted R-squared: 0.501
## F-statistic: 8.69 on 3 and 20 DF, p-value: 0.000686
summary( aovout <- aov( log(OT) ~ Treatment * Month, data=UV))
## Df Sum Sq Mean Sq F value Pr(>F)
## Treatment 2 5.38 2.688 11.67 0.00056 ***
## Month 1 1.32 1.325 5.75 0.02752 *
## Treatment:Month 2 1.00 0.499 2.17 0.14355
## Residuals 18 4.15 0.230
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD( aovout, which=c('Treatment'))
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = log(OT) ~ Treatment * Month, data = UV)
##
## $Treatment
## diff lwr upr p adj
## B-A 1.07760 0.44516 1.71005 0.00107
162 CHAPTER 10. ANALYSIS OF VARIANCE

## C-A 0.84174 0.26080 1.42268 0.00446

## C-B -0.23586 -0.89729 0.42556 0.64105
boxplot(log(OT)~Treatment, data=UV, col=7, boxwex=.5)
at <- c(0.7, 1.7, 2.7, 1.3, 2.3, 3.3)
boxplot( log(OT) ~ Treatment + Month, data=UV, add=T, at=at, xaxt='n',
boxwex=.2)
with( UV, interaction.plot( Treatment, Month, log(OT), col=2:3))
9.5

●
Month

8.5
jan
mean of log(OT)
●
8.5

jul
log(OT)

8.0
7.5

7.5

●
6.5

7.0

A B C A B C

Treatment Treatment

Figure 10.1: UVfilter data: box plots sorted by treatment and interaction plot. (See
R-Code 10.4.)

10.5 Bibliographic remarks

Almost all books covering linear models have a section about ANOVA.

10.6 Exercises and Problems

Problem 10.1 (ANOVA) We consider the data chemosphere_OC.csv available on the course
web page. The data describe the octocrylene (OC) concentration sampled from 12 wastewater
treatment plants in Switzerland. Further variables in the dateset are: Behandlung (treatment of
the wastewater), Monat (month when the sample was collected), Einwohner (number of inhabitant
connected to the plant), Produktion (sludge production (metric tons of dry matter per year),
everything that doesn’t enter the water system after treatment).
Octocrylene is an organic UV filter and is used in sunscreens and as additive in cosmetics
for daily usage. The substance is classified as irritant and dangerous for the environment (EU
classification of dangerous substances).
10.6. EXERCISES AND PROBLEMS 163

The data are published in C. Plagellat, T. Kupper, R. Furrer, L. F. de Alencastro, D. Grand-

jean, J. Tarradellas Concentrations and specific loads of UV filters in sewage sludge originating
from a monitoring network in Switzerland, Chemosphere 62 (2006) 915–25.

i) Describe the data. Do a visual inspection to check for differences between the treatment
types and between the months of data aquirement. Use an appropriate plot function to do
so. Describe your results.
Hint: Also try the function table()

ii) Fit a one-way ANOVA with log(OC) as response variable and Behandlung as explanatory
variable.
Hint: use lm and perform an anova on the output. Don’t forget to check model assump-
tions.

iii) Extend the model to a two-way ANOVA by adding Monat as a predictor. Interpret the
summary table.

iv) Test if there is a significant interaction between Behandlung and Monat. Compare the
result with the output of interaction.plot

v) Extend the model from (b) by adding Produktion as an explanatory variable. Perform an
anova on the model output and interpret the summary table. (Such a model is sometimes
called Analysis of Covariance, ANCOVA).
Switch the order of your explanatory variables and run an anova on both model outputs.
Discuss the results of Behandlung + Produktion and Produktion + Behandlung. What
causes the differences?
164 CHAPTER 10. ANALYSIS OF VARIANCE
Chapter 11

Bayesian Methods

Learning goals for this chapter:

Describe the fundamental differences between the Bayesian and frequentist

approach

Describe the Bayesian terminology

Explain how to compute posterior probability

Interpret posterior probability

R-Code for this chapter: www.math.uzh.ch/furrer/download/sta120/chapter12.R.

In statistics there exist two different philosophical approaches to inference: frequentist and
Bayesian inference. Past chapters dealt with the frquentist approach; now we deal with the
Bayesian approach. Here, we consider the parameter as a random variable with a suitable
distribution, which is chosen a priori, i.e., before the data is collected and analyzed. The goal is
to update this prior knowledge after observation of the data in order to draw conclusions (with
the help of the so-called posterior distribution).

11.1 Motivating Example and Terminology

Bayesian statistics is often introduced by recalling the so-called Bayes theorem, which states for
two events A and B
P(B | A) P(A)
P(A | B) = , for P(B) 6= 0, (11.1)
P(B)

and is shown by using twice Equation (2.3). Bayes theorem is often used in probability theory
to calculate probabilities along an event tree, as illustrated in the arch-example below.

Example 11.1. A patient sees a doctor and gets a test for a (relatively) rare disease. The
prevalence of this disease is 0.5%. As typical, the screening test is not perfect and has a sensitivity

165
166 CHAPTER 11. BAYESIAN METHODS

of 99%, i.e., true positive rate; properly identified the disease in a sick patient, and a specificity
of 98%, i.e., true negative rate; a healthy person is correctly identified disease free. What is the
probability that the patient has the disease provided the test is positive?

Denoting the events D = ‘Patient has disease’ and + = ‘test is positive’ we have using (11.1)
P(+ | D) P(D) P(+ | D) P(D)
P(D | +) = = (11.2)
P(+) P(+ | D) P(D) + P(+ | ¬D) P(¬D)
99% · 0.5%
= = 20%. (11.3)
99% · 0.5% + 2% · 99.5%
Note that for the denominator we have used the so-called law of total probability to get an
expression for P(+). ♣

An interpretation of the previous example from a frequentist view is in terms of proportion

of outcomes (in a repeated sampling framework). In the Bayesian we view the probabilities as
“degree of belief”, where we have some proposition (event A in (11.1) or D in the example)
and some evidence (event B in (11.1) or + in the example). More specifically, P(D) represents
the prior believe of our proposition, P(+ | D)/ P(+) is the support of the evidence for the
proposition and P(D | +) is the posterior believe of the proposition after having accounted for
the new evidence.

Extending Bayes’ theorem to the setting of two continuous random variables X and Y we
have
fY |X=x (y | x) fX (x)
fX|Y =y (x) = . (11.4)
fY (y)
In the context of Bayesian inference the random variable X will now be a parameter, typically
of the distribution of Y :
fY |Θ=θ (y | θ) fΘ (θ)
fΘ|Y =y (θ | y) = . (11.5)
fY (y)
Hence, current knowledge about the parameter is expressed by a probability distribution on
the parameter: the prior distribution. The model for our observations is called the likelihood. We
use our observed data to update the prior distribution and thus obtain the posterior distribution.

Notice that P(B) in (11.1), P(+) in (11.2), or fY (y) in (11.4) and (11.5) serves as a normal-
izing constant, i.e., it is independent of A, D, x or the parameter θ. Thus, we often write the
posterior without this normalizing constant

fΘ|Y =y (θ | y) ∝ fY |Θ=θ (y | θ) × fΘ (θ), (11.6)

(or in short form f (θ | y) ∝ f (y | θ)f (θ) if the context is clear). The symbol “∝” means
“proportional to”.
Finally we can summarize the most important result in Bayesian inference as the posterior
density is proportional to the likelihood multiplied by the prior density, i.e.,

Posterior density ∝ Likelihood × Prior density (11.7)

Advantages of using a Bayesian framework are:

11.2. EXAMPLES 167

• formal way to incorporate priori knowledge;

• intuitive interpretation of the posterior;

• much easier to model complex systems;

• no n dependency of ‘significance’ and the p-value.

As nothing comes for free, there are also some disadvantages:

• more ‘elements’ have to be specified for a model;

• in virtually all cases, computationally more demanding.

Until recently, there were clear fronts between frequentists and Bayesians. Luckily these differ-
ences have vanished.

11.2 Examples
We start with two very typical examples that are tractable and well illustrate the concept of
Bayesian inference.

Example 11.2. (Beta-Binomial) Let Y ∼ Bin(n, p). We observe y successes (out of n). As was
shown in Section 5.1, pb = y/n. We often have additional knowledge about the parameter p. For
example, let n = 13, the number of autumn lambs in a herd of sheep. We count the number of
male lambs. It is highly unlikely that p ≤ 0.1. Thus we assume that p is beta distributed, i.e.,

f (p) = c · pα−1 (1 − p)β−1 , p ∈ [0, 1], α > 0, β > 0, (11.8)

with normalization constant c. We write p ∼ Beta(α, β). Figure 11.4 shows densities for various
pairs (α, β).
The posterior density is then

n y
∝ p (1 − p)n−y × c · pα−1 (1 − p)β−1 (11.9)
y
∝ py pα−1 (1 − p)n−y (1 − p)β−1 (11.10)
∝ py+α−1 (1 − p)n−y+β−1 , (11.11)

which can be recognized as a beta distribution Beta(y + α, n − y + β).

The expected value of a beta distributed random variable Beta(α, β) is α/(α + β) (here the
prior distribution). The posterior expected value is thus
y+α
E(p | Y = y) = . (11.12)
n+α+β
The beta distribution Beta(1, 1), i.e., α = 1, β = 1, is equivalent to a uniform distribution
U(0, 1). The uniform distribution, however, does not mean “information-free”. As a result of
Equation (11.12), a uniform distribution as prior is “equivalent” to two experiments, of which
one is a success, i.e., one of two lambs are male.
Figure 11.1 illustrates the case of y = 10, n = 13 with prior Beta(5, 5). ♣
168 CHAPTER 11. BAYESIAN METHODS

4
Data/likelihood
Prior
Posterior

3
2
1
0

^
p
0.0 0.2 0.4 0.6 0.8 1.0

Figure 11.1: Beta-binomial model with prior density (cyan), data/likelihood (green)
and posterior density (blue).

In the previous example, we use p ∼ Beta(α, β) and fix α and β during model specification
and are thus called hyper-parameters.

iid
Example 11.3. (Normal-normal) Let Y1 , . . . , Yn ∼ N (µ, σ 2 ). We assume that µ ∼ N (η, τ 2 )
and σ is known. Thus we have the Bayesian model:
iid
Yi | µ ∼ N (µ, σ 2 ), i = 1, . . . , n, (11.13)
µ ∼ N (η, τ 2 ). (11.14)

Thus σ 2 , η and τ 2 are considered as hyper-parameters. The posterior density is then

n
Y
f (µ | y1 , . . . , yn ) ∝ f (y1 , . . . , yn | µ) × f (µ) = f (yi | µ) × f (µ) (11.15)
i=1
n 1 (y − µ)2 1 (µ − η)2
i
Y
∝ exp − exp − (11.16)
2 σ2 2 τ2
i=1
n
1X (yi − µ)2 1 (µ − η)2
∝ exp − − , (11.17)
2 σ2 2 τ2
i=1

where the constants (2πσ 2 )−1/2 and (2πτ 2 )−1/2 do not need to be considered. Through further
manipulation (of the square in µ) one obtains
2 !
1 −1 ny

1 n 1 n η
∝ exp − + µ− + + (11.18)
2 σ2 τ 2 σ2 τ 2 σ2 τ 2

and thus the posterior distribution is

−1 !
1 −1

n 1 ny η n
N + + , + . (11.19)
σ2 τ 2 σ2 τ 2 σ2 τ 2

In other words, the posterior expected value

σ2 nτ 2
E(µ | y1 , . . . , yn ) = η + y (11.20)
nτ 2 + σ 2 nτ 2 + σ 2
11.2. EXAMPLES 169

is a weighted mean of the prior mean η and the mean of the Likelihood y. The greater n is, the
less weight there is on the prior mean, since σ 2 /(nτ 2 + σ 2 ) → 0 for n → ∞.
Figure 11.2 illustrates the setting of this example with artificial data (see R-Code 11.1).
Typically, the prior is fixed but if more data is collected, the likelihood gets more and more
peaked. As a result, the posterior mean will be closer to the mean of the data. We discuss this
further in the next section. ♣

R-Code 11.1 Normal-normal model. (See Figure 11.2.)

# Information about data:

ybar <- 2.1; n <- 4; sigma2 <- 1
# information about prior:
priormean <- 0; priorvar <- 2
# Calculating the posterior variance and mean:
postvar <- 1/( n/sigma2 + 1/priorvar)
postmean <- postvar*( ybar*n/sigma2 + priormean/priorvar )
# Plotting follows:
y <- seq(-2, to=4, length=500)
plot( y, dnorm( y, postmean, sqrt( postvar)), type='l', col=4,
ylab='Density', xlab=expression(mu))
lines( y, dnorm( y, ybar, sqrt( sigma2/n)), col=3)
lines( y, dnorm( y, priormean, sqrt( priorvar)), col=5)
legend( "topleft", legend=c("Data/likelihood", "Prior", "Posterior"),
col=c(3, 5, 4), bty='n', lty=1)

Data/likelihood
0.8

Prior
Posterior
0.6
Density

0.4
0.2
0.0

−2 −1 0 1 2 3 4

Figure 11.2: Normal-normal model with prior (cyan), data/likelihood (green) and
posterior (blue). (See R-Code 11.1.)

The posterior mode is often used as a summary statistic of the posterior distribution. Nat-
urally, the posterior median and posterior mean (i.e., expectation of the posterior distribution)
are intuitive alternatives.

With the frequentist approach we have constructed confidence intervals, but these intervals
170 CHAPTER 11. BAYESIAN METHODS

cannot be interpreted with probabilities. An empirical (1 − α)% confidence interval [bu , bo ]

contains the true parameter with a frequency of (1−α)% in infinite repetitions of the experiment.
With a Bayesian approach, we can now make statements about the parameter with probabilities.
In Example 11.3, based on Equation (11.19)

P v −1 m − z1−α/2 v −1/2 ≤ µ ≤ v −1 m + z1−α/2 v −1/2 = 1 − α, (11.21)

with v = n/σ 2 + 1/τ 2 and m = ny/σ 2 + η/τ 2 . That means that the bounds v −1 m ± z1−α/2 v −1/2
can be used to construct a Bayesian counterpart to a confidence interval.

Definition 11.1. The interval R with

Z
f (θ | y1 , . . . , yn )dθ = 1 − α (11.22)
R

is called a (1 − α)% credible interval for θ with respect to the posterior density f (θ | y1 , . . . , yn )
and 1 − α is the credible level of the interval. ♦

The definition states that the random variable whose density is given by f (θ | y1 , . . . , yn ) is
contained in the (1 − α)% credible interval with probability (1 − α).
Since the credible interval for a fixed α is not unique, the “narrowest” is often used. This
is the so-called HPD Interval (highest posterior density interval). A detailed discussion can be
found in Held (2008). Credible intervals are often determined numerically.

Example 11.4. In the context of Example 11.2, the 2.5% and 97.5% quantiles of the posterior
are 0.45 and 0.83, respectively. A HPD is given by 0.46 and 0.84. The differences are not
pronounced as the posterior density is fairly symmetric. Hence, the widths of both are almost
identical: 0.377 and 0.375.
The frequentist empirical 95% CI is [0.5, 0.92], with width 0.42, see Equation (5.9). ♣

The Bayesian counterpart to hypothesis testing is done through a comparison of posterior

probabilities. For example, consider two models specified by two hypotheses H0 and H1 . By
Bayes theorem,

P(H0 | y1 , . . . , yn ) P(y1 , . . . , yn | H0 ) P(H0 )

= × , (11.23)
P(H1 | y1 , . . . , yn ) P(y1 , . . . , yn | H1 ) P(H1 )
| {z } | {z } | {z }
Posterior odds Bayes factor (BF01 ) Prior odds

that means that the Bayes factor BF01 summarizes the evidence of the data for the hypothesis
H0 versus the hypothesis H1 . However, it has to be mentioned that a Bayes factor needs to
exceed 3 to talk about substantial evidence for H0 . For strong evidence we typically require
Bayes factors larger than 10. More precisely, Jeffreys (1983) differentiates

barely worth very

1< < 3 < substantial < 10 < strong < 30 < < 100 < decisive
mentioning strong
11.3. CHOICE AND EFFECT OF PRIOR DISTRIBUTION 171

For values smaller than one, we would favor H1 and the situation is similar by inverting the
ratio, as also illustrated in the following example.

Example 11.5. We consider the setup of Example 11.2 and would compare the models with
p = 1/2 and p = 0.8 when observing 10 successes among the 13 trials. To calculate the Bayes
factor, we need to calculate P(Y = 10 | p) for p = 1/2 and p = 0.8. Hence, the Bayes factor is
13
10 3
10 0.5 (1 − 0.5) 0.0349
BF01 = 13 = = 0.1421, (11.24)
10 0.810 (1 − 0.2)3 0.2457

which is somewhat substantial (1/0.1421 ≈ 7) in favor of H1 . This is not surprising, as the

observed proportion is pb = 10/13 = 0.77 close to p = 0.8 under H1 .
The situation for an unspecified probability p for H1 is much more interesting and relies on
using the prior and integrating out the parameter p. More specifically, with a prior Beta(5, 5),
we have
Z 1 Z 1
13 10
P(Y = 13 | H1 ) = P(Y = 13 | p) P(p)dp = p (1 − p)3 · c p4 (1 − p)4 dp
0 0 10
Z 1 (11.25)
13 10 3 4 4
= p (1 − p) c · p (1 − p) dp = 0.0704,
0 10
where we used integrate( function(p) dbinom(10,13,prob=p)*dbeta(p, 5,5),0,1). Hence,
the Bayes factor is approximately 2 in favor of H1 , barely worth calculating the value. Under a
uniform prior, the support for H1 increases.

Bayes factors are popular because they are linked to the BIC (Bayesian Information Crite-
rion) and thus automatically penalize model complexity. Further, they also work for non-nested
models.

11.3 Choice and Effect of Prior Distribution

The choice of prior distribution belongs to modeling process just like the choice of the likelihood
distribution.
The examples in the last section were such that the posterior and prior distributions belonged
to the same class. Naturally, that is no coincidence. In these examples we have chosen so-called
conjugate prior distributions.
With other prior distributions we may obtain posterior distributions which we no longer
“recognize” and normalisizing constants must be explicitly calculated (which will be discussed in
details in Chapter 13).

Example 11.6. We consider again the normal-normal model and compare the posterior density
for various n with the likelihood. We keep y = 2.1, independent of n. As shown in Figure 11.3,
the maximum likelihood estimate does not depend on n (y is kept constant by design). The
√
uncertainty decreases, however (Standard error is σ/ n). For increasing n, the posterior ap-
proaches the likelihood density. In the limit there is no difference between the posterior and the
likelihood.
172 CHAPTER 11. BAYESIAN METHODS

Data/likelihood

0.8
Prior
Posterior

0.6
Density

0.4
0.2
0.0

−2 −1 0 1 2 3 4

µ
4

Data/likelihood
Prior
Posterior
3
Density

2
1
0

0.5 1.0 1.5 2.0 2.5

Figure 11.3: Normal-normal model with prior (cyan), data/likelihood (green) and
posterior (blue). Two different priors top and increasing n bottom (n = 4, 36, 64, 100).

11.4 Bibliographic remarks

The choice of prior distribution leads to several discussions and we refer you to Held and Sa-
banés Bové (2014) for more details.
The source https://www.nicebread.de/grades-of-evidence-a-cheat-sheet compares different
categorizations of evidence based on a Bayes factor and illustrates that the terminology is not
universal.

11.5 Appendix: Beta Distribution

We introduce a random variable with support [0, 1]. Hence this random variable is well suited
to model probabilities (proportions, fractions) in the context of Bayesian modeling.
A random variable X with density

fX (x) = c · xα−1 (1 − x)β−1 , x ∈ [0, 1], α > 0, β > 0, (11.26)

where c is a normalization constant, is called the beta distributed with parameters α and β. We
write this as X ∼ Beta(α, β). The normalization constant cannot be written in closed form for
all parameters α and β. For α = β the density is symmetric around 1/2 and for α > 1, β > 1
11.5. APPENDIX: BETA DISTRIBUTION 173

the density is concave with mode (α − 1)/(α + β − 2). For arbitrary α > 0, β > 0 we have:

α
E(X) = ; (11.27)
α+β
αβ
Var(X) = . (11.28)
(α + β + 1)(α + β)2

Figure 11.4 shows densities of the beta distribution for various pairs of (α, β).

R-Code 11.2 Densities of beta distributed random variables for various pairs of (α, β).
(See Figure 11.4.)

p <- seq( 0, to=1, length=100)

a.seq <- c( 1:6, .8, .4, .2, 1, .5, 2)
b.seq <- c( 1:6, .8, .4, .2, 4, 4, 4)
col <- c( 1:6, 1:6)
lty <- rep( 1:2, each=6)
plot( p, dbeta( p, 1, 1), type='l', ylab='Density', xlab='x',
xlim=c(0,1.3), ylim=c(0,3))
for ( i in 2:length(a.seq))
lines( p, dbeta(p, a.seq[i], b.seq[i]), col=col[i], lty=lty[i])

legend("topright", legend=c(expression( list( alpha, beta)),

paste(a.seq, b.seq)),
col=c(NA,col), lty=c(NA, lty), cex=.9, bty='n')
3.0

α, β
11
2.5

22
33
44
2.0

55
66
Density

0.8 0.8
1.5

0.4 0.4
0.2 0.2
1.0

14
0.5 4
24
0.5
0.0

0.0 0.2 0.4 0.6 0.8 1.0 1.2

Figure 11.4: Densities of beta distributed random variables for various pairs of (α, β).
(See R-Code 11.2.)
174 CHAPTER 11. BAYESIAN METHODS

11.6 Exercises and Problems

iid
Problem 11.1 (Bayesian statistics) Let Y1 , Y2 , . . . , Yn ∼ N (µ, 1/κ). We assume that the ex-
pected value µ is known (i.e., you can treat it as a constant in your calculations), whereas the
inverse of the variance, denoted here with κ, is the parameter of interest. The inverse of the
variance is called the precision.

i) Write down the likelihood of this model.

ii) We choose the following Gamma prior density for the parameter κ:
 α
 β κα−1 exp(−βκ), if κ > 0,

f (κ | α, β) = Γ(α)
0, otherwise,


for fixed hyper-parameters α > 0, β > 0, i.e., κ ∼ Gamma(α, β). How does this distribution
relates to the exponential distribution?

iii) Plot four pdfs for (α, β) = (1,1), (1,2), (2,1) and (2,2). How does a certain choice of α, β
be interpreted with respect to our “beliefs” on κ?

iv) Derive the posterior distribution of the precision κ.

v) Compare the prior and posterior distributions. Why is the choice in ii) sensible?

vi) Simulate some data with n = 50, µ = 10 and κ = 0.25. Plot the prior and posterior
distributions of κ for α = 2 and β = 1.

Problem 11.2 (Bayesian statistics) For the following Bayesian models, derive the posterior
distribution and give an interpretation thereof in terms of prior and data.

i) Let Y | µ ∼ N (µ, 1/κ), where κ is the precision (inverse of the variance) and is assumed
to be known (hyper-parameter). Further, we assume that µ ∼ N (η, 1/ν), for fixed hyper-
parameters η and ν > 0.

ii) Let Y | λ ∼ Pois(λ) with a prior λ ∼ Gamma(α, β) for fixed hyper-parameters α > 0,
β > 0.
Chapter 12

Design of Experiments

Learning goals for this chapter:

Understand the issues and principles of Design of Experiments

Compute sample size for an experiment

Compute power of test

Describe different setups of an experiment

R-Code for this chapter: www.math.uzh.ch/furrer/download/sta120/chapter11.R.

Design of Experiments (DoE) is a relatively old field of statistics. Pioneering work has been
done almost 100 years ago by Sir Ronald Fisher and co-workers at Rothamsted Experimental
Station, England, where mainly agricultural questions have been discussed. The topic has been
taken up by the industry after the second world war to, e.g., optimize production of chemical
compounds, work with robust parameter designs. In recent decades, advances are still been made
on the one hand using the abundance of data in machine learning type discovery and on the other
hand in preclinical and clinical research were the sample sizes are often extremely small.
In this chapter we will selectively cover different aspects of DoE, focusing on sample size
calculations and randomization. Additionally, in the last section we also cover a few domain
specific concepts and terms that are often used in the context of setting up experiments for
clinical or preclinical trials.

12.1 Basic Idea and Terminology

Often statisticians are consulted after the data has been collected. Moreover, if the data does
not “show” what has been hypothesized, frantic visits to “data clinics” are done. Along the same
lines, Sir Ronald Fisher once stated “To consult the statistician after an experiment is finished
is often merely to ask him to conduct a post mortem examination. He can perhaps say what the
experiment died of.” (Fisher, 1938, Presidential Address to the First Indian Statistical Congress).

175
176 CHAPTER 12. DESIGN OF EXPERIMENTS

We see an ‘experiment’ as a controlled procedure that is (hopefully) carefully designed to

test one (or very few) hypothesis. In the context here, the hypothesis is often about the effect
of independent variables on one dependent variable, i.e., the outcome measure. In terms of our
linear model equation (9.2), what is the effect of one or several of the xi` on the Yi . Designing the
experiment implies here the choice of the independent variables (we need to account for possible
confounders, or effect modifiers), the values thereof (fixed at particular “levels” or randomly
chosen) and sample size. Again in terms of our model, we need to include and well specified all
necessary predictors xi` that have an effect on Yi .
The prime paradigm is of DoE is

Maximize primary variance, minimize error variance and control for secondary variance.

which translates to maximize signal we are investigating, minimize noise we are not modeling
and control uncertainties with carefully chosen independent variables.

In the context of DoE we often want to compare the effect of a treatment (or procedure) on an
outcome. Examples that have been discussed in previous chapters are: “Is there a progression of
pododermatitis at the hind paws over time?”, “Is a diretic medication during pregnancy reducing
the risk of pre-eclampsia?” “How much can we increase hardness of metal springs with lower
temperaturs quenching baths?”, “Is residual octocrylene in waste water sludge linked to particular
waste water types?”.
To design an experiment, it is very important to differentiate between exploratory or con-
firmatory research questions. An exploratory experiment tries to discover as much as possible
about the sample material or the phenomenon under investigation, given time and resource con-
straints. Whereas in a confirmatory experiment we want to verify, to confirm, or to validate a
result, which was often derived from an earlier exploratory experiment. Table 12.1 summarizes
both approaches in a two-valued setting. Some of the design elements will be further discussed
in later sections of this chapter. The binary classification should be understood within each
domain: few observations in one domain may be very many in another one. In both situations
and all scientific domains, however, proper statistical analysis is crucial.

Table 12.1: Differences between exploratory or confirmatory experiments.

Design feature Exploratory Confirmatory

Design space Large Small
Subjects Heterogeneous Homogeneous
Environment Varied Standardized
Treatments Many Few
Outcomes Many Few
Hypotheses Loose and flexible Narrow and predefined
Statistical tests Generate hypotheses Confirm/reject hypotheses
Inferences about population Not possible Possible with rigorous designs
12.2. SAMPLE SIZE CALCULATIONS 177

12.2 Sample Size Calculations

When planning an experiment, we should always carefully evaluate sample size n, needed to be
able to properly conclude our hypothesis. In many cases sample size needs to be determined
before starting the experiment: organizing (time-wise) the experiment, acquire necessary funds,
filing study protocols or submitting a license to the ethic commission. As a general rule, we
choose as many as possible as few as necessary balancing statistical and economic interest.

12.2.1 Experimental Units

Suppose that we test the effect of a dietary treatment for female rabbits (say, with and without
a vitamin additive) on the weight of the litter within two housing boxes. Each doe (i.e., female
reproductive rabbit) in the box receives the same treatment, i.e., the treatment is not applied to
a single individual subject and could not be controlled for individually. All does form a single,
so-called experimental unit. The outcomes or responses are measured on the response units,
which are typically “smaller” than the experimental units. In our example, we would weight the
litter of each doe in the housing box individually, but aggregate or average these to a single
number. As a side note, this average often justifies the use of a normal response model when an
experimental unit consists of several response units.
Formally, experimental units are entities which are independent of each other and to which
it is possible to assign a treatment or intervention independently of the other units. The experi-
mental unit is the unit which has to be replicated in an experiment. Below, when we talk about
a sample, we are talking about a sample of experimental units. We do not discuss the choice
of the response units here as it is most often situation specific were a statistician has little to
contribute.

12.2.2 Case of Single Confidence Interval

In this section, we link the sample size to the width of three different confidence intervals.
Specifically, we discuss the necessary number of observations required such that the width of the
empirical confidence interval has a predefined size.
To start, assume that we are in the setting of a simple z confidence interval at level (1 − α)
with known σ, as seen Equation (3.32). If we want to ensure an empirical interval width ω, we
need

2 σ2
n ≈ 4z1−α/2 (12.1)
ω2
observations. In this setting, the right-hand-side of (12.1) does not involve the data and thus
the width of the confidence interval is in any case guaranteed. Note to reduce the width in half,
we need to quadruple the sample size.
The same approach is used when estimating a proportion. We can, for example, use the pre-
cise Wilson confidence interval (5.11) and solve a quadratic equation to obtain n. Alternatively,
we can use the Wald confidence interval (5.10) to get

2 pb(1 − pb)
n ≈ 4z1−α/2 , (12.2)
ω2
178 CHAPTER 12. DESIGN OF EXPERIMENTS

which corresponds to (12.1) with the the plug-in estimate for σ b2 . Of course, pb is not known a
priori and we often take the conservative choice of pb = 1/2 as the function x(1 − x) is maximized
over (0, 1) at x = 1/2. Thus we may choose n ≈ (z1−α/2 /ω)2 .

If we are estimating a Pearson’s correlation coefficient, we can use CI 6 to again link interval
width with n. Here, we use an alternative approach, and would like to determine sample size
such that the interval does not contain the value zero, i.e., the width is just smaller than 2r. The
derivation relies on the duality of tests and confidence intervals (see Section 4.4). Recall Test 11
for Pearson’s correlation coefficient. From Equation (8.3) we construct the critical value for the
test (boundary of the rejection region, see Figure 4.3) and based on that we can calculate the
minimum sample size necessary to detect a correlation |r| ≥ rcrit as significant:
√
n−2 tcrit
tcrit = rcrit q =⇒ rcrit = q . (12.3)
2
1 − rcrit n − 2 + t2crit

Figure 12.1 illustrates the least significant correlation for specific sample sizes. Specifically, with
sample size n < 24 correlations below 0.4 are not significant and for a correlation of 0.25 to be
significant, we require n > 62 at level α = 5% (see R-Code 12.1).

R-Code 12.1 Significant correlation for specific sample sizes (See Figure 12.1.)

rcrit <- function(n, alpha=.05) {

tcrit <- qt( 1 - alpha/2, df=n-2)
return( tcrit / sqrt( n-2 + tcrit^2))
}
curve( rcrit(x), from=3, to=200, xlab="n", ylab="rcrit", ylim=c(0,1), yaxs='i')
round( c(rcrit(25), uniroot( function(x) rcrit(x)-0.25, c(50,100))$root), 2)
## [1] 0.40 62.02
abline( v=63, h=0.25, col='gray')
1.0
0.8
0.6
rcrit

0.4
0.2
0.0

0 50 100 150 200

Figure 12.1: Significant correlation for specific sample sizes (at level α = 5%). For an
empirical correlation of 0.25, n needs to be larger than 62 as indicated with the gray
lines. For a particular n correlations above the line are significant, below are not. (See
R-Code 12.1.)
12.2. SAMPLE SIZE CALCULATIONS 179

12.2.3 Case of t-Tests

Sample sizes are most often determined to be able to “detect” an alternative hypothesis with a
certain probability. That means we need to work with power 1 − β of a particular statistical test.
As a simple example, we consider a one-sided z-test with H0 : µ ≤ µ0 and H1 : µ > µ0 . The
Type II error is
µ0 − µ1
β = β(µ1 ) = P(H0 not rejected given µ = µ1 ) = · · · = Φ z1−α + √ . (12.4)
σ/ n

Suppose we would be able to detect the alternative µ1 with probability 1 − β(µ1 ), i.e., reject the
null hypothesis with probability 1 − β when the true mean is µ1 . Hence, plugging the values
in (12.4) and solving for n we have approximate sample size

z1−α + z1−β 2

n≈ σ . (12.5)
µ0 − µ1

Hence, the sample size depends on the Type I and II errors as well as the standard deviation and
the difference of the means. The latter three quantities are often combined to the standardized
effect size

µ0 − µ1
d= , called Cohen’s d. (12.6)
σ

If the standard deviation is not known, an estimate can be used.

For a two-sided test, a similar expression is found where z1−α is replaced by z1−α/2 . For
a one-sample t-test (Test 1) the right hand side of (12.5) is again similar with the quantiles
tn−1,1−α and tn−1,1−β respectively. As the quantiles depend on n, we start with a reasonable n
to obtain the quantiles, calculate the resulting n and repeat the two steps for at least one more
iteration. In R, the function power.t.test uses a numerical approach.
In the case of two independent samples (Test 2), Cohen’s d is defined as (x1 − x2 )/sp , where
s2pis an estimate of the pooled variance (e.g., as given in Test 2). The degrees of freedom in the
t-quantiles need to be adjusted from n − 1 to n − 2.

For t-tests, Cohen (1988) defined small, medium and large (standardized) effect sizes as
d = 0.2, 0.5 and 0.8, respectively. These are often termed the conventional effect sizes but
depend on the type of test, also implemented in the function cohen.ES() of the R package pwr).

Example 12.1. In the setting of a two-sample t-test with equal group sizes, we need at level
α = 5% and power 1−β = 80% in each group 26, 64 and 394 observations for a large, medium and
small effect size, respectively, see, e.g., pwr.t.test( d=0.2, power=.8) from the pwr package.
For unequal sample sizes, the sum of both group sizes is a bit larger compared to equal sample
sizes (balanced setting). For a large effect size, we would, for example, require n1 = 20 and n2 =
35, leading to three more observations compared the the balanced setting, (pwr.t2n.test(n1=20,
d=0.8, power=.8)).
180 CHAPTER 12. DESIGN OF EXPERIMENTS

12.3 ANOVA
DoE in the Fisher sense is heavily ANOVA driven by his analysis of the crop experiments at
Rothamsted Experimental Station and thus in many textbooks DoE is equated to the discussion
of ANOVA. Here, we have separated the statistical analysis in Chapter 10 from the conceptual
setup of the experiment in this chapter.
In a typical ANOVA setting we should strive to have the same amount of observations in
each cell (for all settings of levels). Such a setting is called a balanced design (otherwise it is
unbalanced). If every treatment has the same number of observations, effect of unequal variances
are mitigated.

In a simple regression setting, the standard errors of βb0 and βb1 depend on 1/ i (xi −x)2 , see
P

expressions for the estimates (8.8) and (8.9). Hence, to reduce the variability of the estimates, we
should increase i (xi −x)2 as much as possible. Specifically, suppose the interval [a, b] represents
P

a natural range for the predictor, then we should choose half of the predictors as a and the other
half as b.
This last arguments justifies a discretization of continuous predictor variables in levels. Of
course this implies that we expect a linear relationship. If the relationship is not linear, a
discretization may be fatal.

12.3.1 Sums of Squares for Unbalanced Two-Way ANOVA

If we have a balanced two-way ANOVA, the sums of squares partition is additive due to “orthog-
onality” induced by the equal number is each cell and thus we have

SST = SSA + SSB + SSAB + SSE (12.7)

(see also Equation (10.24)). In the unbalanced setting this is not the case and the decomposition
depends on the order we introduce the factors in the model. At each step, we reduce additional
variability, Hence, we should rather write

SST = SSA + SSB|A + SSAB|A,B + SSE , (12.8)

where the term SSB|A indicates the sums of squares of factor B after correction of factor A. and
similarly, term SSAB|A,B indicates the sums of squares of the interaction AB after correction of
factors A and B.
This concept of sums of squares after correction is not new. We have encountered this type
of correction already: SST is actually calculated after correcting for the overall mean.

Equation (12.8) represents the sequential sums of squares decomposition, called Type I se-
quential SS : SSA and SSB|A and SSAB|A,B . It is possible to show that SSB|A = SSA,B − SSA ,
where the former is the classical sums of squares of a model without interactions. An ANOVA
table such as given in Table 10.3 yields different p-values for H0 : β1 = · · · = βI = 0 and
H0 : γ1 = · · · = γJ = 0 if the order of the factors is exchanged. This is often a disadvantage
and for the F -test the so-called Type II partial SS, being SSA|B and SSB|A should be used. As
there is no interaction involved, we should use Type II only if the interaction is not significant
12.4. RANDOMIZATION 181

(in which case it is to be preferred over Type I). Alternatively, Type III partial SS, SSA|B,AB and
SSB|A,AB , may be used.
In R, the output of aov, or anova are Type I SS. To obtain the other types, manual calculations
may be done or using the function Anova(..., type=i) from the package car.

Example 12.2. Consider Example 10.2 in Section 10.4 but we eliminate the first observation
and the design is unbalanced in both factors. R-Code 12.2 calculates the Type I sequential SS
for the same order as in R-Code 10.4. Type II partial SS are subsequently slightly different.
Note that the design is balanced for the factor Month and thus simply exchanging the order
does not alter the SS here. ♣

R-Code 12.2 Type I and II SS for UVfilter data without the first observation.

require( car)
lmout2 <- lm( log(OT) ~ Month + Treatment, data=UV, subset=-1) # omit 1st!
print( anova( lmout2), signif.stars=FALSE)
## Analysis of Variance Table
##
## Response: log(OT)
## Df Sum Sq Mean Sq F value Pr(>F)
## Month 1 1.14 1.137 4.28 0.053
## Treatment 2 5.38 2.692 10.12 0.001
## Residuals 19 5.05 0.266
print( Anova( lmout2, type=2), signif.stars=FALSE) # type=2 is default
## Anova Table (Type II tests)
##
## Response: log(OT)
## Sum Sq Df F value Pr(>F)
## Month 1.41 1 5.31 0.033
## Treatment 5.38 2 10.12 0.001
## Residuals 5.05 19

12.4 Randomization, Confounding and Blocking

Randomization is the mechanism to assign a treatment to an experimental unit by pure chance.
Randomization ensures that – on average – the only systematic difference between the two groups
is the treatment, all other effects that are not accounted for are averaged out.
The randomization procedure should be a truly randomized procedure ultimately performed
by a genuine random number generator. There are several main procedures, simple randomiza-
tion, balanced or constrained randomization, stratified randomization etc.
The simplest one is a completely randomized design (CRD) in which we randomly assign the
type of treatment to an experimental unit. For example to assign 20 subjects to four groups
182 CHAPTER 12. DESIGN OF EXPERIMENTS

we use sample(x=4, size=20, replace=TRUE). This procedure has the disadvantage of leading
to a possibly unbalanced design. The constrained randomization places in all groups the same
number of subjects (conditional on appropriate sample size). This can be achieved by numbering
the subjects and then random drawing the corresponding numbers and putting the corresponding
subjects in the appropriate four groups: sample(x=20, size=20, replace=FALSE).

When setting up an experiment, we need to carefully eliminate possible factors or variables

that that influence both the dependent variable and independent variable and thus causing a
spurious association. For example, smoking causes lung cancer and yellow fingers. It is not
possible to conclude that yellow fingers cause lung cancer. Similarly, there is a strong association
between the number of fire men at the incident and the filed insurance claim (the more men the
larger the claim). The confounding here is the size of the fire, of course. In less striking words,
we often have to take into account slope, aspect, altitude when comparing the yield of different
fields or the sex, age, weight, pre-existing conditions when working with human or animal data.
Not taking into account these factors will induce biases in the results.

In the case of discrete confounders it is possible to split your sample into subgroups accord-
ing to these pre-defined factors. These subgroups are often called blocks (when controllable)
or strata (when not). To randomize, randomized complete block design (RCBD) or stratified
randomization is used.
In RCBD each block receives the same amount of subjects. This can be achieved by num-
bering the subjects and then random drawing the corresponding numbers and putting the cor-
responding subjects in the appropriate groups: sample(x=20, size=20, replace=FALSE).
Of course, the corresponding sample sizes are determined a priori. Finally, randomization
also protects against spurios correlations in the observations.

Split plot design. . .

Example 12.3. Suppose we are studying the effect of irrigation amount and fertilizer type
on crop yield. We have access to eight fields, which can be treated independently and without
proximity effects. If applying irrigation and fertilizer is equally easy, we can use a complete
2 × 2 factorial design and assign levels of both factors randomly to fields in a balanced way (each
combination of factor levels is equally represented).
Alternatively, the following options are possible and are further illustrated in Figure 12.2.
In CRD, levels of irrigation and fertilizer are assigned to plots of land (experimental units) in a
random and balanced fashion. In RCBD, similar experimental units are grouped (for example, by
field) into blocks and treatments are distributed in a CRD fashion within the block. If irrigation
is more difficult to vary on a small scale and fields are large enough to be split, a split plot
design becomes appropriate. Irrigation levels are assigned to whole plots by CRD and fertilizer
is assigned to subplots using RCBD (irrigation is the block). Finally, ff the fields are large
enough, they can be used as blocks for two levels of irrigation. Each field is composed of two
whole plots, each composed of two subplots. Irrigation is assigned to whole plots using RCBD
(blocked by field) and fertilizer assigned to subplots using RCBD (blocked by irrigation). ♣
12.4. RANDOMIZATION 183

Figure 12.2: Different randomization of eight fields. CRD (a), RCBD (b) and split
plot CRD (c) and RCBD (d). Source ??.

Figure 12.3: Grain field with experimental plots. (Photo: R. Furrer).

In many experiments the subjects are inherently heterogeneous with respect to factors
that we are not interested in. This heterogeneity may imply variability in the data masking the
effect we would like to study. Blocking is a technique for dealing with this nuisance heterogeneity.
Hence, we distinguish between the treatment factors that we are interested in and the nuisance
factors which have some effect on the response but are not of interest to us.
The term blocking comes from agricultural experiments where it designated a set fo plots of
land that have a very similar characteristics with respect to crop yield, in other words they are
homogeneous.
If a nuisance factor is known and controllable, we use blocking and control for it by including
a blocking factor in the experiment. Typical, blocking factors are sex, factory, production batch.
These are controllable in the sense that we are able to choose in which factor to include
If a nuisance factor is known and uncontrollable, we may use the concept of ANCOVA, i.e.,
to remove the effect of the factor. Suppose that age has an effect on the treatment. It is not
possible to control for age and creating age batches may not be efficient either. Hence we include
age in our model. This approach is less efficient than blocking as we do correct for the design
compared to design the experiment to account for the factor.
Unfortunately, there are also unknown and uncontrollable nuisance factors. To protect for
these we use proper randomization such that their impact is balanced in all groups. Hence, we
can see randomization as a insurance against systematic biases due to nuisance factors.
184 CHAPTER 12. DESIGN OF EXPERIMENTS

A simple examples of RCBD are Example 10.2 and Exercise 1. Treatment type is the main
“Treatment” and we control for, e.g., season, population size, . . . .

Figure 12.4: (a) A crossed design examines every combination of levels for each fixed
factor. (b) Nested design can progressively subreplicate a fixed factor with nested levels
of a random factor that are unique to the level within which they are nested. (c) If
a random factor can be reused for different levels of the treatment, it can be crossed
with the treatment and modeled as a block. (d) A split plot design in which the fixed
effects (tissue, drug) are crossed (each combination of tissue and drug are tested) but
themselves nested within replicates. Source from ?.

Figure 12.5: (a) A two-factor, split plot animal experiment design. The whole plot is
represented by a mouse assigned to drug, and tissues represent subplots. (b) Biological
variability coming from nuisance factors, such as weight, can be addressed by blocking
the whole plot factor, whose levels are now sampled using RCBD. (c) With three factors,
the design is split-split plot. The housing unit is the whole plot experimental unit, each
subject to a different temperature. Temperature is assigned to housing using CRD.
Within each whole plot, the design shown in b is performed. Drug and tissue are
subplot and sub-subplot units. Replication is done by increasing the number of housing
units. Source from ?

12.4.1 Noninferiority and Bioequivalence

Often a treatment is compared to an existing one and the aim is to show that it is at least as good
or equivalent to an existing one. In such a situation it is not appropriate to state H0 : µE = µN ,
then to compare the mean (effect) of the existing with the new one and, finally, in case of failure
of rejection to claim that these are equivalent.
We need to reformulate the alternative hypothesis stating that the effects are equivalent.
12.4. RANDOMIZATION 185

12.4.2 Some Particular Terms

A randomized controlled trial (RCT) is study in which people are allocated at random (by chance
alone) to receive one of several clinical interventions. One of these interventions is the standard
of comparison or control. The control may be a standard practice, a placebo ("sugar pill"), or
no intervention at all. Someone who takes part in a randomized controlled trial (RCT) is called
a participant or subject. RCTs seek to measure and compare the outcomes after the participants
receive the interventions. Because the outcomes are measured, RCTs are quantitative studies.
In sum, RCTs are quantitative, comparative, controlled experiments in which investigators
study two or more interventions in a series of individuals who receive them in random order.
The RCT is one of the simplest and most powerful tools in clinical research.
An intervention is a process which a group of subjects (or experimental units) is subjected
to such as a surgical procedure, a drug injection, or some other form of a treatment.
Control has several different uses in design. First, an experiment is controlled because we as
experimenters assign treatments to experimental units. Otherwise, we would have an observa-
tional study. Second, acontroltreatment is a “standard” treatment that is used as abaseline or
basis of comparison for the other treatments. This control treatment might be the treatment
in common use, or it might bea null treatment (no treatment at all). For example, a study of
new pain killing drugs could use a standard pain killer as a control treatment, or a study on the
efficacy of fertilizer could give some fields no fertilizer at all. This would control for average soil
fertility or weather conditions.
Placebo is a null treatment that is used when the act of applying a treatment—any treat-
ment—has an effect. Placebos are often used with human subjects, because people often respond
to any treatment: for example, reduction in headache pain when given a sugar pill. Blinding is
important when placebos are used with human subjects. Placebos are also useful for nonhuman
subjects. The apparatus for spraying a field witha pesticide may compact the soil. Thus we drive
the apparatusover the field, without actually spraying, as a placebo treatment.Factorscombine
to form treatments. For example, the baking treatment fora cake involves a given time at a
given temperature. The treatment is the combination of time and temperature, but we can vary
the time and temperature separately. Thus we speak of a time factor and a temperature factor.
Individual settings for each factor are calledlevelsof the factor.
Confounding occurs when the effect of one factor or treatment cannot be distinguished from
that of another factor or treatment. The two factors or treatments are said to be confounded.
Except in very special circumstances, confounding should be avoided. Consider planting corn
variety A in Minnesota and corn variety B in Iowa. In this experiment, we cannot distinguish
location effects from variety effects—the variety factor and the location factor are confounded.

Blinding occurs when the evaluators of a response do not know which treat-ment was given
to which unit. Blinding helps prevent bias in the evaluation, even unconscious bias from well-
intentioned evaluators. Double blinding occurs when both the evaluators of the response and
the (human subject) experimental units do not know the assignment of treatments to units.
186 CHAPTER 12. DESIGN OF EXPERIMENTS

Systematic reviews

Randomized controlled trials (RCT)

Non−randomized controlled trials

Observational studies with comparison groups

Case reports and case studies

Expert opinion

Figure 12.6: (One possible) presentation of the evidence-based medicine pyramid.

Blinding the subjects can also prevent bias, because subject responses can change when subjects
have expectations for certain treatments.

Before a new drug is admitted to the market, many steps are necessary: starting from a
discovery based step toward highly standardized clinical trials (type I, II and III). At the very
end, there are typically randomized controlled trials, that by design (should) eliminate all possible
confounders.
At later steps, when searching for an appropriate drug, we may base the decision on available
“evidence”: what has been used in the past, what shown to work (for similar situations). This
is part of evidence-based medicine. Past information may be of varying quality, ranging from
ideas opinions to case studies to RCTs or systematic reviews. Figure 12.6 represents a so-called
evidence-based medicine pyramid which reflects the quality of research designs (increasing) and
quantity (decreasing) of each study design in the body of published literature (from bottom
to top). For other scientific domains, similar pyramids exist, with bottom and top typically
remaining the same.

12.5 Bibliographic remarks

Devore (2011) derives sample size calculations for many classical tests in a very accessible fashion.
For in-vivo studies, The Experimental Design Assistant https://eda.nc3rs.org.uk/ is a very
helpful tool. Classical approaches are PREPARE guidelines (planning guidelines before the
study) https://norecopa.no/PREPARE and ARRIVE guidelines (reporting guidelines after the
study) https://www.nc3rs.org.uk/arrive-guidelines.
Very accessible online lecture about DoE is available at https://online.stat.psu.edu/stat503/
home.
12.6. EXERCISES AND PROBLEMS 187

12.6 Exercises and Problems

Problem 12.1 (Study design) Consider a placebo-controlled trial for a treatment B (compared
to a placebo A). The clinician proposes, to use ten patients, who first receive the placebo A and
after a long enough period treatment B. Your task is to help the clinician to find an optimal
design with at most 20 treatments and with at most 20 patients available.
i) Describe alternative designs, argue regarding which aspects those are better or worse than
the original.

ii) Give an adequate statistical test for each of your suggestions.

Problem 12.2 (Sample size calculation) Suppose we compare the mean of some treatment
in two equally sized groups. Let zγ denote the γ-quantile of the standard normal distribution.
Furthermore, the following properties are assumed to be known or fixed:
• clinically relevant difference ∆ = µ1 − µ0 , we can assume without loss of generality that
∆>0

• standard deviation of the treatment effect σ (same in both groups)

• Power 1 − β.

• Type I error rate α.

i) Write down the suitable test statistic and its distributions under the null hypothesis.

ii) Derive an expression for the power using the test statistic.

iii) Prove analytically that the required sample size n in each group is at least
2σ 2 (z1−β + z1−α/2 )2
n=
∆2

Problem 12.3 (Sample size and group allocation) A randomized clinical trial to compare
treatment A to treatment B is being conducted. To this end 20 patients need to be allocated to
the two treatment arms.

i) Using R randomize the 20 patients to the two treatments with equal probability. Repeat
the randomization in total a 1000 times retaining the difference in group size and visualize
the distribution of the differences with a histogram.

ii) In order to obtain group sizes that are closer while keeping randomization codes secure a
random permuted block design with varying block sizes 2 and 4 and respective probabilities
0.25 and 0.75 is now to be used. Here, for a given length, each possible block of equal
numbers of As and Bs is chosen with equal probability. Using R randomize the 20 patients
to the two treatments using this design. Repeat the randomization in total a 1000 times
retaining the difference in group size. What are the possible values this difference my take?
How often did these values occur?
188 CHAPTER 12. DESIGN OF EXPERIMENTS
Chapter 13

A Closer Look: Monte Carlo Methods

Learning goals for this chapter:

Explain how to numerically approximate integrals

Explain how to sample from an arbitrary (univariate) distribution

Approximate integrals and sample from arbitrary univariate distribution in

(*) Describe Gibbs sampling

R-Code for this chapter: www.math.uzh.ch/furrer/download/sta120/chapter13.R.

Several examples in Chapter 11 resulted in the same posterior and prior distributions albeit
with different parameters, for example for binomial data with a beta prior, the posterior is again
beta. This was no coincidence; rather, we chose so-called conjugate priors based on our likelihood
(distribution of the data).
With other prior distributions, we may have “complicated”, not standard posterior distribu-
tions, for which, we no longer know the normalizing constant and thus the expected value or
any other moment in general. Theoretically, we could derive the normalizing constant and then
the expectation (via integration). The calculation of these two integrals is often complex and
so here we consider classic simulation procedures as a solution to this problem. In general, so-
called Monte Carlo simulation is used to numerically solve a complex problem through repeated
random sampling.

In this chapter, we start with illustrating the power of Monte Carlo simulation where we
utilize, above all, the law of large numbers. We then discuss one method to draw a sample from
an arbitrary density and, finally, illustrate a method to derive (virtually) arbitrary posterior
densities by simulation.

189
190 CHAPTER 13. A CLOSER LOOK: MONTE CARLO METHODS

13.1 Monte Carlo Integration

Let us consider how we can simply approximate integrals. Let X be a random variable and
fX (x) be its density function and g(x) an arbitrary (sufficiently “well behaved”) function. We
are looking for the expected value of g(X), i.e.,
Z

E g(X) = g(x)fX (x) dx. (13.1)
R

Hence, g(x) cannot be entirely arbitrary, but such that the integral is well defined. An approxi-
mation of this integral is (along the idea of method of moments)
Z n
\ 1X
E g(X) = g(x)fX (x) dx ≈ E g(X) = g(xi ), (13.2)
R n
i=1

where x1 , . . . , xn is a random sample of fX (x). The method relies on the law of large numbers
(see Section 2.7).

Example 13.1. To estimate the expectation of a χ21 random variable we can use mean( rnorm(
100000)ˆ2), yielding 1 with a couple digits of precision, close to what we expect according to
Equation (2.42).
Of course, we can use the same approach to calculate arbitrary moments of a χ2n or Fn,m
distribution. ♣

We now discuss this justification in slightly more details. We consider a continuous real
Rb
function g and the integral I = a g(x) dx. There exists a value ξ such that I = (b − a)g(ξ)
(often termed as the mean value theorem for definite integrals). We do not know ξ nor g(ξ) and
iid
but we hope that the “average” value of g is close to g(ξ). More formally, let X1 , . . . , Xn ∼ U(a, b)
which we use to calculate the average (the density of Xi is fX (x) = 1/(b − a) over the interval
[a, b] and zero elsewhere). We now show that on average, our approximation is correct:
n n
1X 1X
E I = E (b − a)
b g(Xi ) = (b − a) E(g(Xi )) = (b − a) E g(X)
n n
i=1 i=1 (13.3)
Z b Z b Z b
1
= (b − a) g(x)fX (x) dx = (b − a) g(x) dx = g(x) dx = I .
a a b−a a

We can generalize this to almost arbitrary densities fX (x) having a sufficiently large support:
n
1 X g(xi )
Ib = , (13.4)
n fX (xi )
i=1

where the justification is as in (13.3). The density in the denominator takes the role of an
additional weight for each term.
Similarly, to integrate over a rectangle R in two dimensions (or a cuboid in three dimensions,
etc.), we use a uniform random variable for each dimension. More specifically, let R = [a, b]×[c, d]
then
Z Z bZ d n
1X
g(x, y) dx dy = g(x, y) dx dy ≈ (b − a)(d − c) g(xi , yi ), (13.5)
R a c n
i=1
13.1. MONTE CARLO INTEGRATION 191

where xi and yi , i = 1, . . . , n, is a sample of U(a, b) and of U(c, d), respectively.

To approximate A g(x, y) dx dy for some complex domain A ⊂ R2 . We now choose a bivariate
R

random vector having a density fX,Y (x, y) whose support contains A. For example we define a
rectangle R such that A ⊂ R and let fX,Y (x, y) = (b − a)(d − c) over R and zero otherwise. We
define the indicator function 1A (x, y) that is one if (x, y) ∈ A and zero otherwise. Then we have
the general formula
Z Z bZ d
g(x, y) dx dy = 1A (x, y)g(x, y) dx dy
A a c
n (13.6)
1X g(xi , yi )
≈ 1A (xi , yi ) .
n fX,Y (xi , yi )
i=1

Testing if points are in the domain A is typically simple.

We now illustrate this idea with two particular examples.

Example 13.2. Consider the bivariate normal density specified in Example 7.1 and suppose we
are interested in evaluating the probability that P(X > Y 2 ). To approximate this probability we
can draw a large sample of the bivariate normal density and calculate the proportion for which
xi > yi2 , as illustrated in R-Code 13.1 and yielding 10.47%.
In this case, the function g is the density with which we are drawing the data points. Hence,
Equation (13.6) reduces to calculate the proportion of the data satisfying xi > yi2 . ♣

R-Code 13.1 Calculating probability with the aid of a Monte Carlo simulation

set.seed( 14)
require(mvtnorm)
l.sample <- rmvnorm( 10000, mean=c(0,0), sigma=matrix( c(1,2,2,5), 2))
mean( l.sample[,1] > l.sample[,2]^2)
## [1] 0.1047

Example 13.3. The area of the unit circle is π as well as a cylinder placed at the origin with
height one. To estimate π we estimate the volume of the cylinder and we consider U(−1, 1) for
both coordinates, a square that contains the unit circle. The function g(x, y) = 1 is the identity
function and 1A (x, y) is the indicator function of the set x2 + y 2 ≤ 1. We have the following
approximation of the number π
Z 1 Z 1 n
1X
π= 1A (x, y) dx dy ≈ 4 1A (xi , yi ), (13.7)
−1 −1 n
i=1

where xi and yi , i = 1 . . . , n are two independent random samples from U(−1, 1). Equation (13.6)
reduces to calculate a proportion again.
It is important to note that the convergence is very slow, see Figure 13.1. It can be shown
√
that the rate of convergence is of the order of 1/ n. ♣
192 CHAPTER 13. A CLOSER LOOK: MONTE CARLO METHODS

R-Code 13.2 Approximation of π with the aid of Monte Carlo integration. (See Fig-
ure 13.1.)

set.seed(14)
m <- 49
n <- round( 10+1.4^(1:m))
piapprox <- numeric(m)
for (i in 1:m) {
st <- matrix( runif( 2*n[i]), ncol=2)
piapprox[i] <- 4*mean( rowSums( st^2)<= 1)
}
plot( n, abs( piapprox-pi)/pi, log='xy', type='l')
lines( n, 1/sqrt(n), col=2, lty=2)
sel <- 1:7*7
cbind( n=n[sel], pi.approx=piapprox[sel], rel.error=
abs( piapprox[sel]-pi)/pi, abs.error=abs( piapprox[sel]-pi))
## n pi.approx rel.error abs.error
## [1,] 21 2.4762 0.21180409 0.66540218
## [2,] 121 3.0083 0.04243968 0.13332819
## [3,] 1181 3.1634 0.00694812 0.02182818
## [4,] 12358 3.1662 0.00783535 0.02461547
## [5,] 130171 3.1403 0.00040166 0.00126186
## [6,] 1372084 3.1424 0.00025959 0.00081554
## [7,] 14463522 3.1406 0.00032656 0.00102592
1e−01
abs(piapprox − pi)/pi

1e−03
1e−05

1e+01 1e+03 1e+05 1e+07

Figure 13.1: Convergence of the approximation for π: the relative error as a function
of n. (See R-Code 13.2.)

In practice, more efficient “sampling” schemes are used. More specifically, we do not sample
uniformly but deliberately “stratified”. There are several reasons to sample randomly stratified
but the discussion is beyond the scope of the work here.
13.2. REJECTION SAMPLING 193

13.2 Rejection Sampling

We now discuss an approach to sample from a distribution with density fY (y) when no direct
method exists. There are many of such approaches and we discuss an intuitive but inefficient
one here. The approach is called rejection sampling.
In this method, values from a known density fZ (z) (proposal density) are drawn and through
rejection of “unsuitable” values, observations of the density fY (y) (target density) are generated.
This method can also be used when the normalizing constant of fY (y) is unknown and we write
fY (y) = c · f ∗ (y).
The procedure is as follows: Step 0: Find an m < ∞, so that f ∗ (y) ≤ m · fZ (y). Step 1: draw
a realization ỹ from fZ (y) and a realization u from a standard uniform distribution. Step 2: if
u ≤ f ∗ (ỹ)/ m·fZ (ỹ) then ỹ is accepted as a simulated value from fY (y), otherwise ỹ is dicarded

and no longer considered. We cycle along Steps 1 and 2 until a sufficiently large sample has been
obtained. The algorithm is illustrated in the following example.

Example 13.4. The goal is to draw a sample from a Beta(6, 3) distribution with the rejection
sampling method. That means, fY (y) = c · y 6−1 (1 − y)3−1 and f ∗ (y) = y 5 (1 − y)2 . As proposal
density we use a uniform distribution, hence fZ (y) = 10≤y≤1 (y). We select m = 0.02, which
fulfills the condition f ∗ (y) ≤ m · fZ (y) since optimize( function(x) xˆ5*(1-x)ˆ2, c(0, 1),
maximum=TRUE) is roughly 0.152.
An implementation of the example is given in R-Code 13.3. Of course, f_Z is always one
here. The R-Code can be optimized with respect to speed. It would then, however, be more
difficult to read.
Figure 13.2 shows a histogram and the density of the simulated values. By construction the
bars of the target density are smaller than the one of the proposal density. In this particular
example, we have sample size 285. ♣

R-Code 13.3: Rejection sampling in the setting of a beta distribution. (See Figure 13.2.)

set.seed( 14)
n.sim <- 1000
m <- 0.02
fst <- function(y) y^( 6-1) * (1-y)^(3-1)
f_Z <- function(y) ifelse( y >= 0 & y <= 1, 1, 0)
result <- sample <- rep( NA, n.sim)
for (i in 1:n.sim){
sample[i] <- runif(1) # ytilde, proposal
u <- runif(1) # u, uniform
if( u < fst( sample[i]) /( m * f_Z( sample[i])) ) # if accept ...
result[i] <- sample[i] # ... keep
}
mean( !is.na(result)) # proportion of accepted samples
## [1] 0.285
194 CHAPTER 13. A CLOSER LOOK: MONTE CARLO METHODS

result <- result[ !is.na(result)] # eliminate NAs

hist( sample, xlab="y", main="", col="lightblue")
hist( result, add=TRUE, col=4)
curve( dbeta(x, 6, 3), frame =FALSE, ylab="", xlab='y', yaxt="n")
lines( density( result), lty=2, col=4)
legend( "topleft", legend=c("truth", "smoothed empirical"),
lty=1:2, col=c(1,4))
120

truth
smoothed empirical
20 40 60 80
Frequency

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

y y

Figure 13.2: On the left we have a histogram of the simulated values of fZ (y) (light
blue) and fY (y) (dark blue). On the right the theoretical density (truth) and the
simulated density (smoothed empirical) are drawn. (See R-Code 13.3.)

For efficiency reasons the constant m should be chosen to be as small as possible, as this
reduces the number of rejections. Nevertheless in practice, rejection sampling is intuitive but
very inefficient. The next section illustrates an approach well suited for complex Bayesian models.

13.3 Gibbs Sampling

The idea of Gibbs sampling is to simulate the posterior distribution through the use of a so-called
Markov chain. This algorithm belongs to the family of Markov chain Monte Carlo (MCMC).
We illustrate the principle with a likelihood that depends on two parameters θ1 and θ2 . Based
on some prior, the joint posterior density is written as f (θ1 , θ2 | y) with, for simplicity, a single
observation y. The Gibbs sampler reduces the problem to two one-dimensional simulations
f (θ1 | θ2 , y) and f (θ2 | θ1 , y). Starting with some initial value θ2,0 we draw θ1,1 from f (θ1 | θ2,0 , y),
followed by θ2,1 from f (θ2 | θ1,1 , y) and θ1,2 from f (θ1 | θ2,1 , y), etc. If all is setup properly, the
sample (θ1,i , θ2,i ), i = 1, . . . , n, is a sample of the posterior density f (θ1 , θ2 | y). Often we omit
the first few samples to avoid influence of possibly sub-optimal initial values.

In many cases one does not have to program a Gibbs sampler oneself but can use a pre-
programmed sampler. We use the sampler JAGS (Just Another Gibbs sampler) (Plummer,
2003) with the R-Interface package rjags (Plummer, 2016).
13.3. GIBBS SAMPLING 195

R-Codes 13.3, 13.4 and 13.5 give a short, but practical overview into Markov chain Monte
Carlo methods with JAGS in the case of a simple Gaussian likelihood. Luckily more complex
models can easily be constructed based on the approach shown here.
When using MCMC methods, you may encounter situations in which the sampler does not
converge (or converges too slowly). In such a case the posterior distribution cannot be approx-
imated with the simulated values. It is therefore important to examine the simulated values
for eye-catching patterns. For example, the so-called trace plot, observations in function of the
index, as illustrated in the right panel of Figure 13.3 is often used.

Example 13.5. R-Code 13.4 implements the normal-normal model for a single observation,
y = 1, n = 1, known variance, σ 2 = 1.1, and a normal prior for the mean µ:

Y | µ ∼ N (µ, 1.1), (13.8)

µ ∼ N (0, 0.8). (13.9)

The basic approach to use JAGS is to first create a file containing the Bayesian model definition.
This file is then transcribed into a model graph (function jags.model()) from which we can
finally draw samples (coda.samples()).
Defining a model for JAGS is quite straightforward, as the notation is very close to R’s one.
Some care is needed when specifying variance parameters. In our notation, we typically use the
variance σ 2 , as in N ( · , σ 2 ) ; in R we have to specify the standard deviation σ as parameter sd
in the function dnorm(..., sd=sigma); and in JAGS we have to specify the precision 1/σ 2 in
the function dnorm(..., precision=1/sigma2).
The resulting samples are typically plotted with smoothed densities, as seen in the left panel
of Figure 13.3 with prior and likelihood, if possible. The posterior seems affected similarly by
likelihood (data) and prior, the mean is close to the average of the prior mean and the data. The
prior is slightly tighter as its variance is slightly smaller (0.8 vs. 1.1) but this does not seem to
have a visual impact on the posterior. The setting here is identical to Example 11.3 and thus
the posterior is again Normal N 0.8/(0.8 + 1.1), 0.8 · 1.1/(0.8 + 1.1) , see Equation (11.19). ♣

R-Code 13.4: JAGS sampler for normal-normal model, with n = 1. (See Figure 13.3.)

require( rjags)
writeLines("model { # File with Bayesian model definition
y ~ dnorm( mu, 1/1.1) # here Precision = 1/Variance
mu ~ dnorm( 0, 1/0.8) # Precision again!
}", con="jags01.txt")
jagsModel <- jags.model( "jags01.txt", data=list( 'y'=1)) # transcription
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
196 CHAPTER 13. A CLOSER LOOK: MONTE CARLO METHODS

## Observed stochastic nodes: 1

## Unobserved stochastic nodes: 1
## Total graph size: 8
##
## Initializing model
postSamples <- coda.samples( jagsModel, 'mu', n.iter=2000) # draw samples

plot( postSamples, trace=FALSE, main="", auto.layout=FALSE, xlim=c(-2, 3))

y <- seq(-3, to=4, length=100)
lines( y, dnorm( y, 1, sqrt(1.1)), col=3) # likelihood
lines( y, dnorm( y, 0, sqrt(0.8) ), col=4) # prior
lines( y, dnorm( y, 1/1.1 * (1.1*0.8/(0.8 + 1.1)),
sqrt(1.1*0.8/(0.8 + 1.1))), col=2) # posterior
plot( postSamples, density=FALSE, main="", auto.layout=FALSE)
0.6

2
0.4

1
0
0.2

−1
0.0

−2

−2 −1 0 1 2 3 0 500 1000 1500 2000

N = 2000 Bandwidth = 0.1583 Iterations

Figure 13.3: Left: empirical densities: MCMC based posterior (black), exact (red),
prior (blue), likelihood (green). Right: trace plot of the posterior µ | y = 1. (See
R-Code 13.4.)

Example 13.6. R-Code 13.5 extends the normal-normal model to n = 10 observations, still
with known variance:

iid
Y1 , . . . , Yn | µ ∼ N (µ, 1.1), (13.10)
µ ∼ N (0, 0.8). (13.11)

We draw the data in R via rnorm(n, 1, sqrt(1.1)) and proceed similarly as in R-Code 13.4.
Figure 13.4 gives the empirical and exact densities of the posterior, prior and likelihood and shows
a trace plot as a basic graphical diagnostic tool. The density of the likelihood is according to
√
N (y, 1.1/ n), the prior density is based on (13.11) and the posterior density is based on (11.19).
The latter simplifies considerably because we have η = 0 in (13.11).
13.3. GIBBS SAMPLING 197

As the number of observations increases, the data gets more “weight”. From (11.20), the
weight increases from 0.8/(0.8 + 1.1) ≈ 0.42 to 0.8n/(0.8n + 1.1) ≈ 0.88. Thus, the posterior
is “closer” to the likelihood but slightly more peaked. As both the variance of the data and the
variance of the priors are comparable, the prior has a comparable impact on the posterior as if
we would possess an additional observation with value zero. ♣

2.5
1.2

2.0
0.8

1.5
0.4

1.0
0.0

0.5
−0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0 500 1000 1500 2000

N = 2000 Bandwidth = 0.06966 Iterations

Figure 13.4: Left: empirical densities: MCMC based posterior µ | y = y1 , . . . , yn ,

n = 10 (black), exact (red), prior (blue), likelihood (green). Black and green ticks are
posterior sample and observations, respectively. Right: trace plot of the posterior for
the normal-normal model. (See R-Code 13.5.)

R-Code 13.5: JAGS sampler for the normal-normal model, with n = 10. (See Figure 13.4.)

set.seed( 4)
n <- 10
obs <- rnorm( n, 1, sqrt(1.1)) # generate artificial data
writeLines("model {
for (i in 1:n) { # define a likelihood for each
y[i] ~ dnorm( mu, 1/1.1) # individual observation
}
mu ~ dnorm( 0, 1/0.8)
}", con="jags02.txt")
jagsModel <- jags.model( "jags02.txt", data=list('y'=obs, 'n'=n), quiet=T)
postSamples <- coda.samples( jagsModel, 'mu', n.iter=2000)

plot( postSamples, trace=FALSE, main="", auto.layout=FALSE,

xlim=c(-.5, 3), ylim=c(0, 1.3))
rug( obs, col=3)
y <- seq(-.7, to=3.5, length=100)
lines( y, dnorm( y, mean(obs), sqrt(1.1/n)), col=3) # likelihood
198 CHAPTER 13. A CLOSER LOOK: MONTE CARLO METHODS

lines( y, dnorm( y, 0, sqrt(0.8) ), col=4) # prior

lines( y, dnorm( y, n/1.1*mean(obs)*(1.1*0.8/(n*0.8 + 1.1)),
sqrt(1.1*0.8/(n*0.8 + 1.1)) ), col=2) # posterior
plot( postSamples, density=FALSE, main="", auto.layout=FALSE)

Example 13.7. In this last example, we consider an extension of the previous example by
including an unknown variance, respectively unknown precision. That means that we specify now
two prior distributions and we have apriori no knowledge of the posterior and cannot compare
the empirical posterior density with a true (bivariate) density (as we had the red densities in
Figures 13.3 and 13.4).
R-Code 13.6 implements the following model in JAGS:
iid
Yi | µ, κ ∼ N (µ, 1/κ), i = 1, . . . , n, with n = 10, (13.12)
µ ∼ N (η, 1/λ), with η = 0, λ = 1.25, (13.13)
κ ∼ Gamma(α, β), with α = 1, β = 0.2. (13.14)

For more flexibility with the code, we also pass the hyper-parameters η, λ, α, β to the JAGS
MCMC engine.
Figure 13.5 gives the marginal empirical posterior densities of µ and κ, as well as the priors
(based on (13.14) and (13.14)) and likelihood (based on (13.12)).
√
The likelihood for µ is N (y, s2 / n), i.e., we have replaced the parameters in the model with
their unbiased estimates. For κ, it is a Gamma distribution based on parameters n/2 + 1 and
ns2 /2 = ni=1 (yi − y)2 /2, see Problem 11.1,i).
P

Note that this is a another classical example and with a very careful specification of the
priors, we can construct a closed form posterior density. Problem 13.2 gives a hint towards this
more advanced topic. ♣

R-Code 13.6: JAGS sampler for priors on mean and precision parameter, with n = 10.
(See Figure 13.5.)

eta <- 0 # start with defining the four hyperparameters

lambda <- 1.25 # corresponds to a variance 0.8, as in previous examples
alpha <- 1
beta <- 0.2
writeLines("model { # JAGS model as above with ...
for (i in 1:n) {
y[i] ~ dnorm( mu, kappa)
}
mu ~ dnorm( eta, lambda)
kappa ~ dgamma(alpha, beta) # ... one additional prior
}", con="jags03.txt")
13.4. BIBLIOGRAPHIC REMARKS 199

jagsModel <- jags.model('jags03.txt', quiet=T, data=list('y'=obs, 'n'=n,

'eta'=eta, 'lambda'=lambda, 'alpha'=alpha, 'beta'=beta))
postSamples <- coda.samples(jagsModel, c('mu','kappa'), n.iter=2000)
plot( postSamples[,"mu"], trace=FALSE, auto.layout=FALSE,
xlim=c(-1,3), ylim=c(0, 1.3))
y <- seq( -2, to=5, length=100)
lines( y, dnorm(y, 0, sqrt(1.2) ), col=4) # likelihood
lines( y, dnorm(y, mean(obs), sd(obs)/sqrt(n)), col=3) # prior
plot( postSamples[,"kappa"], trace=FALSE, auto.layout=FALSE, ylim=c(0, 1.3))
y <- seq( 0, to=5, length=100)
lines( y, dgamma( y, 1, .2), type='l', col=4) # likelihood
lines( y, dgamma( y, n/2+1, (n-1)*var(obs)/2 ), col=3) # prior
1.2

1.2
0.8

0.8
0.4

0.4
0.0

0.0

−1 0 1 2 3 0.0 0.5 1.0 1.5 2.0 2.5 3.0

N = 2000 Bandwidth = 0.07106 N = 2000 Bandwidth = 0.09395

Figure 13.5: Empirical posterior densities of µ | y1 , . . . , yn (left) and of κ = 1/σ 2 |

y1 , . . . , yn (right), MCMC based (black), prior (blue), likelihood (green). (See R-
Code 13.6.)

Note that the function jags.model writes some local files that may be cleaned after the
analysis.

13.4 Bibliographic remarks

There is ample literature about in-depth Bayesian methods and we only give a few relevant links.
Further information about MCMC diagnostics is found in general Bayesian text books like
Lunn et al. (2012). Specific and often used tests are published in Geweke (1992) Gelman and
Rubin (1992), and Raftery and Lewis, 1992 and are implemented in the package coda with
geweke.plot(), gelman.diag(), raftery.diag().
200 CHAPTER 13. A CLOSER LOOK: MONTE CARLO METHODS

An alternative to JAGS is BUGS (Bayesian inference Using Gibbs Sampling) which is dis-
tributed as two main versions: WinBUGS and OpenBUGS, see also Lunn et al. (2012). Addi-
tionally, there is the R-Interface package (R2OpenBUGS, Sturtz et al., 2005). Other possibilities
are the Stan or INLA engines with convenient user interfaces to R through rstan and INLA
(Gelman et al., 2015; Rue et al., 2009; Lindgren and Rue, 2015).

The list of textbooks discussing MCMC is long and extensive. Held and Sabanés Bové (2014)
has some basic and accessible ideas. Accessible examples for actual implementations can be
found in Kruschke (2015) (JAGS and STAN) and Kruschke (2010) (Bugs).

13.5 Exercises and Problems

Problem 13.1 (Stochastic simulation: inverse transform sampling) The goal of this exercise
is to implement your own R code to simulate from a continous random variable X with the
following probability density function (pdf):

fX (x) = c |x| exp −x2 , x ∈ R.

We use inverse transform sampling, which is well suited for distributions whose cdf is easily
invertible.

Hint: you can get rid of the absolute value by defining


c x exp −x2 , if x ≥ 0,
fX (x) =
−c x exp −x2 , if x < 0.

i) Find c such that fX (x) is an actual pdf (two points are to be checked).

ii) Assume an arbitrary cumulative distribution function F (x) with an existing inverse F −1 (p) =
Q(p) (quantile function). Show that the random variable X = F −1 (U ), where U ∼ U(0, 1),
has cdf F (x).

Hint: When U ∼ U(0, 1), we have P(U ≤ u) = u.

iii) Without using the functions rexp and qexp, implement your own code to simulate from
an exponential distribution of rate λ > 0.

iv) Write your own code to sample from X.

Hint: start by finding the cdf of X and then it’s inverse. Consider x ≥ 0 and x < 0
separately.

v) Check the correctness of your sampler(s), you can use, e.g., hist(..., prob=TRUE) and/or
QQ-plots.

iid
Problem 13.2 (? Normal-normal-gamma model) Let Y1 , Y2 , . . . , Yn | µ, τ ∼ N (µ, 1/κ). Instead
of independent priors on µ and κ, we propose a joint prior density that can be factorized by
the density of κ and µ | κ. We assume κ ∼ Gamma(α, β) and µ | κ ∼ N (η, 1/(κν)), for some
hyper-parameters η, ν > 0, α > 0, and β > 0. This distribution is a so-called normal-gamma
distribution, denoted by N Γ(η, ν, α, β).
13.5. EXERCISES AND PROBLEMS 201

iid
i) Create an artificial dataset consisting for Y1 , . . . , Yn ∼ N (1, 1), with n = 20.

ii) Write a function called dnormgamma() that calculates the density at mu, kappa based on the
parameters eta, nu, alpha, beta. Visualize the bivariate density based on η = 1, ν = 1.5,
α = 1, and β = 0.2.

iii) Setup a Gibbs sampler for the following values η = 0, ν = 1.5, α = 1, and β = 0.2. For a
sample of length 2000 illustrate the (empirical) joint posterior density of µ, κ | y1 , . . . , yn .

Hint: Follow closely R-Code 13.6.

iv) It can be shown that the posterior is again normal-gamma with parameters
1
ηpost = (ny + νη) νpost = ν + n (13.15)
n+ν
n 1 nν(η − x)2
αpost =α+ βpost = β + (n − 1)s2 + (13.16)
2 2 n+ν

where s2 is the usual unbiased estimate of σ 2 . Superimpose the true isolines of the normal-
gamma prior and posterior density in the plot form the previous problem.

v) Compare the posterior mean of the normal-normal model with ηpost .

202 CHAPTER 13. A CLOSER LOOK: MONTE CARLO METHODS
Appendix A

Software Environment R

R is a freely available language and environment for statistical computing and graphics which
provides a wide variety of statistical and graphical techniques. It compiles and runs on a wide
varieties operating systems (Windows, Mac, and Linux), its central entry point is https://www.
r-project.org.
The R software can be downloaded from CRAN (Comprehensive R Archive Network) https:
//cran.r-project.org, a network of ftp and web servers around the world that store identical,
up-to-date, versions of code and documentation for R. Figure A.1 shows a screenshot of the web
page.

Figure A.1: CRAN website

R is console based, that means that individual commands have to be typed. It is very im-
portant to save these commands as we construct a reproducible workflow – the big advantage

203
204 APPENDIX A. SOFTWARE ENVIRONMENT R

over a “click-and-go” approach. We strongly recommend to use some graphical, integrated de-
velopment environment (IDE) for R. The prime choice these days is RStudio. RStudio includes
a console, syntax-highlighting editor that supports direct code execution, as well as tools for
plotting, history, debugging and workspace management, see Figure A.2.
RStudio is available in a desktop open source version for many different operating systems
(Windows, Mac, and Linux) or in a browser connected to an RStudio Server. There are several
providers of such servers, including rstudio.math.uzh.ch for the students of the STA120 lecture.

Figure A.2: Studio screenshot. The four panels shown are (clock-wise starting top
left): (i) console, (ii) plots, (iii) environment, (iv) script.

The installation of all software components are quite straightforward, but the look of the
download page may change from time to time and the precise steps may vary a bit. Some
examples are given by the attached videos.
4 min
The biggest advantage of using R is the support from and for a huge user community. Sheer
endless packages provide almost seemingly every statistical task, often implemented by several
authors. The packages are documented and by the upload to CRAN confined to a limited level
of documentation, coding standards, (unit) testing etc. There are several forums (e.g., R mailing
lists, Stack Overflow with tag “r”) to get additional help, see https://www.r-project.org/help.
html.
Appendix B

Calculus

In this chapter we present some of the most important ideas and concepts of calculus. For exam-
ple, we will not discuss sequences and series. It is impossible to give a formal, mathematically
precise exposition. Further, we cannot present all rules, identities, guidelines or even tricks.

B.1 Functions
We start with one of the most basic concepts, a formal definition that describes a relation between
two sets.

Definition B.1. A function f from a set D to a set W is a rule that assigns a unique value
element f (x) ∈ W to each element x ∈ D. We write

f :D→W (B.1)
x 7→ f (x) (B.2)

The set D is called the domain, the set W is called the range (or target set or codomain).
The graph of a function f is the set (x, f (x)) : x ∈ D .

♦

The function will not necessarily map to every element in W , and there may be several
elements in D with the same image in W . These functions are characterized as follows.

Definition B.2. i) A function f is called injective, if the image of two different elements in
D is different.

ii) A function f is called surjective, if for every element y in W there is at least one element
x in D such that y = f (x).

iii) A function f is called bijective if it is surjective and injective. Such a function is also called
a one-to-one function. ♦

As an illustration, the first point can be ‘translated’ to ∀x, z ∈ D, x 6= z =⇒ f (x) 6= f (z),

which is equivalent to ∀x, z ∈ D, f (x) = f (z) =⇒ x = z.
By restricting the range, it is possible to render a function surjective. It is often possible to
restrict the domain to obtain a locally bijective function.

205
206 APPENDIX B. CALCULUS

In general, there is virtually no restriction on the domain and codomain. However, we often
work with real functions, i.e., D ⊂ R and W ⊂ R.

There are many different characterizations of functions. Some relevant one are as follows.

Definition B.3. A real function f is

i) periodic if there exists an ω > 0 such that f (x + ω) = f (x) for all x ∈ D. The smallest
value ω is called the period of f ;

ii) called increasing if f (x) ≤ f (x + h) for all h ≥ 0. In case of strict inequalities, we call the
function strictly increasing. Similar definitions hold when reversing the inequalities. ♦

The inverse f −1 (y) of a bijective function f : D → W is defined as

f −1 : W → D
(B.3)
y 7→ f −1 (y), such that y = f f −1 (y) .

Subsequently, we require the “inverse” of increasing functions by generalizing the previous

definition. We call these function quantile functions.

To capture the behavior of a function locally, say at a point x0 ∈ D, we use the concept of a
limit.

Definition B.4. Let f : D → R and x0 ∈ D. The limit of f as x approaches x0 is a, written

as limx→x0 f (x) = a if for every > 0, there exists a δ > 0 such that for all x ∈ D with
0 < |x − x0 | < δ =⇒ |f (x) − a| < . ♦

The latter definition does not assume that the function is defined at x0 .
It is possible to define “directional” limits, in the sense that x approaches x0 from above (from
the right side) or from below (from the left side). These limits are denoted with

lim lim for the former; or lim lim for the latter. (B.4)
x→x+
0
x&x0 x→x−
0
x%x0

We are used to interpret graphs and when we sketch an arbitrary function we often use a
single, continuous line. This concept of not lifting the pen while sketching is formalized as follows
and linked directly to limits, introduced above.

Definition B.5. A function f is continous in x0 if the following limits exist

lim f (x0 + h) lim f (x0 + h) (B.5)

h%0 h&0

and are equal to f (x0 ). ♦

There are many other approaches to define coninuity, for example in terms of neighborhoods,
in terms of limits of sequences.
Another very important (local) characterization of a function is the derivative, which quan-
tifies the (infinitesimal) rate of change.
B.1. FUNCTIONS 207

Definition B.6. The derivative of a function f (x) with respect to the variable x at the point
x0 is defined by
f (x0 + h) − f (x0 )
f 0 (x0 ) = lim , (B.6)
h→0 h
df (x0 )
provided the limit exists. We also write = f 0 (x0 ).
dx
If the derivative exists for all x0 ∈ D, the function f is differentiable. ♦

Some of the most important properties in differential calculus are:

Property B.1. i) Differentability implies continuity.

ii) (Mean value theorem) For a continuous function f : [a, b] → R, which is differentiable on
f (b) − f (a)
(a, b) there exists a point ξ ∈ (a, b) such that f 0 (ξ) = .
b−a

The integral of a (positive) function quantifies the area between the function and the x-axis.
A mathematical definition is a bit more complicated.

Definition B.7. Let f (x) : D → R a function and [a, b] ∈ D a finite interval such that |f (x)| <
∞ for x ∈ [a, b]. For any n, let t0 = a < t1 < · · · < tn = b a partition of [a, b].
The integral of f from a to b is defined as
Z b X n
f (x)dx = lim f (ti )(ti − ti−1 ). (B.7)
a n→∞
i=1

For non-finite a and b, the definition of the integral can be extended via limits.

Property B.2. (Fundamental theorem of calculus (I)). Let f : [a, b] → R continuous. For all
Rx
x ∈ [a, b], let F (x) = a f (u)du. Then F is continuous on [a, b], differentiable on (a, b) and
F 0 (x) = f (x), for all x ∈ (a, b).

The function F is often called the antiderivative of f . There exists a second form of the
previous theorem that does not assume continuity of f but only Riemann integrability, that
means that an integral exists.

Property B.3. (Fundamental theorem of calculus (II)). Let f : [a, b] → R. And let F such that
Z b
0
F (x) = f (x), for all x ∈ (a, b). If f is Riemann integrable then f (u)du = F (b) − F (a).
a

There are many ‘rules’ to calculate integrals. One of the most used ones is called integration
by substitution and is as follows.

Property B.4. Let I be an interval and ϕ : [a, b] → I be a differentiable function with integrable
derivative. Let f : I → R be a continuous function. Then
Z ϕ(b) Z b
f (u) du = f (ϕ(x))ϕ0 (x) dx. (B.8)
ϕ(a) a
208 APPENDIX B. CALCULUS

B.2 Functions in Higher Dimensions

We denote with Rm the vector space with elements x = (x1 , . . . , xm )> , called vectors, equipped
with the standard operations. We will discuss vectors and vector notation in more details in the
subsequent chapter.

A natural extension of a real function is as follows. The set D is subset of Rm and thus we
write

f : D ⊂ Rm → W
(B.9)
x 7→ f (x ).

Note that we keep W ⊂ R.

The concept of limit and continuity translates one-to-one. Differentiability, however, is dif-
ferent and slightly more delicate.

Definition B.8. The partial derivative of f : D ⊂ R → W with respect to xj is defined by

∂f (x ) f (x1 , . . . , xj−1 , xj + h, xj+1 , . . . , xm ) − f (x1 , . . . , xn )

= lim , (B.10)
∂xj h→0 h

(provided it exists). ♦

The derivative of f with respect to all components is thus a vector

∂f (x ) ∂f (x ) >
f 0 (x ) = ,..., (B.11)
∂x1 ∂xm

Hence f 0 (x ) is a vector valued function from D to Rm and is called the gradient of f at x ,

also denoted with grad(f (x)) = ∇f (x).

Remark B.1. The existence of partial derivatives is not sufficient for the differentiability of the
function f . ♣

In a similar fashion, higher order derivatives can be calculated. For example, taking the
derivative of each component of (B.11) with respect to all components is an matrix with com-
ponents

∂ 2 f (x )
f 00 (x ) = , (B.12)
∂xi ∂xj

called the Hessian matrix.

It is important to realize that the second derivative constitutes a set of derivatives of f : all
possible double derivatives.
B.3. APPROXIMATING FUNCTIONS 209

B.3 Approximating Functions

Quite often, we want to approximate functions.

Property B.5. Let f : D → R with continuous Then there exists ξ ∈ [a, x] such that
1
f (x) = f (a) + f 0 (a)(x − a) + f 00 (a)(x − a)2 + . . .
2
1 (m) 1 (B.13)
+ f (a)(x − a)m + f (m+1) (ξ)(x − a)m
m! (m + 1)!

We call (B.13) Taylor’s formula and the last term, often denoted by Rn (x), as the reminder
of order n. Taylor’s formula is an extension of the mean value theorem.
If the function has bounded derivatives, the reminder Rn (x) converges to zero as x → a.
Hence, if the function is at least twice differentiable in a neighborhood of a then
1
f (a) + f 0 (a)(x − a) + f 00 (a)(x − a)2 (B.14)
2
is the best quadratic approximation in this neighborhood.

If all derivatives of f exist in an open interval I with a ∈ I, we have for all x ∈ I

∞
X 1 (r)
f (x) = f (a)(x − a)r (B.15)
r!
r=0

Often the approximation is for x = a + h, h small.

Taylor’s formula can be expressed for multivariate real functions. Without stating the precise
assumptions we consider here the following example
∞
X X 1 ∂ r f (a)
f (a + h) = hi1 hi2 . . . hinn , (B.16)
i1 !i2 ! . . . in ! ∂xi1 . . . ∂xin 1 2
r=0 i :i1 +···+in =r

extending (B.15) with x = a + h.

210 APPENDIX B. CALCULUS
Appendix C

Linear Algebra

In this chapter we cover the most important aspects of linear algebra, namely of notational
nature.

C.1 Vectors, Matrices and Operations

A collection of p real numbers is called a vector, an array of n × m real numbers is called a
matrix. We write
   
x1 a11 . . . a 1m
 .  . .. 
x =  .. A = (aij ) =  ..

 ,

 . 
. (C.1)
xp an1 . . . anm
Providing the dimensions are coherent, vector and matrix addition (and subtraction) is performed
componentwise, as is scalar multiplication. That means, for example, that x ± y is a vector with
elements xi ± yi and cA is a matrix with elements caij .

The n × n identity matrix I is defined as the matrix with ones on the diagonal and zeros
elsewhere. We denote the vector with solely one elements with 1 similarly, 0 is a vector with only
zero elements. A matrix with entries d1 , . . . , dn on the diagonal and zero elsewhere is denoted
with diag(d1 , . . . , dn ) or diag(di ) for short and called a diagonal matrix. Hence, I = diag(1).

To indicate the ith-jth element of A, we use (A)ij . The transpose of a vector or a matrix
flips its dimension. When a matrix is transposed, i.e., when all rows of the matrix are turned
into columns (and vice-versa), the elements aij and aji are exchanged. Thus (A> )ij = (A)ji .
The vector x > = (xa , . . . , xp ) is termed a row vector. We work mainly with column vectors as
shown in (C.1).

In the classical setting of real numbers, there is only one type of multiplication. As soon
as we have several dimensions, several different types of multiplications exist, notably scalar
multiplication, matrix multiplication and inner product (and actually more such as the vector
product, outer product).
Let A and B be two n × p and p × m matrices. Matrix multiplication AB is defined as
p
X
AB = C with (C)ij = aik bkj . (C.2)
k=1

211
212 APPENDIX C. LINEAR ALGEBRA

This last equation shows that the matrix I is the neutral element (or identity element) of the
matrix multiplication.

Definition C.1. The inner product between two p-vectors x and y is defined as x > y =
Pp
i=1 xi yi . There are several different notations used: x y = ha, bi = x · y .
>

If for an n × n matrix A there exists an n × n matrix B such that

AB = BA = I, (C.3)

then the matrix B is uniquely determined by A and is called the inverse of A, denoted by A−1 .

C.2 Linear Spaces and Basis

The following definition formalizes one of the main spaces we work in.

Definition C.2. A vector space over R is a set V with the following two operations:

i) + : V × V → V (vector addition)

ii) · : R × V → V (scalar multiplication). ♦

Typically, V is Rp , p ∈ N.
In the following we assume a fixed d and the usual operations on the vectors.

Definition C.3. i) The vectors v 1 , . . . , v k are linearly dependent if there exists scalars a1 , . . . , ak
(not all equal to zero), such that a1 v 1 + · · · + ak v k = 0.

ii) The vectors v 1 , . . . v k are linearly independent if a1 v 1 + · · · + ak v k = 0 cannot be satisfied

by any scalars a1 , . . . , ak (not all equal to zero). ♦

In a set of linearly dependent vectors, each vector can be expressed as a linear combination
of the others.

Definition C.4. The set of vectors {b 1 , . . . , b d } is a basis of a vectors space V if the set is
linearly independent and any other vector v ∈ V can be expressed by v = v1 b 1 + · · · + vd b d . ♦

The following proposition summarizes some of the relevant properties of a basis.

Property C.1. i) The decomposition of a vector v ∈ V in v = v1 b1 + · · · + vd bd is unique.

ii) All basis of V have the same cardinality, which is called the dimension of V , dim(V ).

iii) If there are two basis {b1 , . . . , bd } and {e1 , . . . , ed } then there exists a d × d matrix A such
that ei = Abi , for all i.

Definition C.5. The standard basis, or canonical basis of V = Rd is {e 1 , . . . , e d } with e i =

(0, . . . , 0, 1, 0, . . . )> , i.e., the vector with a one at the ith position and zero elsewhere. ♦
C.3. PROJECTIONS 213

Definition C.6. Let A be a n × m matrix. The column rank of the matrix is the dimension
of the subspace that the m columns of A span and is denoted by rank(A). A matrix is said to
have full rank if rank(A) = m.
The row rank is the column rank of A> . ♦

Some fundamental properties of the rank are as follows.

Property C.2. Let A be a n × m matrix.

i) The column rank and row rank are identical.

ii) rank(A> A) = rank(AA> ) = rank(A).

iii) rank(A) ≤ dim(V ).

iv) rank(A) ≤ min(m, n).

v) For an appropriately sized matrix B rank(A + B) ≤ rank(A) + rank(B) and rank(AB) ≤

min rank(A), rank(B) .

C.3 Projections
We consider classical Euclidean vector spaces with elements x = (x1 , . . . , xp )> ∈ Rp with Eu-
clidean norm ||x || = ( i x2i )1/2 .
P

To illustrate projections, consider the setup illustrated in Figure C.1, where y and a are two
vectors in R2 . The subspace spanned by a is

{λa, λ ∈ R} = {λa/||a||, λ ∈ R} (C.4)

where the second expression is based on a normalized vector a/||a||. By the (geometric) definition
of the inner product (dot product),

< a, b > = a > b = ||a||||b|| cos θ (C.5)

where θ is the angle between the vectors. Classical trigonometric properties state that the length
of the projection is a/||a|| · ||y || cos(θ). Hence, the projected vector is
a a>
y = a(a > a)−1 a > y . (C.6)
||a|| ||a||
In statistics we often encounter expressions like this last term. For example, ordinary least
squares (“classical” multiple regression) is a projection of the vector y onto the column space
spanned by X, i.e., the space spanned by the columns of the matrix X. The projection is
X(X> X)−1 X> y . Usually, the column space is in a lower dimension.

y
a
θ

Figure C.1: Projection of the vector y onto the subspace spanned by a.

214 APPENDIX C. LINEAR ALGEBRA

Remark C.1. Projection matrices (like H = X(X> X)−1 X> ) have many nice properties such
as being symmetric, being idempotent, i.e., H = HH, having eigenvalues within [0, 1], (see next
section), rank(H) = rank(X), etc. ♣

C.4 Matrix Decompositions

In this section we elaborate representations of a matrix as a product of two or three other
matrices.
Let x be a non-zero n-vector (i.e., at least one element is not zero) and A an n × n matrix.
We can interpret A(x ) as a function that maps x to Ax . We are interested in vectors that
change by a scalar factor by such a mapping

Ax = λx , (C.7)

where λ is called an eigenvalue and x an eigenvector.

A matrix has n eigenvalues, {λ1 , . . . , λn }, albeit not necessarily different and not necessarily
real. The set of eigenvalues and the associated eigenvectors denotes an eigendecomposition.
For all square matrices, the set of eigenvectors span an orthogonal basis, i.e., are constructed
that way.

We often denote the set of eigenvectors with γ 1 , . . . , γ n . Let Γ be the matrix with columns
γ i , i.e., Γ = (γ 1 , . . . , γ n ). Then

Γ> AΓ = diag(λ1 , . . . , λn ), (C.8)

due to the orthogonality property of the eigenvectors Γ> Γ = I. This last identity also implies
that A = Γ diag(λ1 , . . . , λn )Γ> .

In cases of non-square matrices, an eigendecomposition is not possible and a more general

approach is required. The so-called singular value decomposition (SVD) works or any n × m
matrix B,

B = UDV> (C.9)

where U is an n × min(n, m) orthogonal matrix (i.e., U> U = In ), D is an diagonal matrix

containing the so-called singular values and V is an min(n, m) × m orthogonal matrix (i.e.,
V> V = Im ).
We say that the columns of U and V are the left-singular vectors and right-singular vectors,
respectively.
Note however, that the dimensions of the corresponding matrices differ in the literature, some
write U and V as square matrices and V as a rectangular matrix.

Remark C.2. Given an SVD of B, the following two relations hold:

BB> = UDV> (UDV> )> = UDV> VDU> = UDDU> (C.10)

B> B = (UDV> )> UDV> = VDU> UDV> = VDDV> (C.11)
and hence the columns of U and V are eigenvectors of BB> and B> B, respectively, and most
importantly, the elements of D are the square roots of the (non-zero) eigenvalues of BB> or
B> B. ♣

Besides an SVD there are many other matrix factorization. We often use the so-called
Cholesky factorization, as - to a certain degree - it generalizes the concept of a square root for
matrices. Assume that all eigenvalues of A are strictly positive, then there exists a unique lower
triangular matrix L with positive entries on the diagonal such that A = LL> . There exist very
efficient algorithm to calculate L and solving large linear systems is often based on a Cholesky
factorization.

The determinant of a square matrix essentially describes the change in “volume” that associ-
ated linear transformation induces. The formal definition is quite complex but it can be written
as det(A) = ni=1 λi for matrices with real eigenvalues.
Q

The trace of a matrix is the sum of its diagonal entries.

C.5 Positive Definite Matrices

Besides matrices containing covariates, we often work with variance-covariance matrices, which
represent an important class of matrices as we see now.

Definition C.7. A n × n matrix A is positive definite (pd) if

x > Ax > 0, for all x 6= 0. (C.12)

Further, if A = A> , the matrix is symmetric positive definite (spd). ♦

Relevant properties of spd matrices A = (aij ) are given as follows.

Property C.3. i) rank(A) = n

ii) the determinant is positive, det(A) > 0

iii) all eigenvalues are positive, λi > 0

iv) all elements on the diagonal are positive, aii > 0

v) aii ajj − a2ij > 0, i 6= j

vi) aii + ajj − 2|aij | > 0, i 6= j

vii) A−1 is spd

viii) all principal sub-matrices of A are spd.

215
216 References

For a non-singular matrix A, written as a 2 × 2 block matrix (with square matrices A11 and
A22 ), we have
!−1 !
−1 A11 A12 A−1 −1 −1
11 + A11 A12 CA21 A11 −A−1
11 A12 C
A = = −1
(C.13)
A21 A22 −CA21 A11 C

with C = (A22 − A21 A−1

11 A12 ) . Note that A11 and C need to be invertible.
−1

It also holds that det(A) = det(A11 ) det(A22 ).

Bibliography

Abraham, B. and Ledolter, J. (2006). Introduction To Regression Modeling. Duxbury applied

series. Thomson Brooks/Cole.

Bland, J. M. and Bland, D. G. (1994). Statistics notes: One and two sided tests of significance.
BMJ, 309, 248.

Box, G. E. P. and Draper, N. R. (1987). Empirical Model-building and Response Surfaces. Wiley.

Brown, L. D., Cai, T. T., and DasGupta, A. (2002). Confidence intervals for a binomial propor-
tion and asymptotic expansions. The Annals of Statistics, 30, 160–201.

Cleveland, W. S. (1993). Visualizing Data. Hobart Press, Summit, New Jersey, U.S.A.

Cohen, J. (1988). Statistical power analysis for the behavioral sciences. Routledge.

Dalal, S. R., Fowlkes, E. B., and Hoadley, B. (1989). Risk analysis of the space shuttle: Pre-
challenger prediction of failure. Journal of the American Statistical Association, 84, 945–957.

Devore, J. L. (2011). Probability and Statistics for Engineering and the Sciences. Brooks/Cole,
8th edition.

Fahrmeir, L., Kneib, T., and Lang, S. (2009). Regression: Modelle, Methoden und Anwendungen.
Springer, 2 edition.

Fahrmeir, L., Kneib, T., Lang, S., and Marx, B. (2013). Regression: Models, Methods and
Applications. Springer.

Faraway, J. J. (2006). Extending the Linear Model with R: Generalized Linear, Mixed Effects
and Nonparametric Regression Models. CRC Press.

Farcomeni, A. (2008). A review of modern multiple hypothesis testing, with particular attention
to the false discovery proportion. Statistical Methods in Medical Research, 17, 347–388.

Fisher, R. A. (1938). Presidential address. Sankhyā: The Indian Journal of Statistics, 4, 14–17.

Friedman, J. H. and Tukey, J. W. (1974). A projection pursuit algorithm for exploratory data
analysis. IEEE Transactions on Computers, C-23, 881–890.

Friendly, M. and Denis, D. J. (2001). Milestones in the history of thematic

cartography, statistical graphics, and data visualization. Web document,
www.math.yorku.ca/SCS/Gallery/milestone/. Accessed: September 16, 2014.

217
218 BIBLIOGRAPHY

Gelman, A., Lee, D., and Guo, J. (2015). Stan: A probabilistic programming language for
bayesian inference and optimization. Journal of Educational and Behavior Science, 40, 530–
543.

Gelman, A. and Rubin, D. B. (1992). Inference from iterative simulation using multiple sequences.
Statistical Science, 7, 457–511.

Gemeinde Staldenried (1994). Verwaltungsrechnung 1993 Voranschlag 1994.

Geweke, J. (1992). Evaluating the accuracy of sampling-based approaches to calculating posterior

moments. In Bernardo, J. M., Berger, J., Dawid, A. P., and Smith, J. F. M., editors, Bayesian
Statistics 4, 169–193. Oxford University Press, Oxford.

Held, L. (2008). Methoden der statistischen Inferenz: Likelihood und Bayes. Springer, Heidelberg.

Held, L. and Sabanés Bové, D. (2014). Applied Statistical Inference. Springer.

Hollander, M. and Wolfe, D. A. (1999). Nonparametric Statistical Methods. John Wiley & Sons.

Hüsler, J. and Zimmermann, H. (2010). Statistische Prinzipien für medizinische Projekte. Huber,
5 edition.

Jeffreys, H. (1983). Theory of probability. The Clarendon Press Oxford University Press, third
edition.

Johnson, N. L., Kemp, A. W., and Kotz, S. (2005). Univariate Discrete Distributions. Wiley-
Interscience, 3rd edition.

Johnson, N. L., Kotz, S., and Balakrishnan, N. (1994). Continuous Univariate Distributions,
Vol. 1. Wiley-Interscience, 2nd edition.

Johnson, N. L., Kotz, S., and Balakrishnan, N. (1995). Continuous Univariate Distributions,
Vol. 2. Wiley-Interscience, 2nd edition.

Kruschke, J. K. (2010). Doing Bayesian Data Analysis: A Tutorial with R and BUGS. Academic
Press, first edition.

Kruschke, J. K. (2015). Doing Bayesian Data Analysis: A Tutorial with R, JAGS and Stan.
Academic Press/Elsevier, second edition.

Kupper, T., De Alencastro, L., Gatsigazi, R., Furrer, R., Grandjean, D., and J., T. (2008). Con-
centrations and specific loads of brominated flame retardants in sewage sludge. Chemosphere,
71, 1173–1180.

Landesman, R., Aguero, O., Wilson, K., LaRussa, R., Campbell, W., and Penaloza, O. (1965).
The prophylactic use of chlorthalidone, a sulfonamide diuretic, in pregnancy. J. Obstet. Gy-
naecol., 72, 1004–1010.

Leemis, L. M. and McQueston, J. T. (2008). Univariate distribution relationships. The American

Statistician, 62, 45–53.
BIBLIOGRAPHY 219

Lindgren, F. and Rue, H. (2015). Bayesian spatial modelling with R-INLA. Journal of Statistical
Software, 63, i19.

Lunn, D., Jackson, C., Best, N., Thomas, A., and Spiegelhalter, D. (2012). The BUGS Book:
A Practical Introduction to Bayesian Analysis. Texts in Statistical Science. Chapman &
Hall/CRC.

Moyé, L. A. and Tita, A. T. (2002). Defending the rationale for the two-tailed test in clinical
research. Circulation, 105, 3062–3065.

Olea, R. A. (1991). Geostatistical Glossary and Multilingual Dictionary. Oxford University Press.

Petersen, K. B. and Pedersen, M. S. (2008). The Matrix Cookbook. Version 2008-11-14, http:
//matrixcookbook.com.

Plagellat, C., Kupper, T., Furrer, R., de Alencastro, L. F., Grandjean, D., and Tarradellas, J.
(2006). Concentrations and specific loads of UV filters in sewage sludge originating from a
monitoring network in Switzerland. Chemosphere, 62, 915–925.

Plummer, M. (2003). Jags: A program for analysis of bayesian graphical models using gibbs sam-
pling. In Proceedings of the 3rd International Workshop on Distributed Statistical Computing
(DSC 2003). Vienna, Austria.

Plummer, M. (2016). rjags: Bayesian Graphical Models using MCMC. R package version 4-6.

R Core Team (2020). R: A Language and Environment for Statistical Computing. R Foundation
for Statistical Computing, Vienna, Austria.

Raftery, A. E. and Lewis, S. M. (1992). One long run with diagnostics: Implementation strategies
for Markov chain Monte Carlo. Statistical Science, 7, 493–497.

Ruchti, S., Kratzer, G., Furrer, R., Hartnack, S., Würbel, H., and Gebhardt-Henrich, S. G.
(2019). Progression and risk factors of pododermatitis in part-time group housed rabbit does
in switzerland. Preventive Veterinary Medicine, 166, 56–64.

Ruchti, S., Meier, A. R., Würbel, H., Kratzer, G., Gebhardt-Henrich, S. G., and Hartnack, S.
(2018). Pododermatitis in group housed rabbit does in switzerland prevalence, severity and
risk factors. Preventive Veterinary Medicine, 158, 114–121.

Rudolf, M. and Kuhlisch, W. (2008). Biostatistik: Eine Einführung für Biowissenschaftler.

Pearson Studium.

Rue, H., Martino, S., and Chopin, N. (2009). Approximate Bayesian inference for latent Gaussian
models by using integrated nested Laplace approximations. Journal of the Royal Statistical
Society B, 71, 319–392.

Siegel, S. and Castellan Jr, N. J. (1988). Nonparametric Statistics for The Behavioral Sciences.
McGraw-Hill, 2nd edition.

Stevens, S. S. (1946). On the theory of scales of measurement. Science, 103, 677–680.

220 BIBLIOGRAPHY

Sturtz, S., Ligges, U., and Gelman, A. (2005). R2WinBUGS: A package for running WinBUGS from
R. Journal of Statistical Software, 12, 1–16.

Swayne, D. F., Temple Lang, D., Buja, A., and Cook, D. (2003). GGobi: evolving from XGobi
into an extensible framework for interactive data visualization. Computational Statistics &
Data Analysis, 43, 423–444.

SWISS Magazine (2011). Key environmental figures, 10/2011–01/2012, p. 107. http://www.

swiss.com/web/EN/fly_swiss/on_board/Pages/swiss_magazine.aspx.

Tufte, E. R. (1983). The Visual Display of Quantitative Information. Graphics Press.

Tufte, E. R. (1990). Envisioning Information. Graphics Press.

Tufte, E. R. (1997a). Visual and Statistical Thinking: Displays of Evidence for Making Decisions.
Graphics Press.

Tufte, E. R. (1997b). Visual Explanations: Images and Quantities, Evidence and Narrative.
Graphics Press.

Tukey, J. W. (1977). Exploratory Data Analysis. Addison-Wesley.

Wasserstein, R. L. and Lazar, N. A. (2016). The asa statement on p-values: Context, process,
and purpose. The American Statistician, 70, 129–133.
Glossary

Throughout the document we tried to be consistent with standard mathematical notation. We

write random variables as uppercase letters (X, Y , . . . ), realizations as lower case letters (x, y,
. . . ), matrices as bold uppercase letters (Σ, X, . . . ), and vectors as bold italics lowercase letters
(x , β, . . . ). (The only slight confusion arises with random vectors and matrices.)
The following glossary contains a non-exhaustive list of the most important notation. Stan-
dard operators or products are not repeatedly explained.

:= Define the left hand side by the expression on the other side.
♣, ♦ End of example, end of definition end of remark.
, , Integration, summation and product symbol. If there is no ambiguity, we omit
R P Q

the domain in inline formulas.

∪, ∩ Union, intersection of sets or events.
Ac Complement of the set A.
B\A Relative complement of A in B.
All elements of the set B that are not in the set A: {x ∈ B | x 6∈ A}.
θb Estimator or estimate of the parameter θ.
Arithmetic mean of the sample: ni=1 xi /n.
P
x
|x| Absolute value of the scalar x.
||x || Norm of the vector x .
X> Transpose of an matrix X.
x(i) Order statistics of the sample {xi }.
0, 1 Vector or matrix with components 0 respectively 1.
Cov(X, Y ) Covariance between two random variables X and Y .
Corr(X, Y ) Correlation between two random variables X and Y .

dx , , ∂x Derivative and partial derivative with respect to x.

d 0 ∂

diag(A) Diagonal entries of an (n × n)-matrix A.

ε, εi Random variable or process, usually measurement error.
E(X) Expectation of the random variable X.
e, exp(·) Transcendental number e = 2.71828 18284, the exponential function.
n! Factorial of a positive integer n: n! = n(n − 1)(n − 2) · · · 1, with 0! = 1.

221
222 Glossary

n n n!
Binomial coefficient defined as = .
k k k!(n − k)!
In = I Identity matrix, I = (δij ).
I{A} Indicator function, talking the value one if A is true and zero otherwise.
lim Limit.
log(·) Logarithmic function to the base e.
max{A}, min{A} Maximum, minimum of the set A.
N, Nd Space of natural numbers, of d-vectors with natural elements.
ϕ(x) Gaussian probability densitiy function ϕ(x) = (2π)−1/2 exp(−x2 /2).
Rx
Φ(x) Gaussian cumulative distribution function Φ(x) = −∞ ϕ(z) dz.
π Transzendental number π = 3.14159 26535.
P(A) Probability of the event A.
R, Rn , Rn×m Space of real numbers, real n-vectors and real (n × m)-matrices.
rank(A) The rank of a matrix A is defined as the number of linearly independent rows
(or columns) of A.
tr(A) Trace of an matrix A defined by the sum of its diagonal elements.
Var(X) Variance of the random variable X.
Z, Zd Space of integers, of d-vectors with integer elements.

The following table contains the abbreviations of the statistical distributions (dof denotes degrees
of freedom).

N (0, 1), zp Standard standard normal distribution, p-quantile thereof.

N (µ, σ 2 ) Gaussian or normal distribution with parameters µ and σ 2 (being mean and
variance), σ 2 > 0.
Np (µ, Σ) Normal p dimensional distribution with mean vector µ and symmetric positive
definite (co-)variance matrix Σ.
Bin(n, p), Binomial distribution with n trials and success probability p,
bn,p,1−α 1 − α-quantile thereof, 0 < p < 1.
Pois(λ) Poisson distribution with parameter λ, λ > 0.
Exp(λ) Exponential distribution with rate parameter λ, λ > 0.
U(a, b) Uniform distribution over the support [a, b], −∞ < a < b < ∞.
Xν2 , χ2ν,p Chi-squared distribution with ν dof, p-quantile thereof.
Tn , tn,p Student’s t-distribution with n dof, p-quantile thereof.
Fm,n , fm,n,p F -distribution with m and n dof, p-quantile thereof.
Ucrit (nx ,ny ;1−α) 1 − α-quantile of the distribution of the Wilcoxon rank sum statistic.
Wcrit (n? ; 1−α) 1 − α-quantile of the distribution of the Wilcoxon signed rank statistic.
Glossary 223

The following table contains the abbreviations of the statistical methods, properties and quality
measures.

EDA Exploratory data analysis.

DoE Design of experiment.
DF, dof Degrees of freedom.
MAD Median absolute deviation.
MAE Mean absolute error.
ML Maximum likelihood (ML estimator or ML estimation).
MM Method of moments.
MSE Mean squared error.
OLS, LS Ordinary least squares.
RMSE Root mean squared error.
SS Sums of squares.
WLS Weighted least squares.
224 Glossary
Index of Statistical Tests and
Confidence Intervals

Comparing a sample mean with a theoretical value, 66

Comparing means from two independent samples, 68
Comparing means from two paired samples, 69
Comparing two variances, 74
Comparison of location for two independent samples, 97
Comparison of location of two dependent samples, 99
Comparison of observations with expected frequencies, 75
Confidence interval for regression coefficients, 131
Confidence interval for relative risk (RR), 88
Confidence interval for the mean µ, 50
Confidence interval for the mean response and for prediction, 126
Confidence interval for the odds ratio (OR), 90
Confidence interval for the Pearson correlation coefficient, 120
Confidence interval for the variance σ 2 , 54
Confidence intervals for proportions, 82

General remarks about statistical tests, 65

Performing a one-way analysis of variance, 153

Permutation test, 102

Sign test to compare medians, 101

Test of a linear relationship, 124

Test of correlation, 120
Test of proportions, 86

225
226 Index of Statistical Tests and CIs
Video Index

The following index gives a short description of the available videos, including a link to the
referenced page. The videos are uploaded to https://tube.switch.ch/.

Chapter 0
What are all these videos about?, vi
Chapter 7
Construction of general multivariate normal variables, 112
Important comment about an important equation, 112
Proof that the correlation is bounded, 107
Properties of expectation and variance in the setting of random vectors, 108
Two classical estimators and estimates for random vectors, 113

Chapter A
Installing of RStudio, 204

227
228 Video Index

Introduction To Statistics 14 Weeks
No ratings yet
Introduction To Statistics 14 Weeks
310 pages
Statistical Inference
No ratings yet
Statistical Inference
158 pages
Introduction To Probability and Statistics
100% (1)
Introduction To Probability and Statistics
179 pages
STAT 251 Course Text
No ratings yet
STAT 251 Course Text
179 pages
Coursenotes Aug2012
No ratings yet
Coursenotes Aug2012
179 pages
Statistical Inference
No ratings yet
Statistical Inference
148 pages
STA501 Study Guide 2024-02-27 01 - 00 - 08
No ratings yet
STA501 Study Guide 2024-02-27 01 - 00 - 08
270 pages
Statistics for Beginners
No ratings yet
Statistics for Beginners
383 pages
Stat Book
No ratings yet
Stat Book
383 pages
Statbook PDF
No ratings yet
Statbook PDF
433 pages
Stat Book
No ratings yet
Stat Book
413 pages
An Introduction To The Science of Statis PDF
No ratings yet
An Introduction To The Science of Statis PDF
430 pages
WT ST102
No ratings yet
WT ST102
201 pages
ANOVA3
No ratings yet
ANOVA3
194 pages
Bio Stat
No ratings yet
Bio Stat
472 pages
Bio Stat Methods
No ratings yet
Bio Stat Methods
474 pages
Statistical Models
No ratings yet
Statistical Models
248 pages
Probability
No ratings yet
Probability
180 pages
Introduction To Statistical Thought - Michael Levine
No ratings yet
Introduction To Statistical Thought - Michael Levine
344 pages
MTH 106 INTRODUCTORY TO DESCRIPTIVE STATISTICS
100% (1)
MTH 106 INTRODUCTORY TO DESCRIPTIVE STATISTICS
134 pages
Book
No ratings yet
Book
475 pages
Statistics 333
100% (1)
Statistics 333
84 pages
2020-2021 EDA 101 Handout
No ratings yet
2020-2021 EDA 101 Handout
192 pages
Statistics For Geoscientists: Pieter Vermeesch
No ratings yet
Statistics For Geoscientists: Pieter Vermeesch
225 pages
Statistics 152
No ratings yet
Statistics 152
236 pages
Statistic Book
100% (1)
Statistic Book
328 pages
A First Course in Mathematical Statistics - Nusbaum
No ratings yet
A First Course in Mathematical Statistics - Nusbaum
195 pages
Maple Manual
No ratings yet
Maple Manual
285 pages
Introduction To Statistical Thought
100% (2)
Introduction To Statistical Thought
393 pages
747343-2 SC750 Service Manual
No ratings yet
747343-2 SC750 Service Manual
79 pages
Information and Randomness-An Algorithmic Perspective
0% (1)
Information and Randomness-An Algorithmic Perspective
487 pages
Offensive Security Poster Kavin
No ratings yet
Offensive Security Poster Kavin
1 page
Cálculos Matemáticos
No ratings yet
Cálculos Matemáticos
151 pages
Jasper Reports Manual
No ratings yet
Jasper Reports Manual
16 pages
Word Tutorial
No ratings yet
Word Tutorial
12 pages
Application for Computing & Business Degrees
No ratings yet
Application for Computing & Business Degrees
8 pages
Benchmarking - Rapid-Response Communication System
No ratings yet
Benchmarking - Rapid-Response Communication System
9 pages
Tda12156h1 N2 3
No ratings yet
Tda12156h1 N2 3
1 page
Global Challenges and Cooperation
No ratings yet
Global Challenges and Cooperation
6 pages
Tmodlauncher Json
No ratings yet
Tmodlauncher Json
3 pages
Ewm Msaa
No ratings yet
Ewm Msaa
10 pages
IBM DS8870 Copy Services Sg246788
No ratings yet
IBM DS8870 Copy Services Sg246788
736 pages
Animation Basics for Beginners
No ratings yet
Animation Basics for Beginners
25 pages
CMC Unit-4
No ratings yet
CMC Unit-4
74 pages
Santosh Resume1
No ratings yet
Santosh Resume1
1 page
RA2211026050071 - Sridevi S - Experiment9
No ratings yet
RA2211026050071 - Sridevi S - Experiment9
3 pages
Web Publishing Basics 1: Website Maintenance Essentials
No ratings yet
Web Publishing Basics 1: Website Maintenance Essentials
32 pages
Communication Nature, Functions and Scope
No ratings yet
Communication Nature, Functions and Scope
25 pages
PitchBook Q1 2020 Emerging Tech Research AIML
No ratings yet
PitchBook Q1 2020 Emerging Tech Research AIML
77 pages
Python For Network Engineers - Huawei Presentation - Updated
No ratings yet
Python For Network Engineers - Huawei Presentation - Updated
44 pages
Unit 3 PDF
No ratings yet
Unit 3 PDF
13 pages
5233-l16 Chauchy's Data
No ratings yet
5233-l16 Chauchy's Data
3 pages
Volume 109, Issue 25
No ratings yet
Volume 109, Issue 25
20 pages
Failure Mode and Effect Analysis (FMEA) : Process: Process Owner: Project ID: Project Title
No ratings yet
Failure Mode and Effect Analysis (FMEA) : Process: Process Owner: Project ID: Project Title
1 page
Cyber Forensics Essentials
No ratings yet
Cyber Forensics Essentials
6 pages
Crown Check Valves: Leading The Way in Pipeline Technology
50% (2)
Crown Check Valves: Leading The Way in Pipeline Technology
4 pages
XIIComp SC H Y 463
No ratings yet
XIIComp SC H Y 463
5 pages
Account Statement for H. Nakhath
No ratings yet
Account Statement for H. Nakhath
3 pages
Online Passport Registration SRS
0% (1)
Online Passport Registration SRS
17 pages

Introduction To Statistics WITH SAS

Uploaded by

Introduction To Statistics WITH SAS

Uploaded by

(A Gentle)

Version May 12, 2020

1 Exploratory Data Analysis and Visualization of Data 1

4.3 Comparing Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5 A Closer Look: Proportions 79

7 Multivariate Normal Distribution 105

8 A Closer Look: Correlation and Simple Regression 117

9 Multiple Regression 129

10 Analysis of Variance 149

11 Bayesian Methods 165

12 Design of Experiments 175

13 A Closer Look: Monte Carlo Methods 189

A Software Environment R 203

C Linear Algebra 211

C.5 Positive Definite Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215

Index of Statistical Tests and Confidence Intervals 225

Video Index 227

2. Statistical foundations in a (i) frequentist setting and (ii) Bayesian setting

• Focusing on linear models: Chapters 1, 2, 7, 3, 4, 8, 9, 10, 11

• Good background in probability: You may omit Chapters 2 and 7

Figure 1: Structure of the script

Exploratory Data Analysis and

A valuable graphical representation should quickly and unambiguously transmit its

Learning goals for this chapter:

 Understand the concept and need of Exploratory Data Analysis (EDA)

 Schematically sketch, plot in R and interpret a histogram, barplot, boxplot

 Visualize multivariate data (e.g. scatterplot)

 Know the definition of an outlier and (subjectively) identify outliers

R-Code for this chapter: www.math.uzh.ch/furrer/download/sta120/chapter01.R.

Propose statistical Model fit Model validation Summarizing results

Figure 1.1: Data analysis workflow seen from a statistical perspective.

R-Code 1.1 Loading a dataset (here the ‘leman’ dataset).

Hg.frame <- read.csv('data/lemanHg.csv')

• What is the data generating process?

• What data types do we have? (discussed in Section 1.1)

• How many data points/missing values do we have? (discussed in Section 1.2)

• What patterns/features/clusters exist in the data? (discussed in Section 1.4)

1.1 Types of Data

R-Code 1.2 Example of creating ordinal and interval scales in R.

ordinal <- factor( c("male","female"))

1.2 Descriptive Statistics

If the context is clear, we may omit empirical.

c( mean=mean( Hg), tr.mean=mean( Hg, trim=.1), median=median( Hg))

In subsequent chapters, we will discuss (statistical) properties of these different statistics. At

1.3 Classical Approaches to Represent Univariate Data

histout <- hist( Hg)

Histogram of Hg With 'smoothed density'

0.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5

Too many bins Too few bins

0.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5 2.0

Figure 1.5: Stem-and-leaf plot of residents of the municipality of Staldenried as of

A box plot is a graphical representation of five statistics of the frequency distribution of

out <- boxplot( Hg, col="LightBlue", notch=TRUE, ylab="Hg", outlty=1,

Normal Q−Q Plot

Theoretical Quantiles Theoretical quantiles

1.4 Visualizing Multivariate Data

We conclude this section with the presentation of a famous historical dataset.

hist( iris$Petal.Length, main='', xlab="Petal length [cm]", col=7)

Petal length [cm]

2.0 3.0 4.0 0.5 1.5 2.5

4.5 5.5 6.5 7.5 1 2 3 4 5 6 7

1.4.1 High-Dimensional Data

str( swiss, strict.width='cut') # in package:datasets

Fertility Agriculture Examination Education Catholic Infant.Mortality

1.5 Examples of Poor Charts

1.6 Bibliographic remarks

Figure 1.14: SWISS Magazine 10/2011-01/2012, 107.

The page https://en.wikipedia.org/wiki/List_of_statistical_software gives an extensive list

1.7 Exercises and Problems

Problem 1.3 (EDA of bivariate data) On www.isleroyalewolf.org/data/data/home.html the

Problem 1.4 (EDA) Perform an EDA of the dataset www.math.uzh.ch/furrer/download/

Learning goals for this chapter:

 Describe in own words a cumulative distribution function (cdf), probability

 Given a pdf/pmf or cdf calculate probabilities

 Know the definition and properties of independent and identically distributed

 Describe a binomial, Poisson and Gaussian random variable

Understand the concept and need of Exploratory Data Analysis (EDA)

Schematically sketch, plot in R and interpret a histogram, barplot, boxplot

Visualize multivariate data (e.g. scatterplot)

Know the definition of an outlier and (subjectively) identify outliers

Describe in own words a cumulative distribution function (cdf), probability

Given a pdf/pmf or cdf calculate probabilities

Know the definition and properties of independent and identically distributed

Describe a binomial, Poisson and Gaussian random variable

Verify that a given function is a pdf or a cdf. Find a multiplicative constant

ii) right-continuous, i.e., lim FX (x + ) = FX (x), for all x ∈ R;

Weak LLN: lim P |X n − µ| > = 0 for all > 0,