Computational Statistics An Introduction To R 1st Edition Günther Sawitzki Download
Computational Statistics An Introduction To R 1st Edition Günther Sawitzki Download
https://ebookfinal.com/download/computational-statistics-an-
introduction-to-r-1st-edition-gunther-sawitzki/
https://ebookfinal.com/download/an-introduction-to-computational-
fluid-dynamics-2ed-edition-versteeg-h/
https://ebookfinal.com/download/an-introduction-to-medical-
statistics-4th-edition-martin-bland/
https://ebookfinal.com/download/analyzing-linguistic-data-a-practical-
introduction-to-statistics-using-r-1st-edition-r-h-baayen/
https://ebookfinal.com/download/an-introduction-to-atm-networks-1st-
edition-harry-g-perros/
An Introduction to Statistics in Early Phase Trials 1st
Edition Steven Julious
https://ebookfinal.com/download/an-introduction-to-statistics-in-
early-phase-trials-1st-edition-steven-julious/
https://ebookfinal.com/download/computational-statistics-with-r-1st-
edition-marepalli-b-rao-and-c-r-rao-eds/
https://ebookfinal.com/download/an-introduction-to-bootstrap-methods-
with-applications-to-r-1st-edition-michael-r-chernick/
https://ebookfinal.com/download/statistics-for-engineers-an-
introduction-1st-edition-jim-morrison/
Günther Sawitzki
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts
have been made to publish reliable data and information, but the author and publisher cannot assume
responsibility for the validity of all materials or the consequences of their use. The authors and publishers
have attempted to trace the copyright holders of all material reproduced in this publication and apologize to
copyright holders if permission to publish in this form has not been obtained. If any copyright material has
not been acknowledged please write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmit-
ted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented,
including photocopying, microfilming, and recording, or in any information storage or retrieval system,
without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.copyright.
com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood
Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and
registration for a variety of users. For organizations that have been granted a photocopy license by the CCC,
a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used
only for identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
What Is R?
For a long time the AT&T implementation of S has been the reference for the S language.
Today, S is available in a commercial version called S-Plus <http://www.insightful.
com/> (based on the AT&T implementation) and as a free software version R1 , or “Gnu
S” <http://www.r-project.org/>.
1 R got its name by accident — the same accident that made the first names of the original authors of
R (Ross Ihaka and Robert Gentleman) start with R.
v
© 2009 Taylor & Francis Group, LLC
vi INTRODUCTION
In the meantime, R has become the reference implementation. Essential more precise
definitions and — if necessary — even modifications of the language are given by R. For
simplicity, here and in the sequel we use “R language” as a common term even where
the precise term should be “the S language using the R implementation”.
R is an interpreted programming language. Instructions in R are executed immediately.
In addition to the original elements of S, R has several extensions. Some of these have
been introduced in response to recent developments in statistics, some are introduced
to open experimental facilities. Advancements in the S language are taken into account.
The most recent (2008) version of R is 2.x. This version is largely compatible with the
previous version R 1.x. The essential changes are internal to the system. For initial use,
there is no significant difference from R, 1.x. For the advanced user, there are three
essential innovations:
• Graphics: The basic graphics system in R implements a model inspired by pen and
paper drawing. A graphics port (paper) is opened, and lines, points or symbols are
drawn in this port. In R 2.x there is a second additional graphics system, oriented at
a viewport/object model. Graphical objects in various positions and orientations are
mapped in a visual space.
• Packages: The original R had a linear command history and a uniform workspace.
R 2.x introduced an improved support for “packages” that may have encapsulated
information. To support this, language concepts such as “name spaces” and various
tools have been introduced.
• Internationalisation: Originally, R was based on the English language, and ASCII
was the general encoding used. With R 2.x extensive support for other languages and
encodings has been introduced. With this, it has become possible to provide localised
versions.
Two aspects of R are active areas of recent developments: interactive access and in-
tegration in a networked environment. These and other aspects are part of Omega-
hat, an attempt to develop a next generation system based on the experiences from
R. This more experimental project is accessible at <http://www.omegahat.org/>. R
does already provide simple possibilities to call functions implemented in other lan-
guages like C or FORTRAN. Omegahat extends these possibilities and allows direct
access to Java, Perl . . .. A Java-based graphical interface for R is JGR, accessible at
<http://stats.math.uni-augsburg.de/software/>. A collection of interactive dis-
plays for R is in iplots, available at the same site.
Recent developments related to R are in <http://r-forge.r-project.org/>. Many
helpful extensions are in <http://www.bioconductor.org/>, a site that is targeted at
biocomputing.
R has been developed for practical work in statistics. Usability often has been given pri-
ority over abstract design principles. As a consequence, it is not easy to give a systematic
In its basic version, R contains more than 1500 functions, too many to introduce in just
one course, and too many to learn. This course can only open the door to R.
Course participants can come from various backgrounds, with different prerequisites.
For pupils and younger students, a mere programming course on technical basics may
be appropriate. Later, questions about meaningful classification and background will be
more important. This is the aim of the present course. The “technical” material provides
a skeleton. Beyond this, we try to open the view for statistical questions and to stimulate
interest in the background. This course should whet the appetite for the substance that
may be offered in a subsequent well-founded statistical course.
The first part of this course material is organised by themes, using example topics to
illustrate how R can be used to tackle statistical problems.
The appendix provides a collection of R language elements and functions. During the
course, it can serve as a quick reference and perhaps as a starting point and orientation
to access the rich information material that comes bundled with R. After the course, it
may serve as a note pad. Finally, in the long run for practical work, the online help and
online manuals for R are the prime sources of information. This appendix is not meant
to be comprehensive. If a concise syntax description or example could be given, it is
included. In other cases, the online help information should be consulted.
A generous time slot for exercises is recommended (an additional half day for the intro-
ductory exercises, an additional half day for one of the project exercises). The course
can then be covered in one week, provided follow-up facilities are established to answer
questions that have come up, and possibilities are available to follow the interest in the
statistical background that may have resulted.
At a more leisurely pace, Chapter 1 with its exercises can be used on its own. This should
provide a working base to use R, and more material from the subsequent chapters can be
added later as needed. The first chapter is fairly selfcontained, including the necessary
basic definitions of statistical terms. The other chapters assume that the reader can look
up terms if necessary.
Using the course during a term in a weekly class requires more time, since repetitions
must be calculated in. Each of the first four chapters will cover about four lectures, plus
time for exercises. For this time schedule, a course covering the statistical background
is recommended, running in parallel with this one.
For a subsequent self-paced study that goes into detail on R as a programming language,
the recommended reading is ([51]).
Statistical literature is evolving, and new publications will be available at the time you
read this text. Instead of giving a long list of the relevant literature available at the time
this text is written, the sections include keywords that can be used to locate up-to-date
literature.
For economic reasons, most of the illustrations are printed in black and white. Colour
versions are available at the web site accompanying this book:
http://sintro.r-forge.r-project.org/.
Examples and input code are formatted so that they can be used with “Cut & Paste”
and entered as program input. To allow this, punctuation marks are omitted and the
input code is shown without a “prompt”. For example:
Example 0.1:
Input
1 + 2
Output
3
> 1+2
[1] 3
>
Acknowledgements
Thanks to the R core team for comments and hints. Special thanks to Friedrich Leisch (R
core team) and Antony Unwin (Univ. Augsburg) who worked through an early version
of this manuscript. Thanks to Rudy Beran, Lucien Birgé, Dianne Cook, Lutz Dümbgen,
Jan Johannes, Deepayan Sarkar, Bill Venables, Ali Ünlü and Adalbert Wilhelm for
comments and hints.
Thanks to Dagmar Neubauer and Shashi Kumar for helping with the TeX pre-production,
and very special thanks to Gerben Wierda for making the necessary tools accessible.
[2] Becker, R.A.; Chambers, J.M.; Wilks, A.R. (1988): The New S Language.
Chapman & Hall, New York.
[52] Venables, W.N.; Ripley, B.D. (2002): Modern Applied Statistics with S.
Springer, Heidelberg.
See: <http://www.stats.ox.ac.uk/pub/MASS4/>.
Introduction v
xi
© 2009 Taylor & Francis Group, LLC
xii CONTENTS
Executing Files 54
1.5.6 Packages 54
1.6 Statistical Summary 56
1.7 Literature and Additional References 57
2 Regression 59
2.1 General Regression Model 59
2.2 Linear Model 60
2.2.1 Factors 63
2.2.2 Least Squares Estimation 64
2.2.3 Regression Diagnostics 69
2.2.4 More Examples for Linear Models 75
2.2.5 Model Formulae 76
2.2.6 Gauss-Markov Estimator and Residuals 77
2.3 Variance Decomposition and Analysis of Variance 79
2.4 Simultaneous Inference 85
2.4.1 Scheffé’s Confidence Bands 85
2.4.2 Tukey’s Confidence Intervals 87
Case Study: Titre Plates 88
2.5 Beyond Linear Regression 96
Transformations 96
2.5.1 Generalised Linear Models 96
2.5.2 Local Regression 97
2.6 R Complements 101
2.6.1 Discretisation 101
2.6.2 External Data 101
2.6.3 Testing Software 101
2.6.4 R Data Types 102
2.6.5 Classes and Polymorphic Functions 103
2.6.6 Extractor Functions 104
2.7 Statistical Summary 105
2.8 Literature and Additional References 105
4 Dimensions 1, 2, 3, . . . , ∞ 139
4.1 R Complements 140
4.2 Dimensions 143
4.3 Selections 145
4.4 Projections 145
4.4.1 Marginal Distributions and Scatter Plot Matrices 145
4.4.2 Projection Pursuit 150
4.4.3 Projections for Dimensions 1, 2, 3, . . . 7 153
4.4.4 Parallel Coordinates 154
4.5 Sections, Conditional Distributions and Coplots 156
4.6 Transformations and Dimension Reduction 162
4.7 Higher Dimensions 167
4.7.1 Linear Case 167
Partial Residuals and Added Variable Plots 168
4.7.2 Non-Linear Case 169
Example: Cusp Non-Linearity 169
4.7.3 Case Study: Melbourne Temperature Data 173
4.7.4 Curse of Dimensionality 174
4.7.5 Case Study: Body Fat 175
4.8 High Dimensions 189
4.9 Statistical Summary 190
References 233
Like any programming language, R has certain conventions. Here are the basic rules.
R Conventions
Numbers A point is used as a decimal separator. Numbers can be written
in exponential form; the exponential part is introduced by E.
Numbers can be complex numbers; the imaginary part is marked
by i.
Example: 1
2.3
3.4E5
6i+7.8
Numbers can take the values Inf, -Inf, NaN for “not a number”
and NA for “not available” = missing.
Example: 1/0 results in Inf
0/0 results in NaN
NA is used as a placeholder for missing numbers.
Strings Strings are delimited by " or '.
Example: "ABC"
'def'
"gh'ij"
Comments Comments start with # and go to the end of the current line.
1
© 2009 Taylor & Francis Group, LLC
2 BASIC DATA ANALYSIS
R Conventions
Objects The basic elements in R are objects. Objects have types, for
example logical or integer. Objects can have a class attribute
specifying more complex type information.
Example: The basis objects in R are vectors.
Names R objects can have names, by which they can be accessed.
Names begin with a letter or a dot, followed by a sequence of
letters, digits, or the special characters _ or .
Examples: x
y_1
Lower- and uppercase are treated as different.
Examples: Y87
y87
Assignments Assignments have the form
Syntax: name <- value or alternatively name = value.
Example: a <- 10
x <- 1:10
Queries If only the name of an object is entered, the value of the object
is returned.
Example: x
Indices Vector components are accessed by index. The lowest index is 1.
Example: x[3]
The indices can be specified directly, or using symbolic names or
rules.
Examples:
a[1] the first element
x[-3] all elements except the third
x[ x^2 < 10] all elements where x2 < 10
Help and
Inspection
Help Documentation and additional information about an object can
be requested using help.
Syntax: help(name)
Examples: help(exp)
help(x)
Alternative form ?name
Examples: ?exp
?x
A hypertext (currently HTML) version of R’s online documen-
tation is activated by help.start() . This allows us to search
by topics, and provides a more structured access to information.
Inspection help() can only provide information that has been prepared in
advance. str() can inspect the actual state of an object and
display this information.
Syntax: str(object, ...)
Examples: str(x)
R Conventions
Functions Function calls in R have the form:
Syntax: name(argument . . . )
Example: e_10 <- exp(10)
This convention holds even when there are no arguments at all.
Example: To quit R, you call a “quit” function q() .
Function arguments are treated in a very flexible way. They can
have default values, which are used if no explicit argument value
is given.
Examples: log(x, base = exp(1))
(cont.)→
R Conventions
(cont.)
Functions can be polymorphic. For a polymorphic function,
the actual function is determined by the class of the actual ar-
guments.
Examples: plot(x) # a one-dimensional serial plot
plot(x, x^2) # a two-dimensional scatter plot
summary(x)
Operators When applied to vectors, operators operate on each of the vector
components.
Example: For vectors y, z, the product y*z is the vector
of component-wise products.
Operators are special functions. They can be called in prefix form
(function form).
Example: "+"(x, y)
When applied to two operands with different lengths, the smaller
operand is repeated cyclically.
Example: (1:2)*(1:6)
Our subject is statistical methods. As a first step, we apply the methods in simulations,
that is, we use synthetic data. Generating these data is largely under our control. This
gives us the opportunity to gain experience with the methods and allows a critical
evaluation. Only then will we use the methods for data analysis.
Random variables with a uniform distribution can be generated by the function runif()
Using help(runif) or ?runif we get information on how to use this function:
help(runif)
Description
These functions provide information about the uniform distribution on the interval
from min to max. dunif gives the density, punif gives the distribution function
qunif gives the quantile function and runif generates random deviates.
Usage
dunif(x, min=0, max=1, log = FALSE)
punif(q, min=0, max=1, lower.tail = TRUE, log.p = FALSE)
qunif(p, min=0, max=1, lower.tail = TRUE, log.p = FALSE)
runif(n, min=0, max=1)
Arguments
x,q vector of quantiles.
p vector of probabilities.
n number of observations. If length(n) > 1, the length is taken to
be the number required.
min,max lower and upper limits of the distribution. Must be finite.
log, log.p logical; if TRUE, probabilities p are given as log(p).
lower.tail logical; if TRUE (default), probabilities are P [X ≤ x], otherwise,
P [X > x].
Details
If min or max are not specified they assume the default values of 0 and 1 respectively.
The uniform distribution has density
1
f (x) =
max − min
for min ≤ x ≤ max.
For the case of u := min == max, the limit case of X ≡ u is assumed, although
there is no density in that case and dunif will return NaN (the error condition).
runif will not generate either of the extreme values unless max = min or max-min
is small compared to min, and in particular not for the default arguments.
See Also
.Random.seed about random number generation, rnorm, etc for other distributions.
Examples
u <- runif(20)
The help information tells us that as an argument for runif() we have to supply the
number n of random variates to generate. As additional arguments for runif() we can
specify the minimum and the maximum for the range of the random numbers. If we do
not specify additional arguments, the default values min = 0 and max = 1 are taken.
For example, runif(100) generates a vector with 100 uniform random numbers with
range (0, 1). Calling runif(100, -10, 10) generates a vector with 100 uniform random
numbers in the range (−10, 10).
The additional arguments can be supplied in the defined order, or specified by name.
If the name of the argument is given, the position can be chosen freely. So instead
of runif(100, -10, 10) it is possible to use runif(100, min = -10, max = 10) or
runif(100, max = 10, min = -10). Using the name, it is also possible to set only
chosen arguments. For example, if the minimum is not specified, the default value for
the minimum is taken: using runif(100, max = 10) is equivalent with runif(100,
min = 0, max = 10). For better readability, we often write the names of arguments,
even if it is not necessary.
Each execution of runif() generates 100 new uniform random numbers. We can store
these numbers.
x <- runif(100)
returns the values. By default, it is written to the output, and we can inspect the result.
We get a graphical representation, the serial plot, a scatterplot of the entries x against
its running index, by using
plot(x)
Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.
ebookfinal.com