Probability and Statistics For Engineers
Probability and Statistics For Engineers
for Engineers
First Edition
Seth Antanah
Contents
1 An Introduction to R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Introduction 1
1.2.1 Installing R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3.1 Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.3 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Introduction to Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Introduction 11
2.2.8 Definition: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.3 Questions 32
3 Introduction to Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.9.1 Gaps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.9.2 Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.13.6 Skewness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.13.7 Kurtosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.13.8 Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.1 Introduction 77
6 Estimations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
6.0.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
6.0.5 Confidence Interval For The Difference Between Two Population Proportions . . . 157
8 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
Even when flawless, the process of synthesis that all ‘data’ goes through before the
communication step entails by its very nature reshaping and loss of information. This
book is designed to cater to the needs of those who want to delve into the practical
aspects of statistics without delving deeply into the theoretical underpinnings of
the subject. It serves as a handy reference guide for common statistical techniques
frequently employed in fields like business, demography, and health
1. Introduction to statistics
2. Introduction to probability
3. Random variables and distributions
4. Special distribution
5. Estimations
6. Hypothesis testing
7. Regression
The initial section comprises a single chapter that introduces the rationale behind
IV
studying statistics and establishes fundamental definitions essential for the course.
Sections two through four are subdivided into chapters, each dedicated to elucidating
a particular concept or technique that complements the overarching theme of the
respective section. To illustrate, the section on descriptive statistics is further divided
into two parts: one that delves into graphical methods for summarizing data and
another that explores numerical data summaries. These sections employ real-world
examples to elucidate the techniques, and readers can reinforce their understanding
through practice problems conveniently embedded within the chapters.
1. An Introduction to R
1.1 Introduction
Having worked through this chapter the student will be able to:
The instructions for obtaining R largely depend on the user’s hardware and operating
system. The R Project has written an R Installation and Administration manual with
complete, precise instructions about what to do, together with all sorts of additional
information. The following is just a primer to get a person started
2 Chapter 1. An Introduction to R
1.2.1 Installing R
Visit one of the links below to download the latest version of R for your operating
system: Microsoft Windows:
1. http://cran.r-project.org/bin/windows/base/
2. MacOS: http://cran.r-project.org/bin/macosx/
3. Linux: http://cran.r-project.org/bin/linux/
There are base packages (which come with R automatically), and contributed packages
(which must be downloaded for installation). For example, on the version of R being
used for this document the default base packages loaded at startup are
The base packages are maintained by a select group of volunteers, called “R Core”. In
addition to the base packages, there are literally thousands of additional contributed
packages written by individuals all over the world. These are stored worldwide on
mirrors of the Comprehensive R Archive Network, or CRAN for short. Given an
active Internet connection, anybody is free to download and install these packages
and even inspect the source code. To install a package named foo, open up R and
type install.packages("foo"). To install foo and additionally install all of the other
packages on which foo depends, instead type install.packages("foo", depends =
TRUE). The general command install.packages() will (on most operating systems)
open a window containing a huge list of available packages; simply choose one or
more to install. No matter how many packages are installed onto the system, each
one must first be loaded for use with the library function. For instance, the foreign
package [18] contains all sorts of functions needed to import data sets into R from other
1.3 Basic R Operations and Concepts 3
software such as SPSS, SAS, etc.. But none of those functions will be available until the
command library(foreign) is issued. Type library() at the command prompt (described
below) to see a list of all available packages in your library. For complete, precise
information regarding installation of R and add-on packages, see the R Installation
and Administration manual, http://cran.r-project.org/manuals.html.
1.3.1 Arithmetic
# Addition
a <− 5
b <− 3
r e s u l t <− a + b
print ( r e s u l t )
# Subtraction
a <− 5
b <− 3
r e s u l t <− a − b
print ( r e s u l t )
4 Chapter 1. An Introduction to R
# Multiplication
a <− 5
b <− 3
r e s u l t <− a ∗ b
print ( r e s u l t )
# Division
a <− 6
b <− 2
r e s u l t <− a / b
print ( r e s u l t )
> options ( d i g i t s = 1 6)
> 10/3 # s e e more d i g i t s
[ 1 ] 3.333333333333333
> sqrt ( 2 ) # s q u a r e r o o t
[ 1 ] 1.414213562373095
> exp ( 1 ) # Euler ’ s c o n s t a n t , e
[ 1 ] 2.718281828459045
> pi
[ 1 ] 3.141592653589793
> options ( d i g i t s = 7 ) # back t o d e f a u l t
Note that it is possible to set digits up to 22, but setting them over 16 is not
recommended (the extra significant digits are not necessarily reliable). Above notice
the sqrt function for square roots and the exp function for powers of e, Euler’s number.
1.3 Basic R Operations and Concepts 5
When choosing a variable name you can use letters, numbers, dots ı.ȷ, or underscore
ıȷ characters. You cannot use mathematical operators, and a leading dot may not be
followed by a number. Examples of valid names are: x, x1, y.value, and yhat . (More
precisely, the set of allowable characters in object names depends on one’s particular
system and locale; see An Introduction to R for more discussion on this.) Objects can
be of many types, modes, and classes. At this level, it is not necessary to investigate
all of the intricacies of the respective types, but there are some with which you need
to become familiar:
[ 1 ] −1+0 i
> typeof ( ( 0 + 1 i ) ^ 2 )
[ 1 ] " complex "
1.3.3 Vectors
All of this time we have been manipulating vectors of length 1. Now let us move to
vectors with multiple entries.
R Entering data vectors; If you would like to enter the data 74, 31, 95, 61, 76, 34, 23, 54, 96
into R, you may create a data vector with the c function (which is short for
concatenate).
> x <− c ( 7 4 , 3 1 , 9 5 , 6 1 , 7 6 , 3 4 , 2 3 , 5 4 , 9 6 )
> x
[ 1 ] 74 31 95 61 76 34 23 54 96
[1] 1 2 3 4 5
> x[1]
[ 1 ] 74
> x[2:4]
[ 1 ] 31 95 61
1.3 Basic R Operations and Concepts 7
> x [ c (1 , 3 , 4 , 8)]
[ 1 ] 74 95 61 54
> x[−c ( 1 , 3 , 4 , 8 ) ]
[ 1 ] 31 76 34 23 96
#N o t i c e t h a t we used t h e minus s i g n t o s p e c i f y t h o s e e l e m e n t s t h a t we
> LETTERS [ 1 : 5 ]
> l e t t e r s [ −(6:24)]
[ 1 ] "a" "b" " c " "d" " e " "y" " z "
> x <− 1 : 5
> sum( x )
[ 1 ] 15
> length ( x )
[1] 5
> min( x )
[1] 1
[1] 3
> sd ( x ) # sample s t a n d a r d d e v i a t i o n
[ 1 ] 1.581139
> intersect
function ( x , y )
8 Chapter 1. An Introduction to R
y <− as . vector ( y )
Non−v i s i b l e f u n c t i o n s a r e a s t e r i s k e d
−> w i l c o x . t e s t
function ( x , . . . )
> methods ( w i l c o x . t e s t )
When you are using R, it will not take long before you find yourself needing
help. Fortunately, R has extensive help resources and you should immediately
become familiar with them. Begin by clicking Help on Rgui. The following
options are available.
1. Console: gives useful shortcuts, for instance, Ctrl+L, to clear the R console
screen.
2. FAQ on R: frequently asked questions concerning general R operation.
3. FAQ on R for Windows: frequently asked questions about R, tailored to
the Microsoft Windows operating system.
1.5 External Resources 9
Having worked through this chapter the student will be able to:
2.2 Introduction
Computing in R;
Let’s examine the random experiment involving dropping a Sty-
rofoam cup from a height of four feet to the floor. After the cup
hits the ground, it eventually comes to rest, and there are three
possible outcomes: it could land upside down, right side up, or
on its side. These potential results of the random experiment are
represented as follows
> S <− data . frame ( l a n d s = c ( " down " , " up " , " s i d e
"))
> S
lands
1 down
2 up
3 side
Here the sample space contains the column lands which stores the
outcomes "down", "up", and "side".
We can also use the package "prob" in computing the sample space
of an experiment in R. Consider the random experiment of tossing
a coin. The outcomes are H and T. We can set up the sample
14 Chapter 2. Introduction to Probability
> l i b r a r y ( prob )
toss1
1 H
2 T
This is based on the assumption that the outcomes of an experiment are equally
likely. For example, if an experiment can lead to n mutually exclusive and
equally likely outcomes, then the probability of the event A is defined by
This concept uses the relative frequencies of past occurrences to develop prob-
abilities for future. The probability of an event A happening in future is
determined by observing what fraction of the time similar events happened in
the past. That is,
The relative frequency of the occurrence of the event A used to estimate P (A)
becomes more accurate if trials are largely repeated. The relative frequency
approach of defining P (A) is sometimes called posterior probability because
P (A) is determined only after event A is observed.
Examples
Solution:
Solution:
i
n(A) 13 1
P (A) = = =
n(S) 52 4
ii
P (B or Q) = P (B) + P (Q)
n(B) n(Q) 4 4 2
= + = + =
n(S) n(S) 52 52 13
18 Chapter 2. Introduction to Probability
(2) The sample space, S = 4R, 2B, 3W-balls and let R = red balls.
Then
n(R) 4
P (R) = =
n(S) 9
R A die is tossed twice. List all the outcomes in each of the following
events and compute the probability of each event.
Solution:
The sample space for the experiment is the set of ordered paired
(m, n), where each takes the values 1, 2, 3, 4, 5 and 6. Thus,
3 1
P (A) = =
36 12
6 1
P (B) = =
36 6
15 5
P (D) = =
36 12
= (4, 5), (4, 6), (5, 4), (5, 5), (5, 6), (6, 4), (6 5), (6, 6)
8 2
P (E) = =
36 9
t o s s 1 t o s s 2 probs
1 H H 0.25
2 T H 0.25
3 H T 0.25
4 T T 0.25
> S[1:3 , ]
t o s s 1 t o s s 2 probs
1 H H 0.25
2 T H 0.25
3 H T 0.25
> S [ c (2 , 4) , ]
t o s s 1 t o s s 2 probs
2 T H 0.25
4 T T 0.25
Two or more events are combined to form a single event using the set operations,
∩ and ∪. The event
Definitions:
• Ai ̸= ∅ For all i = 1, 2, 3, . . . n
• Ai ∩ Aj = ∅ For all i ̸= j i = 1, 2, 3, . . . n and j = 1, 2, 3, . . . n
•
Pn
i=1 S
Examples
1. P (A/B)
2. Are A and B independent ?
Solution:
P (B ∪ W ) = P (B) + P (W ) − P (B ∪ W )
= 0.47
P (A ∩ B) =P (A) + P (B) − P (A ∪ B)
=0.3
22 Chapter 2. Introduction to Probability
P (A ∩ B)
P (A|B) = , P (B) > 0 (2.2.1)
P (B)
Condition
Total
B B/
2 6 8
1 9 10
3 15 18
N (A) 8
P (A) = = = 0.4 (2.2.2)
N 18
P (A ∩ B)
P (A) = (2.2.3)
P (B)
2/18
= (2.2.4)
3/18
2
= (2.2.5)
3
P (A∩B)
• P (A/B) = P (B)
P (A∩B)
• P (A/B) × P (B) = P (B) × P (B)
• P (A/B) · P (B) = P (A ∩ B)
Bayes’ Rule
The power of Bayes’ rule is that in many situations where we want to compute
P (A|B) it turns out that it is difficult to do so directly, yet we might have
direct information about P (B|A). Bayes’ rule enables us to compute P (A|B)
in terms of P (B|A).
P (A ∩ B) P (B|A)P (A)
P (A|B) = =
P (B) P (B)
Bayes Theorem
Let A and Ac constitute a partition of the sample space S such that with
P (A) > 0 and P (Ac ) > 0, then for any event B in S such that P (B) > 0,
P (A ∩ B) P (B|A)P (A)
P (A|B) = =
P (B) P (B|A)P (A) + P (B|Ac )P (Ac )
Solution:
P (R|L) =0.6
P (R|S) =0.3
P (L ∩ R)
P (L|R) =
P (R)
P (R|L)P (L)
=
P (R)
0.6 × 0.7
= = 0.857
0.6 × 0.75 × 0.3 × 0.25
P (A3 ) + · · · + P (An )
Additive Rule
Multiplicative Rule
The Multiplication Principle, also known as the Basic Counting Principle states
that:
Examples
R Tossing a coin has two possible outcomes and tossing a die has
six possible outcomes. Then the combined experiment, tossing
the coin and die together will result in (2 ∗ 6 = 12) twelve possible
outcomes provided below:
The combination of the 8 different shirts and the six different pairs
of trousers results in (8 ∗ 6 = 64) possible ways.
Definitions:
n!
Pk =
n
n − k!
where k < n
3. The number of permutations of n objects consisting of groups of which n1
of the first group are alike,n2 of the second group are alike and so on for
the k th group with objects which are alike is:
n!
, where n1 + n2 + n3 + ...nk = n
n1 !n2 !n3 !, . . . nk !
n!
= (n − 1)!
n
Examples
any order and the remaining 5 are the digits 1, 2, 3, 4 and 5 also in
any order. If each letter and digit can appear only once then the
number of customers the company can code is obtained as follows:
Solution:
l i b r a r y ( permute )
r e s u l t <− a l l P e r m s ( 3 )
print ( r e s u l t )
data <− 1 : 3
library ( g t o o l s )
print ( r e s u l t )
Combination of Objects
2.2.8 Definition:
The number of ways in which objects can be selected from distinct objects,
irrespective of their order is defined by:
32 Chapter 2. Introduction to Probability
2.3 Questions
Solution:
6 5 6! 5!
! !
=
2 2 4!2! 3!2!
= 15 ∗ 10 = 150
6! 5! 6! 5! 6! 5!
= ∗ + ∗ + ∗
5!1! 2!3! 4!2! 2!3! 3!3! 4!1!
(6 ∗ 10) + (15 ∗ 10) + (20 ∗ 5) = 310
R A box contains 6 red, 3 white and 5 blue balls. If three balls are
drawn at random, one after the other without replacement, find
the probability that
Solution:
N o. of selection of f rom
1. Pr(all the 3 are red balls) = N o. of selection of
3
3 f rom
6
14
6
6∗5∗4 5
3
= =
14
3
14 ∗ 13 ∗ 12 91
(62)(31)
2. Pr(2 red and 1 white ball) =
(14
3)
15 ∗ 3 45
= =
14 ∗ 13 ∗ 2 364
8
8∗7
= 1− 3
= 1−
14
3
14 ∗ 13 ∗ 2
2 11
= 1− =
13 13
(61)∗(31)∗(51)
4. Pr(1 of each colour) =
(14
3)
6∗3∗5 45
= =
14 ∗ 13 ∗ 2 182
Solution:
The number of ways of forming the committee of 3 from twelve
men and 8 women (12M + 8W ) − 20
3 = 1140
34 Chapter 2. Introduction to Probability
= P r(2W 1M ) + P r(3W )
2 ∗ 1 + 3
8 12 8
= 14
3
(28 ∗ 12) + (56)
=
1140
Solution:
∴ P(W ∩ C) = 40
100 40 =
× 30 30
100 = 0.3
2. P(M ∩ B) = P(M) × P(B|M)
P(M) = 100 ;
60
P(B|M) = 10
60
∴ P(M∩B) = P(M)P(B|M) = 60
100 60 =
× 10 10
100 = 0.10
OR
P(M cap B) = P(B)P(M|B) = 25
100 25 =
× 10 10
100
Solution:
=1
S = HH, HT, T H, T T
A = HH, HT, T H
52!
N = C552 = = 2598960
5!47!
n 24
P (C) = = = 0.9 × 10−6
N 2598960
R Suppose two people each flip a fair coin simultaneously. Will the
results of the two flips usually be independent? Under what sorts
of circumstances might they not be independent? (List as many
such circumstances as you can.)
Having worked through this chapter the student will be able to:
The field of statistics deals with the collection, presentation, analysis, and use
of data to make decisions, solve problems, and design products and processes.
In simple terms, statistics is the science of data.
Because many aspects of engineering practice involve working with data, ob-
viously knowledge of statistics is just as important to an engineer as are the
other engineering sciences. Specifically, statistical techniques can be powerful
aids in designing new products and systems, improving existing designs, and
designing, developing, and improving production processes.
Statistical analysis provides objective ways of evaluating patterns of events or
patterns in our data by computing the probability of observing such patterns
by chance alone.
Insisting on the use of statistical analyses on which to draw conclusions is an
extension of the argument that objectivity is critical in science. Without the
use of statistics, little can be learnt from most research studies.
Because of the increasing use of statistics in so many areas of our lives, it has
become very desirable to understand and practice statistical thinking. This is
important even if you do not use statistical methods directly.
• Organize the entire set of scores into a table or a graph that allows
researchers (and others) to see the whole set of scores. (summarizing data
graphically)
• Compute one or two summary values (such as the average) that describe
the entire group. (summarizing data numerically).
This is the branch of statistics that involves using a sample to draw conclu-
sions about a population. A basic tool in the study of inferential statistics is
probability.
3.5 Variables
For example, a person’s hair color is a potential variable, which could have the
value of "blond" for one person and "brunette" for another.
city, we are talking about the number of people in the city - a measurable
attribute of the city. Therefore, population would be a quantitative
variable. In algebraic equations, quantitative variables are represented by
symbols (e.g., x, y, or z).
Statistical data are often classified according to the number of variables being
studied.
When we conduct a study that looks at only one variable, we say that we
are working with univariate data. Suppose, for example, that we conducted
a survey to estimate the average weight of high school students. Since we are
only working with one variable (weight), we would be working with univariate
data.
When we conduct a study that examines the relationship between two variables,
we are working with bivariate data. Suppose we conducted a study to see if
there was a relationship between the height and weight of high school students.
Since we are working with two variables (height and weight), we would be
working with bivariate data.
3.7 Summarizing data graphically 45
The study of statistics revolves around the study of data sets. This lesson
describes two important types of data sets - populations and samples. Along
the way, we introduce simple random sampling, the main method used in this
tutorial to select samples.
The main difference between a population and sample has to do with how
observations are assigned to the data set. A population includes all of the
elements from a set of data. A sample consists of one or more observations from
the population. Depending on the sampling method, a sample can have fewer
observations than the population, the same number of observations, or more
observations. More than one sample can be derived from the same population.
• Pie chart
• Bar Chart (Also frequency distribution)
• Box plot
• Dot plot
• Stem-and-leaf
• Histogram
These provide an indication of the center of the distribution where most of the
scores tend to cluster. There are three principal measures of central tendency:
Mode, Median, and Mean.
Variability is the measure of the spread in the data. The three common
variability concepts are: Range, Variance and Standard deviation.
Graphic displays are useful for seeing patterns in data. Patterns in data are
commonly described in terms of: Center, Spread, Shape, Symmetry, Skewness
and Kurtosis
3.9.1 Gaps
3.9.2 Outliers
Common graphical displays (e.g., dot plots, box plots, stem plots, bar charts)
can be effective tools for comparing data from two or more data sets.
When you compare two or more data sets, focus on four features:
Example:
26 65780
There are 5
different samples of 5 letters that can be obtained
from the 26 letters of the alphabet. If a procedure for selecting a
sample of 5 letters was devised such that each of these 65780 samples
3.10 Sampling Procedures 49
Examples
R The (testable) population list alternates between men (on the even
numbers) and women (on the odd numbers). You choose to sample
every tenth individual, which will therefore result in only men
being included in your sample. This would be unrepresentative of
the population.
R You run a department store and are interested in how you can
improve the store experience for your customers. To investigate
this question, you ask an employee to stand by the store entrance
and survey every 20th visitor who leaves, every day for a week.
Although you do not necessarily have a list of all your customers
ahead of time, this method should still provide you with a rep-
resentative sample of your customers since their order of exit is
essentially random.
50 Chapter 3. Introduction to Statistics
• Define and list your population, ensuring that it is not ordered in a cyclical
or periodic order.
• Decide on your sample size and calculate your interval,k, by dividing your
population by your target sample size.
• Choose every k th member of the population as your sample.
Example:
In determining the distribution of incomes among engineers in the
Bay Area, we can divide the population of engineers into sub-
populations corresponding to each major engineering speciality (elec-
trical, chemical, mechanical, civil, industrial, etc.). Random samples
can then be selected from each of these sub-populations of engineers.
The logic behind this sampling structure is the reasonable assump-
tion that the income of an engineer depends, to a large extent, on
his particular speciality.
Example:
In determining the distribution of incomes among engineers in the Bay Area,
we can divide the population of engineers into sub-populations corresponding
54 Chapter 3. Introduction to Statistics
# Sample d a t a
data <− c ( 2 3 , 3 5 , 3 6 , 4 2 , 4 5 , 4 7 , 4 8 , 4 9 , 5 2 , 5 3 , 5 6 , 5 8 )
# C r e a t e a stem−and−l e a f p l o t
stem ( data )
1 | 2 : r e p r e s e n t s 120
l e a f u n i t : 10
n : 192
10 | 57
11 | 136678
3.12 Frequency Distribution 57
12 | 123889
13 | 0255666888899
14 | 00001222344444555556667788889
15 | 0000111112222223444455555566677779
16 | 01222333444445555555678888889
17 | 11233344566667799
18 | 00011235568
19 | 01234455667799
20 | 0000113557788899
21 | 145599
22 | 013467
23 | 9
24 | 7
• Dot plot:
dot plots are a valuable tool for exploratory data analysis. They offer a
concise and informative representation of data distribution, aiding in the
identification of patterns and outliers. Whether used independently or in
conjunction with other visualization techniques, dot plots contribute to a
richer understanding of datasets. Here’s an example of creating a simple
dot plot in R :
# Sample d a t a
v a l u e s <− c ( 1 5 , 1 0 , 2 5 , 1 8 )
# Create a dot p l o t
# Sample d a t a
data <− c ( 2 5 , 3 0 , 3 5 , 4 0 , 4 5 , 5 0 , 5 5 , 6 0 , 6 5 , 7 0 , 7 5 )
# C r e a t e a box−and−w h i s k e r p l o t
boxplot ( data , main = " Box−and−Whisker ␣ P l o t ␣Example " , y l a b = " Values " )
• Histogram;
A histogram is a graphical representation of the distribution of a dataset.
It provides a visual summary of the underlying frequency distribution of
a continuous or discrete variable. The main purpose of a histogram is
to show the underlying frequency distribution of a set of continuous or
discrete data. Here’s an example of creating a simple histogram in R :
# Sample d a t a
data <− c ( 2 2 , 2 8 , 3 0 , 3 5 , 4 0 , 4 2 , 4 5 , 5 0 , 5 5 , 6 0 , 6 5 )
# Create a histogram
• Pareto chart;
A Pareto chart is a type of chart that combines both bar and line charts
to highlight the most significant factors in a dataset. It is named after the
Italian economist Vilfredo Pareto, who observed that roughly 80% of the
effects come from 20% of the causes. Pareto charts are particularly useful
for identifying the most important factors or issues within a dataset and
3.12 Frequency Distribution 59
# Sample d a t a
c a t e g o r i e s <− c ( " Cat␣A" , " Cat␣B" , " Category ␣C" , " Cat
f r e q u e n c i e s <− c ( 3 0 , 2 0 , 1 5 , 1 0 , 2 5 )
f r e q u e n c i e s ) ∗ 100
# C r e a t e Pareto c h a r t
plotting
axis ( s i d e = 4 , a t = seq ( 0 , 1 0 0 , by = 1 0 ) , l a b e l s =
• Pie chart
A pie chart is a circular statistical graphic that is divided into slices to
illustrate numerical proportions. Each slice represents a proportionate
part of the whole, and the total sum of all slices is equal to 100%. Pie
charts are commonly used to display the distribution of a categorical
60 Chapter 3. Introduction to Statistics
# Sample d a t a
p e r c e n t a g e s <− c ( 2 5 , 2 0 , 1 5 , 1 0 , 3 0 )
# Create a p ie chart
p i e ( p e r c e n t a g e s , l a b e l s = c a t e g o r i e s , main = " P ie ␣
• Bar chart;
A bar chart is a graphical representation of data in which rectangular
bars of equal width are drawn with lengths proportional to the values
they represent. The bars can be oriented horizontally or vertically. Bar
charts are used to visually represent and compare the magnitudes of
different categories or groups. Each bar in the chart typically corresponds
to a specific category, and the length of the bar represents the value or
frequency of that category. Here’s an example of creating a simple bar
chart in R :
# Sample d a t a
v a l u e s <− c ( 2 5 , 4 0 , 1 5 , 3 0 , 2 0 )
Values " )
The Mode:
The mode is defined as the observation in the sample which occurs most
frequently. If there is only one mode then it is unimodal otherwise it is
multimodal
# Example d a t a s e t
data <− c ( 1 , 2 , 2 , 3 , 4 , 4 , 4 , 5 )
62 Chapter 3. Introduction to Statistics
# Create a frequency t a b l e
f r e q_table ) ] )
# P r i n t t h e mode r e s u l t
print (mode_r e s u l t )
It is the most commonly used measure of locations. Let the variable x take the
values x1 , x2 . . . xn . The arithmetic mean is defined as:
i=1
1 X
xi
N n
For large data it may be advantageous to classify the data. If these n observa-
tions have corresponding frequencies, the arithmetic mean is computed using
the formula,
x1 f1 + x2 f2 + · · · + xn fn
x= (3.13.1)
n
# Sample d a t a
data <− c ( 1 0 , 1 5 , 2 0 , 2 5 , 3 0 )
# C a l c u l a t e t h e mean
3.13 Measure of Location and Dispersion 63
# Print the r e s u l t
print (mean_r e s u l t )
The geometric mean is the average of a set of products, the calculation of which
is commonly used to determine the performance results of an investment or
portfolio. It is technically defined as "the nth root product of n numbers."
1 X
G = antilog fi log xi (3.13.3)
N
1
GM = (X1 , X2 , . . . , Xn ) n
GM is most often used to calculate the average growth rate over time of some
given series. R codes for computing geometric mean;
# Sample d a t a
data <− c ( 2 , 4 , 8 , 1 6 , 3 2 )
# C a l c u l a t e t h e g e o m e t r i c mean
# Print the r e s u l t
64 Chapter 3. Introduction to Statistics
print ( g e o m e t r i c_mean_r e s u l t )
1992 50 000 -
1993 55 000 55/50 =1.10
1994 66 000 66/55= 1.20
1995 70 000 70/66=1.06
1996 78 000 78/70 =1.11
Solution:
1
GM = (1.10 × 1.20 × 1.06 × 1.11)( 4 ) = 1.16 (3.13.4)
Harmonic Mean
In statistics, harmonic mean is used to find the average rate.the harmonic mean
is the reciprocal of the arithmetic mean of the reciprocals.
Note:
# Sample data
data <− c ( 2 , 4 , 8 , 1 6 , 3 2 )
# C a l c u l a t e t h e harmonic mean
# Print the r e s u l t
p r i n t ( harmonic_mean_result )
Median
If the sample observations are arranged in order from smallest to largest, the
median is defined as the middle observation if the number of observations is
odd, and as the number halfway between the two middle observations if the
number of observations is even. The general formulae for the median is given as
n
− fm−1
M D = bL + 2
×c (3.13.6)
fm
where,
bL = lower boundary of the median class
n = number of observations fm = the number of observations in the median
class
fm − 1 = the cumulative frequency of the class preceding the median class.
66 Chapter 3. Introduction to Statistics
# Sample data
data <− c ( 1 5 , 2 0 , 2 5 , 3 0 , 3 5 )
# C a l c u l a t e t h e median
# Print the r e s u l t
p r i n t ( median_result )
R Example 6:
If bL = 199.5, n = 300, fm−1 = 116, fm = 73, c = 50 Solu-
tion:
150 − 116
M edian = 199.5 + 5̇0
73
= 222.79
Range
It is the difference between the extreme values of the variate i.e. (xn − x1 ) when
the values are arranged in ascending order.
It is the difference between the 75% and 25% i.e. (X75% − X25% ). The interdecile
range is the difference between the ninth and first decile i.e. X0.9 − X0.1 . This
combines eighty percent of the total frequency while the interquartile range
contains fifty percent. They are only mainly used in descriptive statistics
because of the mathematical difficulty in handling them in advanced statistics.
68 Chapter 3. Introduction to Statistics
|(xi − x̄)|
P
MD =
n
For very large n, M.D may equal zero as some deviations may be negative
and others positive but the individual deviations could be numerically large,
thus giving a poor expression of the intrinsic dispersion. The mean absolute
deviation (M.A.D). provides a better and more useful measure of dispersion. R
codes;
# Sample d a t a
data <− c ( 1 5 , 2 0 , 2 5 , 3 0 , 3 5 )
# C a l c u l a t e t h e mean
# C a l c u l a t e t h e mean d e v i a t i o n
# Print the r e s u l t
print (mean_d e v i a t i o n )
For theoretical reasons, the sum of squares is divided by (n˘1) rather than n
because it represents a better estimate of the standard deviation.
(xi − x̄)2
P
s =
2
n−1
# Sample d a t a
data <− c ( 1 5 , 2 0 , 2 5 , 3 0 , 3 5 )
# Print the r e s u l t
print ( v a r i a n c e_r e s u l t )
# Print the r e s u l t
1 ( xi )2
P !
s = (3.13.7)
X
2
x2i −
n−1 n
1 ( xi fi )2
P !
s =
X
2 2
xi fi −
n−1 n
Where xi = midpoint (class mark) of the ith class fi = the number of observa-
tions n the ith class n = the total number of observations
n=
Pk
i=1 fi
70 Chapter 3. Introduction to Statistics
x̄ = 1
xi fi A test in probability and statistics was taken
P
sample mean n
by 51 students at UMaT. The scores ranged from 50% to 95% and were classified
into 8 classes of width 6 units. Find the variance and standard deviation.
xi fi 3825
P
x̄ = = = 75
n 51
xi − x̄2 fi 5328
P
S =
2
=
n−1 50 = 106.56
√
S = 106.56 = 10.23
2
Class Lim- Class Mark Frequency xi f i xi − x̄ xi − x̄ fi
its (fi)
Table 3.13.4
3.13 Measure of Location and Dispersion 71
xi fi x2i xi f i x2i fi
51 2 2601 102 5202
57 3 3249 171 9747
63 5 3969 315 19845
69 8 4761 552 38088
75 10 5625 750 56250
81 12 6561 972 78732
87 10 7569 870 75690
93 1 8649 93 8649
Totals 3825 292203
2 #
1
" P
xi fi
S2 =
X
x2i fi −
n−1 n
2 #
1 (3825
"
= 292203 −
50 51
= 106.56
S = 10.23
s
CV = × 100%
x̄
R codes;
# Sample d a t a
data <− c ( 1 5 , 2 0 , 2 5 , 3 0 , 3 5 )
∗ 100
# Print the r e s u l t
3.13.6 Skewness
(xi − x̄)3
P
α3 =
S3
# I n s t a l l and l o a d t h e e1071 p a c k a g e
l i b r a r y ( e1071 )
# Sample d a t a
data <− c ( 1 5 , 2 0 , 2 5 , 3 0 , 3 5 )
# Print the r e s u l t
print ( s k e w n e s s_r e s u l t )
3.13.7 Kurtosis
(xi − x̄)4
P
α4 =
S4
3.13 Measure of Location and Dispersion 73
R codes;
# I n s t a l l and l o a d t h e e1071 p a c k a g e
l i b r a r y ( e1071 )
# Sample d a t a
data <− c ( 1 5 , 2 0 , 2 5 , 3 0 , 3 5 )
# Print the r e s u l t
print ( k u r t o s i s_r e s u l t )
3.13.8 Questions
Named num [ 1 : 7 0 ] 67 5 4 . 7 7 4 8 . 5 14 1 7 . 2 2 0 . 7 13
43.4 40.2 . . .
> precip [ 1 : 4 ]
str ( rivers )
Answer :
4.1 Introduction
Having worked through this chapter the student will be able to:
There two types of random variables, The two random variables in the above
examples are representatives of the two types of random variables that we will
consider. These definitions are not quite precise, but more examples should
make the idea clearer.
F (x) = P (X ≤ k)
where,
x
F (x) = P (X ≤ x) = P (X = t).
X
t=xmin
4.2 Random Variable 81
x
P (X = x) = x ∈ 1, 2, 3, 4, 5 (4.2.1)
15
3
P X ≤3 = P X =t
X
i=1
That is:
P X ≤ 3 = P X = 1 +P X = 2 + X = 3
P X ≤3 = X =1 + X =2 + X =3
1 2 3
= + +
15 15 15
6
P X ≤3 =
15
R Examples:
x2
P X =x =
30
x
f (x) =
15
biltiy function
Let X be a discrete random variable, abd suppose that the possible values that
it can assume are given by x1 , x2 , . . . xn arranged in some order. Suppose alsp
that these values are assumed with probabilities given by
P (X = xk ) = f (xk ) k = 1, 2, . . . (4.2.2)
• f (x)0
• x f (x) =1
P
4.2 Random Variable 83
Examples
1. f (x) = 1
5 where x = 0, 1, 2, 3, 4, 5
x2
2. f (x) = 30 where x = 0, 1, 2, 3, 4
3. f (x) = x−2
5 where x = 1, 2, 3, 4, 5
R Question 2: Suppose that a pair of fair coins is tossed and let the
random variable X denote the number of heads minus the number
of tails.
random variable in the same way as for a discrete random variable. In order to
arrive at a probability distribution for a continuous random variable we note
that the probability that lies between two different values is meaningful. Thus,
a continuous random variable is the type whose spaces are not composed of a
countable number of points but takes on values in some interval or a union of
intervals of the real line.
Examples
R A fair coin is tossed three times. Let the random variable represent
the number of heads which come up.
Solution:
(a). The sample space is
Sample point HHH THH HTH HHT HTT THT TTH TTT
Number of Head 3 2 2 2 1 1 1 0
p(X =x) 1/8 1/8 1/8 1/8 1/8 1/8 1/8 1/8
The table above shows that the random variable can take the values
0, 1, 2, 3. The next task is to compute the probability distribution
p(Xi) of X. Thus,
1
p(3) = P (X = 3) = P (HHH) =
8
p(2) = P (X = 2) = P ({HHT } ∪ {HT H} ∪ {T HH})
= P ({T T H} + {T HT } + {HT T })
1 1 1 3
= + + =
8 8 8 8
1
p(0) = P (X = 0) = P (HHH) =
8
xi 0 1 2 3
1 3 3 1
p(xi ) 8 8 8 8
3 5
10
f (0) = P (X = 0) = 0 2
=
8
2
28
3 5
15
f (1) = P (X = 1) = 1 1
=
8
2
28
3 5
3
f (2) = P (X = 2) = 2 0
=
8
2
28
xi 0 1 2
10 15 3
f (x) 28 28 28
Solution:
(i). For the probability distribution function to be a probability
88 Chapter 4. Random Variables and Distribution
x=1
21 x=1
1
= [{2(1) + 3}{2(2) + 3} + {2(3) + 3}]
21
1
= [5 + 7 + 9]
21
=1
5 5
p(x) = k(x − 1) = 1
X X
x=3 x=3
=⇒ k[(3 − 1) + (4 − 1) + (5 − 1)] = 1
=⇒ k[2 + 3 + 4] = 1
=⇒ 9k = 1
1
=⇒ k =
9
Verify,
4.2 Random Variable 89
• If f (x) is a PDF
• P (0 < x ≥ 1)
Solution:
(a).
Z∞ Z3 2
x
f (x)dx = dx
3
−∞ −1
2
x3 8 1
= = + =1
9 −1
9 9
(b).
Z1 2
x
P (0 < x eq1) = dx =
3
0
1
x3 1 1
= = −0 =
9 0
9 9
(a).
Z∞ Z3
f (x)dx = cx2 dx
−∞ 0
3
cx3
= = 9c
3 0
Z∞
1
But since f (x)dx = 1, 9c = 1 ∴ c=
9
−∞
(b).
Z2 2
x
P (1 < x ≤ 2) = dx
9
1
2
x3 8 1 7
= = − =
27 1
27 27 27
where c is a constant
Z ∞
f (x)dx = 1
−∞
Z ∞
c(1 + x)−3 dx = 1
0
4.2 Random Variable 91
Z ∞
c U −3 du = 1
0
" #−∞
U −2
c =1
−2 0
1
c =1
2
c=2
For each of the following functions, find the constant c so that f(x)
is a p.d.f of a random variable X.
Solution:
92 Chapter 4. Random Variables and Distribution
(i)
f (x) = 4xc
Z 1
f (x)dx = 1
0
Z 1
4xc dx = 1
0
Z 1
4 xc dx = 1
0
" #1
xc+1
4 =1
c+1 0
" #1
x1
4 =1
c+1 0
c+1 = 4
c=3
(ii)
√
f (x) = c x
Z 4
c x1/2 dx = 1
0
" 1 #4
x 2 +1
c 1 =1
2 +1 0
43/4
" #
c =1
3/2
c = 3/16
4.2 Random Variable 93
(iii)
f (x) = c/x3/4
Z 1
c x−3/4 = 1
0
" 3 #1
x− 4 +1
c =1
− 34 + 1 0
c
=1
1/4
c = 1/4
R For each of the following functions, find the constant c so that f (x) is a PDF
of a random variable X.
1. f (x) = 4xc, 0 ≤ x ≤ 1
√
2. f (x) = c 4, 0 ≤ x ≤ 4
R Consider flipping two fair coins. Let X1 if the first coin is heads, and X0 if the
first coin is tails. Let Y 1 if the two coins show the same thing (i.e., both heads
or both tails), with Y 0 otherwise. Let ZXY and W XY .
94 Chapter 4. Random Variables and Distribution
Certain comparisons hold significance, yet the description of their sampling distribution
isn’t as straightforward or neat in an analytical sense. So, what’s the next step
in such scenarios? Interestingly, having precise analytical details of the sampling
distribution isn’t always necessary. In many cases, using a simulated distribution
as an approximation suffices. This part will guide you through that process. It’s
worth highlighting that R programming excels in computing these simulated sampling
distributions, showing a distinct advantage over other statistical software like SPSS
or SAS.
[ 1 ] 0.9833985
and we can s e e t h e s t a n d a r d d e v i a t i o n
> sd ( mads )
[ 1 ] 0.1139002
i Pr[1 or 2]
ii P r[1 ≤ X ≤ 3]
iii P r[ 12 ≤ X ≤ 52 ]
R A fair coin is tossed three times. Let X represent the number of heads
which come up
Solution:
96 Chapter 4. Random Variables and Distribution
xi 0 1 2 3
1 3 3 1
p(xi ) 8 8 8 8
Step 2:
Find the cumulative distribution function:
= 0 + p(0)
1
=
8
= 0 + p(0) + p(1)
1 3
= 0+ +
8 8
4
=
8
4.2 Random Variable 97
0 x<0
1
0≤x<1
8
F (x) = 4
1≤x<2
8
78 2≤x<3
1 x≥3
x=0
1
8
x=1
48
F (x) =
x=2
3
8
x=3
18
Note:
We can obtain this result without the graph by finding the difference in the
adjacent values of F (x).
The probability density function of a continuous random variable X is given by
0 x<0
f (x) = x
0≤x≤2
2
0 x>2
R∞
F (x) = f (t)dt = 0
−∞
If 0 < x < 2, then
R0 Rx t
F (x) = f (t)dt = 0 + 2 dt
−∞ 0
x
t2 x2
= 0+ 4 0 = 4
If x > 2, then
R0 R2 Rx
F (x) = f (t)dt + f (t)dt + f (t)dt
−∞ 0 2
R2 Rx
= 0 + f (t)dt + f (t)dt
0 2
R2
= 0 + 2t dt
0
2
t2
= 0+ 4 0 +0 =1
R According to Professor Doob, the two of them had an argument about whether
random variables should be called “random variables” or “chance variables.”
They decided by flipping a coin — and “random variables” won. (Source:
Statistical Science 12 (1997), No. 4, page 307.) Which name do you think
would have been a better choice?
5. Special Distribution
5.1 Introduction
Having worked through this chapter the student will be able to:
• Understand the assumptions for each of the discrete and continuous probability
distributions presented.
• Select an appropriate discrete and continuous probability distribution to calculate
probabilities in specific applications.
• Calculate probabilities, determine means and variances for each of the discrete
and continuous probability distributions presented.
Bernoulli Distribution:
A single trial of an experiment may result in one of the two mutually exclusive
outcomes such as defective and non-defective, dead or alive, yes or no, male or female,
etc. Such a trial is called and a sequence of these trials form a process, satisfying the
following conditions:
102 Chapter 5. Special Distribution
• Each trial results in one of the two mutually exclusive outcomes, success and
failure.
• The probability of a success, p remains constant, from trial to trial. The
probability of failure is denoted by q = 1 − p
• The trials are independent. That is, the outcome of any particular trial is not
affected by the outcome of any other trial.
Definition
and 0 < p < 1. where the mean and variance of the distribution are as follows:
• µ = E(x) = p;
• σ = V ar(X) = p(1 − p)
An important distribution arising from counting the number of successes in a fixed
number of independent Bernoulli trials is the Binomial distribution.
Implementation in R
In R, you can work with the Bernoulli distribution mainly through the rbinom function,
which is part of the base R distribution and is typically used for generating random
numbers from a binomial distribution. However, since the Bernoulli distribution is a
special case of the binomial distribution (where the number of trials is 1), you can
use rbinom for this purpose.
# Probability of success
p <− 0 . 5
R Example 35: An urn contains 5 red and 15 green balls. Draw one ball at
random from the urn. Let X=1 if the ball drawn is red, and X=0 if a green
ball is drawn. Obtain;
• the p.d.f. of X,
• mean of X and
• variance of X.
104 Chapter 5. Special Distribution
f (x) = px q 1−x , x = 0, 1
where
p = 5/20andq = 15/20
5x 15
1−x
f (x) = , x = 0, 1
20 20
1
5 x 15 1−x 5 15 5 15
X = E(X) = ) ( ) = (0)( )0 ( )1 + (1)( )1 ( )0
X
x(
x=0
20 20 20 20 20 20
M ean of
5
=( )
20
1
5 1 15 0 5
V (X) = x2 f (x) − [E(X)]2 = (1)( ) ( ) = ( )2
X
0
20 20 20
V ariance of X
5 5 3
=( ) − ( )2 = ( )
20 20 16
Implementation in R
In R, you can work with the binomial distribution using the following functions:dbinom:
Gives the probability of observing exactly k successes in n trials.
5.1 Introduction 105
# dbinom : C a l c u l a t e t h e p r o b a b i l i t y o f g e t t i n g e x a c t l y k
successes in n t r i a l s
# For example , p r o b a b i l i t y o f g e t t i n g e x a c t l y 5 s u c c e s s e s o u t
o f 10 t r i a l s
k <− 5
prob_e x a c t l y_5 <− dbinom( k , s i z e = n , prob = p )
print ( paste ( " P r o b a b i l i t y ␣ o f ␣ e x a c t l y " , k , " s u c c e s s e s : " , prob_
e x a c t l y_5 ) )
# pbinom : C a l c u l a t e t h e c u m u l a t i v e p r o b a b i l i t y o f g e t t i n g k
or f e w e r s u c c e s s e s
# For example , c u m u l a t i v e p r o b a b i l i t y o f g e t t i n g 5 or f e w e r
successes
cum_prob_up_t o_5 <− pbinom( k , s i z e = n , prob = p )
print ( paste ( " Cumulative ␣ p r o b a b i l i t y ␣ o f ␣up␣ t o " , k , " s u c c e s s e s :
" , cum_prob_up_t o_5 ) )
# qbinom : Determine t h e q u a n t i l e f u n c t i o n f o r a g i v e n
cumulative p r o b a b i l i t y
# For example , t h e number o f s u c c e s s e s a s s o c i a t e d w i t h a 50%
cumulative p r o b a b i l i t y
quantile_50_p e r c e n t <− qbinom ( 0 . 5 , s i z e = n , prob = p )
106 Chapter 5. Special Distribution
n n!
f x = Pr X = x = px q n−x = px q n−x
x x! n − x !
where the random variable X denotes the number of success in n trials and x =
0, 1, 2, 3, 4, 5 . . .
The shape of the distribution depends on the two parameters n and p.
1. when p < 0.5 and n is small, the distribution will be skewed to the right.
2. when p > 0.5 and n is small, the distribution will be skewed to the left
3. when p = 0.5 the distribution will be symmetric.
4. In all cases, as n gets larger the distribution gets closer to being a symmetric,
bell-shaped distribution.
Properties
1. Mean = np
2. Variance =npq
√
3. Standard Deviation = npq
5.1 Introduction 107
R If 20% of the bolts produced by a machine are bad. Determine the probability
that out of 4 bolts chosen at random.
• one is defective
• none is defective
• at most 2 bolts will be defective.
Solution:
n = 4, − = 0.2, q = 0.8
4
!
P [X = 1] = f (1) = 0.20 .83 = 0.4096
1
4
!
P [X = 0] = f (0) = 0.20 0.84 = 0.4096
0
P [X ≤ 2] = P [X = 0, 1, 2] = P [X = 0] + P [X = 0] + P [X = 1] + P [X = 2]
or = 1 − P [X > 2] = 1 − P (X = 3) − P (X = 4)
4 4
! !
1− 0.23 0.81 − 0.24 0.80
3 4
or 1 − P [X ≥ 3] =
= 1 − 0.0256 − 0.0016
= 0.9728
= 0.2
= 0.0334 i.e P (X ≥ 6)
R From the experiment “toss four coins and count the number of tails” what is
the variance of X?
R Roll a fair 6 – sided die 20 times and count the number of times that 6 shows
up. What is the standard development of your random variable?
V (x) = npq
= 20 × 1 × 5 = 1006
q q
σ= V (X) = 100/36 = 10/6
5.1 Introduction 109
R The following data are the number of seeds germinating out of 10 on damp
filter paper for 80 sets of seeds. Fit a binomial distribution to these data.
x 0 1 2 3 4 5 6 7 8 9 10 Total
f 6 20 28 12 8 6 0 0 0 0 0 80
Solution:
Here n = 10. N = 80 and fi = 80
P
fi xi 174
P
Arithmetic mean =
80
P
fi
174 174 174
np = p= = = 0.2175, q = 1 − p = 0.7825
80 80n 800
Let us consider an experiment in which the properties are the same as those listed
for a binomial experiment with the exception that the trials will be repeated until a
fixed number of successes occur. Therefore, instead of finding the probability of x
successes in n trials, where n is fixed, we are now interested in the probability that
the. . . kth success occurs on the xth trial. Experiments of this kind are called ‘negative
binomial experiments’. (Walpole and Myres, 1993). The number X of trials to produce
k successes in a negative binomial experiment is called a “negative binomial random
variable” and its probability distribution is called the “negative binomial distribution”.
Since its probabilities depend on the number of successes desired and the probability
of success on a given trial, we shall denote them by the symbol b∗ (x; k, p). For the
general formula b∗ (x; k, p), consider the probability of a success on the trial preceded
110 Chapter 5. Special Distribution
by k − 1 successes and x − k failures in some specified order. The probability for the
specified order ending in success is pk−1 q x−k p = pk q x−k . The total number of sample
points in the experiment ending in success, after the occurrence of k − 1 successes
and x − k failures in any order is equal to the number of partitions of x − 1 trials
into two groups with k − 1 successes corresponding to one group and x − k failures
x−1
corresponding to the other group. This number is given by the term k−1 . each
mutually exclusive and occurring with equal probability pk q x−k . We obtain the general
x−1
formula by multiplying pk q x−k by k−1 . In other words:
! !
∗
x − 1 k−1 x−k x − 1 k x−k
b x; k, p) = p q p= p q x = k, k + 1, ......
k−1 k−1
p = probability of success
q = (1-p) = probability of failure
x = total number of trials on which the k th success occurs.
Implementation in R
# Load t h e MASS p a c k a g e f o r f u n c t i o n s r e l a t e d t o t h e n e g a t i v e
binomial d i s t r i b u t i o n
i n s t a l l . packages ( "MASS" )
l i b r a ry (MASS)
5.1 Introduction 111
# dnbinom : C a l c u l a t e t h e p r o b a b i l i t y o f g e t t i n g a s p e c i f i c
number o f f a i l u r e s
# For example , p r o b a b i l i t y o f g e t t i n g e x a c t l y 3 f a i l u r e s
before 5 successes
f a i l u r e s <− 3
prob_3_f a i l u r e s <− dnbinom( f a i l u r e s , s i z e = s i z e , prob = prob
)
print ( paste ( " P r o b a b i l i t y ␣ o f ␣ e x a c t l y " , f a i l u r e s , " f a i l u r e s : " ,
prob_3_f a i l u r e s ) )
# pnbinom : C a l c u l a t e t h e c u m u l a t i v e p r o b a b i l i t y o f g e t t i n g a
c e r t a i n number o f f a i l u r e s or f e w e r
# For example , c u m u l a t i v e p r o b a b i l i t y o f g e t t i n g 3 or f e w e r
failures
cum_prob_up_t o_3_f a i l u r e s <− pnbinom( f a i l u r e s , s i z e = s i z e ,
prob = prob )
print ( paste ( " Cumulative ␣ p r o b a b i l i t y ␣ o f ␣up␣ t o " , f a i l u r e s , "
f a i l u r e s : " , cum_prob_up_t o_3_f a i l u r e s ) )
# qnbinom : Determine t h e q u a n t i l e f u n c t i o n f o r a g i v e n
cumulative p r o b a b i l i t y
112 Chapter 5. Special Distribution
Examples
r+x−1
!
P= (1 − p)x pr
x
(r + x − 1)!
= (1 − p)x pr
(r − 1)!x!
5.1 Introduction 113
If the regional success ratio is assumed to be 10% then the probability that a
two-hole program will meet the company’s goal of two discoveries will be:
(2 + 0 − 1)!
P= (1 − 0.1)0 (0.1)2
(2 − 1)!0!
1!
= ∗ 0.90 ∗ 0.12
1!0!
= 1 ∗ 1 ∗ 0.001 = 0.01
The probability that five holes will be required to achieve two successes is:
(2 + 3 − 1)!
p= (1 − 0.1)3 (0.1)2
(2 − 1)!3!
24
= × 0.729 × 0.01 = 0.029
1×6
or
x − 1 k x−k 4 4!
! !
p p q = (0.1)2 (0.9)3 = 0.01 × 0.729 = 0.029
k−1 1 (4 − 1)!1!
R Find the probability that a person tossing three coins will get either all heads
or all tails for the second time in the fifth toss?
Solution:
1 1 1
x = 5, k = 2, p =+ =
8 8 4
! !2 !3
1 4 1 3 27
b∗ (5; 2, ) = =
4 1 4 4 256
The negative binomial distribution derives its name from the fact that each
term in the expansion of pk(1-p)x-k corresponds to the values of b*(x; k,p ) for
x = k, k+1, k+2, ...
114 Chapter 5. Special Distribution
Geometric Distribution
The geometric distribution is a special case of the negative binomial distribution for
which k = 1. This is the probability distribution for the number of trials required for
a single success. Thus:
g(x; p) = pq x−1
Implementation in R
In R, you can work with the geometric distribution using functions that are similar to
those for the binomial and negative binomial distributions. Here’s an R code snippet
demonstrating various functions related to the geometric distribution, along with
comments explaining each part:
# D e f i n e t h e parameter f o r t h e g e o m e t r i c d i s t r i b u t i o n
prob <− 0 . 5 # P r o b a b i l i t y o f s u c c e s s i n each t r i a l
# dgeom : C a l c u l a t e t h e p r o b a b i l i t y o f o b s e r v i n g a s p e c i f i c
number o f f a i l u r e s b e f o r e t h e f i r s t s u c c e s s
# For example , p r o b a b i l i t y o f g e t t i n g e x a c t l y 3 f a i l u r e s
before the f i r s t success
f a i l u r e s <− 3
prob_3_f a i l u r e s <− dgeom( f a i l u r e s , prob = prob )
print ( paste ( " P r o b a b i l i t y ␣ o f ␣ e x a c t l y " , f a i l u r e s , " f a i l u r e s ␣
b e f o r e ␣ f i r s t ␣ s u c c e s s : " , prob_3_f a i l u r e s ) )
# pgeom : C a l c u l a t e t h e c u m u l a t i v e p r o b a b i l i t y o f o b s e r v i n g a
c e r t a i n number o f f a i l u r e s or f e w e r b e f o r e t h e f i r s t
success
5.1 Introduction 115
# For example , c u m u l a t i v e p r o b a b i l i t y o f g e t t i n g 3 or f e w e r
f a i l u r e s before the f i r s t success
cum_prob_up_t o_3_f a i l u r e s <− pgeom( f a i l u r e s , prob = prob )
print ( paste ( " Cumulative ␣ p r o b a b i l i t y ␣ o f ␣up␣ t o " , f a i l u r e s , "
f a i l u r e s ␣ b e f o r e ␣ f i r s t ␣ s u c c e s s : " , cum_prob_up_t o_3_f a i l u r e s
))
# qgeom : Determine t h e q u a n t i l e f u n c t i o n f o r a g i v e n
cumulative p r o b a b i l i t y
# For example , t h e number o f f a i l u r e s a s s o c i a t e d w i t h a 50%
cumulative p r o b a b i l i t y before the f i r s t success
quantile_50_p e r c e n t <− qgeom ( 0 . 5 , prob = prob )
print ( paste ( " Q u a n t i l e ␣ a t ␣50%␣ c u m u l a t i v e ␣ p r o b a b i l i t y ␣ o f ␣
f a i l u r e s ␣ b e f o r e ␣ f i r s t ␣ s u c c e s s : " , quantile_50_p e r c e n t ) )
Examples
1 1−p
µ= = 100 σ2 = = 9900
P p2
Poisson Distribution
1. The number of successes occurring in one time interval or specified region are
independent of those occurring in any other disjoint time interval or region of
5.1 Introduction 117
space.
2. The probability of a single success occurring during a very short time interval
or in a small region is proportional to the length of the time interval or the size
of the region and does not depend on the number of successes occuring outside
this time interval or region.
3. The probability of more than one success occuring in such a short time interval
or falling in such a small region is negligible.
The probability distribution of the Poisson random variable is called the Poisson
distribution and is denoted by P (x; µ) since its values depend only on µ , the average
number of successes occuring in the given time interval or specified region. This
formula is given by the definition below:
where µ is the average number of successes occuring in the given time interval or
specified region and e = 2.7183
Theorem: The mean and variance of the Poisson distribution both have the value µ.
Implementation in R
In R, you can work with the Poisson distribution using the following functions:
# D e f i n e t h e parameter f o r t h e Poisson d i s t r i b u t i o n
lambda <− 4 # The a v e r a g e number o f e v e n t s i n t h e i n t e r v a l ( e
. g . , 4 e v e n t s per time u n i t )
# q p o i s : Determine t h e q u a n t i l e f o r a g i v e n c u m u l a t i v e
probability
# For example , t h e number o f e v e n t s a s s o c i a t e d w i t h a 50%
cumulative p r o b a b i l i t y
quantile_50_p e r c e n t <− qpois ( 0 . 5 , lambda )
5.1 Introduction 119
Examples
R Suppose that an urn contains 100,000 marbles and 120 are red. If a random
sample of 1000 is drawn what are the probabilities that 0, 1, 2, 3, and 4
respectively will be red.
120
n = 100, p= = 0.0012, q = 0.9988
100000
Solution:
1000
!
Binomial = 0.00122 0̇.99881000−x
x
F or x=3
e−12 = 0.3012
e−1.2 1.23
f (3) = = 0.0867
3!
P (X > 5) = 1 − p(X ≤ 4)
= 1 − 0.9985 = 0.0015
i. P (X ≤ 6)
‘ii. P (X > 5)
iii. P (X = 6)
iv. P (X ≥ 4)
Solution:
6
5x e−5
P (X ≤ 6) = = 0.762
X
i.
x=0
x!
iv. P (X ≥ 4) = 1 − P (X < 4)
Solution:
e−λ λx
i. p(X = 2) = λ=3
x!
e−3 32 0.05(9)
= = = 0.225
2! 2
e−3 30
ii. p(X = 0) = = 0.05
0!
e−3 33 e−3 34
iii. p(X = 3) + p(X = 4) = +
3! 4!
−3 27 81
=e +
6 24
9 27
= 0.05 +
2 8
= 0.05(7.875)
= 0.394
R Fit a Poisson distribution to the following data which gives the number of yeast
cells per square for 400 squares
Solution:
The expected theoretical frequency for r successes is N e−m mr /r!, but m is not
122 Chapter 5. Special Distribution
f x 529
P
m= P = = 1.32
f 400
λx e−λ (1.32)x e−1.32
P (X = x) = =
x! x!
thus,
e−1.32 (1.32)x
f = 400 , theref ore,
x!
6
P (X < 7) = b(x; 8000, 0.001)
X
x=0
6
= p(x; 8)
X
x=0
= 0.3134
5.1.2 Post-Test
1. Suppose that 24% of a certain population have blood group B, for a sample of
size 20 drawn from this population, find the probability that
7. The phone calls arriving at a given telephone exchange within one minute
follow the Poisson distribution with parameter value equal to ten. What is the
probability that in a given minute:
a) No calls arrive?
b) Exactly 10 calls arrive?
c) At least 10 calls arrive
5.1 Introduction 125
Normal Distribution
The graph of the normal distribution which is a bell-shaped smooth curve approxi-
mately describes many phenomena that occur in nature, industry and research. In
addition, errors in scientific measurements are extremely well approximated by a
normal distribution. Thus, the normal distribution is one of the most widely used
probability distributions for modelling random experiments. It provides a good model
for continuous random variables involving measurements such as time, heights/weights
of persons, marks scored in an examination, amount of rainfall, growth rate and many
other scientific measurements.
Definition:
The probability density function for the normal random variable X which is simply
called normal distribution is defined by:
1 1−µ 2
√1 e− 2 (σ )
−∞ < x < ∞
σ 2π
f (x) =
0,
elsehere
where
σ > 0, µ > 0 and −∞ < x < ∞
.
and the mean and variance of the measurements X, are E(x) = µ and V ar(x) = σ 2 .
If the random variable is modelled by the normal distribution with mean, µ and
variance, σ then it is simply denoted as x ∼ N (µ, σ)
This fact considerably simplifies the calculations of probabilities concerning normally
distributed variables, as seen in the following illustration: Suppose that X, let c1 <
126 Chapter 5. Special Distribution
X − µ c2 − µ X − µ c1 − µ
=P < −P <
σ σ σ σ
c − µ c − µ
2 1
=Φ −Φ
σ σ
Note that
Φ(−x) = 1 − Φ(x)
Properties
• Mean = E(x) = µ
• Variance = σ 2
• Standard Deviation = σ
Solution:
5.1 Introduction 127
1.
(i) P (0.53 < Z < 2.06) = Φ(2.06) − Φ(0.53) = 0.9803 − 0.7019 = 0.2784
2.
X − 75 60 − 75
P (X < 60) = P < = P (Z < −1.5) = 0.0668
10 10
6−6 12 − 6
P (6 ≤ X ≤ 12) = P ≤Z ≤ = P (0 ≤ Z1.2)
5 5
= Φ(1.2) − Φ(0) = 0.8849 − 0.5000 = 0.3849
A measure of the number of standard deviations the data falls above or below the
mean.
observation − mean
Z= (5.1.2)
SD
Wecan calculate Z scores for distributions of any shape, but with normal distributions
we use Z scores to calculate probabilities. Observations that are more than 2 SD
away from the mean are typically considered unusual. Another reason we use Z
scores is if the distribution of X is nearly normal then the Z scores of X will have
a Z distribution (unit normal). Note that the Z distribution is a special case of
the normal distribution where mean(µ) = 0 and standarddeviation(σ) = 1 . Linear
transformations of normally distributed random variable are also normally distributed.
5.1 Introduction 129
Hence, if
X −µ
Z=
σ
P (Z < a) = ω(a)
The area under the unit normal curve from a to b where a ≤ b is given by
130 Chapter 5. Special Distribution
The area under the unit normal curve outside of a to b where a ≤ b is given by
Examples
R Find the area under the standard normal curve between z = −1.5 and z = 1.25.
Solution
The area under the standard normal curve between z = −1.5 and z = 1.25 is
shown
5.1 Introduction 131
From the Standard Normal Table, the area to the left of z = 1.25 is 0.8944 and
the area to the left of z = -1.5 is 0.0668. So, the area between z = -1.5 and z =
1.25 Area = 0.8944 - 0.0668 = 0.8276
Interpretation: So, 82.76% of the area under the curve falls between z = -1.5
and z = 1.25.
R A survey indicates that people use their cellular phones an average of 1.5 years
before buying a new one. The standard deviation is 0.25 year. A cellular phone
user is selected at random. Find the probability that the user will use their
current phone for less than 1 year before buying a new one. Assume that the
variable x is normally distributed.
Solution: The graph shows a normal curve with µ = 1.5 and σ = 0.25 on a
shaded area for x less than 1. The z-score that corresponds to 1 year is
1 − 1.15
z= = −2
0.25
R The weights of chocolate bars are normally distributed with mean 205 g and
standard deviation 26 g. The stated weight of each bar is 200 g.
Solution
(a) Let W be the weight of a chocolate bar, W N (205, 262 ) Then
W − µ 200 − 205
Z= = = −1923077
σ 2.6
5.1 Introduction 133
the probability that fewer than two bars are underweight is 0.841.
Implementation in R
In R, you can work with the normal distribution using several functions::
1. dnorm: Probability Density Function (PDF) - gives the height of the probability
distribution at each point for a given mean and standard deviation.
2. pCumulative Distribution Function (CDF) - calculates the probability that a
normally distributed random variable will be less than or equal to a given value.
3. qnorm: Quantile Function - finds the quantile (the inverse of the CDF) for a
given probability.
4. rnorm: Generates random numbers from the normal distribution.
# D e f i n e t h e p a r a m e t e r s f o r t h e normal d i s t r i b u t i o n
mean <− 0 # Mean ( )
sd <− 1 # Standard D e v i a t i o n ( )
# dnorm : C a l c u l a t e t h e d e n s i t y ( h e i g h t o f t h e p r o b a b i l i t y
d i s t r i b u t i o n ) at a s p e c i f i c value
134 Chapter 5. Special Distribution
# For example , d e n s i t y a t v a l u e 1
v a l u e <− 1
density_a t_v a l u e <− dnorm( value , mean = mean, sd = sd )
print ( paste ( " D e n s i t y ␣ a t ␣ v a l u e " , value , " : " , density_a t_v a l u e )
)
# pnorm : C a l c u l a t e t h e c u m u l a t i v e p r o b a b i l i t y up t o a
s p e c i f i c value
# For example , p r o b a b i l i t y o f b e i n g l e s s than or e q u a l t o 1
cum_prob_up_t o_v a l u e <− pnorm( value , mean = mean, sd = sd )
print ( paste ( " Cumulative ␣ p r o b a b i l i t y ␣up␣ t o ␣ v a l u e " , value , " : " ,
cum_prob_up_t o_v a l u e ) )
# qnorm : Determine t h e q u a n t i l e f o r a g i v e n c u m u l a t i v e
probability
# For example , f i n d i n g t h e v a l u e a s s o c i a t e d w i t h a 50%
cumulative p r o b a b i l i t y
quantile_50_p e r c e n t <− qnorm ( 0 . 5 , mean = mean, sd = sd )
print ( paste ( " Value ␣ a t ␣50%␣ c u m u l a t i v e ␣ p r o b a b i l i t y : " , quantile_
50_p e r c e n t ) )
t o S t r i n g ( random_samples ) ) )
Suppose that a continuous random variable X can assume values in a bounded interval
only, say the open interval (a, b), and suppose the p.d.f. of X is given as
1
f (x; a, b) = f (x) = ,a < x < b
b−a
= 0, elsewhere.
Solution:
136 Chapter 5. Special Distribution
Z70
1
a) P [60 < X < 70] = dx
b−a
60
1
= [x]70
75 − 50 60
2
=
5
Z75
1 125
b) E(X) = xdx =
b−a 2
50
Or
b + a 75 + 50 125
E(X) = = =
2 2 2
Z75
1 125 2 625
c) V ar(X) = E(X ) − E (X) =
2 2
x2 dx − =
25 2 12
50
Or
2
b−a 625
V ar(X) = =
12 12
The Gamma distribution arises in the study of waiting times, for example, in the
lifetime of devices. It is also useful in modeling many nonnegative continuous variables.
The gamma distribution requires the knowledge of the gamma function.
Definition:
5.2 The Gamma Distribution 137
Z∞
Γ(α) = xα−1 e−xdx (5.2.3)
0
Z∞
Γ(α) = −e−x xα−1 |∞
0 + (α − 1)xα−2 e−x dx
0
Z∞
= (α − 1) x2 e−x dx
0
Γ(α) = (α − 1)Γ(α − 1)
The continuous random variable has a gamma distribution, with parameters α β and
if its density function is given by:
x
1
xα−1 e− β
x>0
β α Γ(α)
f (x) =
0
elsewhere
a) E(X) = αβ
b) Var(X) = αβ 2
Implementation in R
# D e f i n e t h e p a r a m e t e r s f o r t h e gamma d i s t r i b u t i o n
shape <− 2 # Shape parameter ( or k )
s c a l e <− 3 # S c a l e parameter ( or )
# dgamma : C a l c u l a t e t h e d e n s i t y ( h e i g h t o f t h e p r o b a b i l i t y
d i s t r i b u t i o n ) at a s p e c i f i c value
# For example , d e n s i t y a t v a l u e 5
v a l u e <− 5
density_a t_v a l u e <− dgamma( value , shape = shape , s c a l e =
scale )
print ( paste ( " D e n s i t y ␣ a t ␣ v a l u e " , value , " : " , density_a t_v a l u e )
)
5.2 The Gamma Distribution 139
# pgamma : C a l c u l a t e t h e c u m u l a t i v e p r o b a b i l i t y up t o a
s p e c i f i c value
# For example , p r o b a b i l i t y o f b e i n g l e s s than or e q u a l t o 5
cum_prob_up_t o_v a l u e <− pgamma( value , shape = shape , s c a l e =
scale )
print ( paste ( " Cumulative ␣ p r o b a b i l i t y ␣up␣ t o ␣ v a l u e " , value , " : " ,
cum_prob_up_t o_v a l u e ) )
# qgamma : Determine t h e q u a n t i l e f o r a g i v e n c u m u l a t i v e
probability
# For example , f i n d i n g t h e v a l u e a s s o c i a t e d w i t h a 50%
cumulative p r o b a b i l i t y
quantile_50_p e r c e n t <− qgamma( 0 . 5 , shape = shape , s c a l e =
scale )
print ( paste ( " Value ␣ a t ␣50%␣ c u m u l a t i v e ␣ p r o b a b i l i t y : " , quantile_
50_p e r c e n t ) )
1− x
β1 e β
x>0
f (x) = where β > 0. (5.3.4)
0
elsewhere
Properties
Mean E(X) = θ
Variance V ar(X) = θ2
Standard Deviation σ = θ
1
So if λ is the mean of changes in the unit interval, then θ = λ is the mean waiting
time for the first change.
Solution:
5.3 Exponential Distribution 141
Implementation in R
In R, you can work with the exponential distribution using several functions as stated
in codes snippet:
# D e f i n e t h e parameter f o r t h e e x p o n e n t i a l d i s t r i b u t i o n
r a t e <− 0 . 5 # Rate parameter ( )
# dexp : C a l c u l a t e t h e d e n s i t y ( h e i g h t o f t h e p r o b a b i l i t y
d i s t r i b u t i o n ) at a s p e c i f i c value
# For example , d e n s i t y a t v a l u e 2
v a l u e <− 2
density_a t_v a l u e <− dexp ( value , r a t e = r a t e )
print ( paste ( " D e n s i t y ␣ a t ␣ v a l u e " , value , " : " , density_a t_v a l u e )
)
# pexp : C a l c u l a t e t h e c u m u l a t i v e p r o b a b i l i t y up t o a s p e c i f i c
value
# For example , p r o b a b i l i t y o f b e i n g l e s s than or e q u a l t o 2
cum_prob_up_t o_v a l u e <− pexp ( value , r a t e = r a t e )
print ( paste ( " Cumulative ␣ p r o b a b i l i t y ␣up␣ t o ␣ v a l u e " , value , " : " ,
cum_prob_up_t o_v a l u e ) )
142 Chapter 5. Special Distribution
# q e x p : Determine t h e q u a n t i l e f o r a g i v e n c u m u l a t i v e
probability
# For example , f i n d i n g t h e v a l u e a s s o c i a t e d w i t h a 50%
cumulative p r o b a b i l i t y
quantile_50_p e r c e n t <− qexp ( 0 . 5 , r a t e = r a t e )
print ( paste ( " Value ␣ a t ␣50%␣ c u m u l a t i v e ␣ p r o b a b i l i t y : " , quantile_
50_p e r c e n t ) )
X
µ = E(x) = xf (x) f or discrete case
Z ∞
µ = E(x) = f (x)dx f or continuous case
−∞
5.3.2 Post-Test
3. Customers arrive randomly at a bank teller’s window. Given that one customer
arrived during a particular 10-minute period, let X equal the time within the 10
minutes that the customer arrived. If X is U(0, 10), find
i The p.d.f of X
ii P (X ≥ 8)
iii P (2 ≤ X < 8)
iv E(X)
v Var(X)
4. Explain the relationship that exists between the Poisson and the Exponential
distributions.
144 Chapter 5. Special Distribution
6.0.1 Introduction
The basic reasons for the need to estimate population parameters from sample
information is that it is ordinarily too expensive or simply infeasible to enumerate
complete populations to obtain the required information. The cost of complete
censuses may be prohibitive in finite populations while complete enumerations are
impossible in the case of infinite populations. Hence, estimation procedures are useful
in providing the means of obtaining estimates of population parameters with desired
degree of precision. We now consider estimation, the first of the two general areas
of statistical inference. The second general area is hypothesis testing which will be
examined later. The subject of estimation is concerned with the methods by which
population characteristics are measured from sample information. The objectives are
to present:
1. properties for judging how well a given sample statistic estimates the parent
population parameter.
2. several methods for estimating these parameters.
146 Chapter 6. Estimations
There are basically two types of estimation: point estimation and interval estimation.
In point estimation, a single sample statistic, such as X̄, s , or p is calculated from the
sample to provide a best estimate of the true value of the corresponding population
parameter such as µ, σ or p . Such a statistic is termed a point estimator. The
function or rule that is used to estimate the value of a parameter is called an estimator.
An estimate is a particular value calculated from a particular sample of observations.
On the other hand, an interval estimate consists of two numerical values defining an
interval which, with varying degrees of confidence, we feel includes the parameter
being estimated.
Unbiasedness:
If the expected value or mean of all possible values of a statistic over all possible
samples is equal to the population parameter being estimated, the sample statistic
is said to be unbiased. That is, if the expected value of an estimator is equal to the
corresponding population parameter, the estimator is unbiased
n
1X
E(X̄) = E( xi ) = µ (6.0.1)
n i=1
Efficiency:
The most efficient estimator among a group of unbiased estimators is the one with
the smallest variance. This concept refers to the sampling variability of an estimator.
R
147
Consistency:
An estimator is consistent if as the sample size increases, the probability increases that
the estimator will approach the true value of the population parameter. Alternatively,
an estimator is consistent if it satisfies the following conditions:
1. V ar(θ̄) −→ 0 as n → ∞
2. becomes unbiased as n → ∞
For most practical purposes, it would not suffice to have merely a single value estimate
of a population parameter. Any single point estimate will be either right or wrong.
Therefore, instead of obtaining only a single estimate of a population parameter,
it would certainly seem to extremely useful and perhaps necessary to obtain two
estimators, say X̄1 andX̄2 , and say with some confidence that the interval between
X̄1 and X̄2 includes the true mean µ . Thus, an interval estimate of a population
parameter θ is a statement of two values between which it is estimated that the
parameter lies. We shall be discussing the construction of confidence intervals as a
means of interval estimation. The confidence we have that a population parameter, θ ,
will fall within some confidence interval will equal (1 − α) , where α is the probability
that the interval does not contain θ (i.e. the probability , is an allowance for error).
To construct a 95% confidence interval α = 0.05 . That is, the probability is 0.05 that
the value θ will not lie within the interval.
Note that,
The larger the confidence interval, the smaller the probability of error α for the interval
estimator
148 Chapter 6. Estimations
! !
σ σ
X̄ − Za/2 √ ≤ µ ≤ X̄ + Za/2 √ (6.0.2)
n n
Simply written as
!
σ
X̄ − Za/2 √ (6.0.3)
n
where Za/2 is the Z value representing an area a/2 to the right and left tails of the
standard normal probability distribution.
3 3
90.48 − Z0.025 √ ≤ µ ≤ 90.48 + Z0.025 √
5 5
90.48 − 1.96(1.3416) ≤ µ ≤ 90.48 + 1.96(1.3416)
87.8505 ≤ µ ≤ 93.1095
Solution
0.001 0.001
(a) 74.036 − Z0.005 √ ≤ µ ≤ 74.036 + Z0.005 √
15 15
74.036 − 2.575(0.000258) ≤ µ ≤ 74.036 − 2.575(0.000258)
74.0353 ≤ µ ≤ 74.0367
0.001 0.001
(b) 74.036 − Z0.025 √ ≤ µ ≤ 74.036 + Z0.025 √
15 15
74.036 − 0.00051 ≤ µ ≤ 74.036 + 0.00051
74.0355 ≤ µ ≤ 74.0365
R ASTM Standard E23 defines standard test methods for notched bar impact
testing of metallic materials. The Charpy V-notch (CVN) technique mea-
sures impact energy and is often used to determine whether or not a material
experiences a ductile-to-brittle transition with decreasing temperature. Ten
measurements of impact energy (J) on specimens of A238 steel cut at 60ºC are
as follows: 64.1, 64.7, 64.5, 64.6, 64.5, 64.3, 64.6, 64.8, 64.2, and 64.3. Assume
that impact energy is normally distributed with. We want to find a 95% CI for,
the mean impact energy. The resulting 95% CI?
Solution:
150 Chapter 6. Estimations
σ σ
x̄ − Zα/2 √ ≤ µ ≤ x̄ + Zα/2 √
n n
1 1
64.46 − Z0.025 √ ≤ µ ≤ 64.46 + Z0.025 √
10 10
1 1
64.46 − 1.96 √ ≤ µ ≤ 64.46 + 1.96 √
10 10
64.46 − 0.6198 ≤ µ ≤ 64.46 + 0.6198
63.84 ≤ µ ≤ 65.08
That is, based on the sample data, a range of highly plausible values for mean
impact energy for A238 steel at 60°C is 63.84J ≤ µ ≤ 65.08J.
Exercise:
When the is not known and the sample size is small, the procedure for interval
estimation of population mean is based on a probability distribution known as the
152 Chapter 6. Estimations
student t-distribution. When the population variance is unknown, and the sample
size is small, the correct distribution for constructing a confidence interval for is the
t-distribution. Here, an estimate s must be calculated from the sample to substitute
for the unknown population standard deviation. The t-distribution is used such that
X̄ − µ
t= √
s/ n
where v
u (Xi − X̄)2
uP
s=t (6.0.5)
(n − 1)
where v = n − 1. Notice that a requirement for the valid use of the t-distribution is
that the sample must be drawn from a normal distribution.
10
x̄ ± ta/2 √
25
x̄ ± t0.05 (2)
75 ± 1.711 (2)
(69.578, 76.422)
153
X − np P̄ − p
Z=q =q (6.0.7)
p(1−p)
np(1 − p) n
154 Chapter 6. Estimations
is approximately standard normal. The 100(1 − α)% CI for p then given by;
s s
P̄ (1 − P̄ ) P̄ (1 − P̄ )
P̄ − Za/2 ≤ p ≤ P̄ + Za/2 (6.0.8)
n n
s s
0.0125(1 − 0.0125) 0.0125(1 − 0.0125)
0.0125 − Z0.05 ≤ p ≤ 0.0125 + Za/2
800 800
0.0125 ± 2.575(0.003928)
R Of 1000 randomly selected cases of lung cancer, 823 resulted in death within
10 years. Construct a 95% confidence interval on the death rate from lung cancer.
Solution:
0.823 ± 1.96(0.0121)
0.823 ± 0.0237
(0.7993, 0.8467)
Exercice:
There are instances where we are interested in estimating the difference between
156 Chapter 6. Estimations
two population means. Here, from each of the populations a sample is drawn and
from the data of each, the sample means x1 and x2 respectively, are computed. The
estimator x1 − x2 yields an unbiased estimate of µ1 − µ2 , the difference between the
population means. The quantity
v
u 2
t σ1
u σ22
X̄1 − X̄2 ± Za/2 + (6.0.10)
n1 n2
Exercise:
Machine 1 Machine 2
16.03 16.01 16.02 16.03
16.04 15.96 15.97 16.04
16.05 15.98 15.96 16.02
16.05 16.02 16.01 16.01
16.02 15.99 15.99 16.00
6.0.5 Confidence Interval For The Difference Between Two Population Proportions
The magnitude of the difference between two population proportions is often of interest.
An unbiased point estimator of the difference in two populations is provided by the
difference in the sample proportions, p¯1 − p¯2 . When n1 and n2 are large and the
population proportions are not too close to 0 or 1, the central limit theorem applies
where is a standard normal random variable and thus, normal distribution theory
may be employed to obtain confidence intervals. A 100(1 − α)% confidence interval
158 Chapter 6. Estimations
for p1 − p2 is given by
s s
p¯1 (1 − p¯1 p¯2 (1 − p¯2 ) p¯1 (1 − p¯1 ) p¯2 (1 − p¯2 )
[p¯1 − p¯2 ]−Zα/2 + ≤ p1 −p2 ≤ [p¯1 − p¯2 ]+Zα/2 +
n1 n2 n1 n2
Examples
R Two different types of injection-molding machines are used to form plastic parts.
A part is considered defective if it has excessive shrinkage or is discolored. Two
random samples, each of size 300, are selected, and 15 defective parts are found
in the sample from machine 1 while 8 defective parts are found in the sample
from machine 2. Construct a 95% confidence interval on the difference in the
two fractions defective.
R Two hundred patients suffering from a certain disease were randomly divided
into two equal groups. Of the first group, who received the standard treatment,
78 recovered within three days. Out of the other 100, who were treated by a
new method, 90 recovered within three days. The physician wished to estimate
the true difference in the proportions who would recover within three days.
Find 95% CI for p1 − p2 .
R In a study designed to assess the side effects of two drugs, 50 animals were
given Drug A and 50 animals were given Drug B. Of the 50 receiving Drug A,
11 showed undesirable side effects, while 8 of those receiving Drug B reacted
similarly. Find the 90 and 95 percent confidence intervals for PA − PB .
159
Results from an Experiment on Plant Growth. The PlantGrowth data frame gives
the results of an experiment to measure plant yield (as measured by the weight of the
plant). We would like to a 95% confidence interval for the mean weight of the plants.
Suppose that we know from prior research that the true population standard deviation
of the plant weights is 0.7 g. The parameter of interest is µ, which represents the true
mean weight of the population of all plants of the particular species in the study. We
will first take a look at a stemplot in R of the data:
> l i b ra ry ( a p l p a c k )
> with ( PlantGrowth , stem . l e a f ( weight ) )
1 | 2: represents 1.2
l e a f unit : 0.1
n : 30
1 f | 5
s |
2 3. | 8
4 4∗ | 11
5 t | 3
8 f | 455
10 s | 66
13 4 . | 889
( 4 ) 5∗ | 1111
13 t | 2233
9 f | 555
s |
6 5 . | 88
160 Chapter 6. Estimations
4 6∗ | 011
1 t | 3
> l i b ra ry ( TeachingDemos )
> temp <− with ( PlantGrowth , z . t e s t ( weight , s t d e v = 0 . 7 ) )
> temp
One Sample z−t e s t
data : weight
z = 3 9 . 6 9 4 2 , n = 3 0 . 0 0 0 , Std . Dev . = 0 . 7 0 0 , Std . Dev . o f th e
sample mean = 0 . 1 2 8 , p−v a l u e < 2 . 2 e −16
a l t e r n a t i v e h y p o t h e s i s : t r u e mean i s not equal t o 0
95 p e r c e n t c o n f i d e n c e i n t e r v a l :
4.822513 5.323487
sample e s t i m a t e s :
mean o f weight
5.073
The confidence interval bounds are shown in the sixth line down of the output. We
can make a plot with
> l i b ra ry (IPSUR)
> plot ( temp , " Conf " )
Learning Objectives
Having worked through this chapter the student will be able to:
7.1.1 Introduction
We now discuss the subject of hypothesis testing, which as earlier noted is one of
the two basic classes of statistical inference. Testing of hypotheses involves using
statistical inference to test the validity of postulated values for population parameters.
If the hypothesis specifies the distribution completely it is called simple, otherwise
it is called composite. For example, a demographer interested in the mean age of
164 Chapter 7. Hypothesis Testing
residents in a certain local government area might pose a simple hypothesis such as
µ = 24 or he might specify a composite hypothesis as µ < 24 or µ > 24.
A statistical test is usually structured in terms of two mutually exclusive hypotheses
referred to as the null hypothesis and the alternative hypothesis denoted by H0 and
H1 respectively.
Two types of error occur in hypothesis testing; these are type I error and type II error.
Type I error occurs if H0 is rejected when it is true. The probability of a type I error
is the conditional probability, P(reject H0 |H0 is true) is denoted by α
Hence,
H0 is true H0 is false
1−α β
Accept H0
(correct decision) (Type II errors)
α 1−β
Reject H0
(Type II errors) (correct decision)
7.1 Tests of Hypotheses and Significance 165
’µ’ X̄ − µ0
Z= √
σ/ n
σ known, population normal
’µ’
X̄−µ
Z= √0
s/ n
if n is ‘large‘ usually n ≥ 30
σ known, population normal
’µ’
X̄−µ
t= √ ,
s/ n
with (n-1) df
σ unknown, n small, population normal
’p’
x/n −p0
population normal, n large Z= q
p̄(1−p̄)
n
Step 3: Determine the critical region using the cumulative distribution table for the
test statistic. The set of values that lead to the rejection of the null hypothesis is
called the critical region. A statistical test may be a one-tail or two-tail test. Whether
one uses a one- or two- tail test of significance depends upon how the alternative
hypothesis is formulated.
Step 4: Compute the values of the test statistic based on the sample information,
166 Chapter 7. Hypothesis Testing
e.g. Ze , te , χ2e
Step 5: Make a statistical decision and interpretation. H0 is rejected if the computed
value of the test statistic falls in the critical region otherwise it is accepted.
Possible situation in testing a statistical hypothesis
We shall consider testing of hypothesis about a population mean under three different
conditions:
Examples
Solutions:
Step 1:
H0 : µ0 = 25
H1 : µ0 ̸= 25
Step 2:
X̄ − µ0
Z= √
σ/ n
σ 2 = 45, n = 10, X̄ = 22
Step 4:
X̄ − µ0
Zc = √
σ/ n
22 − 25
= √ √
45/ 10
3
= −
2.1213
= −1.41
Step 5:
We are unable to reject the null hypothesis, since −1.42 > −1.96.
R Aircrew escape systems are powered by a solid propellant. The burning rate of
this propellant is an important product characteristic. Specifications require
that the mean burning rate must be 50 centimeters per second. We know that
the standard deviation of burning rate is σ = 2 centimeters per second. The
168 Chapter 7. Hypothesis Testing
H0 : µ0 = 50
H1 : µ0 ̸= 50
Step 2:
X̄ − µ0
Zc = √
σ/ n
51.3 − 50
= √
2/ 25
1.3
=
0.4
= 3.25
i n s t a l l . packages ( "BSDA" )
# Load t h e BSDA p a c k a g e
l i b r a ry (BSDA)
7.1 Tests of Hypotheses and Significance 169
# Given v a l u e s
mu <− 50 # H y p o t h e s i z e d p o p u l a t i o n mean
sigma <− 2 # Known p o p u l a t i o n s t a n d a r d d e v i a t i o n
X_bar <− 5 1 . 3 # Sample mean
n <− 30 # Sample s i z e
a lp h a <− 0 . 0 5 # Significance level
# Output t h e t e s t r e s u l t
print ( t e s t_r e s u l t )
Test statistic is
X̄ − µ
t= √
s/ n
R A study revealed that the upper limit of the Normal Body Temperature of
males is 98.6. The body temperatures for 25 male subjects were taken and
recorded as follows: 97.8, 97.2, 97.4, 97.6, 97.8, 97.9, 98.0, 98.0, 98.0, 98.1, 98.2,
98.3, 98.3, 98.4, 98.4, 98.4, 98.5, 98.6, 98.6, 98.7, 98.8, 98.8, 98.9, 98.9 and 99.0.
Test the hypothesis H0 : µ0 = 98.6 versus H1 : µ0 ̸= 98.6, using α = 0.05
R Nine patients suffering from the same physical handicap, but otherwise com-
parable were asked to perform a certain task as part of an experiment. The
170 Chapter 7. Hypothesis Testing
average time required to perform the task was seven minutes with a standard
deviation of two minutes. Assuming normality, can we conclude that the true
mean time required to perform the task by this type of patient is at least ten
minutes?
R The increased availability of light materials with high strength has revolutionized
the design and manufacture of golf clubs, particularly drivers. Clubs with hollow
heads and very thin faces can result in much longer tee shots, especially for
players of modest skills. This is due partly to the “spring-like effect” that
the thin face imparts to the ball. Firing a golf ball at the head of the club
and measuring the ratio of the outgoing velocity of the ball to the incoming
velocity can quantify this spring-like effect. The ratio of velocities is called the
coefficient of restitution of the club. An experiment was performed in which
15 drivers produced by a particular club maker were selected at random and
their coefficients of restitution measured. In the experiment, the golf balls were
fired from an air cannon so that the incoming velocity and spin rate of the ball
could be precisely controlled. Determine if there is evidence (with α = 0.05)
to support a claim that the mean coefficient of restitution exceeds 0.82. The
observations are:
The sample mean and sample standard deviation are X̄ = 0.83725 and s =
0.02456.
Perform the one-sample t-test testr esult < −t.test(coef f icients, mu = mu0 , alternative =
”greater”)
7.1 Tests of Hypotheses and Significance 171
Draw conclusions based on the p-value if (testr esultp.value < alpha) print("Reject
the null hypothesis: there is evidence that the mean coefficient of restitution exceeds
0.82.") else print("Do not reject the null hypothesis: there is not enough evidence to
conclude that the mean coefficient of restitution exceeds 0.82.")
Considering testing
172 Chapter 7. Hypothesis Testing
H0 = p = p0
H1 = p ̸= p0
For large samples, the normal approximation to the binomial with the test statistic
X − np0 X/n − p0 P̄ − p0
Z=q =q =q
p0 (1−p0 ) p0 (1−p0 )
np0 (1 − p0 ) n n
may be used.
This presents the test statistic in terms of the sample proportion instead of that
number of items X in the sample that belongs to the class interest.
Hypothesis testing involving the difference between two population means is most
frequently employed to determine whether or not it is reasonable to conclude that the
two are unequal. In such cases, one or other of the following hypotheses is tested:
H0 : µ1 = µ2 = 0 H1 : µ1 = µ2 ̸= 0
H0 : µ1 = µ2 ≥ 0 H1 : µ1 = µ2 < 0
H0 : µ1 = µ2 ≤ 0 H1 : µ1 = µ2 > 0
Solution
n1 = 12 n2 = 15
σ12 = 1 σ22 = 1
H0 : µ1 = µ2 = 0
H1 : µ1 = µ2 ̸= 0
(4.5 − 3.4) − 0
= q
1
12 + 15
1
7.1 Tests of Hypotheses and Significance 175
1.1 1
=√ = = 2.84
0.15 0.3873
Reject H0 since 2.84 > 1.96 on the basis of these data, there is an indication
that the means are not equal.
Solution:
n1 = 10 n2 = 10
σ12 = 64 σ22 = 64
H0 : µ1 ≤ µ2 = 0
H1 : µ1 > µ2
(121 − 112) − 0
= q
64
10 + 64
10
9 9
=√ = = 2.52
12.8 3.5777
Conclusion: Reject H0 .
R Exercises
i Two machines are used for filling plastic bottles with a net volume of
16.0 ounces. The fill volume can be assumed normal, with standard
deviation α1 = 0.020 and α2 = 0.025 ounces. A member of the quality
engineering staff suspects that both machines fill to the same mean net
volume, whether or not this volume is 16.0 ounces. A random sample of
10 bottles is taken from the output of each machine.
Machine 1 Machine 2
16.03 16.01 16.02 16.03
16.04 15.96 15.97 16.04
16.05 15.98 15.96 16.03
16.05 16.02 16.01 16.01
16.02 15.99 15.99 16.00
The normality assumption is required to develop the test procedure, but moderate
departures from normality do not adversely affect the procedure. Two different sit-
uations must be treated. In the first case, we assume that the variances of the two
normal distributions are unknown but equal; that is, σ12 = σ22 = σ 2 . In the second, we
assume that and are unknown and not necessarily equal. The test statistic is
The two sample variances are combined to form an estimator of σ 2 . The pooled
estimator of σ 2 is defined as follows.
Examples
and s21 = 0.40, respectively. Assume that σ12 = σ22 and that the data are drawn
from a normal distribution. Is there evidence to support the claim that the two
machines produce rods with different mean diameters? Use α = 0.05 in arriving
at this conclusion.
R Two catalysts are being analyzed to determine how they affect the mean yield
of a chemical process. Specifically, catalyst 1 is currently in use, but catalyst 2
is acceptable. Since catalyst 2 is cheaper, it should be adopted, providing it
does not change the process yield. A test is run in the pilot plant and results
in the data shown in the following table. Is there any difference between the
mean yields? Use α = 0.05, and assume equal variances.
populations, and let X1 and X2 represent the number of observations that belong to
the class of interest in samples 1 and 2, respectively. Furthermore, suppose that the
normal :approximation to the binomial is applied to each population, so the estimators
of the population proportions P̂1 = X1 /n1 and P̂2 = X2 /n2 have approximate normal
distributions.
X1 + X 2
P̂ =
n1 + n2
P̂1 − P̂2
Z=r
p̂(1 − p̂) n11 + n12
7.2 Questions
R A random sample of 500 adult residents of Maricopa County found that 385
were in favor of increasing the highway speed limit to 75 mph, while another
sample of 400 adult residents of Pima County found that 267 were in favor
of the increased speed limit. Do these data indicate that there is a difference
in the support for increasing the speed limit between the residents of the two
180 Chapter 7. Hypothesis Testing
R Out of a sample of 150, selected from patients admitted over a two-year period
to a large hospital, 129 had some type of hospitalization insurance. In a sample
of 160 similarly selected patients from a second hospital, 144 had some type of
hospitalization insurance. Test the null hypothesis that p1 = p2 . Let α = 0.05.
8. Regression
Learning Objectives
Having worked through this chapter the student will be able to:
• Use simple linear regression for building empirical models of engineering and
scientific data.
• Understand how the method of least squares is used to estimate the parameters
in a linear regression model.
• Test statistical hypotheses and construct confidence intervals on regression model
parameters.
• Use the regression model to make a prediction of a future observation and
construct an appropriate prediction interval on the future observation.
• Apply the correlation model.
182 Chapter 8. Regression
8.1.1 Introduction
Many problems in engineering and science involve exploring the relationships between
two or more variables. Regression analysis is a statistical technique that is very useful
for these types of problems. For example, in a chemical process, suppose that the yield
of the product is related to the process-operating temperature. Regression analysis
can be used to build a model to predict yield at a given temperature level. This model
can also be used for process optimization, such as finding the level of temperature
that maximizes yield, or for process control purposes. Other examples are, studying
the relationship between blood pressure and age, the concentration of an injected
drug and heart rate etc.
Regression analysis is concerned with the study of the dependence of one variable, the
dependent variable, on one or more other variables, the independent or explanatory
variables with a view to estimating and predicting the (population) mean or average of
the former (dependent) in terms of the known or fixed (in repeated sampling) values
of the latter (independent).
Very often in practice, a relationship is found to exist between two (or more) variables
and one wishes to express this relationship in mathematical form by determining
an equation connecting the variables. Correlation analysis, on the other hand, is
concerned with measuring the strength of the relationship between variables. When
we compute measures of correlation from a set of data, we are interested in the degree
of the correlation between variables.
In the typical regression problem, the researcher has available for analysis a sample of
observations from some real or hypothetical population. Based on the result of his
8.1 Regression and Correlation Analysis 183
analysis of the sample data, he is interested in reaching decisions about the population
from which the sample is presumed to have been drawn. It is important that the
researcher understand the nature of the population in which he is interested.
In the simple linear regression model two variables X and Y, are of interest. The
variable X is usually referred to as the independent variable, while the other variable,
Y is called the dependent variable; and we speak of the regression of Y on X. The
following are the assumptions underlying the simple linear regression model.
E Y |x = µY |x = α + βx
where α and β (slope and intercept) are called population regression coefficients,
e is called the error term with mean zero and variance σ 2 . The random errors
corresponding to different observations are also assumed to be uncorrelated random
variables.
184 Chapter 8. Regression
The results of n observations of the set of random variables X and Y can be summarized
by drawing a scatter diagram. A straight line passing closely to the points may be
drawn. The main problem arises when the points do not all lie exactly on the straight
line, but simply form a cloud of points around it. Thus, it may be possible by guess
work to draw quite a number of lines each of which will appear to be able to explain
the relationship between X and Y. We shall consider finding a best fit line. Such a
line will then be used as a model relating the random variable Y with the random
variable X.
Suppose that we have n pairs of observations x1 , y1 , x2 , y2 , ... xn , yn . The
following figure shows a typical scatter plot of observed data and a candidate for the
estimated regression line.
The estimates of α and β should result in a line that is (in some sense) a “best fit” to
the data. The German scientist Karl Gauss proposed estimating the parametersandin
Equation 1.1 to minimize the sum of the squares of the deviations in the diagram.
We call this criterion for estimating the regression coefficients the (method of least
squares.) Using Equation 8.1.2, we may express the n observations in the sample as
yi = αi + βxi + ei i = 1, 2, ..., n
and the sum of the squares of the deviations of the observations from the true regression
line is
n
X n
X 2
L= = y − α − βxi
i=1 i=1
n
δL X
= −2 y − α̂ − β̂xi = 0
δα i=1
8.1 Regression and Correlation Analysis 185
n
δL X
= −2 y − α̂ − β̂xi xi = 0
δβ i=1
n
X n
X
nα̂ + β̂ xi = yi
i=1 i=1
n
X n
X
α xi + β̂ xi yi
i=1 i=1
Equations 1.4 are called the least squares normal equations. The solution to the
normal equations results in the least squares estimators α and β. The least squares
estimates of the intercept and slope in the simple linear regression model are
α̂ = ȳ − β̂ x̄
P P
n n
Pn i=1 xi i=1 yi
i=1 xi yi − n
β̂ = P 2
n
Pn 2 i=1 xi
i=1 xi − n
Pn P P
n n
n i=1 xi yi − i=1 xi i=1 yi
= Pn P 2
2 n
n i=1 xi − i=1 xi
Pn
xi yi − nx̄ȳ
β̂ = Pi=1
n 2 2
i=1 xi − nx̄
We shall now find a and b, the estimates of and so that the sum of the squares of
the residuals is a minimum. The residual sum of squares is often called the Sum of
Squares of Errors (SSE) about the regression line. This minimisation procedure for
estimating the parameter is called the “methods of least squares”. Hence we shall find
a and b so as to minimise
n n 2 n 2
e2i =
X X X
SSE = yi − ŷi = yi − a − bxi
i=1 i=1 i=1
n n
δ(SSE) X δ(SSE) X
= −2 (Yi − a − bxi ) = −2 (Yi − a − bxi )xi
δa i=1 δb i=1
Setting the partial derivative equal to zero and rearranging the terms, we obtain the
equation (called the normal equations)
n
X n
X
na + b xi = yi .......(1)
i=1 i=1
n n n
x2i =
X X X
a xi + b xi y i ......(2)
i=1 i=1 i=1
P P
n
n
xi y i − Pn
P
n i=1 xi
i=1 yi i=1 (xi − x̄)(yi − ȳ) SSxy
b= 2 = Pn 2
=
i=1 (xi − x̄) SSxx
Pn P
n
n i=1 x2i − i=1 xi
Pn Pn
i=1 yi − b i=1 xi
a= = ȳ − bx̄
n
Equations (1) and (2) can also be solved using matrices as:
8.2 Method of Least Squares 187
P
n x
P
a y
=
P 2 P
b xy
P
x x
P −1 P
n x
a y
P 2 P
b xy
P
x x
Examples
Solution:
Using the above equation:
8 140 382
a
=
140 3500 b 3870
# Given q u a n t i t i e s
n <− 8
# C a l c u l a t e means o f x and y
^2)
x" ) )
It may be noted that the least-squares line passes through the point (x, y) called
the ‘Centroid’ or centre of gravity of the data. The slope b of the regression line is
independent of the origin of coordinates. It is therefore said that b is invariable under
the translation of axes. Besides assuming that the regression of y and x is a linear
function having the form E(Y |X) = α + βx we have made three further assumptions
which may be summarised as follows:
How it is computed in R
# Given q u a n t i t i e s
8.2 Method of Least Squares 189
n <− 8
sum_x <− 140
sum_y <− 382
sum_xy <− 3870
sum_x2 <− 3500
yi = 572, x2i = 157.42, yi2 = 23, 530, and xi yi = 1697.80. Assume that
P P P P
the two variables are related according to the simple linear regression model.
The following data were obtained from a study investigating the relationship
between noise exposure and hypertension.
Y 1 0 1 2 5 1 4 6 2 3 5 4
6 8 4 5 7 9 7 6
X 60 63 65 70 70 70 80 90 80 80 85 89
90 90 90 90 94 100 100 100
Closely related but conceptually very much different from regression analysis is
correlation analysis, where the primary objective is to measure the strength or degree
of linear association between two variables. The correlation coefficient measures this
strength of (linear) association. For example, we may be interested in finding the
correlation between smoking and lung cancer; between scores on mathematics and
fluid mechanics examinations, between high school grades and college grades etc.
In regression analysis, as already noted, we are not primarily interested in such a
measure. Instead, we try to estimate the average value of one variable on the basis of
the fixed values of another variable.
The population correlation coefficient between two random variables, X and Y is
defined as
where σXY is the covariance between variables X and Y, σX and σY are the standard
deviations of X and Y respectively. It is possible to draw inferences about the
correlation coefficient ρ using its estimator, the sample correlation coefficient, r. “r” is
the correlation coefficient between “n” pairs of observations whose values are (Xi , Yi )
and is given by
Pn P P
n n
i=1 xi yi −
Pn
i=1 xi yi − nxy
¯ n i=1 i=1 yi
r = r P = r 2 P 2
Pn 2 n 2 Pn P
n
P
i=1 xi − nx̄
2
i=1 yi − nȳ
2
n 2
i=1 xi − i=1 xi n ni=1 yi2 − n
i=1 yi
3 It can be positive or negative, the sign depending on the sign of the term in the
numerator which measures the sample co variation of the two variables.
4 It lies between the limits of -1 and +1; that is, −1 ≤ r ≤ +1.
5 If X and Y are independent, the correlation coefficient between them is zero but
if r=0 it does not mean that the two variables are independent.
8.3 Correlation Analysis 193
# Example d a t a v e c t o r s
x <− c ( 1 , 2 , 3 , 4 , 5 )
194 Chapter 8. Regression
y <− c ( 2 , 4 , 6 , 8 , 10 )
# Compute Pearson c o r r e l a t i o n c o e f f i c i e n t
c o r r e l a t i o n_p e a r s o n <− cor ( x , y , method = " p e a r s o n " )
# Print the r e s u l t
print ( c o r r e l a t i o n_p e a r s o n )
test:
H0 : ρ = 0
H1 : ρ ̸= 0
8.4 Questions
R Using the following data, test the hypothesis that there is no linear correlation
among the variables that generated them; at 5% level of significance: SSxx =
0.11273 SSyy = 11, 807, 324, 786 SSxy = 34, 42275972
Solution:
SSxy 34422.75972
r= √ =p = 0.9435
SSxxSSyy (0.11273)(11807324786)
H0 : ρ = 0
H1 : ρ ̸= 0
8.4 Questions 195
α = 0.05 df = (n − 2)
P < 0.0001
Decision: Since t > t0.025 (27) reject the hypothesis of no linear correlation.
More generally, if X and Y follow the bivariate normal distribution, it can
be shown that quantity is a random variable that follows approximately the
normal distribution with mean and variance equal to 1/(n − 3). There the
procedure is to compute
√ √
n−3 1+r 1 + ρ0 n−3 (1 + r)(1 − ρ0 )
z= ln − ln = In
2 1−r 1 − ρ0 n (1 − r)(1 + ρ0 )
R Consider the immediate preceding example data, test the null hypothesis that
= 0.9 against the alternative that > 0.9 at 5% level of significance.
Solution:
H0 : ρ = 0.9
H1 : ρ > 0.9
Critical region : Z > 1.645
Decision: Since Z < Z0.05 there is no evidence that the correlation coefficient
is not equal to 0.9
In ordinary usage of this method, it is not necessary to use the formula for Z
that corresponds to r values between 0.0 and 0.99. Tables contain fisher - Z
values Zf are available. In this case to test H0 : ρ = ρ0 vrs H1 : ρ ̸= ρ0
Z = we have Z=
Critical region is Z ≤ −Z2 andZ ≥ Zα/2 where Zf and f are the fisher - Z values
196 Chapter 8. Regression
x y X y
23.1 10.5 37.9 22.8
32.8 16.7 30.5 14.1
31.8 18.2 25.1 12.9
32.0 17.0 12.4 8.8
30.4 16.3 35.1 17.4
24.0 10.5 31.5 14.9
39.5 23.1 21.1 10.5
24.2 12.4 27.6 16.1
52.5 24.9
R A group of eight athletes ran a 400 metres race twice. The times
in seconds were recorded as follows for each athlete.
Runner
1st Trial x 2nd Trial Y
48.4 48.0
51.2 54.3
48.6 49.4
49.5 48.4
51.6 54.0
49.3 47.2
50.8 51.8
49.7 50.3
# Load n e c e s s a r y l i b r a r y
library ( ggplot2 )
# C r e a t i n g a d a t a frame
f i r s t_t r i a l = c ( 4 8 . 4 , 5 1 . 2 , 4 8 . 6 , 4 9 . 5 , 5 1 . 6 , 4 9 . 3 ,
50.8 , 49.7) ,
s e c o n d_t r i a l = c ( 4 8 . 0 , 5 4 . 3 , 4 9 . 4 , 4 8 . 4 , 5 4 . 0 , 4 7 . 2 ,
51.8 , 50.3)
# Plotting
geom_p o i n t ( ) +
) +