Data sampling and probability

DATA SAMPLING AND
PROBABILITY
Avjinder Singh Kaler and Kristi Mai

 Multiplication Rule: Complements and Conditional Probability
 Counting
 Types of Sampling Methods
 Summarizing Data
 Statistical Graphs
 Probability Distributions
 Normal and Standard Normal Distribution

 A conditional probability of an event is a probability obtained with the
additional information that some other event has already occurred.
 denotes the conditional probability of event B occurring,
given that event A has already occurred, and it can be found by
dividing the probability of events A and B both occurring by the
probability of event A:
( | )P B A
( and )
( | )
( )

P A B
P B A
P A

Refer to Table 4-1 to find the following:
a) If 1 of the 1000 test subjects is randomly selected, find the probability that
the subject had a positive test result, given that the subject actually uses
drugs. That is, find 𝑷(𝒑𝒐𝒔𝒊𝒕𝒊𝒗𝒆 𝒕𝒆𝒔𝒕 𝒓𝒆𝒔𝒖𝒍𝒕|𝒔𝒖𝒃𝒋𝒆𝒄𝒕 𝒖𝒔𝒆𝒔 𝒅𝒓𝒖𝒈𝒔).
a) If 1 of the 1000 test subjects is randomly selected, find the probability that
the subject actually uses drugs, given that the he/she had a positive test
result. That is, find 𝑷(𝒔𝒖𝒃𝒋𝒆𝒄𝒕 𝒖𝒔𝒆𝒔 𝒅𝒓𝒖𝒈𝒔|𝒑𝒐𝒔𝒊𝒕𝒊𝒗𝒆 𝒕𝒆𝒔𝒕 𝒓𝒆𝒔𝒖𝒍𝒕).

Solution:
a) P positive test result subject uses drugs =
P subject uses drugs and had a positive test result
P(subject uses drugs)
P positive test result subject uses drugs =
44
100
50
100
=
44
50
= 0.88
b) P subject uses drugs positive test result =
P subject uses drugs and had a positive test result
P(positive test result)
𝑃 𝑠𝑢𝑏𝑗𝑒𝑐𝑡 𝑢𝑠𝑒𝑠 𝑑𝑟𝑢𝑔𝑠 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑡𝑒𝑠𝑡 𝑟𝑒𝑠𝑢𝑙𝑡 =
44
134
= 0.328
Table 4-1 Pre-Employment Drug Screening Results
Positive Test Result Negative Test Result
Subject Uses Drugs 44 (True Positive) 6 (False Negative)
Subject Is Not a Drug User 90 (False Positive) 860 (True Negative)

 For a sequence of two events in which the first event can occur 𝑚
ways and the second event can occur 𝑛 ways, the events together
can occur a total of 𝑚 ∗ 𝑛 ways.
Example:
For a two-character code consisting of a letter followed by a digit, the
number of different possible codes is 26 ∗ 10 = 260.

 The factorial symbol ! denotes the product of decreasing positive
whole numbers.
 For example,
 By special definition, 0! = 1.
4! 4 3 2 1 24    

 n! = Number of different permutations (order counts) of n different items can
be arranged when all n of them are selected. (This factorial rule reflects the
fact that the first item may be selected in n different ways, the second item
may be selected in n – 1 ways, and so on.)
Example:
The number of ways that the five letters {a, b, c, d, e} can be arranged is as
follows: 5! = 5 ∙ 4 ∙ 3 ∙ 2 ∙ 1 = 120

Requirements:
1. There are n different items available. (This rule does not apply if some of
the items are identical to others.)
2. We select r of the n items (without replacement).
3. We consider rearrangements of the same items to be different sequences.
(The permutation of ABC is different from CBA and is counted separately.)
If the preceding requirements are satisfied, the number of permutations (or
sequences) of r items selected from n available items (without replacement) is
!
( )!
n r
n
P
n r



If the five letters {a, b, c, d, e} are available and three of them are to be
selected without replacement, the number of different permutations is
as follows:
𝑛𝑃𝑟 =
𝑛!
(𝑛 − 𝑟)!
=
5!
(5 − 3)!
= 60

Requirements:
1. There are n items available, and some items are identical to others.
2. We select all of the n items (without replacement).
3. We consider rearrangements of distinct items to be different sequences.
If the preceding requirements are satisfied, and if there are n1 alike, n2 alike,
. . . nk alike, the number of permutations (or sequences) of all items selected
without replacement is
1 2
!
! ! !k
n
n n n

If the 10 letters {a, a, a, a, b, b, c, c, d, e} are available and all 10 of
them are to be selected without replacement, the number of different
permutations is as follows:
𝑛!
𝑛1! 𝑛2! ⋯ 𝑛 𝑘!
=
10!
4! 2! 2!
=
3,628,800
24 ∗ 2 ∗ 2
= 37,800

Requirements:
1. There are n different items available.
2. We select r of the n items (without replacement).
3. We consider rearrangements of the same items to be the same. (The
combination of ABC is the same as CBA.)
If the preceding requirements are satisfied, the number of combinations of r
items selected from n different items is
!
( )! !
n r
n
C
n r r



In the Pennsylvania Match 6 Lotto, winning the jackpot requires you
select six different numbers from 1 to 49. The winning numbers may be
drawn in any order. Find the probability of winning if one ticket is
purchased.
 
 
! 49!
Number of combinations: 13,983,816
! ! 43!6!
1
winning
13,983,816
n r
n
C
n r r
P
  



When different orderings of the same items are to be counted
separately, we have a permutation problem, but when different
orderings are not to be counted separately, we have a combination
problem.
Permutations are for lists (order matters) and combinations are for
groups (order doesn’t matter).

 Data – collections of observations, such as measurements, genders,
or survey responses
 Population – the complete collection of all individuals to be studied
 Sample – sub-collection of population the data comes from
 Census – the collection of data from every member of the population

 planning studies, designing experiments, and
obtaining data
 organizing, summarizing, analyzing, interpreting,
drawing conclusions about, and presenting data

The Gallup corporation collected data from 1013 adults in the United
States. Results showed that 66% of the respondents worried about
identity theft.
 The population consists of all 241,472,385 adults in the United States.
 The sample consists of the 1013 polled adults.
 The objective is to use the sample data as a basis for drawing a
conclusion about the whole population.

 Simple random sample
Random sample
 Systematic sampling
 Convenience sampling
 Stratified sampling
 Cluster sampling

A sample of n subjects is selected in such a way that every possible sample of
the same size n has the same chance of being chosen.

 Members from the population are selected in such a way that each
individual member in the population has an equal chance of being
selected.

Select some starting point and then select every kth element in the
population.

Use results that are easy to get.

Subdivide the population into at least two different subgroups that
share the same characteristics, then draw a sample from each
subgroup (or stratum).

Divide the population area into sections (or clusters). Then randomly
select some of those clusters. Now choose all members from selected
clusters.

When working with large data sets, it is often helpful to
organize and summarize data by constructing a table called
a frequency distribution.

 Shows how a data set is partitioned among all of several
categories (or classes) by listing all of the categories along
with the number (frequency) of data values in each of them
 All categories/classes and the number of observations in
that given category/class

IQ Score Frequency
50-69 2
70-89 33
90-109 35
110-129 7
130-149 1
Lower Class
Limits
are the smallest numbers that can
actually belong to different classes.

IQ Score Frequency
50-69 2
70-89 33
90-109 35
110-129 7
130-149 1
Upper Class
Limits
are the largest numbers that can
actually belong to different classes.

IQ Score Frequency
50-69 2
70-89 33
90-109 35
110-129 7
130-149 1
Class
Boundaries
are the numbers used to separate
classes, but without the gaps created
by class limits.
49.5
69.5
89.5
109.5
129.5
149.5

IQ Score Frequency
50-69 2
70-89 33
90-109 35
110-129 7
130-149 1
Class
Midpoints
are the values in the middle of the
classes and can be found by adding
the lower class limit to the upper class
limit and dividing the sum by 2.
59.5
79.5
99.5
119.5
139.5
𝑐𝑙𝑎𝑠𝑠 𝑚𝑖𝑑𝑝𝑜𝑖𝑛𝑡 =
𝑙𝑜𝑤𝑒𝑟 𝑙𝑖𝑚𝑖𝑡 + 𝑢𝑝𝑝𝑒𝑟 𝑙𝑖𝑚𝑖𝑡
2

IQ Score Frequency
50-69 2
70-89 33
90-109 35
110-129 7
130-149 1
Class
Width
is the difference between two
consecutive lower class limits or two
consecutive lower class boundaries.
20
20
20
20
20


relative frequency =
class frequency
sum of all frequencies
includes the same class limits as a frequency distribution, but the
frequency of a class is replaced with a relative frequencies (a
proportion) or a percentage frequency ( a percent)
percentage
frequency
class frequency
sum of all frequencies
 100%=

IQ Score Frequency Relative Frequency
50-69 2 2.6%
70-89 33 42.3%
90-109 35 44.9%
110-129 7 9.0%
130-149 1 1.3%

CumulativeFrequencies
IQ Score Frequency Cumulative Frequency
50-69 2 2
70-89 33 35
90-109 35 70
110-129 7 77
130-149 1 78

 The frequencies start low, then
increase to higher frequencies until
reaching a maximum, and then
decrease to low again.
 The distribution is approximately
symmetric
• frequencies preceding the
maximum being roughly a mirror
image of those that follow the
maximum

 Numerical in nature
 Consists of numbers representing counts or measurements
 Have a unit and can be used arithmetically
 Quantitative data can be further described by distinguishing
between discrete and continuous types.
Examples:
• The weights of supermodels
• The ages of respondents

the number of possible values is either a finite number or a
‘countable’ number (i.e. the number of possible values is 0,
1, 2, 3, . . .).
Example:
The number of eggs that a hen lays

infinitely many possible values that correspond to some
continuous scale that covers a range of values without gaps,
interruptions, or jumps
Example:
The amount of milk that a cow produces;
e.g. 2.343115 gallons per day

consists of names or labels (representing categories)
Example:
• The gender (male/female) of professional athletes.
• Shirt numbers on professional athletes uniforms - substitutes for names.

• Uses bars of equal width to show
frequencies of categorical, or
qualitative, data
• Vertical scale represents frequencies or
relative frequencies.
• Horizontal scale identifies the different
categories of qualitative data.

A multiple bar graph has two or more sets of bars and is used to
compare two or more data sets.

A bar graph for qualitative data, with the bars arranged in descending
order according to frequencies

A graph depicting qualitative data as slices of a circle, in which the size
of each slice is proportional to frequency count

 a variable (typically represented by 𝑥) that has a single numerical
value, determined by chance, for each outcome of a given
procedure
 Can be discrete or continuous – just like data

 Discrete Random Variable
either a finite number of values or countable number of values, where
“countable” refers to the fact that there might be infinitely many
values, but that they result from a counting process
 Continuous Random Variable
has infinitely many values, and those values can be associated with
measurements on a continuous scale without gaps or interruptions.

 a description that gives the probability for each value of the random
variable
 often expressed in the format of a graph, table, or formula
Note:
If a probability is very small, it is represented as 0+
in tables
(i.e. it is very small, yet positive)

1. There is a numerical random variable x and its values are
associated with corresponding probabilities.
2. The sum of all probabilities must be 1.
3. Each probability value must be between 0 and 1 inclusive.
  1P x 
 0 1P x 

The probability histogram is very similar to a relative frequency
histogram, but the vertical scale shows probabilities.

 According to the range rule of thumb, most values should lie within 2
standard deviations of the mean.
 We can therefore identify “unusual” values by determining if they lie
outside these limits:
Maximum usual value =
Minimum usual value =
2 
2 

We found for families with two children, the mean number of girls is 1.0
and the standard deviation is 0.7 girls.
Use those values to find the maximum and minimum usual values for the
number of girls.
Solution:
 
 
maximum usual value 2 1.0 2 0.7 2.4
minimum usual value 2 1.0 2 0.7 0.4
 
 
    
     

Rare Event Rule for Inferential Statistics
If, under a given assumption (such as the assumption that a coin is fair), the
probability of a particular observed event (such as 992 heads in 1000 tosses of
a coin) is extremely small, we conclude that the assumption is probably not
correct.

Using Probabilities to Determine When Results Are Unusual
 Unusually high # of successes: x successes among n trials is an
unusually high number of successes if
.
 Unusually low # of successes : x successes among n trials is an
unusually low number of successes if
( orfewer) 0.05P x 
( ormore) 0.05P x 

A density curve is the graph of a continuous probability
distribution. It must satisfy the following properties:
1. The total area under the curve must equal 1.
2. Every point on the curve must have a vertical height that is 0 or
greater. (That is, the curve cannot fall below the x-axis.)

Because the total area under the density curve is equal to 1, there is a
correspondence between area and probability.

A continuous random variable has a uniform distribution if its values are
spread evenly over the range of probabilities. The graph of a uniform
distribution results in a rectangular shape.

Given the uniform distribution illustrated, find the probability that a
randomly selected voltage level is greater than 124.5 volts.
Shaded area
represents voltage
levels greater than
124.5 volts.

2
1
2
( )
2
x
e
f x


 
 
  
 

A continuous R.V. has a normal distribution if it has a graph that is
symmetric and bell-shaped and if the R.V. can be described by the
following equation:

The standard normal distribution is a normal probability distribution with
μ = 0 and σ = 1. The total area under its density curve is equal to 1.

 Represents how much a given value, 𝑥, deviates/varies from the center of a
set of data
 This value can help to assess how “extreme” a particular data value is based
on the distribution the value is supposed to follow
 This score can also be used to convert sample data (sample statistics) to a
measure of relative standing so that we may be able to compare sample to
one another.
 Basic “Idea” Behind Formulas for Z-Scores:
𝑍 =
𝑡ℎ𝑒 𝑣𝑎𝑙𝑢𝑒 − 𝑡ℎ𝑒 𝑐𝑒𝑛𝑡𝑒𝑟 𝑜𝑓 𝑡ℎ𝑒 𝑑𝑎𝑡𝑎
𝑎 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝑚𝑒𝑎𝑠𝑢𝑟𝑒

 If the z-score is positive (+), the specific value falls above the center
value.
 If the z-score is negative (-), the specific value falls below the center
value.
 “Usual” values have z-scores between -2 and 2.
 “Unusual” values have z-scores less than -2 or greater than 2.

 We can find areas (probabilities) for different regions under
a normal model using StatCrunch.

A bone mineral density test can be helpful in identifying the presence of
osteoporosis.
The result of the test is commonly measured as a z score, which has a
normal distribution with a mean of 0 and a standard deviation of 1.
A randomly selected adult undergoes a bone density test.
Find the probability that the result is a reading less than 1.27.

The probability of random adult having a bone
density less than 1.27 is 0.8980.
( 1.27) 0.8980P z  

Using the same bone density test, find the probability that a randomly
selected person has a result above –1.00 (which is considered to be in
the “normal” range of bone density readings.
The probability of a randomly
selected adult having a bone
density above –1 is 0.8413.

A bone density reading between –1.00 and –2.50 indicates the subject has
osteopenia. Find this probability.
The probability of a randomly selected adult having osteopenia is 0.1525.

denotes the probability that the z score is between a and b.
denotes the probability that the z score is greater than a.
denotes the probability that the z score is less than a.
( )P a z b 
( )P z a
( )P z a

Finding the 95th Percentile
1.645
5% or 0.05
(z score will be positive)

Using the same bone density test, find the bone density scores that
separates the bottom 2.5% and find the score that separates the top
2.5%.

For the standard normal distribution, a critical value is a z score
separating unlikely values from those that are likely to occur.
Notation:
The expression zα denotes the z score with an area of α to its right.

Find the value of z0.025.
The notation z0.025 is used to represent the z score with an area of
0.025 to its right.
Referring back to the bone density example,
z0.025 = 1.96.

• Complete HW1 and HW2 on MLP

Data sampling and probability

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Data sampling and probability (20)

More from Avjinder (Avi) Kaler (20)

Recently uploaded (20)

Data sampling and probability