SlideShare a Scribd company logo
DATA SAMPLING AND
PROBABILITY
Avjinder Singh Kaler and Kristi Mai
 Multiplication Rule: Complements and Conditional Probability
 Counting
 Types of Sampling Methods
 Summarizing Data
 Statistical Graphs
 Probability Distributions
 Normal and Standard Normal Distribution
 A conditional probability of an event is a probability obtained with the
additional information that some other event has already occurred.
 denotes the conditional probability of event B occurring,
given that event A has already occurred, and it can be found by
dividing the probability of events A and B both occurring by the
probability of event A:
( | )P B A
( and )
( | )
( )

P A B
P B A
P A
Refer to Table 4-1 to find the following:
a) If 1 of the 1000 test subjects is randomly selected, find the probability that
the subject had a positive test result, given that the subject actually uses
drugs. That is, find 𝑷(𝒑𝒐𝒔𝒊𝒕𝒊𝒗𝒆 𝒕𝒆𝒔𝒕 𝒓𝒆𝒔𝒖𝒍𝒕|𝒔𝒖𝒃𝒋𝒆𝒄𝒕 𝒖𝒔𝒆𝒔 𝒅𝒓𝒖𝒈𝒔).
a) If 1 of the 1000 test subjects is randomly selected, find the probability that
the subject actually uses drugs, given that the he/she had a positive test
result. That is, find 𝑷(𝒔𝒖𝒃𝒋𝒆𝒄𝒕 𝒖𝒔𝒆𝒔 𝒅𝒓𝒖𝒈𝒔|𝒑𝒐𝒔𝒊𝒕𝒊𝒗𝒆 𝒕𝒆𝒔𝒕 𝒓𝒆𝒔𝒖𝒍𝒕).
Solution:
a) P positive test result subject uses drugs =
P subject uses drugs and had a positive test result
P(subject uses drugs)
P positive test result subject uses drugs =
44
100
50
100
=
44
50
= 0.88
b) P subject uses drugs positive test result =
P subject uses drugs and had a positive test result
P(positive test result)
𝑃 𝑠𝑢𝑏𝑗𝑒𝑐𝑡 𝑢𝑠𝑒𝑠 𝑑𝑟𝑢𝑔𝑠 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑡𝑒𝑠𝑡 𝑟𝑒𝑠𝑢𝑙𝑡 =
44
134
= 0.328
Table 4-1 Pre-Employment Drug Screening Results
Positive Test Result Negative Test Result
Subject Uses Drugs 44 (True Positive) 6 (False Negative)
Subject Is Not a Drug User 90 (False Positive) 860 (True Negative)
 For a sequence of two events in which the first event can occur 𝑚
ways and the second event can occur 𝑛 ways, the events together
can occur a total of 𝑚 ∗ 𝑛 ways.
Example:
For a two-character code consisting of a letter followed by a digit, the
number of different possible codes is 26 ∗ 10 = 260.
 The factorial symbol ! denotes the product of decreasing positive
whole numbers.
 For example,
 By special definition, 0! = 1.
4! 4 3 2 1 24    
 n! = Number of different permutations (order counts) of n different items can
be arranged when all n of them are selected. (This factorial rule reflects the
fact that the first item may be selected in n different ways, the second item
may be selected in n – 1 ways, and so on.)
Example:
The number of ways that the five letters {a, b, c, d, e} can be arranged is as
follows: 5! = 5 ∙ 4 ∙ 3 ∙ 2 ∙ 1 = 120
Requirements:
1. There are n different items available. (This rule does not apply if some of
the items are identical to others.)
2. We select r of the n items (without replacement).
3. We consider rearrangements of the same items to be different sequences.
(The permutation of ABC is different from CBA and is counted separately.)
If the preceding requirements are satisfied, the number of permutations (or
sequences) of r items selected from n available items (without replacement) is
!
( )!
n r
n
P
n r


If the five letters {a, b, c, d, e} are available and three of them are to be
selected without replacement, the number of different permutations is
as follows:
𝑛𝑃𝑟 =
𝑛!
(𝑛 − 𝑟)!
=
5!
(5 − 3)!
= 60
Requirements:
1. There are n items available, and some items are identical to others.
2. We select all of the n items (without replacement).
3. We consider rearrangements of distinct items to be different sequences.
If the preceding requirements are satisfied, and if there are n1 alike, n2 alike,
. . . nk alike, the number of permutations (or sequences) of all items selected
without replacement is
1 2
!
! ! !k
n
n n n
If the 10 letters {a, a, a, a, b, b, c, c, d, e} are available and all 10 of
them are to be selected without replacement, the number of different
permutations is as follows:
𝑛!
𝑛1! 𝑛2! ⋯ 𝑛 𝑘!
=
10!
4! 2! 2!
=
3,628,800
24 ∗ 2 ∗ 2
= 37,800
Requirements:
1. There are n different items available.
2. We select r of the n items (without replacement).
3. We consider rearrangements of the same items to be the same. (The
combination of ABC is the same as CBA.)
If the preceding requirements are satisfied, the number of combinations of r
items selected from n different items is
!
( )! !
n r
n
C
n r r


In the Pennsylvania Match 6 Lotto, winning the jackpot requires you
select six different numbers from 1 to 49. The winning numbers may be
drawn in any order. Find the probability of winning if one ticket is
purchased.
 
 
! 49!
Number of combinations: 13,983,816
! ! 43!6!
1
winning
13,983,816
n r
n
C
n r r
P
  


When different orderings of the same items are to be counted
separately, we have a permutation problem, but when different
orderings are not to be counted separately, we have a combination
problem.
Permutations are for lists (order matters) and combinations are for
groups (order doesn’t matter).
 Data – collections of observations, such as measurements, genders,
or survey responses
 Population – the complete collection of all individuals to be studied
 Sample – sub-collection of population the data comes from
 Census – the collection of data from every member of the population
 planning studies, designing experiments, and
obtaining data
 organizing, summarizing, analyzing, interpreting,
drawing conclusions about, and presenting data
The Gallup corporation collected data from 1013 adults in the United
States. Results showed that 66% of the respondents worried about
identity theft.
 The population consists of all 241,472,385 adults in the United States.
 The sample consists of the 1013 polled adults.
 The objective is to use the sample data as a basis for drawing a
conclusion about the whole population.
 Simple random sample
Random sample
 Systematic sampling
 Convenience sampling
 Stratified sampling
 Cluster sampling
A sample of n subjects is selected in such a way that every possible sample of
the same size n has the same chance of being chosen.
 Members from the population are selected in such a way that each
individual member in the population has an equal chance of being
selected.
Select some starting point and then select every kth element in the
population.
Use results that are easy to get.
Subdivide the population into at least two different subgroups that
share the same characteristics, then draw a sample from each
subgroup (or stratum).
Divide the population area into sections (or clusters). Then randomly
select some of those clusters. Now choose all members from selected
clusters.
When working with large data sets, it is often helpful to
organize and summarize data by constructing a table called
a frequency distribution.
 Shows how a data set is partitioned among all of several
categories (or classes) by listing all of the categories along
with the number (frequency) of data values in each of them
 All categories/classes and the number of observations in
that given category/class
IQ Score Frequency
50-69 2
70-89 33
90-109 35
110-129 7
130-149 1
Lower Class
Limits
are the smallest numbers that can
actually belong to different classes.
IQ Score Frequency
50-69 2
70-89 33
90-109 35
110-129 7
130-149 1
Upper Class
Limits
are the largest numbers that can
actually belong to different classes.
IQ Score Frequency
50-69 2
70-89 33
90-109 35
110-129 7
130-149 1
Class
Boundaries
are the numbers used to separate
classes, but without the gaps created
by class limits.
49.5
69.5
89.5
109.5
129.5
149.5
IQ Score Frequency
50-69 2
70-89 33
90-109 35
110-129 7
130-149 1
Class
Midpoints
are the values in the middle of the
classes and can be found by adding
the lower class limit to the upper class
limit and dividing the sum by 2.
59.5
79.5
99.5
119.5
139.5
𝑐𝑙𝑎𝑠𝑠 𝑚𝑖𝑑𝑝𝑜𝑖𝑛𝑡 =
𝑙𝑜𝑤𝑒𝑟 𝑙𝑖𝑚𝑖𝑡 + 𝑢𝑝𝑝𝑒𝑟 𝑙𝑖𝑚𝑖𝑡
2
IQ Score Frequency
50-69 2
70-89 33
90-109 35
110-129 7
130-149 1
Class
Width
is the difference between two
consecutive lower class limits or two
consecutive lower class boundaries.
20
20
20
20
20

relative frequency =
class frequency
sum of all frequencies
includes the same class limits as a frequency distribution, but the
frequency of a class is replaced with a relative frequencies (a
proportion) or a percentage frequency ( a percent)
percentage
frequency
class frequency
sum of all frequencies
 100%=
IQ Score Frequency Relative Frequency
50-69 2 2.6%
70-89 33 42.3%
90-109 35 44.9%
110-129 7 9.0%
130-149 1 1.3%
CumulativeFrequencies
IQ Score Frequency Cumulative Frequency
50-69 2 2
70-89 33 35
90-109 35 70
110-129 7 77
130-149 1 78
 The frequencies start low, then
increase to higher frequencies until
reaching a maximum, and then
decrease to low again.
 The distribution is approximately
symmetric
• frequencies preceding the
maximum being roughly a mirror
image of those that follow the
maximum
 Numerical in nature
 Consists of numbers representing counts or measurements
 Have a unit and can be used arithmetically
 Quantitative data can be further described by distinguishing
between discrete and continuous types.
Examples:
• The weights of supermodels
• The ages of respondents
the number of possible values is either a finite number or a
‘countable’ number (i.e. the number of possible values is 0,
1, 2, 3, . . .).
Example:
The number of eggs that a hen lays
infinitely many possible values that correspond to some
continuous scale that covers a range of values without gaps,
interruptions, or jumps
Example:
The amount of milk that a cow produces;
e.g. 2.343115 gallons per day
consists of names or labels (representing categories)
Example:
• The gender (male/female) of professional athletes.
• Shirt numbers on professional athletes uniforms - substitutes for names.
• Uses bars of equal width to show
frequencies of categorical, or
qualitative, data
• Vertical scale represents frequencies or
relative frequencies.
• Horizontal scale identifies the different
categories of qualitative data.
A multiple bar graph has two or more sets of bars and is used to
compare two or more data sets.
A bar graph for qualitative data, with the bars arranged in descending
order according to frequencies
A graph depicting qualitative data as slices of a circle, in which the size
of each slice is proportional to frequency count
 a variable (typically represented by 𝑥) that has a single numerical
value, determined by chance, for each outcome of a given
procedure
 Can be discrete or continuous – just like data
 Discrete Random Variable
either a finite number of values or countable number of values, where
“countable” refers to the fact that there might be infinitely many
values, but that they result from a counting process
 Continuous Random Variable
has infinitely many values, and those values can be associated with
measurements on a continuous scale without gaps or interruptions.
 a description that gives the probability for each value of the random
variable
 often expressed in the format of a graph, table, or formula
Note:
If a probability is very small, it is represented as 0+
in tables
(i.e. it is very small, yet positive)
1. There is a numerical random variable x and its values are
associated with corresponding probabilities.
2. The sum of all probabilities must be 1.
3. Each probability value must be between 0 and 1 inclusive.
  1P x 
 0 1P x 
The probability histogram is very similar to a relative frequency
histogram, but the vertical scale shows probabilities.
 According to the range rule of thumb, most values should lie within 2
standard deviations of the mean.
 We can therefore identify “unusual” values by determining if they lie
outside these limits:
Maximum usual value =
Minimum usual value =
2 
2 
We found for families with two children, the mean number of girls is 1.0
and the standard deviation is 0.7 girls.
Use those values to find the maximum and minimum usual values for the
number of girls.
Solution:
 
 
maximum usual value 2 1.0 2 0.7 2.4
minimum usual value 2 1.0 2 0.7 0.4
 
 
    
     
Rare Event Rule for Inferential Statistics
If, under a given assumption (such as the assumption that a coin is fair), the
probability of a particular observed event (such as 992 heads in 1000 tosses of
a coin) is extremely small, we conclude that the assumption is probably not
correct.
Using Probabilities to Determine When Results Are Unusual
 Unusually high # of successes: x successes among n trials is an
unusually high number of successes if
.
 Unusually low # of successes : x successes among n trials is an
unusually low number of successes if
( orfewer) 0.05P x 
( ormore) 0.05P x 
A density curve is the graph of a continuous probability
distribution. It must satisfy the following properties:
1. The total area under the curve must equal 1.
2. Every point on the curve must have a vertical height that is 0 or
greater. (That is, the curve cannot fall below the x-axis.)
Because the total area under the density curve is equal to 1, there is a
correspondence between area and probability.
A continuous random variable has a uniform distribution if its values are
spread evenly over the range of probabilities. The graph of a uniform
distribution results in a rectangular shape.
Given the uniform distribution illustrated, find the probability that a
randomly selected voltage level is greater than 124.5 volts.
Shaded area
represents voltage
levels greater than
124.5 volts.
2
1
2
( )
2
x
e
f x


 
 
  
 

A continuous R.V. has a normal distribution if it has a graph that is
symmetric and bell-shaped and if the R.V. can be described by the
following equation:
The standard normal distribution is a normal probability distribution with
μ = 0 and σ = 1. The total area under its density curve is equal to 1.
 Represents how much a given value, 𝑥, deviates/varies from the center of a
set of data
 This value can help to assess how “extreme” a particular data value is based
on the distribution the value is supposed to follow
 This score can also be used to convert sample data (sample statistics) to a
measure of relative standing so that we may be able to compare sample to
one another.
 Basic “Idea” Behind Formulas for Z-Scores:
𝑍 =
𝑡ℎ𝑒 𝑣𝑎𝑙𝑢𝑒 − 𝑡ℎ𝑒 𝑐𝑒𝑛𝑡𝑒𝑟 𝑜𝑓 𝑡ℎ𝑒 𝑑𝑎𝑡𝑎
𝑎 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝑚𝑒𝑎𝑠𝑢𝑟𝑒
 If the z-score is positive (+), the specific value falls above the center
value.
 If the z-score is negative (-), the specific value falls below the center
value.
 “Usual” values have z-scores between -2 and 2.
 “Unusual” values have z-scores less than -2 or greater than 2.
 We can find areas (probabilities) for different regions under
a normal model using StatCrunch.
A bone mineral density test can be helpful in identifying the presence of
osteoporosis.
The result of the test is commonly measured as a z score, which has a
normal distribution with a mean of 0 and a standard deviation of 1.
A randomly selected adult undergoes a bone density test.
Find the probability that the result is a reading less than 1.27.
The probability of random adult having a bone
density less than 1.27 is 0.8980.
( 1.27) 0.8980P z  
Using the same bone density test, find the probability that a randomly
selected person has a result above –1.00 (which is considered to be in
the “normal” range of bone density readings.
The probability of a randomly
selected adult having a bone
density above –1 is 0.8413.
A bone density reading between –1.00 and –2.50 indicates the subject has
osteopenia. Find this probability.
The probability of a randomly selected adult having osteopenia is 0.1525.
denotes the probability that the z score is between a and b.
denotes the probability that the z score is greater than a.
denotes the probability that the z score is less than a.
( )P a z b 
( )P z a
( )P z a
Finding the 95th Percentile
1.645
5% or 0.05
(z score will be positive)
Using the same bone density test, find the bone density scores that
separates the bottom 2.5% and find the score that separates the top
2.5%.
For the standard normal distribution, a critical value is a z score
separating unlikely values from those that are likely to occur.
Notation:
The expression zα denotes the z score with an area of α to its right.
Find the value of z0.025.
The notation z0.025 is used to represent the z score with an area of
0.025 to its right.
Referring back to the bone density example,
z0.025 = 1.96.
• Complete HW1 and HW2 on MLP

More Related Content

What's hot (20)

PDF
Simple linear regression
Avjinder (Avi) Kaler
 
PPTX
Probability Distribution
Long Beach City College
 
PDF
Sampling and sampling distribution tttt
pardeepkaur60
 
PPTX
poisson distribution
sangeeta saini
 
PPTX
Chap05 continuous random variables and probability distributions
Judianto Nugroho
 
PPT
SOME PROPERTIES OF ESTIMATORS - 552.ppt
dayashka1
 
PPTX
Measures of dispersion
Gnana Sravani
 
PPT
Basic concept of probability
Ikhlas Rahman
 
PPTX
Conditional probability
suncil0071
 
PDF
Chapter 5 part1- The Sampling Distribution of a Sample Mean
nszakir
 
PPTX
Poisson Distribution
Hafiz UsmanAli
 
PPTX
Binomial probability distributions
Long Beach City College
 
PPTX
Sampling and Sampling Distributions
Bk Islam Mumitul
 
PPTX
Probability distribution for Dummies
Balaji P
 
PPTX
Poisson distribution
Student
 
PPTX
Binomial Probability Distributions
Long Beach City College
 
PPT
Statistical Inference
Muhammad Amir Sohail
 
PPT
5.9 complex numbers
Jessica Garcia
 
PPTX
Sampling Distribution
Cumberland County Schools
 
PPT
Normal Distribution
Shubham Mehta
 
Simple linear regression
Avjinder (Avi) Kaler
 
Probability Distribution
Long Beach City College
 
Sampling and sampling distribution tttt
pardeepkaur60
 
poisson distribution
sangeeta saini
 
Chap05 continuous random variables and probability distributions
Judianto Nugroho
 
SOME PROPERTIES OF ESTIMATORS - 552.ppt
dayashka1
 
Measures of dispersion
Gnana Sravani
 
Basic concept of probability
Ikhlas Rahman
 
Conditional probability
suncil0071
 
Chapter 5 part1- The Sampling Distribution of a Sample Mean
nszakir
 
Poisson Distribution
Hafiz UsmanAli
 
Binomial probability distributions
Long Beach City College
 
Sampling and Sampling Distributions
Bk Islam Mumitul
 
Probability distribution for Dummies
Balaji P
 
Poisson distribution
Student
 
Binomial Probability Distributions
Long Beach City College
 
Statistical Inference
Muhammad Amir Sohail
 
5.9 complex numbers
Jessica Garcia
 
Sampling Distribution
Cumberland County Schools
 
Normal Distribution
Shubham Mehta
 

Viewers also liked (20)

PPTX
Probability Sampling
Muhammad Usman
 
PPT
Case study research by maureann o keefe
wawaaa789
 
PPT
Ch. 12 Sampling Methods
christjt
 
PPTX
Sampling types, size and eroors
Adil Arif
 
PPTX
sampling
Senjuti Dutta
 
PPT
T5 sampling
kompellark
 
PPTX
SET FORM 4 (3.1.1-3.1.3)
Christina Ringgit
 
PPTX
Introduction to sampling
Situo Liu
 
PPTX
Sampling techniques
Narasimha B.C
 
PDF
MM Bagali, HR, MBA, HRM, HRD, Research ,Case study final
dr m m bagali, phd in hr
 
PPTX
Sampling
yashalgul
 
PPTX
IT3010 Lecture on Case Study Research
BabakFarshchian
 
PPT
Sampling methods 16
Raj Selvam
 
PPT
Chapter8
Ying Liu
 
PPTX
The Islamia University of Bahawalpur, Islamia university library, Sir sadiq M...
Shafiq-ur-rehman Ansari
 
PPTX
Random Sampling
Salome Gass
 
PPTX
Case study research for elt
Parlin Pardede
 
PPTX
Topic – METHOD OF SAMPLING
vinato aomi
 
Probability Sampling
Muhammad Usman
 
Case study research by maureann o keefe
wawaaa789
 
Ch. 12 Sampling Methods
christjt
 
Sampling types, size and eroors
Adil Arif
 
sampling
Senjuti Dutta
 
T5 sampling
kompellark
 
SET FORM 4 (3.1.1-3.1.3)
Christina Ringgit
 
Introduction to sampling
Situo Liu
 
Sampling techniques
Narasimha B.C
 
MM Bagali, HR, MBA, HRM, HRD, Research ,Case study final
dr m m bagali, phd in hr
 
Sampling
yashalgul
 
IT3010 Lecture on Case Study Research
BabakFarshchian
 
Sampling methods 16
Raj Selvam
 
Chapter8
Ying Liu
 
The Islamia University of Bahawalpur, Islamia university library, Sir sadiq M...
Shafiq-ur-rehman Ansari
 
Random Sampling
Salome Gass
 
Case study research for elt
Parlin Pardede
 
Topic – METHOD OF SAMPLING
vinato aomi
 
Ad

Similar to Data sampling and probability (20)

PDF
Math for 800 06 statistics, probability, sets, and graphs-charts
Edwin Lapuerta
 
PDF
Counting -Methods-probabilities-theory prst2
alaqwryans
 
PDF
ELEMENTARY STATISCS ANSWER & QUETIONS.pdf
AhmedAbdirizak3
 
PDF
cie-as-maths-9709-statistics1-v2-znotes.pdf
YiranMa4
 
PDF
cie-as-maths-9709-statistics1-v2-znotes 2.pdf
YiranMa4
 
PPT
Chapter Five.ppthhjhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
beshahashenafe20
 
PPT
Classidication and Tabulation
Osama Zahid
 
PPTX
2.1 frequency distributions for organizing and summarizing data
Long Beach City College
 
PDF
5-Probability-and-Normal-Distribution.pdf
shaenairamaguale086
 
KEY
Counting Partitions: Combinations - Finite Math
Justin Tallant
 
PPTX
chapter five.pptx
AbebeNega
 
PPTX
LET-Review-3-5.pptx
CheeneeRivera
 
PPTX
Permutations and Combinations.pptx
IvyChuaAganon
 
PDF
Frequency distribution, central tendency, measures of dispersion
Dhwani Shah
 
PPTX
GROUP-10-Frequency-Distribution-and-Graphical-Representation.pptx
ROWENAAGBULOS
 
PPT
Day 2 - Permutations and Combinations (1).ppt
rich_26
 
PDF
Principlles of statistics [amar mamusta amir]
Rebin Daho
 
PPTX
Chapter 3 medical laboratory.biostatics pptx
bedadadenbalprosperi
 
PPT
Graphical presentation of data
prince irfan
 
Math for 800 06 statistics, probability, sets, and graphs-charts
Edwin Lapuerta
 
Counting -Methods-probabilities-theory prst2
alaqwryans
 
ELEMENTARY STATISCS ANSWER & QUETIONS.pdf
AhmedAbdirizak3
 
cie-as-maths-9709-statistics1-v2-znotes.pdf
YiranMa4
 
cie-as-maths-9709-statistics1-v2-znotes 2.pdf
YiranMa4
 
Chapter Five.ppthhjhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
beshahashenafe20
 
Classidication and Tabulation
Osama Zahid
 
2.1 frequency distributions for organizing and summarizing data
Long Beach City College
 
5-Probability-and-Normal-Distribution.pdf
shaenairamaguale086
 
Counting Partitions: Combinations - Finite Math
Justin Tallant
 
chapter five.pptx
AbebeNega
 
LET-Review-3-5.pptx
CheeneeRivera
 
Permutations and Combinations.pptx
IvyChuaAganon
 
Frequency distribution, central tendency, measures of dispersion
Dhwani Shah
 
GROUP-10-Frequency-Distribution-and-Graphical-Representation.pptx
ROWENAAGBULOS
 
Day 2 - Permutations and Combinations (1).ppt
rich_26
 
Principlles of statistics [amar mamusta amir]
Rebin Daho
 
Chapter 3 medical laboratory.biostatics pptx
bedadadenbalprosperi
 
Graphical presentation of data
prince irfan
 
Ad

More from Avjinder (Avi) Kaler (20)

PDF
Unleashing Real-World Simulations: A Python Tutorial by Avjinder Kaler
Avjinder (Avi) Kaler
 
PDF
Tutorial for Deep Learning Project with Keras
Avjinder (Avi) Kaler
 
PDF
Tutorial for DBSCAN Clustering in Machine Learning
Avjinder (Avi) Kaler
 
PDF
Python Code for Classification Supervised Machine Learning.pdf
Avjinder (Avi) Kaler
 
PDF
Sql tutorial for select, where, order by, null, insert functions
Avjinder (Avi) Kaler
 
PDF
Kaler et al 2018 euphytica
Avjinder (Avi) Kaler
 
PDF
Association mapping identifies loci for canopy coverage in diverse soybean ge...
Avjinder (Avi) Kaler
 
PDF
Genome-Wide Association Mapping of Carbon Isotope and Oxygen Isotope Ratios i...
Avjinder (Avi) Kaler
 
PDF
Genome-wide association mapping of canopy wilting in diverse soybean genotypes
Avjinder (Avi) Kaler
 
PDF
Tutorial for Estimating Broad and Narrow Sense Heritability using R
Avjinder (Avi) Kaler
 
PDF
Tutorial for Circular and Rectangular Manhattan plots
Avjinder (Avi) Kaler
 
PDF
Genomic Selection with Bayesian Generalized Linear Regression model using R
Avjinder (Avi) Kaler
 
PDF
Genome wide association mapping
Avjinder (Avi) Kaler
 
PDF
Nutrient availability response to sulfur amendment in histosols having variab...
Avjinder (Avi) Kaler
 
PDF
Sugarcane yield and plant nutrient response to sulfur amended everglades hist...
Avjinder (Avi) Kaler
 
PDF
R code descriptive statistics of phenotypic data by Avjinder Kaler
Avjinder (Avi) Kaler
 
PDF
Population genetics
Avjinder (Avi) Kaler
 
PDF
Quantitative genetics
Avjinder (Avi) Kaler
 
PDF
Abiotic stresses in plant
Avjinder (Avi) Kaler
 
PDF
Seed rate calculation for experiment
Avjinder (Avi) Kaler
 
Unleashing Real-World Simulations: A Python Tutorial by Avjinder Kaler
Avjinder (Avi) Kaler
 
Tutorial for Deep Learning Project with Keras
Avjinder (Avi) Kaler
 
Tutorial for DBSCAN Clustering in Machine Learning
Avjinder (Avi) Kaler
 
Python Code for Classification Supervised Machine Learning.pdf
Avjinder (Avi) Kaler
 
Sql tutorial for select, where, order by, null, insert functions
Avjinder (Avi) Kaler
 
Kaler et al 2018 euphytica
Avjinder (Avi) Kaler
 
Association mapping identifies loci for canopy coverage in diverse soybean ge...
Avjinder (Avi) Kaler
 
Genome-Wide Association Mapping of Carbon Isotope and Oxygen Isotope Ratios i...
Avjinder (Avi) Kaler
 
Genome-wide association mapping of canopy wilting in diverse soybean genotypes
Avjinder (Avi) Kaler
 
Tutorial for Estimating Broad and Narrow Sense Heritability using R
Avjinder (Avi) Kaler
 
Tutorial for Circular and Rectangular Manhattan plots
Avjinder (Avi) Kaler
 
Genomic Selection with Bayesian Generalized Linear Regression model using R
Avjinder (Avi) Kaler
 
Genome wide association mapping
Avjinder (Avi) Kaler
 
Nutrient availability response to sulfur amendment in histosols having variab...
Avjinder (Avi) Kaler
 
Sugarcane yield and plant nutrient response to sulfur amended everglades hist...
Avjinder (Avi) Kaler
 
R code descriptive statistics of phenotypic data by Avjinder Kaler
Avjinder (Avi) Kaler
 
Population genetics
Avjinder (Avi) Kaler
 
Quantitative genetics
Avjinder (Avi) Kaler
 
Abiotic stresses in plant
Avjinder (Avi) Kaler
 
Seed rate calculation for experiment
Avjinder (Avi) Kaler
 

Recently uploaded (20)

PPTX
How to Manage Promotions in Odoo 18 Sales
Celine George
 
PPTX
Growth and development and milestones, factors
BHUVANESHWARI BADIGER
 
PDF
ARAL_Orientation_Day-2-Sessions_ARAL-Readung ARAL-Mathematics ARAL-Sciencev2.pdf
JoelVilloso1
 
PDF
IMP NAAC-Reforms-Stakeholder-Consultation-Presentation-on-Draft-Metrics-Unive...
BHARTIWADEKAR
 
PPTX
Capitol Doctoral Presentation -July 2025.pptx
CapitolTechU
 
PPTX
How to Manage Access Rights & User Types in Odoo 18
Celine George
 
PDF
Zoology (Animal Physiology) practical Manual
raviralanaresh2
 
PPTX
HEAD INJURY IN CHILDREN: NURSING MANAGEMENGT.pptx
PRADEEP ABOTHU
 
PPTX
How to Create Rental Orders in Odoo 18 Rental
Celine George
 
PDF
BÀI TẬP BỔ TRỢ THEO LESSON TIẾNG ANH - I-LEARN SMART WORLD 7 - CẢ NĂM - CÓ ĐÁ...
Nguyen Thanh Tu Collection
 
PPTX
Views on Education of Indian Thinkers J.Krishnamurthy..pptx
ShrutiMahanta1
 
PPSX
Health Planning in india - Unit 03 - CHN 2 - GNM 3RD YEAR.ppsx
Priyanshu Anand
 
PDF
The-Beginnings-of-Indian-Civilisation.pdf/6th class new ncert social/by k san...
Sandeep Swamy
 
PPTX
Pyhton with Mysql to perform CRUD operations.pptx
Ramakrishna Reddy Bijjam
 
PPTX
LEGAL ASPECTS OF PSYCHIATRUC NURSING.pptx
PoojaSen20
 
PDF
Federal dollars withheld by district, charter, grant recipient
Mebane Rash
 
PPTX
2025 Winter SWAYAM NPTEL & A Student.pptx
Utsav Yagnik
 
PDF
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 - GLOBAL SUCCESS - CẢ NĂM - NĂM 2024 (VOCABULARY, ...
Nguyen Thanh Tu Collection
 
PDF
CEREBRAL PALSY: NURSING MANAGEMENT .pdf
PRADEEP ABOTHU
 
PPTX
SCHOOL-BASED SEXUAL HARASSMENT PREVENTION AND RESPONSE WORKSHOP
komlalokoe
 
How to Manage Promotions in Odoo 18 Sales
Celine George
 
Growth and development and milestones, factors
BHUVANESHWARI BADIGER
 
ARAL_Orientation_Day-2-Sessions_ARAL-Readung ARAL-Mathematics ARAL-Sciencev2.pdf
JoelVilloso1
 
IMP NAAC-Reforms-Stakeholder-Consultation-Presentation-on-Draft-Metrics-Unive...
BHARTIWADEKAR
 
Capitol Doctoral Presentation -July 2025.pptx
CapitolTechU
 
How to Manage Access Rights & User Types in Odoo 18
Celine George
 
Zoology (Animal Physiology) practical Manual
raviralanaresh2
 
HEAD INJURY IN CHILDREN: NURSING MANAGEMENGT.pptx
PRADEEP ABOTHU
 
How to Create Rental Orders in Odoo 18 Rental
Celine George
 
BÀI TẬP BỔ TRỢ THEO LESSON TIẾNG ANH - I-LEARN SMART WORLD 7 - CẢ NĂM - CÓ ĐÁ...
Nguyen Thanh Tu Collection
 
Views on Education of Indian Thinkers J.Krishnamurthy..pptx
ShrutiMahanta1
 
Health Planning in india - Unit 03 - CHN 2 - GNM 3RD YEAR.ppsx
Priyanshu Anand
 
The-Beginnings-of-Indian-Civilisation.pdf/6th class new ncert social/by k san...
Sandeep Swamy
 
Pyhton with Mysql to perform CRUD operations.pptx
Ramakrishna Reddy Bijjam
 
LEGAL ASPECTS OF PSYCHIATRUC NURSING.pptx
PoojaSen20
 
Federal dollars withheld by district, charter, grant recipient
Mebane Rash
 
2025 Winter SWAYAM NPTEL & A Student.pptx
Utsav Yagnik
 
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 - GLOBAL SUCCESS - CẢ NĂM - NĂM 2024 (VOCABULARY, ...
Nguyen Thanh Tu Collection
 
CEREBRAL PALSY: NURSING MANAGEMENT .pdf
PRADEEP ABOTHU
 
SCHOOL-BASED SEXUAL HARASSMENT PREVENTION AND RESPONSE WORKSHOP
komlalokoe
 

Data sampling and probability

  • 1. DATA SAMPLING AND PROBABILITY Avjinder Singh Kaler and Kristi Mai
  • 2.  Multiplication Rule: Complements and Conditional Probability  Counting  Types of Sampling Methods  Summarizing Data  Statistical Graphs  Probability Distributions  Normal and Standard Normal Distribution
  • 3.  A conditional probability of an event is a probability obtained with the additional information that some other event has already occurred.  denotes the conditional probability of event B occurring, given that event A has already occurred, and it can be found by dividing the probability of events A and B both occurring by the probability of event A: ( | )P B A ( and ) ( | ) ( )  P A B P B A P A
  • 4. Refer to Table 4-1 to find the following: a) If 1 of the 1000 test subjects is randomly selected, find the probability that the subject had a positive test result, given that the subject actually uses drugs. That is, find 𝑷(𝒑𝒐𝒔𝒊𝒕𝒊𝒗𝒆 𝒕𝒆𝒔𝒕 𝒓𝒆𝒔𝒖𝒍𝒕|𝒔𝒖𝒃𝒋𝒆𝒄𝒕 𝒖𝒔𝒆𝒔 𝒅𝒓𝒖𝒈𝒔). a) If 1 of the 1000 test subjects is randomly selected, find the probability that the subject actually uses drugs, given that the he/she had a positive test result. That is, find 𝑷(𝒔𝒖𝒃𝒋𝒆𝒄𝒕 𝒖𝒔𝒆𝒔 𝒅𝒓𝒖𝒈𝒔|𝒑𝒐𝒔𝒊𝒕𝒊𝒗𝒆 𝒕𝒆𝒔𝒕 𝒓𝒆𝒔𝒖𝒍𝒕).
  • 5. Solution: a) P positive test result subject uses drugs = P subject uses drugs and had a positive test result P(subject uses drugs) P positive test result subject uses drugs = 44 100 50 100 = 44 50 = 0.88 b) P subject uses drugs positive test result = P subject uses drugs and had a positive test result P(positive test result) 𝑃 𝑠𝑢𝑏𝑗𝑒𝑐𝑡 𝑢𝑠𝑒𝑠 𝑑𝑟𝑢𝑔𝑠 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑡𝑒𝑠𝑡 𝑟𝑒𝑠𝑢𝑙𝑡 = 44 134 = 0.328 Table 4-1 Pre-Employment Drug Screening Results Positive Test Result Negative Test Result Subject Uses Drugs 44 (True Positive) 6 (False Negative) Subject Is Not a Drug User 90 (False Positive) 860 (True Negative)
  • 6.  For a sequence of two events in which the first event can occur 𝑚 ways and the second event can occur 𝑛 ways, the events together can occur a total of 𝑚 ∗ 𝑛 ways. Example: For a two-character code consisting of a letter followed by a digit, the number of different possible codes is 26 ∗ 10 = 260.
  • 7.  The factorial symbol ! denotes the product of decreasing positive whole numbers.  For example,  By special definition, 0! = 1. 4! 4 3 2 1 24    
  • 8.  n! = Number of different permutations (order counts) of n different items can be arranged when all n of them are selected. (This factorial rule reflects the fact that the first item may be selected in n different ways, the second item may be selected in n – 1 ways, and so on.) Example: The number of ways that the five letters {a, b, c, d, e} can be arranged is as follows: 5! = 5 ∙ 4 ∙ 3 ∙ 2 ∙ 1 = 120
  • 9. Requirements: 1. There are n different items available. (This rule does not apply if some of the items are identical to others.) 2. We select r of the n items (without replacement). 3. We consider rearrangements of the same items to be different sequences. (The permutation of ABC is different from CBA and is counted separately.) If the preceding requirements are satisfied, the number of permutations (or sequences) of r items selected from n available items (without replacement) is ! ( )! n r n P n r  
  • 10. If the five letters {a, b, c, d, e} are available and three of them are to be selected without replacement, the number of different permutations is as follows: 𝑛𝑃𝑟 = 𝑛! (𝑛 − 𝑟)! = 5! (5 − 3)! = 60
  • 11. Requirements: 1. There are n items available, and some items are identical to others. 2. We select all of the n items (without replacement). 3. We consider rearrangements of distinct items to be different sequences. If the preceding requirements are satisfied, and if there are n1 alike, n2 alike, . . . nk alike, the number of permutations (or sequences) of all items selected without replacement is 1 2 ! ! ! !k n n n n
  • 12. If the 10 letters {a, a, a, a, b, b, c, c, d, e} are available and all 10 of them are to be selected without replacement, the number of different permutations is as follows: 𝑛! 𝑛1! 𝑛2! ⋯ 𝑛 𝑘! = 10! 4! 2! 2! = 3,628,800 24 ∗ 2 ∗ 2 = 37,800
  • 13. Requirements: 1. There are n different items available. 2. We select r of the n items (without replacement). 3. We consider rearrangements of the same items to be the same. (The combination of ABC is the same as CBA.) If the preceding requirements are satisfied, the number of combinations of r items selected from n different items is ! ( )! ! n r n C n r r  
  • 14. In the Pennsylvania Match 6 Lotto, winning the jackpot requires you select six different numbers from 1 to 49. The winning numbers may be drawn in any order. Find the probability of winning if one ticket is purchased.     ! 49! Number of combinations: 13,983,816 ! ! 43!6! 1 winning 13,983,816 n r n C n r r P     
  • 15. When different orderings of the same items are to be counted separately, we have a permutation problem, but when different orderings are not to be counted separately, we have a combination problem. Permutations are for lists (order matters) and combinations are for groups (order doesn’t matter).
  • 16.  Data – collections of observations, such as measurements, genders, or survey responses  Population – the complete collection of all individuals to be studied  Sample – sub-collection of population the data comes from  Census – the collection of data from every member of the population
  • 17.  planning studies, designing experiments, and obtaining data  organizing, summarizing, analyzing, interpreting, drawing conclusions about, and presenting data
  • 18. The Gallup corporation collected data from 1013 adults in the United States. Results showed that 66% of the respondents worried about identity theft.  The population consists of all 241,472,385 adults in the United States.  The sample consists of the 1013 polled adults.  The objective is to use the sample data as a basis for drawing a conclusion about the whole population.
  • 19.  Simple random sample Random sample  Systematic sampling  Convenience sampling  Stratified sampling  Cluster sampling
  • 20. A sample of n subjects is selected in such a way that every possible sample of the same size n has the same chance of being chosen.
  • 21.  Members from the population are selected in such a way that each individual member in the population has an equal chance of being selected.
  • 22. Select some starting point and then select every kth element in the population.
  • 23. Use results that are easy to get.
  • 24. Subdivide the population into at least two different subgroups that share the same characteristics, then draw a sample from each subgroup (or stratum).
  • 25. Divide the population area into sections (or clusters). Then randomly select some of those clusters. Now choose all members from selected clusters.
  • 26. When working with large data sets, it is often helpful to organize and summarize data by constructing a table called a frequency distribution.
  • 27.  Shows how a data set is partitioned among all of several categories (or classes) by listing all of the categories along with the number (frequency) of data values in each of them  All categories/classes and the number of observations in that given category/class
  • 28. IQ Score Frequency 50-69 2 70-89 33 90-109 35 110-129 7 130-149 1 Lower Class Limits are the smallest numbers that can actually belong to different classes.
  • 29. IQ Score Frequency 50-69 2 70-89 33 90-109 35 110-129 7 130-149 1 Upper Class Limits are the largest numbers that can actually belong to different classes.
  • 30. IQ Score Frequency 50-69 2 70-89 33 90-109 35 110-129 7 130-149 1 Class Boundaries are the numbers used to separate classes, but without the gaps created by class limits. 49.5 69.5 89.5 109.5 129.5 149.5
  • 31. IQ Score Frequency 50-69 2 70-89 33 90-109 35 110-129 7 130-149 1 Class Midpoints are the values in the middle of the classes and can be found by adding the lower class limit to the upper class limit and dividing the sum by 2. 59.5 79.5 99.5 119.5 139.5 𝑐𝑙𝑎𝑠𝑠 𝑚𝑖𝑑𝑝𝑜𝑖𝑛𝑡 = 𝑙𝑜𝑤𝑒𝑟 𝑙𝑖𝑚𝑖𝑡 + 𝑢𝑝𝑝𝑒𝑟 𝑙𝑖𝑚𝑖𝑡 2
  • 32. IQ Score Frequency 50-69 2 70-89 33 90-109 35 110-129 7 130-149 1 Class Width is the difference between two consecutive lower class limits or two consecutive lower class boundaries. 20 20 20 20 20 
  • 33. relative frequency = class frequency sum of all frequencies includes the same class limits as a frequency distribution, but the frequency of a class is replaced with a relative frequencies (a proportion) or a percentage frequency ( a percent) percentage frequency class frequency sum of all frequencies  100%=
  • 34. IQ Score Frequency Relative Frequency 50-69 2 2.6% 70-89 33 42.3% 90-109 35 44.9% 110-129 7 9.0% 130-149 1 1.3%
  • 35. CumulativeFrequencies IQ Score Frequency Cumulative Frequency 50-69 2 2 70-89 33 35 90-109 35 70 110-129 7 77 130-149 1 78
  • 36.  The frequencies start low, then increase to higher frequencies until reaching a maximum, and then decrease to low again.  The distribution is approximately symmetric • frequencies preceding the maximum being roughly a mirror image of those that follow the maximum
  • 37.  Numerical in nature  Consists of numbers representing counts or measurements  Have a unit and can be used arithmetically  Quantitative data can be further described by distinguishing between discrete and continuous types. Examples: • The weights of supermodels • The ages of respondents
  • 38. the number of possible values is either a finite number or a ‘countable’ number (i.e. the number of possible values is 0, 1, 2, 3, . . .). Example: The number of eggs that a hen lays
  • 39. infinitely many possible values that correspond to some continuous scale that covers a range of values without gaps, interruptions, or jumps Example: The amount of milk that a cow produces; e.g. 2.343115 gallons per day
  • 40. consists of names or labels (representing categories) Example: • The gender (male/female) of professional athletes. • Shirt numbers on professional athletes uniforms - substitutes for names.
  • 41. • Uses bars of equal width to show frequencies of categorical, or qualitative, data • Vertical scale represents frequencies or relative frequencies. • Horizontal scale identifies the different categories of qualitative data.
  • 42. A multiple bar graph has two or more sets of bars and is used to compare two or more data sets.
  • 43. A bar graph for qualitative data, with the bars arranged in descending order according to frequencies
  • 44. A graph depicting qualitative data as slices of a circle, in which the size of each slice is proportional to frequency count
  • 45.  a variable (typically represented by 𝑥) that has a single numerical value, determined by chance, for each outcome of a given procedure  Can be discrete or continuous – just like data
  • 46.  Discrete Random Variable either a finite number of values or countable number of values, where “countable” refers to the fact that there might be infinitely many values, but that they result from a counting process  Continuous Random Variable has infinitely many values, and those values can be associated with measurements on a continuous scale without gaps or interruptions.
  • 47.  a description that gives the probability for each value of the random variable  often expressed in the format of a graph, table, or formula Note: If a probability is very small, it is represented as 0+ in tables (i.e. it is very small, yet positive)
  • 48. 1. There is a numerical random variable x and its values are associated with corresponding probabilities. 2. The sum of all probabilities must be 1. 3. Each probability value must be between 0 and 1 inclusive.   1P x   0 1P x 
  • 49. The probability histogram is very similar to a relative frequency histogram, but the vertical scale shows probabilities.
  • 50.  According to the range rule of thumb, most values should lie within 2 standard deviations of the mean.  We can therefore identify “unusual” values by determining if they lie outside these limits: Maximum usual value = Minimum usual value = 2  2 
  • 51. We found for families with two children, the mean number of girls is 1.0 and the standard deviation is 0.7 girls. Use those values to find the maximum and minimum usual values for the number of girls. Solution:     maximum usual value 2 1.0 2 0.7 2.4 minimum usual value 2 1.0 2 0.7 0.4               
  • 52. Rare Event Rule for Inferential Statistics If, under a given assumption (such as the assumption that a coin is fair), the probability of a particular observed event (such as 992 heads in 1000 tosses of a coin) is extremely small, we conclude that the assumption is probably not correct.
  • 53. Using Probabilities to Determine When Results Are Unusual  Unusually high # of successes: x successes among n trials is an unusually high number of successes if .  Unusually low # of successes : x successes among n trials is an unusually low number of successes if ( orfewer) 0.05P x  ( ormore) 0.05P x 
  • 54. A density curve is the graph of a continuous probability distribution. It must satisfy the following properties: 1. The total area under the curve must equal 1. 2. Every point on the curve must have a vertical height that is 0 or greater. (That is, the curve cannot fall below the x-axis.)
  • 55. Because the total area under the density curve is equal to 1, there is a correspondence between area and probability.
  • 56. A continuous random variable has a uniform distribution if its values are spread evenly over the range of probabilities. The graph of a uniform distribution results in a rectangular shape.
  • 57. Given the uniform distribution illustrated, find the probability that a randomly selected voltage level is greater than 124.5 volts. Shaded area represents voltage levels greater than 124.5 volts.
  • 58. 2 1 2 ( ) 2 x e f x             A continuous R.V. has a normal distribution if it has a graph that is symmetric and bell-shaped and if the R.V. can be described by the following equation:
  • 59. The standard normal distribution is a normal probability distribution with μ = 0 and σ = 1. The total area under its density curve is equal to 1.
  • 60.  Represents how much a given value, 𝑥, deviates/varies from the center of a set of data  This value can help to assess how “extreme” a particular data value is based on the distribution the value is supposed to follow  This score can also be used to convert sample data (sample statistics) to a measure of relative standing so that we may be able to compare sample to one another.  Basic “Idea” Behind Formulas for Z-Scores: 𝑍 = 𝑡ℎ𝑒 𝑣𝑎𝑙𝑢𝑒 − 𝑡ℎ𝑒 𝑐𝑒𝑛𝑡𝑒𝑟 𝑜𝑓 𝑡ℎ𝑒 𝑑𝑎𝑡𝑎 𝑎 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝑚𝑒𝑎𝑠𝑢𝑟𝑒
  • 61.  If the z-score is positive (+), the specific value falls above the center value.  If the z-score is negative (-), the specific value falls below the center value.  “Usual” values have z-scores between -2 and 2.  “Unusual” values have z-scores less than -2 or greater than 2.
  • 62.  We can find areas (probabilities) for different regions under a normal model using StatCrunch.
  • 63. A bone mineral density test can be helpful in identifying the presence of osteoporosis. The result of the test is commonly measured as a z score, which has a normal distribution with a mean of 0 and a standard deviation of 1. A randomly selected adult undergoes a bone density test. Find the probability that the result is a reading less than 1.27.
  • 64. The probability of random adult having a bone density less than 1.27 is 0.8980. ( 1.27) 0.8980P z  
  • 65. Using the same bone density test, find the probability that a randomly selected person has a result above –1.00 (which is considered to be in the “normal” range of bone density readings. The probability of a randomly selected adult having a bone density above –1 is 0.8413.
  • 66. A bone density reading between –1.00 and –2.50 indicates the subject has osteopenia. Find this probability. The probability of a randomly selected adult having osteopenia is 0.1525.
  • 67. denotes the probability that the z score is between a and b. denotes the probability that the z score is greater than a. denotes the probability that the z score is less than a. ( )P a z b  ( )P z a ( )P z a
  • 68. Finding the 95th Percentile 1.645 5% or 0.05 (z score will be positive)
  • 69. Using the same bone density test, find the bone density scores that separates the bottom 2.5% and find the score that separates the top 2.5%.
  • 70. For the standard normal distribution, a critical value is a z score separating unlikely values from those that are likely to occur. Notation: The expression zα denotes the z score with an area of α to its right.
  • 71. Find the value of z0.025. The notation z0.025 is used to represent the z score with an area of 0.025 to its right. Referring back to the bone density example, z0.025 = 1.96.
  • 72. • Complete HW1 and HW2 on MLP