TTests.ppt

Statistical background
Z-Test
T-Test
Anovas

 Science tries to predict the future
 Genuine effect?
 Attempt to strengthen predictions with stats
 Use P-Value to indicate our level of certainty that result =
genuine effect on whole population (more on this later…)

 Develop an experimental hypothesis
 H0 = null hypothesis
 H1 = alternative hypothesis
 Statistically significant result
 P Value = .05

 Probability that observed result is true
 Level = .05 or 5%
 95% certain our experimental effect is genuine

 Type 1 = false positive
 Type 2 = false negative
 P = 1 – Probability of Type 1 error

 Let’s pretend you came up with the following theory…
Having a baby increases brain volume (associated with
possible structural changes)

 Population




x
z

 Cost
 Not able to include everyone
 Too time consuming
 Ethical right to privacy
Realistically researchers can only do sample based
studies

 T = differences between sample means / standard error
of sample means
 Degrees of freedom = sample size - 1
2
1
2
1
x
x
s
x
x
t



means
between
s
difference
of
error
dard
tan
s
estimated
means
sample
between
s
difference
t
_
_
_
_
_
_
_
_
_

2
2
2
1
2
1
2
1
n
s
n
s
s x
x 



 H0 = There is no difference in brain size before or after
giving birth
 H1 = The brain is significantly smaller or significantly
larger after giving birth (difference detected)

Before Delivery 6 Weeks After Delivery Difference
1437.4 1494.5 57.1
1089.2 1109.7 20.5
1201.7 1245.4 43.7
1371.8 1383.6 11.8
1207.9 1237.7 29.8
1150.7 1180.1 29.4
1221.9 1268.8 46.9
1208.7 1248.3 39.6
Sum 9889.3 10168.1 278.8
Mean 1236.1625 1271.0125 34.85
SD 113.8544928 119.0413426 5.18685
T=(1271-1236)/(119-113)

T 6.718914454
DF 7
http://www.danielsoper.com/statcalc/calc08.aspx
Women have a significantly larger brain after giving birth

 One-sample (sample vs. hypothesized mean)
 Independent groups (2 separate groups)
 Repeated measures (same group, different
measure)

 ANalysis Of VAriance
 Factor = what is being compared (type of pregnancy)
 Levels = different elements of a factor (age of mother)
 F-Statistic
 Post hoc testing

 1 Way Anova
 1 factor with more than 2 levels
 Factorial Anova
 More than 1 factor
 Mixed Design Anovas
 Some factors are independent, others are related

 There is a significant difference somewhere between
groups
 NOT where the difference lies
 Finding exactly where the difference lies requires
further statistical analysis = post hoc analysis

 Z-Tests for populations
 T-Tests for samples
 ANOVAS compare more than 2 groups in more
complicated scenarios

Objective
Correlation
Linear Regression
Take Home Points.

Correlation
- How much linear is the relationship of two
variables? (descriptive)
Regression
- How good is a linear model to explain my data?
(inferential)

Correlation
Correlation reflects the noisiness and direction of a linear relationship (top row),
but not the slope of that relationship (middle), nor many aspects of nonlinear
relationships (bottom).

 Strength and direction of the relationship between
variables
 Scattergrams
Y
X
Y
Y
X
Y
Y Y
Positive correlation Negative correlation No correlation

Measures of Correlation
1) Covariance
2) Pearson Correlation Coefficient (r)

1) Covariance
- The covariance is a statistic representing the degree to which 2
variables vary together
{Note that Sx
2 = cov(x,x) }
n
y
y
x
x
y
x
i
n
i
i )
)(
(
)
,
cov( 1






 A statistic representing the degree to which 2 variables
vary together
 Covariance formula
 cf. variance formula
n
y
y
x
x
y
x
i
n
i
i )
)(
(
)
,
cov( 1





n
x
x
S
n
i
i
x
2
1
2
)
(





2) Pearson correlation coefficient (r)
- r is a kind of ‘normalised’ (dimensionless) covariance
- r takes values fom -1 (perfect negative correlation) to 1 (perfect
positive correlation). r=0 means no correlation
y
x
xy
s
s
y
x
r
)
,
cov(
 (S = st dev of sample)

Limitations:
Sensitive to extreme values
Relationship not a prediction.
Not Causality

Regression: Prediction of one variable from knowledge of one or
more other variables

How good is a linear model (y=ax+b) to explain the relationship of two
variables?
- If there is such a relationship, we can ‘predict’ the value y for a given x.
(25, 7.498)

Linear dependence between 2 variables
Two variables are linearly dependent when the increase of one variable
is proportional to the increase of the other one
x
y
Samples: - Energy needed to boil water
- Money needed to buy coffeepots

Fiting data to a straight line (o viceversa):
Here, ŷ = ax + b
– ŷ : predicted value of y
– a: slope of regression line
– b: intercept
Residual error (εi): Difference between obtained and predicted values of y (i.e. yi- ŷi)
Best fit line (values of b and a) is the one that minimises the sum of squared errors
(SSerror) (yi- ŷi)2
ε i
εi = residual
= yi , observed
= ŷi, predicted
ŷ = ax + b

Adjusting the straight line to data:
• Minimise (yi- ŷi)2 , which is (yi-axi+b)2
• Minimum SSerror is at the bottom of the curve where the gradient is zero
– and this can found with calculus
• Take partial derivatives of (yi-axi-b)2 respect parametres a and b and
solve for 0 as simultaneous equations, giving:
• This can always be done
x
y
s
rs
a  x
a
y
b 


 We can calculate the regression line for any data, but how well does it fit the
data?
 Total variance = predicted variance + error variance
sy
2 = sŷ
2 + ser
2
 Also, it can be shown that r2 is the proportion of the variance in y that is
explained by our regression model
r2 = sŷ
2 / sy
2
 Insert r2 sy
2 into sy
2 = sŷ
2 + ser
2 and rearrange to get:
ser
2 = sy
2 (1 – r2)
From this we can see that the greater the correlation
the smaller the error variance, so the better our
prediction

 Do we get a significantly better prediction of y
from our regression equation than by just
predicting the mean?
F-statistic

 Prediction / Forecasting
 Quantify strength between y and Xj ( X1, X2, X3 )

 A General Linear Model is just any model that
describes the data in terms of a straight line
 Linear regression is actually a form of the General
Linear Model where the parameters are b, the slope of
the line, and a, the intercept.
y = bx + a +ε

 Multiple regression is used to determine the effect of a
number of independent variables, x1, x2, x3 etc., on a single
dependent variable, y
 The different x variables are combined in a linear way and
each has its own regression coefficient:
y = b0 + b1x1+ b2x2 +…..+ bnxn + ε
 The a parameters reflect the independent contribution of
each independent variable, x, to the value of the dependent
variable, y.
 i.e. the amount of variance in y that is accounted for by each x
variable after all the other x variables have been accounted for

Take Home Points
- Correlated doesn’t mean related.
e.g, any two variables increasing or decreasing over time would show a
nice correlation: C02 air concentration in Antartica and lodging rental cost
in London. Beware in longitudinal studies!!!
- Relationship between two variables doesn’t mean causality
(e.g leaves on the forest floor and hours of sun)

 Linear regression is a GLM that models the effect of one
independent variable, x, on one dependent variable, y
 Multiple Regression models the effect of several
independent variables, x1, x2 etc, on one dependent
variable, y
 Both are types of General Linear Model

TTests.ppt

More Related Content

Similar to TTests.ppt (20)

Recently uploaded (20)

TTests.ppt

Editor's Notes