Regression Analysis
Regression analysis is the appropriate statistical method
when the response variable and all explanatory variables
are continuous. Here, we only discuss linear regression,
the simplest and most common form.
The purpose of this lesson on correlation and linear
regression is to provide guidance on how R can be used to
determine the association between two variables and to
then use this degree of association to predict future
outcomes.
Past behavior is the best predictor of future behavior.
H.M.F 5
Regression Analysis
Linear Model
Regression Analysis: is a statistical technique that can be used
to develop a mathematical equation showing how variables
are related.
The basic function for fitting ordinary multiple models is
lm(), and a streamlined version of the call is as follows:
> fitted.model <- lm(formula, data =
data.frame)
>fit <- lm(y ~ x1 + x2 + x3, data=mydata) # with intercept
>fit <- lm(y ~ x1 + x2 + x3 -1 , mydata=mydata) # omitting intercept
summary(fit) # show results
H.M.F 6
Categorical independent
variables/creating dummy variables
Examples
Let the variable x is coded 1, 2 ,3 and we want to give values
like low, medium and high
Mydata$x=factor(mydata$x,levels=c(1,2,3),labels=c(“low”,
“medium”,”high”))
R by default chooses the reference category as the 'first’ or
baseline one which is decided alphabetically or
numerically (if coded as 1 2 3…). So if you had a factor with
four values ‘married', ‘divorced', ‘widowed', ‘single' then R
will use ‘divorced' as the reference category.
Varx=relevel(varx,ref=“wanted ref”)#do table for the
orginal and the changed table(x)
H.M.F 7
Creating dummy variables in R
Dummy variables are always binary, but they can also
be created based on categorical variables with more
than two categories.
For instance, you might consider the geographic region
of respondents.
You can use the region variable to this end. But this is a
categorical variable with four values
data$cat1 <- ifelse(dat$var == “value", 1, 0)
H.M.F 8
Regression assumptions
Linearity of the data. The relationship between the predictor (x) and the
outcome (y) is assumed to be linear.
Normality of residuals. The residual errors are assumed to be normally
distributed.
Homogeneity of residuals variance. The residuals are assumed to have a
constant variance (homoscedasticity): Plotting residuals versus fitted values
is a good test.
Independence of residuals error terms.
You should check whether or not these assumptions hold true. Potential
problems include:
Non-linearity of the outcome - predictor relationships
Heteroscedasticity: Non-constant variance of error terms.
Presence of influential values in the data that can be:
Outliers: extreme values in the outcome (y) variable
High-leverage points: extreme values in the predictors (x) variable
H.M.F 9
Regression diagnostics {reg-diag}
Diagnostic plots
Regression diagnostics plots can be created using the
R base function plot() or the autoplot() function
[ggfortify package], which creates a ggplot2-based
graphics.
Create the diagnostic plots with the R base function:
par(mfrow = c(2, 2))
plot(model)
library(ggfortify)
autoplot(model)
H.M.F 10