SlideShare a Scribd company logo
Regression analysis
   Week no 2 - 19th to 23rd Sept, 2011
Course Map
Introduction to Quantitative Analysis, Ch1, RSH (1 Week)

Regression Models Ch4 (1week)

Decision Analysis, Ch3, RSH (2 Weeks)

Linear Programming Models: Graphical & Computer Methods, Ch7, RSH (2
Weeks)

Linear Programming Modeling Applications: With Computer Analyses in Excel,
Ch8, RSH (2 Weeks)

Simulation Modeling, Ch15, RSH (2 Weeks)

Forecasting, Ch5, RSH. (2 Weeks)

Waiting Lines and Queuing Theory Models, Ch14, RSH. (2 Weeks)
regression analysis
A very valuable tool for today’s manager.
Regression Analysis is used to:

Understand the relationship between variables.

Predict the value of one variable based on
another variable.

A regression model has:

dependent, or response, variable - Y axis

an independent, or predictor, variable - X axis
How to perform
Regression analysis
regression analysis
 Triple A Construction Company renovates old
homes in Albany. They have found that its dollar
volume of renovation work is dependent on the
             Albany area payroll.
           Local Payroll     Triple A Sales
       ($100,000,000's)     ($100,000's)
                 3                  6
                 4                  8
                 6                  9
                 4                  5
                 2                 4.5
                 5                 9.5
Scatter plot
10

 8

 6
     100,000
      Sales




 4

 2

 0
     0         1     2          3           4   5   6
                         Local Payroll
                         ($100,000,000's)
regression analysis model
                             Regression: Understand & Predict
Create a Scatter Plot
Perform Regression Analysis

                                                some random error
                                                  that cannot be
                                                    predicted.
 Dependent
  Variable,          Slope
 Response                           Independent
                                 Variable, Predictor
             Intercept
         (Value of Y when
               X=0)
regression analysis model
Sample data are used to estimate
the true values for the intercept and
slope.
    Y = b0+ b 1X
Where,

Y = predicted value of Y
The difference between the actual
value of Y and the predicted value
(using sample data) is known as
the error.
 Error = (actual value) – (predicted value)

    e=Y-Y
regression analysis model
                                    _   2
                                               _ _
Sales (Y)   Payroll (X)    (X - X)          (X-X)(Y-Y)
                                                         Calculating the required
    6           3               1               1         parameters:

    8           4               0               0
                                                             b 1= !(X-X)(Y-Y)    =   12.5   = 1.25
                                                                     ! (X-X) 2       10
    9           6               4               4

    5           4               0               0        bo= Y – b1X = 7 – (1.25)(4) = 2

    4.5         2               4               5                      So,

    9.5         5               1              2.5            Y = 2 + 1.25 X
Summations for each column:
  42           24           10                12.5
_                _
Y = 42/6 = 7     X = 24/6 = 4
Measuring the Fit of
the linear Regression
        Model
Measuring the Fit of the linear
            Regression Model
      To understand how well the X predicts the Y, we
                        evaluate
  Variability in the Y                     Correlation          Standard       Residual
       variable                            Coefficient             Error        Analysis
SSR –> Regression Variability                                   St Deviation
                                          r – Strength of the                  Validation of
   that is explained by the                                        of error
                                              relationship                        Model
   relationship b/w X & Y                                        around the
                                           between Y and X
               +                                                 Regression
                                                variables
     SSE –> Unexplained                                             Line
Variability, due to factors then
        the regression                  Coefficient of             Test for Linearity
 ------------------------------------   Determination               Significance of the
SST –> Total variability about          R Sq - Proportion of       Regression Model i.e.
             the mean                   explained variation      Linear Regression Model
Variability
10   y = 1.25x + 2                            SSE            SST
         R² = 0.6944              SSR
                              explained
 8                            variability                              _
                                                                       Y
 6

 4

 2

 0
     0     1         2             3            4        5         6
               Local Payroll           Regression Line
               ($100,000,000's)
Variability
Errors (deviations) may be positive or
negative. Summing the errors would be
misleading, thus we square the terms             For Triple A Construction:
prior to summing.                                                    2
                                                                          = 22.5
                                                      SST =! (Y-Y)
!  Sum of Squares Total (SST) measures the
   total variable in Y.                               SSE =! e 2 = ! (Y-Y)         2
                                                                                       = 6.875
                             2
              SST =! (Y-Y)                            SSR =!(Y-Y)2 = 15.625

!  Sum of the Squared Error (SSE) is less
   than the SST because the regression line         Note:
   reduced the variability.                                 SST = SSR + SSE
            SSE =! e 2 = ! (Y-Y) 2
                                                            Explained         Unexplained
!  Sum of Squares due to Regression (SSR)                   Variability       Variability
   indicated how much of the total variability
   is explained by the regression model.
                SSR =!(Y-Y)2
Coefficient of Determination
     The coefficient of determination (r2 )
     is the proportion of the variability in Y
     that is explained by the regression
     equation.
               r2 = SSR = 1 – SSE
                                                  SST, SSR and SSE
                    SST       SST                  just themselves
                                                 provide little direct
              For Triple A Construction:         interpretation. This
                                                    measures the
                   r2 = 15.625 = 0.6944             usefulness of
                         22.5                         regression


      69% of the variability in sales is explained
         by the regression based on payroll.

                  Note: 0 < r2 < 1
Correlation Coefficient
        The correlation coefficient (r)
        measures the strength of the linear
        relationship.                          Possible
                                           Scatter Diagrams
                                            for values of r.

                  n!XY-!X!Y              Shown as Multiple R in
   r=                                      the output of Excel

        [n!X -(!X) ][n!Y -(!Y -(!Y) ]
             2       2        2      2      2      file



         For Triple A Construction, r = 0.8333


                  Note: -1 < r < 1
Correlation Coefficient
Standard error
The mean squared error (MSE) is
the estimate of the error variance of
the regression equation.

     s = MSE = SSE
      2

              n–k-1
                                             Estimate of Variance. Just like St Dev
                                            (which is around mean), it measures the
Where,                                         variation of Y variation around the
 n = number of observations in the sample      regression line OR St Dev of error
                                            around the Regression Line. Same units
 k = number of independent variables         as Y. Means +1.3 x 100,000 USD Sales
                                                       error in prediction




For Triple A Construction, s 2= 1.31
Test for linearity
                                           p value is significance level
An F-test is used to statistically       alpha = level of significance or
                                             = 1-confidence interval
test the null hypothesis that there
is no linear relationship between If p<alpha
                                      Reject the null hypothesis that
the X and Y variables (i.e. ! 1 = 0). there is no linear relationship
If the significance level for the F between X & Triple A Construction:
                                                For Y
test is low, we reject Ho and conclude
there is a linear relationship.                      MSR = 15.625 = 15.625
                                                                  1

              F = MSR                                 F     = 15.625 = 9.0909
                                                              1.7188
                  MSE                            The significance level for F = 9.0909 is
                                                 0.0394, indicating we reject Ho and
       where, MSR = SSR                          conclude a linear relationship exists
                                                 between sales and payroll.
                      k
Computer Software for
     Regression
 In Excel, use Tools/
 Data Analysis. This
is an ‘add-in’ option.
Computer Software for
     Regression
Computer Software for
Multiple R is
                                Regression
 correlation                                     Estimate of Variance. Just like St Dev (which is around mean), it measures the variation
 coefficient                                     of Y variation around the regression line OR St Dev of error around the Regression Line.
                                                           Same units as Y. Means +1.3 x 100,000 USD Sales error in prediction
number of independent variables in the model.
  The adjusted R Sq takes into account the




                                                                                                             p Value < Alpha (0.05 or
                                                                                                             0.1) means relationship
                                                                                                             between X & Y is linear
Anova table
Residual Analysis:
to verify regression assumptions
           are correct
Assumptions of the
               Regression Model
We make certain assumptions about
the errors in a regression model                                      A plot of
which allow for statistical testing.                               the errors (Real
                                                               Value minus predicted
Assumptions:                                                   value of Y), also called
!  Errors are independent.                                     residuals in excel may
                                                                      highlight
!  Errors are normally distributed.
                                                                 problems with the
!  Errors have a mean of zero.
                                                                       model.
!  Errors have a constant variance.
 PITFALLS:
 Prediction beyond the range of X values in the sample can be misleading, including
    interpretation of the intercept (X=0).
 A linear regression model may not be the best model, even in the presence of a significant F
    test.
Constant variance
                                    Triple A Construction
 Errors have constant
 Variance Assumption
Plot Residues w.r.t X values
Pattern should be random!


                               Non-constant Variation in Error
                                  Residual Plot –violation
      0               X
Normal distribution
Histogram of Residuals - Should look like a bell curve

                 Triple A Construction

                                           Not possible to see
                                         the bell curve with just
                                          6 observations. Need
                                              more samples
zero mean
                            Triple A Construction
    Errors have zero Mean




0                       X
independent errors
                               Example: Manager of a package
 If samples collected over a
                               delivery store wants to predict
period of time and not at the    weekly sales based on the
  same time, then plot the      number of customers making
 residues w.r.t time to see if  purchases for a period of 100
any pattern (Autocorrelation) days. Data is collected over a
            exists.              period of time so check for
                               autocorrelation (pattern) effect.

If substantial autocorrelation,                   Cyclical Pattern!
                                                    A Violation
                                       Residues
  Regression Model Validity
      becomes doubtful
 Autocorrelation can also be checked
   using Durbin–Watson statistic.
                                                       time
Residual analysis for
validating assumptions
     Nonlinear Residual Plot –violation
multiple regression
multiple regression
Multiple regression models are
similar to simple linear regression   Wilson Realty wants to develop a model to
                                      determine the suggested listing price for a house
models except they include more       based on size and age.

than one X variable.                  Price
                                      35000
                                                  Sq. Feet
                                                  1926
                                                              Age
                                                              30
                                                                          Condition
                                                                          Good
                                      47000       2069        40          Excellent
                                      49900       1720        30          Excellent
                                      55000       1396        15          Good
                                      58900       1706        32          Mint
                                      60000       1847        38          Mint
Y = b0+ b1 X 1+ b2X 2+…+ bnXn         67000       1950        27          Mint
                                      70000       2323        30          Excellent

   slope                              78500       2285        26          Mint
                                      79000       3752        35          Good
                                      87500       2300        18          Good
             Independent variables    93000       2525        17          Good
                                      95000       3800        40          Excellent
                                      97000       1740        12          Mint
multiple regression

                                                              Wilson Realty has found a linear
                        67% of the variation in
                                                              relationship between price and size
                        sales price is explained by
                                                              and age. The coefficient for size
                        size and age.
                                              Ho: No linear   indicates each additional square foot
                                              relationship    increases the value by $21.91, while
                                              is rejected     each additional year in age decreases
                                                              the value by $1449.34.
                                                              Y = 60815.45 + 21.91(size) – 1449.34 (age)


                                                              For a 1900 square foot house that is 10
                                                              years old, the following prediction can be
                                                              made:
Y = 60815.45 + 21.91(size) – 1449.34 (age)                       $87,951 = 21.91(1900) + 1449.34(10)

                                    Ho: !1 = 0 is rejected
                                    Ho: !2 = 0 is rejected
binary or dummy
    variables
dummy variables
 Binary (or dummy) variables                 Return to Wilson Realty, and let’s
                                             evaluate how to use property
 are special variables that are              condition in the regression model.
 created for qualitative data.               There are three categories: Mint,
                                             Excellent, and Good.
!  A dummy variable is assigned a
   value of 1 if a particular condition is    X3= 1 if the house is in excellent condition
                                                = 0 otherwise
   met and a value of 0 otherwise.            X4 = 1 if the house is in mint condition
!  The number of dummy variables                 = 0 otherwise

   must equal one less than the number        Note: If both X and X = 0 then the
                                              house is in good condition
   of categories of the qualitative
   variable.
dummy variables
 As more variables are
added to the model, the r2
  usually increases.         Y = 48329.23 + 28.21 (size) – 1981.41(age) +
                                 23684.62 (if mint) + 16581.32 (if excellent)
model building
adjusted r-Square
The best model is a statistically
significant model with a high r2
and a few variables.

!  As more variables are added to the
   model, the r2 usually increases.
!  The adjusted r2 takes into account
   the number of independent variables
   in the model.
Note: When variables are added to the model, the
value of r2 can never decrease; however, the
adjusted r2 may decrease.
multicollinearity
Collinearity or multicollinearity         Duplication of
exists when an independent variable     information occurs

is correlated with another
independent variable.             When multicollinearity exists,
                                           the overall F test is still valid, but
!  Collinearity and multicollinearity      the hypothesis tests related to the
   create problems in the coefficients.    individual coefficients are not.

!  The overall model prediction is still   A variable may appear to be
   good; however individual                significant when it is
   interpretation of the variables is      insignificant, or a variable may
   questionable.                           appear to be insignificant when it
                                           is significant.
non-linear regression
non-linear regression
Engineers at Colonel Motors want to use regression analysis to improve fuel efficiency. They are
studying the impact of weight on miles per gallon (MPG).




  Linear regression model:

     MPG = 47.8 – 8.2 (weight)

     F significance = .0003
     r2 = .7446
non-linear regression
Nonlinear (transformed variable)
 regression model
                                       2
MPG = 79.8 – 30.2(weight) + 3.4(weight)

  F significance = .0002
  R2 = .8478
non-linear regression
 We should not try to interpret the coefficients of the variables
 due to the correlation between (weight) and (weight squared).
 Normally we would interpret the coefficient for as the change
 in Y that results from a 1-unit change in X1, while holding all
 other variables constant.
 Obviously holding one variable constant while changing the
 other is impossible in this example since If changes, then must
 change also.
 This is an example of a problem that exists when
 multicollinearity is present.
chapter assignments
      on LMS
quiz in next class
Case studies

More Related Content

PPT
Regression analysis
Ravi shankar
 
PDF
Least Squares Regression Method | Edureka
Edureka!
 
PPTX
Presentation On Regression
alok tiwari
 
PPT
Simple linear regression
RekhaChoudhary24
 
PPT
Regression
mandrewmartin
 
PPTX
Basics of Regression analysis
Mahak Vijayvargiya
 
PDF
Correlations using SPSS
Christine Pereira Ask Brunel
 
PPTX
Bayes Theorem
sabareeshbabu
 
Regression analysis
Ravi shankar
 
Least Squares Regression Method | Edureka
Edureka!
 
Presentation On Regression
alok tiwari
 
Simple linear regression
RekhaChoudhary24
 
Regression
mandrewmartin
 
Basics of Regression analysis
Mahak Vijayvargiya
 
Correlations using SPSS
Christine Pereira Ask Brunel
 
Bayes Theorem
sabareeshbabu
 

What's hot (20)

PPTX
Regression
Sauravurp
 
PDF
Simple linear regression
Avjinder (Avi) Kaler
 
PPT
Simple Linear Regression
Yesica Adicondro
 
PPTX
Regression ppt
Shraddha Tiwari
 
PPTX
Regression
LavanyaK75
 
PPT
Regression analysis
Sohag Babu
 
PDF
Regression Analysis
Birinder Singh Gulati
 
PPTX
Logistic regression
DrZahid Khan
 
PPT
Regression analysis
Shameer P Hamsa
 
PPTX
Regression Analysis
Shiela Vinarao
 
PPT
Simple Linier Regression
dessybudiyanti
 
PPT
Regression
ICFAI Business School
 
PPTX
ML - Multiple Linear Regression
Andrew Ferlitsch
 
PPTX
Data entry in Excel and SPSS
Dhritiman Chakrabarti
 
PDF
Correlation Analysis
Birinder Singh Gulati
 
PPTX
Simple Linear Regression: Step-By-Step
Dan Wellisch
 
PPTX
Correlation analysis
Misab P.T
 
PPTX
Applications of regression analysis - Measurement of validity of relationship
Rithish Kumar
 
PPTX
Uniform Distribution
mathscontent
 
PPTX
Statistics-Regression analysis
Rabin BK
 
Regression
Sauravurp
 
Simple linear regression
Avjinder (Avi) Kaler
 
Simple Linear Regression
Yesica Adicondro
 
Regression ppt
Shraddha Tiwari
 
Regression
LavanyaK75
 
Regression analysis
Sohag Babu
 
Regression Analysis
Birinder Singh Gulati
 
Logistic regression
DrZahid Khan
 
Regression analysis
Shameer P Hamsa
 
Regression Analysis
Shiela Vinarao
 
Simple Linier Regression
dessybudiyanti
 
ML - Multiple Linear Regression
Andrew Ferlitsch
 
Data entry in Excel and SPSS
Dhritiman Chakrabarti
 
Correlation Analysis
Birinder Singh Gulati
 
Simple Linear Regression: Step-By-Step
Dan Wellisch
 
Correlation analysis
Misab P.T
 
Applications of regression analysis - Measurement of validity of relationship
Rithish Kumar
 
Uniform Distribution
mathscontent
 
Statistics-Regression analysis
Rabin BK
 
Ad

Viewers also liked (9)

PPTX
Linear regression(probabilistic interpretation)
hitesh saini
 
PPTX
Multiple Linear Regression
Indus University
 
PPT
Hypothesis
Mukut Deori
 
PPTX
Regression analysis
Dr.ammara khakwani
 
PDF
Linear regression without tears
Ankit Sharma
 
PPT
Multiple regression presentation
Carlo Magno
 
PPT
Simple linear regression (final)
Harsh Upadhyay
 
ODP
Multiple linear regression
James Neill
 
PPT
Regression analysis ppt
Elkana Rorio
 
Linear regression(probabilistic interpretation)
hitesh saini
 
Multiple Linear Regression
Indus University
 
Hypothesis
Mukut Deori
 
Regression analysis
Dr.ammara khakwani
 
Linear regression without tears
Ankit Sharma
 
Multiple regression presentation
Carlo Magno
 
Simple linear regression (final)
Harsh Upadhyay
 
Multiple linear regression
James Neill
 
Regression analysis ppt
Elkana Rorio
 
Ad

Similar to Regression Analysis (20)

PPT
Bba 3274 qm week 6 part 1 regression models
Stephen Ong
 
PPT
Anov af03
pradeep joshi
 
PDF
Business Statistics_an overview
Diane Christina
 
DOC
Ch 6 Slides.doc/9929292929292919299292@:&:&:&9/92
ohenebabismark508
 
PPTX
Regression-SIMPLE LINEAR (1).psssssssssptx
pokah34509
 
PPT
Corr-and-Regress.ppt
MoinPasha12
 
PPT
Corr-and-Regress.ppt
HarunorRashid74
 
PPT
Corr-and-Regress.ppt
krunal soni
 
PPT
Corr-and-Regress (1).ppt
MuhammadAftab89
 
PPT
Corr-and-Regress.ppt
BAGARAGAZAROMUALD2
 
PPT
Cr-and-Regress.ppt
RidaIrfan10
 
PPT
Correlation & Regression for Statistics Social Science
ssuser71ac73
 
PPTX
Regression Analysis.pptx
ShivankAggatwal
 
PDF
CLIM Fall 2017 Course: Statistics for Climate Research, Nonstationary Covaria...
The Statistical and Applied Mathematical Sciences Institute
 
PPTX
Regression analysis
saba khan
 
PPT
Rsh qam11 ch04 ge
Firas Husseini
 
PDF
Applied Business Statistics ,ken black , ch 3 part 2
AbdelmonsifFadl
 
PPT
Malhotra17
Uzair Javed Siddiqui
 
PPT
simple linear regression statistics course
Saleh Abdelraouf Hussien
 
PPT
Correlation by Neeraj Bhandari ( Surkhet.Nepal )
Neeraj Bhandari
 
Bba 3274 qm week 6 part 1 regression models
Stephen Ong
 
Anov af03
pradeep joshi
 
Business Statistics_an overview
Diane Christina
 
Ch 6 Slides.doc/9929292929292919299292@:&:&:&9/92
ohenebabismark508
 
Regression-SIMPLE LINEAR (1).psssssssssptx
pokah34509
 
Corr-and-Regress.ppt
MoinPasha12
 
Corr-and-Regress.ppt
HarunorRashid74
 
Corr-and-Regress.ppt
krunal soni
 
Corr-and-Regress (1).ppt
MuhammadAftab89
 
Corr-and-Regress.ppt
BAGARAGAZAROMUALD2
 
Cr-and-Regress.ppt
RidaIrfan10
 
Correlation & Regression for Statistics Social Science
ssuser71ac73
 
Regression Analysis.pptx
ShivankAggatwal
 
CLIM Fall 2017 Course: Statistics for Climate Research, Nonstationary Covaria...
The Statistical and Applied Mathematical Sciences Institute
 
Regression analysis
saba khan
 
Rsh qam11 ch04 ge
Firas Husseini
 
Applied Business Statistics ,ken black , ch 3 part 2
AbdelmonsifFadl
 
simple linear regression statistics course
Saleh Abdelraouf Hussien
 
Correlation by Neeraj Bhandari ( Surkhet.Nepal )
Neeraj Bhandari
 

Recently uploaded (20)

PDF
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
PPTX
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
PDF
Doc9.....................................
SofiaCollazos
 
PDF
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
PDF
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PPTX
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
PDF
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PDF
Software Development Methodologies in 2025
KodekX
 
PDF
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
PDF
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
PPTX
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PPTX
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
PDF
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
PDF
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
PDF
The Future of Artificial Intelligence (AI)
Mukul
 
PDF
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
Doc9.....................................
SofiaCollazos
 
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
Software Development Methodologies in 2025
KodekX
 
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
The Future of Artificial Intelligence (AI)
Mukul
 
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 

Regression Analysis

  • 1. Regression analysis Week no 2 - 19th to 23rd Sept, 2011
  • 2. Course Map Introduction to Quantitative Analysis, Ch1, RSH (1 Week) Regression Models Ch4 (1week) Decision Analysis, Ch3, RSH (2 Weeks) Linear Programming Models: Graphical & Computer Methods, Ch7, RSH (2 Weeks) Linear Programming Modeling Applications: With Computer Analyses in Excel, Ch8, RSH (2 Weeks) Simulation Modeling, Ch15, RSH (2 Weeks) Forecasting, Ch5, RSH. (2 Weeks) Waiting Lines and Queuing Theory Models, Ch14, RSH. (2 Weeks)
  • 3. regression analysis A very valuable tool for today’s manager. Regression Analysis is used to: Understand the relationship between variables. Predict the value of one variable based on another variable. A regression model has: dependent, or response, variable - Y axis an independent, or predictor, variable - X axis
  • 5. regression analysis Triple A Construction Company renovates old homes in Albany. They have found that its dollar volume of renovation work is dependent on the Albany area payroll. Local Payroll Triple A Sales ($100,000,000's) ($100,000's) 3 6 4 8 6 9 4 5 2 4.5 5 9.5
  • 6. Scatter plot 10 8 6 100,000 Sales 4 2 0 0 1 2 3 4 5 6 Local Payroll ($100,000,000's)
  • 7. regression analysis model Regression: Understand & Predict Create a Scatter Plot Perform Regression Analysis some random error that cannot be predicted. Dependent Variable, Slope Response Independent Variable, Predictor Intercept (Value of Y when X=0)
  • 8. regression analysis model Sample data are used to estimate the true values for the intercept and slope. Y = b0+ b 1X Where, Y = predicted value of Y The difference between the actual value of Y and the predicted value (using sample data) is known as the error. Error = (actual value) – (predicted value) e=Y-Y
  • 9. regression analysis model _ 2 _ _ Sales (Y) Payroll (X) (X - X) (X-X)(Y-Y) Calculating the required 6 3 1 1 parameters: 8 4 0 0 b 1= !(X-X)(Y-Y) = 12.5 = 1.25 ! (X-X) 2 10 9 6 4 4 5 4 0 0 bo= Y – b1X = 7 – (1.25)(4) = 2 4.5 2 4 5 So, 9.5 5 1 2.5 Y = 2 + 1.25 X Summations for each column: 42 24 10 12.5 _ _ Y = 42/6 = 7 X = 24/6 = 4
  • 10. Measuring the Fit of the linear Regression Model
  • 11. Measuring the Fit of the linear Regression Model To understand how well the X predicts the Y, we evaluate Variability in the Y Correlation Standard Residual variable Coefficient Error Analysis SSR –> Regression Variability St Deviation r – Strength of the Validation of that is explained by the of error relationship Model relationship b/w X & Y around the between Y and X + Regression variables SSE –> Unexplained Line Variability, due to factors then the regression Coefficient of Test for Linearity ------------------------------------ Determination Significance of the SST –> Total variability about R Sq - Proportion of Regression Model i.e. the mean explained variation Linear Regression Model
  • 12. Variability 10 y = 1.25x + 2 SSE SST R² = 0.6944 SSR explained 8 variability _ Y 6 4 2 0 0 1 2 3 4 5 6 Local Payroll Regression Line ($100,000,000's)
  • 13. Variability Errors (deviations) may be positive or negative. Summing the errors would be misleading, thus we square the terms For Triple A Construction: prior to summing. 2 = 22.5 SST =! (Y-Y) !  Sum of Squares Total (SST) measures the total variable in Y. SSE =! e 2 = ! (Y-Y) 2 = 6.875 2 SST =! (Y-Y) SSR =!(Y-Y)2 = 15.625 !  Sum of the Squared Error (SSE) is less than the SST because the regression line Note: reduced the variability. SST = SSR + SSE SSE =! e 2 = ! (Y-Y) 2 Explained Unexplained !  Sum of Squares due to Regression (SSR) Variability Variability indicated how much of the total variability is explained by the regression model. SSR =!(Y-Y)2
  • 14. Coefficient of Determination The coefficient of determination (r2 ) is the proportion of the variability in Y that is explained by the regression equation. r2 = SSR = 1 – SSE SST, SSR and SSE SST SST just themselves provide little direct For Triple A Construction: interpretation. This measures the r2 = 15.625 = 0.6944 usefulness of 22.5 regression 69% of the variability in sales is explained by the regression based on payroll. Note: 0 < r2 < 1
  • 15. Correlation Coefficient The correlation coefficient (r) measures the strength of the linear relationship. Possible Scatter Diagrams for values of r. n!XY-!X!Y Shown as Multiple R in r= the output of Excel [n!X -(!X) ][n!Y -(!Y -(!Y) ] 2 2 2 2 2 file For Triple A Construction, r = 0.8333 Note: -1 < r < 1
  • 17. Standard error The mean squared error (MSE) is the estimate of the error variance of the regression equation. s = MSE = SSE 2 n–k-1 Estimate of Variance. Just like St Dev (which is around mean), it measures the Where, variation of Y variation around the n = number of observations in the sample regression line OR St Dev of error around the Regression Line. Same units k = number of independent variables as Y. Means +1.3 x 100,000 USD Sales error in prediction For Triple A Construction, s 2= 1.31
  • 18. Test for linearity p value is significance level An F-test is used to statistically alpha = level of significance or = 1-confidence interval test the null hypothesis that there is no linear relationship between If p<alpha Reject the null hypothesis that the X and Y variables (i.e. ! 1 = 0). there is no linear relationship If the significance level for the F between X & Triple A Construction: For Y test is low, we reject Ho and conclude there is a linear relationship. MSR = 15.625 = 15.625 1 F = MSR F = 15.625 = 9.0909 1.7188 MSE The significance level for F = 9.0909 is 0.0394, indicating we reject Ho and where, MSR = SSR conclude a linear relationship exists between sales and payroll. k
  • 19. Computer Software for Regression In Excel, use Tools/ Data Analysis. This is an ‘add-in’ option.
  • 20. Computer Software for Regression
  • 21. Computer Software for Multiple R is Regression correlation Estimate of Variance. Just like St Dev (which is around mean), it measures the variation coefficient of Y variation around the regression line OR St Dev of error around the Regression Line. Same units as Y. Means +1.3 x 100,000 USD Sales error in prediction number of independent variables in the model. The adjusted R Sq takes into account the p Value < Alpha (0.05 or 0.1) means relationship between X & Y is linear
  • 23. Residual Analysis: to verify regression assumptions are correct
  • 24. Assumptions of the Regression Model We make certain assumptions about the errors in a regression model A plot of which allow for statistical testing. the errors (Real Value minus predicted Assumptions: value of Y), also called !  Errors are independent. residuals in excel may highlight !  Errors are normally distributed. problems with the !  Errors have a mean of zero. model. !  Errors have a constant variance. PITFALLS: Prediction beyond the range of X values in the sample can be misleading, including interpretation of the intercept (X=0). A linear regression model may not be the best model, even in the presence of a significant F test.
  • 25. Constant variance Triple A Construction Errors have constant Variance Assumption Plot Residues w.r.t X values Pattern should be random! Non-constant Variation in Error Residual Plot –violation 0 X
  • 26. Normal distribution Histogram of Residuals - Should look like a bell curve Triple A Construction Not possible to see the bell curve with just 6 observations. Need more samples
  • 27. zero mean Triple A Construction Errors have zero Mean 0 X
  • 28. independent errors Example: Manager of a package If samples collected over a delivery store wants to predict period of time and not at the weekly sales based on the same time, then plot the number of customers making residues w.r.t time to see if purchases for a period of 100 any pattern (Autocorrelation) days. Data is collected over a exists. period of time so check for autocorrelation (pattern) effect. If substantial autocorrelation, Cyclical Pattern! A Violation Residues Regression Model Validity becomes doubtful Autocorrelation can also be checked using Durbin–Watson statistic. time
  • 29. Residual analysis for validating assumptions Nonlinear Residual Plot –violation
  • 31. multiple regression Multiple regression models are similar to simple linear regression Wilson Realty wants to develop a model to determine the suggested listing price for a house models except they include more based on size and age. than one X variable. Price 35000 Sq. Feet 1926 Age 30 Condition Good 47000 2069 40 Excellent 49900 1720 30 Excellent 55000 1396 15 Good 58900 1706 32 Mint 60000 1847 38 Mint Y = b0+ b1 X 1+ b2X 2+…+ bnXn 67000 1950 27 Mint 70000 2323 30 Excellent slope 78500 2285 26 Mint 79000 3752 35 Good 87500 2300 18 Good Independent variables 93000 2525 17 Good 95000 3800 40 Excellent 97000 1740 12 Mint
  • 32. multiple regression Wilson Realty has found a linear 67% of the variation in relationship between price and size sales price is explained by and age. The coefficient for size size and age. Ho: No linear indicates each additional square foot relationship increases the value by $21.91, while is rejected each additional year in age decreases the value by $1449.34. Y = 60815.45 + 21.91(size) – 1449.34 (age) For a 1900 square foot house that is 10 years old, the following prediction can be made: Y = 60815.45 + 21.91(size) – 1449.34 (age) $87,951 = 21.91(1900) + 1449.34(10) Ho: !1 = 0 is rejected Ho: !2 = 0 is rejected
  • 33. binary or dummy variables
  • 34. dummy variables Binary (or dummy) variables Return to Wilson Realty, and let’s evaluate how to use property are special variables that are condition in the regression model. created for qualitative data. There are three categories: Mint, Excellent, and Good. !  A dummy variable is assigned a value of 1 if a particular condition is X3= 1 if the house is in excellent condition = 0 otherwise met and a value of 0 otherwise. X4 = 1 if the house is in mint condition !  The number of dummy variables = 0 otherwise must equal one less than the number Note: If both X and X = 0 then the house is in good condition of categories of the qualitative variable.
  • 35. dummy variables As more variables are added to the model, the r2 usually increases. Y = 48329.23 + 28.21 (size) – 1981.41(age) + 23684.62 (if mint) + 16581.32 (if excellent)
  • 37. adjusted r-Square The best model is a statistically significant model with a high r2 and a few variables. !  As more variables are added to the model, the r2 usually increases. !  The adjusted r2 takes into account the number of independent variables in the model. Note: When variables are added to the model, the value of r2 can never decrease; however, the adjusted r2 may decrease.
  • 38. multicollinearity Collinearity or multicollinearity Duplication of exists when an independent variable information occurs is correlated with another independent variable. When multicollinearity exists, the overall F test is still valid, but !  Collinearity and multicollinearity the hypothesis tests related to the create problems in the coefficients. individual coefficients are not. !  The overall model prediction is still A variable may appear to be good; however individual significant when it is interpretation of the variables is insignificant, or a variable may questionable. appear to be insignificant when it is significant.
  • 40. non-linear regression Engineers at Colonel Motors want to use regression analysis to improve fuel efficiency. They are studying the impact of weight on miles per gallon (MPG). Linear regression model: MPG = 47.8 – 8.2 (weight) F significance = .0003 r2 = .7446
  • 41. non-linear regression Nonlinear (transformed variable) regression model 2 MPG = 79.8 – 30.2(weight) + 3.4(weight) F significance = .0002 R2 = .8478
  • 42. non-linear regression We should not try to interpret the coefficients of the variables due to the correlation between (weight) and (weight squared). Normally we would interpret the coefficient for as the change in Y that results from a 1-unit change in X1, while holding all other variables constant. Obviously holding one variable constant while changing the other is impossible in this example since If changes, then must change also. This is an example of a problem that exists when multicollinearity is present.
  • 44. quiz in next class