Class Outline
• Incorporating Discrete Variables in Regression
Analysis Using Dummy Variables
• Incorporating Discrete Variables with 3+
Categories in Regression Analysis Using Dummy
Variables
• Application Exercise #1: Impact of Competition on
Airfare
• Application Exercise #2: Movie Box Office Revenue
Forecast
Example – Sales Data Continued
Market ID Sales Price Advertising
1 214 2.2 Yes
2 163 2.7 No
3 201 2.4 Yes
4 152 2.9 No
5 157 2.8 No
6 213 2.2 Yes
7 226 2 Yes
8 187 2.3 No
9 219 2.1 Yes
10 163 2.7 No
11 157 2.8 No
12 189 2.6 Yes
13 169 2.6 No
14 152 2.9 No
15 189 2.6 Yes
Discrete
Variable with
2 categories
Regression Analysis Using Dummy Variables
• We can represent a discrete variable using dummy
variables
• dummy variable: takes the value of 0 or 1 to indicate
the absence or presence of some categorical effect
• Procedure (two category case)
1. Select a baseline category: e.g. No advertising
2. Generate a dummy variable for the non-baseline
category (DumAdYes)
3. Use the variable in regression analysis
Sales = a + b1*Price + b2*DumAdYes + ε
Regression Analysis Using Dummy Variables
Market ID Sales Price Advertising DumAdYes
1 214 2.2 Yes 1
2 163 2.7 No 0
3 201 2.4 Yes 1
4 152 2.9 No 0
5 157 2.8 No 0
6 213 2.2 Yes 1
7 226 2.0 Yes 1
8 187 2.3 No 0
9 219 2.1 Yes 1
: : : : :
Use “Dummy Variable Examples.xlsx – Dummy Variable Example 1”
Regression Analysis Using Dummy Variables
No Ad
Sales = a + b1*Price + ε
Ad
Sales = a + b1*Price + b2 + ε
Influence of Ad on Sales: b2
Sales = a + b1*Price + b2*DumAdYes + ε
Use “Dummy Variable Examples.xlsx – Dummy Variable Example 1”
Price
Sales
No Ad
Sales = a + b1*Price + ε
Ad
Sales = a + b1*Price + b2 + ε
Ad
No Ad
b2
b1
0
a
b1
SUMMARY OUTPUT
Regression Statistics
Multiple R 1.00
R Square 1.00
Adjusted R Square 1.00
Standard Error 0.49
Observations 15.00
ANOVA
df SS MS F Significance F
Regression 2 9682.672455 4841.336 19844.63 7.62539E-22
Residual 12 2.927544735 0.243962
Total 14 9685.6
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 325.36 1.73 188.50 0.00 321.60 329.12
Price -60.04 0.63 -94.84 0.00 -61.42 -58.66
DumAdYes 20.02 0.37 54.78 0.00 19.22 20.81
Regression Analysis Using Dummy Variables
1. Prediction / Forecasting
eg.) Price = 3; AD; Expected Sales=165.26
eg.) Price = 3; No AD; Expected Sales=145.24
2. Relationship between variables
Influence of Ad on Sales: 20.02
Sales=325.36–60.04*Price+20.02*DumAdYes+ε
Incorporating Discrete Variable
with 3+ Categories
Example – Sales Data Continued
Market ID Sales Price Advertising
1 221 2.2 TV
2 171 2.7 No
3 210 2.4 Radio
4 153 2.9 No
5 163 2.8 No
6 224 2.2 TV
7 236 2.0 Radio
8 191 2.3 No
9 233 2.1 TV
10 171 2.7 No
11 163 2.8 No
12 192 2.6 Radio
13 174 2.6 No
14 156 2.9 No
15 201 2.6 TV
Discrete
Variable with
3 categories
Regression Analysis Using Dummy Variables
• We can always represent a discrete variable with K
categories using K-1 dummy variables.
• Procedure
1. Select a baseline category
2. Define K-1 Dummy variables for non-baseline
categories
3. Include them in regression analysis
Use “Dummy Variable Examples.xlsx – Dummy Variable Example 2”
Regression Analysis Using Dummy Variables
Advertising DumTV DumRadio
TV 1 0
No 0 0
Radio 0 1
No 0 0
No 0 0
TV 1 0
Radio 0 1
Sales = a + b1*Price
+ b2*DumTV + b3*DumRadio + ε
Use “Dummy Variable Examples.xlsx – Dummy Variable Example 2”
Regression Analysis Using Dummy Variables
No Ad (baseline)
Sales = a + b1*Price + ε
TV Ad
Sales = a + b1*Price + b2 + ε
Radio Ad
Sales = a + b1*Price + b3 + ε
Sales = a + b1*Price + b2*DumTV + b3*DumRadio + ε
Regression Statistics
Multiple R 1.00
R Square 0.99
Adjusted R Square 0.99
Standard Error 2.57
Observations 15.00
ANOVA
df SS MS F Significance F
Regression 3 11490.89388 3830.298 579.5011 2.193E-12
Residual 11 72.7061161 6.6096469
Total 14 11563.6
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 341 9.03 37.75 0.00 320.94 360.68
Price -64 3.31 -19.27 0.00 -71.09 -56.51
DumTV 24 2.14 11.26 0.00 19.38 28.80
DumRadio 21 2.15 9.66 0.00 16.00 25.45
No Ad
Sales = 341 - 64*Price + ε
TV Ad
Sales = 341 - 64*Price + 24 + ε
Radio Ad
Sales = 341 - 64*Price + 21 + ε
Sales = 341 – 64*Price + 24*DumTV + 21*DumRadio + ε
Application Exercise #1:
Impact of Competition on Airfare
Airfare
Airfare
Competition
••••••
Distance
Other factors
Location
Characteristics
Impact of Competition on Airfare
Use “Airfare.xlsx”
• Origin: Airport code for the origin
• Destination: Airport code for the destination
• Average Fare: Average non-stop fare for the route
• # of Airlines: Number of Airlines providing direct service
between O & D
• Distance: Distance between O & D
• South_West: Whether Southwest provides a Direct Service
• Holiday_O: Whether Origin is a holiday market
• Holiday_D: Whether Destination is holiday market
• Traffic_O: Annual airport traffic at Origin  city size
• Traffic_D: Annual airport traffic at Destination  city size
Impact of Competition on Airfare
Q1: Generate “DumHO” using Holiday_O
Q2: Generate “DumHD” using Holiday_D
Q3: Generate “DumSW” using SouthWest
Q4: Perform a regression analysis
Airfare = a + b1* (# of Airlines) + b2* Distance
+ b3* Traffic_O + b4* Traffic_D
+ b5* DumHO + b6* DumHD
+ b6* DumSW + ε
Impact of Competition on Airfare
Q5: One more airline in the market will
(decrease/increase) average fare by $( ).
Q6: Presence of South West in the market will
(decrease/increase) average fare by $( ).
Q7: One mile longer in distance will (decrease/increase)
average fare by $( ).
Q8: If destination is holiday market, average fare will
(decrease/increase) by $( ).
Impact of Competition on Airfare
R Square 0.42
ANOVA P-Val. = 0.00
Coefficients Standard Error t Stat P-value
Intercept -43.05 350.11 -0.12 0.90
# of Airlines -25.88 6.73 -3.85 0.00
Distance 0.22 0.01 17.71 0.00
Traffic_O -8.19 14.03 -0.58 0.56
Traffic_D 25.80 14.34 1.80 0.07
DumHO -34.72 22.42 -1.55 0.12
DumHD -74.85 23.53 -3.18 0.00
DumSW -65.11 17.98 -3.62 0.00
Application Exercise #2:
Box Office Revenue Forecast
• Suppose you are helping Warner Bros. in developing
a model for forecasting Box Office revenues
• You are provided the opening week revenues (in
millions of $) for various past movies along with
several dependent variables
Movie Opening_Week_Revenue
# of
Theaters
Overall
Rating
Genre
The Dark Knight 158.4 4366 82 Action
Iron Man 98.6 4105 79 Action
Sex and the City 57 3285 53 Comedy
Mamma Mia! 27.8 2976 51 Comedy
21 24.1 2648 48 Drama
Constantine 29.8 3006 50 Horror
The Grudge 39.1 3245 49 Horror
WALL-E 63.1 3992 93 Kids
Kung Fu Panda 60.2 4114 73 Kids
Movie Revenue Forecast
Revenue
Rating
Other Factors
Genre
# of Theaters
Use “Movie.xlsx”
1. Generate dummy variables to represent Genre. Use “Kids”
genre as the baseline category.
• Genre: {Action, Comedy, Drama, Horror, Kids}
• DumA = 1, if Genre = Action
= 0, Otherwise
• DumC = 1, if Genre = ( )
= 0, Otherwise
• DumD = 1, if Genre = ( )
= 0, Otherwise
• DumH = 1, if Genre = ( )
= 0, Otherwise
In Excel, Use =if(E2=“Action’,1,0)
3. Describe the relationship among the variables.
Revenue = -126.66 + 0.04*#Theater + 0.27*Rating
+ 16.20*DumA + 5.99*DumC
+ 19.56*DumD + 14.43*DumH + ε
Continuous variables:
If we increase XXX by one unit, Revenue ( ).
This is statistically significant/insignificant.
Dummy variables:
Compared to the revenue of baseline category (i.e.
Kids), XXX genre has ( ).
This is statistically significant/insignificant.

Dummy Variable Regression Analysis

  • 1.
    Class Outline • IncorporatingDiscrete Variables in Regression Analysis Using Dummy Variables • Incorporating Discrete Variables with 3+ Categories in Regression Analysis Using Dummy Variables • Application Exercise #1: Impact of Competition on Airfare • Application Exercise #2: Movie Box Office Revenue Forecast
  • 2.
    Example – SalesData Continued Market ID Sales Price Advertising 1 214 2.2 Yes 2 163 2.7 No 3 201 2.4 Yes 4 152 2.9 No 5 157 2.8 No 6 213 2.2 Yes 7 226 2 Yes 8 187 2.3 No 9 219 2.1 Yes 10 163 2.7 No 11 157 2.8 No 12 189 2.6 Yes 13 169 2.6 No 14 152 2.9 No 15 189 2.6 Yes Discrete Variable with 2 categories
  • 3.
    Regression Analysis UsingDummy Variables • We can represent a discrete variable using dummy variables • dummy variable: takes the value of 0 or 1 to indicate the absence or presence of some categorical effect • Procedure (two category case) 1. Select a baseline category: e.g. No advertising 2. Generate a dummy variable for the non-baseline category (DumAdYes) 3. Use the variable in regression analysis Sales = a + b1*Price + b2*DumAdYes + ε
  • 4.
    Regression Analysis UsingDummy Variables Market ID Sales Price Advertising DumAdYes 1 214 2.2 Yes 1 2 163 2.7 No 0 3 201 2.4 Yes 1 4 152 2.9 No 0 5 157 2.8 No 0 6 213 2.2 Yes 1 7 226 2.0 Yes 1 8 187 2.3 No 0 9 219 2.1 Yes 1 : : : : : Use “Dummy Variable Examples.xlsx – Dummy Variable Example 1”
  • 5.
    Regression Analysis UsingDummy Variables No Ad Sales = a + b1*Price + ε Ad Sales = a + b1*Price + b2 + ε Influence of Ad on Sales: b2 Sales = a + b1*Price + b2*DumAdYes + ε Use “Dummy Variable Examples.xlsx – Dummy Variable Example 1”
  • 6.
    Price Sales No Ad Sales =a + b1*Price + ε Ad Sales = a + b1*Price + b2 + ε Ad No Ad b2 b1 0 a b1
  • 7.
    SUMMARY OUTPUT Regression Statistics MultipleR 1.00 R Square 1.00 Adjusted R Square 1.00 Standard Error 0.49 Observations 15.00 ANOVA df SS MS F Significance F Regression 2 9682.672455 4841.336 19844.63 7.62539E-22 Residual 12 2.927544735 0.243962 Total 14 9685.6 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Intercept 325.36 1.73 188.50 0.00 321.60 329.12 Price -60.04 0.63 -94.84 0.00 -61.42 -58.66 DumAdYes 20.02 0.37 54.78 0.00 19.22 20.81
  • 8.
    Regression Analysis UsingDummy Variables 1. Prediction / Forecasting eg.) Price = 3; AD; Expected Sales=165.26 eg.) Price = 3; No AD; Expected Sales=145.24 2. Relationship between variables Influence of Ad on Sales: 20.02 Sales=325.36–60.04*Price+20.02*DumAdYes+ε
  • 9.
  • 10.
    Example – SalesData Continued Market ID Sales Price Advertising 1 221 2.2 TV 2 171 2.7 No 3 210 2.4 Radio 4 153 2.9 No 5 163 2.8 No 6 224 2.2 TV 7 236 2.0 Radio 8 191 2.3 No 9 233 2.1 TV 10 171 2.7 No 11 163 2.8 No 12 192 2.6 Radio 13 174 2.6 No 14 156 2.9 No 15 201 2.6 TV Discrete Variable with 3 categories
  • 11.
    Regression Analysis UsingDummy Variables • We can always represent a discrete variable with K categories using K-1 dummy variables. • Procedure 1. Select a baseline category 2. Define K-1 Dummy variables for non-baseline categories 3. Include them in regression analysis Use “Dummy Variable Examples.xlsx – Dummy Variable Example 2”
  • 12.
    Regression Analysis UsingDummy Variables Advertising DumTV DumRadio TV 1 0 No 0 0 Radio 0 1 No 0 0 No 0 0 TV 1 0 Radio 0 1 Sales = a + b1*Price + b2*DumTV + b3*DumRadio + ε Use “Dummy Variable Examples.xlsx – Dummy Variable Example 2”
  • 13.
    Regression Analysis UsingDummy Variables No Ad (baseline) Sales = a + b1*Price + ε TV Ad Sales = a + b1*Price + b2 + ε Radio Ad Sales = a + b1*Price + b3 + ε Sales = a + b1*Price + b2*DumTV + b3*DumRadio + ε
  • 14.
    Regression Statistics Multiple R1.00 R Square 0.99 Adjusted R Square 0.99 Standard Error 2.57 Observations 15.00 ANOVA df SS MS F Significance F Regression 3 11490.89388 3830.298 579.5011 2.193E-12 Residual 11 72.7061161 6.6096469 Total 14 11563.6 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Intercept 341 9.03 37.75 0.00 320.94 360.68 Price -64 3.31 -19.27 0.00 -71.09 -56.51 DumTV 24 2.14 11.26 0.00 19.38 28.80 DumRadio 21 2.15 9.66 0.00 16.00 25.45
  • 15.
    No Ad Sales =341 - 64*Price + ε TV Ad Sales = 341 - 64*Price + 24 + ε Radio Ad Sales = 341 - 64*Price + 21 + ε Sales = 341 – 64*Price + 24*DumTV + 21*DumRadio + ε
  • 16.
    Application Exercise #1: Impactof Competition on Airfare
  • 17.
  • 18.
    Impact of Competitionon Airfare Use “Airfare.xlsx” • Origin: Airport code for the origin • Destination: Airport code for the destination • Average Fare: Average non-stop fare for the route • # of Airlines: Number of Airlines providing direct service between O & D • Distance: Distance between O & D • South_West: Whether Southwest provides a Direct Service • Holiday_O: Whether Origin is a holiday market • Holiday_D: Whether Destination is holiday market • Traffic_O: Annual airport traffic at Origin  city size • Traffic_D: Annual airport traffic at Destination  city size
  • 19.
    Impact of Competitionon Airfare Q1: Generate “DumHO” using Holiday_O Q2: Generate “DumHD” using Holiday_D Q3: Generate “DumSW” using SouthWest Q4: Perform a regression analysis Airfare = a + b1* (# of Airlines) + b2* Distance + b3* Traffic_O + b4* Traffic_D + b5* DumHO + b6* DumHD + b6* DumSW + ε
  • 20.
    Impact of Competitionon Airfare Q5: One more airline in the market will (decrease/increase) average fare by $( ). Q6: Presence of South West in the market will (decrease/increase) average fare by $( ). Q7: One mile longer in distance will (decrease/increase) average fare by $( ). Q8: If destination is holiday market, average fare will (decrease/increase) by $( ).
  • 21.
    Impact of Competitionon Airfare R Square 0.42 ANOVA P-Val. = 0.00 Coefficients Standard Error t Stat P-value Intercept -43.05 350.11 -0.12 0.90 # of Airlines -25.88 6.73 -3.85 0.00 Distance 0.22 0.01 17.71 0.00 Traffic_O -8.19 14.03 -0.58 0.56 Traffic_D 25.80 14.34 1.80 0.07 DumHO -34.72 22.42 -1.55 0.12 DumHD -74.85 23.53 -3.18 0.00 DumSW -65.11 17.98 -3.62 0.00
  • 22.
    Application Exercise #2: BoxOffice Revenue Forecast
  • 23.
    • Suppose youare helping Warner Bros. in developing a model for forecasting Box Office revenues • You are provided the opening week revenues (in millions of $) for various past movies along with several dependent variables Movie Opening_Week_Revenue # of Theaters Overall Rating Genre The Dark Knight 158.4 4366 82 Action Iron Man 98.6 4105 79 Action Sex and the City 57 3285 53 Comedy Mamma Mia! 27.8 2976 51 Comedy 21 24.1 2648 48 Drama Constantine 29.8 3006 50 Horror The Grudge 39.1 3245 49 Horror WALL-E 63.1 3992 93 Kids Kung Fu Panda 60.2 4114 73 Kids Movie Revenue Forecast
  • 24.
  • 25.
    Use “Movie.xlsx” 1. Generatedummy variables to represent Genre. Use “Kids” genre as the baseline category. • Genre: {Action, Comedy, Drama, Horror, Kids} • DumA = 1, if Genre = Action = 0, Otherwise • DumC = 1, if Genre = ( ) = 0, Otherwise • DumD = 1, if Genre = ( ) = 0, Otherwise • DumH = 1, if Genre = ( ) = 0, Otherwise In Excel, Use =if(E2=“Action’,1,0)
  • 26.
    3. Describe therelationship among the variables. Revenue = -126.66 + 0.04*#Theater + 0.27*Rating + 16.20*DumA + 5.99*DumC + 19.56*DumD + 14.43*DumH + ε Continuous variables: If we increase XXX by one unit, Revenue ( ). This is statistically significant/insignificant. Dummy variables: Compared to the revenue of baseline category (i.e. Kids), XXX genre has ( ). This is statistically significant/insignificant.