Final Take Home
Michael Perhats
May 16, 2016
Necessary Code to run before the analysis:
MLBattend <- read.table("http://www.amstat.org/publications/jse/datasets/MLBattend.dat.txt",
quote = """, comment.char = "")
names(MLBattend) <- c("Team", "League", "Division", "Season", "Attendance",
"Scored", "Allowed", "Wins", "Losses", "GamesBack")
library(scales)
## Warning: package 'scales' was built under R version 3.2.5
library(readxl)
## Warning: package 'readxl' was built under R version 3.2.5
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.2.5
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.2.5
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 3.2.5
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
1
library(cowplot)
## Warning: package 'cowplot' was built under R version 3.2.5
##
## Attaching package: 'cowplot'
## The following object is masked from 'package:ggplot2':
##
## ggsave
CENT <- MLBattend %>% filter(`Division`=="CENT")
EAST <- MLBattend %>% filter(`Division`=="EAST")
WEST <- MLBattend %>% filter(`Division`=="WEST")
Season <- MLBattend$Season + 1900 + 100 * (MLBattend$Season == 0)
LDiv <- paste(as.character(MLBattend$League), as.character(MLBattend$Division))
Attendance <- MLBattend$Attendance
GamesBack <- MLBattend$GamesBack
League <- MLBattend$League
Division <- MLBattend$Division
Scored <- MLBattend$Scored
Allowed <- MLBattend$Allowed
SeasonCentral <- CENT$Season + 1900 + 100 * (CENT$Season == 0)
SeasonWest <- WEST$Season + 1900 + 100 * (WEST$Season == 0)
SeasonEast <- EAST$Season + 1900 + 100 * (EAST$Season == 0)
AttendanceCentral <- CENT$Attendance
AttendanceWest <- WEST$Attendance
AttendanceEast <- EAST$Attendance
1) Did attendance increase over this time period? If so, at what rate?
The following chunk of code is a visualization depicting the increase over time of Attendance in the MLB.
The rate of this change is shown via a linear model that prints the slope on the graph:
MLBattend %>% ggplot(mapping=aes(Season, Attendance))+
stat_smooth(color = "black",method="lm")+
xlab("Season")+
ylab("Attendance")+
ggtitle("MLB Attendance Overtime")+
geom_point(aes(color="Red"))+
xlim(69,100)+
annotate("text", x = 76, y = 4e+06, label = paste0("Slope=",lm(Season~Attendance)$coefficient[2]))
## Warning: Removed 30 rows containing non-finite values (stat_smooth).
## Warning: Removed 30 rows containing missing values (geom_point).
2
Slope=6.32471821709783e−06
1e+06
2e+06
3e+06
4e+06
70 80 90 100
Season
Attendance
colour
Red
MLB Attendance Overtime
2) Is there a connection between the number of runs scored and the number of
runs allowed? If so, what?
For this particular question, The Null hypothesis that is made is that there is a slope equal to 0 and that the
values of runs scored and runs allowed are independent of one another.
If we are testing to see if the null hypothesis can be rejected at a 95% confidence level, then we reject the
null hypothesis if we get p < 0.05
When we run the following code, we get a p-value: <2e-16, which is very small and lower than 0.05 which
means that we can reject the Null hypothesis that there is no connection between the two variables and
accept the Alternative Hypothesis that there is a statistically significant relationship here between runs scored
and runs allowed.
The following code displays this p-value:
summary(lm(MLBattend$Scored~MLBattend$Allowed))
##
## Call:
## lm(formula = MLBattend$Scored ~ MLBattend$Allowed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -255.813 -58.995 -3.745 57.938 292.031
##
3
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 302.3735 19.9599 15.15 <2e-16 ***
## MLBattend$Allowed 0.5649 0.0284 19.89 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 86.7 on 836 degrees of freedom
## Multiple R-squared: 0.3213, Adjusted R-squared: 0.3205
## F-statistic: 395.7 on 1 and 836 DF, p-value: < 2.2e-16
The following visualization is a depiction of this relationship with Runs Scored on the x-axis and runs allowed
on the y-axis. Altough we rejected the null hypothesis, we can see, by the annotation on the visualization
that there is an R squared value of only 32 percent:
MLBattend %>% ggplot(mapping=aes(Scored, Allowed))+
geom_smooth(color = "black",method="lm")+xlab("Runs Scored")+
ylab("Runs Allowed")+ggtitle("Relation between Runs Scored and Allowed")+
geom_point(aes(color="Red"))+
annotate("text", x = 485, y = 900, label = summary(lm(Scored~Allowed))$r.squared)+
annotate("text", x = 350, y = 900, label = "R-squared= ")
0.321281179344147R−squared=
300
500
700
900
1100
300 500 700 900
Runs Scored
RunsAllowed
colour
Red
Relation between Runs Scored and Allowed
#Yes there is, the rquared value is in the visual which shows the positive increase
4
3)Is there a connection between attendance and the number games back of first
a team is? If so, what?
The Null Hypothesis for this analysis is that there is no relationship between attendance and the number of
games teams are away from winning their respective division and that these two variables are independent of
each other in all circumstances.
If we are testing to see if the null hypothesis can be rejected at a 95% confidence level, then we reject the
null hypothesis if we get p < 0.05
When we run the following code, we get a p-value: <2e-16, which is very small and lower than 0.05 which
means that we can reject the Null hypothesis that there is no connection between the two variables and accept
the Alternative Hypothesis that there is a statistically significant relationship here between the variables of
Games Back and Attendance. The Lower the GamesBack, the more attendance there is. This makes intuitive
sense when thinking about the problem. If a team is in first place with a GamesBack value of 0, we would
expect attendance to be higher, and vice versa for a GamesBack value of say 50.
The following Chunbk of code shows this Pvalue:
summary(lm(MLBattend$Attendance~MLBattend$GamesBack))
##
## Call:
## lm(formula = MLBattend$Attendance ~ MLBattend$GamesBack)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1310981 -491954 -101932 417801 3300243
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2156539 37704 57.20 <2e-16 ***
## MLBattend$GamesBack -26309 2030 -12.96 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 690200 on 836 degrees of freedom
## Multiple R-squared: 0.1673, Adjusted R-squared: 0.1663
## F-statistic: 168 on 1 and 836 DF, p-value: < 2.2e-16
The following chunk of code depicts this statistical test in a visual manner. It also displays the r-squared value
for these two variables. R-squared is a statistical measure of how close the data are to the fitted regression
line. It is also known as the coefficient of determination. Although we accepted the alternative hypothesis
above and state that there is stastical significance in the relationship in these variables, the r-squared value is
only 17% which means that GamesBack is not an entirely reliable source for determining attendance. This
also makes sense intuitively. Some teams have a larger fan base and/or a stadium that attracts more fans.
MLBattend %>% ggplot(mapping=aes(GamesBack, Attendance))+
geom_smooth(color = "black",method="lm")+xlab("Games Back")+
ylab("Attendance")+ggtitle("Relation between Attendance and Games Back")+
geom_point(aes(color="Red"))+
annotate("text", x = 50, y = 4e+06, label = summary(lm(GamesBack~Attendance))$r.squared)+
annotate("text", x = 40, y = 4e+06, label = "R-squared= ")+
scale_y_continuous(labels = comma)
5
0.167292463179872R−squared=
1,000,000
2,000,000
3,000,000
4,000,000
0 10 20 30 40 50
Games Back
Attendance
colour
Red
Relation between Attendance and Games Back
Yes there is a negative relationship between these variables. R squared is printed on the screen # The
relationship is pretty insignificant
4) Is there a connection between league and division? If so, what?
The Null hypothesis in the following chunk of code is that the two variables are independent and do not have
any statistically significant correlation.
Here, the calculated p-value exceeds 0.05, which can be shown below in the chunk of code, so the observation
is consistent with the null hypothesis, as it falls within the range of what would happen 95% of the time. We
can then reject the alternative hypothesis that we set out to discover, that there is a statistically significant
correlation between League and Division in the MLB. (I use a Chi-squared test because we are comparing
two categorical variables)
tbl <- table(MLBattend$League, MLBattend$Division)
chisq.test(tbl)
##
## Pearson's Chi-squared test
##
## data: tbl
## X-squared = 1.1535, df = 2, p-value = 0.5617
6
5) Is there a connection between attendance and division with league (use LDiv)?
If so, what?
The Null Hypothesis for the question above is that there is no difference between Attendance and the
respective divisions of (NL WEST, NL EAST, NL CENTRAL, AL WEST, AL EAST, AL CENTRAL) and
that the means and medians of attendance for all of these divisions are equivalent to one another. When
Looking at the results:
aov(MLBattend$Attendance~as.factor(LDiv))
## Call:
## aov(formula = MLBattend$Attendance ~ as.factor(LDiv))
##
## Terms:
## as.factor(LDiv) Residuals
## Sum of Squares 1.169481e+13 4.665199e+14
## Deg. of Freedom 5 832
##
## Residual standard error: 748813.1
## Estimated effects may be unbalanced
anova <- aov(MLBattend$Attendance~as.factor(LDiv))
summary(anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(LDiv) 5 1.169e+13 2.339e+12 4.171 0.000949 ***
## Residuals 832 4.665e+14 5.607e+11
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Our p-value is less than 0.05. Hence we can conclude that for our confidence interval, the Alternative
Hypothesis: not all means are equal and that there is a relationship between Attendance and the divisions
and We can Reject the null hypothesis that all 6 Divisions have equal means.
This can be depicted numerically with the following code, displaying the media
NLWEST <- WEST %>% filter(`League`=="NL")
mean(NLWEST$Attendance)
## [1] 1860060
NLEAST <- EAST %>% filter(`League`=="NL")
mean(NLEAST$Attendance)
## [1] 1778405
NLCENT <- CENT %>% filter(`League`=="NL")
mean(NLCENT$Attendance)
## [1] 2084059
7
ALWEST <- WEST %>% filter(`League`=="AL")
mean(ALWEST$Attendance)
## [1] 1593557
ALEAST <- EAST %>% filter(`League`=="AL")
mean(ALEAST$Attendance)
## [1] 1817818
ALCENT <- CENT %>% filter(`League`=="AL")
mean(ALCENT$Attendance)
## [1] 1800183
And Visually displayed with a box and whisker plot, showing the values as categorized by division:
ggplot(MLBattend, aes(x=LDiv, y=Attendance, fill=LDiv))+
geom_boxplot(outlier.colour="red", outlier.shape=16,outlier.size=2, notch=FALSE)+
coord_flip()+
scale_y_continuous(labels = comma)+
scale_fill_brewer(palette="Dark2")+
theme(legend.position="top")+
ggtitle("Attendance by LDiv")
AL CENT
AL EAST
AL WEST
NL CENT
NL EAST
NL WEST
1,000,000 2,000,000 3,000,000 4,000,000
Attendance
LDiv
LDiv
AL CENT
AL EAST
AL WEST
NL CENT
NL EAST
NL WEST
Attendance by LDiv
8
Does not seem like there is a real big difference
Summary statistics given in the graph in order that the graph is
listing the LDiv variables
6) Is there a difference in runs scored between the two leagues? If so, what?
For this analyis, we are going to state the Null Hypothesis: there is no difference between runs scored between
the two leagues. After Running the following code:
AL <- MLBattend %>% filter(`League`=="AL")
NL <- MLBattend %>% filter(`League`=="NL")
ALScored <- AL$Scored
NLScored <- NL$Scored
sum(ALScored)-sum(NLScored)
## [1] 33926
We can reject the Null Hypothesis above and accept the Alternative Hypothesis that there is a difference in
runs between the two leagues. To be exact, this difference is 33926, Favoring the Americal League.
7)Is there a difference in runs allowed between the two leagues? If so, what?
The Null hypothesis for this analysis is that there is no difference within the two leagues in the MLB in
regards to runs allowed.
When Looking at the result from the following code:
AL <- MLBattend %>% filter(`League`=="AL")
NL <- MLBattend %>% filter(`League`=="NL")
ALAllowed <- AL$Allowed
NLAllowed <- NL$Allowed
sum(ALAllowed)-sum(NLAllowed)
## [1] 34137
We can reject the null hypothesis that there is no difference in runs allowed in the two leagues and accept the
alternative hyopthesis that there is a difference between runs allowed between the two leagues Specifically
there is a difference of 34137 runs.
8)Describe how runs allowed and runs scored are distributed.
The following Dashboard displays the distribution of the two variables while also displaying the Center
(Median), Spread (SD), and Skew for both runs scored and runs allowed in the MLB:
9
Allowedhist <- ggplot(data=MLBattend, aes(MLBattend$Allowed))+
geom_histogram(aes(y =..density.., fill=..count..),alpha = .9)+
geom_density(col="black") +labs(title="Runs Allowed Distribution")+
labs(x="Runs Allowed", y="Count")+
scale_fill_gradientn("Count", colours = heat.colors(16, alpha = .8))+
xlim(300,1100)+
annotate("text", x = 420, y = 0.004, color = "RED", label = median(Allowed))+
annotate("text", x = 350, y = 0.004, color = "RED", label = "Median: ")+
annotate("text", x = 480, y = 0.003, color = "BLUE", label = sd(Allowed))+
annotate("text", x = 330, y = 0.003, color = "BLUE", label = "SD: ")+
annotate("text", x = 330, y = 0.002, color = "BLACK", label = "Skew: NA ")
Scoredhist <- ggplot(data=MLBattend, aes(MLBattend$Scored))+
geom_histogram(aes(y =..density.., fill=..count..),alpha = .9)+
geom_density(col="black") +labs(title="Runs Scored Distribution")+
labs(x="Runs Allowed", y="Count")+
scale_fill_gradientn("Count", colours = heat.colors(16, alpha = .8))+
xlim(300,1100)+
annotate("text", x = 420, y = 0.004, color = "RED", label = median(Scored))+
annotate("text", x = 350, y = 0.004, color = "RED", label = "Median: ")+
annotate("text", x = 480, y = 0.003, color = "BLUE", label = sd(Scored))+
annotate("text", x = 330, y = 0.003, color = "BLUE", label = "SD: ")+
annotate("text", x = 330, y = 0.002, color = "BLACK", label = "Skew: NA ")
grid.arrange (Allowedhist, Scoredhist, ncol=1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing non-finite values (stat_density).
## Warning: Removed 1 rows containing missing values (geom_bar).
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1 rows containing missing values (geom_bar).
10
693Median:
105.524366934333SD:
Skew: NA
0.000
0.001
0.002
0.003
0.004
500 750 1000
Runs Allowed
Count
0
25
50
75
100
Count
Runs Allowed Distribution
691.5Median:
105.173606361848SD:
Skew: NA
0.000
0.001
0.002
0.003
0.004
0.005
500 750 1000
Runs Allowed
Count
0
25
50
75
100
Count
Runs Scored Distribution
#CENTER: MEDIAN. Displayed on Graph
#SPREAD: SD. Displayed on Graph
9) Describe how attendance is distributed
The following Dashboard displays the distribution of Attendance while also displaying the Center (Median),
Spread (SD), and Skew:
ggplot(data=MLBattend, aes(MLBattend$Attendance))+
geom_histogram(aes(y =..density.., fill=..count..),alpha = .9)+
geom_density(col="black") +labs(title="Attendance Distribution")+
labs(x="Attendance", y="Count")+
scale_fill_gradientn("Count", colours = heat.colors(16, alpha = .8))+
scale_x_continuous(labels = comma)+
annotate("text", x = 3500000, y = 4.2e-07, color = "RED", label = median(Attendance))+
annotate("text", x = 3000000, y = 4.2e-07, color = "RED", label = "Median: ")+
annotate("text", x = 3500000, y = 3.7e-07, color = "BLUE", label = sd(Attendance))+
annotate("text", x = 2900000, y = 3.7e-07, color = "BLUE", label = "SD: ")+
annotate("text", x = 3050000, y = 3.2e-07, color = "BLACK", label = "SKew: --> ")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
11
1681896Median:
755872.81336223SD:
SKew: −−>
0e+00
2e−07
4e−07
0 1,000,000 2,000,000 3,000,000 4,000,000
Attendance
Count
0
20
40
60
Count
Attendance Distribution
#CENTER: Median. displayed on graph
#SPREAD: SD, displayed on graph
#SKEW: Displayed on Graph
10) #Is there a difference in the proportion of times that the Cubs and the
White Sox won their respective divisions over this time period?
The Null Hypothesis for this particular Analysis is that there is no difference in the proportion of times that
the cubs and white sox win their respective divisions.
When Running the following code:
CUBSdata <- MLBattend %>% filter(`Team`=="CHIN")
CUBSWins <- CUBSdata %>% filter (`GamesBack`=="0")
SOXdata <- MLBattend %>% filter(`Team`=="CHIA")
SOXWins <- SOXdata %>% filter (`GamesBack`=="0")
nrow(SOXWins)/nrow(CUBSWins)
## [1] 2
nrow(CUBSWins)/nrow(SOXWins)
## [1] 0.5
12
When looking at the results, we discover that we can reject the null hypothesis that there is no difference in
proportion of times that the cubs and sox won their respective divisions and accept the alternative hyopthesis
that there is a difference in this proportion. Specifically, the sox won their division twice as many times as
the cubs and that the cubs won their division half as many times as the sox within this time period.
13

FINAL_TAKE_HOME

  • 1.
    Final Take Home MichaelPerhats May 16, 2016 Necessary Code to run before the analysis: MLBattend <- read.table("http://www.amstat.org/publications/jse/datasets/MLBattend.dat.txt", quote = """, comment.char = "") names(MLBattend) <- c("Team", "League", "Division", "Season", "Attendance", "Scored", "Allowed", "Wins", "Losses", "GamesBack") library(scales) ## Warning: package 'scales' was built under R version 3.2.5 library(readxl) ## Warning: package 'readxl' was built under R version 3.2.5 library(dplyr) ## Warning: package 'dplyr' was built under R version 3.2.5 ## ## Attaching package: 'dplyr' ## The following objects are masked from 'package:stats': ## ## filter, lag ## The following objects are masked from 'package:base': ## ## intersect, setdiff, setequal, union library(ggplot2) ## Warning: package 'ggplot2' was built under R version 3.2.5 library(gridExtra) ## Warning: package 'gridExtra' was built under R version 3.2.5 ## ## Attaching package: 'gridExtra' ## The following object is masked from 'package:dplyr': ## ## combine 1
  • 2.
    library(cowplot) ## Warning: package'cowplot' was built under R version 3.2.5 ## ## Attaching package: 'cowplot' ## The following object is masked from 'package:ggplot2': ## ## ggsave CENT <- MLBattend %>% filter(`Division`=="CENT") EAST <- MLBattend %>% filter(`Division`=="EAST") WEST <- MLBattend %>% filter(`Division`=="WEST") Season <- MLBattend$Season + 1900 + 100 * (MLBattend$Season == 0) LDiv <- paste(as.character(MLBattend$League), as.character(MLBattend$Division)) Attendance <- MLBattend$Attendance GamesBack <- MLBattend$GamesBack League <- MLBattend$League Division <- MLBattend$Division Scored <- MLBattend$Scored Allowed <- MLBattend$Allowed SeasonCentral <- CENT$Season + 1900 + 100 * (CENT$Season == 0) SeasonWest <- WEST$Season + 1900 + 100 * (WEST$Season == 0) SeasonEast <- EAST$Season + 1900 + 100 * (EAST$Season == 0) AttendanceCentral <- CENT$Attendance AttendanceWest <- WEST$Attendance AttendanceEast <- EAST$Attendance 1) Did attendance increase over this time period? If so, at what rate? The following chunk of code is a visualization depicting the increase over time of Attendance in the MLB. The rate of this change is shown via a linear model that prints the slope on the graph: MLBattend %>% ggplot(mapping=aes(Season, Attendance))+ stat_smooth(color = "black",method="lm")+ xlab("Season")+ ylab("Attendance")+ ggtitle("MLB Attendance Overtime")+ geom_point(aes(color="Red"))+ xlim(69,100)+ annotate("text", x = 76, y = 4e+06, label = paste0("Slope=",lm(Season~Attendance)$coefficient[2])) ## Warning: Removed 30 rows containing non-finite values (stat_smooth). ## Warning: Removed 30 rows containing missing values (geom_point). 2
  • 3.
    Slope=6.32471821709783e−06 1e+06 2e+06 3e+06 4e+06 70 80 90100 Season Attendance colour Red MLB Attendance Overtime 2) Is there a connection between the number of runs scored and the number of runs allowed? If so, what? For this particular question, The Null hypothesis that is made is that there is a slope equal to 0 and that the values of runs scored and runs allowed are independent of one another. If we are testing to see if the null hypothesis can be rejected at a 95% confidence level, then we reject the null hypothesis if we get p < 0.05 When we run the following code, we get a p-value: <2e-16, which is very small and lower than 0.05 which means that we can reject the Null hypothesis that there is no connection between the two variables and accept the Alternative Hypothesis that there is a statistically significant relationship here between runs scored and runs allowed. The following code displays this p-value: summary(lm(MLBattend$Scored~MLBattend$Allowed)) ## ## Call: ## lm(formula = MLBattend$Scored ~ MLBattend$Allowed) ## ## Residuals: ## Min 1Q Median 3Q Max ## -255.813 -58.995 -3.745 57.938 292.031 ## 3
  • 4.
    ## Coefficients: ## EstimateStd. Error t value Pr(>|t|) ## (Intercept) 302.3735 19.9599 15.15 <2e-16 *** ## MLBattend$Allowed 0.5649 0.0284 19.89 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 86.7 on 836 degrees of freedom ## Multiple R-squared: 0.3213, Adjusted R-squared: 0.3205 ## F-statistic: 395.7 on 1 and 836 DF, p-value: < 2.2e-16 The following visualization is a depiction of this relationship with Runs Scored on the x-axis and runs allowed on the y-axis. Altough we rejected the null hypothesis, we can see, by the annotation on the visualization that there is an R squared value of only 32 percent: MLBattend %>% ggplot(mapping=aes(Scored, Allowed))+ geom_smooth(color = "black",method="lm")+xlab("Runs Scored")+ ylab("Runs Allowed")+ggtitle("Relation between Runs Scored and Allowed")+ geom_point(aes(color="Red"))+ annotate("text", x = 485, y = 900, label = summary(lm(Scored~Allowed))$r.squared)+ annotate("text", x = 350, y = 900, label = "R-squared= ") 0.321281179344147R−squared= 300 500 700 900 1100 300 500 700 900 Runs Scored RunsAllowed colour Red Relation between Runs Scored and Allowed #Yes there is, the rquared value is in the visual which shows the positive increase 4
  • 5.
    3)Is there aconnection between attendance and the number games back of first a team is? If so, what? The Null Hypothesis for this analysis is that there is no relationship between attendance and the number of games teams are away from winning their respective division and that these two variables are independent of each other in all circumstances. If we are testing to see if the null hypothesis can be rejected at a 95% confidence level, then we reject the null hypothesis if we get p < 0.05 When we run the following code, we get a p-value: <2e-16, which is very small and lower than 0.05 which means that we can reject the Null hypothesis that there is no connection between the two variables and accept the Alternative Hypothesis that there is a statistically significant relationship here between the variables of Games Back and Attendance. The Lower the GamesBack, the more attendance there is. This makes intuitive sense when thinking about the problem. If a team is in first place with a GamesBack value of 0, we would expect attendance to be higher, and vice versa for a GamesBack value of say 50. The following Chunbk of code shows this Pvalue: summary(lm(MLBattend$Attendance~MLBattend$GamesBack)) ## ## Call: ## lm(formula = MLBattend$Attendance ~ MLBattend$GamesBack) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1310981 -491954 -101932 417801 3300243 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 2156539 37704 57.20 <2e-16 *** ## MLBattend$GamesBack -26309 2030 -12.96 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 690200 on 836 degrees of freedom ## Multiple R-squared: 0.1673, Adjusted R-squared: 0.1663 ## F-statistic: 168 on 1 and 836 DF, p-value: < 2.2e-16 The following chunk of code depicts this statistical test in a visual manner. It also displays the r-squared value for these two variables. R-squared is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination. Although we accepted the alternative hypothesis above and state that there is stastical significance in the relationship in these variables, the r-squared value is only 17% which means that GamesBack is not an entirely reliable source for determining attendance. This also makes sense intuitively. Some teams have a larger fan base and/or a stadium that attracts more fans. MLBattend %>% ggplot(mapping=aes(GamesBack, Attendance))+ geom_smooth(color = "black",method="lm")+xlab("Games Back")+ ylab("Attendance")+ggtitle("Relation between Attendance and Games Back")+ geom_point(aes(color="Red"))+ annotate("text", x = 50, y = 4e+06, label = summary(lm(GamesBack~Attendance))$r.squared)+ annotate("text", x = 40, y = 4e+06, label = "R-squared= ")+ scale_y_continuous(labels = comma) 5
  • 6.
    0.167292463179872R−squared= 1,000,000 2,000,000 3,000,000 4,000,000 0 10 2030 40 50 Games Back Attendance colour Red Relation between Attendance and Games Back Yes there is a negative relationship between these variables. R squared is printed on the screen # The relationship is pretty insignificant 4) Is there a connection between league and division? If so, what? The Null hypothesis in the following chunk of code is that the two variables are independent and do not have any statistically significant correlation. Here, the calculated p-value exceeds 0.05, which can be shown below in the chunk of code, so the observation is consistent with the null hypothesis, as it falls within the range of what would happen 95% of the time. We can then reject the alternative hypothesis that we set out to discover, that there is a statistically significant correlation between League and Division in the MLB. (I use a Chi-squared test because we are comparing two categorical variables) tbl <- table(MLBattend$League, MLBattend$Division) chisq.test(tbl) ## ## Pearson's Chi-squared test ## ## data: tbl ## X-squared = 1.1535, df = 2, p-value = 0.5617 6
  • 7.
    5) Is therea connection between attendance and division with league (use LDiv)? If so, what? The Null Hypothesis for the question above is that there is no difference between Attendance and the respective divisions of (NL WEST, NL EAST, NL CENTRAL, AL WEST, AL EAST, AL CENTRAL) and that the means and medians of attendance for all of these divisions are equivalent to one another. When Looking at the results: aov(MLBattend$Attendance~as.factor(LDiv)) ## Call: ## aov(formula = MLBattend$Attendance ~ as.factor(LDiv)) ## ## Terms: ## as.factor(LDiv) Residuals ## Sum of Squares 1.169481e+13 4.665199e+14 ## Deg. of Freedom 5 832 ## ## Residual standard error: 748813.1 ## Estimated effects may be unbalanced anova <- aov(MLBattend$Attendance~as.factor(LDiv)) summary(anova) ## Df Sum Sq Mean Sq F value Pr(>F) ## as.factor(LDiv) 5 1.169e+13 2.339e+12 4.171 0.000949 *** ## Residuals 832 4.665e+14 5.607e+11 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Our p-value is less than 0.05. Hence we can conclude that for our confidence interval, the Alternative Hypothesis: not all means are equal and that there is a relationship between Attendance and the divisions and We can Reject the null hypothesis that all 6 Divisions have equal means. This can be depicted numerically with the following code, displaying the media NLWEST <- WEST %>% filter(`League`=="NL") mean(NLWEST$Attendance) ## [1] 1860060 NLEAST <- EAST %>% filter(`League`=="NL") mean(NLEAST$Attendance) ## [1] 1778405 NLCENT <- CENT %>% filter(`League`=="NL") mean(NLCENT$Attendance) ## [1] 2084059 7
  • 8.
    ALWEST <- WEST%>% filter(`League`=="AL") mean(ALWEST$Attendance) ## [1] 1593557 ALEAST <- EAST %>% filter(`League`=="AL") mean(ALEAST$Attendance) ## [1] 1817818 ALCENT <- CENT %>% filter(`League`=="AL") mean(ALCENT$Attendance) ## [1] 1800183 And Visually displayed with a box and whisker plot, showing the values as categorized by division: ggplot(MLBattend, aes(x=LDiv, y=Attendance, fill=LDiv))+ geom_boxplot(outlier.colour="red", outlier.shape=16,outlier.size=2, notch=FALSE)+ coord_flip()+ scale_y_continuous(labels = comma)+ scale_fill_brewer(palette="Dark2")+ theme(legend.position="top")+ ggtitle("Attendance by LDiv") AL CENT AL EAST AL WEST NL CENT NL EAST NL WEST 1,000,000 2,000,000 3,000,000 4,000,000 Attendance LDiv LDiv AL CENT AL EAST AL WEST NL CENT NL EAST NL WEST Attendance by LDiv 8
  • 9.
    Does not seemlike there is a real big difference Summary statistics given in the graph in order that the graph is listing the LDiv variables 6) Is there a difference in runs scored between the two leagues? If so, what? For this analyis, we are going to state the Null Hypothesis: there is no difference between runs scored between the two leagues. After Running the following code: AL <- MLBattend %>% filter(`League`=="AL") NL <- MLBattend %>% filter(`League`=="NL") ALScored <- AL$Scored NLScored <- NL$Scored sum(ALScored)-sum(NLScored) ## [1] 33926 We can reject the Null Hypothesis above and accept the Alternative Hypothesis that there is a difference in runs between the two leagues. To be exact, this difference is 33926, Favoring the Americal League. 7)Is there a difference in runs allowed between the two leagues? If so, what? The Null hypothesis for this analysis is that there is no difference within the two leagues in the MLB in regards to runs allowed. When Looking at the result from the following code: AL <- MLBattend %>% filter(`League`=="AL") NL <- MLBattend %>% filter(`League`=="NL") ALAllowed <- AL$Allowed NLAllowed <- NL$Allowed sum(ALAllowed)-sum(NLAllowed) ## [1] 34137 We can reject the null hypothesis that there is no difference in runs allowed in the two leagues and accept the alternative hyopthesis that there is a difference between runs allowed between the two leagues Specifically there is a difference of 34137 runs. 8)Describe how runs allowed and runs scored are distributed. The following Dashboard displays the distribution of the two variables while also displaying the Center (Median), Spread (SD), and Skew for both runs scored and runs allowed in the MLB: 9
  • 10.
    Allowedhist <- ggplot(data=MLBattend,aes(MLBattend$Allowed))+ geom_histogram(aes(y =..density.., fill=..count..),alpha = .9)+ geom_density(col="black") +labs(title="Runs Allowed Distribution")+ labs(x="Runs Allowed", y="Count")+ scale_fill_gradientn("Count", colours = heat.colors(16, alpha = .8))+ xlim(300,1100)+ annotate("text", x = 420, y = 0.004, color = "RED", label = median(Allowed))+ annotate("text", x = 350, y = 0.004, color = "RED", label = "Median: ")+ annotate("text", x = 480, y = 0.003, color = "BLUE", label = sd(Allowed))+ annotate("text", x = 330, y = 0.003, color = "BLUE", label = "SD: ")+ annotate("text", x = 330, y = 0.002, color = "BLACK", label = "Skew: NA ") Scoredhist <- ggplot(data=MLBattend, aes(MLBattend$Scored))+ geom_histogram(aes(y =..density.., fill=..count..),alpha = .9)+ geom_density(col="black") +labs(title="Runs Scored Distribution")+ labs(x="Runs Allowed", y="Count")+ scale_fill_gradientn("Count", colours = heat.colors(16, alpha = .8))+ xlim(300,1100)+ annotate("text", x = 420, y = 0.004, color = "RED", label = median(Scored))+ annotate("text", x = 350, y = 0.004, color = "RED", label = "Median: ")+ annotate("text", x = 480, y = 0.003, color = "BLUE", label = sd(Scored))+ annotate("text", x = 330, y = 0.003, color = "BLUE", label = "SD: ")+ annotate("text", x = 330, y = 0.002, color = "BLACK", label = "Skew: NA ") grid.arrange (Allowedhist, Scoredhist, ncol=1) ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. ## Warning: Removed 1 rows containing non-finite values (stat_bin). ## Warning: Removed 1 rows containing non-finite values (stat_density). ## Warning: Removed 1 rows containing missing values (geom_bar). ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. ## Warning: Removed 1 rows containing missing values (geom_bar). 10
  • 11.
    693Median: 105.524366934333SD: Skew: NA 0.000 0.001 0.002 0.003 0.004 500 7501000 Runs Allowed Count 0 25 50 75 100 Count Runs Allowed Distribution 691.5Median: 105.173606361848SD: Skew: NA 0.000 0.001 0.002 0.003 0.004 0.005 500 750 1000 Runs Allowed Count 0 25 50 75 100 Count Runs Scored Distribution #CENTER: MEDIAN. Displayed on Graph #SPREAD: SD. Displayed on Graph 9) Describe how attendance is distributed The following Dashboard displays the distribution of Attendance while also displaying the Center (Median), Spread (SD), and Skew: ggplot(data=MLBattend, aes(MLBattend$Attendance))+ geom_histogram(aes(y =..density.., fill=..count..),alpha = .9)+ geom_density(col="black") +labs(title="Attendance Distribution")+ labs(x="Attendance", y="Count")+ scale_fill_gradientn("Count", colours = heat.colors(16, alpha = .8))+ scale_x_continuous(labels = comma)+ annotate("text", x = 3500000, y = 4.2e-07, color = "RED", label = median(Attendance))+ annotate("text", x = 3000000, y = 4.2e-07, color = "RED", label = "Median: ")+ annotate("text", x = 3500000, y = 3.7e-07, color = "BLUE", label = sd(Attendance))+ annotate("text", x = 2900000, y = 3.7e-07, color = "BLUE", label = "SD: ")+ annotate("text", x = 3050000, y = 3.2e-07, color = "BLACK", label = "SKew: --> ") ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. 11
  • 12.
    1681896Median: 755872.81336223SD: SKew: −−> 0e+00 2e−07 4e−07 0 1,000,0002,000,000 3,000,000 4,000,000 Attendance Count 0 20 40 60 Count Attendance Distribution #CENTER: Median. displayed on graph #SPREAD: SD, displayed on graph #SKEW: Displayed on Graph 10) #Is there a difference in the proportion of times that the Cubs and the White Sox won their respective divisions over this time period? The Null Hypothesis for this particular Analysis is that there is no difference in the proportion of times that the cubs and white sox win their respective divisions. When Running the following code: CUBSdata <- MLBattend %>% filter(`Team`=="CHIN") CUBSWins <- CUBSdata %>% filter (`GamesBack`=="0") SOXdata <- MLBattend %>% filter(`Team`=="CHIA") SOXWins <- SOXdata %>% filter (`GamesBack`=="0") nrow(SOXWins)/nrow(CUBSWins) ## [1] 2 nrow(CUBSWins)/nrow(SOXWins) ## [1] 0.5 12
  • 13.
    When looking atthe results, we discover that we can reject the null hypothesis that there is no difference in proportion of times that the cubs and sox won their respective divisions and accept the alternative hyopthesis that there is a difference in this proportion. Specifically, the sox won their division twice as many times as the cubs and that the cubs won their division half as many times as the sox within this time period. 13