SlideShare a Scribd company logo
Chapter 2
REDUCING
MUTLI-COLLINEARITY
,OVERFITTING AND
LINERAIZING with R
2.1 Call Housing Data
As from the previous report of Decision tree ,I was able to predict the Houses values with good SSE through
Decision Tress,but according to me those prediction couldnt be reliable because the results shows Overfitting,
and the variables had too much noise and variance plus most of the variable were coreelating to each other
,which leads to major multicollinearity.
So I will try to use some Unsupervised Statstics Methods over the data in order to refine it.
What I tried to Do in this Project-
• Firstly I Manipulate the data and checked for any skewness.
• Then I performed a box-cox transformation,in order to make variables more Linear
• Then I performed Ordinary least squares regression over the transformed data. Then I used ridge
estimates to check for multicollinearity.
• Then Using principal component analysis I performed regression using optimal components to reduce
residual error.
• Than to further improve the model I perform Partial least square technique in order to further reduce
crrelation and co dependence so data may reduce OVERFITTING
3
Chapter 3
Research
3.1 Scatter-plot matrix to see Linearity among Variables
Figure 3.1: Appendix Reference-1
Matrix names Sequence -longitude,latitude ,housing median age, total rooms, total bedrooms,
population ,households, median income, median house value, ocean proximity
As from Scatter plot Matrix ,we can see many independent Variables are Linear and correlated to each
other,like Total rooms and Total Bedrooms
For our Houses values ,we see linearity only with median income and with rest of variables there is lot of
variance and need to be fixed.
3.1.1 Splitting Data to train sets and Applying BOX COX Transformation to
introduce Linearity among Variables
4
0.20pt
Figure 3.2: Before BOX COX
0.20pt
Figure 3.3: After BOX COX
Figure 3.4: Comparison before and after BOXCOX of Population Variable
5
0.20pt
Figure 3.5: Before BOX COX
0.20pt
Figure 3.6: After BOX COX
Figure 3.7: Comparison before and after BOXCOX of medaian house value Variable
6
Chapter 4
Research
4.1 Training Transformed Data and Performing Linear regression
Figure 4.1: Linear Regression on Power Transformed Trained Data
As compared to Linear regression performed in previous report and after
transforming data to make it more Linear,results have been improved with in-
crease in R squared from 0.24 to 0.63
7
Figure 4.2: Linear Regression Performed on the data in Decision tree report
4.2 Prediction and Cross Validation over Transformed Data
Figure 4.3: Prediciton on Training Data Results
Figure 4.4: Cross Validation Results
8
Chapter 5
Research
5.1 Muticollinearity and VIF
Further I will try to improve the model by checking for mutlicolinearity and if
found,Performing PCA and PLS to remove collinearity
5.2 Variance Inflation factor on Transformed Data
Figure 5.1: Vif results
The levels of VIF shows Collinearity among Variables
5.2.1 Performing PCA to reduce Dimensions and variables
Figure 5.2: PCA COMPONENTS IMPORTANCE
Running the principal component analysis shows that after adding the 5th variables this
has already accounted for 97% (0.96947) of the variance that were expecting, and I can run
the revised model with only five parameters and I would get significant results as well for
our multiple linear regression model. The option scale.=T(Appendix) lets me make the data
normalized, which is important where some features are off scale.
9
Figure 5.3: Real Data VS Predictions After PCA
The Model have been improved significantly,but still it is overestimating for the cheaper
houses.Let’s Try if Performing Partial Linear Square,which is extention to PCA concept ,and
can be better in Selecting weights ,which may remove the shortcomings of the model from
PCA(overestimating for cheaper houses
10
Chapter 6
Research
6.1 Partial Linear Square
According to partial Linear Square,the perfect components will be 4 for the predictions
Figure 6.1: PLS COMPONENTS
6.2 Results of PLS
Though the results have been improved from the PCA predictions,with reduced RMSE, and
incresed R squared.But still there are some problems and shortcomings of UNSUPERVISED
MODELS,that I will discuss in Conclusions
Figure 6.2: PLS Results - RMSE IS REDUCED SIGNIFICANTLY
11
Figure 6.3: Real data VS PLS predictions12
Chapter 7
Conclusions
Though I was able to improve the model significantly and able to remove Variance,multicollinearity
from the data significantly,which lead to improvement of Results but still
• Model could not evaluate expensive Areas
• Model could not evaluate correctly very cheap areas
According to me these problems in Model could be
• In data there is a data entry mistake,I studied the data and there is a value 500001
of median househich is randomly assigned to multiple areas, with variables similar to
cheap or expensive areas ,
• I may need to add other variables ,that do not depend on Supervised Learning or
unsupervised Learning but more to the empirical studies,For example like maybe in
some areas which are on ”ISLAND” need to get some bias because ,on that ISLAND
only exclusive rich or celebrities live and that is why the house values is too expensive
because of that bias,
• or maybe some variable like CRIME RATE is missing and I need to a bias or research
on it to calculate why there is difference of prices of houses of the areas having similar
Variable Values
• Maybe I can try to use Tensorflow Deep learning to predict the areas with different
prices of house but similar other variables,which can be better option than K NEAREST
NEIGHBOURS
13
Appendix
1
library(caret)
library( AppliedPredictiveModeling )
library(pls)
library(e1071)
library(lattice)
library(pls)
library(MASS)
library(lars)
library(elasticnet)
library(car)
library(glmnet)
library(plyr)
### read data
data <-read.csv("cal -housing (1).csv")
head(data)
data$ocean_proximity <-revalue(data$ocean_proximity , c("NEAR BAY"="1", "INLAND"="2", "ISLAND"="3", "NEAR OCEAN"="4","<1H OCEAN"="5"))
data$ocean_proximity <- as.numeric(data$ocean_proximity)
data <-data[,!names(data) %in% c(’X’,’X.1’)]
###do a scatterplot matrix to see linearilty
splom(data)
nrow(data)
head(data)
1
### create training and test data set
###set 75 percent of rows for training and rest for test
bound <-floor(0.75*nrow(data ))
data.train <- data[1:bound , ]
data.test <- data [( bound+1): nrow(data), ]
nrow(data.test)
nrow(data.train)
dataTrainX <-data.train
dataTestX <-data.test
### apply box cox transformation
boxcox <-preProcess(dataTrainX ,method ="BoxCox")
dataTrainXtrans <-predict(boxcox ,dataTrainX)
head(dataTrainXtrans)
hist(dataTrainXtrans$population)
hist(dataTrainX$population)
datatestXtrans <-predict(boxcox ,dataTestX)
head(datatestXtrans)
hist(datatestXtrans$median_house_value)
hist(dataTestX$median_house_value)
### create training data
trainingData <- dataTrainXtrans
trainingData <-dataTrainX
head(trainingData)
2
###fit the model -OLS
model <-lm(median_house_value ~.,data=trainingData)
summary(model)
par(mfrow=c(2,2))
### predict values
pred <-predict(model , datatestXtrans)
### create obs ,pred data frame
df <-data.frame(obs=datatestXtrans $median_house_value ,pred=pred)
df
defaultSummary(df)
###cross -validation
ctrl <-trainControl(method="cv",n=10)
set.seed(100)
tmp <-subset(dataTrainXtrans ,select =-median_house_value)
head(tmp)
modcv <-train(x=tmp ,y= dataTrainXtrans $median_house_value ,method="lm",trControl =ctrl)
### check for multicollinearality
vif(model)
###vif levels shows collinearity in the dataset
###pca analysis
pca <-data
### standardize independent variables
x<-subset(pca ,select=-median_house_value)
3
head(x)
x<-scale(x)
### center the dependent variable
y<-pca$median_house_value
y<-scale(y,scale =F)
###do pca on indepenedent variables
comp <-prcomp(na.omit(x))
comp
plot(comp)
biplot(comp)
summary(comp)
#5 principal components explain 97% of the total variance
pcr <-pcr(median_house_value ~.,data=trainingData ,validation="CV")
summary(pcr)
### choose five components for prediction
xpcr=subset(datatestXtrans ,select=-median_house_value)
pcrpred <-predict(pcr ,xpcr ,ncomp =5)
pcrdf1<-data.frame(obs=dataTestX$median_house_value ,Predictions=pcrpred)
pcrdf1
###pls regression is a better variation of PCR.It accounts for the variation in response when selecting weights
###use pls package , plsr function
### default algorithm is Dayal and Mcgregor kernel algorithm
plsFit <-plsr(median_house_value~.,data=trainingData ,validation="CV")
### predict first five median_house_value values using 1 and 2 components
pls.pred <-predict(plsFit ,datatestXtrans [1:100 ,],ncomp=1:2)
summary(plsFit)
4
validationplot(plsFit ,val.type ="RMSEP")
pls.RMSEP <-RMSEP(plsFit ,estimate="CV")
plot(pls.RMSEP ,main="RMSEP PLS",xlab="Components")
min <-which.min(pls.RMSEP$val)
points(min ,min(pls.RMSEP$val),pch=1,col="red")
plot(plsFit , ncomp=4, asp=1)
###use 4 components
pls.pred2<-predict(plsFit ,datatestXtrans ,ncomp=5)
pls.eval <-data.frame(obs=dataTestX$median_house_value ,pred=pls.pred2[,1,1])
defaultSummary(pls.eval)
5
List of Figures
1.1 BOX COX FORMULA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 PCA ALGORITHM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 PLS ON VARIABLES VS SIMPLE LINEAR REGRESSION . . . . . . . . . . . . . . . . . . 2
3.1 Appendix Reference-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.2 Before BOX COX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.3 After BOX COX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.4 Comparison before and after BOXCOX of Population Variable . . . . . . . . . . . . . . . . . 5
3.5 Before BOX COX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.6 After BOX COX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.7 Comparison before and after BOXCOX of medaian house value Variable . . . . . . . . . . . . 6
4.1 Linear Regression on Power Transformed Trained Data . . . . . . . . . . . . . . . . . . . . . . 7
4.2 Linear Regression Performed on the data in Decision tree report . . . . . . . . . . . . . . . . 8
4.3 Prediciton on Training Data Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.4 Cross Validation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
5.1 Vif results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5.2 PCA COMPONENTS IMPORTANCE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5.3 Real Data VS Predictions After PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
6.1 PLS COMPONENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
6.2 PLS Results - RMSE IS REDUCED SIGNIFICANTLY . . . . . . . . . . . . . . . . . . . . . 11
6.3 Real data VS PLS predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
ii

More Related Content

PDF
Data Modelling
NicholasDavis85
 
PDF
Report- Monica Salib
Monica Samir
 
PDF
Maximum Likelihood Calibration of the Hercules Data Set
Christopher Garling
 
PDF
Dce a novel delay correlation
ijdpsjournal
 
PDF
Optimal Power System Planning with Renewable DGs with Reactive Power Consider...
International Journal of Power Electronics and Drive Systems
 
PDF
Karimanal thrml co_design_itherm2010_final
Kamal Karimanal
 
PDF
Cs36565569
IJERA Editor
 
PDF
post119s1-file3
Venkata Suhas Maringanti
 
Data Modelling
NicholasDavis85
 
Report- Monica Salib
Monica Samir
 
Maximum Likelihood Calibration of the Hercules Data Set
Christopher Garling
 
Dce a novel delay correlation
ijdpsjournal
 
Optimal Power System Planning with Renewable DGs with Reactive Power Consider...
International Journal of Power Electronics and Drive Systems
 
Karimanal thrml co_design_itherm2010_final
Kamal Karimanal
 
Cs36565569
IJERA Editor
 
post119s1-file3
Venkata Suhas Maringanti
 

Similar to Unsupervised learning (20)

PPTX
Linear regression by Kodebay
Kodebay
 
PDF
Multivariate Regression Analysis
Kalaivanan Murthy
 
PPTX
Linear Regression Algorithm | Linear Regression in R | Data Science Training ...
Edureka!
 
PPTX
2. diagnostics, collinearity, transformation, and missing data
Malik Hassan Qayyum 🕵🏻‍♂️
 
PDF
Random forest algorithm for regression a beginner's guide
prateek kumar
 
PPTX
unit-5 Data Wrandling weightage marks.pptx
nilampatoliya
 
PPT
CROSS-VALIDATION AND MODEL SELECTION (1).ppt
KAVYATM5
 
PPTX
Linear Regression.pptx
Ramakrishna Reddy Bijjam
 
PPTX
Machine Learning - Simple Linear Regression
Siddharth Shrivastava
 
PPTX
Introduction to MARS (1999)
Salford Systems
 
DOCX
SHAHBAZ_TECHNICAL_SEMINAR.docx
ShahbazKhan77289
 
PPTX
House Sale Price Prediction
sriram30691
 
PDF
Machine learning Mind Map
Ashish Patel
 
PDF
SupportVectorRegression
Daniel K
 
PDF
Exploring Support Vector Regression - Signals and Systems Project
Surya Chandra
 
PDF
Gradient boosting for regression problems with example basics of regression...
prateek kumar
 
PDF
Lecture6 xing
Tianlu Wang
 
PDF
Data Science Cheatsheet.pdf
qawali1
 
PDF
Regression Analysis and model comparison on the Boston Housing Data
Shivaram Prakash
 
Linear regression by Kodebay
Kodebay
 
Multivariate Regression Analysis
Kalaivanan Murthy
 
Linear Regression Algorithm | Linear Regression in R | Data Science Training ...
Edureka!
 
2. diagnostics, collinearity, transformation, and missing data
Malik Hassan Qayyum 🕵🏻‍♂️
 
Random forest algorithm for regression a beginner's guide
prateek kumar
 
unit-5 Data Wrandling weightage marks.pptx
nilampatoliya
 
CROSS-VALIDATION AND MODEL SELECTION (1).ppt
KAVYATM5
 
Linear Regression.pptx
Ramakrishna Reddy Bijjam
 
Machine Learning - Simple Linear Regression
Siddharth Shrivastava
 
Introduction to MARS (1999)
Salford Systems
 
SHAHBAZ_TECHNICAL_SEMINAR.docx
ShahbazKhan77289
 
House Sale Price Prediction
sriram30691
 
Machine learning Mind Map
Ashish Patel
 
SupportVectorRegression
Daniel K
 
Exploring Support Vector Regression - Signals and Systems Project
Surya Chandra
 
Gradient boosting for regression problems with example basics of regression...
prateek kumar
 
Lecture6 xing
Tianlu Wang
 
Data Science Cheatsheet.pdf
qawali1
 
Regression Analysis and model comparison on the Boston Housing Data
Shivaram Prakash
 
Ad

Recently uploaded (20)

PPTX
Web dev -ppt that helps us understand web technology
shubhragoyal12
 
PPTX
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
PPTX
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
PPTX
Fuzzy_Membership_Functions_Presentation.pptx
pythoncrazy2024
 
PDF
Fundamentals and Techniques of Biophysics and Molecular Biology (Pranav Kumar...
RohitKumar868624
 
PPTX
M1-T1.pptxM1-T1.pptxM1-T1.pptxM1-T1.pptx
teodoroferiarevanojr
 
PDF
Practical Measurement Systems Analysis (Gage R&R) for design
Rob Schubert
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PPTX
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
PPTX
Introduction to Data Analytics and Data Science
KavithaCIT
 
PPTX
Future_of_AI_Presentation for everyone.pptx
boranamanju07
 
PDF
TIC ACTIVIDAD 1geeeeeeeeeeeeeeeeeeeeeeeeeeeeeer3.pdf
Thais Ruiz
 
PPTX
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
PDF
717629748-Databricks-Certified-Data-Engineer-Professional-Dumps-by-Ball-21-03...
pedelli41
 
PPTX
INFO8116 -Big data architecture and analytics
guddipatel10
 
PPTX
White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...
RamNeymarjr
 
PPTX
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
PDF
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
PPTX
Probability systematic sampling methods.pptx
PrakashRajput19
 
PPTX
Presentation on animal welfare a good topic
kidscream385
 
Web dev -ppt that helps us understand web technology
shubhragoyal12
 
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
Fuzzy_Membership_Functions_Presentation.pptx
pythoncrazy2024
 
Fundamentals and Techniques of Biophysics and Molecular Biology (Pranav Kumar...
RohitKumar868624
 
M1-T1.pptxM1-T1.pptxM1-T1.pptxM1-T1.pptx
teodoroferiarevanojr
 
Practical Measurement Systems Analysis (Gage R&R) for design
Rob Schubert
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
Introduction to Data Analytics and Data Science
KavithaCIT
 
Future_of_AI_Presentation for everyone.pptx
boranamanju07
 
TIC ACTIVIDAD 1geeeeeeeeeeeeeeeeeeeeeeeeeeeeeer3.pdf
Thais Ruiz
 
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
717629748-Databricks-Certified-Data-Engineer-Professional-Dumps-by-Ball-21-03...
pedelli41
 
INFO8116 -Big data architecture and analytics
guddipatel10
 
White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...
RamNeymarjr
 
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
Probability systematic sampling methods.pptx
PrakashRajput19
 
Presentation on animal welfare a good topic
kidscream385
 
Ad

Unsupervised learning

  • 1. Chapter 2 REDUCING MUTLI-COLLINEARITY ,OVERFITTING AND LINERAIZING with R 2.1 Call Housing Data As from the previous report of Decision tree ,I was able to predict the Houses values with good SSE through Decision Tress,but according to me those prediction couldnt be reliable because the results shows Overfitting, and the variables had too much noise and variance plus most of the variable were coreelating to each other ,which leads to major multicollinearity. So I will try to use some Unsupervised Statstics Methods over the data in order to refine it. What I tried to Do in this Project- • Firstly I Manipulate the data and checked for any skewness. • Then I performed a box-cox transformation,in order to make variables more Linear • Then I performed Ordinary least squares regression over the transformed data. Then I used ridge estimates to check for multicollinearity. • Then Using principal component analysis I performed regression using optimal components to reduce residual error. • Than to further improve the model I perform Partial least square technique in order to further reduce crrelation and co dependence so data may reduce OVERFITTING 3
  • 2. Chapter 3 Research 3.1 Scatter-plot matrix to see Linearity among Variables Figure 3.1: Appendix Reference-1 Matrix names Sequence -longitude,latitude ,housing median age, total rooms, total bedrooms, population ,households, median income, median house value, ocean proximity As from Scatter plot Matrix ,we can see many independent Variables are Linear and correlated to each other,like Total rooms and Total Bedrooms For our Houses values ,we see linearity only with median income and with rest of variables there is lot of variance and need to be fixed. 3.1.1 Splitting Data to train sets and Applying BOX COX Transformation to introduce Linearity among Variables 4
  • 3. 0.20pt Figure 3.2: Before BOX COX 0.20pt Figure 3.3: After BOX COX Figure 3.4: Comparison before and after BOXCOX of Population Variable 5
  • 4. 0.20pt Figure 3.5: Before BOX COX 0.20pt Figure 3.6: After BOX COX Figure 3.7: Comparison before and after BOXCOX of medaian house value Variable 6
  • 5. Chapter 4 Research 4.1 Training Transformed Data and Performing Linear regression Figure 4.1: Linear Regression on Power Transformed Trained Data As compared to Linear regression performed in previous report and after transforming data to make it more Linear,results have been improved with in- crease in R squared from 0.24 to 0.63 7
  • 6. Figure 4.2: Linear Regression Performed on the data in Decision tree report 4.2 Prediction and Cross Validation over Transformed Data Figure 4.3: Prediciton on Training Data Results Figure 4.4: Cross Validation Results 8
  • 7. Chapter 5 Research 5.1 Muticollinearity and VIF Further I will try to improve the model by checking for mutlicolinearity and if found,Performing PCA and PLS to remove collinearity 5.2 Variance Inflation factor on Transformed Data Figure 5.1: Vif results The levels of VIF shows Collinearity among Variables 5.2.1 Performing PCA to reduce Dimensions and variables Figure 5.2: PCA COMPONENTS IMPORTANCE Running the principal component analysis shows that after adding the 5th variables this has already accounted for 97% (0.96947) of the variance that were expecting, and I can run the revised model with only five parameters and I would get significant results as well for our multiple linear regression model. The option scale.=T(Appendix) lets me make the data normalized, which is important where some features are off scale. 9
  • 8. Figure 5.3: Real Data VS Predictions After PCA The Model have been improved significantly,but still it is overestimating for the cheaper houses.Let’s Try if Performing Partial Linear Square,which is extention to PCA concept ,and can be better in Selecting weights ,which may remove the shortcomings of the model from PCA(overestimating for cheaper houses 10
  • 9. Chapter 6 Research 6.1 Partial Linear Square According to partial Linear Square,the perfect components will be 4 for the predictions Figure 6.1: PLS COMPONENTS 6.2 Results of PLS Though the results have been improved from the PCA predictions,with reduced RMSE, and incresed R squared.But still there are some problems and shortcomings of UNSUPERVISED MODELS,that I will discuss in Conclusions Figure 6.2: PLS Results - RMSE IS REDUCED SIGNIFICANTLY 11
  • 10. Figure 6.3: Real data VS PLS predictions12
  • 11. Chapter 7 Conclusions Though I was able to improve the model significantly and able to remove Variance,multicollinearity from the data significantly,which lead to improvement of Results but still • Model could not evaluate expensive Areas • Model could not evaluate correctly very cheap areas According to me these problems in Model could be • In data there is a data entry mistake,I studied the data and there is a value 500001 of median househich is randomly assigned to multiple areas, with variables similar to cheap or expensive areas , • I may need to add other variables ,that do not depend on Supervised Learning or unsupervised Learning but more to the empirical studies,For example like maybe in some areas which are on ”ISLAND” need to get some bias because ,on that ISLAND only exclusive rich or celebrities live and that is why the house values is too expensive because of that bias, • or maybe some variable like CRIME RATE is missing and I need to a bias or research on it to calculate why there is difference of prices of houses of the areas having similar Variable Values • Maybe I can try to use Tensorflow Deep learning to predict the areas with different prices of house but similar other variables,which can be better option than K NEAREST NEIGHBOURS 13
  • 12. Appendix 1 library(caret) library( AppliedPredictiveModeling ) library(pls) library(e1071) library(lattice) library(pls) library(MASS) library(lars) library(elasticnet) library(car) library(glmnet) library(plyr) ### read data data <-read.csv("cal -housing (1).csv") head(data) data$ocean_proximity <-revalue(data$ocean_proximity , c("NEAR BAY"="1", "INLAND"="2", "ISLAND"="3", "NEAR OCEAN"="4","<1H OCEAN"="5")) data$ocean_proximity <- as.numeric(data$ocean_proximity) data <-data[,!names(data) %in% c(’X’,’X.1’)] ###do a scatterplot matrix to see linearilty splom(data) nrow(data) head(data) 1
  • 13. ### create training and test data set ###set 75 percent of rows for training and rest for test bound <-floor(0.75*nrow(data )) data.train <- data[1:bound , ] data.test <- data [( bound+1): nrow(data), ] nrow(data.test) nrow(data.train) dataTrainX <-data.train dataTestX <-data.test ### apply box cox transformation boxcox <-preProcess(dataTrainX ,method ="BoxCox") dataTrainXtrans <-predict(boxcox ,dataTrainX) head(dataTrainXtrans) hist(dataTrainXtrans$population) hist(dataTrainX$population) datatestXtrans <-predict(boxcox ,dataTestX) head(datatestXtrans) hist(datatestXtrans$median_house_value) hist(dataTestX$median_house_value) ### create training data trainingData <- dataTrainXtrans trainingData <-dataTrainX head(trainingData) 2
  • 14. ###fit the model -OLS model <-lm(median_house_value ~.,data=trainingData) summary(model) par(mfrow=c(2,2)) ### predict values pred <-predict(model , datatestXtrans) ### create obs ,pred data frame df <-data.frame(obs=datatestXtrans $median_house_value ,pred=pred) df defaultSummary(df) ###cross -validation ctrl <-trainControl(method="cv",n=10) set.seed(100) tmp <-subset(dataTrainXtrans ,select =-median_house_value) head(tmp) modcv <-train(x=tmp ,y= dataTrainXtrans $median_house_value ,method="lm",trControl =ctrl) ### check for multicollinearality vif(model) ###vif levels shows collinearity in the dataset ###pca analysis pca <-data ### standardize independent variables x<-subset(pca ,select=-median_house_value) 3
  • 15. head(x) x<-scale(x) ### center the dependent variable y<-pca$median_house_value y<-scale(y,scale =F) ###do pca on indepenedent variables comp <-prcomp(na.omit(x)) comp plot(comp) biplot(comp) summary(comp) #5 principal components explain 97% of the total variance pcr <-pcr(median_house_value ~.,data=trainingData ,validation="CV") summary(pcr) ### choose five components for prediction xpcr=subset(datatestXtrans ,select=-median_house_value) pcrpred <-predict(pcr ,xpcr ,ncomp =5) pcrdf1<-data.frame(obs=dataTestX$median_house_value ,Predictions=pcrpred) pcrdf1 ###pls regression is a better variation of PCR.It accounts for the variation in response when selecting weights ###use pls package , plsr function ### default algorithm is Dayal and Mcgregor kernel algorithm plsFit <-plsr(median_house_value~.,data=trainingData ,validation="CV") ### predict first five median_house_value values using 1 and 2 components pls.pred <-predict(plsFit ,datatestXtrans [1:100 ,],ncomp=1:2) summary(plsFit) 4
  • 16. validationplot(plsFit ,val.type ="RMSEP") pls.RMSEP <-RMSEP(plsFit ,estimate="CV") plot(pls.RMSEP ,main="RMSEP PLS",xlab="Components") min <-which.min(pls.RMSEP$val) points(min ,min(pls.RMSEP$val),pch=1,col="red") plot(plsFit , ncomp=4, asp=1) ###use 4 components pls.pred2<-predict(plsFit ,datatestXtrans ,ncomp=5) pls.eval <-data.frame(obs=dataTestX$median_house_value ,pred=pls.pred2[,1,1]) defaultSummary(pls.eval) 5
  • 17. List of Figures 1.1 BOX COX FORMULA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 PCA ALGORITHM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 PLS ON VARIABLES VS SIMPLE LINEAR REGRESSION . . . . . . . . . . . . . . . . . . 2 3.1 Appendix Reference-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3.2 Before BOX COX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.3 After BOX COX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.4 Comparison before and after BOXCOX of Population Variable . . . . . . . . . . . . . . . . . 5 3.5 Before BOX COX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.6 After BOX COX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.7 Comparison before and after BOXCOX of medaian house value Variable . . . . . . . . . . . . 6 4.1 Linear Regression on Power Transformed Trained Data . . . . . . . . . . . . . . . . . . . . . . 7 4.2 Linear Regression Performed on the data in Decision tree report . . . . . . . . . . . . . . . . 8 4.3 Prediciton on Training Data Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 4.4 Cross Validation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 5.1 Vif results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 5.2 PCA COMPONENTS IMPORTANCE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 5.3 Real Data VS Predictions After PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 6.1 PLS COMPONENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 6.2 PLS Results - RMSE IS REDUCED SIGNIFICANTLY . . . . . . . . . . . . . . . . . . . . . 11 6.3 Real data VS PLS predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 ii