Unsupervised learning

Chapter 2
REDUCING
MUTLI-COLLINEARITY
,OVERFITTING AND
LINERAIZING with R
2.1 Call Housing Data
As from the previous report of Decision tree ,I was able to predict the Houses values with good SSE through
Decision Tress,but according to me those prediction couldnt be reliable because the results shows Overﬁtting,
and the variables had too much noise and variance plus most of the variable were coreelating to each other
,which leads to major multicollinearity.
So I will try to use some Unsupervised Statstics Methods over the data in order to reﬁne it.
What I tried to Do in this Project-
• Firstly I Manipulate the data and checked for any skewness.
• Then I performed a box-cox transformation,in order to make variables more Linear
• Then I performed Ordinary least squares regression over the transformed data. Then I used ridge
estimates to check for multicollinearity.
• Then Using principal component analysis I performed regression using optimal components to reduce
residual error.
• Than to further improve the model I perform Partial least square technique in order to further reduce
crrelation and co dependence so data may reduce OVERFITTING
3

Chapter 3
Research
3.1 Scatter-plot matrix to see Linearity among Variables
Figure 3.1: Appendix Reference-1
Matrix names Sequence -longitude,latitude ,housing median age, total rooms, total bedrooms,
population ,households, median income, median house value, ocean proximity
As from Scatter plot Matrix ,we can see many independent Variables are Linear and correlated to each
other,like Total rooms and Total Bedrooms
For our Houses values ,we see linearity only with median income and with rest of variables there is lot of
variance and need to be ﬁxed.
3.1.1 Splitting Data to train sets and Applying BOX COX Transformation to
introduce Linearity among Variables
4

0.20pt
Figure 3.2: Before BOX COX
0.20pt
Figure 3.3: After BOX COX
Figure 3.4: Comparison before and after BOXCOX of Population Variable
5

0.20pt
Figure 3.5: Before BOX COX
0.20pt
Figure 3.6: After BOX COX
Figure 3.7: Comparison before and after BOXCOX of medaian house value Variable
6

Chapter 4
Research
4.1 Training Transformed Data and Performing Linear regression
Figure 4.1: Linear Regression on Power Transformed Trained Data
As compared to Linear regression performed in previous report and after
transforming data to make it more Linear,results have been improved with in-
crease in R squared from 0.24 to 0.63
7

Figure 4.2: Linear Regression Performed on the data in Decision tree report
4.2 Prediction and Cross Validation over Transformed Data
Figure 4.3: Prediciton on Training Data Results
Figure 4.4: Cross Validation Results
8

Chapter 5
Research
5.1 Muticollinearity and VIF
Further I will try to improve the model by checking for mutlicolinearity and if
found,Performing PCA and PLS to remove collinearity
5.2 Variance Inflation factor on Transformed Data
Figure 5.1: Vif results
The levels of VIF shows Collinearity among Variables
5.2.1 Performing PCA to reduce Dimensions and variables
Figure 5.2: PCA COMPONENTS IMPORTANCE
Running the principal component analysis shows that after adding the 5th variables this
has already accounted for 97% (0.96947) of the variance that were expecting, and I can run
the revised model with only five parameters and I would get significant results as well for
our multiple linear regression model. The option scale.=T(Appendix) lets me make the data
normalized, which is important where some features are off scale.
9

Figure 5.3: Real Data VS Predictions After PCA
The Model have been improved signiﬁcantly,but still it is overestimating for the cheaper
houses.Let’s Try if Performing Partial Linear Square,which is extention to PCA concept ,and
can be better in Selecting weights ,which may remove the shortcomings of the model from
PCA(overestimating for cheaper houses
10

Chapter 6
Research
6.1 Partial Linear Square
According to partial Linear Square,the perfect components will be 4 for the predictions
Figure 6.1: PLS COMPONENTS
6.2 Results of PLS
Though the results have been improved from the PCA predictions,with reduced RMSE, and
incresed R squared.But still there are some problems and shortcomings of UNSUPERVISED
MODELS,that I will discuss in Conclusions
Figure 6.2: PLS Results - RMSE IS REDUCED SIGNIFICANTLY
11

Figure 6.3: Real data VS PLS predictions12

Chapter 7
Conclusions
Though I was able to improve the model significantly and able to remove Variance,multicollinearity
from the data significantly,which lead to improvement of Results but still
• Model could not evaluate expensive Areas
• Model could not evaluate correctly very cheap areas
According to me these problems in Model could be
• In data there is a data entry mistake,I studied the data and there is a value 500001
of median househich is randomly assigned to multiple areas, with variables similar to
cheap or expensive areas ,
• I may need to add other variables ,that do not depend on Supervised Learning or
unsupervised Learning but more to the empirical studies,For example like maybe in
some areas which are on ”ISLAND” need to get some bias because ,on that ISLAND
only exclusive rich or celebrities live and that is why the house values is too expensive
because of that bias,
• or maybe some variable like CRIME RATE is missing and I need to a bias or research
on it to calculate why there is difference of prices of houses of the areas having similar
Variable Values
• Maybe I can try to use Tensorflow Deep learning to predict the areas with different
prices of house but similar other variables,which can be better option than K NEAREST
NEIGHBOURS
13

Appendix
1
library(caret)
library( AppliedPredictiveModeling )
library(pls)
library(e1071)
library(lattice)
library(pls)
library(MASS)
library(lars)
library(elasticnet)
library(car)
library(glmnet)
library(plyr)
### read data
data <-read.csv("cal -housing (1).csv")
head(data)
data$ocean_proximity <-revalue(data$ocean_proximity , c("NEAR BAY"="1", "INLAND"="2", "ISLAND"="3", "NEAR OCEAN"="4","<1H OCEAN"="5"))
data$ocean_proximity <- as.numeric(data$ocean_proximity)
data <-data[,!names(data) %in% c(’X’,’X.1’)]
###do a scatterplot matrix to see linearilty
splom(data)
nrow(data)
head(data)
1

### create training and test data set
###set 75 percent of rows for training and rest for test
bound <-floor(0.75*nrow(data ))
data.train <- data[1:bound , ]
data.test <- data [( bound+1): nrow(data), ]
nrow(data.test)
nrow(data.train)
dataTrainX <-data.train
dataTestX <-data.test
### apply box cox transformation
boxcox <-preProcess(dataTrainX ,method ="BoxCox")
dataTrainXtrans <-predict(boxcox ,dataTrainX)
head(dataTrainXtrans)
hist(dataTrainXtrans$population)
hist(dataTrainX$population)
datatestXtrans <-predict(boxcox ,dataTestX)
head(datatestXtrans)
hist(datatestXtrans$median_house_value)
hist(dataTestX$median_house_value)
### create training data
trainingData <- dataTrainXtrans
trainingData <-dataTrainX
head(trainingData)
2

###fit the model -OLS
model <-lm(median_house_value ~.,data=trainingData)
summary(model)
par(mfrow=c(2,2))
### predict values
pred <-predict(model , datatestXtrans)
### create obs ,pred data frame
df <-data.frame(obs=datatestXtrans $median_house_value ,pred=pred)
df
defaultSummary(df)
###cross -validation
ctrl <-trainControl(method="cv",n=10)
set.seed(100)
tmp <-subset(dataTrainXtrans ,select =-median_house_value)
head(tmp)
modcv <-train(x=tmp ,y= dataTrainXtrans $median_house_value ,method="lm",trControl =ctrl)
### check for multicollinearality
vif(model)
###vif levels shows collinearity in the dataset
###pca analysis
pca <-data
### standardize independent variables
x<-subset(pca ,select=-median_house_value)
3

head(x)
x<-scale(x)
### center the dependent variable
y<-pca$median_house_value
y<-scale(y,scale =F)
###do pca on indepenedent variables
comp <-prcomp(na.omit(x))
comp
plot(comp)
biplot(comp)
summary(comp)
#5 principal components explain 97% of the total variance
pcr <-pcr(median_house_value ~.,data=trainingData ,validation="CV")
summary(pcr)
### choose five components for prediction
xpcr=subset(datatestXtrans ,select=-median_house_value)
pcrpred <-predict(pcr ,xpcr ,ncomp =5)
pcrdf1<-data.frame(obs=dataTestX$median_house_value ,Predictions=pcrpred)
pcrdf1
###pls regression is a better variation of PCR.It accounts for the variation in response when selecting weights
###use pls package , plsr function
### default algorithm is Dayal and Mcgregor kernel algorithm
plsFit <-plsr(median_house_value~.,data=trainingData ,validation="CV")
### predict first five median_house_value values using 1 and 2 components
pls.pred <-predict(plsFit ,datatestXtrans [1:100 ,],ncomp=1:2)
summary(plsFit)
4

validationplot(plsFit ,val.type ="RMSEP")
pls.RMSEP <-RMSEP(plsFit ,estimate="CV")
plot(pls.RMSEP ,main="RMSEP PLS",xlab="Components")
min <-which.min(pls.RMSEP$val)
points(min ,min(pls.RMSEP$val),pch=1,col="red")
plot(plsFit , ncomp=4, asp=1)
###use 4 components
pls.pred2<-predict(plsFit ,datatestXtrans ,ncomp=5)
pls.eval <-data.frame(obs=dataTestX$median_house_value ,pred=pls.pred2[,1,1])
defaultSummary(pls.eval)
5

List of Figures
1.1 BOX COX FORMULA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 PCA ALGORITHM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 PLS ON VARIABLES VS SIMPLE LINEAR REGRESSION . . . . . . . . . . . . . . . . . . 2
3.1 Appendix Reference-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.2 Before BOX COX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.3 After BOX COX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.4 Comparison before and after BOXCOX of Population Variable . . . . . . . . . . . . . . . . . 5
3.5 Before BOX COX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.6 After BOX COX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.7 Comparison before and after BOXCOX of medaian house value Variable . . . . . . . . . . . . 6
4.1 Linear Regression on Power Transformed Trained Data . . . . . . . . . . . . . . . . . . . . . . 7
4.2 Linear Regression Performed on the data in Decision tree report . . . . . . . . . . . . . . . . 8
4.3 Prediciton on Training Data Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.4 Cross Validation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
5.1 Vif results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5.2 PCA COMPONENTS IMPORTANCE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5.3 Real Data VS Predictions After PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
6.1 PLS COMPONENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
6.2 PLS Results - RMSE IS REDUCED SIGNIFICANTLY . . . . . . . . . . . . . . . . . . . . . 11
6.3 Real data VS PLS predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
ii

Unsupervised learning

More Related Content

Similar to Unsupervised learning (20)

Recently uploaded (20)

Unsupervised learning