-
Notifications
You must be signed in to change notification settings - Fork 0
/
House_Price_Report.rmd
1339 lines (822 loc) · 53.8 KB
/
House_Price_Report.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "House Price in LA"
author: "SM"
date: "April 1, 2020"
output:
pdf_document: default
html_document: default
word_document: default
---
## Section 1
Load the data
```{r}
library(readr)
house_price <- read_csv("./dataset1.csv")
summary(house_price)
str(house_price)
```
```{r}
#plotting histogram with fitted normal density
#install.packages("rcompanion")
library(rcompanion)
#house_price$ValuePerSQFT <- house_price$LandValue/house_price$SQFTmain
plotNormalHistogram(house_price$ValuePerSQFT)
#testing for normality of distribution
shapiro.test(house_price$ValuePerSQFT)
```
Since the p-value is small, the Shapiro test rejected the null hypothese of the data being normal. Hence, the response variable needs to be transformed to achieve normality.
Now, perform Box-Cox Transformation
```{r}
#finding optimal lambda for Box-Cox tranformation
#install.packages("MASS")
library(MASS)
BoxCox.fit<- boxcox(ValuePerSQFT ~ age+Bedrooms+Bathrooms+LandValue100K,
data=house_price, lambda = seq(-3,3,1/4), interp=FALSE)
BoxCox.data<- data.frame(BoxCox.fit$x, BoxCox.fit$y)
ordered.data<- BoxCox.data[with(BoxCox.data, order(-BoxCox.fit.y)),]
ordered.data[1,]
lambda = ordered.data[1, "BoxCox.fit.x"] #0.5
```
Based on the second output, the optimal lambda value to use for transformation is 0.5. Now, lets create function that performs a Box Cox transformation with a given lambda. The funciton also has an inverse agrument to revert the values after prediction.
```{r}
#Box Cox Transformation with lambda
transform_data <- function(var, lambda, inverse=F){
if(inverse==F){
new_variable <- (1/lambda)*(var^(lambda)-1)
}
else{
new_variable <- (lambda*var + 1)^(1/lambda)
}
return(new_variable)
}
```
With the Box Cox transformation created, we will transform the response variable ValuePerSQFT and test to see if the function works correctly.
```{r}
#applying Box-Cox tranformation with lambda=0.5
tt <- transform_data(var=house_price$ValuePerSQFT, lambda = lambda)
identical(tt, 2*(sqrt(house_price$ValuePerSQFT)-1))
```
The function works and the next step is to test the normality of the transformed data using Shapiro Test.
```{r}
#plotting histogram for tranformed response
plotNormalHistogram(tt)
#testing for normality of distribution
shapiro.test(tt)
```
The normality assumption is been met and now we apply the generalized linear model. The first thing to do is prepare the data by changing the data type of several features and split the data for training/testing.
```{r}
#convert Bedrooms and Bathrooms into integers
house_price$Bedrooms <- as.integer(house_price$Bedrooms)
house_price$Bathrooms <- as.integer(house_price$Bathrooms)
attr(house_price, "spec") <- NULL
set.seed(5)
N = nrow(house_price)
train_idx <- sample(1:N, size = round(0.8*N))
eval_idx <- setdiff(1:N, train_idx)
train_data <- house_price[train_idx, ]
eval_data <- house_price[eval_idx, ]
#transformed values of response variable
train_data$target <- tt[train_idx]
```
With the data prepared, lets perform some data visualizations.
```{r}
#install.packages("ggplot2")
library(ggplot2)
#library(devtools)
#install_github("ggobi/ggally")
#install.packages("colorspace")
#install.packages('Rcpp')
#install.packages('GGally')
library(GGally)
#install.packages('digest')
#library('digest')
```
```{r}
train_data %>%
ggplot(aes(x=factor(Bedrooms), y=target, fill=factor(Bedrooms))) +
geom_boxplot()
```
The target variable decreases as the number of Bedrooms increases due to econmies of scale or agglomeration. More bedrooms require more space which increases area of the house. Hence, the Since the target variable is the transformation of price per square foot, the price per SQFT increases as Bedrooms increases. Other than 6 Bedrooms, the variability of the target variable is similar across Bedrooms. Bedrooms is a good candidate to include in the model.
```{r}
ggplot(train_data, aes(x=factor(Bathrooms), y=target, fill=factor(Bathrooms))) +
geom_boxplot()
```
Unlike Bedrooms, target does not have similar variability across Bathrooms and it decreases only from 1 to 4 Bathrooms.
```{r}
ggplot(train_data, aes(x=LandValue100K, y=target, color=factor(Bathrooms))) +
geom_point() +
geom_smooth(method='lm', se=F) +
facet_wrap(~factor(Bathrooms))
ggplot(train_data, aes(x=LandValue100K, y=target, color=factor(Bedrooms))) +
geom_point() +
geom_smooth(method = 'lm', se=F) +
facet_wrap(~factor(Bedrooms))
ggplot(train_data, aes(x=LandValue100K, y=target)) +
geom_point() +
geom_smooth(method = 'lm', se=F)
```
Before any candidate feature to include in GLM model, the feature must have linear relationship to the target variable. If not so, then apply transformation to find a linear relationship or remove the feature as a candidate for the GLM model.
The target variable has been consistenly decreases as the LandValue100K, the price per square foot increases among both Bathrooms and Bedrooms. The third output has a linear trend between LandValue100K and target.
```{r fig.height=7, fig.width=7}
ggplot(train_data, aes(x=age, y=target, color=factor(Bedrooms))) +
geom_point() +
geom_smooth(method = 'lm', se=F) +
facet_wrap(~factor(Bedrooms))
ggplot(train_data, aes(x=age, y=target, color=factor(Bathrooms))) +
geom_point() +
geom_smooth(method = 'lm', se=F) +
facet_wrap(~factor(Bathrooms))
ggplot(train_data, aes(x=age, y=target)) +
geom_point() +
geom_smooth(method = 'lm', se=F)
```
Note: low age means recent developed house while high age means old house. The slope of target with respect to age is positive for all number of Bedrooms except 5. For Bathrooms, the slopes is not as consistent especially numbers 4,5,6,and 7. Instances of Bathrooms 6 and 7 have limited observations to indicate the trend between age and target, while 5 Bathrooms have a large positive slope compared to Bathrooms 1,2,3. Lastly, 4 Bathrooms has a negative slope.
In the third output, the overall trend between age and target is a positive slope
```{r}
ggplot(train_data, aes(x=SQFTmain, y=target, color=factor(Bedrooms))) +
geom_point() +
geom_smooth(method = 'lm', se=F) +
facet_wrap(~factor(Bedrooms))
ggplot(train_data, aes(x=SQFTmain, y=target, color=factor(Bathrooms))) +
geom_point() +
geom_smooth(method = 'lm', se=F) +
facet_wrap(~factor(Bathrooms))
ggplot(train_data, aes(x=SQFTmain, y=target)) +
geom_point() +
geom_smooth(method = 'lm', fullrange=T, se=F, aes(colour="lm"), linetype=1) +
geom_smooth(se=F, fullrange=T, aes(colour="loess"), linetype=2) +
scale_colour_manual(name="legend", values=c("green", "blue")) +
scale_linetype_discrete(name = "legend")
```
In the first and second output, there seems to be a nonlinear trend between SQFTmain and target. Looking at the third output, the loess function and the linear function is different. This trend suggest a possible nonlinear association between SQFTmain and target. SQFTmain looks to maybe have a quadratic relationship with the target feature. To compare perform, lets add a quadratic term for the model.
```{r fig.height=10, fig.width=10}
#ggpairs(train_data, columns = c(2,3,4,5,6,8))
```
```{r}
#sqft_tr = (5200 - SQFTmain)^2
library(dplyr)
#creates a new variable that gets the total number of Bathrooms and Bedrooms
#the other new variable gets the squared value of SQFTmain
train_data1 <- train_data %>%
mutate(Total_household = Bathrooms + Bedrooms, SQFT_2 = SQFTmain^2) #training set
eval_data1 <- eval_data %>%
mutate(Total_household = Bathrooms + Bedrooms, SQFT_2 = SQFTmain^2) #testing set
```
```{r fig.height=18, fig.width=18}
ggpairs(train_data1, columns = c(2,3,6,4,5,9,10,8))
```
For the distribution of age, the older the property usually the bigger the square footage and as time goes on there is a scarcity of land thus resulting in denser newer developments. There is high correlation between SQFTmain and the variables Bedrooms and Bathrooms. This is an issue with multicollinearity which will be addressed. The selection of the variables in the model would suggest using either SQFTmain without Bedrooms and Bathrooms or Bedrooms/Bathrooms without SQFTmain.
There is an potential issues of multi-collinearlity on variables Bedrooms, Bathrooms, SQFTmain, and Total_households. Since Total_households is the summation of Bathrooms and Bedrooms, either Total_household is used or Bedrooms/Bathrooms. Bathrooms is strongly correlated to the predictors age, SQFTmain, Bedrooms, and Total_households. Hence, Bathrooms is expected to be not a good predictor to include in the model. The remaining options to add are Bedrooms, SQFTmain, or Total_households. Based on the previous plots, age and LandValue100K are consistently good predictors to use.
```{r}
#install.packages('caret')
#install.packages('ipred')
```
```{r warning=FALSE}
library(caret)
control <- trainControl(method = "repeatedcv", number = 5, repeats = 5)
train(target ~ age + poly(SQFTmain,2) + LandValue100K, data = train_data1, method='lm', trcontrol = control) #prefer this model
train(target ~ age + Total_household + LandValue100K, data = train_data1, method='lm', trcontrol = control)
train(target ~ age + Bedrooms + LandValue100K, data = train_data1, method='lm', trcontrol = control)
train(target ~ age + Bathrooms + LandValue100K, data = train_data1, method='lm', trcontrol = control)
```
Note: k-fold cross validation is when you randomly split the data into k parts where each part leftover is used for validation while k-1 parts are for training. Repeated k-fold cross validation randomly splits the data n times which makes n different k-fold cross validation. Using repeated k-fold cross validation, the 2nd order polynomial function of SQFTmain is the preferred feature to add with age and LandValue100K.
```{r}
library(car)
print('model with age, LandValue100K, and quadratic form of SQFTmain:')
vif(lm(target ~ age + poly(SQFTmain, 2) + LandValue100K, data = train_data1))
print('model with age, LandValue100K, and Bedrooms:')
vif(lm(target ~ age + Bedrooms + LandValue100K, data = train_data1))
print('model with age, LandValue100K, and Bathrooms:')
vif(lm(target ~ age + Bathrooms + LandValue100K, data = train_data1))
print('model with age, LandValue100K, and Total_household:')
vif(lm(target ~ age + Total_household + LandValue100K, data = train_data1))
```
The variance inflation rate for all models are adequate since the values are less than 4. The variance inflation rate of the 2nd order polynomial function of SQFTmain is valid. Now, the next thing to check is the residual assumption such as homogeneity of variance, constant mean 0, low leverage, low influence, normality, and constant scale (variance).
```{r}
model2 <- lm(target ~ age + SQFTmain + SQFT_2 + LandValue100K, data = train_data1)
```
We will display cook's distance and find the most influential data points in the model to remove. Once the observations are dropped, the residuals are checked again and the process is repeated until the influence is small. Note: the 3 largest Cook's distance values appear at a time. So, the data is filtered 3 observations at time until there not many stand out values in Cook's Distance plot (output 1) and the standardized residuals are within the Cook's distance curve over leverage (output 2).
```{r}
cutoff <- 4/((nrow(train_data1) - length(model2$coefficients) - 2))
plot(model2, which = 4, cook.levels = cutoff)
plot(model2, which = 5, cook.levels = cutoff)
```
Looking at Cook's distance (output 1) and the standardized residuals over leverage (Output 2), some of the observations need to be removed to get less extreme cases in the data. The pattern is to remove the 3 data points with the highest Cook's Distance and name the first filtered dataframe as filter_data1. The repeat this process until the largest influential points do not stand out to other influential points. For this instance, we have to remove 30 outliers (filter_data1 - filter_data10); the code below executes this strategy as the finished code. Note: The procedure in the code below starts by running the last 4 lines to display the Cook's Distance plot which also identiifies the 3 largest influential points. Then, we remove/filter the 3 largest influential point where the first data is called filter_data1 located in the first line of code. Next, we change lm() where the data train_data1 is replaced with filter_data1. Then, we rerun the last 4 lines of code to find the next 3 largest influential points and remove those data points(filter_data2) similar to the previous step. We repeat the process until the 3 largest influential points are not as large compared to other points and the leverage plot (output 2) has the curve within the boundary.
```{r}
filter_data1 <- train_data1[-which(rownames(train_data1) %in% c(173, 480, 515)), ]
filter_data2 <- filter_data1[-which(rownames(filter_data1) %in% c(271, 309, 703)), ]
filter_data3 <- filter_data2[-which(rownames(filter_data2) %in% c(132, 145, 370)), ]
filter_data4 <- filter_data3[-which(rownames(filter_data3) %in% c(119, 250, 422)), ]
filter_data5 <- filter_data4[-which(rownames(filter_data4) %in% c(115, 366, 713)), ]
filter_data6 <- filter_data5[-which(rownames(filter_data5) %in% c(226, 595, 699)), ]
filter_data7 <- filter_data6[-which(rownames(filter_data6) %in% c(391, 589, 666)), ]
filter_data8 <- filter_data7[-which(rownames(filter_data7) %in% c(353, 589, 691)), ]
filter_data9 <- filter_data8[-which(rownames(filter_data8) %in% c(300, 358, 366)), ]
filter_data10 <- filter_data9[-which(rownames(filter_data9) %in% c(108, 341, 392)), ] #best value
#filter_data11 <- filter_data10[-which(rownames(filter_data10) %in% c(55, 509, 703)), ]
#filter_data12 <- filter_data11[-which(rownames(filter_data11) %in% c(402, 539, 655)), ]
#filter_data13 <- filter_data12[-which(rownames(filter_data12) %in% c(76, 125, 526)), ]
#filter_data14 <- filter_data13[-which(rownames(filter_data13) %in% c(97, 198, 375)), ]
#filter_data15 <- filter_data14[-which(rownames(filter_data14) %in% c(196, 646, 649)), ]
#filter_data16 <- filter_data15[-which(rownames(filter_data15) %in% c(267, 616, 673)), ]
#filter_data17 <- filter_data16[-which(rownames(filter_data16) %in% c(15, 343, 452)), ]
model <- lm(target ~ age + SQFTmain + SQFT_2 + LandValue100K, data = filter_data10)
cutoff <- 4/((nrow(filter_data10) - length(model$coefficients) - 2))
plot(model, which = 4, cook.levels = cutoff)
plot(model, which = 5, cook.levels = cutoff)
```
The first output has the largest influencial points that are not as distinct compared to the other points unlike the previous output 1 prior to removing the outliers. The second output has the standarized residuals and leverage curve inside the boundary unlike the previous output 2.
```{r warning=FALSE}
train(target ~ SQFTmain + SQFT_2 + LandValue100K, data = filter_data7, method='lm', trcontrol = control)
train(target ~ age + SQFTmain + SQFT_2 + LandValue100K, data = filter_data10, method='lm', trcontrol = control)
```
The first training model above filtered 21 outliers (7*3), while the second training model filtered 30 outliers (10*3). The model improved when cooks distance assumption is met based on the RMSE(root mean squared error), Rsquared(R2), and MAE(mean absolute error) (because RMSE went down, Rsquared went up, and MAE went down).
```{r}
#library(gridExtra)
#install.packages('gridExtra')
library(gridExtra)
```
```{r}
temp1 <- ggplot(filter_data10, aes(x=LandValue100K, y=target)) +
geom_point() +
geom_smooth(method = 'lm', se=F, color='red')
temp2 <- ggplot(train_data, aes(x=LandValue100K, y=target)) +
geom_point() +
geom_smooth(method = 'lm', se=F)
grid.arrange(temp1, temp2, ncol=1, nrow=2)
```
There is still a linear relationship between LandValue100K and target, but the filtered data (top) looks smoother. The bottom plot has outliers of LandValue100K > 6 that are far away from the blue line.
```{r fig.height=7, fig.width=7}
temp1 <- ggplot(filter_data10, aes(x=age, y=target)) +
geom_point() +
geom_smooth(method = 'lm', se=F, color='red')
temp2 <- ggplot(train_data, aes(x=age, y=target)) +
geom_point() +
geom_smooth(method = 'lm', se=F)
grid.arrange(temp1, temp2, ncol=1, nrow=2)
```
There is not much difference between the top plot (filtered data) and bottom plot (train data) unlike the previous plot other than the points farthest from the trend are removed.
```{r}
temp1 <- ggplot(filter_data10, aes(x=SQFTmain, y=target)) +
geom_point() +
geom_smooth(method = 'lm', fullrange=T, se=F, aes(colour="lm"), linetype=1, color='red') +
geom_smooth(se=F, fullrange=T, aes(colour="loess"), linetype=2) +
scale_colour_manual(name="legend", values=c("blue", "red")) +
scale_linetype_discrete(name = "legend")
temp2 <- ggplot(train_data, aes(x=SQFTmain, y=target)) +
geom_point() +
geom_smooth(method = 'lm', fullrange=T, se=F, aes(colour="lm"), linetype=1) +
geom_smooth(se=F, fullrange=T, aes(colour="loess"), linetype=2) +
scale_colour_manual(name="legend", values=c("green", "blue")) +
scale_linetype_discrete(name = "legend")
grid.arrange(temp1, temp2, ncol=1, nrow=2)
```
By removing the outliers, the filtered data looks closer to being linear between SQFTmain and target. In terms of coefficent interpretation, using SQFTmain as a linear relationship rather than quadratic is easier. However; quadratic is useful for prediction. For this thesis, we want to maximize predictive performance.
There is imporovement filtering the data 10 times versus 7 with a lower root mean squared error, r squared, and mean absolute error.
```{r}
library(MuMIn)
Metric <- function(model){
num_var = length(model$coefficients) - 1
print(paste0('BIC : ', BIC(model)))
print(paste0('AIC : , ', AIC(model)))
print(paste0('Sigma : ', round(sigma(model), 3)))
}
```
Next, create functions where one gets the BIC, AIC, and Sigma while the other obtains the root mean squared error, r squared, and mean absolute error to test on the holdout/evaluation set.
```{r}
library(Metrics)
metrics_mod <- function(model, test_data, target_name="ValuePerSQFT", lambda=0.5, include_metric=F, invert=F){
if(include_metric){
Metric(model)
}
if(invert){
pred1a <- transform_data(predict(model, test_data), lambda=lambda, inverse=T)
#target = test_data[[target_name]]
}
else{
pred1a <- predict(model, test_data)
}
#residual <- test_data[[target_name]] - pred1a
print(paste0('RMSE: ', rmse(actual = test_data[[target_name]], predicted = pred1a)))
print(paste0("R squared: ", cor(pred1a, test_data[[target_name]])^2 ))
#print(paste0('R squared: ', 1 - var(test_data[[target_name]] - pred1a)/var(test_data[[target_name]]) ))
print(paste0('MAE: ', mae(actual = test_data[[target_name]], predicted = pred1a)))
}
```
The next function is to predict an example to see how the predictors behave to the response variable. Note: the user must know whether Box-Cox Transformation is used in the model or not based on the invert argument; True means reverse Box Cox Transformation on the prediction into order to get Price Per SQFT. Lambda is the parameter to use for transformation.
```{r}
sample_predict <- function(model, invert=F, lambda=0.5, age=0, Bedrooms=0, Bathrooms=0, LandValue100K=0, SQFTmain=0){
pred.tr.ValuePerSQFT <- predict(model, data.frame(age = age, Bedrooms = Bedrooms, Bathrooms = Bathrooms, LandValue100K = LandValue100K, SQFTmain = SQFTmain, SQFT_2 = SQFTmain^2, Total_household = Bedrooms + Bathrooms))
if(invert){
print(paste0('predict sample: ', pred.ValuePerSQFT <- transform_data(pred.tr.ValuePerSQFT, lambda=lambda, inverse=T)))
}
else{
print(paste0('predict sample: ', pred.ValuePerSQFT <- pred.tr.ValuePerSQFT))
}
}
sample_predict(model, invert=T, age=3, Bedrooms=2, LandValue100K = 2.5, SQFTmain = 1700)
sample_predict(model, invert=T, age=10, Bedrooms=2, LandValue100K = 2.5, SQFTmain = 1700)
```
install and load ggResidpanel to analyze residuals of a model
```{r}
#install.packages("devtools")
#devtools::install_github("goodekat/ggResidpanel")
library(ggResidpanel)
```
```{r fig.height=8, fig.width=10}
resid_panel(model, plots="all", qqbands = T)
resid_xpanel(model, yvar="response")
```
The top left figure is the residual plot which should have values centered around zero. top center figure is the obersation index versus the residuals which also behave similar to the residual plot. The top right plot is response vs predicted which should be within the trend. The center left is the qqplot which should have points within the line to represent the normality assumption. The middle plot is a histogram which should validate the behavior of the qqplot. The middle right is the boxplot of the residuals which should have a box centered around zero, the wiskers look fairly even(not much skew), and the outliers are limited. The bottom right(Residual-Leverage plot) and bottom left (Cook's D Plot) are the same as the previous plot for testing outliers. The bottom center plot is location-Scale plot which should even spread across the predicted values to make sure the residuals mets the assumption of equal variance (homoscedasticity). The assumptions are met pretty well.
```{r fig.height=18, fig.width=18}
ggpairs(filter_data10, columns = c(2,3,6,10,7))
```
Now that the model for the Box-Cox approach is completed, we will able Gamma Regression. Lets assume the target variable (PricePerSQFT) is distributed in a gamma distribution instead of transforming the data to become normally distributed for Generalized Linear Regression. We will use the same features from the previous model since the plots above indicate similar relationships as the transformed variable.
```{r}
#fitting gamma regression
summary(fitted.Gmodel<- glm(ValuePerSQFT ~ age + SQFTmain + SQFT_2 + LandValue100K, data=train_data1,
family=Gamma(link=identity) ))
#get the BIC, AIC, and Sigma value of the model
Metric(fitted.Gmodel)
print("")
#get the performance metrics of the testing data to see how well the model did
metrics_mod(fitted.Gmodel, test_data = eval_data1)
```
The model seems to perform well consider the amount of variation is accounted in the model and the deviation relative to PricePerSQFT do not look too bad (RMSE and MAE)
```{r fig.height=8, fig.width=10}
resid_panel(fitted.Gmodel, plots = "all", qqbands = T)
```
However; the residual assumption do not look good. The Residual plot deviates away from zero when it comes to high and low predicted values. This is also reflected in the Response vs Predicted plot. The qq-plot and boxplot clearly have a lot of outliers. Cook's D plot has at least 6 very influential points and the location-scale plot values is high in instances of extreme predicted values. We have to detect the outliers using Cook's distance similarly to method used on the previous model.
Note: filter_data overwrites the previous filter_data from the previous model and the data points filtered are not necesarily all the same index locations.
```{r fig.height=8, fig.width=10}
filter_data1 <- train_data1[-which(rownames(train_data1) %in% c(85, 480, 706)), ]
filter_data2 <- filter_data1[-which(rownames(filter_data1) %in% c(271, 427, 513)), ]
filter_data3 <- filter_data2[-which(rownames(filter_data2) %in% c(172, 186, 682)), ]
filter_data4 <- filter_data3[-which(rownames(filter_data3) %in% c(144, 281, 527)), ]
filter_data5 <- filter_data4[-which(rownames(filter_data4) %in% c(131, 227, 304)), ]
filter_data6 <- filter_data5[-which(rownames(filter_data5) %in% c(209, 247, 698)), ]
filter_data7 <- filter_data6[-which(rownames(filter_data6) %in% c(118, 362, 593)), ]
filter_data8 <- filter_data7[-which(rownames(filter_data7) %in% c(339, 395, 513)), ]
filter_data9 <- filter_data8[-which(rownames(filter_data8) %in% c(107, 364, 746)), ]
filter_data10 <- filter_data9[-which(rownames(filter_data9) %in% c(354, 541, 689)), ]
Gmodel <- glm(ValuePerSQFT ~ age + SQFTmain + SQFT_2 + LandValue100K, data = filter_data10, family = Gamma("identity"))
#cutoff <- 4/((nrow(filter_data10) - length(Gmodel$coefficients) - 2))
#plot(Gmodel, which = 4, cook.levels = cutoff)
#plot(Gmodel, which = 5, cook.levels = cutoff)
resid_panel(Gmodel, plots="all", qqbands = T)
```
The first plot (Cook's Distance) have less points influential relative to other points. The second plot (Residuals vs Leverage) has the curve within the boundary which is good. The third plot is residual plot diagnosis which match the neccessary assumptions to use Gamma Regression.
Note: you could ignore the first two plot since it matches the plots for the bottom left and bottom right.
The prediction from both Gaussian and glm model is similar.
```{r}
print("Generalized linear model:")
metrics_mod(model, test_data = eval_data1, invert = T)
print("")
print("Gamma regression:")
metrics_mod(Gmodel, test_data = eval_data1)
```
The GLM model performed better compared to the Gamma regression model. The Root mean squared error predicted 4 cents better for GLM. The R squared value went up 3 percent more for GLM. It also predicted 3 cents better when comes to Mean absolute error.
## Section 2
The next data set to analyze is from Zillow where the sequence of month-year pair registers a price per sqrt foot identified for each city. Later in the analysis, we discovered that there are multiple squences with the same city. This makes sense since there are different patterns of price per sqrt foot within the same city due to market wihtin the city.
```{r}
#library(readr)
Monthly_rent <- read_csv("./dataset2.csv")
longform_MR <- read_csv("./longformdataset2.csv")
# print('Columns:')
# print(colnames(longform_MR))
# print('Cities:')
# print(unique(longform_MR$City))
```
```{r}
length(unique(longform_MR$City))
length(unique(Monthly_rent$City))
head(longform_MR)
```
Want to look at the time series of monthly rent for some cities
```{r}
library(dplyr)
library(ggplot2)
#City, MonthYear, PricePERSQFT
longform_MR %>%
filter(City %in% c('Homestead', 'Los Angeles', 'Delaware', 'Middletown', 'West New York', 'Riverdale', 'Brooklyn Park', 'Denver', 'Houston')) %>%
ggplot(aes(x=MonthYear, y=PricePerSQFT, col=City)) +
geom_line() +
facet_wrap(.~City) +
theme(legend.position='none')
```
The sample plot is a time series of several cities. The goal would be to group these trends using k-cluster and apply it to a GEE model and Generalized Mixed Effect model.
```{r}
#next exercise
#purr is a library for K-means clustering
library(purrr)
# Use map_dbl to run many models with varying value of k (centers)
tot_withinss <- map_dbl(1:10, function(k){
model <- kmeans(x = Monthly_rent[, 2:76], nstart = 1, iter.max = 15, centers = k)
model$tot.withinss
})
# Generate a data frame containing both k and tot_withinss
elbow_df <- data.frame(
k = 1:10 ,
tot_withinss = tot_withinss
)
# Plot the elbow plot
ggplot(elbow_df, aes(x = k, y = tot_withinss)) +
geom_line() +
scale_x_continuous(breaks = 1:10)
```
Looks like clusters 2-5 is reasonable. The elbow method suggests two clusters should be implemented. Now, we want to see how similar each observation is within its cluster. This is achieved by creating a Silhouette plot which is a way to interpret how well each point is classified within a cluster.
```{r fig.height=16, fig.width=16}
library(cluster)
# Generate a k-means model using the pam() function with a k = 2
monthly_rent <- Monthly_rent[, 2:76]
pam_k <- pam(monthly_rent, k = 2)
# Plot the silhouette visual for the pam_k2 model
plot(silhouette(pam_k))
# Use map_dbl to run many models with varying value of k
sil_width <- map_dbl(2:10, function(k){
model <- pam(x = Monthly_rent[, 2:76], k = k)
model$silinfo$avg.width
})
# Generate a data frame containing both k and sil_width
sil_df <- data.frame(
k = 2:10,
sil_width = sil_width
)
# Plot the relationship between k and sil_width
ggplot(sil_df, aes(x = k, y = sil_width)) +
geom_line() +
scale_x_continuous(breaks = 2:10)
```
The first plot is the Sihouette width of across the clusters where each observation is judged (for this instance we used 2 clusters). Generally, Sihouette widths between 0.5 and 1 is pretty good while 0.5 and 0 is questionable, anything below 0 is bad. There were several chunks of observations off in the cluster 1 (negative values) while cluster 2 did really well. The second plot is the average Sihouette width for each selected cluster (2-10). Based on the plot of average Siloutte width, the recommended k value is 2. The most tolerence of k is perhaps 6 since the width peaks again at that position.
Next step would be to incorporate the clusters into the data and compare the trends for each cluster.
```{r}
library(tidyr)
N2 <- nrow(Monthly_rent)
#Monthly_Rent1 <- spread(longform_MR, key = MonthYear, value = PricePerSQFT, -City)
# Build a kmeans model
model_km2 <- kmeans(monthly_rent, centers = 2)
model_km3 <- kmeans(monthly_rent, centers = 3)
model_km4 <- kmeans(monthly_rent, centers = 4)
model_km6 <- kmeans(monthly_rent, centers = 6)
# Extract the cluster assignment vector from the kmeans model
clust_km2 <- model_km2$cluster
clust_km3 <- model_km3$cluster
clust_km4 <- model_km4$cluster
clust_km6 <- model_km6$cluster
# Create a new dataframe appending the cluster assignment
Monthly_rent_km2 <- Monthly_rent %>%
mutate(clusterTwo = clust_km2, clusterThree = clust_km3, clusterFour = clust_km4, clusterSix = clust_km6, index = 1:N2)
```
When looking for cities that intersect with different cluster values, it indicates there are different behavior within city because the real-estate market varies on location.
This instance occurred only for two clusters while clusters 4 and 6 have no intersection between cities.
It looks like there is an issue with combining cluster and city. Some cities are duplicated in Monthly_rent dataframe which is normal.
Many patterns of price within each city varies between different location.
The code below resolves this issue. It requires using an index placeholder to keep tract of the corresponding cluster and city pair. Then, we use gather function to transform the wide-long format of the data.
Now, we plan to transform the Montly Rent dataframe into a longform manually to address issue with joining clusters to respective cities because somes cities have multiple clusters.
```{r }
longform_km2 <- gather(Monthly_rent_km2, key = "MonthYear", value = "PricePerSQFT", -clusterTwo, -clusterThree, -clusterFour, -clusterSix, -City, -index)
months <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
library(car)
longform_km2_new <- longform_km2 %>%
separate(MonthYear, c("Month", "Year"), "_") %>%
mutate(Year = as.numeric(Year), Month_num = Recode(Month, " 'Jan' = 1; 'Feb' = 2; 'Mar' = 3; 'Apr' = 4;
'May' = 5; 'Jun' = 6; 'Jul' = 7; 'Aug' = 8; 'Sep' = 9;
'Oct' = 10; 'Nov' = 11; 'Dec' = 12 "), month = Month_num + 12*(Year - 10) -10)
```
```{r}
longform_km2_new %>%
ggplot(aes(x = month, y = PricePerSQFT, color=factor(clusterTwo), group=index)) +
geom_line() +
facet_wrap(~clusterTwo)
longform_km2_new %>%
ggplot(aes(x = month, y = PricePerSQFT, color=factor(clusterThree), group=index)) +
geom_line() +
facet_wrap(~clusterThree)
longform_km2_new %>%
ggplot(aes(x = month, y = PricePerSQFT, color=factor(clusterFour), group=index)) +
geom_line() +
facet_wrap(~clusterFour)
longform_km2_new %>%
ggplot(aes(x = month, y = PricePerSQFT, color=factor(clusterSix), group=index)) +
geom_line() +
facet_wrap(~clusterSix)
```
```{r fig.height=12, fig.width=20}
p1<-longform_km2_new %>%
ggplot(aes(x = month, y = PricePerSQFT, color=factor(clusterTwo), group=index)) +
geom_line() +
facet_wrap(~clusterTwo)
p2<-longform_km2_new %>%
ggplot(aes(x = month, y = PricePerSQFT, color=factor(clusterThree), group=index)) +
geom_line() +
facet_wrap(~clusterThree)
p3<-longform_km2_new %>%
ggplot(aes(x = month, y = PricePerSQFT, color=factor(clusterFour), group=index)) +
geom_line() +
facet_wrap(~clusterFour)
p4<-longform_km2_new %>%
ggplot(aes(x = month, y = PricePerSQFT, color=factor(clusterSix), group=index)) +
geom_line() +
facet_wrap(~clusterSix)
library(gridExtra)
grid.arrange(p1, p2, p3, p4, nrow=2)
```
The first plot is PricePerSQFT vs month of a given index(city/cluster set) across two clusters. The second plot is same as the first plot but with 4 clusters while the third uses 6 clusters. There appears to be a clear separation in terms of PriceSQFT in the first plot while there is a good amount of overlap on the other plots. Hence, we have more assurance that 2 clusters is the perferred choice to implement in the model.
******************************************************************************************************************************************************************
WE NOW BEGIN GENERAL MIXED EFFECTS MODEL STEPS / INVESTIGATION
The next step is to figure out what distribution the target variable PricePerSQFT belongs to.
```{r}
library(MuMIn)
#library(rcompanion)
#check what distribution is for response variable PricePerSQFT
plotDensityHistogram(longform_km2_new$PricePerSQFT)
```
There is a right skewness in the data which may indicate the distribution is not normal. Possible candidates can be found easier by using the fitdistrplus package.
```{r}
library(fitdistrplus)
descdist(longform_km2_new$PricePerSQFT, discrete = FALSE)
```
```{r}
#seems like the response variable belongs to the gamma distribution
gamma_dist <- fitdist(longform_km2_new$PricePerSQFT, "gamma")
#gamma_log_dist <- fitdist(log(longform_km2_new$PricePerSQFT), "gamma")
plot(gamma_dist)
#plot(gamma_log_dist)
```
Other than the qq-plot being slightly questionable, all figures above suggest the response is from the gamma distribution. Next, use this know to add the agrument family = gamma(link = 'log') for the glmer function.
The next chunk of code is to add the clusters to the dataframe and split the training/testing set by separating the first 65 months of each index/city-location pair.
```{r}
#fitting random slope and intercept Gamma model
#install.packages("lme4")
library(lme4)
longform_km2_new$MonthYear <- longform_km2_new$month/100
longform_km2_new$clusterTwo <- relevel(as.factor(longform_km2_new$clusterTwo), ref=1)
longform_km2_new$clusterThree <- relevel(as.factor(longform_km2_new$clusterThree), ref=1)
longform_km2_new$clusterFour <- relevel(as.factor(longform_km2_new$clusterFour), ref=1)
longform_km2_new$clusterSix <- relevel(as.factor(longform_km2_new$clusterSix), ref=1)
#set.seed(3)
#N1 = nrow(longform_km2_new)
#train_idx2 <- sample(1:N1, size = round(0.9*N1))
#eval_idx2 <- setdiff(1:N1, train_idx2)
#train_data2 <- longform_km2_new[train_idx2, ]
#eval_data2 <- longform_km2_new[eval_idx2, ]
filter2 <- longform_km2_new$month %in% 1:65
train_data2 <- longform_km2_new[filter2, ]
eval_data2 <- longform_km2_new[!filter2, ]
#saveRDS(train_data1, 'train_data1.Rds')
#saveRDS(eval_data1, 'eval_data1.Rds')
train_data3 <- train_data2 %>%
arrange(index, MonthYear)
#summary(glmer(PricePerSQFT ~ cluster + MonthYear + (1 + MonthYear|City),
# data=train_data1, family=Gamma(link = "inverse")))
```
begin writing solution code
Generalized Mixed-Effect Regression
```{r}
#packages to install
#install.packages('lme4')
library(lme4)
library(tidyr)
library(dplyr)
library(ggplot2)
library(MuMIn)
library(Metrics)
```
The details of the Generalized Mixed-Effect model are in page 4 of the notes. Basically, the data is collected longitudinally at times. The name GME model is combination of fixed coefficients and random effect coefficient/terms.
```{r}
#generalized mixed-effect regression
summary(rm_model <- glmer(PricePerSQFT ~ clusterTwo + MonthYear + (MonthYear|index), data=train_data3, family=Gamma(link = "identity")))
```
The prediction of PricePerSQFT is
$$g(E(y_{ij})) = \beta_{0} + \beta_{1}x_{1ij} +\beta_{2}t_{j} + \mu_{1i} + \mu_{2i}t_{j}$$
where $\beta_{0}$ is the intercept, $\beta_{1}$ is the coefficient of the cluster(clusterTwo), $\beta_{2}$ is the slope of time(MonthYear), $\mu_{1i}$ is the random intercept from a normal distribtution, and $\mu_{2i}$ is the random slope from a normal distribution. Note: $\mu_{1i}$ ~ $N(0, \sigma_{\mu_{1}}^2)$ and $\mu_{2i}$ ~ $N(0, \sigma_{\mu_{2}}^2)$.
Then,
$$g(E(y_{ij})) = 0.91456 + 0.46910*x_{1ij} + 0.40133*t_{j}$$
where g() is the gamma distribution as the link function.
The standard deviance for the model is -251073.9 which seems pretty good.
$$\sigma_{\mu_{1}}^2=0.00279$$
$$\sigma_{\mu_{2}}^2=0.01176$$
$$\sigma^2=0.00146$$
$$\sigma_{\mu_{1}\mu_{2}}=-0.18$$
Note: $\mu_{1i}$ ~ $N(0, 0.00279)$ and $\mu_{2i}$ ~ $N(0, 0.01176)$.
Now, we'll look at the residuals of the model
```{r}
library(ggResidpanel)
resid_panel(rm_model, plots = "all", qqbands = T, qqline = T, type = "response")
```
There is some issues with the tails of the qqplot. The Response vs Predicted plot has weaker predictions for larger values based on the right side of the plot. The qqplot and boxplot both have a lot of outliers from left to right side.
```{r}
#summary(rm_model2 <- glmer(PricePerSQFT ~ clusterFour + MonthYear + (MonthYear|index), data=train_data3, family=Gamma(link = "identity")))
```
```{r}
#summary(rm_model3 <- glmer(PricePerSQFT ~ clusterSix + MonthYear + (MonthYear|index), data=train_data3, family=Gamma(link = "identity")))
```
```{r}
resid_xpanel(rm_model, yvar = "response", type = "response")
resid_xpanel(rm_model, type = "response")
```
There is smaller variation of the residual for cluster 1 which makes sense. The siholuette width had a lot of high values hence many of the observations in cluster 1 where compatible for the model. The MonthYear and Residuals has a cylical trend which may violate the residual assumption since the residuals appears to be time dependent. It could be addressed using time-series but we will ignore it due to amount of content.
```{r}