-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathUber_Lyft_Analysis_R.rmd
1129 lines (871 loc) · 54.9 KB
/
Uber_Lyft_Analysis_R.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: |
<center>\vspace{7cm}**Uber & Lyfr Price Predictions**</center>
author: "Hai Vu"
date: 'December 14, 2022'
# date: '`r format(Sys.time(),"%d %B, %Y")`'
output:
pdf_document:
latex_engine: xelatex
# pandoc_args: --listings
# pandoc_args: --wrap=auto
toc: true
toc_depth: 3
number_sections: true
# fontsize: 11pt
indent: true
header-includes:
- \usepackage{indentfirst}
- \usepackage{ragged2e}
- \usepackage[labelformat=empty]{caption}
- \usepackage{fvextra}
- \DefineVerbatimEnvironment{Highlighting}{Verbatim}{breaklines,commandchars=\\\{\}}
- \usepackage[fontsize=12pt]{scrextend}
# - \usepackage{geometry}
# - \geometry{top=0.75in,left=3in,bottom=0.75in,right=0.80in}
# - \usepackage{listings}
# - \lstset{breaklines=true,basicstyle=\ttfamily}
---
\pagenumbering{gobble}
```{r message=FALSE, warning=FALSE, results = FALSE, include=FALSE}
# Libraries installation
install_load_package <- function(pkg){
new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
if (length(new.pkg))
install.packages(new.pkg, dependencies = TRUE, quiet = TRUE)
sapply(pkg, suppressPackageStartupMessages(require), character.only = TRUE)
}
list.of.packages <- c("ggplot2","pls","glmnet","Matrix","smbinning","ROCR","cvms","gbm","hrbrthemes",
"RCurl","curl","httr","InformationValue","car","caTools","randomForest","stringr",
"ISLR","plyr","knitr","dplyr","readxl","scales","reshape2","rpart.plot","gridExtra",
"tidyverse","tibble","RColorBrewer","kableExtra","mctest","rpart","viridis","arules",
"formatR","DescTools","ggpubr","psych","caret","Metrics","reshape","lubridate",
"Information","MASS","e1071","class","pROC","corrplot","mltools","rattle","tinytex")
install_load_package(list.of.packages)
```
```{r wrap-hook, include=F}
hook_output = knit_hooks$get('output')
knit_hooks$set(output = function(x, options) {
# this hook is used only when the linewidth option is not NULL
if (!is.null(n <- options$linewidth)) {
x = knitr:::split_lines(x)
# any lines wider than n should be wrapped
if (any(nchar(x) > n)) x = strwrap(x, width = n)
x = paste(x, collapse = '\n')
}
hook_output(x, options)
})
```
```{r setup, include=FALSE}
# knitr::opts_chunk$set(echo = TRUE, tidy.opts = list(width.cutoff = 80), tidy = TRUE)
knitr::opts_chunk$set(echo = F, tidy = F, message=F, warning=F, fig.align="center", tinytex.verbose = TRUE)#, linewidth=90)
# options(width = 100)
```
\newpage
\pagenumbering{arabic}
\thispagestyle{plain}
\mbox{}
\setcounter{tocdepth}{3}
\renewcommand{\contentsname}{TABLE OF CONTENTS}
\tableofcontents
\newpage
# I. INTRODUCTION
In this project, I aim to utilize various statistical methodologies to perform exploratory analysis as well as predictive analysis on the ride share data set containing ride-sharing records of 2 companies Uber and Lyft in the city of Boston during the month of November and December in 2018, along with the weather information at the time of transaction. Primarily, I will focus on answering 3 main questions:
* Can we predict the fare prices of Uber & Lyft accurately based on the given information?
* What sort of factors influence the changes in price, and how strong are these influences relative to one another?
* What are the main factors influence the surge in prices, as represented by the "surge_multiplier" feature in the data set?
# II. ANALYSIS
## 1. Data cleaning & feature engineering
First, let's import the ride sharing data set. Below is the overview of the data set:
```{r, echo=T}
# Import data set and treat blank cells as NA
df_ride <- read.csv("rideshare_kaggle.csv", na.strings=c(""," ","NA"))
str(df_ride, width = 70, strict.width = "cut")
```
The imported data set contains ``r nrow(df_ride)`` observations and ``r ncol(df_ride)`` variables. At first glance, we can see that there are 2 main groups of information presented, where one is about the information of the ride (starting location, ending location, price, type of cab, etc.) and the other describes the weather conditions surrounding that specific ride. This makes sense since weather status may be a factor in determining the fare price for Uber and Lyft. Below are the descriptions of the variables in this data set:
```{r}
# Variables description
desc_names <- c(
'Hour & day & month',
'source',
'destination',
'cab_type',
'product_id',
'name',
'price',
'distance',
'surge_multiplier',
'latitude',
'longitude',
'temperature',
'apparentTemperature',
'short_summary',
'long_summary',
'precipIntensity',
'precipProbability',
'Other weather measurements',
'Timestamps of specific events')
desc_detail <- c(
'All are integer variables used to record the specific time when each transaction occurred.',
'Starting location of the ride',
'Ending location of the ride',
'Strings used to record the transaction platform, including Uber and lyft.',
'Unique identification of the ride type',
'Strings used to record the ride type (e.g: Uber XL from Uber and Lyft XL from Lyft)',
'Fare price of the ride',
'Total distance of the ride',
'Values determining how many times the price is surged',
'Latitude of each transaction',
'Longitude of each transaction',
'Temperature at time of transaction',
'Apparent temperature at time of transaction',
'Short description of weather condition at time of transaction',
'Long description of weather condition at time of transaction',
'Indicates the amount of precipitation expected within one hour at the current intensity',
'Probability that the forecast timeframe and location will receive at least 0.01" of rain',
'humidity, wind speed, wind gust, visibility, UV index, etc.',
'highest & lowest temperature, wind gust, sunrise, sunset, etc.')
# Description table
df_info <- as.data.frame(cbind(desc_names,desc_detail))
colnames(df_info) <- c("Summarized list of variables","Description of variables")
knitr::kable(df_info, format="latex",
booktabs = T,longtable=T,linesep = "",
caption="Table 1. Description of the data set variables") %>%
kable_styling(position = "center", font_size = 10,
latex_options = c("striped","repeat_header")) %>%
column_spec(2, width = "11cm")
#df_ride2 <- df_ride %>% mutate_all(na_if,"") # remove blank cells
```
While importing the data, I replace all blank cells (if any) by NA values so that they can be examined together. I also trim any unnecessary spaces for the categorical variables:
```{r, echo=T}
# Remove blank and white space
df_ride2 <- df_ride %>% mutate_if(is.character, str_trim) # remove white space
```
Next, I check to see if there are any variables containing NA in their records:
```{r}
# NA check
col_names <- colnames(df_ride2)
df_na <- sapply(df_ride2, FUN=function(col_names){
table(is.na(col_names))
})
df_na <- t(as.data.frame(df_na))
df_na <- as.data.frame(df_na[seq(2,nrow(df_na),2),])
df_na$V2 <- ifelse(df_na$V2 == max(df_na$V1),0,df_na$V2)
df_na <- df_na[order(df_na$V2),]
df_na <- rownames_to_column(df_na)
df_na$rowname <- gsub('.Freq','',df_na$rowname)
colnames(df_na) <- c("Variables","No blank counts","Blank counts")
rownames(df_na) <- NULL
df_na$`No blank counts` <- as.numeric(df_na$`No blank counts`)
df_na$`Blank counts` <- as.numeric(df_na$`Blank counts`)
df_na[df_na$`Blank counts` > 0,]
```
After checking, it seems that "price" is the only variable to contain blank values. However, the missing observations only account for ``r round(df_na[1,3]/df_na[1,2]*100,2)`` percent of the entire data set and thus, the invalid amount not too significant. I also want to check how these missing values are distributed among other variables, such as the type of cab:
```{r, echo=T}
# Check distribution of NA values by product type
table(df_ride2$name, is.na(df_ride2$price))
```
It seems that Taxi is the only type of ride that have no information on the fare prices in this data set. Thus, we can go ahead and remove all of the missing observations. Using the distinct function, I also remove any potential duplicates and rename some columns to be more comprehensible:
```{r, echo=T}
# Remove blank rows
df_ride3 <- na.omit(df_ride2)
# Remove duplicates
df_ride3 <- distinct(df_ride3)
# Rename columns
df_ride3 <- df_ride3 %>%
dplyr::rename("product" = "name",
"weather" = "icon")
```
```{r}
# Duplicate check (another way)
if(nrow(df_ride3) == nrow(distinct(df_ride3))){
paste("There are no repeated values in the data set")
} else {
paste("There are",nrow(df_ride3) - nrow(distinct(df_ride3)),
"duplicated records in the data set")
}
```
Moving on, I want to check the number of unique classes within each categorical variables:
```{r}
# Check unique values
char_vars <- c(colnames(select_if(df_ride3,is.character)),"day","month")
df_char <- df_ride3[char_vars]
var_uniq <- as.data.frame(lengths(lapply(df_char,unique)))
var_uniq <- rownames_to_column(var_uniq)
colnames(var_uniq) <- c("Variable Name","Unique Count")
var_uniq <- var_uniq[order(-var_uniq$`Unique Count`),]
rownames(var_uniq) <- NULL
knitr::kable(var_uniq, "latex",
booktabs = T,longtable=T,linesep = "",
caption="Table 2. Unique counts of categorical variables") %>%
kable_styling(position = "center",
latex_options = c("striped","repeat_header"))
```
Looking at the number of unique classes, I am able to identify a number of irrelevant columns, such as ID and timezone. Furthermore, I also notice several redundant numerical columns like timestamp, product_id, and visibility.1, which are replaceable by datetime, product, and visibility, respectively. Thus, my next step is to remove these variables:
```{r, echo=T}
# Remove redundant columns
col_remove <- c("id","day","timestamp","timezone","product_id","visibility.1")
df_ride4 <- df_ride3[,!names(df_ride3) %in% col_remove]
```
After that, I want to convert the variables containing timestamp values into readable time format. I decide to use the decimal hours format, where the hours and minutes are converted into continuous decimal numbers representing a 24-hour time range; for example, 4:30 PM would become 16.5 hours. Below is the function I used to do the transformation and an overview of the data set after the time format transformation was done:
```{r, echo=T}
# Convert time stamp to decimal hours
as_dechour <- function(y){
y <- format(as.POSIXct(as_datetime(y)), format = "%H:%M")
sapply(strsplit(y,":"),
function(x) {
x <- as.numeric(x)
x[1]+x[2]/60
}
)
}
```
```{r, cache=T}
# Datetime transforming
#df_ride4$datetime <- df_ride3$datetime
df_ride4$datetime <- as.POSIXlt.character(df_ride3$datetime,format="%Y-%m-%d %H:%M:%S")
df_ride4$date <- as.Date.character(df_ride4$datetime,format="%Y-%m-%d")
df_ride4$month <- unclass(df_ride4$datetime)$mon
df_ride4$day_of_week <- weekdays(df_ride4$datetime) #(df_ride2$datetime)$wday
df_ride4$day_of_month <- unclass(df_ride4$datetime)$mday
#df_ride4$timestamp <- as_datetime(df_ride3$timestamp)
df_ride4$hour_of_day <- as_dechour(df_ride3$timestamp)
df_ride4$windGustTime <- as_dechour(df_ride3$windGustTime)
df_ride4$temperatureHighTime <- as_dechour(df_ride3$temperatureHighTime)
df_ride4$temperatureLowTime <- as_dechour(df_ride3$temperatureLowTime)
df_ride4$temperatureMaxTime <- as_dechour(df_ride3$temperatureMaxTime)
df_ride4$temperatureMinTime <- as_dechour(df_ride3$temperatureMinTime)
df_ride4$apparentTemperatureHighTime <- as_dechour(df_ride3$apparentTemperatureHighTime)
df_ride4$apparentTemperatureLowTime <- as_dechour(df_ride3$apparentTemperatureLowTime)
df_ride4$apparentTemperatureMaxTime <- as_dechour(df_ride3$apparentTemperatureMaxTime)
df_ride4$apparentTemperatureMinTime <- as_dechour(df_ride3$apparentTemperatureMinTime)
df_ride4$sunriseTime <- as_dechour(df_ride3$sunriseTime)
df_ride4$sunsetTime <- as_dechour(df_ride3$sunsetTime)
df_ride4$uvIndexTime <- as_dechour(df_ride3$uvIndexTime)
df_ride4$day_of_week <- factor(df_ride4$day_of_week,
levels = c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"))
```
```{r, echo=T}
# Check current look
str(df_ride4, width = 70, strict.width = "cut")
```
## 2. Predict fare price using linear regression model
### 2.1. Feature selection
Since our target variable is the fare price, I want to examine its distribution for both Uber and Lyft:
```{r}
boxplot(df_ride4$price~df_ride4$cab_type,col=c("lightblue","pink"),
main="Figure 1. Boxplots of fare price by company",
xlab="Company",ylab="Fare Price")
```
From figure 1, we can see that there are multiple outliers towards the upper (or right) tail of the distribution. Since linear regression results can be skewed by outliers, I will remove these outliers using the Tukey Method (for each group Uber and Lyft separately) with the hope of improving the model accuracy:
```{r, echo=T}
# Outliers imputation
df_uber <- subset(df_ride4,df_ride4$cab_type == "Uber")
df_lyft <- subset(df_ride4,df_ride4$cab_type == "Lyft")
uber_q25 <- as.numeric(quantile(df_uber$price,.25))
uber_q75 <- as.numeric(quantile(df_uber$price,.75))
uber_iqr <- uber_q75 - uber_q25
lyft_q25 <- as.numeric(quantile(df_lyft$price,.25))
lyft_q75 <- as.numeric(quantile(df_lyft$price,.75))
lyft_iqr <- lyft_q75 - lyft_q25
df_uber$price <- ifelse(df_uber$price > uber_iqr*1.5 | df_uber$price < -uber_iqr*1.5, NA, df_uber$price)
df_lyft$price <- ifelse(df_lyft$price > lyft_iqr*1.5 | df_lyft$price < -lyft_iqr*1.5, NA, df_lyft$price)
df_uber2 <- na.omit(df_uber)
df_lyft2 <- na.omit(df_lyft)
```
After imputing the outliers, the fare price distribution now looks like the following:
```{r}
# Plot distribution changes
df_combine <- rbind(df_uber2,df_lyft2)
#df_combine$trip <- paste0(df_combine$source," => ",df_combine$destination)
boxplot(df_combine$price~df_combine$cab_type,col=c("lightblue","pink"),
main="Figure 2. Boxplots of fare price by company (without outliers)",
xlab="Company",ylab="Fare Price")
```
Next, I want to examine the correlation level between some of the numerical variables. The following correlation matrix below presents that information:
```{r}
# Correlation matrix
all_num <- names(select_if(df_combine,is.numeric))
num_vars <- c("price","distance","precipIntensity","precipProbability","temperature","apparentTemperature",
"humidity","windSpeed","visibility","cloudCover","dewPoint","hour_of_day","ozone")
num_vars2 <- c("price","distance","temperature","windSpeed","visibility","cloudCover","ozone","hour_of_day")
remove <- c("surge_multiplier","hour","month","latitude","longitude","day_of_month","windGust","windBearing","uvIndex",
num_vars[!num_vars %in% num_vars2],
all_num[grepl(c("Time"), all_num, fixed = TRUE) |
grepl(c("apparent"), all_num, fixed = TRUE) |
grepl(c("Min"), all_num, fixed = TRUE) |
grepl(c("Max"), all_num, fixed = TRUE) |
grepl(c("High"), all_num, fixed = TRUE) |
grepl(c("Low"), all_num, fixed = TRUE) ])
pick_num <- all_num[!all_num %in% remove]
cors <- cor(df_ride4[,c(num_vars,"surge_multiplier")], use="pairwise")
corrplot(cors,type="upper",mar=c(0,0,1.5,0),method="color",
tl.cex=0.8,
#col=brewer.pal(8,"RdYlBu"),
addCoef.col = 1,number.digits = 1,number.cex=0.55,
title="Figure 3. Correlation matrix of numerical values")
```
From figure 3, I was able to spot several highly correlated groups of variables:
* Group 1: humidity, dewPoint, visibility, precipIntensity, precipProbability
* Group 2: temperature, apparentTemperature, dewPoint
To avoid multi-collinearity instances, I want to pick only 1 feature from each group that are not highly correlated with one another. Thus, I pick visibility and temperature and I also believe they may have an impact on the pricing of the Uber & Lyft rides. The remaining numerical features are not correlated with one another, and so I decide to use all of them.
As for the categorical variables, I believe the starting location of the trip might influence the fare price more than the destination of the trip because the traffic of Uber/Lyft drivers around the pick-up locations determines can greatly affect pricing (less drivers mean more expensive rides). Meanwhile, the traffic around the destinations are not that important in determining the fare price. Other information that "destination" can offer may be the indication of how far the ride is, however, we already have "distance" as one of our features and thus, I think we can disregard the "destination" feature in this case. I also notice that there are 3 different features describing the weather conditions, therefore, I want to pick only one of them. Based on table 2, we can see that "weather" has the least number of sub-categories, and so, each sub-category may have more significant influence on the fare price. Apart from weather, the product type also greatly determines the fare price as I have already knew from my experience using the ride-share service from both companies before. Lastly, I also think that different days in the week may also tend to have different prices since perhaps on the weekends rides are less popular because people stay at home more, for instance. In the end, all of my chosen categorical variables are: cab_type, product, source, weather, and day of week.
After picking my features, I convert the categorical variables into the factor data type and prepare to partition the data. The following is the current structure of the data set to be used for model construction:
```{r}
# Pick relevant features
char_vars <- c(names(select_if(df_combine,is.character)))
char_vars <- char_vars[!char_vars %in% c("short_summary","long_summary","trip","destination")]
df_combine2 <- df_combine[,c(char_vars,pick_num,"day_of_week")]
# Convert categorical to factor
for (i in char_vars){
df_combine2[,i] <- as.factor(df_combine2[,i])
}
str(df_combine2, width = 70, strict.width = "cut")
```
I also want to subset the data set into 2 sets, each containing data on only 1 company (either Uber or Lyft). I will then construct separate linear regression model for each of them to predict the fare price of each company independently. Therefore, the "cab_type" variable is no longer needed.
```{r, echo=T}
# Subset by cab_type
df_uber3 <- subset(df_combine2, df_combine2$cab_type == "Uber")
df_uber3$cab_type <- NULL
df_lyft3 <- subset(df_combine2, df_combine2$cab_type == "Lyft")
df_lyft3$cab_type <- NULL
```
### 2.2. Data partitioning and standardization
Using the 80/20 ratio, I split the data set into the training and testing sets for both Uber and Lyft records:
```{r, echo=T}
# Partition data set using 80/20 split
set.seed(888)
# Method 1:
uber_trainRatio <- createDataPartition(df_uber3$price,p=0.8,list=F,times=1)
uber_train <- df_uber3[uber_trainRatio,]
uber_test <- df_uber3[-uber_trainRatio,]
lyft_trainRatio <- createDataPartition(df_lyft3$price,p=0.8,list=F,times=1)
lyft_train <- df_lyft3[lyft_trainRatio,]
lyft_test <- df_lyft3[-lyft_trainRatio,]
```
The following table shows the size of each data set:
```{r}
# Dimension of each set
dim_traintest <- data.frame(dim(uber_train),dim(uber_test),dim(lyft_train),dim(lyft_test))
row.names(dim_traintest) <- c("No. of Records","No. of Features")
colnames(dim_traintest) <- c("Uber Training Set","Uber Testing Set","Lyft Training Set","Lyft Testing Set")
knitr::kable(dim_traintest, format="latex",
booktabs = T,longtable=T,linesep = "",
caption="Table 3. Dimension of training and testing set for the linear regression model") %>%
kable_styling(position = "center", font_size = 10,
latex_options = c("striped","repeat_header")) #%>%
#column_spec(2, width = "11cm")
```
Since we are using regression, a distance based method, it is useful to standardize the numerical values so they stay relatively within the same range. Thus, I scale the numerical values using the scale() function provided in R.
```{r, echo=T}
# Normalize numerical values in each set
num_vars3 <- pick_num[!pick_num %in% c("price")]
uber_train[num_vars3] <- scale(uber_train[num_vars3])
lyft_train[num_vars3] <- scale(lyft_train[num_vars3])
```
### 2.3. Construct linear regression models to forecast Lyft & Uber prices
I create the first linear regression model to predict the Uber prices using all of my chosen features in the previous part:
```{r, echo=T, cache=T}
# Construct linear model to predict Uber prices
lm_uber1 <- lm(data=uber_train,price~.)
summary(lm_uber1)
vif(lm_uber1)
```
The r-squared of the Uber training model is ``r round(summary(lm_uber1)$r.squared,4)``, which is not bad at all in terms of correlation level. From examining the Variance Inflation Factor (VIF), it seems that "weather" has the highest VIF values and thus, I want to remove this feature to avoid multi-collinearity occurrences.
Next, I build another regression model using stepwise regression because I want to keep only the significant predictors and disregard the others:
```{r, echo=T, cache=T}
# Stepwise regression for Uber
step_uber1 <- step(lm(data=uber_train,price~.-weather), direction = "both", trace = FALSE)
summary(step_uber1)
vif(step_uber1)
```
The Uber stepwise regression model returns a similar R-squared of ``r round(summary(step_uber1)$r.squared,4)``, however, the VIF values are now all below 5, indicating that there are no multi-collinearity instances. Since the values of the predictors are scaled beforehand, we can compare the regression coefficients of among the predictors to explore how big of an influence each of them has on the fare price. The following table shows the variable importance for determining the Uber fare price:
```{r}
# Variable importance for Uber prices
uber_coef <- cbind(as.data.frame(summary(step_uber1)$coefficients[,1]),
as.data.frame(summary(step_uber1)$coefficients[,4]))
colnames(uber_coef) <- c("Coefficients","P-values")
uber_coef <- uber_coef[order(abs(uber_coef$Coefficients),decreasing=T),,drop=F]
uber_coef <- rownames_to_column(uber_coef, var="Predictors")
colnames(uber_coef) <- c("Predictors","Coefficients","P-values")
uber_coef <- uber_coef[uber_coef$Predictors != "(Intercept)",]
#uber_coef$Significance <- ifelse(uber_coef$`P-values` < 0.05,"Yes","No")
rownames(uber_coef) <- c(1:nrow(uber_coef))
knitr::kable(uber_coef, format="latex",
booktabs = T,longtable=T,linesep = "",
caption="Table 4. Features regression coefficients from the linear model predicting Uber prices") %>%
kable_styling(position = "center",
latex_options = c("striped","repeat_header"))
```
From here, it seems that product, unsurprisingly, has the biggest influence on the Uber fare prices, followed by distance, source. On the other hand, considering a confidence level of 0.95 (or alpha = 0.05), cloudCover (index representing the density and thickness of clouds) is the only insignificant factor in this case.
```{r}
# Generate predictions for Uber prices
pred_uber_train <- predict(step_uber1, uber_train, interval="predict",level=0.95)
pred_uber_test <- predict(step_uber1, uber_test, interval="predict",level=0.95)
# R-squared:
RSQUARE = function(y_actual,y_predict){
a <- cor(y_actual,y_predict)^2
return(a)
}
uber_train_rmse <- Metrics::rmse(pred_uber_train[,1], uber_train$price) # RMSE
uber_train_mae <- Metrics::mae(pred_uber_train[-1], uber_train$price) # MAE
uber_train_mape <- Metrics::mape(pred_uber_train[,1], uber_train$price) # MAPE
uber_train_r_squared <- RSQUARE(uber_train$price,pred_uber_train[,1])
uber_test_rmse <- Metrics::rmse(pred_uber_test[,1], uber_test$price) # RMSE
uber_test_mae <- Metrics::mae(pred_uber_test[,1], uber_test$price) # MAE
uber_test_mape <- Metrics::mape(pred_uber_test[,1], uber_test$price) # MAPE
uber_test_r_squared <- RSQUARE(uber_test$price,pred_uber_test[,1])
```
Similarly, I create another linear model to predict the Lyft fare prices:
```{r, echo=T, cache=T}
# Construct linear model to predict Lyft prices
lm_lyft1 <- lm(data=lyft_train,price~.)
summary(lm_lyft1)
vif(lm_lyft1)
```
The Lyft linear regression models return a higher R-squared of ``r round(summary(lm_lyft1)$r.squared,4)`` compared to the Uber model, indicating a better fit. However, "weather" is once again attain a high VIF value and thus, needs to be removed. Using the stepwise regression method, I attempt to remove weather along with other insignificant factors.
```{r, echo=T, cache=T}
# Stepwise regression for Lyft
step_lyft1 <- step(lm(data=lyft_train,price~.-weather), direction = "both", trace = FALSE)
summary(step_lyft1)
vif(step_lyft1)
```
Again, the R-squared value of the stepwise model does not change, but the now only the significant variables are kept and their VIF values are all lower than 5. Next, I compare the regression coefficients of all features to see their impacts on the Lyft prices:
```{r}
# Variable importance for Lyft prices
lyft_coef <- cbind(as.data.frame(summary(step_lyft1)$coefficients[,1]),
as.data.frame(summary(step_lyft1)$coefficients[,4]))
colnames(lyft_coef) <- c("Coefficients","P-values")
lyft_coef <- lyft_coef[order(abs(lyft_coef$Coefficients),decreasing=T),,drop=F]
lyft_coef <- rownames_to_column(lyft_coef, var="Predictors")
colnames(lyft_coef) <- c("Predictors","Coefficients","P-values")
lyft_coef <- lyft_coef[lyft_coef$Predictors != "(Intercept)",]
#lyft_coef$Significance <- ifelse(lyft_coef$`P-values` < 0.05,"Yes","No")
rownames(lyft_coef) <- c(1:nrow(lyft_coef))
knitr::kable(lyft_coef, format="latex",
booktabs = T,longtable=T,linesep = "",
caption="Table 5. Features regression coefficients from the linear model predicting Lyft prices") %>%
kable_styling(position = "center",
latex_options = c("striped","repeat_header"))
```
Similar to table 4, product, distance and source are once again, determined to be the top 3 significant factors influencing fare prices by the stepwise linear regression model.
```{r}
# Generate predictions for Lyft prices
pred_lyft_train <- predict(step_lyft1, lyft_train, interval="predict",level=0.95)
pred_lyft_test <- predict(step_lyft1, lyft_test, interval="predict",level=0.95)
lyft_train_rmse <- Metrics::rmse(pred_lyft_train[,1], lyft_train$price) # RMSE
lyft_train_mae <- Metrics::mae(pred_lyft_train[,1], lyft_train$price) # MAE
lyft_train_mape <- Metrics::mape(pred_lyft_train[,1], lyft_train$price) # MAPE
lyft_train_r_squared <- RSQUARE(lyft_train$price,pred_lyft_train[,1])
lyft_test_rmse <- Metrics::rmse(pred_lyft_test[,1], lyft_test$price) # RMSE
lyft_test_mae <- Metrics::mae(pred_lyft_test[,1], lyft_test$price) # MAE
lyft_test_mape <- Metrics::mape(pred_lyft_test[,1], lyft_test$price) # MAPE
lyft_test_r_squared <- RSQUARE(lyft_test$price,pred_lyft_test[,1])
```
### 2.4. Linear model performance
After constructing the regression models, I generate the price predictions using both the training and testing set. The following table presents the performances of those models using different metrics:
```{r}
# Compare table:
lm_compare <- as.data.frame(cbind(rbind(uber_train_rmse,uber_train_mae,uber_train_mape,uber_train_r_squared),
rbind(uber_test_rmse,uber_test_mae,uber_test_mape,uber_test_r_squared),
rbind(lyft_train_rmse,lyft_train_mae,lyft_train_mape,lyft_train_r_squared),
rbind(lyft_test_rmse,lyft_test_mae,lyft_test_mape,lyft_test_r_squared)))
lm_compare <-
lm_compare %>%
mutate_all(funs(round(.,3)))
colnames(lm_compare) <- c("Uber Training","Uber Testing","Lyft Training","Lyft Testing")
rownames(lm_compare) <- c("RMSE","MAE","MAPE","R-Squared")
knitr::kable(lm_compare, "latex",
booktabs = T,longtable=T,linesep = "",
caption="Table 6. Linear regression model predictive performance between training and testing") %>%
kable_styling(position = "center", font_size = 10,
latex_options = c("striped","repeat_header"))
```
From table 6, we can see that Mean Average Percentage Errors (MAPE) are still rather high when using the testing set (around 26% to 28% error) compared to the training set (around 10% to 13% error). The R-squared for the Uber price prediction model got worse during the testing process compared to the training, however, the R-squared of the Lyft model actually improves slightly, indicating that the model does not overfit. The following scatter plot helps visualize the goodness of fit between our model predictions and the actual fare prices of both companies:
```{r}
# Visualize prediction accuracy
par(mfcol=c(1,2),mai=c(1,1.2,0.7,0.3))
base::plot(x=pred_uber_test[,1],y=uber_test$price,col="blue",
ylab="Uber price observations",xlab=paste("Uber price predictions","\n"),
main="",
#sub=expression(italic("Unit: Thousand USD")),
xlim=c(0,30),ylim=c(0,30))
abline(a=0,b=1,col="red",lwd=2)
plot(x=pred_lyft_test[,1],y=lyft_test$price,col="darkred",
ylab="Lyft price observations",xlab=paste("Lyft price predictions","\n"),
main="",
#sub=expression(italic("Unit: Thousand USD")),
xlim=c(0,30),ylim=c(0,30))
abline(a=0,b=1,col="red",lwd=2)
mtext(expression(bold("Figure 4. Rideshare fare price observations vs linear predictions")),
side = 3, line = -1.8, outer = TRUE, cex=1.1)
mtext(expression(italic("Unit: USD")),
side = 1, line = -1.5, outer = TRUE, cex=1)
```
From figure 4, we can see that our linear model does not seem to fit very well with this data set and there are still huge margin or errors despite having removed the outliers. For that reason, I intend to try other predictive methods as well with the hope of predicting the fare price with higher accuracy.
## 3. Predict fare price using decision tree model
### 3.1. Decision tree model construction
Decision tree model is very robust to outliers, and thus, I will no longer remove the outliers in fare prices when training this model. Using a similar 80/20 split, I create training and testing data sets for both Uber and Lyft. The following table shows the size of each set:
```{r, cache=F}
# Partition unscaled data
set.seed(888)
df_uber4 <- subset(df_ride4,df_ride4$cab_type == "Uber")[,c(char_vars,pick_num,"destination","day_of_week")]
df_lyft4 <- subset(df_ride4,df_ride4$cab_type == "Lyft")[,c(char_vars,pick_num,"destination","day_of_week")]
uber_trainRatio2 <- createDataPartition(df_uber4$price,p=0.8,list=F,times=1)
lyft_trainRatio2 <- createDataPartition(df_lyft4$price,p=0.8,list=F,times=1)
uber_train2 <- df_uber4[uber_trainRatio2,]
uber_test2 <- df_uber4[-uber_trainRatio2,]
lyft_train2 <- df_lyft4[lyft_trainRatio2,]
lyft_test2 <- df_lyft4[-lyft_trainRatio2,]
# Dimension of each set
dim_traintest2 <- data.frame(dim(uber_train2),dim(uber_test2),dim(lyft_train2),dim(lyft_test2))
row.names(dim_traintest2) <- c("No. of Records","No. of Features")
colnames(dim_traintest2) <- c("Uber Training Set","Uber Testing Set","Lyft Training Set","Lyft Testing Set")
knitr::kable(dim_traintest2, "latex",
booktabs = T,longtable=T,linesep = "",
caption="Table 7. Dimension of training and testing set for the decision tree model") %>%
kable_styling(position = "center", font_size = 10,
latex_options = c("striped","repeat_header"))
```
First, I created the decision tree to predict Uber prices, with complexity parameter (CP) value of 0.001:
```{r, echo=T, cache=T}
# Regression Trees for Uber
uber_tree1 <- rpart(price ~., data = uber_train2, method = "anova", cp = 0.001)
printcp(uber_tree1)
```
The tree created for Uber used 2 only variables distance and product to make 18 splits and 19 nodes. Similarly, I created another tree to predict the Lyft prices with CP = 0.001:
```{r, echo=T, cache=T}
# Regression Trees for Lyft
lyft_tree1 <- rpart(price ~., data = lyft_train2, method = "anova", cp = 0.001)
printcp(lyft_tree1)
```
Again, the decision tree model only chooses the distance and product features, but for Lyft, it creates 20 splits and 21 nodes in total. The following figure demonstrates the reduction in cross-validated errors for both models as the value of CP decreases:
```{r}
# Plot of cp vs errors
par(mfcol=c(1,2),mai=c(1,0.9,1.5,0.4))
plotcp(uber_tree1, main=paste("Uber Tree","\n","\n"),
col="red",lty=2,cex.axis=0.8)
plotcp(lyft_tree1, main=paste("Lyft Tree","\n","\n"),
col="red",lty=2,cex.axis=0.8)
mtext(expression(bold("Figure 6. Cross-validated errors and complexity parameter")),
side = 3, line = -1.8, outer = TRUE, cex=1.1)
```
```{r, echo=T}
# Pruned tree
best_cp_uber <- uber_tree1$cptable[which.min(uber_tree1$cptable[,"xerror"]),"CP"]
best_cp_lyft <- lyft_tree1$cptable[which.min(lyft_tree1$cptable[,"xerror"]),"CP"]
best_cp_uber==best_cp_lyft
```
The complexity parameter values that generates the lowest cross-validated error for uber and lyft prices are both ``r best_cp_uber``.
Next, I want to examine the relative importance of the features as determined by the decision tree model. In decision trees, a variable may appear in the tree many times, either as a primary variable in the root node, or as a surrogate variable in the decision node. "An overall measure of variable importance is the sum of the goodness of split measures for each split for which it was the primary variable, plus goodness * (adjusted agreement) for all splits in which it was a surrogate" (Therneau & Atkinson, 2022). In regression trees, the goodness of split measure is often the gap between the observations and the model predictions. This measure is then scaled to sum up to 100%, representing the proportion of influence each variable has on the dependent variable. The figure below helps visualize these importance measures between features:
```{r}
# Feature Importance
uber_imp <- as.data.frame(uber_tree1$variable.importance)
uber_imp$Perc.Imp <- uber_imp$`uber_tree1$variable.importance`/sum(uber_imp$`uber_tree1$variable.importance`)
lyft_imp <- as.data.frame(lyft_tree1$variable.importance)
lyft_imp$Perc.Imp <- lyft_imp$`lyft_tree1$variable.importance`/sum(lyft_imp$`lyft_tree1$variable.importance`)
uber_imp <- uber_imp %>%
tibble::rownames_to_column() %>%
dplyr::rename("Variables" = rowname) %>%
dplyr::arrange(Perc.Imp) %>%
dplyr::mutate(Variables = forcats::fct_inorder(Variables))
colnames(uber_imp)[2] <- "Variable Importance"
lyft_imp <- lyft_imp %>%
tibble::rownames_to_column() %>%
dplyr::rename("Variables" = rowname) %>%
dplyr::arrange(Perc.Imp) %>%
dplyr::mutate(Variables = forcats::fct_inorder(Variables))
colnames(lyft_imp)[2] <- "Variable Importance"
```
```{r}
# Feature importance plot
plot1 <- ggplot(uber_imp) +
geom_col(aes(x = Variables, y = Perc.Imp*100, fill=Variables), show.legend = F) +
coord_flip() + #scale_fill_brewer(palette="Spectral")
labs(title=paste0("Uber fare price")) +
ylab("Relative Importance (%)") + xlab(paste("")) + theme_bw() +
theme(plot.title = element_text(hjust = 0.5))
plot2 <- ggplot(lyft_imp) +
geom_col(aes(x = Variables, y = Perc.Imp*100, fill=Variables), show.legend = F) +
coord_flip() + scale_fill_viridis(discrete = TRUE) +
labs(title=paste0("Lyft fare price")) +
ylab("Relative Importance (%)") + xlab(paste("")) + theme_bw() +
theme(plot.title = element_text(hjust = 0.5))
plot3 <- ggarrange(plot1, plot2, ncol=2, nrow=1)
annotate_figure(plot3, top = text_grob("Figure 7. Variable importance from regression trees",
color = "black", face = "bold", size = 14))
```
From figure 5, we can see that the regression trees determine product is the most significant factor affecting fare prices, followed by distance, destination and source for both Uber and Lyft. Other features seem to have an inconsiderable impact on the price of the ride-sharing service.
### 3.2. Decision tree model performance
After having constructed the regression trees, I then use these models to generate pricing predictions using both the training and testing sets of Uber & Lyft. The table below summarizes the predictive accuracy of the model outputs:
```{r}
# Fit with training and testing set
uber_tree_train <- as.data.frame(predict(uber_tree1,newdata = uber_train2))
uber_tree_test <- as.data.frame(predict(uber_tree1,newdata = uber_test2))
uber_tree_train <- uber_tree_train[,1]
uber_tree_test <- uber_tree_test[,1]
lyft_tree_train <- as.data.frame(predict(lyft_tree1,newdata = lyft_train2))
lyft_tree_test <- as.data.frame(predict(lyft_tree1,newdata = lyft_test2))
lyft_tree_train <- lyft_tree_train[,1]
lyft_tree_test <- lyft_tree_test[,1]
# Calculate accuracy metrics
uber_tree_train_rmse <- Metrics::rmse(uber_tree_train, uber_train2$price) # RMSE
uber_tree_train_mae <- Metrics::mae(uber_tree_train, uber_train2$price) # MAE
uber_tree_train_mape <- Metrics::mape(uber_tree_train, uber_train2$price) # MAPE
uber_tree_train_r_squared <- RSQUARE(uber_train2$price,uber_tree_train)
uber_tree_test_rmse <- Metrics::rmse(uber_tree_test, uber_test2$price) # RMSE
uber_tree_test_mae <- Metrics::mae(uber_tree_test, uber_test2$price) # MAE
uber_tree_test_mape <- Metrics::mape(uber_tree_test, uber_test2$price) # MAPE
uber_tree_test_r_squared <- RSQUARE(uber_test2$price,uber_tree_test)
lyft_tree_train_rmse <- Metrics::rmse(lyft_tree_train, lyft_train2$price) # RMSE
lyft_tree_train_mae <- Metrics::mae(lyft_tree_train, lyft_train2$price) # MAE
lyft_tree_train_mape <- Metrics::mape(lyft_tree_train, lyft_train2$price) # MAPE
lyft_tree_train_r_squared <- RSQUARE(lyft_train2$price,lyft_tree_train)
lyft_tree_test_rmse <- Metrics::rmse(lyft_tree_test, lyft_test2$price) # RMSE
lyft_tree_test_mae <- Metrics::mae(lyft_tree_test, lyft_test2$price) # MAE
lyft_tree_test_mape <- Metrics::mape(lyft_tree_test, lyft_test2$price) # MAPE
lyft_tree_test_r_squared <- RSQUARE(lyft_test2$price,lyft_tree_test)
```
```{r}
# Compare table:
tree_compare <- as.data.frame(
cbind(rbind(uber_tree_train_rmse,uber_tree_train_mae,uber_tree_train_mape,uber_tree_train_r_squared),
rbind(uber_tree_test_rmse,uber_tree_test_mae,uber_tree_test_mape,uber_tree_test_r_squared),
rbind(lyft_tree_train_rmse,lyft_tree_train_mae,lyft_tree_train_mape,lyft_tree_train_r_squared),
rbind(lyft_tree_test_rmse,lyft_tree_test_mae,lyft_tree_test_mape,lyft_tree_test_r_squared)))
tree_compare <-
tree_compare %>%
mutate_all(funs(round(.,3)))
colnames(tree_compare) <- c("Uber Training","Uber Testing","Lyft Training","Lyft Testing")
rownames(tree_compare) <- c("RMSE","MAE","MAPE","R-Squared")
knitr::kable(lm_compare, "latex",
booktabs = T,longtable=T,linesep = "",
caption="Table 8. Regression tree model predictive performance between training and testing") %>%
kable_styling(position = "center", font_size=10,
latex_options = c("striped","repeat_header"))
```
From table 8, we can observe that the decision tree models have been able to generate higher r-squared values when predicting both Uber and Lyft prices, along with lower overall RMSE and MAPE. The testing predictions also show some improvements in accuracy compared to the training ones since R-squared slightly increases but MAPE remains similar or lower. The following scatterplots help visualize the predictions fitness with the observations in fare prices:
```{r}
# Visualize prediction accuracy
par(mfcol=c(1,2),mai=c(1,1.2,0.7,0.3))
base::plot(x=uber_tree_test,y=uber_test2$price,col="blue",
ylab="Uber price observations",xlab=paste("Uber price predictions","\n"),
main="",
#sub=expression(italic("Unit: Thousand USD")),
xlim=c(0,60),ylim=c(0,60))
abline(a=0,b=1,col="red",lwd=2)
plot(x=lyft_tree_test,y=lyft_test2$price,col="darkred",
ylab="Lyft price observations",xlab=paste("Lyft price predictions","\n"),
main="",
#sub=expression(italic("Unit: Thousand USD")),
xlim=c(0,60),ylim=c(0,60))
abline(a=0,b=1,col="red",lwd=2)
mtext(expression(bold("Figure 8. Rideshare fare price observations vs regression tree predictions")),
side = 3, line = -1.8, outer = TRUE, cex=1.1)
mtext(expression(italic("Unit: USD")),
side = 1, line = -1.5, outer = TRUE, cex=1)
```
From figure 8, we see that the points are not too densely scattered around the diagonal red lines, similar to what we have seen in figure 4. This represents that the pricing predictions made by the regression trees, despite the progress made versus the linear regression method, are still quite far off from the actual observations.
## 4. Predict whether price will surge using logistic regression model
In this section we will focus on a new target variable "surge_multipler". From my understanding, the actual price = standard price * surge multiplier. Thus, this variable determines whether the actual price will be surged (multiplier > 1) or not (multiplier = 1). First, we examine the distribution of "surge_multipler" across the 2 companies Lyft & Uber:
```{r}
# Distribution of surge multiplier
# str(df_ride4, give.attr = F, width = 70, strict.width = "cut")
table(df_ride4$cab_type,df_ride4$surge_multiplier)
```
From here, we can see that the surge multiplier only seems to exist for Lyft, and thus, we will focus only only predicting price surge for Lyft with this data set. In addition, the value of "surge_multipler" = 1 accounts for majority of the data, while the highest surge is 3 times the standard price. Next, we subset the original data set into a smaller set containing only Lyft's data and create a new target categorical feature "lyft_surge" based on the numerical values of "surge_multipler":
```{r, echo=T}
# Subset original data for only Lyft's data
df_lyft_surge <- df_ride4[df_ride4$cab_type=="Lyft",c(char_vars,pick_num,
"day_of_week",
"surge_multiplier")]
df_lyft_surge <- df_lyft_surge[,!names(df_lyft_surge) %in% "cloudCover"]
df_lyft_surge$cab_type <- NULL
# Create new target variable lyft_surge and remove irrelevant ones
df_lyft_surge$price_surge <- if_else(df_lyft_surge$surge_multiplier > 1,1,0)
char_vars_lyft <- names(select_if(df_lyft_surge,is.character))
df_lyft_surge$surge_multiplier <- NULL
df_lyft_surge$price <- NULL
# Convert categorical to factor
for (i in char_vars_lyft){
df_lyft_surge[,i] <- as.factor(df_lyft_surge[,i])
}
# Scale numerical features
num_vars_lyft <- names(select_if(df_lyft_surge,is.numeric))
num_vars_lyft <- num_vars_lyft[!num_vars_lyft %in% c("price_surge")]
df_lyft_surge2 <- df_lyft_surge
df_lyft_surge2[num_vars_lyft] <- scale(df_lyft_surge[num_vars_lyft])
# Overview of current data set
str(df_lyft_surge2, width = 70, strict.width = "cut")
```
### 4.1. Remove class bias with undersampling
The current distribution of the new target feature "price_surge" is as follow:
```{r}
#df_lyft_surge2$price_surge <- factor(df_lyft_surge2$price_surge, levels=c(0,1))
#str(df_lyft_surge2[,!names(df_lyft_surge2) %in% "destination"], width = 70, strict.width = "cut")
table(df_lyft_surge2$price_surge)
```
We can see that the classes of the dependent variable is currently unbalanced, with the majority being "1" compared to "0". Thus, we use the undersampling method in order to balance out the classes:
```{r, echo=T}
# Balance classes of dependent variable
df_surge_freq <- as.data.frame(t(as.data.frame(table(df_lyft_surge2$price_surge))))
rownames(df_surge_freq) <- c("Price-Surge categories","Frequency")
colnames(df_surge_freq) <- c("Not-surge","Surge")
df_surge_freq$`Not-surge` <- as.numeric(df_surge_freq$`Not-surge`)
df_surge_freq$Surge <- as.numeric(df_surge_freq$Surge)
df_surge <- df_lyft_surge2[df_lyft_surge2$price_surge == 1,]
df_nosurge <- df_lyft_surge2[df_lyft_surge2$price_surge == 0,]
no_bias_row <- min(nrow(df_surge),nrow(df_nosurge))
private_yes_nobias <- head(df_nosurge,no_bias_row)
df_col_equal <- rbind(private_yes_nobias,df_surge)
table(df_col_equal$price_surge)
```
```{r}
# df_surge_freq2 <- as.data.frame(t(as.data.frame(table(df_col_equal$price_surge))))
# rownames(df_surge_freq2) <- c("Price-Surge categories","Frequency")
# colnames(df_surge_freq2) <- c("Not-surge","Surge")
# df_surge_freq2$`Not-surge` <- as.numeric(df_surge_freq2$`Not-surge`)
# df_surge_freq2$Surge <- as.numeric(df_surge_freq2$Surge)
```
### 4.2. Logistic regression model construction
Now that the classes are balanced, we can partition the data set for training and testing and fit a stepwise logistic regression model to predict whether price will surge:
```{r, echo=T, cache=T}
# Partition unscaled data
set.seed(888)
surge_trainRatio <- createDataPartition(df_col_equal$price_surge,p=0.8,list=F,times=1)
surge_train <- df_col_equal[surge_trainRatio,]
surge_test <- df_col_equal[-surge_trainRatio,]
# Stepwise logistic regression model
logit_model <- glm(price_surge~., data=surge_train, family=binomial(link="logit")) %>% stepAIC(trace = FALSE)
```
```{r}
# Decision Trees using rpart
# surge_tree1 <- rpart(price_surge ~., data = surge_train, method = "class", cp = 0.1)
# printcp(surge_tree1)
# options(scipen=0)
coef_tbl <- summary(logit_model)$coef
coef_tbl[,4] <- round(coef_tbl[,4],5)
coef_tbl[,c(1,2,4)]
```
Similar to linear regression, we check the VIF values to see if there are any signs of multicollinearity issues.
```{r, echo=T}
# Check VIF values
vif(logit_model)
```
Since all VIFs are less than 5, we can be fairly confident that there are no correlated features in the model and move on with our analysis. Based on the model output, we can observe that most of the independent predictors are deemed significant (based on their P-values) except for "product" feature. To better visualize this information, the variable importance calculated for each feature is plotted below:
```{r}
# Variable importance for Uber prices
surge_varImp <- as.data.frame(varImp(logit_model, scale = F))
surge_varImp <- surge_varImp[order(abs(surge_varImp$Overall),decreasing=T),,drop=F]
surge_varImp <- rownames_to_column(surge_varImp, var="Variables")
surge_varImp$RelativeImp <- surge_varImp$Overall/sum(surge_varImp$Overall)
rownames(surge_varImp) <- c(1:nrow(surge_varImp))
ggplot(surge_varImp) +
geom_col(aes(x = reorder(Variables,RelativeImp), y = RelativeImp*100, fill=reorder(Variables,RelativeImp)), show.legend = F) +
coord_flip() +
labs(title=expression(bold("Figure 9. Variable importance of logistic regression model"))) +
ylab("Relative Importance (%)") + xlab(paste("")) + theme_bw() +
theme(plot.title = element_text(hjust = 1.5))
```
From figure 9, we can determine that the source locations of the rides, along with weather conditions, are the most influential factors affecting whether prices will surge or not. Specifically, rides that start from North End, Haymarket Square, North Station, West End and Financial District are among the top 5 locations that are most likely to experience price surges. This makes sense since all of these locations are often heavily-crowded, which could cause the demand for ride-shares to temporarily outnumbers the supply of drivers at these particular locations during high traffic hours. In addition, cloudy and rainy weathers may also contribute a small part in the surging of prices due to unfavorable road conditions.
### 4.3. Logistic regression model performance
After fitting the training set, the next step is to generate predictions for the testing set. The training and testing results from the logistic regression model are presented below:
```{r, echo=T}
# Generate predictions using the training set
logit_predict_train <- predict(logit_model, surge_train, type="response")