-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathWrite-up.Rmd
761 lines (668 loc) · 43.6 KB
/
Write-up.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
---
title: "Predicting EMS Calls for Better Ambulance Allocation"
author: "Yujing Wu & Xiaoran Wang"
date: "12/7/2019"
output:
html_document:
code_folding: hide
toc: true
toc_depth: 3
toc_float: true
theme: spacelab
---
<style>
.superbigimage{
overflow-x:scroll;
white-space: nowrap;
}
.superbigimage img{
max-width: none;
}
</style>
```{r setup}
knitr::opts_chunk$set(echo = TRUE, results=TRUE, echo=TRUE, message = FALSE,
warning=FALSE, fig.align="center", cache=FALSE, tigris_use_cache = TRUE)
```
```{r import libs}
library(kableExtra)
library(pscl)
library(tidyr)
library(tidyverse)
library(sf)
library(RSocrata)
library(lubridate)
library(tigris)
library(tidycensus)
library(gganimate)
library(viridis)
library(riem)
library(gridExtra)
library(knitr)
library(kableExtra)
#library(stargazer)
mapTheme <- function(base_size = 12) {
theme(
text = element_text( color = "black"),
plot.title = element_text(size = 14,colour = "black"),
plot.subtitle=element_text(face="italic"),
plot.caption=element_text(hjust=0),
axis.ticks = element_blank(),
panel.background = element_blank(),axis.title = element_blank(),
axis.text = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_rect(colour = "black", fill=NA, size=2)
)
}
plotTheme <- function(base_size = 12) {
theme(
text = element_text( color = "black"),
plot.title = element_text(size = 14,colour = "black"),
plot.subtitle = element_text(face="italic"),
plot.caption = element_text(hjust=0),
axis.ticks = element_blank(),
panel.background = element_blank(),
panel.grid.major = element_line("grey80", size = 0.1),
panel.grid.minor = element_blank(),
panel.border = element_rect(colour = "black", fill=NA, size=2),
strip.background = element_rect(fill = "grey80", color = "white"),
strip.text = element_text(size=12),
axis.title = element_text(size=12),
axis.text = element_text(size=10),
plot.background = element_blank(),
legend.background = element_blank(),
legend.title = element_text(colour = "black", face = "italic"),
legend.text = element_text(colour = "black", face = "italic"),
strip.text.x = element_text(size = 14)
)
}
palette5 <- c('#fff6f7', '#ffcfca', '#ffa59c', '#ff746e', '#ff263d')
palette4 <- c("#D2FBD4","#92BCAB","#527D82","#123F5A")
palette3 <- c('#fff6f7','#ffa59c','#ff263d')
palette2 <- c("#6baed6","#08519c")
```
# 1. Introduction
In the emergency medical service (EMS) system, the time of responding to the call is always crucial for saving people’s lives. In Virginia Beach, there are 22 volunteer rescue squads, but more than 10% of EMS calls have been delayed in each year. Though the control center sends the closest ambulance to the scene, the unpredicted locations of calls still make the ambulance driver spend a long time on the way. According to the annual report of Virginia Beach EMS, from receiving calls to arriving at the scene, the drivers’ average response time was more than 9 minutes, and the responding time increased between 2017 and 2018. Based on the above situation, our project aims to minimize the time it takes for a driver to arrive at the scene after being notified. <br />
<br />
In addition, we would like to increase the transparency and efficiency of the EMS dispatch system by empowering the ambulance drivers to have more decision-making power. For our use case, the driver would have a general sense of the current call pattern around the city by checking our predicted risk map at a 1-hour time interval. Instead of staying at the dispatch center to wait for the call, he or she can stay around the high-risk location in advance. Our multi-driver communication system also allows the driver to be aware of the general locations of other drivers.
# 2. Data
We gathered data from multiple sources to explore the potential relationships between the number of EMS calls and other features. The predictors included different types of time lag, demographic features, spatial characteristics, and other relevant external characteristics of Virginia Beach. The main dataset was the Virginia Beach EMS calls between 2010 and 2018, which was acquired from the Virginia Beach Open data portal. Other supplement datasets were from the Tidy Census, Iowa Environment Mesonet, Virginia Road and Esri Open Data.
## 2.1 Feature Engineering
To utilize the gathered datasets in our model, we did feature engineering and transformed the data into different formats. Below is an introduction to the data processing and the logic behind it.
### EMS Call Data
We imported EMS call data from 2010 to 2018 in Virginia Beach. In our project, we used the data of June, July, and August in 2017 as training and testing data. We standardized the call times to hours and stored the data in the column interval60 in the dataset. In terms of spatial unit, we created a fishnet grid using the boundary of the city and excluded the water bodies. Finally, we aggregated call counts in each grid cell by the hour and stored the data in a new column Call_Count. Therefore, each row in our final time/space panel refers to the number of EMS calls by grid cell per hour in each day of the week.
```{r, results="hide"}
EMS_Calls <- st_read('C:\\Users\\Sarah\\Documents\\MUSA_507_Public_Policy_Analytics\\Predict-EMS-call-to-better-allocate-ambulances\\EMS_calls1.geojson') %>%
st_transform(2284)
boundary <- st_read('https://opendata.arcgis.com/datasets/82ada480c5344220b2788154955ce5f0_8.geojson') %>%
st_set_crs(4326) %>%
st_transform(2284)
EMS_calls2 <-
EMS_Calls %>%
mutate(interval60 = floor_date(ymd_hms(call_date_and_time), unit = "hour"),
interval15 = floor_date(ymd_hms(call_date_and_time), unit = "15 mins"),
week = week(interval60),
dotw = wday(interval60, label=TRUE))
```
```{r, fig.width=3}
fishnet <-
st_make_grid(boundary, cellsize = 5000) %>%
st_sf()
fishnet <-
fishnet[boundary,] %>%
mutate(uniqueID = rownames(.)) %>%
dplyr::select(uniqueID)
ggplot() +
geom_sf(data=fishnet) +
mapTheme() +
labs(title = "Fishnet of grid size 5000 ft")
```
### Demographic Characteristics
We downloaded census data at the tract level which included socioeconomic and demographic features of Virginia Beach. We believe the number of EMS calls is associated with these features. For example, we looked at the total population since the population size in each grid cell directly affects the number of calls. Related research indicated that the aged population and single males were more likely to make EMS calls. Therefore, we used the total households with people 65-years old and over to represent the aged population. We also summed up the number of males in the marital status of never married, spouse absent, widowed, and divorced. We assumed the income level is another decisive factor on whether one makes an EMS call in case of an emergency and added the median household income. Finally, people who took public transit to work may not have car ownership or live in congested areas, so this group tends to have a higher likelihood of making EMS calls.
```{r, results='hide'}
vbCensus <-
get_acs(geography = "tract", variables = c("B01003_001", "B19013_001", "B02001_002",
"B08301_001", "B08301_010", "B11007_001", "B12001_003",
"B12001_007", "B12001_009", "B12001_010"),
year = 2017, state = "VA", geometry = TRUE, county=c("Virginia Beach")) %>%
mutate(variable =
case_when(variable == "B01003_001" ~ "Total_Population",
variable == "B19013_001" ~ "Median_Household_Income",
variable == "B02001_002" ~ "Total_White_Population",
variable == "B08301_001" ~ "Means_of_Transportation_to_Work",
variable == "B08301_010" ~ "Total_Public_Trans_excl_Taxi",
variable == "B11007_001" ~ "Total_Households_with_65yrs_and_Over",
variable == "B12001_003" ~ "Never_Married_Male",
variable == "B12001_007" ~ "Spouse_Absent_Male",
variable == "B12001_009" ~ "Widowed_Male",
variable == "B12001_010" ~ "Divorced_Male")) %>%
select(variable, estimate, GEOID, geometry) %>%
spread(variable,estimate) %>%
mutate(Percent_White = Total_White_Population / Total_Population,
Percent_Taking_Public_Trans = Total_Public_Trans_excl_Taxi / Means_of_Transportation_to_Work,
Percent_Single_Male = (Never_Married_Male + Spouse_Absent_Male + Widowed_Male + Divorced_Male)/Total_Population) %>%
gather(Variable,Value, -GEOID, -geometry) %>%
st_transform(2284)
vbCensus_wide <- spread(vbCensus, Variable, Value)
census_fishnet_wide <-
fishnet[c(1)] %>%
st_join(vbCensus_wide[-c(2:18)], st_intersects, largest = TRUE) %>%
left_join(st_set_geometry(vbCensus_wide, NULL), by=c('GEOID')) %>%
st_set_geometry(NULL)
```
### Spatial Characteristics
Neighborhoods and census tracts were spatial characteristics in our prediction. Since grids within the same neighbor or census tract might have similar call pattern, we joined those two spatial features to the fishnet grids. The GEOID refers to the census tracts and NAME refers to the neighborhood.
```{r, results='hide'}
Neighborhoods <- st_read('https://data.opendatasoft.com/explore/dataset/zillow-neighborhoods@public/download/?format=geojson&refine.state=VA&refine.county=Virginia+Beach+City&timezone=America/New_York') %>%
st_transform(2284)
fishnet <-
fishnet %>%
st_join(Neighborhoods[2], st_intersects, largest = TRUE)
for (i in 1: nrow(fishnet)) {
if (is.na(fishnet$name[i])) {
previous <- fishnet$uniqueID[i - 1]
previous_neighborhood <- fishnet[fishnet$uniqueID == previous,]$name
fishnet$name[i] <- previous_neighborhood
}
}
```
```{r, fig.width=8}
grid.arrange(
ggplot() +
geom_sf(data = Neighborhoods) +
mapTheme() +
labs(title = "Neighborhoods in Virginia Beach"),
ggplot() +
geom_sf(data = vbCensus) +
mapTheme() +
labs(title = "Census tracts in Virginia Beach"),
ncol = 2
)
```
### Other
There are other features that might affect the number of EMS calls in an area. In beach cities such as Virginia Beach, the high volume of tourists in high-temperature summer days might suffer from drowning or sunstroke events. Thus, we added weather conditions including high temperature, wind speed, and precipitation as predictors. Besides, we found that car crashes were one of the major causes of EMS calls and aggregated the number of severe car accidents in 2017 into each grid cell.
```{r, results='hide'}
weather.Data <-
riem_measures(station = "NTU", date_start = "2017-05-15", date_end = "2017-09-15")
weather.Panel <-
weather.Data %>%
replace(is.na(.), 0) %>%
mutate(interval60 = ymd_h(substr(valid,1,13))) %>%
mutate(week = week(interval60),
dotw = wday(interval60, label=TRUE)) %>%
group_by(interval60) %>%
summarize(Temperature = max(tmpf),
Precipitation = sum(p01i),
Wind_Speed = max(sknt)) %>%
mutate(Temperature = ifelse(Temperature == 0, 42, Temperature))
accident <- st_read("https://opendata.arcgis.com/datasets/1c7c9f723d5947c19c0fc34aaa30ff2a_0.geojson?where=Crash_Severity%20like%20'%25A.Severe%20Injury%25'%20AND%20VDOT_District%20like%20'%255.Hampton%20Roads%25'%20AND%20Crash_Year%20%3E%3D%202017%20AND%20Crash_Year%20%3C%3D%202017") %>%
st_transform(2284)
accident_net <-
accident %>%
dplyr::select() %>%
mutate(countAccident = 1) %>%
aggregate(., fishnet, sum) %>%
mutate(countAccident = ifelse(is.na(countAccident), 0, countAccident),
uniqueID = rownames(.))
fishnet <- fishnet %>%
merge(st_set_geometry(accident_net, NULL), by = 'uniqueID')
```
```{r}
fishnet %>%
ggplot() + geom_sf(aes(fill = countAccident)) +
mapTheme() +
labs(title = "Count of accidents per grid in Virgina Beach")
```
```{r}
grid.arrange(
ggplot(weather.Panel, aes(interval60,Precipitation)) + geom_line() +
labs(title="Precipitation", x="Hour", y="Precipitation") + plotTheme(),
ggplot(weather.Panel, aes(interval60,Wind_Speed)) + geom_line() +
labs(title="Wind Speed", x="Hour", y="Wind Speed") + plotTheme(),
ggplot(weather.Panel, aes(interval60,Temperature)) + geom_line() +
labs(title="Temperature", x="Hour", y="Temperature") + plotTheme(),
top="Weather Data - Virginia Beach - June, July, August,& September, 2017")
```
### Final Space/time Panel
The dataset for space/time analysis must be in a panel format, which means the dataset must include all possible combinations of space and time in each grid cell. EMS_Call.panel was our final study panel, which included the combination of 24 hours a day * 7 days a week * 9 weeks a * 343 grid cells, and the total number of combinations in the panel was 518616.
```{r, results='hide'}
EMS_fishnet <-
EMS_calls2 %>%
st_join(fishnet, st_intersects) %>%
st_set_geometry(NULL)
EMS.template <-
EMS_fishnet %>%
filter(week >= 22 & week <= 24 | week >= 25 & week <= 30)
study.panel <-
expand.grid(interval60 = seq(floor_date(ymd_hms(min(EMS.template$call_date_and_time)), unit = "hour"),floor_date(ymd_hms(max(EMS.template$call_date_and_time)), unit = "hour"), by = '60 mins'),
uniqueID = unique(fishnet$uniqueID))
EMS_Call.panel <-
EMS.template %>%
mutate(Call_Counter = 1) %>%
right_join(study.panel) %>%
group_by(interval60, uniqueID) %>%
summarize(Call_Count = sum(Call_Counter, na.rm=T)) %>%
left_join(census_fishnet_wide, by = c("uniqueID")) %>%
left_join(weather.Panel) %>%
left_join(fishnet, by=c("uniqueID")) %>%
ungroup() %>%
mutate(week = week(interval60),
dotw = wday(interval60, label = TRUE)) %>%
st_sf()
```
### Time Lag
We supposed the number of EMS calls at certain time intervals relates to the number of calls during other time intervals around it, so additional feature engineering was done to mine the data for critical time trends. The time lags were created and included time lags of 1 to 4 hours, a half-day, and a whole day.
```{r}
EMS_Call.panel <-
EMS_Call.panel %>%
arrange(uniqueID, interval60) %>%
mutate(lagHour = dplyr::lag(Call_Count,1),
lag2Hours = dplyr::lag(Call_Count,2),
lag3Hours = dplyr::lag(Call_Count,3),
lag4Hours = dplyr::lag(Call_Count,4),
lag12Hours = dplyr::lag(Call_Count,12),
lag1day = dplyr::lag(Call_Count,24)) %>%
mutate(day = yday(interval60))
```
Finally, we split the data to train the model on first 6 weeks of data and test it on the latter 3 weeks.
```{r}
EMS_Call.Train <- filter(EMS_Call.panel, week < 28)
EMS_Call.Test <- filter(EMS_Call.panel, week >= 28)
```
# 3. Exploratory Analyses
In this section, we will further explore our EMS call dataset and potential predictors. Specifically, we will look at the distribution of EMS calls among fixed effect variables and plot the correlations between calls and continuous variables.
```{r}
week26 <-
EMS_fishnet %>%
filter(week == 26 & dotw == "Mon")
week26.panel <-
expand.grid(interval60 = seq(floor_date(ymd_hms(min(week26$call_date_and_time)), unit = "hour"),floor_date(ymd_hms(max(week26$call_date_and_time)), unit = "hour"), by = '60 mins'),
uniqueID = unique(fishnet$uniqueID))
EMS_Call.animation.data <-
week26 %>%
mutate(Call_Counter = 1) %>%
right_join(week26.panel) %>%
group_by(interval60, uniqueID) %>%
summarize(Call_Count = sum(Call_Counter, na.rm=T)) %>%
left_join(fishnet,by=c("uniqueID")) %>%
st_sf() %>%
mutate(Calls = case_when(Call_Count == 0 ~ "0 calls",
Call_Count > 0 & Call_Count <= 1 ~ "1 call",
Call_Count > 1 & Call_Count <= 3 ~ "2-3 calls",
Call_Count > 3 & Call_Count <= 5 ~ "4-5 calls",
Call_Count > 5 ~ "5+ calls")) %>%
mutate(Calls = factor(Calls, levels=c("0 calls","1 call","2-3 calls","4-5 calls","5+ calls")))
```
```{r}
EMS_Call_animation <-
ggplot() +
geom_sf(data=EMS_Call.animation.data, aes(fill=Calls)) +
scale_fill_manual(values = palette3) +
labs(title = "EMS Calls for one day in June 2018, Virginia Beach",
subtitle = "60 minute intervals: {current_frame}") +
transition_manual(interval60) +
mapTheme()
animate(EMS_Call_animation, duration=24)
```
## 3.1 Call count variation across time
We visualized EMS call counts by hours of the day during 9 weeks in a time series plot. The black lines separate the plot into weeks. The testing and training data are differentiated by colors. <br />
<br />
Since EMS calls are rare and usually caused by accidental events, there is not a clear pattern in the distribution of these calls temporally. According to the plot, the differences between peak and valley are significantly large, which further indicates the irregularity of EMS calls. Though there are many noises within the line, one can still observe that roughly one peak occurs everyday, and calls usually take place during the daytime. There are obvious high peaks in the distribution of the calls except in week 2. In week two, the number of calls stays quite constant.
<div class="superbigimage">
```{r, fig.width= 15, fig.height=3}
mondayMidnight <-
EMS_Call.panel %>%
mutate(mondayMidnight = ifelse(wday(interval60) == 2 & hour(interval60) == 1,
as.POSIXct(interval60),0)) %>%
filter(mondayMidnight != 0)
rbind(
mutate(EMS_Call.Train, label = "Training"),
mutate(EMS_Call.Test, label = "Testing")) %>%
group_by(label, interval60) %>%
summarize(Call_Count = sum(Call_Count)) %>%
ggplot(aes(interval60, Call_Count, colour = label)) +
geom_line() +
ylim(0,20) +
labs(title="EMS calls in Virginia Beach by week: June through August, 2017",
subtitle="Monday demarked in black", x="Day", y="Call Counts") +
plotTheme() + theme(panel.grid.major = element_blank()) +
scale_colour_manual(values = palette2) +
geom_vline(data = mondayMidnight, aes(xintercept = mondayMidnight), colour="black")
```
</div>
Next, we aggregated EMS call counts by the hour for each day of the week and visualized it by a hotspot matrix. The matrix indicates that fewer calls are made during the early morning before 6 A.M. The call counts start to increase from 7 A.M. and reach a peak between10 A.M. and 11 A.M. Then the call counts keep decreasing until midnight. Money, Tuesday and Thursday have around 80 calls during the rush hours. Especially for Monday and Tuesday, the general call counts are larger than that of other days. The rest days still follow the general daily pattern but do not have any extremely high or low counts.<br />
```{r}
EMS_Call.panel$hour <- hour(EMS_Call.panel$interval60)
EMS_Call.panel %>%
group_by(hour, dotw) %>%
summarise(
Call_Count = sum(Call_Count)) %>%
ggplot(aes(x=hour,
y=dotw,
fill=Call_Count)) + geom_tile() + scale_fill_gradient(low = "#E4FFFF", high = "#EF200A", name = 'Count of EMS Calls')
```
We visualized the correlations between EMS call counts and time lags. The absolute values of correlations are around 0.4 to 0.5 except for the 4-hour lag. In fact, for the first four hours, despite the minor fluctuation between the 2-hour lag and the 3-hour lag, the value of the correlation dropped from 0.45 to 0.19. The significant reduction also shows that the correlation tends to decrease with additional hours. The correlations are stronger again in 12-hour lag and 1-day lag. The 12-hour lag is the only lag with a negative correlation, and we believe it corresponds to the pattern that is displayed in the time series plot: the daytime and nighttime have opposite call count patterns.
```{r}
plotData.lag <-
as.data.frame(EMS_Call.panel) %>%
filter(week == 26) %>%
group_by(interval60) %>%
summarise_at(vars(starts_with("lag"), "Call_Count"), mean, na.rm = TRUE) %>%
gather(Variable, Value, -interval60, -Call_Count) %>%
mutate(Variable = factor(Variable, levels=c("lagHour","lag2Hours","lag3Hours","lag4Hours",
"lag12Hours","lag1day")))
correlation.lag <-
plotData.lag %>%
group_by(Variable) %>%
summarize(correlation = round(cor(Value, Call_Count),2))
plotData.lag %>%
ggplot(aes(Value, Call_Count)) +
geom_point() + geom_smooth(method = "lm", se = F, color = '#78DBFF') +
facet_wrap(~Variable) +
geom_text(data=correlation.lag, aes(label=paste("R =", correlation)),colour = "blue",
x=-Inf, y=Inf, hjust=-0.1, vjust=1.2) +
labs(title = "EMS call count as a function of time lags",
subtitle= "One week in June, 2017", x = "Lag Call Count") +
plotTheme()
```
## 3.2 Call count variation across space
We then explored the spatial pattern of EMS call counts across Virginia Beach. We visualized the call counts of each grid cell to see whether there are any spatial autocorrelations or specific patterns that may potentially affect our prediction. Since the hour is the time interval of our prediction, we made small multiple maps below to show the spatial distribution of the sum of three-month EMS call counts per grid. Generally, the calls take place in North Virginia Beach. In the early morning, the call counts are at a low level and in an even distribution across North Virginia Beach. When time shifts to the daytime, the general call counts increase with no grids with extremely high call counts (Except the 12 P.M.). After 6 P.M., the easternmost grid has significantly high call counts than other grid cells do.
```{r, fig.width=25, fig.height=25}
EMS_Call.panel %>%
mutate(hour = hour(interval60)) %>%
group_by(hour, uniqueID) %>%
summarize(Sum_Call_Count = sum(Call_Count)) %>%
ggplot() + geom_sf(aes(fill = Sum_Call_Count)) +
facet_wrap(~hour, ncol = 6) +
scale_fill_gradient(low = "#E2F2FD", high = "#FF2020", space = "Lab") +
labs(title="Sum of EMS calls by hour of the day") +
mapTheme() + theme(legend.position = "bottom") + theme(strip.text = element_text(size=20))
```
We visualized the aggregated call counts by week during the three months. A large number of call counts form a horizontal line at the northern Virginia Beach, and the call counts gradually expand outward from the line. According to the map, the horizontal line with more call counts is the main road of Virginia Beach, which reflects that the spatial distribution of EMS calls may relate to the road distribution.
```{r, fig.width=25,fig.height=25}
EMS_Call.panel %>%
group_by(week, uniqueID) %>%
summarize(Sum_Call_Count = sum(Call_Count)) %>%
ggplot() + geom_sf(aes(fill = Sum_Call_Count)) +
facet_wrap(~week, ncol = 3) +
scale_fill_gradient(low = "#E2F2FD", high = "#FF2020", space = "Lab") +
labs(title="Sum of EMS calls by week") +
mapTheme() + theme(legend.position = "bottom") + theme(text = element_text(size=14)) + theme(strip.text = element_text(size=20))
```
## 3.3 Weather
Virginia Beach is a beach city, and we select three summer months, so we visualized the weather variables. Considering the season and location, we chose the presence/absence of precipitation and over 86 degrees Fahrenheit temperature as weather variables. The bar plots indicate the presence of rainy days and high-temperature days in each week. After week 25, the end of June, the general precipitation in Virginia Beach started to decline. The high temperature first appeared at week 24. The end of June and beginning of July are the hottest within our studied period.
```{r, fig.width= 16, fig.height=5}
grid.arrange(
as.data.frame(EMS_Call.panel) %>%
group_by(interval60) %>%
summarize(Call_Count = sum(Call_Count)) %>%
left_join(weather.Panel) %>%
mutate(isPercip = ifelse(Precipitation > 0,"Rain", "None")) %>%
group_by(week = week(interval60), isPercip) %>%
summarize(Mean_Call_Count = mean(Call_Count)) %>%
ggplot(aes(isPercip, Mean_Call_Count)) +
geom_bar(stat = "identity") +
facet_wrap(~week,ncol=9) +
labs(title="Does EMS Calls vary when it's raining?",
subtitle="Mean call count by week; June through August, 2017",
x="Precipitation", y="Mean Call Count") +
plotTheme() + theme(axis.text.x = element_text(angle = 45, hjust = 1)),
as.data.frame(EMS_Call.panel) %>%
group_by(interval60) %>%
summarize(Call_Count = sum(Call_Count)) %>%
left_join(weather.Panel) %>%
mutate(isHot = ifelse(Temperature > 86,"Hot", "None")) %>%
group_by(week = week(interval60), isHot) %>%
summarize(Mean_Call_Count = mean(Call_Count)) %>%
ggplot(aes(isHot, Mean_Call_Count)) +
geom_bar(stat = "identity") +
facet_wrap(~week,ncol=9) +
labs(title="Does EMS Call vary when it's too hot?",
subtitle="Mean call count by week; June through August, 2017",
x="Temperature", y="Mean Call Count") +
plotTheme() + theme(axis.text.x = element_text(angle = 45, hjust = 1)),
ncol = 2
)
```
High temperatures may increase the incidence of certain diseases such as sunstroke or cardiovascular diseases. The below plots indicate a higher number of EMS calls tend to occur in weeks with more high-temperature days. High temperatures may increase the incidence of certain diseases such as sunstroke or cardiovascular diseases. The below plots indicate a higher number of EMS calls tend to occur in weeks with more high-temperature days. There seems to be no obvious relationship between the number of calls and precipitation.
```{r}
correlation.temperature <- EMS_Call.panel %>%
group_by(week) %>%
summarize(correlation = round(cor(Temperature, Call_Count),2))
as.data.frame(EMS_Call.panel) %>%
group_by(interval60) %>%
summarize(Call_Count = sum(Call_Count)) %>%
left_join(weather.Panel) %>%
mutate(week = week(interval60)) %>%
ggplot(aes(Temperature, Call_Count)) +
geom_point() + facet_wrap(~week) + geom_smooth(method = "lm", se= FALSE,color = '#78DBFF') +
plotTheme() +
geom_text(data=correlation.temperature, aes(label=paste("R =", correlation)),colour = "blue",
x=-Inf, y=Inf, hjust=-0.1, vjust=1.2) +
labs(title="Call Count as a fuction of Temperature by week",
subtitle="Call count by week; June through August, 2017",
x="Temperature", y="Call Trip Count")
```
## 3.4 Car crashes
Severe car crashes are usually associated with serious injuries, leading to EMS calls. Both car accidents and EMS calls are rare events in people’s daily life. We created a scatterplot to visualize the relationship between the total number of car accidents and EMS calls for one week. We noticed a positive correlation between the two. The correlation between EMS call counts and car crashes is 0.49, higher than correlations between EMS calls and other predictors in this model.
```{r}
correlation.accidents <-
EMS_Call.panel %>%
filter(week == 26) %>%
group_by(uniqueID) %>%
summarize(Call_Count = sum(Call_Count), countAccident = sum(countAccident)) %>%
mutate(correlation = round(cor(countAccident, Call_Count),2))
as.data.frame(EMS_Call.panel) %>%
filter(week == 26) %>%
group_by(uniqueID) %>%
summarize(Call_Count = sum(Call_Count), countAccident = sum(countAccident)) %>%
ggplot(aes(countAccident, Call_Count)) +
geom_point() + geom_smooth(method = "lm", se= FALSE,color = '#78DBFF') +
plotTheme() +
geom_text(data=correlation.accidents, aes(label=paste("R =", correlation)),colour = "blue",
x=-Inf, y=Inf, hjust=-0.1, vjust=1.2) +
plotTheme() +
labs(title="One week of EMS Calls as a function of number of accidents")
```
## 3.5 Socioeconomic factors
Below, we explored the correlation between our demographic variables and the number of EMS calls in one week. There is a negative correlation between EMS calls and the median household income. The calls are also negatively correlated with the percent of the white population. There is a positive correlation between EMS calls and the percent of the population taking public transportation. In addition, they are also positively correlated with the number of households with people 65 years old and over. The correlation between the number of calls and the percent of single males is 0.01, indicating a very weak correlation.
```{r, fig.width=15, fig.height=10}
census_fishnet <-
fishnet[c(1)] %>%
st_join(vbCensus[-c(2,3)], st_intersects, largest=TRUE) %>%
left_join(st_set_geometry(vbCensus, NULL), by=c('GEOID')) %>%
st_set_geometry(NULL)
plotData.census <-
as.data.frame(EMS_Call.panel) %>%
filter(week == 26) %>%
group_by(uniqueID) %>%
summarize(Call_Count = sum(Call_Count)) %>%
left_join(census_fishnet, by=c("uniqueID")) %>%
filter(Variable == "Median_Household_Income" | Variable == "Mean_Commute_Time_for_Workers" |
Variable == "Percent_Taking_Public_Trans" | Variable == "Percent_White" | Variable == "Percent_Single_Male"| Variable == "Total_Households_with_65yrs_and_Over")
correlation.census <-
plotData.census %>%
group_by(Variable) %>%
summarize(correlation = round(cor(Value, Call_Count),2))
ggplot(plotData.census, aes(Value,Call_Count)) + geom_point() + geom_smooth(method="lm", se = F, color = '#78DBFF') +
facet_wrap(~Variable, scales="free", ncol=2) +
geom_text(data=correlation.census, aes(label=paste("R =", correlation)),
colour = "blue", x=-Inf, y=Inf, hjust=-0.1, vjust=1.2) +
plotTheme() + theme(strip.text = element_text(size=20)) +
labs(title="One week of EMS Calls by Census Tract\nas a function of selected Census variables")
```
# 4. Modeling
In this project, we used the zero-inflated Poisson model for several reasons. First, we used a Poisson model instead of the Ordinary Least Square model since we aimed to predict the count of EMS calls. In addition, EMS calls are relatively rare, meaning the distribution of our data is rather similar to the Poisson distribution instead of a normal distribution. <br />
<br />
We used the zero-inflated Poisson model since there is an excess amount of zeros in our EMS call data. The model assumes that the excess zeros and the counts are generated by two independent processes. Here, we believed that the zero EMS calls in south Virginia Beach were determined by different factors from the higher number of calls in North Virginia Beach.
```{r, fig.width=10}
as.data.frame(EMS_Call.panel) %>%
group_by(uniqueID) %>%
summarize(Call_Count = sum(Call_Count)) %>%
ggplot(aes(Call_Count)) +
geom_histogram(binwidth = 1) +
labs(title = "Distribution of EMS Calls by grid cell")
```
Although we gathered and wrangled a variety of different datasets, we only used the hour of day, the neighborhood, the count of calls during the previous hour, the count of calls during the previous 12 hours, and the number of car accidents within the grid cells as our predictors. We chose these predictors by putting our independent variables into the model and examined how the model responded. We did not use variables that led to no difference in the error.
```{r}
regression1 <-
zeroinfl(Call_Count ~ hour(interval60) + name + lagHour + lag12Hours + countAccident | 1, data = EMS_Call.Train)
EMS_Call.Test.weekNest <-
EMS_Call.Test %>%
nest(-week)
model_pred <- function(dat, fit){
pred <- predict(fit, newdata = dat)}
week_predictions <-
EMS_Call.Test.weekNest %>%
mutate(Zero_inflated_poisson = map(.x = data, fit = regression1, .f = model_pred))
```
# 5. Validation
To assess the accuracy of our model, we looked at the mean absolute error of our model per hour per grid for each week. To interpret this number, we also calculated the number of EMS calls per hour per grid on average to compare these numbers on the same scale. The MAE are quite similar among the three weeks in our test dataset. OVerall, the mean error for each time/space interval is around 0.02 while the mean call count per week is around 0.01. The result shows that the accuracy of the model still needs to be improved.
```{r}
week_predictions <-
week_predictions %>%
gather(Regression, Prediction, -data, -week) %>%
mutate(Observed = map(data, pull, Call_Count),
Absolute_Error = map2(Observed, Prediction, ~ abs(.x - .y)),
MAE = map_dbl(Absolute_Error, mean),
sd_AE = map_dbl(Absolute_Error, sd))
week_predictions
```
``` {r}
EMS_Call.Test.weekNest %>%
mutate(Call_Count = map(data, pull, Call_Count),
Mean_Call_Count = map_dbl(Call_Count, mean))
```
Doing cross-validation for a time/space prediction model is out of scope for this project. Here, we decided to validate our model using three methods. First, we held out a different set of weeks from our original dataset to train our model and tested the model on the remaining weeks. Instead of using the first 6 weeks to train the model and test it on the latter 3 weeks, this time we used the first 5 weeks to train the model and tested it on the latter 4 weeks. The results allowed us to assess how our model generalized to new data.
```{r}
EMS_Call.Train1 <- filter(EMS_Call.panel, week < 27)
EMS_Call.Test1 <- filter(EMS_Call.panel, week >= 27)
regression2 <-
zeroinfl(Call_Count ~ hour(interval60) + name + lagHour + lag12Hours + countAccident | 1, data = EMS_Call.Train1)
EMS_Call.Test.weekNest1 <-
EMS_Call.Test1 %>%
nest(-week)
week_predictions1 <-
EMS_Call.Test.weekNest1 %>%
mutate(Zero_inflated_poisson = map(.x = data, fit = regression2, .f = model_pred))
```
```{r}
week_predictions1 <-
week_predictions1 %>%
gather(Regression, Prediction, -data, -week) %>%
mutate(Observed = map(data, pull, Call_Count),
Absolute_Error = map2(Observed, Prediction, ~ abs(.x - .y)),
MAE = map_dbl(Absolute_Error, mean),
sd_AE = map_dbl(Absolute_Error, sd))
week_predictions1
```
<br />
The results above show our model has a similar mean and standard deviation of absolute errors when trained on a different set of data. For the scope of this project, we only looked at the model results for one different set of training data. In the future, it is worth training the model on data with a larger time frame and holding out training data that is significantly different from the rest (e.g. data for a specific season). <br />
<br />
Second, we examined the temporal and spatial distribution of our error through mapping them across space and time.
```{r, fig.width=15}
week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
uniqueID = map(data, pull, uniqueID)) %>%
dplyr::select(interval60, uniqueID, Observed, Prediction, Regression) %>%
unnest() %>%
gather(Variable, Value, -Regression, -interval60, -uniqueID) %>%
group_by(Regression, Variable, interval60) %>%
summarize(Value = mean(Value)) %>%
ggplot(aes(interval60, Value, colour=Variable)) +
geom_line(size = 1.1) +
facet_wrap(~Regression, ncol=1) +
scale_colour_manual(values = palette2) +
labs(title = "Mean Predicted/Observed EMS Calls by hourly interval",
subtitle = "Virginia Beach; A test set of 3 weeks in December", x = "Hour", y= "EMS Calls") +
plotTheme()
```
<br />
From the chart above, it is apparent that our model captures the temporal pattern of the EMS calls. The peaks within the observed and predicted calls occur at roughly the same places. The valleys within the observed and predicted calls also occur at roughly the same places. Throughout the time span of our test dataset, the difference between our predicted and observed calls stay relatively consistent, indicating that our model generalizes well temporally.
```{r, fig.width=10, fig.height=10}
all_prediction <- EMS_Call.Test %>%
mutate(Prediction = predict(regression1, EMS_Call.Test)) %>%
mutate(Absolute_Error = abs(Prediction - Call_Count))
breaks <- hour(hm("00:00", "6:00", "12:00", "18:00", "23:59"))
labels <- c("Night", "Morning", "Afternoon", "Evening")
all_prediction$Time_of_day <- cut(x=hour(all_prediction$interval60), breaks = breaks, labels = labels, include.lowest=TRUE)
all_prediction %>%
group_by(uniqueID, Time_of_day) %>%
summarize(MAE = mean(Absolute_Error)) %>%
ggplot() +
geom_sf(aes(fill = MAE)) +
facet_wrap(~Time_of_day, ncol=2) +
scale_fill_gradient(low = "#E2F2FD", high = "#FF2020", space = "Lab") +
#scale_fill_viridis() +
labs(title="Mean Absolute Error by grid and time of day") +
mapTheme() + theme(legend.position="bottom")
```
<br />
We also looked at the distribution of the model error by the time of the day. Here, we grouped the hours in a day into morning (6:00 - 12:00), afternoon (12:00 - 18:00), evening (18:00 - 23:59) and night (0:00 - 6:00). The above maps indicate the mean absolute error by the grid for each of these time intervals. Overall, our model generalizes well among morning, afternoon, and evening. The MAE is slightly lower during the night than the three other intervals. Spatially, our model predicts better for South Virginia Beach, where there have been in general much fewer calls. The error is quite consistent across grid cells in North Virginia Beach, where most of the EMS calls took place.
```{r}
filter(week_predictions, Regression == "Zero_inflated_poisson") %>%
unnest() %>%
st_sf() %>%
dplyr::select(uniqueID, Absolute_Error, geometry) %>%
gather(Variable, Value, -uniqueID, -geometry) %>%
group_by(Variable, uniqueID) %>%
summarize(MAE = mean(Value)) %>%
ggplot() +
geom_sf(aes(fill = MAE)) +
scale_fill_gradient(low = "#E2F2FD", high = "#FF2020", space = "Lab") +
labs(title="Mean Absolute Error by grid") +
mapTheme() + theme(legend.position="bottom")
```
<br />
The above graph indicates the mean absolute error by the grid for the whole test dataset. As mentioned before, most of the error occurs in Northern Virginia Beach, where most of the EMS calls have taken place. It reveals that there may be other predictors that differentiate the grid cells with higher call volume. Overall, although our model generalizes well temporally, its generalizability regarding space still needs to be improved.
<br />
<br />
Finally, we examined the distribution of errors across neighborhoods and racial context. Below, the plot reveals our model does not generalize well between neighborhoods in the north and neighborhoods in the south. Similar to what's discussed above, the model generates a larger error for neighborhoods with higher call counts in North Virginia Beach. However, among neighborhoods with a higher number of EMS calls, the error stays relatively constant.
```{r}
neighborhood_MAE <- all_prediction %>%
st_set_geometry(NULL) %>%
group_by(name) %>%
summarise(MAE = mean(Absolute_Error))
Neighborhoods %>%
merge(neighborhood_MAE, by = c('name')) %>%
ggplot() + geom_sf(aes(fill = MAE)) +
scale_fill_gradient(low = "#E2F2FD", high = "#FF2020", space = "Lab") +
mapTheme() +
labs(title="Mean Absolute Error by Neighborhood")
```
The plot and the table below show the distribution and the mean absolute error per grid per hour of majority non-white and majority-white tracts. All the majority non-white tracts are located in North Virginia Beach, where more EMS calls occurred. The majority-white tracts are distributed more evenly across the city. There are far more majority-white tracts than majority non-white ones. The mean absolute error of the majority non-white tracts is slightly higher than the majority-white tracts. This shows that the generalizability of the model needs to be improved. This difference may be caused by the difference in performance when our model predicts for cells with higher call counts versus cells with very low call counts.
```{r}
vbCensus_wide[is.na(vbCensus_wide$Percent_White) != TRUE,] %>%
mutate(raceContext = ifelse(Percent_White > .5, "Majority White", "Majority Non-white")) %>%
ggplot() + geom_sf(color = 'white', aes(fill = raceContext)) +
scale_fill_manual(values = palette2) +
mapTheme() +
labs(title="Race Context of Virginia Beach")
```
```{r}
Prediction_race <- all_prediction %>%
mutate(raceContext = ifelse(Percent_White > .5, "Majority White", "Majority Non-white"))
Prediction_race[is.na(Prediction_race$GEOID) == FALSE, ] %>%
st_set_geometry(NULL) %>%
group_by(GEOID, raceContext) %>%
summarize(MAE = mean(Absolute_Error)) %>%
group_by(raceContext) %>%
summarize(MAE = mean(MAE)) %>%
kable(caption = "Mean Absolute Error of Test set by Race Context") %>%
kable_styling("striped", full_width = F) %>%
row_spec(1, color = 'white', background = "#E74C3C") %>%
row_spec(2, color = "white", background = "#2E86C1")
```
# 6. Conclusion
In this project, we built a model to predict the count of EMS calls in Virginia Beach per 5000 ft grid cell per hour. There were two challenges that we tried to deal with. First, EMS calls were rare events and most of the grid cells had zero calls at most of the time intervals. Second, since emergencies can often be pure accidents, our EMS call data contains a lot of noise and does not show a very strong pattern in terms of time and space. To deal with these two challenges, we used a zero-inflated Poisson model and explored a variety of potential predictors including the number of accidents, demographic characteristics, and neighborhoods. Our analysis resulted in a model with reasonable accuracy and generalizability. <br />
<br />
Through creating an app based on our model, we aimed to better inform ambulance drivers through identifying grid cells with a higher number of calls predicted. The app would enable them to stay around high-risk zones in advance instead of waiting to be dispatched at the rescue stations. As discussed in the validation section, our model has successfully captured the trend within EMS calls. However, it wasn’t as successful in predicting the exact number of calls. Thus, we do not recommend our users make decisions based on the exact number of calls predicted but instead encourage them to interpret our predictions in a relative way. Essentially, a higher number of calls in our predictions indicates a higher risk of EMS calls while a lower number of calls predicted indicates a lower risk of EMS calls. In addition, it is important to acknowledge the generalizability of the model needs to be improved in terms of space and demographic context. <br />
<br />
```{r, echo=FALSE, out.width = '100%'}
knitr::include_graphics("app.jpg")
```
<br />
We can improve this model in several different ways. For this model, we did not put in spatial lag due to the limit of time. However, there is a high possibility that there is spatial autocorrelation among the number of call counts. In addition, we discovered that the relationships between the number of EMS calls and predictors drastically change spatially in the city of Virginia Beach. Most of the cells in South Virginia Beach did not get one single EMS call in our 9 weeks of data. As a result, we believe ensemble modeling may drastically improve our predictions since different models can be used to predict different relationships between the calls and the predictors. Second, the pattern within EMS calls has been difficult to capture. And we believe the time span of interest may be much longer than 9 weeks. In order to create a better model, we need to include data with a longer time frame, for example, six months or even a whole year. Finally, instead of predicting the number of calls per hour per grid cell, we can change the time unit of our predictions to a larger time interval. For example, using the number of calls per 3 hours per grid cell may lead to better performance of the model. However, we do think there will be a possible tradeoff between the accuracy and applicability of the results.
<br />
FInally, if you are interested in learning more about this project, check out https://youtu.be/kBo98GdzNTM for a video presentation of the project.