Skip to content

Latest commit

 

History

History
48 lines (33 loc) · 4.68 KB

README.md

File metadata and controls

48 lines (33 loc) · 4.68 KB

Accident Risk Prediction based on Heterogeneous Sparse Data: New Dataset and Insights (ACM SIGSPATIAL 2019)

Generate Input

One important process is to transform raw input data into the form of input for a machine learning model. Here we employed multiple processes as follows:

  • Step 1: Run 1-CreateInputForAccidentPrediction.ipynb from /1-GenerateFeatureVector to generate raw feature vectors. Each vector represents a geographical region of size 5km x 5km (that we call it a geohash) during a 15 minutes time interval. This code uses LSTW dataset for traffic events data, raw weather observation records for weather-related attributes (check data/Sample_Weather.tar.gz for sample data), and daylight information (check data/sample_daylight.csv for sample data).

  • Step 2: Run 2-CreateNaturalLanguageRepresentationForGeoHashes.ipynb to generate description to vector representation for geographical regions. The main inputs for this process are LSTW and GloVe. A sample output can be find as data/geohash_to_text_vec.csv.

  • Step 3: Run 3-DataCleaningAndIntegration.ipynb for data cleaning, and preparation for integration with POI data.

  • Step 4: Run 4-FinalTrainAndTestDataPreparation.ipynb to prepare final train and test data. This includes creating sample entries, and negative sampling for non-accident data samples. There are two versions of the code: single thread vs multi-thread. The multi-thread version uses more system cores but it needs more memory as well. It is more suitable for running on servers. Single-thread is for running on desktop devices for generating smaller train-test sets.

Implementations of these steps can be found in 1-GenerateFeatureVector. Also, note that the sample data and codes are for those cities that we used in the paper (e.g., Atlanta, Austin, Charlotte, Dallas, Houston, and Los Angeles).

Sample Data Files For Train and Test

To train and test our proposed model and the baselines, you can use our pre-generated train and test files for six cities Atlanta, Austin, Charlotte, Dallas, Houston, and Los Angeles. The time frame to generate sample data for these cities is the same as what we described in our paper. You can find these files in data/train_set.7z. Use 7za -e train_set.7z to decompress this file and obtain 4 numpy (.npy) files per city. Two files contain feature vectors for train and test, and two files contain train and test labels. These sample files are the result of the above input generation process.

Deep Accident Prediction (DAP) Model

Our Deep Accident Prediction model comprises several important components including Recurrent Component, Embedding Component, Description-to-Vector Component, Points-Of-Interest Component, and Fully-connected Component. The following image shows a demonstration of this model:

The implementation of this model can be found here: 2-DAP/DAP.ipynb.

Baseline Models

In terms of baselines, we employed the following models:

  • Logistic Regressions (LR): Find sample code in 3-Baselines/Traditional_Models_Sklearn.py.
  • Gradient Boosted classifier (GBC): Find sample code in 3-Baselines/Traditional_Models_Sklearn.py.
  • FeedForward Neural Network Model (DNN): An implementation of this model can be found in 3-Baselines/DNN.ipynb.
  • DAP Without Embedding Component (DAP-NoEmbed): An implementation of this model can be found in 2-DAP/DAP-NoEmbed.ipynb.

System Requirements

We recommend using Python 3.x, and install the following to properly run the code:

pip install tensorflow==1.14.0
pip install keras==2.3.1
pip install keras_metrics
pip install keras_self_attention
pip install scikit-learn==0.20.0

Please note that you may choose to use other versions of tensorflow and/or keras, but make sure that they are compatible.

How to Run the Code?

All implementations are in python, with deep learning models developed in Keras using Tensorflow as backend. For non-deep learning baselines (i.e., LR and GBC) you can run codes on CPU machines. But for deep-learning models, we recommend using GPU machines to speed-up the process.

Acknowledgment