Skip to content

A two-day case study on fraud detection. The goal of this sprint was to create an end-to-end prediction platform.

Notifications You must be signed in to change notification settings

drewrice2/fraud-detection-case-study-DSI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Fraud Detection Case Study

A two-day case study on fraud detection. The goal of this sprint was to create an end-to-end prediction platform. First, we were broken up into teams of three.

Our team began with feature selection and engineering. Some of the features we engineered were:

  • Count NaNs, or missing data, per column
  • A percentage of uppercase characters for each title
  • Event duration field

Based on the assumption that misclassifying true fraudulent cases cost us significantly higher than misclassifying true non-fraud cases, we modeled to minimize false negatives. After a train / test split, we iteratively tested the random forest model and selected the features that gave us the best result.

The model was designed to take one instance, classify it as fraud or not with associated probability scores, then save the results to a Mongo database. We then initialized a site on our local designed to receive one request and go through the previously described steps.

A server sent out live requests, or unseen data in JSON format, to the site we set up. We then classified and stored those new requests an the Mongo database. We coded up a dashboard on the splash page of the site for a quick-view of essential info. Essentially, we wanted to make potentially fraudulent cases accessible at a glance.

Dashboard example:

~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ![Dashboard Example](https://github.com/drewrice2/fraud-detection-case-study-DSI/blob/master/Dashboard_example.png) ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~

Technologies used:

  • Python 2.7
  • SKLearn's RandomForestClassifier and train_test_split
  • Mongo DB, via PyMongo
  • Flask
  • Pandas, numpy

Future steps would include:

  • Grid searching to optimize the model
  • Clean up the database
  • Test other models
  • NLP on event title and description
  • Make the dashboard look freakin’ sweet

NOTE: due to the nature of the sprint, some of the code is a bit hacky, so beware...

Scott Contri, Clay Porter, Drew Rice, 2016.

About

A two-day case study on fraud detection. The goal of this sprint was to create an end-to-end prediction platform.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published