Healthcare Fraud Detection Analysis

Project Overview

This project aims to develop a model for detecting healthcare fraud using machine learning techniques. The analysis leverages a dataset containing various beneficiary attributes and healthcare reimbursement amounts. The objective is to identify fraudulent claims and profile high-risk beneficiaries.

Technologies Used

Python: Programming language for data analysis and modeling.
Pandas: Library for data manipulation and analysis.
NumPy: Library for numerical computations.
Scikit-learn: Library for machine learning algorithms.
Seaborn & Matplotlib: Libraries for data visualization.
Google Colab: Environment for executing Python code and collaborating.

Data Description

The dataset used in this project is train_bene_final_data.xlsx, which includes the following columns:

BeneID: Unique identifier for beneficiaries.
DOB: Date of birth of beneficiaries.
DOD: Date of death of beneficiaries (if applicable).
Gender: Gender of the beneficiary.
Race: Race of the beneficiary.
RenalDiseaseIndicator: Indicator of renal disease.
State: State of residence.
Country: Country of residence.
NoOfMonths_PartACov: Number of months covered by Part A.
NoOfMonths_PartBCov: Number of months covered by Part B.
ChronicCond_*: Indicators for various chronic conditions.
IPAnnualReimbursementAmt: Annual reimbursement amount for inpatient services.
IPAnnualDeductibleAmt: Annual deductible amount for inpatient services.
OPAnnualReimbursementAmt: Annual reimbursement amount for outpatient services.
OPAnnualDeductibleAmt: Annual deductible amount for outpatient services.
Age: Age of the beneficiary.
ChronicConditionCount: Count of chronic conditions.
fraud: Target variable indicating whether a claim is fraudulent.

Data Cleaning and Transformation

Handling Missing Values: Missing values are checked and managed appropriately.
Feature Engineering: Creation of a target variable fraud based on reimbursement amounts, identifying claims in the top 5% as fraudulent.
Categorical Encoding: Categorical variables are transformed into dummy variables for modeling.

Data Modeling

The project employs several classification models to detect fraud:

Logistic Regression: A baseline model to understand the relationships between features and the target variable.
Random Forest: An ensemble model that improves predictive accuracy by combining multiple decision trees.
Gradient Boosting: Another ensemble method that sequentially builds models to improve predictions.

Model Evaluation

Each model's performance is evaluated using:

Classification reports (precision, recall, F1-score)
AUC-ROC score to assess the ability to distinguish between classes.

Results

Logistic Regression:
- AUC-ROC: 0.994
Random Forest:
- AUC-ROC: 1.0 (perfect classification)
Gradient Boosting:
- AUC-ROC: 1.0 (perfect classification)

Anomaly Detection

K-Means Clustering: Used to identify clusters in the data.
Isolation Forest: Employed for detecting anomalies.

Conclusion

The models developed in this project demonstrate high accuracy in detecting fraudulent claims, with Random Forest and Gradient Boosting achieving perfect classification. The analysis of high-risk groups highlights patterns that can inform further investigations and interventions.

Future Work

Explore advanced deep learning techniques for enhanced fraud detection.
Implement real-time fraud detection systems.
Integrate additional features for a more comprehensive analysis.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.gitignore		.gitignore
DataModeling.ipynb		DataModeling.ipynb
README.md		README.md
dataClean.ipynb		dataClean.ipynb
dataTransform.ipynb		dataTransform.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Healthcare Fraud Detection Analysis

Project Overview

Table of Contents

Technologies Used

Data Description

Data Cleaning and Transformation

Data Modeling

Model Evaluation

Results

Anomaly Detection

Conclusion

Future Work

About

Releases

Packages

Contributors 2

Languages

shivi13102/Healthcare-Fraud-Provider-Detection-Analysis

Folders and files

Latest commit

History

Repository files navigation

Healthcare Fraud Detection Analysis

Project Overview

Table of Contents

Technologies Used

Data Description

Data Cleaning and Transformation

Data Modeling

Model Evaluation

Results

Anomaly Detection

Conclusion

Future Work

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages