This code is a machine learning model that uses various classifiers to predict the genuinity of a transaction based on input features.
To run this code, you need to install Python 3.5 or above and the following packages:
numpy
: A Python library for numerical computingpandas
: A Python library for data manipulationmatplotlib
: A Python library for data visualizationseaborn
: A Python library for statistical data visualizationscikit-learn
: A machine learning library for Python
You can install the required packages using pip
by running the following command:
pip install numpy pandas matplotlib seaborn scikit-learn
The code reads in training and test datasets and applies various classifiers (Random Forest, Gradient Boosting, SVM, and Logistic Regression) to predict the verdict based on input features. The predictions are then saved to a CSV file for submission.
To use this code, you need to follow the steps below:
- Ensure that the required packages are installed (see Installation section).
- Download the training and test datasets (in CSV format) and save them in the "data" directory with the names "train.csv" and "test.csv", respectively.
- Run the code.
- The code loads the data and prints the numerical variables in the data and a histogram for each of them.
- It then splits the data into train and test sets and trains different models, such as Random Forest, Gradient Boosting, SVM, and Logistic Regression, and calculates their accuracy scores.
- Finally, it selects the best performing model (Random Forest) and predicts the verdict for the test data and saves it to a file named predictions.csv.
Random Forest is a popular ensemble learning method for classification, regression, and other tasks. It is an extension of decision trees, where multiple decision trees are trained on randomly selected subsets of the training data and features.
Here's how Random Forest Classifier works:
- Selecting random samples from a given dataset.
- Building a decision tree for each sample and getting a prediction result from each decision tree.
- Performing a vote for each predicted result.
- Selecting the prediction result with the most votes as the final prediction.
The key concept behind Random Forest is to combine the outputs of multiple decision trees to create a more accurate and stable prediction. This helps to reduce overfitting and improve generalization performance.
During training, the Random Forest algorithm creates multiple decision trees by randomly selecting subsets of data and features. The number of trees and the size of the subsets are hyperparameters that can be tuned for optimal performance.
During prediction, each decision tree in the Random Forest makes a prediction, and the class with the most votes across all trees is selected as the final prediction.
The Random Forest Classifier is based on two key concepts: bagging and random feature selection.
Bagging is a technique that involves sampling the training data with replacement to create multiple subsets of the data. The decision tree is then trained on each of these subsets, and the final prediction is obtained by combining the predictions of all decision trees. Bagging helps to reduce overfitting and improve the stability of the model.
Random feature selection is a technique that involves randomly selecting a subset of features for each decision tree. This helps to reduce the correlation between decision trees and ensures that each decision tree makes a different set of decisions.
The math behind the Random Forest Classifier algorithm involves constructing decision trees and aggregating their predictions using majority voting. Each decision tree is constructed recursively by selecting the best feature to split the data at each node based on a metric such as information gain or Gini impurity. The process continues until the data is fully partitioned, or a stopping criterion is met. The final prediction is obtained by aggregating the predictions of all decision trees using majority voting.
The dataset used in this code contains various features transactions, such as PARAMETER_1, PARAMETER_2, PARAMETER_3, etc. The dataset also includes a binary target variable "VERDICT" indicating whether the transaction was fraudulent or not. ‘1’ stands for a genuine transaction while ‘0’ stands for a fraudulent one.
Finally the predictions were made using Random Forest Classifier with an accuracy of 94.8734 %