Skip to content

mridul-g/Beyond-the-Buzz-Submission

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Readme for Beyond the Buzz recruitment task Submission

This code is a machine learning model that uses various classifiers to predict the genuinity of a transaction based on input features.

Installation

To run this code, you need to install Python 3.5 or above and the following packages:

  • numpy: A Python library for numerical computing
  • pandas: A Python library for data manipulation
  • matplotlib: A Python library for data visualization
  • seaborn: A Python library for statistical data visualization
  • scikit-learn: A machine learning library for Python

You can install the required packages using pip by running the following command:

                        
                            pip install numpy pandas matplotlib seaborn scikit-learn
                        
                    

Usage

The code reads in training and test datasets and applies various classifiers (Random Forest, Gradient Boosting, SVM, and Logistic Regression) to predict the verdict based on input features. The predictions are then saved to a CSV file for submission.

To use this code, you need to follow the steps below:

  1. Ensure that the required packages are installed (see Installation section).
  2. Download the training and test datasets (in CSV format) and save them in the "data" directory with the names "train.csv" and "test.csv", respectively.
  3. Run the code.

Working

  1. The code loads the data and prints the numerical variables in the data and a histogram for each of them.
  2. It then splits the data into train and test sets and trains different models, such as Random Forest, Gradient Boosting, SVM, and Logistic Regression, and calculates their accuracy scores.
  3. Finally, it selects the best performing model (Random Forest) and predicts the verdict for the test data and saves it to a file named predictions.csv.

Random Forest Classifier

Random Forest is a popular ensemble learning method for classification, regression, and other tasks. It is an extension of decision trees, where multiple decision trees are trained on randomly selected subsets of the training data and features.

Here's how Random Forest Classifier works:

  1. Selecting random samples from a given dataset.
  2. Building a decision tree for each sample and getting a prediction result from each decision tree.
  3. Performing a vote for each predicted result.
  4. Selecting the prediction result with the most votes as the final prediction.

The key concept behind Random Forest is to combine the outputs of multiple decision trees to create a more accurate and stable prediction. This helps to reduce overfitting and improve generalization performance.

During training, the Random Forest algorithm creates multiple decision trees by randomly selecting subsets of data and features. The number of trees and the size of the subsets are hyperparameters that can be tuned for optimal performance.

During prediction, each decision tree in the Random Forest makes a prediction, and the class with the most votes across all trees is selected as the final prediction.

The Random Forest Classifier is based on two key concepts: bagging and random feature selection.

Bagging (Bootstrap Aggregating):

Bagging is a technique that involves sampling the training data with replacement to create multiple subsets of the data. The decision tree is then trained on each of these subsets, and the final prediction is obtained by combining the predictions of all decision trees. Bagging helps to reduce overfitting and improve the stability of the model.

Random Feature Selection:

Random feature selection is a technique that involves randomly selecting a subset of features for each decision tree. This helps to reduce the correlation between decision trees and ensures that each decision tree makes a different set of decisions.

The math behind the Random Forest Classifier algorithm involves constructing decision trees and aggregating their predictions using majority voting. Each decision tree is constructed recursively by selecting the best feature to split the data at each node based on a metric such as information gain or Gini impurity. The process continues until the data is fully partitioned, or a stopping criterion is met. The final prediction is obtained by aggregating the predictions of all decision trees using majority voting.

Dataset

The dataset used in this code contains various features transactions, such as PARAMETER_1, PARAMETER_2, PARAMETER_3, etc. The dataset also includes a binary target variable "VERDICT" indicating whether the transaction was fraudulent or not. ‘1’ stands for a genuine transaction while ‘0’ stands for a fraudulent one.

Finally the predictions were made using Random Forest Classifier with an accuracy of 94.8734 %

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published