Skip to content
sachinshrmaa edited this page Mar 22, 2020 · 2 revisions

Introduction

In today’s world, we are on the express train to a cashless society. It’s expected that in future years there will be a steady growth of non-cash transactions.

Now, while this might be exciting, but on the flip-side fraudulent transactions are on the rise as well. Even with EMV smart chips being implemented, we still have a very high amount of money lost from credit card fraud.

What can we do to reduce the risk? While there are a lot of methods to limit the loss and prevent fraud, and we’ll walk you through our process and show you our findings. To solve this problem, we’re creating a “Credit Card Fraud Analysis Model using Machine Learning”. The datasets we’ll be using will be gathered from a Kaggle dataset which contains 285,000 rows of data and 31 columns.

Problem Definition

Credit card fraud is an inclusive term for fraud committed using a payment card, such as a credit card or debit card. The purpose may be to obtain goods or services or to make payments to another account which is controlled by a criminal.

There are two kinds of card fraud: card-present fraud (not so common nowadays) and card-not-present fraud (more common). The compromise can occur in a number of ways and can usually occur without the knowledge of the cardholder. The internet has made database security lapses particularly costly, in some cases, millions of accounts have been compromised.

Solution Strategy

A machine learning module will be programmed to determine whether the given transaction is fraud or not a fraud.

This module will be machine learning algorithms like the Isolation Forest Algorithm and the Local Outlier Factor (LOF) Algorithm.

The Programming used for this module will be Python and open libraries from scikit-learn.

Proposed Methodology

The types of algorithms we are going to use to try to do anomaly detection on this dataset module are as follows:

Isolation Forest Algorithm

One of the newest techniques to detect anomalies is called Isolation Forests. The algorithm is based on the fact that anomalies are data points that are few and different. As a result of these properties, anomalies are susceptible to a mechanism called isolation. This method is highly useful and is fundamentally different from all existing methods. It introduces the use of isolation as a more effective and efficient means to detect anomalies than the commonly used basic distance and density measures. Moreover, this method is an algorithm with a low linear time complexity and a small memory requirement. It builds a good performing model with a small number of trees using small sub-samples of fixed size, regardless of the size of a data set.

Local Outlier Factor (LOF) Algorithm

The LOF algorithm is an unsupervised outlier detection method that computes the local density deviation of a given data point with respect to its neighbors. It considers as outlier samples that have a substantially lower density than their neighbors.

The number of neighbors considered, (parameter n neighbors) is typically chosen 1 greater than the minimum number of objects a cluster has to contain so that other objects can be local outliers relative to this cluster, and 2) smaller than the maximum number of close-by objects that can potentially be local outliers. In practice, such information is generally not available, and taking n neighbors = 20 appears to work well in general.

Software Requirements

Jupyter Notebook

Project Jupyter is a non-profit organization created to "develop open-source software, open-standards, and services for interactive computing across dozens of programming languages".

Spun-off from I Python in 2014 by Fernando Pérez, Project Jupyter supports execution environments in several dozen languages. Project Jupyter's name is a reference to the three core programming languages supported by Jupyter, which are Julia, Python, and R, and also a homage to Galileo's notebooks recording the discovery of the moons of Jupiter.

Project Jupyter has developed and supported the interactive computing products Jupyter Notebook, JupyterHub, and Jupyter Lab, the next-generation version of Jupyter Notebook.

Scikit Learn

Scikit-learn is a free software machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.

The scikit-learn project started as scikits.learn a Google Summer of Code project by David Cournapeau. Its name stems from the notion that it is a "SciKit" (SciPy Toolkit), a separately-developed and distributed third-party extension to SciPy. The original codebase was later rewritten by other developers. In 2010 Fabian Pedregosa, Gael Varoquaux, Alexandre Gramfort and Vincent Michel, all from the French Institute for Research in Computer Science and Automation in Rocquencourt, France, took leadership of the project and made the first public release on February the 1st 2010. Of the various scikits, scikit-learn, as well as scikit-image, were described as "well-maintained and popular" in November 2012. Scikit-learn is one of the most popular machine learning libraries on GitHub.

Kaggle

Kaggle, a subsidiary of Google LLC, is an online community of data scientists and machine learning practitioners. Kaggle allows users to find and publish data sets, explore and build models in a web-based data-science environment, work with other data scientists and machine learning engineers, and enter competitions to solve data science challenges.

Kaggle got its start in 2010 by offering machine learning competitions and now also offers a public data platform, a cloud-based workbench for data science, and Artificial Intelligence education. Its key personnel were Anthony Gold bloom and Jeremy Howard. Nicholas Gruen was a founding chair succeeded by Max Levchin.

Python

Python is an interpreted, high-level, general-purpose programming language. Created by Guido van Rossum and first released in 1991, Python's design philosophy emphasizes code readability with its notable use of significant whitespace. Its language constructs and object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects.

Python is dynamically typed and garbage-collected. It supports multiple programming paradigms, including procedural, object-oriented, and functional programming. Python is often described as a "batteries included" language due to its comprehensive standard library.

Flask

Flask is a micro web framework written in Python. It is classified as a microframework because it does not require particular tools or libraries. It has no database abstraction layer, form validation, or any other components where pre-existing third-party libraries provide common functions.

However, Flask supports extensions that can add application features as if they were implemented in Flask itself. Extensions exist for object-relational mappers, form validation, upload handling, various open authentication technologies and several common framework related tools. Extensions are updated far more frequently than the core Flask program.

Conclusion

Finding fraudulent credit card transactions is really important, especially in today’s society. There are lots of methods to capture these instances, and this method was one of them with machine learning with python and R.