From 09824044ded74d42c80fc3dd040d9e3937fa2469 Mon Sep 17 00:00:00 2001 From: Claudio Salvatore Arcidiacono <22871978+ClaudioSalvatoreArcidiacono@users.noreply.github.com> Date: Wed, 26 Jul 2023 16:00:19 +0200 Subject: [PATCH] Improve documentation --- README.md | 3 +-- docs/reference/RFE.md | 2 ++ docs/reference/drift.md | 3 +++ felimination/drift.py | 29 ++++++++++++++++++++++++++++- mkdocs.yml | 2 ++ 5 files changed, 36 insertions(+), 3 deletions(-) create mode 100644 docs/reference/drift.md diff --git a/README.md b/README.md index 2e8dd5a..0911421 100644 --- a/README.md +++ b/README.md @@ -1,13 +1,12 @@ [![pytest](https://github.com/ClaudioSalvatoreArcidiacono/felimination/workflows/Tests/badge.svg)](https://github.com/ClaudioSalvatoreArcidiacono/felimination/actions?query=workflow%3A%22Tests%22) [![PyPI](https://img.shields.io/pypi/v/felimination)](#) +[![documentation](https://img.shields.io/badge/docs-mkdocs%20material-blue.svg?style=flat)](https://claudiosalvatorearcidiacono.github.io/felimination/) # felimination This library contains some useful scikit-learn compatible classes for feature selection. -## [Check out documentation here](https://claudiosalvatorearcidiacono.github.io/felimination/) - ## Features - [Recursive Feature Elimination with Cross Validation using Permutation Importance](reference/RFE.md#felimination.rfe.PermutationImportanceRFECV) diff --git a/docs/reference/RFE.md b/docs/reference/RFE.md index 1f0734c..61733d8 100644 --- a/docs/reference/RFE.md +++ b/docs/reference/RFE.md @@ -1 +1,3 @@ ::: felimination.rfe + options: + inherited_members: true diff --git a/docs/reference/drift.md b/docs/reference/drift.md new file mode 100644 index 0000000..4915bc1 --- /dev/null +++ b/docs/reference/drift.md @@ -0,0 +1,3 @@ +::: felimination.drift + options: + inherited_members: true diff --git a/felimination/drift.py b/felimination/drift.py index 67c84e2..f90f7be 100644 --- a/felimination/drift.py +++ b/felimination/drift.py @@ -1,4 +1,31 @@ -"""Module with tools to perform drift-based feature selection. +"""The idea behind this module comes from the conjunction of two concepts: + +- [1] [Classifier Two-Sample Test](https://arxiv.org/abs/1610.06545) +- [2] [Recursive Feature Elimination](\ + https://scikit-learn.org/stable/modules/generated/\ + sklearn.feature_selection.RFE.html) + +In [1] classifier performances are used to determine how similar two samples are. More +specifically, imagine to have two samples: `reference` and `test`. In order to assess +whether `reference` and `test` have been drawn from the same distribution, we could +train a classifier in classifying which instances belong to which sample. If the +model easily distinguishes instances from the two samples, then the two samples +have been probably drawn from two different distributions. Conversely, if the +classifier struggles to distinguish them, then it is likely that the samples have +been drawn from the same distribution. + +In the context of drift detection, the classifier two-sample test can be used to +assess whether drift has happened between the reference and the test set and to +which degree. + +The classes of this module take this idea one step further and attempt +to reduce the drift using recursive feature selection. After a classifier +is trained to distinguish between `reference` and `test`, the feature +importance of the classifier is used to determine which features contribute +the most in distinguishing between the two sets. The most important features +are then eliminated and the procedure is repeated until the classifier is not +able anymore to distinguish between the two samples, or until a certain amount +of features has been removed. This module contains the following classes: - `SampleSimilarityDriftRFE`: base class for drift-based sample similarity diff --git a/mkdocs.yml b/mkdocs.yml index 395a650..b438819 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -29,6 +29,8 @@ plugins: python: options: docstring_style: numpy + import: + - https://scikit-learn.org/stable/objects.inv markdown_extensions: - pymdownx.highlight: