Improve readme, add plots of readme examples' results

aai-institute · Dec 9, 2023 · 7fd4750 · 7fd4750
1 parent dc2d8ec
commit 7fd4750
Show file tree

Hide file tree

Showing 3 changed files with 2,107 additions and 132 deletions.
diff --git a/README.md b/README.md
@@ -30,11 +30,242 @@
     </a>
 </p>
 
-pyDVL collects algorithms for Data Valuation and Influence Function computation.
+**pyDVL** collects algorithms for **Data Valuation** and **Influence Function** computation.
 
-Data Valuation is the task of estimating the intrinsic value of a data point
-wrt. the training set, the model and a scoring function. We currently implement
-methods from the following papers:
+**Data Valuation** is the task of estimating the intrinsic value of a data point
+wrt. the training set, the model and a scoring function.
+
+**Influence Functions** compute the effect that single points have on an estimator /
+model
+
+# Installation
+
+To install the latest release use:
+
+```shell
+$ pip install pyDVL
+```
+
+You can also install the latest development version from
+[TestPyPI](https://test.pypi.org/project/pyDVL/):
+
+```shell
+pip install pyDVL --index-url https://test.pypi.org/simple/
+```
+
+pyDVL has also extra dependencies for certain functionalities (e.g. influence functions).
+
+For more instructions and information refer to [Installing pyDVL
+](https://pydvl.org/stable/getting-started/installation/) in the
+documentation.
+
+# Usage
+
+In the following subsections, we will showcase the usage of pyDVL
+for Data Valuation and Influence Functions using simple examples.
+
+For more instructions and information refer to [Getting
+Started](https://pydvl.org/stable/getting-started/first-steps/) in
+the documentation.
+We provide several examples for data valuation
+(e.g. [Shapley Data Valuation](https://pydvl.org/stable/examples/shapley_basic_spotify/))
+and for influence functions
+(e.g. [Influence Functions for Neural Networks](https://pydvl.org/stable/examples/influence_imagenet/))
+with details on the algorithms and their applications.
+
+## Influence Functions
+
+For influence computation, follow these steps:
+
+1. Import the necessary packages (The exact packages depend on your specific use case).
+
+   ```python
+   import torch
+   from torch import nn
+   from torch.utils.data import DataLoader, TensorDataset
+   from pydvl.reporting.plots import plot_influence_distribution
+   from pydvl.influence import compute_influences, InversionMethod
+   from pydvl.influence.torch import TorchTwiceDifferentiable
+   ```
+
+2. Create PyTorch data loaders for your train and test splits.
+
+   ```python
+   torch.manual_seed(16)
+
+   input_dim = (5, 5, 5)
+   output_dim = 3
+
+   train_data_loader = DataLoader(
+      TensorDataset(torch.rand((10, *input_dim)), torch.rand((10, output_dim))),
+      batch_size=2,
+   )
+   test_data_loader = DataLoader(
+      TensorDataset(torch.rand((5, *input_dim)), torch.rand((5, output_dim))),
+      batch_size=1,
+   )
+   ```
+
+3. Instantiate your neural network model.
+
+   ```python
+   nn_architecture = nn.Sequential(
+      nn.Conv2d(in_channels=5, out_channels=3, kernel_size=3),
+      nn.Flatten(),
+      nn.Linear(27, 3),
+   )
+   nn_architecture.eval()
+   ```
+
+4. Define your loss:
+
+   ```python
+   loss = nn.MSELoss()
+   ```
+
+5. Wrap your model and loss in a `TorchTwiceDifferentiable` object.
+
+   ```python
+   model = TorchTwiceDifferentiable(nn_architecture, loss)
+   ```
+
+6. Compute influence factors by providing training data and inversion method.
+   Using the conjugate gradient algorithm, this would look like:
+
+   ```python
+   influences = compute_influences(
+      model,
+      training_data=train_data_loader,
+      test_data=test_data_loader,
+      inversion_method=InversionMethod.Cg,
+      hessian_regularization=1e-1,
+      maxiter=200,
+      progress=True,
+   )
+   ```
+   The result is a tensor of shape `(training samples x test samples)`
+   that contains at index `(i, j`) the influence of training sample `i` on
+   test sample `j`.
+
+7. Visualize the results.
+
+   ```python
+   plot_influence_distribution(influences, index=1, title_extra="Example")
+   ```
+
+   ![Influence Functions Example](docs/assets/influence_functions_example.svg)
+
+   The higher the absolute value of the influence of a training sample
+   on a test sample, the more influential it is for the chosen test sample, model
+   and data loaders. The sign of the influence determines whether it is 
+   useful (positive) or harmful (negative).
+
+> **Note** pyDVL currently only support PyTorch for Influence Functions. 
+> We are planning to add support for Jax and perhaps TensorFlow or even Keras.
+
+## Data Valuation
+
+The steps required to compute data values for your samples are:
+
+1. Import the necessary packages (The exact packages depend on your specific use case).
+
+   ```python
+   import matplotlib.pyplot as plt
+   from sklearn.datasets import load_breast_cancer
+   from sklearn.linear_model import LogisticRegression
+   from pydvl.reporting.plots import plot_shapley
+   from pydvl.utils import Dataset, Scorer, Utility
+   from pydvl.value import (
+      compute_shapley_values,
+      ShapleyMode,
+      MaxUpdates,
+   )
+   ```
+
+2. Create a `Dataset` object with your train and test splits.
+
+   ```python
+   data = Dataset.from_sklearn(
+       load_breast_cancer(),
+       train_size=10,
+       stratify_by_target=True,
+       random_state=16,
+   )
+   ```
+
+3. Create an instance of a `SupervisedModel` (basically any sklearn compatible
+   predictor).
+
+   ```python
+   model = LogisticRegression()
+   ```  
+
+4. Create a `Utility` object to wrap the Dataset, the model and a scoring
+   function.
+
+   ```python
+   u = Utility(
+      model,
+      data,
+      Scorer("accuracy", default=0.0)
+   )
+   ```
+
+5. Use one of the methods defined in the library to compute the values.
+   In our example, we will use *Permutation Montecarlo Shapley*,
+   an approximate method for computing Data Shapley values.
+
+   ```python
+   values = compute_shapley_values(
+      u,
+      mode=ShapleyMode.PermutationMontecarlo,
+      done=MaxUpdates(100),
+      seed=16,  
+      progress=True
+   )
+   ```
+   The result is a variable of type `ValuationResult` that contains
+   the indices and their values as well as other attributes.
+
+6. Convert the valuation result to a dataframe and visualize the values.
+
+   ```python
+   df = values.to_dataframe(column="data_value")
+   plot_shapley(df, title="Data Valuation Example", xlabel="Index", ylabel="Value")
+   plt.show()
+   ```
+
+   ![Data Valuation Example Plot](docs/assets/data_valuation_example.svg)
+
+   The higher the value for an index, the more important it is for the chosen
+   model, dataset and scorer.
+
+## Caching
+
+pyDVL offers the possibility to cache certain results and
+speed up computation. It uses [Memcached](https://memcached.org/) For that.
+
+You can run it either locally or, using
+[Docker](https://www.docker.com/):
+
+```shell
+docker container run --rm -p 11211:11211 --name pydvl-cache -d memcached:latest
+```
+
+You can read more in the
+[documentation](https://pydvl.org/stable/getting-started/first-steps/#caching).
+
+# Contributing
+
+Please open new issues for bugs, feature requests and extensions. You can read
+about the structure of the project, the toolchain and workflow in the [guide for
+contributions](CONTRIBUTING.md).
+
+# Papers
+
+## Data Valuation
+
+We currently implement the following papers:
 
 - Castro, Javier, Daniel Gómez, and Juan Tejada. [Polynomial Calculation of the
   Shapley Value Based on Sampling](https://doi.org/10.1016/j.cor.2008.04.004).
@@ -80,8 +311,9 @@ methods from the following papers:
   Thirty-Sixth Conference on Neural Information Processing Systems (NeurIPS).
   New Orleans, Louisiana, USA, 2022.
 
-Influence Functions compute the effect that single points have on an estimator /
-model. We implement methods from the following papers:
+## Influence Functions
+
+We currently implement the following papers:
 
 - Koh, Pang Wei, and Percy Liang. [Understanding Black-Box Predictions via
   Influence Functions](http://proceedings.mlr.press/v70/koh17a.html). In
@@ -94,132 +326,6 @@ model. We implement methods from the following papers:
   [Scaling Up Influence Functions](http://arxiv.org/abs/2112.03052). 
   In Proceedings of the AAAI-22. arXiv, 2021.
 
-# Installation
-
-To install the latest release use:
-
-```shell
-$ pip install pyDVL
-```
-
-You can also install the latest development version from
-[TestPyPI](https://test.pypi.org/project/pyDVL/):
-
-```shell
-pip install pyDVL --index-url https://test.pypi.org/simple/
-```
-
-For more instructions and information refer to [Installing pyDVL
-](https://pydvl.org/stable/getting-started/installation/) in the
-documentation.
-
-# Usage
-
-### Influence Functions
-
-For influence computation, follow these steps:
-
-1. Wrap your model and loss in a `TorchTwiceDifferentiable` object
-2. Compute influence factors by providing training data and inversion method
-
-Using the conjugate gradient algorithm, this would look like:
-```python
-import torch
-from torch import nn
-from torch.utils.data import DataLoader, TensorDataset
-
-from pydvl.influence import TorchTwiceDifferentiable, compute_influences, InversionMethod
-
-nn_architecture = nn.Sequential(
-    nn.Conv2d(in_channels=5, out_channels=3, kernel_size=3),
-    nn.Flatten(),
-    nn.Linear(27, 3),
-)
-loss = nn.MSELoss()
-model = TorchTwiceDifferentiable(nn_architecture, loss)
-
-input_dim = (5, 5, 5)
-output_dim = 3
-
-train_data_loader = DataLoader(
-    TensorDataset(torch.rand((10, *input_dim)), torch.rand((10, output_dim))),
-    batch_size=2,
-)
-test_data_loader = DataLoader(
-    TensorDataset(torch.rand((5, *input_dim)), torch.rand((5, output_dim))),
-    batch_size=1,
-)
-
-influences = compute_influences(
-    model,
-    training_data=train_data_loader,
-    test_data=test_data_loader,
-    progress=True,
-    inversion_method=InversionMethod.Cg,
-    hessian_regularization=1e-1,
-    maxiter=200,
-)
-```
-
-
-### Shapley Values
-The steps required to compute values for your samples are:
-
-1. Create a `Dataset` object with your train and test splits.
-2. Create an instance of a `SupervisedModel` (basically any sklearn compatible
-   predictor)
-3. Create a `Utility` object to wrap the Dataset, the model and a scoring
-   function.
-4. Use one of the methods defined in the library to compute the values.
-
-This is how it looks for *Truncated Montecarlo Shapley*, an efficient method for
-Data Shapley values:
-
-```python
-from sklearn.datasets import load_breast_cancer
-from sklearn.linear_model import LogisticRegression
-from pydvl.value import *
-
-data = Dataset.from_sklearn(load_breast_cancer(), train_size=0.7)
-model = LogisticRegression()
-u = Utility(model, data, Scorer("accuracy", default=0.0))
-values = compute_shapley_values(
-    u,
-    mode=ShapleyMode.TruncatedMontecarlo,
-    done=MaxUpdates(100) | AbsoluteStandardError(threshold=0.01),
-    truncation=RelativeTruncation(u, rtol=0.01),
-)
-```
-
-For more instructions and information refer to [Getting
-Started](https://pydvl.org/stable/getting-started/first-steps/) in
-the documentation. We provide several examples for data valuation
-(e.g. [Shapley Data Valuation](https://pydvl.org/stable/examples/shapley_basic_spotify/))
-and for influence functions
-(e.g. [Influence Functions for Neural Networks](https://pydvl.org/stable/examples/influence_imagenet/))
-with details on the algorithms and their applications.
-
-## Caching
-
-pyDVL offers the possibility to cache certain results and
-speed up computation. It uses [Memcached](https://memcached.org/) For that.
-
-You can run it either locally or, using
-[Docker](https://www.docker.com/):
-
-```shell
-docker container run --rm -p 11211:11211 --name pydvl-cache -d memcached:latest
-```
-
-You can read more in the
-[documentation](https://pydvl.org/stable/getting-started/first-steps/#caching).
-
-# Contributing
-
-Please open new issues for bugs, feature requests and extensions. You can read
-about the structure of the project, the toolchain and workflow in the [guide for
-contributions](CONTRIBUTING.md).
-
 # License
 
 pyDVL is distributed under