Skip to content

Commit

Permalink
Improve readme, add plots of readme examples' results
Browse files Browse the repository at this point in the history
  • Loading branch information
AnesBenmerzoug committed Dec 9, 2023
1 parent dc2d8ec commit 7fd4750
Show file tree
Hide file tree
Showing 3 changed files with 2,107 additions and 132 deletions.
370 changes: 238 additions & 132 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,11 +30,242 @@
</a>
</p>

pyDVL collects algorithms for Data Valuation and Influence Function computation.
**pyDVL** collects algorithms for **Data Valuation** and **Influence Function** computation.

Data Valuation is the task of estimating the intrinsic value of a data point
wrt. the training set, the model and a scoring function. We currently implement
methods from the following papers:
**Data Valuation** is the task of estimating the intrinsic value of a data point
wrt. the training set, the model and a scoring function.

**Influence Functions** compute the effect that single points have on an estimator /
model

# Installation

To install the latest release use:

```shell
$ pip install pyDVL
```

You can also install the latest development version from
[TestPyPI](https://test.pypi.org/project/pyDVL/):

```shell
pip install pyDVL --index-url https://test.pypi.org/simple/
```

pyDVL has also extra dependencies for certain functionalities (e.g. influence functions).

For more instructions and information refer to [Installing pyDVL
](https://pydvl.org/stable/getting-started/installation/) in the
documentation.

# Usage

In the following subsections, we will showcase the usage of pyDVL
for Data Valuation and Influence Functions using simple examples.

For more instructions and information refer to [Getting
Started](https://pydvl.org/stable/getting-started/first-steps/) in
the documentation.
We provide several examples for data valuation
(e.g. [Shapley Data Valuation](https://pydvl.org/stable/examples/shapley_basic_spotify/))
and for influence functions
(e.g. [Influence Functions for Neural Networks](https://pydvl.org/stable/examples/influence_imagenet/))
with details on the algorithms and their applications.

## Influence Functions

For influence computation, follow these steps:

1. Import the necessary packages (The exact packages depend on your specific use case).

```python
import torch
from torch import nn
from torch.utils.data import DataLoader, TensorDataset
from pydvl.reporting.plots import plot_influence_distribution
from pydvl.influence import compute_influences, InversionMethod
from pydvl.influence.torch import TorchTwiceDifferentiable
```

2. Create PyTorch data loaders for your train and test splits.

```python
torch.manual_seed(16)

input_dim = (5, 5, 5)
output_dim = 3

train_data_loader = DataLoader(
TensorDataset(torch.rand((10, *input_dim)), torch.rand((10, output_dim))),
batch_size=2,
)
test_data_loader = DataLoader(
TensorDataset(torch.rand((5, *input_dim)), torch.rand((5, output_dim))),
batch_size=1,
)
```

3. Instantiate your neural network model.

```python
nn_architecture = nn.Sequential(
nn.Conv2d(in_channels=5, out_channels=3, kernel_size=3),
nn.Flatten(),
nn.Linear(27, 3),
)
nn_architecture.eval()
```

4. Define your loss:

```python
loss = nn.MSELoss()
```

5. Wrap your model and loss in a `TorchTwiceDifferentiable` object.

```python
model = TorchTwiceDifferentiable(nn_architecture, loss)
```

6. Compute influence factors by providing training data and inversion method.
Using the conjugate gradient algorithm, this would look like:

```python
influences = compute_influences(
model,
training_data=train_data_loader,
test_data=test_data_loader,
inversion_method=InversionMethod.Cg,
hessian_regularization=1e-1,
maxiter=200,
progress=True,
)
```
The result is a tensor of shape `(training samples x test samples)`
that contains at index `(i, j`) the influence of training sample `i` on
test sample `j`.

7. Visualize the results.

```python
plot_influence_distribution(influences, index=1, title_extra="Example")
```

![Influence Functions Example](docs/assets/influence_functions_example.svg)

The higher the absolute value of the influence of a training sample
on a test sample, the more influential it is for the chosen test sample, model
and data loaders. The sign of the influence determines whether it is
useful (positive) or harmful (negative).

> **Note** pyDVL currently only support PyTorch for Influence Functions.
> We are planning to add support for Jax and perhaps TensorFlow or even Keras.
## Data Valuation

The steps required to compute data values for your samples are:

1. Import the necessary packages (The exact packages depend on your specific use case).

```python
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from pydvl.reporting.plots import plot_shapley
from pydvl.utils import Dataset, Scorer, Utility
from pydvl.value import (
compute_shapley_values,
ShapleyMode,
MaxUpdates,
)
```

2. Create a `Dataset` object with your train and test splits.

```python
data = Dataset.from_sklearn(
load_breast_cancer(),
train_size=10,
stratify_by_target=True,
random_state=16,
)
```

3. Create an instance of a `SupervisedModel` (basically any sklearn compatible
predictor).

```python
model = LogisticRegression()
```

4. Create a `Utility` object to wrap the Dataset, the model and a scoring
function.

```python
u = Utility(
model,
data,
Scorer("accuracy", default=0.0)
)
```

5. Use one of the methods defined in the library to compute the values.
In our example, we will use *Permutation Montecarlo Shapley*,
an approximate method for computing Data Shapley values.

```python
values = compute_shapley_values(
u,
mode=ShapleyMode.PermutationMontecarlo,
done=MaxUpdates(100),
seed=16,
progress=True
)
```
The result is a variable of type `ValuationResult` that contains
the indices and their values as well as other attributes.

6. Convert the valuation result to a dataframe and visualize the values.

```python
df = values.to_dataframe(column="data_value")
plot_shapley(df, title="Data Valuation Example", xlabel="Index", ylabel="Value")
plt.show()
```

![Data Valuation Example Plot](docs/assets/data_valuation_example.svg)

The higher the value for an index, the more important it is for the chosen
model, dataset and scorer.

## Caching

pyDVL offers the possibility to cache certain results and
speed up computation. It uses [Memcached](https://memcached.org/) For that.

You can run it either locally or, using
[Docker](https://www.docker.com/):

```shell
docker container run --rm -p 11211:11211 --name pydvl-cache -d memcached:latest
```

You can read more in the
[documentation](https://pydvl.org/stable/getting-started/first-steps/#caching).

# Contributing

Please open new issues for bugs, feature requests and extensions. You can read
about the structure of the project, the toolchain and workflow in the [guide for
contributions](CONTRIBUTING.md).

# Papers

## Data Valuation

We currently implement the following papers:

- Castro, Javier, Daniel Gómez, and Juan Tejada. [Polynomial Calculation of the
Shapley Value Based on Sampling](https://doi.org/10.1016/j.cor.2008.04.004).
Expand Down Expand Up @@ -80,8 +311,9 @@ methods from the following papers:
Thirty-Sixth Conference on Neural Information Processing Systems (NeurIPS).
New Orleans, Louisiana, USA, 2022.

Influence Functions compute the effect that single points have on an estimator /
model. We implement methods from the following papers:
## Influence Functions

We currently implement the following papers:

- Koh, Pang Wei, and Percy Liang. [Understanding Black-Box Predictions via
Influence Functions](http://proceedings.mlr.press/v70/koh17a.html). In
Expand All @@ -94,132 +326,6 @@ model. We implement methods from the following papers:
[Scaling Up Influence Functions](http://arxiv.org/abs/2112.03052).
In Proceedings of the AAAI-22. arXiv, 2021.

# Installation

To install the latest release use:

```shell
$ pip install pyDVL
```

You can also install the latest development version from
[TestPyPI](https://test.pypi.org/project/pyDVL/):

```shell
pip install pyDVL --index-url https://test.pypi.org/simple/
```

For more instructions and information refer to [Installing pyDVL
](https://pydvl.org/stable/getting-started/installation/) in the
documentation.

# Usage

### Influence Functions

For influence computation, follow these steps:

1. Wrap your model and loss in a `TorchTwiceDifferentiable` object
2. Compute influence factors by providing training data and inversion method

Using the conjugate gradient algorithm, this would look like:
```python
import torch
from torch import nn
from torch.utils.data import DataLoader, TensorDataset

from pydvl.influence import TorchTwiceDifferentiable, compute_influences, InversionMethod

nn_architecture = nn.Sequential(
nn.Conv2d(in_channels=5, out_channels=3, kernel_size=3),
nn.Flatten(),
nn.Linear(27, 3),
)
loss = nn.MSELoss()
model = TorchTwiceDifferentiable(nn_architecture, loss)

input_dim = (5, 5, 5)
output_dim = 3

train_data_loader = DataLoader(
TensorDataset(torch.rand((10, *input_dim)), torch.rand((10, output_dim))),
batch_size=2,
)
test_data_loader = DataLoader(
TensorDataset(torch.rand((5, *input_dim)), torch.rand((5, output_dim))),
batch_size=1,
)

influences = compute_influences(
model,
training_data=train_data_loader,
test_data=test_data_loader,
progress=True,
inversion_method=InversionMethod.Cg,
hessian_regularization=1e-1,
maxiter=200,
)
```


### Shapley Values
The steps required to compute values for your samples are:

1. Create a `Dataset` object with your train and test splits.
2. Create an instance of a `SupervisedModel` (basically any sklearn compatible
predictor)
3. Create a `Utility` object to wrap the Dataset, the model and a scoring
function.
4. Use one of the methods defined in the library to compute the values.

This is how it looks for *Truncated Montecarlo Shapley*, an efficient method for
Data Shapley values:

```python
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from pydvl.value import *

data = Dataset.from_sklearn(load_breast_cancer(), train_size=0.7)
model = LogisticRegression()
u = Utility(model, data, Scorer("accuracy", default=0.0))
values = compute_shapley_values(
u,
mode=ShapleyMode.TruncatedMontecarlo,
done=MaxUpdates(100) | AbsoluteStandardError(threshold=0.01),
truncation=RelativeTruncation(u, rtol=0.01),
)
```

For more instructions and information refer to [Getting
Started](https://pydvl.org/stable/getting-started/first-steps/) in
the documentation. We provide several examples for data valuation
(e.g. [Shapley Data Valuation](https://pydvl.org/stable/examples/shapley_basic_spotify/))
and for influence functions
(e.g. [Influence Functions for Neural Networks](https://pydvl.org/stable/examples/influence_imagenet/))
with details on the algorithms and their applications.

## Caching

pyDVL offers the possibility to cache certain results and
speed up computation. It uses [Memcached](https://memcached.org/) For that.

You can run it either locally or, using
[Docker](https://www.docker.com/):

```shell
docker container run --rm -p 11211:11211 --name pydvl-cache -d memcached:latest
```

You can read more in the
[documentation](https://pydvl.org/stable/getting-started/first-steps/#caching).

# Contributing

Please open new issues for bugs, feature requests and extensions. You can read
about the structure of the project, the toolchain and workflow in the [guide for
contributions](CONTRIBUTING.md).

# License

pyDVL is distributed under
Expand Down
Loading

0 comments on commit 7fd4750

Please sign in to comment.