diff --git a/README.md b/README.md index 57bf56d33..101516860 100644 --- a/README.md +++ b/README.md @@ -30,11 +30,242 @@

-pyDVL collects algorithms for Data Valuation and Influence Function computation. +**pyDVL** collects algorithms for **Data Valuation** and **Influence Function** computation. -Data Valuation is the task of estimating the intrinsic value of a data point -wrt. the training set, the model and a scoring function. We currently implement -methods from the following papers: +**Data Valuation** is the task of estimating the intrinsic value of a data point +wrt. the training set, the model and a scoring function. + +**Influence Functions** compute the effect that single points have on an estimator / +model + +# Installation + +To install the latest release use: + +```shell +$ pip install pyDVL +``` + +You can also install the latest development version from +[TestPyPI](https://test.pypi.org/project/pyDVL/): + +```shell +pip install pyDVL --index-url https://test.pypi.org/simple/ +``` + +pyDVL has also extra dependencies for certain functionalities (e.g. influence functions). + +For more instructions and information refer to [Installing pyDVL +](https://pydvl.org/stable/getting-started/installation/) in the +documentation. + +# Usage + +In the following subsections, we will showcase the usage of pyDVL +for Data Valuation and Influence Functions using simple examples. + +For more instructions and information refer to [Getting +Started](https://pydvl.org/stable/getting-started/first-steps/) in +the documentation. +We provide several examples for data valuation +(e.g. [Shapley Data Valuation](https://pydvl.org/stable/examples/shapley_basic_spotify/)) +and for influence functions +(e.g. [Influence Functions for Neural Networks](https://pydvl.org/stable/examples/influence_imagenet/)) +with details on the algorithms and their applications. + +## Influence Functions + +For influence computation, follow these steps: + +1. Import the necessary packages (The exact packages depend on your specific use case). + + ```python + import torch + from torch import nn + from torch.utils.data import DataLoader, TensorDataset + from pydvl.reporting.plots import plot_influence_distribution + from pydvl.influence import compute_influences, InversionMethod + from pydvl.influence.torch import TorchTwiceDifferentiable + ``` + +2. Create PyTorch data loaders for your train and test splits. + + ```python + torch.manual_seed(16) + + input_dim = (5, 5, 5) + output_dim = 3 + + train_data_loader = DataLoader( + TensorDataset(torch.rand((10, *input_dim)), torch.rand((10, output_dim))), + batch_size=2, + ) + test_data_loader = DataLoader( + TensorDataset(torch.rand((5, *input_dim)), torch.rand((5, output_dim))), + batch_size=1, + ) + ``` + +3. Instantiate your neural network model. + + ```python + nn_architecture = nn.Sequential( + nn.Conv2d(in_channels=5, out_channels=3, kernel_size=3), + nn.Flatten(), + nn.Linear(27, 3), + ) + nn_architecture.eval() + ``` + +4. Define your loss: + + ```python + loss = nn.MSELoss() + ``` + +5. Wrap your model and loss in a `TorchTwiceDifferentiable` object. + + ```python + model = TorchTwiceDifferentiable(nn_architecture, loss) + ``` + +6. Compute influence factors by providing training data and inversion method. + Using the conjugate gradient algorithm, this would look like: + + ```python + influences = compute_influences( + model, + training_data=train_data_loader, + test_data=test_data_loader, + inversion_method=InversionMethod.Cg, + hessian_regularization=1e-1, + maxiter=200, + progress=True, + ) + ``` + The result is a tensor of shape `(training samples x test samples)` + that contains at index `(i, j`) the influence of training sample `i` on + test sample `j`. + +7. Visualize the results. + + ```python + plot_influence_distribution(influences, index=1, title_extra="Example") + ``` + + ![Influence Functions Example](docs/assets/influence_functions_example.svg) + + The higher the absolute value of the influence of a training sample + on a test sample, the more influential it is for the chosen test sample, model + and data loaders. The sign of the influence determines whether it is + useful (positive) or harmful (negative). + +> **Note** pyDVL currently only support PyTorch for Influence Functions. +> We are planning to add support for Jax and perhaps TensorFlow or even Keras. + +## Data Valuation + +The steps required to compute data values for your samples are: + +1. Import the necessary packages (The exact packages depend on your specific use case). + + ```python + import matplotlib.pyplot as plt + from sklearn.datasets import load_breast_cancer + from sklearn.linear_model import LogisticRegression + from pydvl.reporting.plots import plot_shapley + from pydvl.utils import Dataset, Scorer, Utility + from pydvl.value import ( + compute_shapley_values, + ShapleyMode, + MaxUpdates, + ) + ``` + +2. Create a `Dataset` object with your train and test splits. + + ```python + data = Dataset.from_sklearn( + load_breast_cancer(), + train_size=10, + stratify_by_target=True, + random_state=16, + ) + ``` + +3. Create an instance of a `SupervisedModel` (basically any sklearn compatible + predictor). + + ```python + model = LogisticRegression() + ``` + +4. Create a `Utility` object to wrap the Dataset, the model and a scoring + function. + + ```python + u = Utility( + model, + data, + Scorer("accuracy", default=0.0) + ) + ``` + +5. Use one of the methods defined in the library to compute the values. + In our example, we will use *Permutation Montecarlo Shapley*, + an approximate method for computing Data Shapley values. + + ```python + values = compute_shapley_values( + u, + mode=ShapleyMode.PermutationMontecarlo, + done=MaxUpdates(100), + seed=16, + progress=True + ) + ``` + The result is a variable of type `ValuationResult` that contains + the indices and their values as well as other attributes. + +6. Convert the valuation result to a dataframe and visualize the values. + + ```python + df = values.to_dataframe(column="data_value") + plot_shapley(df, title="Data Valuation Example", xlabel="Index", ylabel="Value") + plt.show() + ``` + + ![Data Valuation Example Plot](docs/assets/data_valuation_example.svg) + + The higher the value for an index, the more important it is for the chosen + model, dataset and scorer. + +## Caching + +pyDVL offers the possibility to cache certain results and +speed up computation. It uses [Memcached](https://memcached.org/) For that. + +You can run it either locally or, using +[Docker](https://www.docker.com/): + +```shell +docker container run --rm -p 11211:11211 --name pydvl-cache -d memcached:latest +``` + +You can read more in the +[documentation](https://pydvl.org/stable/getting-started/first-steps/#caching). + +# Contributing + +Please open new issues for bugs, feature requests and extensions. You can read +about the structure of the project, the toolchain and workflow in the [guide for +contributions](CONTRIBUTING.md). + +# Papers + +## Data Valuation + +We currently implement the following papers: - Castro, Javier, Daniel Gómez, and Juan Tejada. [Polynomial Calculation of the Shapley Value Based on Sampling](https://doi.org/10.1016/j.cor.2008.04.004). @@ -80,8 +311,9 @@ methods from the following papers: Thirty-Sixth Conference on Neural Information Processing Systems (NeurIPS). New Orleans, Louisiana, USA, 2022. -Influence Functions compute the effect that single points have on an estimator / -model. We implement methods from the following papers: +## Influence Functions + +We currently implement the following papers: - Koh, Pang Wei, and Percy Liang. [Understanding Black-Box Predictions via Influence Functions](http://proceedings.mlr.press/v70/koh17a.html). In @@ -94,132 +326,6 @@ model. We implement methods from the following papers: [Scaling Up Influence Functions](http://arxiv.org/abs/2112.03052). In Proceedings of the AAAI-22. arXiv, 2021. -# Installation - -To install the latest release use: - -```shell -$ pip install pyDVL -``` - -You can also install the latest development version from -[TestPyPI](https://test.pypi.org/project/pyDVL/): - -```shell -pip install pyDVL --index-url https://test.pypi.org/simple/ -``` - -For more instructions and information refer to [Installing pyDVL -](https://pydvl.org/stable/getting-started/installation/) in the -documentation. - -# Usage - -### Influence Functions - -For influence computation, follow these steps: - -1. Wrap your model and loss in a `TorchTwiceDifferentiable` object -2. Compute influence factors by providing training data and inversion method - -Using the conjugate gradient algorithm, this would look like: -```python -import torch -from torch import nn -from torch.utils.data import DataLoader, TensorDataset - -from pydvl.influence import TorchTwiceDifferentiable, compute_influences, InversionMethod - -nn_architecture = nn.Sequential( - nn.Conv2d(in_channels=5, out_channels=3, kernel_size=3), - nn.Flatten(), - nn.Linear(27, 3), -) -loss = nn.MSELoss() -model = TorchTwiceDifferentiable(nn_architecture, loss) - -input_dim = (5, 5, 5) -output_dim = 3 - -train_data_loader = DataLoader( - TensorDataset(torch.rand((10, *input_dim)), torch.rand((10, output_dim))), - batch_size=2, -) -test_data_loader = DataLoader( - TensorDataset(torch.rand((5, *input_dim)), torch.rand((5, output_dim))), - batch_size=1, -) - -influences = compute_influences( - model, - training_data=train_data_loader, - test_data=test_data_loader, - progress=True, - inversion_method=InversionMethod.Cg, - hessian_regularization=1e-1, - maxiter=200, -) -``` - - -### Shapley Values -The steps required to compute values for your samples are: - -1. Create a `Dataset` object with your train and test splits. -2. Create an instance of a `SupervisedModel` (basically any sklearn compatible - predictor) -3. Create a `Utility` object to wrap the Dataset, the model and a scoring - function. -4. Use one of the methods defined in the library to compute the values. - -This is how it looks for *Truncated Montecarlo Shapley*, an efficient method for -Data Shapley values: - -```python -from sklearn.datasets import load_breast_cancer -from sklearn.linear_model import LogisticRegression -from pydvl.value import * - -data = Dataset.from_sklearn(load_breast_cancer(), train_size=0.7) -model = LogisticRegression() -u = Utility(model, data, Scorer("accuracy", default=0.0)) -values = compute_shapley_values( - u, - mode=ShapleyMode.TruncatedMontecarlo, - done=MaxUpdates(100) | AbsoluteStandardError(threshold=0.01), - truncation=RelativeTruncation(u, rtol=0.01), -) -``` - -For more instructions and information refer to [Getting -Started](https://pydvl.org/stable/getting-started/first-steps/) in -the documentation. We provide several examples for data valuation -(e.g. [Shapley Data Valuation](https://pydvl.org/stable/examples/shapley_basic_spotify/)) -and for influence functions -(e.g. [Influence Functions for Neural Networks](https://pydvl.org/stable/examples/influence_imagenet/)) -with details on the algorithms and their applications. - -## Caching - -pyDVL offers the possibility to cache certain results and -speed up computation. It uses [Memcached](https://memcached.org/) For that. - -You can run it either locally or, using -[Docker](https://www.docker.com/): - -```shell -docker container run --rm -p 11211:11211 --name pydvl-cache -d memcached:latest -``` - -You can read more in the -[documentation](https://pydvl.org/stable/getting-started/first-steps/#caching). - -# Contributing - -Please open new issues for bugs, feature requests and extensions. You can read -about the structure of the project, the toolchain and workflow in the [guide for -contributions](CONTRIBUTING.md). - # License pyDVL is distributed under diff --git a/docs/assets/data_valuation_example.svg b/docs/assets/data_valuation_example.svg new file mode 100644 index 000000000..21c0f885d --- /dev/null +++ b/docs/assets/data_valuation_example.svg @@ -0,0 +1,876 @@ + + + + + + + + 2023-12-09T21:47:38.055137 + image/svg+xml + + + Matplotlib v3.7.2, https://matplotlib.org/ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/docs/assets/influence_functions_example.svg b/docs/assets/influence_functions_example.svg new file mode 100644 index 000000000..a7040e1c3 --- /dev/null +++ b/docs/assets/influence_functions_example.svg @@ -0,0 +1,993 @@ + + + + + + + + 2023-12-09T22:22:51.558936 + image/svg+xml + + + Matplotlib v3.7.2, https://matplotlib.org/ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +