Skip to content

Commit

Permalink
Merge pull request #529 from aai-institute/cleanup/parallel-backend
Browse files Browse the repository at this point in the history
Cleanup Parallel Backend
  • Loading branch information
AnesBenmerzoug authored Mar 26, 2024
2 parents db6dfc2 + dc7f738 commit 3705f0d
Show file tree
Hide file tree
Showing 27 changed files with 1,015 additions and 737 deletions.
1,222 changes: 678 additions & 544 deletions .test_durations

Large diffs are not rendered by default.

190 changes: 190 additions & 0 deletions docs/getting-started/advanced-usage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,190 @@
---
title: Advanced usage
alias:
name: advanced-usage
text: Advanced usage
---

# Advanced usage

Besides the dos and don'ts of data valuation itself, which are the subject of
the examples and the documentation of each method, there are two main things to
keep in mind when using pyDVL namely Parallelization and Caching.

## Parallelization { #setting-up-parallelization }

pyDVL uses parallelization to scale and speed up computations. It does so
using one of Dask, Ray or Joblib. The first is used in
the [influence][pydvl.influence] package whereas the other two
are used in the [value][pydvl.value] package.

### Data valuation

For data valuation, pyDVL uses [joblib](https://joblib.readthedocs.io/en/latest/) for local
parallelization (within one machine) and supports using
[Ray](https://ray.io) for distributed parallelization (across multiple machines).

The former works out of the box but for the latter you will need to install
additional dependencies (see [Extras][installation-extras])
and to provide a running cluster (or run ray in local mode).

!!! info

As of v0.9.0 pyDVL does not allow requesting resources per task sent to the
cluster, so you will need to make sure that each worker has enough resources to
handle the tasks it receives. A data valuation task using game-theoretic methods
will typically make a copy of the whole model and dataset to each worker, even
if the re-training only happens on a subset of the data. This means that you
should make sure that each worker has enough memory to handle the whole dataset.

#### Joblib

Please follow the instructions in Joblib's documentation
for all possible configuration options that you can pass to the
[parallel_config][joblib.parallel_config] context manager.

To use the joblib parallel backend with the `loky` backend and verbosity set to `100`
to compute exact shapley values you would use:

```python
import joblib
from pydvl.parallel import ParallelConfig
from pydvl.value.shapley import combinatorial_exact_shapley
from pydvl.utils.utility import Utility

config = ParallelConfig(backend="joblib")
u = Utility(...)

with joblib.parallel_config(backend="loky", verbose=100):
combinatorial_exact_shapley(u, config=config)
```

#### Ray

Please follow the instructions in Ray's documentation to
[set up a remote cluster](https://docs.ray.io/en/latest/cluster/key-concepts.html).
You could alternatively use a local cluster and in that case you don't have to set
anything up.

Before starting a computation, you should initialize ray by calling
[`ray.init`][ray.init] with the appropriate parameters:

To set up and start a local ray cluster with 4 CPUs you would use:

```python
import ray

ray.init(num_cpus=4)
```

Whereas for a remote ray cluster you would use:

```python
import ray

address = "<Hypothetical Ray Cluster IP Address>"
ray.init(address)
```

To use the ray parallel backend to compute exact shapley values you would use:

```python
import ray
from pydvl.parallel import ParallelConfig
from pydvl.value.shapley import combinatorial_exact_shapley
from pydvl.utils.utility import Utility

ray.init()
config = ParallelConfig(backend="ray")
u = Utility(...)
combinatorial_exact_shapley(u, config=config)
```

### Influence functions

Refer to the [[scaling-influence-computations]] page for explanations
about parallelization for Influence Function computations.

## Caching { #getting-started-cache }

PyDVL can cache (memoize) the computation of the utility function
and speed up some computations for data valuation.
It is however disabled by default.
When it is enabled it takes into account the data indices passed as argument
and the utility function wrapped into the
[Utility][pydvl.utils.utility.Utility] object. This means that
care must be taken when reusing the same utility function with different data,
see the documentation for the [caching package][pydvl.utils.caching] for more
information.

In general, caching won't play a major role in the computation of Shapley values
because the probability of sampling the same subset twice, and hence needing
the same utility function computation, is very low. However, it can be very
useful when comparing methods that use the same utility function, or when
running multiple experiments with the same data.

pyDVL supports 3 different caching backends:

- [InMemoryCacheBackend][pydvl.utils.caching.memory.InMemoryCacheBackend]:
an in-memory cache backend that uses a dictionary to store and retrieve
cached values. This is used to share cached values between threads
in a single process.

- [DiskCacheBackend][pydvl.utils.caching.disk.DiskCacheBackend]:
a disk-based cache backend that uses pickled values written to and read from disk.
This is used to share cached values between processes in a single machine.
- [MemcachedCacheBackend][pydvl.utils.caching.memcached.MemcachedCacheBackend]:
a [Memcached](https://memcached.org/)-based cache backend that uses pickled values written to
and read from a Memcached server. This is used to share cached values
between processes across multiple machines.

??? info "Memcached extras"

The Memcached backend requires optional dependencies.
See [Extras][installation-extras] for more information.

As an example, here's how one would use the disk-based cached backend
with a utility:

```python
from pydvl.utils.caching.disk import DiskCacheBackend
from pydvl.utils.utility import Utility

cache_backend = DiskCacheBackend()
u = Utility(..., cache_backend=cache_backend)
```

Please refer to the documentation and examples of each backend class for more details.

!!! tip "When is the cache really necessary?"
Crucially, semi-value computations with the
[PermutationSampler][pydvl.value.sampler.PermutationSampler] require caching
to be enabled, or they will take twice as long as the direct implementation
in [compute_shapley_values][pydvl.value.shapley.compute_shapley_values].

!!! tip "Using the cache"
Continue reading about the cache in the documentation
for the [caching package][pydvl.utils.caching].

### Setting up the Memcached cache { #setting-up-memcached }

[Memcached](https://memcached.org/) is an in-memory key-value store accessible
over the network. pyDVL can use it to cache the computation of the utility function
and speed up some computations (in particular, semi-value computations with the
[PermutationSampler][pydvl.value.sampler.PermutationSampler] but other methods
may benefit as well).

You can either install it as a package or run it inside a docker container (the
simplest). For installation instructions, refer to the [Getting
started](https://github.com/memcached/memcached/wiki#getting-started) section in
memcached's wiki. Then you can run it with:

```shell
memcached -u user
```

To run memcached inside a container in daemon mode instead, use:

```shell
docker container run -d --rm -p 11211:11211 memcached:latest
```
2 changes: 1 addition & 1 deletion docs/getting-started/applications.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ comprehensive overview, along with concrete examples, please refer to the
[Transferlab blog post]({{ transferlab.website }}blog/data-valuation-applications/)
on this topic.

## Data Engineering
## Data engineering

Some of the promising applications in data engineering include:

Expand Down
105 changes: 4 additions & 101 deletions docs/getting-started/first-steps.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
---
title: First Steps
title: First steps
alias:
name: first-steps
text: First Steps
---

# First Steps
# First steps

!!! Warning
Make sure you have read [[getting-started#installation]] before using the library.
Expand Down Expand Up @@ -36,102 +36,5 @@ by browsing our worked-out examples illustrating pyDVL's capabilities either:

## Advanced usage

Besides the dos and don'ts of data valuation itself, which are the subject of
the examples and the documentation of each method, there are two main things to
keep in mind when using pyDVL.

### Caching { #getting-started-cache }

PyDVL can cache (memoize) the computation of the utility function
and speed up some computations for data valuation.
It is however disabled by default.
When it is enabled it takes into account the data indices passed as argument
and the utility function wrapped into the
[Utility][pydvl.utils.utility.Utility] object. This means that
care must be taken when reusing the same utility function with different data,
see the documentation for the [caching package][pydvl.utils.caching] for more
information.

In general, caching won't play a major role in the computation of Shapley values
because the probability of sampling the same subset twice, and hence needing
the same utility function computation, is very low. However, it can be very
useful when comparing methods that use the same utility function, or when
running multiple experiments with the same data.

pyDVL supports 3 different caching backends:

- [InMemoryCacheBackend][pydvl.utils.caching.memory.InMemoryCacheBackend]:
an in-memory cache backend that uses a dictionary to store and retrieve
cached values. This is used to share cached values between threads
in a single process.
- [DiskCacheBackend][pydvl.utils.caching.disk.DiskCacheBackend]:
a disk-based cache backend that uses pickled values written to and read from disk.
This is used to share cached values between processes in a single machine.
- [MemcachedCacheBackend][pydvl.utils.caching.memcached.MemcachedCacheBackend]:
a [Memcached](https://memcached.org/)-based cache backend that uses pickled
values written to and read from a Memcached server. This is used to share
cached values between processes across multiple machines. Note that this
backend requires optional dependencies, see [Extras][installation-extras].

!!! tip "When is the cache really necessary?"
Crucially, semi-value computations with the
[PermutationSampler][pydvl.value.sampler.PermutationSampler] require caching
to be enabled, or they will take twice as long as the direct implementation
in [compute_shapley_values][pydvl.value.shapley.compute_shapley_values].

!!! tip "Using the cache"
Continue reading about the cache in the documentation
for the [caching package][pydvl.utils.caching].

#### Setting up the Memcached cache { #setting-up-memcached }

[Memcached](https://memcached.org/) is an in-memory key-value store accessible
over the network. pyDVL can use it to cache the computation of the utility function
and speed up some computations (in particular, semi-value computations with the
[PermutationSampler][pydvl.value.sampler.PermutationSampler] but other methods
may benefit as well).

You can either install it as a package or run it inside a docker container (the
simplest). For installation instructions, refer to the [Getting
started](https://github.com/memcached/memcached/wiki#getting-started) section in
memcached's wiki. Then you can run it with:

```shell
memcached -u user
```

To run memcached inside a container in daemon mode instead, use:

```shell
docker container run -d --rm -p 11211:11211 memcached:latest
```

### Parallelization { #setting-up-parallelization }

pyDVL uses [joblib](https://joblib.readthedocs.io/en/latest/) for local
parallelization (within one machine) and supports using
[Ray](https://ray.io) for distributed parallelization (across multiple machines).

The former works out of the box but for the latter you will need to install
additional dependencies (see [Extras][installation-extras] )
and to provide a running cluster (or run ray in local mode).

As of v0.8.1 pyDVL does not allow requesting resources per task sent to the
cluster, so you will need to make sure that each worker has enough resources to
handle the tasks it receives. A data valuation task using game-theoretic methods
will typically make a copy of the whole model and dataset to each worker, even
if the re-training only happens on a subset of the data. This means that you
should make sure that each worker has enough memory to handle the whole dataset.

#### Ray

Please follow the instructions in Ray's documentation to set up a cluster.
Once you have a running cluster, you can use it by passing the address
of the head node to parallel methods via [ParallelConfig][pydvl.parallel.config.ParallelConfig].

For a local ray cluster you would use:

```python
from pydvl.parallel.config import ParallelConfig
config = ParallelConfig(backend="ray")
```
Refer to the [[advanced-usage]] page for explanations on how to enable
and use parallelization and caching.
6 changes: 3 additions & 3 deletions docs/getting-started/index.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
---
title: Getting Started
title: Getting started
alias:
name: getting-started
title: Getting Started
title: Getting started
---

# Getting started
Expand Down Expand Up @@ -104,4 +104,4 @@ pip install pyDVL[memcached]

This installs [pymemcache](https://github.com/pinterest/pymemcache)
additionally. Be aware that you still have to start a memcached server manually.
See [Setting up the Memcached cache](first-steps/#setting-up-memcached).
See [Setting up the Memcached cache][setting-up-memcached].
4 changes: 2 additions & 2 deletions docs/getting-started/methods.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ alias:

We currently implement the following methods:

## Data Valuation
## Data valuation

- [**LOO**][pydvl.value.loo.compute_loo].

Expand Down Expand Up @@ -44,7 +44,7 @@ We currently implement the following methods:
- [**Data-OOB**][pydvl.value.oob.compute_data_oob]
[@kwon_dataoob_2023].

## Influence Functions
## Influence functions

- [**CG Influence**][pydvl.influence.torch.CgInfluence].
[@koh_understanding_2017].
Expand Down
7 changes: 7 additions & 0 deletions docs/influence/scaling_computation.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,10 @@
---
title: Scaling computation
alias:
name: scaling-influence-computation
text: Scaling computation
---

The implementations of [InfluenceFunctionModel][pydvl.influence.base_influence_function_model.InfluenceFunctionModel]
provide a convenient way to calculate influences for
in memory tensors.
Expand Down
2 changes: 1 addition & 1 deletion docs/value/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -224,7 +224,7 @@ from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.datasets import load_iris

dataset = Dataset.from_sklearn(load_iris())
u = Utility(LogisticRegression(), dataset, enable_cache=False)
u = Utility(LogisticRegression(), dataset)
training_budget = 3
wrapped_u = DataUtilityLearning(u, training_budget, LinearRegression())

Expand Down
3 changes: 2 additions & 1 deletion mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ nav:
- Applications: getting-started/applications.md
- Benchmarking: getting-started/benchmarking.md
- Methods: getting-started/methods.md
- Advanced usage: getting-started/advanced-usage.md
- Glossary: getting-started/glossary.md
- Data Valuation:
- value/index.md
Expand Down Expand Up @@ -74,7 +75,6 @@ plugins:
canonical_version: stable
- section-index
- alias:
use_relative_link: true
verbose: true
- gen-files:
scripts:
Expand Down Expand Up @@ -110,6 +110,7 @@ plugins:
- https://joblib.readthedocs.io/en/stable/objects.inv
- https://docs.dask.org/en/latest/objects.inv
- https://distributed.dask.org/en/latest/objects.inv
- https://docs.ray.io/en/latest/objects.inv
paths: [ src ] # search packages in the src folder
options:
heading_level: 1
Expand Down
Loading

0 comments on commit 3705f0d

Please sign in to comment.