Merge pull request #529 from aai-institute/cleanup/parallel-backend

Cleanup Parallel Backend
aai-institute · Mar 26, 2024 · 3705f0d · 3705f0d
2 parents db6dfc2 + dc7f738
commit 3705f0d
Show file tree

Hide file tree

Showing 27 changed files with 1,015 additions and 737 deletions.
diff --git a/.test_durations b/.test_durations
diff --git a/docs/getting-started/advanced-usage.md b/docs/getting-started/advanced-usage.md
@@ -0,0 +1,190 @@
+---
+title: Advanced usage
+alias: 
+  name: advanced-usage
+  text: Advanced usage
+---
+
+# Advanced usage
+
+Besides the dos and don'ts of data valuation itself, which are the subject of
+the examples and the documentation of each method, there are two main things to
+keep in mind when using pyDVL namely Parallelization and Caching.
+
+## Parallelization { #setting-up-parallelization }
+
+pyDVL uses parallelization to scale and speed up computations. It does so
+using one of Dask, Ray or Joblib. The first is used in
+the [influence][pydvl.influence] package whereas the other two
+are used in the [value][pydvl.value] package. 
+
+### Data valuation
+
+For data valuation, pyDVL uses [joblib](https://joblib.readthedocs.io/en/latest/) for local
+parallelization (within one machine) and supports using
+[Ray](https://ray.io) for distributed parallelization (across multiple machines).
+
+The former works out of the box but for the latter you will need to install
+additional dependencies (see [Extras][installation-extras])
+and to provide a running cluster (or run ray in local mode).
+
+!!! info
+
+    As of v0.9.0 pyDVL does not allow requesting resources per task sent to the
+    cluster, so you will need to make sure that each worker has enough resources to
+    handle the tasks it receives. A data valuation task using game-theoretic methods
+    will typically make a copy of the whole model and dataset to each worker, even
+    if the re-training only happens on a subset of the data. This means that you
+    should make sure that each worker has enough memory to handle the whole dataset.
+
+#### Joblib
+
+Please follow the instructions in Joblib's documentation
+for all possible configuration options that you can pass to the
+[parallel_config][joblib.parallel_config] context manager.
+
+To use the joblib parallel backend with the `loky` backend and verbosity set to `100`
+to compute exact shapley values you would use:
+
+```python
+import joblib
+from pydvl.parallel import ParallelConfig
+from pydvl.value.shapley import combinatorial_exact_shapley
+from pydvl.utils.utility import Utility
+
+config = ParallelConfig(backend="joblib") 
+u = Utility(...)
+
+with joblib.parallel_config(backend="loky", verbose=100):
+    combinatorial_exact_shapley(u, config=config)
+```
+
+#### Ray
+
+Please follow the instructions in Ray's documentation to
+[set up a remote cluster](https://docs.ray.io/en/latest/cluster/key-concepts.html).
+You could alternatively use a local cluster and in that case you don't have to set
+anything up.
+
+Before starting a computation, you should initialize ray by calling 
+[`ray.init`][ray.init] with the appropriate parameters:
+
+To set up and start a local ray cluster with 4 CPUs you would use:
+
+```python
+import ray
+
+ray.init(num_cpus=4)
+```
+
+Whereas for a remote ray cluster you would use:
+
+```python
+import ray
+
+address = "<Hypothetical Ray Cluster IP Address>"
+ray.init(address)
+```
+
+To use the ray parallel backend to compute exact shapley values you would use:
+
+```python
+import ray
+from pydvl.parallel import ParallelConfig
+from pydvl.value.shapley import combinatorial_exact_shapley
+from pydvl.utils.utility import Utility
+
+ray.init()
+config = ParallelConfig(backend="ray")
+u = Utility(...)
+combinatorial_exact_shapley(u, config=config)
+```
+
+### Influence functions
+
+Refer to the [[scaling-influence-computations]] page for explanations
+about parallelization for Influence Function computations.
+
+## Caching { #getting-started-cache }
+
+PyDVL can cache (memoize) the computation of the utility function
+and speed up some computations for data valuation.
+It is however disabled by default.
+When it is enabled it takes into account the data indices passed as argument
+and the utility function wrapped into the
+[Utility][pydvl.utils.utility.Utility] object. This means that
+care must be taken when reusing the same utility function with different data,
+see the documentation for the [caching package][pydvl.utils.caching] for more
+information.
+
+In general, caching won't play a major role in the computation of Shapley values
+because the probability of sampling the same subset twice, and hence needing
+the same utility function computation, is very low. However, it can be very
+useful when comparing methods that use the same utility function, or when
+running multiple experiments with the same data.
+
+pyDVL supports 3 different caching backends:
+
+- [InMemoryCacheBackend][pydvl.utils.caching.memory.InMemoryCacheBackend]:
+  an in-memory cache backend that uses a dictionary to store and retrieve
+  cached values. This is used to share cached values between threads
+  in a single process.
+
+- [DiskCacheBackend][pydvl.utils.caching.disk.DiskCacheBackend]:
+  a disk-based cache backend that uses pickled values written to and read from disk.  
+  This is used to share cached values between processes in a single machine.
+- [MemcachedCacheBackend][pydvl.utils.caching.memcached.MemcachedCacheBackend]:
+  a [Memcached](https://memcached.org/)-based cache backend that uses pickled values written to
+  and read from a Memcached server. This is used to share cached values
+  between processes across multiple machines.
+
+    ??? info "Memcached extras"
+
+         The Memcached backend requires optional dependencies.
+         See [Extras][installation-extras] for more information.
+
+As an example, here's how one would use the disk-based cached backend
+with a utility:
+
+```python
+from pydvl.utils.caching.disk import DiskCacheBackend
+from pydvl.utils.utility import Utility
+
+cache_backend = DiskCacheBackend()
+u = Utility(..., cache_backend=cache_backend)
+```
+
+Please refer to the documentation and examples of each backend class for more details.
+
+!!! tip "When is the cache really necessary?"
+    Crucially, semi-value computations with the
+    [PermutationSampler][pydvl.value.sampler.PermutationSampler] require caching
+    to be enabled, or they will take twice as long as the direct implementation
+    in [compute_shapley_values][pydvl.value.shapley.compute_shapley_values].
+
+!!! tip "Using the cache"
+    Continue reading about the cache in the documentation
+    for the [caching package][pydvl.utils.caching].
+
+### Setting up the Memcached cache { #setting-up-memcached }
+
+[Memcached](https://memcached.org/) is an in-memory key-value store accessible
+over the network. pyDVL can use it to cache the computation of the utility function
+and speed up some computations (in particular, semi-value computations with the
+[PermutationSampler][pydvl.value.sampler.PermutationSampler] but other methods
+may benefit as well).
+
+You can either install it as a package or run it inside a docker container (the
+simplest). For installation instructions, refer to the [Getting
+started](https://github.com/memcached/memcached/wiki#getting-started) section in
+memcached's wiki. Then you can run it with:
+
+```shell
+memcached -u user
+```
+
+To run memcached inside a container in daemon mode instead, use:
+
+```shell
+docker container run -d --rm -p 11211:11211 memcached:latest
+```
diff --git a/docs/getting-started/applications.md b/docs/getting-started/applications.md
@@ -23,7 +23,7 @@ comprehensive overview, along with concrete examples, please refer to the
 [Transferlab blog post]({{ transferlab.website }}blog/data-valuation-applications/)
 on this topic.
 
-## Data Engineering
+## Data engineering
 
 Some of the promising applications in data engineering include:
 

diff --git a/docs/getting-started/first-steps.md b/docs/getting-started/first-steps.md
@@ -1,11 +1,11 @@
 ---
-title: First Steps
+title: First steps
 alias: 
   name: first-steps
   text: First Steps
 ---
 
-# First Steps
+# First steps
 
 !!! Warning
     Make sure you have read [[getting-started#installation]] before using the library. 
@@ -36,102 +36,5 @@ by browsing our worked-out examples illustrating pyDVL's capabilities either:
 
 ## Advanced usage
 
-Besides the dos and don'ts of data valuation itself, which are the subject of
-the examples and the documentation of each method, there are two main things to
-keep in mind when using pyDVL.
-
-### Caching { #getting-started-cache }
-
-PyDVL can cache (memoize) the computation of the utility function
-and speed up some computations for data valuation.
-It is however disabled by default.
-When it is enabled it takes into account the data indices passed as argument
-and the utility function wrapped into the
-[Utility][pydvl.utils.utility.Utility] object. This means that
-care must be taken when reusing the same utility function with different data,
-see the documentation for the [caching package][pydvl.utils.caching] for more
-information.
-
-In general, caching won't play a major role in the computation of Shapley values
-because the probability of sampling the same subset twice, and hence needing
-the same utility function computation, is very low. However, it can be very
-useful when comparing methods that use the same utility function, or when
-running multiple experiments with the same data.
-
-pyDVL supports 3 different caching backends:
-
-- [InMemoryCacheBackend][pydvl.utils.caching.memory.InMemoryCacheBackend]:
-  an in-memory cache backend that uses a dictionary to store and retrieve
-  cached values. This is used to share cached values between threads
-  in a single process.
-- [DiskCacheBackend][pydvl.utils.caching.disk.DiskCacheBackend]:
-  a disk-based cache backend that uses pickled values written to and read from disk.  
-  This is used to share cached values between processes in a single machine.
-- [MemcachedCacheBackend][pydvl.utils.caching.memcached.MemcachedCacheBackend]:
-  a [Memcached](https://memcached.org/)-based cache backend that uses pickled
-  values written to and read from a Memcached server. This is used to share
-  cached values between processes across multiple machines. Note that this
-  backend requires optional dependencies, see [Extras][installation-extras].
-
-!!! tip "When is the cache really necessary?"
-    Crucially, semi-value computations with the
-    [PermutationSampler][pydvl.value.sampler.PermutationSampler] require caching
-    to be enabled, or they will take twice as long as the direct implementation
-    in [compute_shapley_values][pydvl.value.shapley.compute_shapley_values].
-
-!!! tip "Using the cache"
-    Continue reading about the cache in the documentation
-    for the [caching package][pydvl.utils.caching].
-
-#### Setting up the Memcached cache { #setting-up-memcached }
-
-[Memcached](https://memcached.org/) is an in-memory key-value store accessible
-over the network. pyDVL can use it to cache the computation of the utility function
-and speed up some computations (in particular, semi-value computations with the
-[PermutationSampler][pydvl.value.sampler.PermutationSampler] but other methods
-may benefit as well).
-
-You can either install it as a package or run it inside a docker container (the
-simplest). For installation instructions, refer to the [Getting
-started](https://github.com/memcached/memcached/wiki#getting-started) section in
-memcached's wiki. Then you can run it with:
-
-```shell
-memcached -u user
-```
-
-To run memcached inside a container in daemon mode instead, use:
-
-```shell
-docker container run -d --rm -p 11211:11211 memcached:latest
-```
-
-### Parallelization { #setting-up-parallelization }
-
-pyDVL uses [joblib](https://joblib.readthedocs.io/en/latest/) for local
-parallelization (within one machine) and supports using
-[Ray](https://ray.io) for distributed parallelization (across multiple machines).
-
-The former works out of the box but for the latter you will need to install
-additional dependencies (see [Extras][installation-extras] )
-and to provide a running cluster (or run ray in local mode).
-
-As of v0.8.1 pyDVL does not allow requesting resources per task sent to the
-cluster, so you will need to make sure that each worker has enough resources to
-handle the tasks it receives. A data valuation task using game-theoretic methods
-will typically make a copy of the whole model and dataset to each worker, even
-if the re-training only happens on a subset of the data. This means that you
-should make sure that each worker has enough memory to handle the whole dataset.
-
-#### Ray
-
-Please follow the instructions in Ray's documentation to set up a cluster.
-Once you have a running cluster, you can use it by passing the address
-of the head node to parallel methods via [ParallelConfig][pydvl.parallel.config.ParallelConfig].
-
-For a local ray cluster you would use:
-
-```python
-from pydvl.parallel.config import ParallelConfig
-config = ParallelConfig(backend="ray") 
-```
+Refer to the [[advanced-usage]] page for explanations on how to enable
+and use parallelization and caching.
diff --git a/docs/getting-started/index.md b/docs/getting-started/index.md
@@ -1,8 +1,8 @@
 ---
-title: Getting Started
+title: Getting started
 alias:
   name: getting-started
-  title: Getting Started
+  title: Getting started
 ---
 
 # Getting started
@@ -104,4 +104,4 @@ pip install pyDVL[memcached]
 
 This installs [pymemcache](https://github.com/pinterest/pymemcache)
 additionally. Be aware that you still have to start a memcached server manually.
-See [Setting up the Memcached cache](first-steps/#setting-up-memcached).
+See [Setting up the Memcached cache][setting-up-memcached].
diff --git a/docs/getting-started/methods.md b/docs/getting-started/methods.md
@@ -7,7 +7,7 @@ alias:
 
 We currently implement the following methods:
 
-## Data Valuation
+## Data valuation
 
 - [**LOO**][pydvl.value.loo.compute_loo].
 
@@ -44,7 +44,7 @@ We currently implement the following methods:
 - [**Data-OOB**][pydvl.value.oob.compute_data_oob]
   [@kwon_dataoob_2023].
 
-## Influence Functions
+## Influence functions
 
 - [**CG Influence**][pydvl.influence.torch.CgInfluence].
   [@koh_understanding_2017].

diff --git a/docs/influence/scaling_computation.md b/docs/influence/scaling_computation.md
@@ -1,3 +1,10 @@
+---
+title: Scaling computation
+alias: 
+  name: scaling-influence-computation
+  text: Scaling computation
+---
+
 The implementations of [InfluenceFunctionModel][pydvl.influence.base_influence_function_model.InfluenceFunctionModel]
 provide a convenient way to calculate influences for
 in memory tensors. 

diff --git a/docs/value/index.md b/docs/value/index.md
@@ -224,7 +224,7 @@ from sklearn.linear_model import LinearRegression, LogisticRegression
 from sklearn.datasets import load_iris
 
 dataset = Dataset.from_sklearn(load_iris())
-u = Utility(LogisticRegression(), dataset, enable_cache=False)
+u = Utility(LogisticRegression(), dataset)
 training_budget = 3
 wrapped_u = DataUtilityLearning(u, training_budget, LinearRegression())
 

diff --git a/mkdocs.yml b/mkdocs.yml
@@ -14,6 +14,7 @@ nav:
     - Applications: getting-started/applications.md
     - Benchmarking: getting-started/benchmarking.md
     - Methods: getting-started/methods.md
+    - Advanced usage: getting-started/advanced-usage.md
     - Glossary: getting-started/glossary.md
   - Data Valuation:
     - value/index.md
@@ -74,7 +75,6 @@ plugins:
       canonical_version: stable
   - section-index
   - alias:
-      use_relative_link: true
       verbose: true
   - gen-files:
       scripts:
@@ -110,6 +110,7 @@ plugins:
             - https://joblib.readthedocs.io/en/stable/objects.inv
             - https://docs.dask.org/en/latest/objects.inv
             - https://distributed.dask.org/en/latest/objects.inv
+            - https://docs.ray.io/en/latest/objects.inv
           paths: [ src ]  # search packages in the src folder
           options:
             heading_level: 1