Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify parallel backend config #549

Merged
merged 45 commits into from
Apr 5, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
d7143be
Allow instantiating parallel backend classes directly
AnesBenmerzoug Mar 27, 2024
75dbca0
Deprecate use of config argument in JoblibParallelBackend
AnesBenmerzoug Mar 27, 2024
cf90870
Deprecate use of config argument in RayParallelBackend
AnesBenmerzoug Mar 27, 2024
96b7ec9
Deprecate use of config argument in RayExecutor
AnesBenmerzoug Mar 27, 2024
dfeeadd
Deprecated use of config argument in RayExecutor
AnesBenmerzoug Mar 27, 2024
01f2c2c
Rename BaseParallelBackend to ParallelBackend
AnesBenmerzoug Mar 27, 2024
6597a9e
Deprecate use of config argument for future executor classes
AnesBenmerzoug Mar 27, 2024
bf12873
Deprecate use of init_executor and config in compute_loo
AnesBenmerzoug Mar 27, 2024
3111533
Docstring changes
AnesBenmerzoug Mar 27, 2024
5a09aa4
Make maybe_init_parallel_backend a private function
AnesBenmerzoug Mar 27, 2024
2a64ad4
Improve type annotations
AnesBenmerzoug Mar 27, 2024
e640eac
Deprecate use of config in MapReduceJob
AnesBenmerzoug Mar 27, 2024
2f66acf
Fixes
AnesBenmerzoug Mar 27, 2024
c570880
Create new parallel backend fixture and use it in loo test
AnesBenmerzoug Mar 27, 2024
31683fb
Deprecate use of config in montecarlo least core
AnesBenmerzoug Mar 29, 2024
77297ac
Deprecate use of config in combinatorial exact shapley
AnesBenmerzoug Mar 29, 2024
f32455b
Deprecate use of config in montecarlo shapley
AnesBenmerzoug Mar 29, 2024
ff57e55
Deprecate use of config in owen shapley
AnesBenmerzoug Mar 29, 2024
0d3c26d
Deprecate use of config in classwise shapley
AnesBenmerzoug Mar 29, 2024
993e6ee
Deprecate use of config in group testing shapley
AnesBenmerzoug Mar 29, 2024
866e754
Update docstrings
AnesBenmerzoug Mar 29, 2024
d5ae8ea
Deprecate use of config in semivalues
AnesBenmerzoug Mar 29, 2024
acfc3e6
Remove effective_n_jobs function
AnesBenmerzoug Mar 29, 2024
7108087
Update tests to use parallel_backend fixture instead of parallel_conf…
AnesBenmerzoug Mar 29, 2024
d905fd2
Hardcode wait timeout in permutation_montecarlo_shapley
AnesBenmerzoug Mar 29, 2024
71b0a47
Ignore return type of functions wrapped with @deprecated
AnesBenmerzoug Mar 29, 2024
bee00fa
Fix arguments to joblib's Parallel
AnesBenmerzoug Mar 29, 2024
59230f6
Deprecate use of config in lc_solve_problems
AnesBenmerzoug Mar 29, 2024
832cd3a
Update parallel backend documentation
AnesBenmerzoug Mar 29, 2024
4b4a258
Replace parallel_config fixture
AnesBenmerzoug Mar 29, 2024
fd25ce7
Fix tests
AnesBenmerzoug Mar 29, 2024
67342d3
Fix effective_n_jobs methods in joblib parallel backend
AnesBenmerzoug Mar 29, 2024
f5c6ee0
Fix tests
AnesBenmerzoug Mar 29, 2024
31dcbc9
Disable mkdocs social plugin
AnesBenmerzoug Mar 29, 2024
863bc39
Update changelog
AnesBenmerzoug Mar 30, 2024
a6d9633
Use cast instead of colon syntax for joblib executor
AnesBenmerzoug Apr 1, 2024
270038a
Use new union and optional syntax for type annotations
AnesBenmerzoug Apr 2, 2024
324143f
Only allow use of JoblibParallelBackend with MapReduceJob
AnesBenmerzoug Apr 2, 2024
6a7313c
Deprecate only config argument of init_parallel_backend
AnesBenmerzoug Apr 2, 2024
55e06af
Define _joblib_backend_name attribute for parallel backend classes an…
AnesBenmerzoug Apr 4, 2024
65313e7
Document mapreducejob and futures executor interfaces
AnesBenmerzoug Apr 4, 2024
371ef79
Merge branch 'develop' into feature/simplify-parallel-backend-config
AnesBenmerzoug Apr 4, 2024
b0a7a49
Reenable the mkdocs social plugin
AnesBenmerzoug Apr 4, 2024
e23256b
Add instructions about init_parallel_backend
AnesBenmerzoug Apr 4, 2024
61f69c1
Minor tweaks
mdbenito Apr 5, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,8 @@
- Documentation improvements and cleanup
[PR #521](https://github.com/aai-institute/pyDVL/pull/521),
[PR #522](https://github.com/aai-institute/pyDVL/pull/522)
- Simplified parallel backend configuration
[PR #549](https://github.com/mkdocstrings/mkdocstrings/issues/615)

## 0.8.1 - 🆕 🏗 New method and notebook, Games with exact shapley values, bug fixes and cleanup

Expand Down
90 changes: 83 additions & 7 deletions docs/getting-started/advanced-usage.md
mdbenito marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ keep in mind when using pyDVL namely Parallelization and Caching.
pyDVL uses parallelization to scale and speed up computations. It does so
using one of Dask, Ray or Joblib. The first is used in
the [influence][pydvl.influence] package whereas the other two
are used in the [value][pydvl.value] package.
are used in the [value][pydvl.value] package.

### Data valuation

Expand All @@ -37,6 +37,33 @@ and to provide a running cluster (or run ray in local mode).
if the re-training only happens on a subset of the data. This means that you
should make sure that each worker has enough memory to handle the whole dataset.

We use backend classes for both joblib and ray as well as two types
of executors for the different algorithms: the first uses a map reduce pattern as seen in
the [MapReduceJob][pydvl.parallel.map_reduce.MapReduceJob] class
and the second implements the futures executor interface from [concurrent.futures][].

As a convenience, you can also instantiate a parallel backend class
by using the [init_parallel_backend][pydvl.parallel.init_parallel_backend]
function:

```python
from pydvl.parallel import init_parallel_backend
parallel_backend = init_parallel_backend(backend_name="joblib")
```

!!! info

The executor classes are not meant to be instantiated and used by users
of pyDVL. They are used internally as part of the computations of the
different methods.

!!! danger "Deprecation notice"

We are currently planning to deprecate
[MapReduceJob][pydvl.parallel.map_reduce.MapReduceJob] in favour of the
futures executor interface because it allows for more diverse computation
patterns with interruptions.

#### Joblib

Please follow the instructions in Joblib's documentation
Expand All @@ -48,19 +75,24 @@ to compute exact shapley values you would use:

```python
import joblib
from pydvl.parallel import ParallelConfig
from pydvl.parallel import JoblibParallelBackend
from pydvl.value.shapley import combinatorial_exact_shapley
from pydvl.utils.utility import Utility

config = ParallelConfig(backend="joblib")
parallel_backend = JoblibParallelBackend()
u = Utility(...)

with joblib.parallel_config(backend="loky", verbose=100):
combinatorial_exact_shapley(u, config=config)
values = combinatorial_exact_shapley(u, parallel_backend=parallel_backend)
```

#### Ray

!!! warning "Additional dependencies"

The Ray parallel backend requires optional dependencies.
See [Extras][installation-extras] for more information.

Please follow the instructions in Ray's documentation to
[set up a remote cluster](https://docs.ray.io/en/latest/cluster/key-concepts.html).
You could alternatively use a local cluster and in that case you don't have to set
Expand Down Expand Up @@ -90,14 +122,58 @@ To use the ray parallel backend to compute exact shapley values you would use:

```python
import ray
from pydvl.parallel import ParallelConfig
from pydvl.parallel import RayParallelBackend
from pydvl.value.shapley import combinatorial_exact_shapley
from pydvl.utils.utility import Utility

ray.init()
config = ParallelConfig(backend="ray")
parallel_backend = RayParallelBackend()
u = Utility(...)
combinatorial_exact_shapley(u, config=config)
vaues = combinatorial_exact_shapley(u, parallel_backend=parallel_backend)
```

#### Futures executor

For the futures executor interface, we have implemented an executor
class for ray in [RayExecutor][pydvl.parallel.futures.ray.RayExecutor]
and rely on joblib's loky [get_reusable_executor][loky.get_reusable_executor]
function to instantiate an executor for local parallelization.

They are both compatibles with the builtin
[ThreadPoolExecutor][concurrent.futures.ThreadPoolExecutor]
and [ProcessPoolExecutor][concurrent.futures.ProcessPoolExecutor]
classes.

```pycon
>>> from joblib.externals.loky import _ReusablePoolExecutor
>>> from pydvl.parallel import JoblibParallelBackend
>>> parallel_backend = JoblibParallelBackend()
>>> with parallel_backend.executor() as executor:
... results = list(executor.map(lambda x: x + 1, range(3)))
...
>>> results
[1, 2, 3]
```

#### Map-reduce

The map-reduce interface is older and more limited in the patterns
it allows us to use.

To reproduce the previous example using
[MapReduceJob][pydvl.parallel.map_reduce.MapReduceJob], we would use:

```pycon
>>> from pydvl.parallel import JoblibParallelBackend, MapReduceJob
>>> parallel_backend = JoblibParallelBackend()
>>> map_reduce_job = MapReduceJob(
... list(range(3)),
... map_func=lambda x: x[0] + 1,
... parallel_backend=parallel_backend,
... )
>>> results = map_reduce_job()
>>> results
[1, 2, 3]
```

### Influence functions
Expand Down
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -108,6 +108,7 @@ plugins:
- https://pytorch.org/docs/stable/objects.inv
- https://pymemcache.readthedocs.io/en/latest/objects.inv
- https://joblib.readthedocs.io/en/stable/objects.inv
- https://loky.readthedocs.io/en/stable/objects.inv
- https://docs.dask.org/en/latest/objects.inv
- https://distributed.dask.org/en/latest/objects.inv
- https://docs.ray.io/en/latest/objects.inv
Expand Down
43 changes: 27 additions & 16 deletions src/pydvl/parallel/__init__.py
Original file line number Diff line number Diff line change
@@ -1,37 +1,48 @@
"""
This module provides a common interface to parallelization backends. The list of
supported backends is [here][pydvl.parallel.backends]. Backends can be
selected with the `backend` argument of an instance of
[ParallelConfig][pydvl.utils.config.ParallelConfig], as seen in the examples
below.
supported backends is [here][pydvl.parallel.backends]. Backends should be
instantiated directly and passed to the respective valuation method.

We use [executors][concurrent.futures.Executor] to submit tasks in parallel. The
basic high-level pattern is
We use executors that implement the [Executor][concurrent.futures.Executor]
interface to submit tasks in parallel.
The basic high-level pattern is:

```python
from pydvl.parallel import init_executor, ParallelConfig
from pydvl.parallel import JoblibParallelBackend

config = ParallelConfig(backend="ray")
with init_executor(max_workers=1, config=config) as executor:
parallel_backend = JoblibParallelBackend()
with parallel_backend.executor(max_workers=2) as executor:
future = executor.submit(lambda x: x + 1, 1)
result = future.result()
assert result == 2
```

Running a map-reduce job is also easy:
Running a map-style job is also easy:

```python
from pydvl.parallel import init_executor, ParallelConfig
from pydvl.parallel import JoblibParallelBackend

config = ParallelConfig(backend="joblib")
with init_executor(config=config) as executor:
parallel_backend = JoblibParallelBackend()
with parallel_backend.executor(max_workers=2) as executor:
results = list(executor.map(lambda x: x + 1, range(5)))
assert results == [1, 2, 3, 4, 5]
```

!!! tip "Passsing large objects"
When running tasks which accept heavy inputs, it is important
to first use `put()` on the object and use the returned reference
as argument to the callable within `submit()`. For example:
```python
u_ref = parallel_backend.put(u)
...
executor.submit(task, utility=u)
```
Note that `task()` does not need to be changed in any way:
the backend will `get()` the object and pass it to the function
upon invocation.
There is an alternative map-reduce implementation
[MapReduceJob][pydvl.parallel.map_reduce.MapReduceJob] which internally
uses joblib's higher level API with `Parallel()`
uses joblib's higher level API with `Parallel()` which then indirectly also
supports the use of Dask and Ray.
"""
# HACK to avoid circular imports
from ..utils.types import * # pylint: disable=wrong-import-order
Expand All @@ -41,5 +52,5 @@
from .futures import *
from .map_reduce import *

if len(BaseParallelBackend.BACKENDS) == 0:
if len(ParallelBackend.BACKENDS) == 0:
raise ImportError("No parallel backend found. Please install ray or joblib.")
Loading
Loading