-
Notifications
You must be signed in to change notification settings - Fork 653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introducing dask
-based parallel backend for the AnalysisBase.run()
#4158
Comments
An initial curveball here - |
@IAlibay I think it is possible to configure imports and dependencies this way, yes. # default option -- without dask
$ pip install --upgrade MDAnalysis
# more advanced option -- with it, perhaps supporting less python versions/platforms/etc
$ pip install --upgrade MDAnalysis[dask] As for the |
Very exciting! |
Could you create a table to show e.g.
|
This is very cool! One suggestion though about:
Similar to the suggestion in #4259, I think it would be good to keep |
I am not quite sure how we would do So if all goes according to plan then
Please feel free to comment on the PR, too! |
Specific comment on marking AnalysisBase as parallelizable or not in #4162 (comment) on the PR. |
My 2 centsWhat I would prefer is to:
I would prefer a usage pattern where analysis class authors opt in on backends, rather than blanket say "it's parallelisable" or not. This is why my main ask was that the default AnalysisBase should always only have a 'local' / 'serial' backend as its available options (see: https://github.com/MDAnalysis/mdanalysis/pull/4162/files#r1278286718). Analysis authors then have to make a conscious decision as to whether or not a given backend will work for them. CaveatOne of the things I've not had a chance to review in #4162 yet is: a) How this will look in terms of documentation - it might be that this is far too complicated for users to get behind |
Quick comment: conceptually the question if an algorithm is parallelizable is different from which backend can be used to run it. Given that backends are “just” implementation details I really don’t think that user code should need to track what happens in the background. As a user I want to be able to choose and try out any backend that becomes available. If the underlying algorithm can be parallelized with split-apply-combine then I should be able to select any backend and in the worst case it crashes. |
I agree that docs from both user and developer perspective are still lacking. For the user side, have a look at @marinegor ‘s final blog post https://marinegor.github.io/posts/2023/06/gsoc-summary (scroll down). |
@orbeckst I'm not sure I follow / agree here - if a developer knows that something is a bad idea, they should be able to dictate the allowed parameters of their method. I.e. ring fencing users as a design choice is ok, not an implementation detail? |
Yeah that's a fair point - it could increase the maintainance burden. But from looking at the existing class ParallelizableAnalysisBase(AnalysisBase):
def __init__(self, trajectory, verbose=False, **kwargs):
super().__init__(trajectory, verbose, **kwargs)
def run(
self,
start: int = None,
stop: int = None,
step: int = None,
frames: Iterable = None,
verbose: bool = None,
n_workers: int = None,
n_parts: int = None,
backend: str = None,
*,
client: object = None,
progressbar_kwargs={},
):
if backend == "local":
return super().run(
start,
stop,
step,
frames,
verbose,
progressbar_kwargs={},
)
# Otherwise handle some parallel stuff Although I get that you may want to avoid having additional classes for various reasons, as @IAlibay mentioned, which is completely understandable.
That sounds great! I hadn't realised that by default only |
It's quite probable that I am missing somethings about the differences between using multiprocessing, dask, and dask distributed, but I agree with @orbeckst in that if an algorithm can safely be run in parallel with e.g. |
Sure, but what happens when you throw in mpi as a backend? |
It's on us to provide an implementation of I don't see a way to automate complicated parallelization (e.g., within a trajectory frame) within the generic AnalysisBase framework. But by restricting ourselves to a simple parallel algorithm we can separate the implementation details from the algorithm via @marinegor 's @marinegor please voice your opinion, too, especially if I am misunderstanding how the current approach is supposed work. (@yuxuanzhuang and @RMeli I'd very much value your views, too, of course.) |
@orbeckst if I'm understanding this correctly, you're saying that the onus on ensuring that all backends work for all methods / future subclasses, etc.. is squarely on the MDA core library? This seems like we're undertaking a massive maintenance burden here. |
Maybe I'm just completely misunderstanding what you mean here, but it looks like we have a difference in opinion between the idea of how a method should work and it's eventual application. My take is that whilst it should always work, we know that edge cases exist, they could be anything from no handling file writing in a certain way to some arbitrary race condition that only happens in certain backends. If that happens, the a developer should have the right to ring fence users into a set of backends that they know things will work for them. edit: an additional consideration here is if someone creates say an mdakit that is just fundamentally incompatible with a backend, due to whatever complicated dependency hell folks have gotten themselves into. |
If we enable parallelization for analysis classes in MDAnalysis.analysis then we should be able to test them. In principle, the current testing framework can do that. It just seems a heavy lift for CI so we might not be able to do this for every commit. That's a point for discussion. We can't guarantee that it will work for code that we didn't write but I think that's the same for every piece of code that we publish. The current parallel backends are the ones we'd expect people to use — multiprocessing, dask, dask.distributed. What's the alternative to offering these? |
I think that's where my current thought pattern is circling around - because we can't assume that everything will work with what we provide, the onus on ensuring a downstream analysis class works with all backends is on the developer not on the core library. Hence my view that by consequence, the decision to expose / restrict a given set of backends is down to the analysis class author? Edit: it's not immediately clear that we're talking about the same scope - I'm mostly talking about the downstream world. |
That's fair. And they can certainly change
Yes, that's complicated. I want to say that it's the MDAKit author's responsibility to test all functionality, including parallel tests. However, that seems unreasonable if we promise "magic parallelization of anything that uses 'AnalysisBase'". We also can't just run their serial tests "in parallel" as this requires specialized tests. I don't think that dependency hell will be the issue. multiprocessing is standard library and if people choose dask as a backend they should have installed it. At the end of the day, parallel execution is opt-in (via argument processing of From the discussion I take away
|
@p-j-smith thanks for analysing the proposed code. I can see the appeal of leaving Assuming that it is possible to rewrite PR #4162 so that
then I see the following pros and cons for splitting AnalysisBase (serial) and ParallelAnalysisBase (parallel) cons
pro
|
Thanks @orbeckst for pretty thoroughly covering the pros and cons of having serpate classes for serial and parallel analysis. I think the most important con - and one I hadn't considered - is your point:
I don't know if there's an easy solutions to this. I've added a proof-of-concept of how a
This is pretty similar to the changes required to enable parallelisation in #4162, which instead requires:
But you're still right that people may not be aware that |
Regarding available backends - I think it would be reasonable to say to users that MDAnalysis has some official backends for its core analysis classes and:
In terms of 3., in #4269 I've changed the way backends are defined. Now a class (or def apply(self, func: Callable, computations: list, n_workers: int) -> list: This cons
pros
Edit: Added 'pro' about preventing the |
Thanks for putting up PR #4269 . It's an interesting approach abd there are quite a few things in your code that I liked. I only had time for a cursory glance, though. I hope others also have a look. @p-j-smith correctly points out that even with the approach in PR #4162 we will not be able to automagically add parallel capability to downstream analysis classes purely via inheritance — as shown for the |
@p-j-smith Thanks so much for your suggestions! At the moment, I won't be able to offer my thoughts, but I'm definitely looking forward to participating at a later time. |
Hi everyone, sorry for being silent for the last few days -- rain travels took the soul out of me. I'm following the discussion closely, and will respond in more details later this week. |
Hi everyone, I feel like the discussion above now has condensed to this list of questions: principal:
and design:
My point of view on them -- sorry for being too wordy sometimes :)
I agree with @orbeckst here: "regardless what we do, users/developers will have to explicitly upgrade their code to make use of parallelization". As MDAnalysis library, we must maintain following: a) all core objects are properly serializable; b)
Although @p-j-smith proposed an interesting solution with For instance, I imagine we could introduce a class AnalysisBase:
...
def _configure_client(self, backend: str, n_workers: int, client, **, executor_class: ParallelExecutor):
if executor_class is not ParallelExecutor:
return executor_class(backend, n_workers)
...
class ParallelExecutor:
def _options(self):
options = {
self._compute_with_client: self.client is not None,
self._compute_with_dask: self.backend == "dask",
self._compute_with_multiprocessing: self.backend == "multiprocessing",
self._compute_with_local: self.backend == "local",
}
return options
def apply(self, func: Callable, computations: list) -> list:
for map_, condition in self._options().items():
if condition:
return map_(func, computations)
raise ValueError(f"Backend or client is not set properly: {self.backend=}, {self.client=}") By that, introducing a new backend would work like this -- very similarly to @p-j-smith:
class CustomExecutor(ParallelExecutor):
def _compute_with_SOMETHING(self, func: Callable, computations: list):
...
def _options(self):
options = super()._options()
options.update({self._compute_with_SOMETHING: self.backend == 'SOMETHING'})
class SomeCustomClass(AnalysisBase):
def _configure_client(self, backend: str, n_workers: int, client):
super()._configure_client(backend, n_workers, CustomExecutor)
def available_backends(self):
return (..., 'SOMETHING') with like 4 lines of boilerplate and the actual code to generate a new backend. Of course, the example above must be documented and also put into "Contributing" section on the documentation.
I agree with @IAlibay that we should avoid class proliferation, and design our code in a way that focuses on class as something that does one thing but does it good. Ultimately,
Since in pt. 1 I already followed @orbeckst on being explicit and defining allowed backends for each class (otherwise it's only Miscelaneous:
Following "composition over inheritance", I'd rather introduce a separate little class that iterates over the trajectory depending on transformation, and then feed it into If we now introduce a two-class approach (twice already!), we'll end up with 4 classes that do the same thing in 4 different ways, and it's reeeeeally unclear for a developer which one to use in which case. Also, we have to decide from which of them to inherit each of the existing subclasses every time, and modify tests accordingly. Both seem like a huge maintenance burden. |
Thanks for summarising the conversation @marinegor - would organising a call to discuss this all further be wise here? I feel like there's a lot here and it might be better suited to synchronous communication. One extra note on maintenanceOne thing I don't see explicitly in the summary, but I want to make sure is at the forefront of this discussion is not only "downstream users/packages", but also the maintenance of this current library. Unfortunately the core library already has a medium to large maintenance burden spread on a very small amount of part-time developers. I know that all new feature will involve their own additional maintenance burdens, and "more work" shouldn't be a reason to shy away from improvements. However, as part of our discussions we should aim to assign a relative maintenance cost to all proposed solutions - including what backends we expose etc... |
Thanks for the detailed reply @marinegor ! I think @IAlibay is right and it would perhaps be easier to have a call to discuss some of these points. I'll just try to clarify one point I perhaps didn't explain very well. You mentioned that if a user wants to implement a new backend:
This is how a user could implement their own backend for their own analysis with the approach you've taken in #4162, which at first doesn't seem llke too much effort. However, if a user wants to use their backend with any of the analyses in class HBondsWithAnotherBakend(HydrogenBondsAnalysis):
def _configure_client(self, backend: str, n_workers: int, client):
super()._configure_client(backend, n_workers, CustomExecutor)
def available_backends(self):
return (..., 'another_backend')
class RMSDWithAnotherBakend(RMSD):
def _configure_client(self, backend: str, n_workers: int, client):
super()._configure_client(backend, n_workers, CustomExecutor)
def available_backends(self):
return (..., 'another_backend') etc. However, with the approach in #4269 it is much simpler to add a new backend. You would define your backend: class MyBackend:
def __init__(self, <some input parameters if needed>):
# do some initialisation if needed
def apply(self, func: Callable, computations: list, n_workers: int) -> list:
# do something
return results and then you can pass this backend to any parallel analysis class - your own analysis as well as MDAnalysis ones - without the need to subclass them The benefit of this is that MDAnalysis could decide to implement e.g. only the |
I think that's a great idea |
@p-j-smith ah, I see now, thanks! Let me rephrase that to ensure I understand it correctly: you're basically suggesting to move from combination "backend/client + ParallelExecutor class" to "Backend class", and pass this class as an argument instead of If yes, I would imagine we could have it the following way (without introducing a new class):
class BackendExecutor:
def __init__(self, n_workers: int):
self.n_workers = n_workers
def apply(self, func: Callable, computations: list) -> list:
raise NotImplementedError
class MyExecutor(BackendExecutor):
def apply(self, func: Callable, computations: list) -> list:
... # do some real work here then current |
Regarding synchronous discussion -- I suggest we move it to |
Hi everyone, I implemented what I meant here: marinegor#1 It introduces a |
Here's the summary of our discussion today (@marinegor @RMeli @p-j-smith ); sorry you couldn't be there @yuxuanzhuang but please have a look at the outcomes below: Overall goalProvide a simple approach to parallelize code following the AnalysisBase structure with the split-apply-combine algorithm. This will not happen automagically for any AnalysisBase-derived class because developers need to ensure that their specific code still produces correct results in parallel (we cannot guarantee that) but required changes to code are minimal. General guiding principles
Specific decisions
|
Is this issue solved? Does dask is integrated into MDAnalysis? Also, does dask-cudf integrated into MDAnalysis ? Does MDAnalysis have GPU support? |
This issue is currently being worked on in a late stage PR #4162. In general we do not use pandas so pandas-cudf was not an optimisation target of ours. MDAnalysis does not have GPU support. |
@shafayetrahat, as @hmacdope mentioned this is WIP in #4162. The PR needs some cleanup and review, but it's in an usable state if you want to give it a try. We would very much welcome early feedback! Unfortunately only a handful of analysis methods are currently parallelized in the PR (RMSD/RMSF). For more information you can have a look at @marinegor's blog posts, describing his work during GSoC. In particular, the GSoC Summary. There is also a gist with some preliminary analysis. |
Does you guys have future plans for GPU support? I am asking this because
without this feature my analysis took a long time only to load the
trajectory into the ram.
…On Mon, 4 Dec 2023, 1:03 pm Rocco Meli, ***@***.***> wrote:
@shafayetrahat <https://github.com/shafayetrahat>, as @hmacdope
<https://github.com/hmacdope> mentioned this is WIP in #4162
<#4162>. The PR needs some
cleanup and review, but it's in an usable state if you want to give it a
try. We would very much welcome early feedback! Unfortunately only a
handful of analysis methods are currently parallelized in the PR
(RMSD/RMSF).
For more information you can have a look at @marinegor
<https://github.com/marinegor>'s blog posts
<https://marinegor.github.io/year-archive/>, describing his work during
GSoC. In particular, the GSoC Summary
<https://marinegor.github.io/posts/2023/06/gsoc-summary>.
There is also a gist
<https://gist.github.com/marinegor/17558d1685cd2f24a6de65aa99cf5c9e> with
some preliminary analysis.
—
Reply to this email directly, view it on GitHub
<#4158 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AC5G3KKI3PSQH6GZ6VFDFGTYHVYU7AVCNFSM6AAAAAAYSVDXJ2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMZXHE2TEMRZGM>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Though my analysis is not rmsf/rmsd surely I want to look at the code to
check how it's done. Can you mention which branch contains it?
…On Mon, 4 Dec 2023, 1:03 pm Rocco Meli, ***@***.***> wrote:
@shafayetrahat <https://github.com/shafayetrahat>, as @hmacdope
<https://github.com/hmacdope> mentioned this is WIP in #4162
<#4162>. The PR needs some
cleanup and review, but it's in an usable state if you want to give it a
try. We would very much welcome early feedback! Unfortunately only a
handful of analysis methods are currently parallelized in the PR
(RMSD/RMSF).
For more information you can have a look at @marinegor
<https://github.com/marinegor>'s blog posts
<https://marinegor.github.io/year-archive/>, describing his work during
GSoC. In particular, the GSoC Summary
<https://marinegor.github.io/posts/2023/06/gsoc-summary>.
There is also a gist
<https://gist.github.com/marinegor/17558d1685cd2f24a6de65aa99cf5c9e> with
some preliminary analysis.
—
Reply to this email directly, view it on GitHub
<#4158 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AC5G3KKI3PSQH6GZ6VFDFGTYHVYU7AVCNFSM6AAAAAAYSVDXJ2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMZXHE2TEMRZGM>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
The PR provides support to add parallelization to your own analysis. The code is available in PR #4162, which you can check out locally.
There is no such plan at this time. |
moved the parallelization below Issue MDAnalysis#4158
Motivation
This project focuses on improving MDAnalysis speed and scalability. I am planning to implement a parallel backend for the MDAnalysis library. The backend is supposed to be implemented using dask library, allowing users to seamlessly run their analysis either on powerful local machines or various clusters, such as SLURM. There was a proof-of-concept fork of the MDAnalysis library, pmda, which implemented this idea for a subset of analysis methods implemented in the MDAnalysis.
Basically, proposal idea is to refactor
pmda
methods into the current version of MDAnalysis.Proposed solution
A key component of the MDAnalysis library is the AnalysisBase class, from which all objects that allow user to run an analysis of a trajectory are inherited. Namely, it implements a run method, that looks somewhat like that:
and consists of three steps:
For a setup with multiple worker processes, this protocol will require an additional step of first separating a trajectory into blocks. Each block will be processed with a single separate process, and also results from different blocks will potentially be concluded separately:
Which requires introducing following methods for the AnalysisBase class:
Which is similar to the protocol implemented in pmda. Such a modular design has following advantages:
Alternatives
If you're looking into speeding up of your MD trajectory analysis, I'd recommend looking into
pmda
. It is an older fork of MDAnalysis (before 2.0) that explores the same idea that I want to implement here, but for limited amount of AnalysisBase subclasses.Additional context
This issue is a part of GSOC project, and most of the code-related communication is supposed to happen here.
The text was updated successfully, but these errors were encountered: