-
Notifications
You must be signed in to change notification settings - Fork 610
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lets umap run in parallel #3295
base: main
Are you sure you want to change the base?
Conversation
I didn't add release note yet because I'm not sure if this should already go into 1.10.4 |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #3295 +/- ##
==========================================
+ Coverage 76.95% 77.04% +0.09%
==========================================
Files 109 110 +1
Lines 12465 12494 +29
==========================================
+ Hits 9592 9626 +34
+ Misses 2873 2868 -5
|
@@ -146,6 +147,8 @@ def umap( | |||
:attr:`~anndata.AnnData.obsp`\\ ``[.uns[neighbors_key]['connectivities_key']]`` for connectivities. | |||
copy | |||
Return a copy instead of writing to adata. | |||
parallel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Has this been tested? It seems like if the underlying implementation would error out every time a random seed is set, and we set a random seed, then setting this argument as true won't work?
I don't know where I got that this should error, but then we should definitely warn users about this sort of thing. Reproducibility is very important
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the other hand:
sc.tl.umap(adata, parallel=True, random_state=42)
works so I think this needs a test + updated comment to reflect whatever is supposed to be going on here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As does
sc.tl.umap(adata, parallel=True, random_state=np.random.RandomState(42))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a test to see if it errors (it doesn't)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but we should also warn users about this random state business. And check that the warning is raised every time parallel
is True because this is very important!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok! But then we should definitely warn in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a warning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok there is in issue with check_random_state
as it transforms None
into a seed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok the function is bugged on the umap side. If you force None
as the random seed, it errors. I'll make an issue with umap for clarification. I'll but the PR on ice while i wait for an update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -56,6 +56,7 @@ def umap( | |||
key_added: str | None = None, | |||
neighbors_key: str = "neighbors", | |||
copy: bool = False, | |||
parallel: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have the n_jobs: int | None
convention for this (with None
meaning sc.settings.N_JOBS
), but simplicial_set_embedding
just passes parallel
on to numba.
We should think about how the two integrate before we add this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is numba parallel, we can only set it to use everything you got or nothing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that’s why we should talk about the parameter name. IIRC that would be the first parallelization parameter not called n_jobs
.
if parallel and random_state is not None: | ||
warnings.warn( | ||
"Parallel execution was expected to be disabled when both `parallel=True` and `random_state` are set, " | ||
"to ensure reproducibility. However, parallel execution still seems to occur, which may lead to " | ||
"non-deterministic results." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Link to the umap repo to give user more context. Something like "UMAP reports that parallel execution should error with a random seed to ensure reproducibility. Parallel execution may occur nonetheless, which can lead to non-deterministic results." with a link to their docs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And then make sure this warning is raised in tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll update this once i have clarification from lmcinnes/umap#1155
tests/test_embedding.py
Outdated
def test_umap_parallel_randomstate(): | ||
pbmc = pbmc68k_reduced()[:100, :].copy() | ||
sc.tl.umap(pbmc, parallel=True, random_state=42) | ||
sc.tl.umap(pbmc, parallel=True, random_state=np.random.RandomState(42)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only one action per test. This should be parametrized. Also, we don't need both of these given what I said above. We should just check for the warning with random_state
as None
or not-None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I updated the test already to catch the warning
Closes #3211
Lets you run all parts of UMAP in Parallel. Default will still be False for reproducibility.
Benchmarks (95k Cells AMD5950X)
parallel = False
33 sparallel = True
18 sI copied the doc string from UMAP to explain the impact of parallel execution.