-
Notifications
You must be signed in to change notification settings - Fork 610
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lets umap run in parallel #3295
base: main
Are you sure you want to change the base?
Changes from all commits
fbc2e49
9ce1770
812e630
ba6538d
d2ab85d
4eba4a8
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -56,6 +56,7 @@ def umap( | |
key_added: str | None = None, | ||
neighbors_key: str = "neighbors", | ||
copy: bool = False, | ||
parallel: bool = False, | ||
) -> AnnData | None: | ||
"""\ | ||
Embed the neighborhood graph using UMAP :cite:p:`McInnes2018`. | ||
|
@@ -146,7 +147,8 @@ def umap( | |
:attr:`~anndata.AnnData.obsp`\\ ``[.uns[neighbors_key]['connectivities_key']]`` for connectivities. | ||
copy | ||
Return a copy instead of writing to adata. | ||
|
||
parallel | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I don't know where I got that this should error, but then we should definitely warn users about this sort of thing. Reproducibility is very important There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. On the other hand: sc.tl.umap(adata, parallel=True, random_state=42) works so I think this needs a test + updated comment to reflect whatever is supposed to be going on here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As does sc.tl.umap(adata, parallel=True, random_state=np.random.RandomState(42)) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I added a test to see if it errors (it doesn't) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, but we should also warn users about this random state business. And check that the warning is raised every time There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok! But then we should definitely warn in that case. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I added a warning There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ok there is in issue with There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ok the function is bugged on the umap side. If you force There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
Whether to run the computation using numba parallel. Running in parallel is non-deterministic. | ||
Returns | ||
------- | ||
Returns `None` if `copy=False`, else returns an `AnnData` object. Sets the following fields: | ||
|
@@ -214,6 +216,12 @@ def umap( | |
# for the init condition in the UMAP embedding | ||
default_epochs = 500 if neighbors["connectivities"].shape[0] <= 10000 else 200 | ||
n_epochs = default_epochs if maxiter is None else maxiter | ||
if parallel and random_state is not None: | ||
warnings.warn( | ||
"Parallel execution was expected to be disabled when both `parallel=True` and `random_state` are set, " | ||
"to ensure reproducibility. However, parallel execution still seems to occur, which may lead to " | ||
"non-deterministic results." | ||
) | ||
Comment on lines
+219
to
+224
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Link to the umap repo to give user more context. Something like "UMAP reports that parallel execution should error with a random seed to ensure reproducibility. Parallel execution may occur nonetheless, which can lead to non-deterministic results." with a link to their docs There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. And then make sure this warning is raised in tests There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'll update this once i have clarification from lmcinnes/umap#1155 |
||
X_umap, _ = simplicial_set_embedding( | ||
data=X, | ||
graph=neighbors["connectivities"].tocoo(), | ||
|
@@ -232,6 +240,7 @@ def umap( | |
densmap_kwds={}, | ||
output_dens=False, | ||
verbose=settings.verbosity > 3, | ||
parallel=parallel, | ||
) | ||
elif method == "rapids": | ||
msg = ( | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have the
n_jobs: int | None
convention for this (withNone
meaningsc.settings.N_JOBS
), butsimplicial_set_embedding
just passesparallel
on to numba.We should think about how the two integrate before we add this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is numba parallel, we can only set it to use everything you got or nothing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that’s why we should talk about the parameter name. IIRC that would be the first parallelization parameter not called
n_jobs
.