-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Concat API lacks semantic dimension identifier #1245
Comments
I'm unsure about this. We let people put names on their dimensions. E.g. I don't know that we should have Right now, the |
@ilan-gold, do you have thoughts on this w.r.t. the xarray API? @gtca, do you have thoughts on this w.r.t. MuData? I think this is one of the strongest cases for allowing dynamic naming of dimensions is for multimodal data, e.g. something like:
|
This is a tough point, actually. If your dimension name in An example that hopefully illustrates I know what I'm talking about: num_obs = 10
obs_names = np.arange(0, num_obs)
col1 = xr.DataArray(np.random.rand(num_obs,), coords=[obs_names], dims=["obs_names"])
col2 = xr.DataArray(np.random.rand(num_obs,), coords=[obs_names], dims=["obs_names"])
obs = xr.Dataset(dict(col1=col1, col2=col2))
# <xarray.Dataset>
# Dimensions: (obs_names: 10)
# Coordinates:
# * obs_names (obs_names) int64 0 1 2 3 4 5 6 7 8 9
# Data variables:
# col1 (obs_names) float64 0.3043 0.4391 0.7602 ... 0.309 0.8719 0.6947
# col2 (obs_names) float64 0.2787 0.5073 0.572 ... 0.6495 0.9579 0.8732
obs['obs_names']
# <xarray.DataArray 'obs_names' (obs_names: 10)>
# array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
# Coordinates:
# * obs_names (obs_names) int64 0 1 2 3 4 5 6 7 8 9
obs['obs_names'].name
# obs_names
obs['obs_names'].name = 'other_name'
obs['obs_names'].name
# obs_names However, that brings up the point of whether or not things like
What does this mean? |
The core design philosophy of AnnData is “a lot of data has two main dimensions, obs and var”. Of course it makes sense to have metadata describing these axes (e.g. “observations are cells, variables are transcripts”), but
Isaac means that Of course the exact same argument applies to |
I think I am not convinced about dynamic dim naming yet. In the multimodal case, there are already modality names, and a single AnnData object is already concerned with a particular feature set. In the context of this thread, while it makes sense to define operations like concatenation through axes ( In addition to that, there's a concern that codebases would lose some generality that they have now if we allow for dynamic naming. I.e. methods that work on |
Compatibility with xarray@ilan-gold, @flying-sheep I've put together a short demo of where I think there will be conflicts with xarray. import anndata as ad, pandas as pd, numpy as np, xarray as xr
M, N = (3, 2)
adata = ad.AnnData(
np.arange(M * N).reshape((M, N)),
obs=pd.DataFrame(
index=pd.Index(["cell_{}".format(i) for i in range(M)], name="cell")
),
var=pd.DataFrame(
index=pd.Index(["gene_{}".format(i) for i in range(N)], name="gene")
),
)
# Creating a dataarray using the same indices
da = xr.DataArray(
adata.X,
coords=(adata.obs_names, adata.var_names),
)
# Xarray semantics:
da.sum(dim="cell")
# <xarray.DataArray (gene: 2)>
# array([6, 9])
# Coordinates:
# * gene (gene) object 'gene_0' 'gene_1'
xr.concat([da, da], dim="cell")
# <xarray.DataArray (cell: 6, gene: 2)>
# array([[0, 1],
# [2, 3],
# [4, 5],
# [0, 1],
# [2, 3],
# [4, 5]])
# Coordinates:
# * cell (cell) object 'cell_0' 'cell_1' 'cell_2' 'cell_0' 'cell_1' 'cell_2'
# * gene (gene) object 'gene_0' 'gene_1'
xr.concat([da,da], dim="gene")
# <xarray.DataArray (cell: 3, gene: 4)>
# array([[0, 1, 0, 1],
# [2, 3, 2, 3],
# [4, 5, 4, 5]])
# Coordinates:
# * cell (cell) object 'cell_0' 'cell_1' 'cell_2'
# * gene (gene) object 'gene_0' 'gene_1' 'gene_0' 'gene_1' For the sake of interoperability I don't think we should have a keyword argument adata.X = da
adata.X.sum(dim=???)
ad.concat([adata, adata], dim=???)
ds = adata.to_xarray() # Theoretical method that returns a `xr.Dataset` object
xr.concat([ds, ds], dim=???) So what are the options here?
We basically delete this information out of old files. Also makes operations like
This is basically the case above. I really don't like: adata.X = da
ad.concat([adata, adata], dim="obs").X.sum(dim="cells") Especially since using any other permutation of those strings would not work.
I think this makes the most sense. However we do not enforce that Since there is data written which specifies this we'd need to figure out what happens there. Probaly we error when it's ambiguous. This requires that there's a way to reference these things unambiguously, which would be
We could allow This increases the number of ambiguous cases. Another thought: using user assigned names can lead to repeated dimensions names. Xarray is currently not capable of handling these, so we'd need to figure out how to handle that. MuData@gtca, when we've talked before about how we can use MuData to represent not just same-dataset different-modality and even collections of datasets we got into "how do you indicate which sample set or modality a particular anndata object is part of". I believe one of the approaches that came up was naming the dimensions. I would think that it could be nice to do something like: .mod = {
AnnData{dataset_1, genes}
AnnData{dataset_1, atac}
AnnData{dataset_2, genes}
AnnData{dataset_2, immune_receptor}
AnnData{dataset_3, atac}
} Though I wonder if instead of .mod = {
AnnData{slide_1_cells, immuno_fluorescence},
AnnData{slice_1_tissues, immuno_flourescence},
AnnData{slice_2_cells, immuno_fluorescence},
AnnData{slice_2_cells, FISH},
} It looks like you've started to address this with the It looks like users have the option of having a MuData object represent:
Is this correct? And can MuData represent say, multiple sets of observations each of which has a variable selection of modalities? Spatialdata?@scverse/spatialdata would also be nice to know if I'm forgetting anything about any implications for |
I don't see implications for spatialdata atm. But, unrequested feedback, I think I understand the dynamic naming potentials, and am excited by xarray compatibility, but I must say I don't think I've ever used the current dynamic names and am not familiar with any API that uses it. Therefore I guess I see the potential and don't see a real barrier to it, as adding the dimension semantic identifier would encourage users and developers to use it/think about it. Therefore, I would go for this
I also think that API(dim="obs") is also not very intuitive, as I also think of them as attributes. |
My intuition comes from the central idea I designed AnnData around: AnnData has two main dimensions that represent Our amount of support for @ivirshup for your examples, this should be the API: ds = adata.to_xarray() # Theoretical method that returns a `xr.Dataset` object
xr.concat([ds, ds], dim='obs') # or 'var'
Me neither. So we need to make sure that the first line fails with
We have this problem in multiple places. That’s why I didn’t like adding support for So maybe we should just go around to solving that before we allow using xarray in AnnData. |
We could also do |
This makes more sense to me. For internal variables I would always make |
Great! I updated #1244 accordingly |
@gtca, is this something you would adopt in |
I just saw that this matches pandas’ API:
|
Makes sense, here's one to track for MuData: scverse/mudata#64. For @ivirshup, if it helps to clarify the comment above #1245 (comment), the way to go there is nested MuData objects, e.g. see scverse/mudata#44 with an example for a nested object with one multimodal dataset:
You should be able to nest multiple datasets like that and so on. I guess the main point here is that this is intentionally not a 2D format (unlike e.g. PostData) so you have to structure your data storage. Supporting deeply nested objects by tools is another great question of course. 😃 |
Please make sure these conditions are met
Report
AnnData was designed around semantic dimension names instead of implementation details.
The
concat
API lacks support for this basic design goal.While it’s of course not entirely possible to avoid numerical axes like that, all APIs should support semantic axis names.
Code:
Traceback:
Versions
The text was updated successfully, but these errors were encountered: