-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running hotspot on integrated data #32
Comments
I am also interested to get an answer on this. |
I think based on case usage in published papers, it seems the more common practice is to not input corrected embeddings into hotspot. These are all datasets, where clustering and annotations were performed on batch corrected/integrated datasets, but when running hotspot, they did PCA on the raw matrix without correction. The only explanation I can find of why pre-batch integration is not required is in Moorman et al., where they state 'Hotspot is based on gene-gene covariance, which better represents a set of genes that work together towards a common function. Moreover, covariance has been shown to be robust to batch effects[75]. The reference paper is https://www.sciencedirect.com/science/article/pii/S0092867418307232?via%3Dihub. It is a bit difficult to see how this reference demonstrates this... the only figure and text that might look into this is Fig S2E-F. However it is really unclear from the manuscript what removing the mean parameters means in terms of its relation to batch effects. A rare example of where it was run on corrected embeddings is: It is worth noting that in totalVI manuscript (https://www.nature.com/articles/s41592-020-01050-x), hotspot within a batch is used to calculate the expected autocorrelation of genes and proteins, which is then used as a benchmark to assess for retention of autocorrelation after totalVI integration. Its usage as a pseudo 'ground truth' here doesn't help answer the question, but clearly shows that batch correction or multi-omic integration can result in loss of autocorrelation. Worth noting that Seurat v3 integration seems to perform less well than harmony and totalVI here. Bearing this in mind, I think the wider question is now whether autocorrelation that is batch specific is the ground truth, or whether as you increase the dataset size and power, autocorrelations change due to larger sampling and better detection of rare cell types/states. In the latter, batch correction may help reduce false discoveries that are due to technical artefact, or could introduce error. It would be great if the Yosef lab could comment on their case uses of hotspot in data with significant batch effects. |
Great answer! Also waiting for the opinion from Yosef lab. |
I am looking to run Hotspot to discover transcriptional modules in a dataset containing integrated data from over 20 samples. All of the Hotspot vignettes I see online only analyze data from a single sample, and I am wondering if Hotspot supports integrated data analysis. If so, which gene counts matrix and model would you recommend using?
My data has been integrated using the standard Seurat integration pipeline, which includes (1) normalizing and identifying variable features for each dataset independently, (2) selecting 2000 variable integration features, and (3) running Seurat's IntegrateData command. I then re-scaled the gene counts for the entire integrated dataset. The integrated data is stored in "data" in the Seurat Object, whereas the scaled data is stored in "scale.data". I then converted the Seurat object to an anndata object, preserving the separation between "data" and "scale.data".
I am currently running hotspot using "data" (aka the original, not re-scaled integrated data) for my gene counts matrix and "normal" for my model type. Is this correct? Would you recommend using a different gene counts matrix or model type? Thank you.
The text was updated successfully, but these errors were encountered: