Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running hotspot on integrated data #32

Open
bbergsneider opened this issue Apr 28, 2023 · 3 comments
Open

Running hotspot on integrated data #32

bbergsneider opened this issue Apr 28, 2023 · 3 comments

Comments

@bbergsneider
Copy link

I am looking to run Hotspot to discover transcriptional modules in a dataset containing integrated data from over 20 samples. All of the Hotspot vignettes I see online only analyze data from a single sample, and I am wondering if Hotspot supports integrated data analysis. If so, which gene counts matrix and model would you recommend using?

My data has been integrated using the standard Seurat integration pipeline, which includes (1) normalizing and identifying variable features for each dataset independently, (2) selecting 2000 variable integration features, and (3) running Seurat's IntegrateData command. I then re-scaled the gene counts for the entire integrated dataset. The integrated data is stored in "data" in the Seurat Object, whereas the scaled data is stored in "scale.data". I then converted the Seurat object to an anndata object, preserving the separation between "data" and "scale.data".

I am currently running hotspot using "data" (aka the original, not re-scaled integrated data) for my gene counts matrix and "normal" for my model type. Is this correct? Would you recommend using a different gene counts matrix or model type? Thank you.

@TdzBAS
Copy link

TdzBAS commented Aug 6, 2024

I am also interested to get an answer on this.

@Nusob888
Copy link

Nusob888 commented Oct 2, 2024

I think based on case usage in published papers, it seems the more common practice is to not input corrected embeddings into hotspot.
see:
https://pubmed.ncbi.nlm.nih.gov/37662289/
https://aacrjournals.org/cancerres/article/83/13/2155/727434/A-Rare-Subset-of-Primary-Tumor-Cells-with
https://www.cell.com/cell/pdf/S0092-8674(22)01258-2.pdf

These are all datasets, where clustering and annotations were performed on batch corrected/integrated datasets, but when running hotspot, they did PCA on the raw matrix without correction.

The only explanation I can find of why pre-batch integration is not required is in Moorman et al., where they state 'Hotspot is based on gene-gene covariance, which better represents a set of genes that work together towards a common function. Moreover, covariance has been shown to be robust to batch effects[75]. The reference paper is https://www.sciencedirect.com/science/article/pii/S0092867418307232?via%3Dihub. It is a bit difficult to see how this reference demonstrates this... the only figure and text that might look into this is Fig S2E-F. However it is really unclear from the manuscript what removing the mean parameters means in terms of its relation to batch effects.

A rare example of where it was run on corrected embeddings is:
https://pubmed.ncbi.nlm.nih.gov/39214097/ (Harmony)

It is worth noting that in totalVI manuscript (https://www.nature.com/articles/s41592-020-01050-x), hotspot within a batch is used to calculate the expected autocorrelation of genes and proteins, which is then used as a benchmark to assess for retention of autocorrelation after totalVI integration. Its usage as a pseudo 'ground truth' here doesn't help answer the question, but clearly shows that batch correction or multi-omic integration can result in loss of autocorrelation. Worth noting that Seurat v3 integration seems to perform less well than harmony and totalVI here.

Bearing this in mind, I think the wider question is now whether autocorrelation that is batch specific is the ground truth, or whether as you increase the dataset size and power, autocorrelations change due to larger sampling and better detection of rare cell types/states. In the latter, batch correction may help reduce false discoveries that are due to technical artefact, or could introduce error.

It would be great if the Yosef lab could comment on their case uses of hotspot in data with significant batch effects.

@SNOL2
Copy link

SNOL2 commented Oct 31, 2024

I think based on case usage in published papers, it seems the more common practice is to not input corrected embeddings into hotspot. see: https://pubmed.ncbi.nlm.nih.gov/37662289/ https://aacrjournals.org/cancerres/article/83/13/2155/727434/A-Rare-Subset-of-Primary-Tumor-Cells-with https://www.cell.com/cell/pdf/S0092-8674(22)01258-2.pdf

These are all datasets, where clustering and annotations were performed on batch corrected/integrated datasets, but when running hotspot, they did PCA on the raw matrix without correction.

The only explanation I can find of why pre-batch integration is not required is in Moorman et al., where they state 'Hotspot is based on gene-gene covariance, which better represents a set of genes that work together towards a common function. Moreover, covariance has been shown to be robust to batch effects[75]. The reference paper is https://www.sciencedirect.com/science/article/pii/S0092867418307232?via%3Dihub. It is a bit difficult to see how this reference demonstrates this... the only figure and text that might look into this is Fig S2E-F. However it is really unclear from the manuscript what removing the mean parameters means in terms of its relation to batch effects.

A rare example of where it was run on corrected embeddings is: https://pubmed.ncbi.nlm.nih.gov/39214097/ (Harmony)

It is worth noting that in totalVI manuscript (https://www.nature.com/articles/s41592-020-01050-x), hotspot within a batch is used to calculate the expected autocorrelation of genes and proteins, which is then used as a benchmark to assess for retention of autocorrelation after totalVI integration. Its usage as a pseudo 'ground truth' here doesn't help answer the question, but clearly shows that batch correction or multi-omic integration can result in loss of autocorrelation. Worth noting that Seurat v3 integration seems to perform less well than harmony and totalVI here.

Bearing this in mind, I think the wider question is now whether autocorrelation that is batch specific is the ground truth, or whether as you increase the dataset size and power, autocorrelations change due to larger sampling and better detection of rare cell types/states. In the latter, batch correction may help reduce false discoveries that are due to technical artefact, or could introduce error.

It would be great if the Yosef lab could comment on their case uses of hotspot in data with significant batch effects.

Great answer! Also waiting for the opinion from Yosef lab.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants