Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Are Zero-Shot Cell Embeddings Dependent on Input Sample Chunk Size & max_length? #281

Open
peterhealx opened this issue Jan 20, 2025 · 0 comments

Comments

@peterhealx
Copy link

Dear scGPT Authors (@subercui et al.) ,

When evaluating the zero-shot cell embeddings of scGPT's "embed_data" function (inside scgpt.tasks.cell_emb.py) I noticed the following given a test adata containing 100 samples (obs) x 20,000 genes (vars) and either :

(a) Quering embed_data with the entire adata file and getting one output embedding: 100 samples (obs) x 512 embeddings (vars)
(b) chunking the adata file into n chunks (e.g. 4 x 25 samples) and contatenating the 4 output embeddings

Observations:

  1. Small max_length (e.g. default 1200): the output embeddings from (a) and (b) were different, although strangely for the first ~ 25 samples were identical
  2. Larger max_length (e.g. 2000 or 30000): the output embeddings from (a) and (b) were identical

It appears that the samples are not entirely independent as dataset

Image

Image

chunking yields different final embeddings IF a small max_length "context window" is used, but the issue goes away if a large enough context window is used. The larger the dataset/chunks (no. samples) the larger the context window needs to be to get identical results in cases (a) and (b). Note: this also holds if the model is initialised once before (a) and (b), or once for each chunk within (b).

Can the authors explain this behaviour of scGPT please. Perhaps the best way to proceed is to ensure a large enough context window for the dataset/chunks being used. Please advise.

Best regards,

Dr. Peter Wright

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant