You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When evaluating the zero-shot cell embeddings of scGPT's "embed_data" function (inside scgpt.tasks.cell_emb.py) I noticed the following given a test adata containing 100 samples (obs) x 20,000 genes (vars) and either :
(a) Quering embed_data with the entire adata file and getting one output embedding: 100 samples (obs) x 512 embeddings (vars)
(b) chunking the adata file into n chunks (e.g. 4 x 25 samples) and contatenating the 4 output embeddings
Observations:
Small max_length (e.g. default 1200): the output embeddings from (a) and (b) were different, although strangely for the first ~ 25 samples were identical
Larger max_length (e.g. 2000 or 30000): the output embeddings from (a) and (b) were identical
It appears that the samples are not entirely independent as dataset
chunking yields different final embeddings IF a small max_length "context window" is used, but the issue goes away if a large enough context window is used. The larger the dataset/chunks (no. samples) the larger the context window needs to be to get identical results in cases (a) and (b). Note: this also holds if the model is initialised once before (a) and (b), or once for each chunk within (b).
Can the authors explain this behaviour of scGPT please. Perhaps the best way to proceed is to ensure a large enough context window for the dataset/chunks being used. Please advise.
Best regards,
Dr. Peter Wright
The text was updated successfully, but these errors were encountered:
Dear scGPT Authors (@subercui et al.) ,
When evaluating the zero-shot cell embeddings of scGPT's "embed_data" function (inside scgpt.tasks.cell_emb.py) I noticed the following given a test adata containing 100 samples (obs) x 20,000 genes (vars) and either :
(a) Quering embed_data with the entire adata file and getting one output embedding: 100 samples (obs) x 512 embeddings (vars)
(b) chunking the adata file into n chunks (e.g. 4 x 25 samples) and contatenating the 4 output embeddings
Observations:
It appears that the samples are not entirely independent as dataset
chunking yields different final embeddings IF a small max_length "context window" is used, but the issue goes away if a large enough context window is used. The larger the dataset/chunks (no. samples) the larger the context window needs to be to get identical results in cases (a) and (b). Note: this also holds if the model is initialised once before (a) and (b), or once for each chunk within (b).
Can the authors explain this behaviour of scGPT please. Perhaps the best way to proceed is to ensure a large enough context window for the dataset/chunks being used. Please advise.
Best regards,
Dr. Peter Wright
The text was updated successfully, but these errors were encountered: