Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running the geneformer example results in KeyError #1312

Open
ddemaeyer opened this issue Nov 13, 2024 · 1 comment
Open

Running the geneformer example results in KeyError #1312

ddemaeyer opened this issue Nov 13, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@ddemaeyer
Copy link

Describe the bug

Trying to run the geneformer example on provided testdata as explained in tutorials. After installing the last version of geneformer from the hugging faces repository and some plumbing to get everything to work, I run into the following error

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[92], [line 7](vscode-notebook-cell:?execution_count=92&line=7)
      [3](vscode-notebook-cell:?execution_count=92&line=3) # create the trainer
      [5](vscode-notebook-cell:?execution_count=92&line=5) kwargs = {"token_dictionary": tokenizer.gene_token_dict};
      [6](vscode-notebook-cell:?execution_count=92&line=6) trainer = Trainer(model=model,
----> [7](vscode-notebook-cell:?execution_count=92&line=7)                 data_collator=DataCollatorForCellClassification())
      [8](vscode-notebook-cell:?execution_count=92&line=8) # use trainer

File ~/.pyenv/versions/3.10.0/lib/python3.10/site-packages/geneformer/collator_for_classification.py:611, in DataCollatorForGeneClassification.__init__(self, *args, **kwargs)
    [610](~/.pyenv/versions/3.10.0/lib/python3.10/site-packages/geneformer/collator_for_classification.py:610) def __init__(self, *args, **kwargs) -> None:
--> [611](~/.pyenv/versions/3.10.0/lib/python3.10/site-packages/geneformer/collator_for_classification.py:611)     self.token_dictionary = kwargs.pop("token_dictionary")
    [612](~/.pyenv/versions/3.10.0/lib/python3.10/site-packages/geneformer/collator_for_classification.py:612)     super().__init__(
    [613](~/.pyenv/versions/3.10.0/lib/python3.10/site-packages/geneformer/collator_for_classification.py:613)         tokenizer=PrecollatorForGeneAndCellClassification(
    [614](~/.pyenv/versions/3.10.0/lib/python3.10/site-packages/geneformer/collator_for_classification.py:614)             token_dictionary=self.token_dictionary
   (...)
    [621](~/.pyenv/versions/3.10.0/lib/python3.10/site-packages/geneformer/collator_for_classification.py:621)         **kwargs,
    [622](~/.pyenv/versions/3.10.0/lib/python3.10/site-packages/geneformer/collator_for_classification.py:622)     )

KeyError: 'token_dictionary'

when I try to execute

# reload pretrained model
model = BertForSequenceClassification.from_pretrained(model_dir)
# create the trainer
trainer = Trainer(model=model,
                data_collator=DataCollatorForCellClassification())

All data being used is from the example data and none from external sources.

I tried overrulling the token_dictionary by performing:

# reload pretrained model
model = BertForSequenceClassification.from_pretrained(model_dir)
# create the trainer

kwargs = {"token_dictionary": tokenizer.gene_token_dict};
trainer = Trainer(model=model,
                data_collator=DataCollatorForCellClassification(**kwargs))

But this results in the following error during training:

File ~/.pyenv/versions/3.10.0/lib/python3.10/site-packages/transformers/models/bert/modeling_bert.py:1073, in BertModel.forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
   [1071](~/.pyenv/versions/3.10.0/lib/python3.10/site-packages/transformers/models/bert/modeling_bert.py:1071) if hasattr(self.embeddings, "token_type_ids"):
   [1072](~/.pyenv/versions/3.10.0/lib/python3.10/site-packages/transformers/models/bert/modeling_bert.py:1072)     buffered_token_type_ids = self.embeddings.token_type_ids[:, :seq_length]
-> [1073](~/.pyenv/versions/3.10.0/lib/python3.10/site-packages/transformers/models/bert/modeling_bert.py:1073)     buffered_token_type_ids_expanded = buffered_token_type_ids.expand(batch_size, seq_length)
   [1074](~/.pyenv/versions/3.10.0/lib/python3.10/site-packages/transformers/models/bert/modeling_bert.py:1074)     token_type_ids = buffered_token_type_ids_expanded
   [1075](~/.pyenv/versions/3.10.0/lib/python3.10/site-packages/transformers/models/bert/modeling_bert.py:1075) else:

RuntimeError: The expanded size of the tensor (2377) must match the existing size (2048) at non-singleton dimension 1.  Target sizes: [8, 2377].  Tensor sizes: [1, 2048]

To Reproduce

Run the geneformer notebook using the latest version of geneformer installed.

Environment

Provide a description of your system and the software versions.

Mac M1 pro, running python 3.10 with most important libs:

cellxgene-census 1.16.2
geneformer 0.1.0
tiledb 0.32.5
tiledbsoma 1.14.5
torch 2.5.1
transformers 4.46.2

@ddemaeyer ddemaeyer added the bug Something isn't working label Nov 13, 2024
@mlin
Copy link
Contributor

mlin commented Nov 15, 2024

@ddemaeyer Can you please try setting geneformer to the specific git revision eb038a6?

pip install git+https://huggingface.co/ctheodoris/Geneformer@eb038a6

Sorry for the roadbump. That revision is what we coded against when we created the example; at that time (possibly still?), the Geneformer repository didn't have tagged/released versions, making it a little challenging to track with subsequent changes. We'll be doing some work shortly to update the cellxgene_census Geneformer integration to a newer version, but it will take some time to get out the door. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants