Running the geneformer example results in KeyError #1312

ddemaeyer · 2024-11-13T13:22:37Z

Describe the bug

Trying to run the geneformer example on provided testdata as explained in tutorials. After installing the last version of geneformer from the hugging faces repository and some plumbing to get everything to work, I run into the following error

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[92], [line 7](vscode-notebook-cell:?execution_count=92&line=7)
      [3](vscode-notebook-cell:?execution_count=92&line=3) # create the trainer
      [5](vscode-notebook-cell:?execution_count=92&line=5) kwargs = {"token_dictionary": tokenizer.gene_token_dict};
      [6](vscode-notebook-cell:?execution_count=92&line=6) trainer = Trainer(model=model,
----> [7](vscode-notebook-cell:?execution_count=92&line=7)                 data_collator=DataCollatorForCellClassification())
      [8](vscode-notebook-cell:?execution_count=92&line=8) # use trainer

File ~/.pyenv/versions/3.10.0/lib/python3.10/site-packages/geneformer/collator_for_classification.py:611, in DataCollatorForGeneClassification.__init__(self, *args, **kwargs)
    [610](~/.pyenv/versions/3.10.0/lib/python3.10/site-packages/geneformer/collator_for_classification.py:610) def __init__(self, *args, **kwargs) -> None:
--> [611](~/.pyenv/versions/3.10.0/lib/python3.10/site-packages/geneformer/collator_for_classification.py:611)     self.token_dictionary = kwargs.pop("token_dictionary")
    [612](~/.pyenv/versions/3.10.0/lib/python3.10/site-packages/geneformer/collator_for_classification.py:612)     super().__init__(
    [613](~/.pyenv/versions/3.10.0/lib/python3.10/site-packages/geneformer/collator_for_classification.py:613)         tokenizer=PrecollatorForGeneAndCellClassification(
    [614](~/.pyenv/versions/3.10.0/lib/python3.10/site-packages/geneformer/collator_for_classification.py:614)             token_dictionary=self.token_dictionary
   (...)
    [621](~/.pyenv/versions/3.10.0/lib/python3.10/site-packages/geneformer/collator_for_classification.py:621)         **kwargs,
    [622](~/.pyenv/versions/3.10.0/lib/python3.10/site-packages/geneformer/collator_for_classification.py:622)     )

KeyError: 'token_dictionary'

when I try to execute

# reload pretrained model
model = BertForSequenceClassification.from_pretrained(model_dir)
# create the trainer
trainer = Trainer(model=model,
                data_collator=DataCollatorForCellClassification())

All data being used is from the example data and none from external sources.

I tried overrulling the token_dictionary by performing:

# reload pretrained model
model = BertForSequenceClassification.from_pretrained(model_dir)
# create the trainer

kwargs = {"token_dictionary": tokenizer.gene_token_dict};
trainer = Trainer(model=model,
                data_collator=DataCollatorForCellClassification(**kwargs))

But this results in the following error during training:

File ~/.pyenv/versions/3.10.0/lib/python3.10/site-packages/transformers/models/bert/modeling_bert.py:1073, in BertModel.forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
   [1071](~/.pyenv/versions/3.10.0/lib/python3.10/site-packages/transformers/models/bert/modeling_bert.py:1071) if hasattr(self.embeddings, "token_type_ids"):
   [1072](~/.pyenv/versions/3.10.0/lib/python3.10/site-packages/transformers/models/bert/modeling_bert.py:1072)     buffered_token_type_ids = self.embeddings.token_type_ids[:, :seq_length]
-> [1073](~/.pyenv/versions/3.10.0/lib/python3.10/site-packages/transformers/models/bert/modeling_bert.py:1073)     buffered_token_type_ids_expanded = buffered_token_type_ids.expand(batch_size, seq_length)
   [1074](~/.pyenv/versions/3.10.0/lib/python3.10/site-packages/transformers/models/bert/modeling_bert.py:1074)     token_type_ids = buffered_token_type_ids_expanded
   [1075](~/.pyenv/versions/3.10.0/lib/python3.10/site-packages/transformers/models/bert/modeling_bert.py:1075) else:

RuntimeError: The expanded size of the tensor (2377) must match the existing size (2048) at non-singleton dimension 1.  Target sizes: [8, 2377].  Tensor sizes: [1, 2048]

To Reproduce

Run the geneformer notebook using the latest version of geneformer installed.

Environment

Provide a description of your system and the software versions.

Mac M1 pro, running python 3.10 with most important libs:

cellxgene-census 1.16.2
geneformer 0.1.0
tiledb 0.32.5
tiledbsoma 1.14.5
torch 2.5.1
transformers 4.46.2

The text was updated successfully, but these errors were encountered:

mlin · 2024-11-15T08:53:54Z

@ddemaeyer Can you please try setting geneformer to the specific git revision eb038a6?

pip install git+https://huggingface.co/ctheodoris/Geneformer@eb038a6

Sorry for the roadbump. That revision is what we coded against when we created the example; at that time (possibly still?), the Geneformer repository didn't have tagged/released versions, making it a little challenging to track with subsequent changes. We'll be doing some work shortly to update the cellxgene_census Geneformer integration to a newer version, but it will take some time to get out the door. Thanks!

ddemaeyer added the bug Something isn't working label Nov 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running the geneformer example results in KeyError #1312

Running the geneformer example results in KeyError #1312

ddemaeyer commented Nov 13, 2024

mlin commented Nov 15, 2024

Running the geneformer example results in KeyError #1312

Running the geneformer example results in KeyError #1312

Comments

ddemaeyer commented Nov 13, 2024

Describe the bug

To Reproduce

Environment

mlin commented Nov 15, 2024