Fix `torch.cat()` issue when processing large number of documents with `TransformersModelForTokenClassificationNerStep` #80

paluchasz · 2024-12-11T18:38:33Z

The Issue

Noticed a weird problem occurring in the evaluation script when trying to naively process a large number (365) of Kazu documents with the TransformersModelForTokenClassificationNerStep step using the MPS device.

The 365 documents used totalled over 14k sections and were being processed with a newly trained 400MB model. Performing this on a Mac M3 with MPS, I saw Python's memory usage peak at 18GB:

In the end the step failed to predict any entities but without any exceptions thrown. The result was a weird phenomenon inside https://github.com/AstraZeneca/KAZU/blob/main/kazu/steps/ner/hf_token_classification.py:

  def get_multilabel_activations(self, loader: DataLoader) -> Tensor:
        """Get a tensor consisting of confidences for labels in a multi label
        classification context.

        :param loader:
        :return:
        """
        with torch.no_grad():
            results = torch.cat(
                tuple(self.model(**batch.to(self.device)).logits for batch in loader)
            ).to(self.device)
        return results.heaviside(torch.tensor([0.0]).to(self.device)).int().to("cpu")

Where torch.cat was producing a tensor full of zeros, indicating the model has not found any entities. This is likely due to torch.cat exceeding the allocated memory of the device.

The Fix

The fix is in two places. Firstly in the evaluate script we now process the documents in batches through the pipeline. However, to stop a user naively processing many documents with the Kazu pipeline and hitting this issue, there is also a fix inside the TransformersModelForTokenClassificationNerStep. This offloads the model logits onto CPU before concatenation.

Testing Performance

Here we perform the test with the naive call to the step with all the documents at once as before and test the version of TransformersModelForTokenClassificationNerStep before and after the change.

Before the change we observe a peak memory usage of 18GB and it takes 690s to process all the documents. With the new implementation we see a peak memory usage of 4GB and it takes 680s to process all the documents - also fixing the weird issue. Thus there doesn't seem to be any performance degradation in executing torch.cat on cpu vs mps. Cuda device was not tested however.

General Test for single label classification

A test script with the default model pipeline and Kazu model pack was run as a sanity check. The integration tests will now also be run.

Note

There is also a small refactor moving some functions from train_multilabel_ner to modelling_utils. Individual changes can be seen at commit level.

stops memory issue and null results

This saves memory early by offloading model logits onto CPU before concatenation and fixes a weird bug likely caused by memory issues

mariosaenger

LGTM. I only added minor comments

kazu/training/modelling_utils.py

mariosaenger

LGTM

paluchasz added 4 commits December 11, 2024 17:51

refactor: move func to utils module

06e355d

fix: process docs through pipeline in batches

cb515d0

stops memory issue and null results

fix: circular import - move funcs to utils

a8bf7a7

fix: torch.cat returning zero tensor for large datasets

28db0e5

This saves memory early by offloading model logits onto CPU before concatenation and fixes a weird bug likely caused by memory issues

paluchasz requested review from EFord36 and removed request for EFord36 December 12, 2024 15:38

mariosaenger approved these changes Dec 13, 2024

View reviewed changes

kazu/training/modelling_utils.py Outdated Show resolved Hide resolved

kazu/training/modelling_utils.py Show resolved Hide resolved

refactor: move random sampling to outside caller of visualiser

fba3ecb

paluchasz requested a review from mariosaenger December 13, 2024 15:18

mariosaenger approved these changes Dec 13, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix `torch.cat()` issue when processing large number of documents with `TransformersModelForTokenClassificationNerStep` #80

Fix `torch.cat()` issue when processing large number of documents with `TransformersModelForTokenClassificationNerStep` #80

paluchasz commented Dec 11, 2024 •

edited

Loading

mariosaenger left a comment

mariosaenger left a comment

Fix torch.cat() issue when processing large number of documents with TransformersModelForTokenClassificationNerStep #80

Are you sure you want to change the base?

Fix torch.cat() issue when processing large number of documents with TransformersModelForTokenClassificationNerStep #80

Conversation

paluchasz commented Dec 11, 2024 • edited Loading

The Issue

The Fix

Testing Performance

General Test for single label classification

Note

mariosaenger left a comment

Choose a reason for hiding this comment

mariosaenger left a comment

Choose a reason for hiding this comment

Fix `torch.cat()` issue when processing large number of documents with `TransformersModelForTokenClassificationNerStep` #80

Fix `torch.cat()` issue when processing large number of documents with `TransformersModelForTokenClassificationNerStep` #80

paluchasz commented Dec 11, 2024 •

edited

Loading