Evaluation metrics don't match with Recall on Predicts. #786

korchi · 2024-08-02T11:10:24Z

Discussed in #736

^{Originally posted by korchi August 3, 2023}
Hi. First of all, thank you for the great tool you are developing. However, I am puzzled already for a week about how can I replicate trainer.evaluation() results with inferencing the model.

My initial idea was to truncate every sessions by one (removing last item_id), call trainer.predict(truncated_sessions), and then compute recall(last_item_ids, predictions[:20]). However, I am getting different recall metric.

The only way I managed to "replicate" evaluate() results is by: (1) providing not-truncated inputs to the trainer.predict() and (2) changing -1 into -2 in

Transformers4Rec/transformers4rec/torch/model/prediction_task.py

Line 460 in 348c963

last_item_sessions = non_pad_mask.sum(dim=1) - 1

.
I am puzzled why, but this was the only way I could ensure that the x in

Transformers4Rec/transformers4rec/torch/model/prediction_task.py

Line 464 in 348c963

x, _ = self.pre(x) # type: ignore

(during inference) is the same as x in

Transformers4Rec/transformers4rec/torch/model/prediction_task.py

Line 444 in 348c963

    
           x, y = self.pre(x, targets=y, training=training, testing=testing)  # type: ignore

(during evaluation).

Is it because trainer.evaluate() shifts the inputs to the left by one position? Or what am I doing incorrectly? Could any provide me insights how to do it "correctly", please?

Thanks a lot.

The text was updated successfully, but these errors were encountered:

rnyak · 2024-08-13T14:07:18Z

@korchi In the predict, say this is your original input sequence list: [1, 2, 3, 4, 5] and these are the item ids. If you want to compare trainer.evaluate() and trainer.predict() your input sequences that you feed to the model should be different.
For predict, say your target item is the last item which is 5, then you should feed this as an input: [1, 2, 3, 4] .
what's gonna happen the model is gonna use first four entires and predict the fifth one, and then you can compare with your ground truth item id (it is 5) in this example:

[1, 2, 3, 4] --> [1, 2, 3, 4, predicted_item_id]

compare predicted_item_id with the ground truth.

Whereas, for trainer.evaluate() you feed entire sequence [1,2,3,4,5] and the evaluate func will do its job and generate a predicted item id for the last item in the sequence by using the items before that.

please also see this example here.

rnyak · 2024-10-08T13:52:47Z

@korchi I am closing this ticket due to low activity please open again if you need to.

rnyak closed this as completed Oct 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation metrics don't match with Recall on Predicts. #786

Evaluation metrics don't match with Recall on Predicts. #786

korchi commented Aug 2, 2024

rnyak commented Aug 13, 2024 •

edited

Loading

rnyak commented Oct 8, 2024

Evaluation metrics don't match with Recall on Predicts. #786

Evaluation metrics don't match with Recall on Predicts. #786

Comments

korchi commented Aug 2, 2024

Discussed in #736

rnyak commented Aug 13, 2024 • edited Loading

rnyak commented Oct 8, 2024

rnyak commented Aug 13, 2024 •

edited

Loading