Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluation metrics don't match with Recall on Predicts. #786

Closed
korchi opened this issue Aug 2, 2024 Discussed in #736 · 2 comments
Closed

Evaluation metrics don't match with Recall on Predicts. #786

korchi opened this issue Aug 2, 2024 Discussed in #736 · 2 comments

Comments

@korchi
Copy link

korchi commented Aug 2, 2024

Discussed in #736

Originally posted by korchi August 3, 2023
Hi. First of all, thank you for the great tool you are developing. However, I am puzzled already for a week about how can I replicate trainer.evaluation() results with inferencing the model.

My initial idea was to truncate every sessions by one (removing last item_id), call trainer.predict(truncated_sessions), and then compute recall(last_item_ids, predictions[:20]). However, I am getting different recall metric.

The only way I managed to "replicate" evaluate() results is by: (1) providing not-truncated inputs to the trainer.predict() and (2) changing -1 into -2 in

last_item_sessions = non_pad_mask.sum(dim=1) - 1
.
I am puzzled why, but this was the only way I could ensure that the x in
x, _ = self.pre(x) # type: ignore
(during inference) is the same as x in
x, y = self.pre(x, targets=y, training=training, testing=testing) # type: ignore
(during evaluation).

Is it because trainer.evaluate() shifts the inputs to the left by one position? Or what am I doing incorrectly? Could any provide me insights how to do it "correctly", please?

Thanks a lot.

@rnyak
Copy link
Contributor

rnyak commented Aug 13, 2024

@korchi In the predict, say this is your original input sequence list: [1, 2, 3, 4, 5] and these are the item ids. If you want to compare trainer.evaluate() and trainer.predict() your input sequences that you feed to the model should be different.
For predict, say your target item is the last item which is 5, then you should feed this as an input: [1, 2, 3, 4] .
what's gonna happen the model is gonna use first four entires and predict the fifth one, and then you can compare with your ground truth item id (it is 5) in this example:

[1, 2, 3, 4] --> [1, 2, 3, 4, predicted_item_id]

compare predicted_item_id with the ground truth.

Whereas, for trainer.evaluate() you feed entire sequence [1,2,3,4,5] and the evaluate func will do its job and generate a predicted item id for the last item in the sequence by using the items before that.

please also see this example here.

@rnyak
Copy link
Contributor

rnyak commented Oct 8, 2024

@korchi I am closing this ticket due to low activity please open again if you need to.

@rnyak rnyak closed this as completed Oct 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants