-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add tests for meval to replicate paper results #605
base: main
Are you sure you want to change the base?
Conversation
self.assertGreater(len(sys_info.results.analyses), 0) | ||
self.assertAlmostEqual( | ||
overall_score, | ||
0.0946, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't find this score in the ROUGE results in the original paper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ROUGE1: Table 4 -> Column: NeR18 -> COH
self.assertGreater(len(sys_info.results.analyses), 0) | ||
self.assertAlmostEqual( | ||
overall_score, | ||
0.3157, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This result didn't be listed in paper, but are stored in their github repo: implementation section: https://github.com/neulab/BARTScore#reproduce
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I still didn't see where the number 0.3157 came from though. The Spearman numbers in the table were in the range 0.49.
Could you add a little bit more detail to the comment about where the number could be found?
Sorry, I don't want to be pedantic, but it'd be nice to keep this information around and if we don't do it know we'll likely forget later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this! Overall, I think this is a great initiative.
However, if the numbers are a little bit different because of the stemmer used for ROUGE, actually they aren't really "replicated" I guess? To truly replicate the results we can probably calculate ROUGE with the same stemmer as the original paper. Would that be hard?
class MetaEvalNLGNewsroomTest(unittest.TestCase): | ||
""" | ||
Test the NLG metric on newsroom dataset and replicate the reported results from | ||
the paper: https://arxiv.org/pdf/2106.11520.pdf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add some of the details from the PR into this comment here so they remain for later?
Also, I agree with Yusuke's comments below that it'd be a good idea to specify which results were replicated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"To truly replicate the results we can probably calculate ROUGE with the same stemmer as the original paper. Would that be hard?"
I won't plan to do this. (1) We need to modify the eaas code. (2) the original paper use a package that will be outdated. no need to support it. Anyway, no plan to do this here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's fair enough!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot! I was able to confirm the ROUGE number but still wasn't able to find the BARTScore number.
Also, I reviewed this before you clicked the "re-request review button", so if this was still WIP apologies!
class MetaEvalNLGNewsroomTest(unittest.TestCase): | ||
""" | ||
Test the NLG metric on newsroom dataset and replicate the reported results from | ||
the paper: https://arxiv.org/pdf/2106.11520.pdf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's fair enough!
.value | ||
) | ||
self.assertGreater(len(sys_info.results.analyses), 0) | ||
# Replicate the Table 4 result in paper: https://arxiv.org/pdf/2106.11520.pdf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Replicate the Table 4 result in paper: https://arxiv.org/pdf/2106.11520.pdf | |
# Replicate the Table 4 result in paper: https://arxiv.org/pdf/2106.11520.pdf | |
# Specifically: ROUGE1: Table 4 -> Column: NeR18 -> COH |
self.assertGreater(len(sys_info.results.analyses), 0) | ||
self.assertAlmostEqual( | ||
overall_score, | ||
0.3157, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I still didn't see where the number 0.3157 came from though. The Spearman numbers in the table were in the range 0.49.
Could you add a little bit more detail to the comment about where the number could be found?
Sorry, I don't want to be pedantic, but it'd be nice to keep this information around and if we don't do it know we'll likely forget later.
Overview
This PR adds tests to verify whether our implemented meta-evaluation processor is able to replicate reported results from existing published papers.
Relevant issue: https://github.com/inspired-co/taskboard/issues/180
Details
References