Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add tests for meval to replicate paper results #605

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

pfliu-nlp
Copy link
Collaborator

@pfliu-nlp pfliu-nlp commented Dec 27, 2022

Overview

This PR adds tests to verify whether our implemented meta-evaluation processor is able to replicate reported results from existing published papers.

Relevant issue: https://github.com/inspired-co/taskboard/issues/180

Details

  • Collect system outputs from this repo of two metrics (rouge1 and bartscore)
  • Using Explainaboard to process these outputs and compare the results with the ones reported from the above repo.

References

@pfliu-nlp pfliu-nlp marked this pull request as ready for review December 27, 2022 22:23
self.assertGreater(len(sys_info.results.analyses), 0)
self.assertAlmostEqual(
overall_score,
0.0946,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find this score in the ROUGE results in the original paper.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ROUGE1: Table 4 -> Column: NeR18 -> COH

self.assertGreater(len(sys_info.results.analyses), 0)
self.assertAlmostEqual(
overall_score,
0.3157,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This result didn't be listed in paper, but are stored in their github repo: implementation section: https://github.com/neulab/BARTScore#reproduce

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I still didn't see where the number 0.3157 came from though. The Spearman numbers in the table were in the range 0.49.

Could you add a little bit more detail to the comment about where the number could be found?

Sorry, I don't want to be pedantic, but it'd be nice to keep this information around and if we don't do it know we'll likely forget later.

Copy link
Contributor

@neubig neubig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! Overall, I think this is a great initiative.

However, if the numbers are a little bit different because of the stemmer used for ROUGE, actually they aren't really "replicated" I guess? To truly replicate the results we can probably calculate ROUGE with the same stemmer as the original paper. Would that be hard?

class MetaEvalNLGNewsroomTest(unittest.TestCase):
"""
Test the NLG metric on newsroom dataset and replicate the reported results from
the paper: https://arxiv.org/pdf/2106.11520.pdf
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add some of the details from the PR into this comment here so they remain for later?

Also, I agree with Yusuke's comments below that it'd be a good idea to specify which results were replicated.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"To truly replicate the results we can probably calculate ROUGE with the same stemmer as the original paper. Would that be hard?"

I won't plan to do this. (1) We need to modify the eaas code. (2) the original paper use a package that will be outdated. no need to support it. Anyway, no plan to do this here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's fair enough!

Copy link
Contributor

@neubig neubig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot! I was able to confirm the ROUGE number but still wasn't able to find the BARTScore number.

Also, I reviewed this before you clicked the "re-request review button", so if this was still WIP apologies!

class MetaEvalNLGNewsroomTest(unittest.TestCase):
"""
Test the NLG metric on newsroom dataset and replicate the reported results from
the paper: https://arxiv.org/pdf/2106.11520.pdf
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's fair enough!

.value
)
self.assertGreater(len(sys_info.results.analyses), 0)
# Replicate the Table 4 result in paper: https://arxiv.org/pdf/2106.11520.pdf
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Replicate the Table 4 result in paper: https://arxiv.org/pdf/2106.11520.pdf
# Replicate the Table 4 result in paper: https://arxiv.org/pdf/2106.11520.pdf
# Specifically: ROUGE1: Table 4 -> Column: NeR18 -> COH

self.assertGreater(len(sys_info.results.analyses), 0)
self.assertAlmostEqual(
overall_score,
0.3157,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I still didn't see where the number 0.3157 came from though. The Spearman numbers in the table were in the range 0.49.

Could you add a little bit more detail to the comment about where the number could be found?

Sorry, I don't want to be pedantic, but it'd be nice to keep this information around and if we don't do it know we'll likely forget later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants