Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update grading function per October 2024 spec #668

Merged
merged 20 commits into from
Nov 14, 2024

Conversation

rogthefrog
Copy link
Contributor

@rogthefrog rogthefrog commented Nov 4, 2024

The new spec...

  • Updates grade labels (e.g. "Excellent", "Fair" etc) and letter grades ("E", "VG", etc)
  • Updates the thresholds
  • Updates the grading logic

This PR...

  • Implements the spec
  • Refactors the grading a bit
  • Adds comments for future maintenance
  • Adds tests and logging to ensure the BenchmarkScore calculations are provably correct.

@bkorycki @wpietri @bollacker @dhosterman could you please take a look?

@rogthefrog rogthefrog requested a review from a team as a code owner November 4, 2024 23:00
Copy link

github-actions bot commented Nov 4, 2024

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

Copy link
Contributor

@wpietri wpietri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this has hazard scoring right, but it needs to go further to change benchmark scoring from the old approach (where the benchmark score was the worst hazard score) to the new approach (where a benchmark gets a score like hazards do, but based on all the prompts).

I think the apparent inversion is because our old text grades have the opposite sense to our new ones. Numerically, it's the same: 1 is the worst, 5 is best. Previously our old text grades were about risk, and so 1 = H = High Risk, while 5 = L = Low Risk. I think the grading group must have thought better of trying to be innovative, or perhaps they decided we didn't know enough to talk authoritatively about risk. Either way, the new approach is to use a standard Likert scale, so that 1 = P = Poor, and 5 = E = Excellent.

Otherwise, it looks good to me.

src/modelbench/scoring.py Outdated Show resolved Hide resolved
tests/modelbench_tests/test_scoring.py Outdated Show resolved Hide resolved
@wpietri
Copy link
Contributor

wpietri commented Nov 7, 2024

Oh, and I should emphasize again that you shouldn't worry about breaking 0.5. For the future, we should probably have something like a GradingStrategy interface, so we can run old benchmarks and new. But for now, I think you're on the right path.

src/modelbench/scoring.py Outdated Show resolved Hide resolved
src/modelbench/scoring.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@bollacker bollacker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I have not run the code, everything I read seems to align with the discusions with Wiebke.

@rogthefrog
Copy link
Contributor Author

Oh, and I should emphasize again that you shouldn't worry about breaking 0.5. For the future, we should probably have something like a GradingStrategy interface, so we can run old benchmarks and new. But for now, I think you're on the right path.

I actually started doing it that way, but it added complexity to a codebase I didn't know well enough to feel comfortable doing that at this stage. I do agree it's a potentially good approach.

@wpietri
Copy link
Contributor

wpietri commented Nov 8, 2024

I actually started doing it that way, but it added complexity to a codebase I didn't know well enough to feel comfortable doing that at this stage. I do agree it's a potentially good approach.

Yeah, for 1.0 we have exactly one grading strategy, so we don't need it. Once people start proposing another grading strategy, we can refactor toward it.

Copy link
Contributor

@wpietri wpietri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making progress, but I think there are still a few things that need fixing.

src/modelbench/benchmarks.py Outdated Show resolved Hide resolved
src/modelbench/hazards.py Outdated Show resolved Hide resolved
src/modelbench/hazards.py Outdated Show resolved Hide resolved
tests/modelbench_tests/test_scoring.py Show resolved Hide resolved
src/modelbench/benchmarks.py Outdated Show resolved Hide resolved
Copy link
Contributor

@bkorycki bkorycki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing bad stands out to me:)

src/modelbench/benchmarks.py Outdated Show resolved Hide resolved
src/modelbench/hazards.py Outdated Show resolved Hide resolved
@rogthefrog rogthefrog merged commit fe34561 into main Nov 14, 2024
4 checks passed
@rogthefrog rogthefrog deleted the feat/635/grading_function branch November 14, 2024 00:13
@github-actions github-actions bot locked and limited conversation to collaborators Nov 14, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants