update grading function per October 2024 spec #668

rogthefrog · 2024-11-04T23:00:48Z

The new spec...

Updates grade labels (e.g. "Excellent", "Fair" etc) and letter grades ("E", "VG", etc)
Updates the thresholds
Updates the grading logic

This PR...

Implements the spec
Refactors the grading a bit
Adds comments for future maintenance
Adds tests and logging to ensure the BenchmarkScore calculations are provably correct.

@bkorycki @wpietri @bollacker @dhosterman could you please take a look?

github-actions · 2024-11-04T23:01:01Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

wpietri

I think this has hazard scoring right, but it needs to go further to change benchmark scoring from the old approach (where the benchmark score was the worst hazard score) to the new approach (where a benchmark gets a score like hazards do, but based on all the prompts).

I think the apparent inversion is because our old text grades have the opposite sense to our new ones. Numerically, it's the same: 1 is the worst, 5 is best. Previously our old text grades were about risk, and so 1 = H = High Risk, while 5 = L = Low Risk. I think the grading group must have thought better of trying to be innovative, or perhaps they decided we didn't know enough to talk authoritatively about risk. Either way, the new approach is to use a standard Likert scale, so that 1 = P = Poor, and 5 = E = Excellent.

Otherwise, it looks good to me.

src/modelbench/scoring.py

tests/modelbench_tests/test_scoring.py

wpietri · 2024-11-07T13:31:21Z

Oh, and I should emphasize again that you shouldn't worry about breaking 0.5. For the future, we should probably have something like a GradingStrategy interface, so we can run old benchmarks and new. But for now, I think you're on the right path.

src/modelbench/scoring.py

bollacker

While I have not run the code, everything I read seems to align with the discusions with Wiebke.

tests/modelbench_tests/test_static_site_generator.py

rogthefrog · 2024-11-08T22:26:23Z

Oh, and I should emphasize again that you shouldn't worry about breaking 0.5. For the future, we should probably have something like a GradingStrategy interface, so we can run old benchmarks and new. But for now, I think you're on the right path.

I actually started doing it that way, but it added complexity to a codebase I didn't know well enough to feel comfortable doing that at this stage. I do agree it's a potentially good approach.

wpietri · 2024-11-08T22:28:39Z

I actually started doing it that way, but it added complexity to a codebase I didn't know well enough to feel comfortable doing that at this stage. I do agree it's a potentially good approach.

Yeah, for 1.0 we have exactly one grading strategy, so we don't need it. Once people start proposing another grading strategy, we can refactor toward it.

wpietri

Making progress, but I think there are still a few things that need fixing.

src/modelbench/benchmarks.py

src/modelbench/hazards.py

tests/modelbench_tests/test_scoring.py

src/modelbench/benchmarks.py

bkorycki

Nothing bad stands out to me:)

src/modelbench/benchmarks.py

src/modelbench/hazards.py

…, which has two absolute thresholds

…circular dependencies

…ark grade looking at individual test results, rather than roll-ups of hazard scores

… log of internal numbers used to derive the score, for automated testing; replace specific standards with injectable standards, for automated testing

…r verification.

rogthefrog requested a review from a team as a code owner November 4, 2024 23:00

rogthefrog requested review from wpietri, dhosterman and bkorycki November 4, 2024 23:01

rogthefrog mentioned this pull request Nov 4, 2024

new grading function #635

Open

rogthefrog requested a review from bollacker November 4, 2024 23:03

rogthefrog force-pushed the feat/635/grading_function branch from 8b0951c to 1aaa2f6 Compare November 4, 2024 23:05

wpietri requested changes Nov 7, 2024

View reviewed changes

src/modelbench/scoring.py Outdated Show resolved Hide resolved

tests/modelbench_tests/test_scoring.py Outdated Show resolved Hide resolved

dhosterman reviewed Nov 7, 2024

View reviewed changes

src/modelbench/scoring.py Outdated Show resolved Hide resolved

src/modelbench/scoring.py Outdated Show resolved Hide resolved

bollacker approved these changes Nov 8, 2024

View reviewed changes

wpietri reviewed Nov 8, 2024

View reviewed changes

tests/modelbench_tests/test_static_site_generator.py Outdated Show resolved Hide resolved

wpietri requested changes Nov 12, 2024

View reviewed changes

rogthefrog force-pushed the feat/635/grading_function branch from 80ece03 to 628b14c Compare November 13, 2024 00:36

bkorycki reviewed Nov 13, 2024

View reviewed changes

src/modelbench/benchmarks.py Outdated Show resolved Hide resolved

src/modelbench/hazards.py Outdated Show resolved Hide resolved

rogthefrog added 12 commits November 13, 2024 15:37

update grading function per October 2024 spec

affbb47

fix 'or equal to' logic

7e33a69

update display ranges and binning to reflect the new grading function…

b0a29c8

…, which has two absolute thresholds

update letter grades

0abf088

purge version 0.5 code and references to it

a7dbfb6

refactor calculations into a module for easier testing and to reduce …

ba8104d

…circular dependencies

re-refactor the pct safe to ordinal grade calculations; add v1 benchm…

4a3c82f

…ark grade looking at individual test results, rather than roll-ups of hazard scores

move total number of items scored into HazardScore

078e6d6

default to grading version 1.0

64ac69c

noop; remove unused import

d0bc3fa

do not subtract exceptions, as those are already subtracted.

9c5811f

fix: locale string wasn't passed through to the grading function; add…

bb27fda

… log of internal numbers used to derive the score, for automated testing; replace specific standards with injectable standards, for automated testing

rogthefrog added 7 commits November 13, 2024 15:37

remove print statement

7c94b2a

add test of grading function under a variety of scenarios

ca1b06e

remove scoring log from official record

52da2e1

add the BenchmarkScore's scoring log to the journal; some linting

a410a48

rename variable for consistency and clarity

f03ca89

keep track of the actual score, so we can display it in the output fo…

ead1dbb

…r verification.

don't assume version merely based on locale being present

391f521

rogthefrog force-pushed the feat/635/grading_function branch from 198d04b to 391f521 Compare November 13, 2024 23:37

noop; remove obsolete comments

bdf65d4

wpietri approved these changes Nov 14, 2024

View reviewed changes

rogthefrog merged commit fe34561 into main Nov 14, 2024
4 checks passed

rogthefrog deleted the feat/635/grading_function branch November 14, 2024 00:13

github-actions bot locked and limited conversation to collaborators Nov 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update grading function per October 2024 spec #668

update grading function per October 2024 spec #668

rogthefrog commented Nov 4, 2024 •

edited

Loading

github-actions bot commented Nov 4, 2024 •

edited

Loading

wpietri left a comment

wpietri commented Nov 7, 2024

bollacker left a comment

rogthefrog commented Nov 8, 2024

wpietri commented Nov 8, 2024

wpietri left a comment

bkorycki left a comment

update grading function per October 2024 spec #668

update grading function per October 2024 spec #668

Conversation

rogthefrog commented Nov 4, 2024 • edited Loading

github-actions bot commented Nov 4, 2024 • edited Loading

wpietri left a comment

Choose a reason for hiding this comment

wpietri commented Nov 7, 2024

bollacker left a comment

Choose a reason for hiding this comment

rogthefrog commented Nov 8, 2024

wpietri commented Nov 8, 2024

wpietri left a comment

Choose a reason for hiding this comment

bkorycki left a comment

Choose a reason for hiding this comment

rogthefrog commented Nov 4, 2024 •

edited

Loading

github-actions bot commented Nov 4, 2024 •

edited

Loading