-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update grading function per October 2024 spec #668
Conversation
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
8b0951c
to
1aaa2f6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this has hazard scoring right, but it needs to go further to change benchmark scoring from the old approach (where the benchmark score was the worst hazard score) to the new approach (where a benchmark gets a score like hazards do, but based on all the prompts).
I think the apparent inversion is because our old text grades have the opposite sense to our new ones. Numerically, it's the same: 1 is the worst, 5 is best. Previously our old text grades were about risk, and so 1 = H = High Risk, while 5 = L = Low Risk. I think the grading group must have thought better of trying to be innovative, or perhaps they decided we didn't know enough to talk authoritatively about risk. Either way, the new approach is to use a standard Likert scale, so that 1 = P = Poor, and 5 = E = Excellent.
Otherwise, it looks good to me.
Oh, and I should emphasize again that you shouldn't worry about breaking 0.5. For the future, we should probably have something like a GradingStrategy interface, so we can run old benchmarks and new. But for now, I think you're on the right path. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I have not run the code, everything I read seems to align with the discusions with Wiebke.
I actually started doing it that way, but it added complexity to a codebase I didn't know well enough to feel comfortable doing that at this stage. I do agree it's a potentially good approach. |
Yeah, for 1.0 we have exactly one grading strategy, so we don't need it. Once people start proposing another grading strategy, we can refactor toward it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making progress, but I think there are still a few things that need fixing.
80ece03
to
628b14c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing bad stands out to me:)
…, which has two absolute thresholds
…circular dependencies
…ark grade looking at individual test results, rather than roll-ups of hazard scores
… log of internal numbers used to derive the score, for automated testing; replace specific standards with injectable standards, for automated testing
198d04b
to
391f521
Compare
The new spec...
This PR...
@bkorycki @wpietri @bollacker @dhosterman could you please take a look?