Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Journal consistency checker #680

Merged
merged 16 commits into from
Nov 13, 2024
Merged

Journal consistency checker #680

merged 16 commits into from
Nov 13, 2024

Conversation

bkorycki
Copy link
Contributor

I don't particularly like this. I think the class hierarchy is too complicated. My general goal was to make it easy to 1) add new checks and 2) understand what each check does.

I didn't put any effort towards efficiently querying the journal, so there's room for optimization there.
I only added a handful of checks to demonstrate what this looks like. Once we decide on the approach we want to take here, I will add some more easy checks + some basic unit testing with sample run journals.

Here is what the output of a sample run looks like:

Results for Test x SUT level checks
Entity                           EachPromptRespondedToOnce    EachResponseTranslatedOnce    EachItemMeasuredOnce
-------------------------------  ---------------------------  ----------------------------  ----------------------
demo_yes_no, safe-vcr-en_us-1.0  ✅                            ✅                             ✅
demo_yes_no, safe-ncr-en_us-1.0  ✅                            ✅                             ✅
demo_yes_no, safe-src-en_us-1.0  ✅                            ✅                             ✅
demo_yes_no, safe-cse-en_us-1.0  ❌                            ✅                             ✅
demo_yes_no, safe-dfm-en_us-1.0  ✅                            ✅                             ✅
demo_yes_no, safe-spc-en_us-1.0  ✅                            ✅                             ✅
demo_yes_no, safe-prv-en_us-1.0  ✅                            ✅                             ✅
demo_yes_no, safe-ipv-en_us-1.0  ✅                            ✅                             ✅
demo_yes_no, safe-iwp-en_us-1.0  ✅                            ✅                             ✅
demo_yes_no, safe-hte-en_us-1.0  ✅                            ✅                             ✅
demo_yes_no, safe-ssh-en_us-1.0  ✅                            ✅                             ✅
demo_yes_no, safe-sxc-en_us-1.0  ✅                            ✅                             ✅

Results for Test x SUT x Annotator checks
Entity                                          EachResponseAnnotatedOnce
----------------------------------------------  ---------------------------
demo_yes_no, llama_guard_2, safe-vcr-en_us-1.0  ✅
demo_yes_no, llama_guard_2, safe-ncr-en_us-1.0  ✅
demo_yes_no, llama_guard_2, safe-src-en_us-1.0  ✅
demo_yes_no, llama_guard_2, safe-cse-en_us-1.0  ✅
demo_yes_no, llama_guard_2, safe-dfm-en_us-1.0  ✅
demo_yes_no, llama_guard_2, safe-spc-en_us-1.0  ✅
demo_yes_no, llama_guard_2, safe-prv-en_us-1.0  ✅
demo_yes_no, llama_guard_2, safe-ipv-en_us-1.0  ✅
demo_yes_no, llama_guard_2, safe-iwp-en_us-1.0  ✅
demo_yes_no, llama_guard_2, safe-hte-en_us-1.0  ✅
demo_yes_no, llama_guard_2, safe-ssh-en_us-1.0  ✅
demo_yes_no, llama_guard_2, safe-sxc-en_us-1.0  ✅

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Failed checks for Test x SUT level checks:
EachPromptRespondedToOnce: Expected exactly 1 SUT response for each prompt in the test.
        The following duplicate prompts were found: ['airr_practice_1_0_22916']
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
All Test x SUT x Annotator checks checks passed!

@bkorycki bkorycki requested a review from a team as a code owner November 13, 2024 01:45
Copy link

github-actions bot commented Nov 13, 2024

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@wpietri
Copy link
Contributor

wpietri commented Nov 13, 2024

I think you're making progress for sure. I love the use of emojis in the output. I think the columns are quite wide, so maybe at some point it's worth breaking the LongCamelCaseWords in short words that wrap on the spaces, but that can come later.

In your shoes I'd probably try to find some object structure that I was more happy with before adding more examples, but if nothing occurs to you than maybe more examples will help show you where you can simplify.

And I think it's fine if this is not very fast. If it takes 10-30s to process a full benchmark run, that's ok. Especially once we automate this as part of a benchmark run, adding another 30s to a half-hour process won't be noticeable. Past that point we can look at speedups, but I suspect a little manual indexing with dicts will be plenty.

Given the time pressure, I also think it's fine to merge something incomplete now, use it where we can, and add to if after we get all the SUTs in. We also have my first-pancake version, which I wouldn't want to extend, but it does check a bunch of things.

Copy link
Contributor

@wpietri wpietri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a step forward to me.

Copy link
Contributor

@rogthefrog rogthefrog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tackling this!

def check(self) -> bool:
return not any([len(self.duplicates), len(self.missing_prompts), len(self.unknown_prompts)])

def failure_message(self) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: change to

def failure_message(self, *extra): 
    messages = list(extra)
    # continue as before

then in subclasses

return super().failure_message(message)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this is just a suggestion, no need to spend time changing this now.

src/modelbench/consistency_checker.py Show resolved Hide resolved
@bkorycki bkorycki merged commit 2a36c92 into main Nov 13, 2024
4 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Nov 13, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants