Journal consistency checker #680

bkorycki · 2024-11-13T01:44:59Z

I don't particularly like this. I think the class hierarchy is too complicated. My general goal was to make it easy to 1) add new checks and 2) understand what each check does.

I didn't put any effort towards efficiently querying the journal, so there's room for optimization there.
I only added a handful of checks to demonstrate what this looks like. Once we decide on the approach we want to take here, I will add some more easy checks + some basic unit testing with sample run journals.

Here is what the output of a sample run looks like:

Results for Test x SUT level checks
Entity                           EachPromptRespondedToOnce    EachResponseTranslatedOnce    EachItemMeasuredOnce
-------------------------------  ---------------------------  ----------------------------  ----------------------
demo_yes_no, safe-vcr-en_us-1.0  ✅                            ✅                             ✅
demo_yes_no, safe-ncr-en_us-1.0  ✅                            ✅                             ✅
demo_yes_no, safe-src-en_us-1.0  ✅                            ✅                             ✅
demo_yes_no, safe-cse-en_us-1.0  ❌                            ✅                             ✅
demo_yes_no, safe-dfm-en_us-1.0  ✅                            ✅                             ✅
demo_yes_no, safe-spc-en_us-1.0  ✅                            ✅                             ✅
demo_yes_no, safe-prv-en_us-1.0  ✅                            ✅                             ✅
demo_yes_no, safe-ipv-en_us-1.0  ✅                            ✅                             ✅
demo_yes_no, safe-iwp-en_us-1.0  ✅                            ✅                             ✅
demo_yes_no, safe-hte-en_us-1.0  ✅                            ✅                             ✅
demo_yes_no, safe-ssh-en_us-1.0  ✅                            ✅                             ✅
demo_yes_no, safe-sxc-en_us-1.0  ✅                            ✅                             ✅

Results for Test x SUT x Annotator checks
Entity                                          EachResponseAnnotatedOnce
----------------------------------------------  ---------------------------
demo_yes_no, llama_guard_2, safe-vcr-en_us-1.0  ✅
demo_yes_no, llama_guard_2, safe-ncr-en_us-1.0  ✅
demo_yes_no, llama_guard_2, safe-src-en_us-1.0  ✅
demo_yes_no, llama_guard_2, safe-cse-en_us-1.0  ✅
demo_yes_no, llama_guard_2, safe-dfm-en_us-1.0  ✅
demo_yes_no, llama_guard_2, safe-spc-en_us-1.0  ✅
demo_yes_no, llama_guard_2, safe-prv-en_us-1.0  ✅
demo_yes_no, llama_guard_2, safe-ipv-en_us-1.0  ✅
demo_yes_no, llama_guard_2, safe-iwp-en_us-1.0  ✅
demo_yes_no, llama_guard_2, safe-hte-en_us-1.0  ✅
demo_yes_no, llama_guard_2, safe-ssh-en_us-1.0  ✅
demo_yes_no, llama_guard_2, safe-sxc-en_us-1.0  ✅

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Failed checks for Test x SUT level checks:
EachPromptRespondedToOnce: Expected exactly 1 SUT response for each prompt in the test.
        The following duplicate prompts were found: ['airr_practice_1_0_22916']
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
All Test x SUT x Annotator checks checks passed!

github-actions · 2024-11-13T01:45:12Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

wpietri · 2024-11-13T14:46:17Z

I think you're making progress for sure. I love the use of emojis in the output. I think the columns are quite wide, so maybe at some point it's worth breaking the LongCamelCaseWords in short words that wrap on the spaces, but that can come later.

In your shoes I'd probably try to find some object structure that I was more happy with before adding more examples, but if nothing occurs to you than maybe more examples will help show you where you can simplify.

And I think it's fine if this is not very fast. If it takes 10-30s to process a full benchmark run, that's ok. Especially once we automate this as part of a benchmark run, adding another 30s to a half-hour process won't be noticeable. Past that point we can look at speedups, but I suspect a little manual indexing with dicts will be plenty.

Given the time pressure, I also think it's fine to merge something incomplete now, use it where we can, and add to if after we get all the SUTs in. We also have my first-pancake version, which I wouldn't want to extend, but it does check a bunch of things.

wpietri

Seems like a step forward to me.

rogthefrog

Thanks for tackling this!

rogthefrog · 2024-11-13T19:57:31Z

src/modelbench/consistency_checker.py

+    def check(self) -> bool:
+        return not any([len(self.duplicates), len(self.missing_prompts), len(self.unknown_prompts)])
+
+    def failure_message(self) -> str:


Suggestion: change to

def failure_message(self, *extra): messages = list(extra) # continue as before

then in subclasses

return super().failure_message(message)

Note: this is just a suggestion, no need to spend time changing this now.

src/modelbench/consistency_checker.py

bkorycki added 11 commits November 6, 2024 13:19

empty cli command

9839a8d

first draft of class structure

00e7263

Display results in table + --verbose option

3091cb2

restructure classes

bff02d3

remove required_messages method

c6e94be

another restructure

8a1c543

Print named of failed checks + dividers

4362c53

add trivial check + display emojis

d435560

waterfall check i.e. only compare to previous stage

86f5034

Fix difference check

399fcab

Add one annotator check:

b7faf5c

bkorycki requested review from wpietri and rogthefrog November 13, 2024 01:44

bkorycki requested a review from a team as a code owner November 13, 2024 01:45

wpietri approved these changes Nov 13, 2024

View reviewed changes

rogthefrog approved these changes Nov 13, 2024

View reviewed changes

bkorycki added 5 commits November 13, 2024 13:34

get terminal size from shutil instead

90e779d

Add sut,test,etc.. to table header

4de9bf2

remove record_path args

78daa8d

Basic test

8e0fcca

more unit tests

9386d24

bkorycki merged commit 2a36c92 into main Nov 13, 2024
4 checks passed

github-actions bot locked and limited conversation to collaborators Nov 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Journal consistency checker #680

Journal consistency checker #680

bkorycki commented Nov 13, 2024

github-actions bot commented Nov 13, 2024 •

edited

Loading

wpietri commented Nov 13, 2024

wpietri left a comment

rogthefrog left a comment

rogthefrog Nov 13, 2024

rogthefrog Nov 13, 2024

Journal consistency checker #680

Journal consistency checker #680

Conversation

bkorycki commented Nov 13, 2024

github-actions bot commented Nov 13, 2024 • edited Loading

wpietri commented Nov 13, 2024

wpietri left a comment

Choose a reason for hiding this comment

rogthefrog left a comment

Choose a reason for hiding this comment

rogthefrog Nov 13, 2024

Choose a reason for hiding this comment

rogthefrog Nov 13, 2024

Choose a reason for hiding this comment

github-actions bot commented Nov 13, 2024 •

edited

Loading