-
Notifications
You must be signed in to change notification settings - Fork 177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor the dataset generator #335
Conversation
Amazing! Is #333 fixed totally? |
No, not yet, I'm actually debugging now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some suggestions.
assert all( | ||
are_dataset_dicts_identical(raw, origin) | ||
for (raw, origin) in zip(raw_dataset_dicts, DATASET_DICTS) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally think that using are_dataset_dicts_identical
is more clear and simple. 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not need to iterate against if "val"
and if "test"
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you've probably heard of the DRY (don't repeat yourself) principle, and the previous code follows the DRY principle. But I changed it for two reasons:
- in the case that the two dicts were different, the error message from
are_dataset_dicts_identical
did not indicate which elements of the dataset dict were broken. now it does. - there is an argument that tests should not be DRY, but DAMP (descriptive and meaningful phrases) to make it very clear to anyone reading the test exactly what it does and why it's failing in a self-contained way. This change moves a little bit in that direction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow! Thanks for your clarification.
Another thing is that I am not sure that all the API-based models use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally I love this PR - removing a lot of code bloat and length comments, simplifying behavior, and consolidating tests.
Just have a few small questions and suggestions about the changes.
split: DatasetSplit, | ||
) -> datasets.Dataset: | ||
"""Generate data for a single named split of data. | ||
|
||
Args: | ||
prompt_spec: A prompt spec (containing a system description). | ||
expected_num_examples: Expected number of examples in split. | ||
num_examples: Expected number of examples in split. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was intentionally expected_num_examples
because we could not guarantee that this is the exact number generated (due to the way we did batching and retries)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, and I made sure the result actually matched the actual number of examples (unless we hit max_api_calls).
if output_dir: | ||
save_dir = Path(output_dir) | ||
save_dir.mkdir(parents=True, exist_ok=True) | ||
dataset_dict.save_to_disk(str(save_dir)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to remove this? Saving the dataset to file allows this to be reused in the future without running dataset generation, which is key for iteration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree that there's a tradeoff here. But there's a major issue with the current implementation in that it (a) implicitly saves a cache, and (b) doesn't check if the cache was created with the same parameters as the current function call. If we're going to cache something, we should make sure that we check that the parameters are the same and invalidate the cache if the parameters are not the same.
I'd suggest that we just remove it for now, and if we want to add cacheing in the future we
- Make it explicit (the user needs to call a
save_cache
function), not implicit - Check to make sure that the parameters of the called function match those of the cache. Zeno build has some code that I wrote to do this, so we might be able to use this (or check other libraries)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are somehow right. I feel a sense of loss that I write careful tests to test the cache and load the cache.
if not isinstance(e, API_ERRORS): | ||
raise e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this change we'll no longer be logging the errors where we suppress->retry. Without this change the log will have something like:
ERROR:root:Expecting property name enclosed in double quotes: line 1 column 2 (char 1)
Is this what you were intending?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I'm not sure I fully understand. The logging line still exists in the line directly above this?
@@ -233,7 +235,7 @@ def test_dataset_processor_with_numerical_column(): | |||
"<task 1>convert to text2text\nExample:\nfoo\nLabel:\n", | |||
"<task 1>convert to text2text\nExample:\nbar\nLabel:\n", | |||
], | |||
"model_output": ["foo", "bar", "0", "1"], | |||
"model_output": ["baz", "qux", "0", "1"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the output changing here? Was the test broken before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I believe it was broken but the assert was not catching it properly. I haven't looked deeply into this though.
Excellent point @zhaochenyang20. I was thinking of creating an issue for this, but I think for simplicity we can leave this as-is until someone complains about it. Maybe @saum7800 could keep it in mind when we try the non-API-based models though. It might become even more of a problem due to the limited context windows. |
Co-authored-by: Vijay Viswanathan <[email protected]>
Co-authored-by: Vijay Viswanathan <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One final suggestion on a comment but LGTM.
context_cutoff: If the total length of the prompt exceeds this value, | ||
repeat the prompt generation process to generate a shorter one. | ||
context_cutoff: If the total length of the prompt in tokens exceeds this | ||
value, repeat prompt generation process to generate a shorter one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion to make grammatical (while still fitting within char limits)
value, repeat prompt generation process to generate a shorter one. | |
value, repeat the prompt generation process to generate a shorter one. |
Description
This PR does a major refactoring of the dataset generator in an effort to make the API and code cleaner:
References
Blocked by