Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor the dataset generator #335

Merged
merged 8 commits into from
Sep 8, 2023
Merged

Refactor the dataset generator #335

merged 8 commits into from
Sep 8, 2023

Conversation

neubig
Copy link
Collaborator

@neubig neubig commented Sep 7, 2023

Description

This PR does a major refactoring of the dataset generator in an effort to make the API and code cleaner:

  1. There were several places where things were converted between lists and hugging face datasets, I tried to reduce the number of conversions.
  2. There were places where caches were written out as a side effect of the main function effect, I removed these.
  3. I reduced the length of comments, particularly for tests and for inline comments where the comment was redundant with the code itself.

References

Blocked by

  • NA

@zhaochenyang20
Copy link
Collaborator

Amazing! Is #333 fixed totally?

@neubig
Copy link
Collaborator Author

neubig commented Sep 8, 2023

No, not yet, I'm actually debugging now.

@neubig neubig mentioned this pull request Sep 8, 2023
Copy link
Collaborator

@zhaochenyang20 zhaochenyang20 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some suggestions.

prompt2model/dataset_generator/base.py Show resolved Hide resolved
prompt2model/dataset_generator/prompt_based.py Outdated Show resolved Hide resolved
prompt2model/dataset_generator/prompt_based.py Outdated Show resolved Hide resolved
Comment on lines -131 to -134
assert all(
are_dataset_dicts_identical(raw, origin)
for (raw, origin) in zip(raw_dataset_dicts, DATASET_DICTS)
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally think that using are_dataset_dicts_identical is more clear and simple. 🤔

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not need to iterate against if "val" and if "test".

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you've probably heard of the DRY (don't repeat yourself) principle, and the previous code follows the DRY principle. But I changed it for two reasons:

  1. in the case that the two dicts were different, the error message from are_dataset_dicts_identical did not indicate which elements of the dataset dict were broken. now it does.
  2. there is an argument that tests should not be DRY, but DAMP (descriptive and meaningful phrases) to make it very clear to anyone reading the test exactly what it does and why it's failing in a self-contained way. This change moves a little bit in that direction.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow! Thanks for your clarification.

@zhaochenyang20
Copy link
Collaborator

Another thing is that I am not sure that all the API-based models use tiktoken to compute input tokens. Thus, the num of input tokens may slightly differ between these models. But that's minimal.

Copy link
Collaborator

@viswavi viswavi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally I love this PR - removing a lot of code bloat and length comments, simplifying behavior, and consolidating tests.

Just have a few small questions and suggestions about the changes.

split: DatasetSplit,
) -> datasets.Dataset:
"""Generate data for a single named split of data.

Args:
prompt_spec: A prompt spec (containing a system description).
expected_num_examples: Expected number of examples in split.
num_examples: Expected number of examples in split.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was intentionally expected_num_examples because we could not guarantee that this is the exact number generated (due to the way we did batching and retries)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, and I made sure the result actually matched the actual number of examples (unless we hit max_api_calls).

Comment on lines -66 to -69
if output_dir:
save_dir = Path(output_dir)
save_dir.mkdir(parents=True, exist_ok=True)
dataset_dict.save_to_disk(str(save_dir))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to remove this? Saving the dataset to file allows this to be reused in the future without running dataset generation, which is key for iteration

Copy link
Collaborator Author

@neubig neubig Sep 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree that there's a tradeoff here. But there's a major issue with the current implementation in that it (a) implicitly saves a cache, and (b) doesn't check if the cache was created with the same parameters as the current function call. If we're going to cache something, we should make sure that we check that the parameters are the same and invalidate the cache if the parameters are not the same.

I'd suggest that we just remove it for now, and if we want to add cacheing in the future we

  1. Make it explicit (the user needs to call a save_cache function), not implicit
  2. Check to make sure that the parameters of the called function match those of the cache. Zeno build has some code that I wrote to do this, so we might be able to use this (or check other libraries)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are somehow right. I feel a sense of loss that I write careful tests to test the cache and load the cache.

prompt2model/dataset_generator/prompt_based.py Outdated Show resolved Hide resolved
prompt2model/dataset_generator/prompt_based.py Outdated Show resolved Hide resolved
prompt2model/prompt_parser/instr_parser.py Show resolved Hide resolved
Comment on lines +185 to +186
if not isinstance(e, API_ERRORS):
raise e
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this change we'll no longer be logging the errors where we suppress->retry. Without this change the log will have something like:

ERROR:root:Expecting property name enclosed in double quotes: line 1 column 2 (char 1)

Is this what you were intending?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm not sure I fully understand. The logging line still exists in the line directly above this?

@@ -233,7 +235,7 @@ def test_dataset_processor_with_numerical_column():
"<task 1>convert to text2text\nExample:\nfoo\nLabel:\n",
"<task 1>convert to text2text\nExample:\nbar\nLabel:\n",
],
"model_output": ["foo", "bar", "0", "1"],
"model_output": ["baz", "qux", "0", "1"],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the output changing here? Was the test broken before?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I believe it was broken but the assert was not catching it properly. I haven't looked deeply into this though.

@neubig
Copy link
Collaborator Author

neubig commented Sep 8, 2023

Another thing is that I am not sure that all the API-based models use tiktoken to compute input tokens. Thus, the num of input tokens may slightly differ between these models. But that's minimal.

Excellent point @zhaochenyang20. I was thinking of creating an issue for this, but I think for simplicity we can leave this as-is until someone complains about it. Maybe @saum7800 could keep it in mind when we try the non-API-based models though. It might become even more of a problem due to the limited context windows.

@neubig neubig requested a review from viswavi September 8, 2023 12:28
@neubig neubig enabled auto-merge (squash) September 8, 2023 12:28
Copy link
Collaborator

@viswavi viswavi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One final suggestion on a comment but LGTM.

context_cutoff: If the total length of the prompt exceeds this value,
repeat the prompt generation process to generate a shorter one.
context_cutoff: If the total length of the prompt in tokens exceeds this
value, repeat prompt generation process to generate a shorter one.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion to make grammatical (while still fitting within char limits)

Suggested change
value, repeat prompt generation process to generate a shorter one.
value, repeat the prompt generation process to generate a shorter one.

@neubig neubig merged commit 40aee3f into main Sep 8, 2023
8 checks passed
@neubig neubig deleted the refactor_dataset_generator branch September 8, 2023 12:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Inconsistent behaviour in prompt parser
4 participants