Refactor the dataset generator #335

neubig · 2023-09-07T21:37:41Z

Description

This PR does a major refactoring of the dataset generator in an effort to make the API and code cleaner:

There were several places where things were converted between lists and hugging face datasets, I tried to reduce the number of conversions.
There were places where caches were written out as a side effect of the main function effect, I removed these.
I reduced the length of comments, particularly for tests and for inline comments where the comment was redundant with the code itself.

References

Part of: DatasetGenerator doesn't stop generating? #333
Fixes: Inconsistent behaviour in prompt parser #334

Blocked by

NA

…ataset_generator

zhaochenyang20 · 2023-09-08T00:01:50Z

Amazing! Is #333 fixed totally?

neubig · 2023-09-08T00:06:31Z

No, not yet, I'm actually debugging now.

zhaochenyang20

Some suggestions.

prompt2model/dataset_generator/base.py

prompt2model/dataset_generator/prompt_based.py

zhaochenyang20 · 2023-09-08T02:59:40Z

tests/dataset_processor_test.py

-    assert all(
-        are_dataset_dicts_identical(raw, origin)
-        for (raw, origin) in zip(raw_dataset_dicts, DATASET_DICTS)
-    )


I personally think that using are_dataset_dicts_identical is more clear and simple. 🤔

We do not need to iterate against if "val" and if "test".

So you've probably heard of the DRY (don't repeat yourself) principle, and the previous code follows the DRY principle. But I changed it for two reasons:

in the case that the two dicts were different, the error message from are_dataset_dicts_identical did not indicate which elements of the dataset dict were broken. now it does.

there is an argument that tests should not be DRY, but DAMP (descriptive and meaningful phrases) to make it very clear to anyone reading the test exactly what it does and why it's failing in a self-contained way. This change moves a little bit in that direction.

Wow! Thanks for your clarification.

zhaochenyang20 · 2023-09-08T03:10:48Z

Another thing is that I am not sure that all the API-based models use tiktoken to compute input tokens. Thus, the num of input tokens may slightly differ between these models. But that's minimal.

viswavi

Generally I love this PR - removing a lot of code bloat and length comments, simplifying behavior, and consolidating tests.

Just have a few small questions and suggestions about the changes.

viswavi · 2023-09-08T02:37:54Z

prompt2model/dataset_generator/base.py

        split: DatasetSplit,
    ) -> datasets.Dataset:
        """Generate data for a single named split of data.

        Args:
            prompt_spec: A prompt spec (containing a system description).
-            expected_num_examples: Expected number of examples in split.
+            num_examples: Expected number of examples in split.


This was intentionally expected_num_examples because we could not guarantee that this is the exact number generated (due to the way we did batching and retries)

Yeah, and I made sure the result actually matched the actual number of examples (unless we hit max_api_calls).

viswavi · 2023-09-08T02:48:13Z

prompt2model/dataset_generator/base.py

-        if output_dir:
-            save_dir = Path(output_dir)
-            save_dir.mkdir(parents=True, exist_ok=True)
-            dataset_dict.save_to_disk(str(save_dir))


Do we want to remove this? Saving the dataset to file allows this to be reused in the future without running dataset generation, which is key for iteration

Yeah, I agree that there's a tradeoff here. But there's a major issue with the current implementation in that it (a) implicitly saves a cache, and (b) doesn't check if the cache was created with the same parameters as the current function call. If we're going to cache something, we should make sure that we check that the parameters are the same and invalidate the cache if the parameters are not the same.

I'd suggest that we just remove it for now, and if we want to add cacheing in the future we

Make it explicit (the user needs to call a save_cache function), not implicit

Check to make sure that the parameters of the called function match those of the cache. Zeno build has some code that I wrote to do this, so we might be able to use this (or check other libraries)

You are somehow right. I feel a sense of loss that I write careful tests to test the cache and load the cache.

prompt2model/model_retriever/generate_hypothetical_document.py

prompt2model/dataset_generator/prompt_based.py

prompt2model/prompt_parser/instr_parser.py

viswavi · 2023-09-08T03:13:20Z

prompt2model/utils/api_tools.py

+    if not isinstance(e, API_ERRORS):
+        raise e


With this change we'll no longer be logging the errors where we suppress->retry. Without this change the log will have something like:

ERROR:root:Expecting property name enclosed in double quotes: line 1 column 2 (char 1)

Is this what you were intending?

Sorry, I'm not sure I fully understand. The logging line still exists in the line directly above this?

viswavi · 2023-09-08T03:15:09Z

tests/dataset_processor_test.py

@@ -233,7 +235,7 @@ def test_dataset_processor_with_numerical_column():
                        "<task 1>convert to text2text\nExample:\nfoo\nLabel:\n",
                        "<task 1>convert to text2text\nExample:\nbar\nLabel:\n",
                    ],
-                    "model_output": ["foo", "bar", "0", "1"],
+                    "model_output": ["baz", "qux", "0", "1"],


Why is the output changing here? Was the test broken before?

Yes, I believe it was broken but the assert was not catching it properly. I haven't looked deeply into this though.

prompt2model/prompt_parser/instr_parser.py

neubig · 2023-09-08T12:17:13Z

Another thing is that I am not sure that all the API-based models use tiktoken to compute input tokens. Thus, the num of input tokens may slightly differ between these models. But that's minimal.

Excellent point @zhaochenyang20. I was thinking of creating an issue for this, but I think for simplicity we can leave this as-is until someone complains about it. Maybe @saum7800 could keep it in mind when we try the non-API-based models though. It might become even more of a problem due to the limited context windows.

Co-authored-by: Vijay Viswanathan <[email protected]>

viswavi

One final suggestion on a comment but LGTM.

viswavi · 2023-09-08T12:52:13Z

prompt2model/dataset_generator/prompt_based.py

-            context_cutoff: If the total length of the prompt exceeds this value,
-                repeat the prompt generation process to generate a shorter one.
+            context_cutoff: If the total length of the prompt in tokens exceeds this
+                value, repeat prompt generation process to generate a shorter one.


Suggestion to make grammatical (while still fitting within char limits)

Suggested change

value, repeat prompt generation process to generate a shorter one.

value, repeat the prompt generation process to generate a shorter one.

neubig added 3 commits September 7, 2023 17:28

Refactor the dataset generator

11926e2

Merge branch 'main' of github.com:neulab/prompt2model into refactor_d…

db93f06

…ataset_generator

Fixes error handling

2f8065e

neubig requested review from zhaochenyang20 and viswavi September 7, 2023 22:19

neubig added 2 commits September 7, 2023 18:22

Fix typechecking error

c7c29f7

Fix prompt parser test

e880426

neubig mentioned this pull request Sep 8, 2023

Use cached dataset retriever #336

Merged

neubig mentioned this pull request Sep 8, 2023

Fix generator progress bar #337

Merged

zhaochenyang20 requested changes Sep 8, 2023

View reviewed changes

zhaochenyang20 approved these changes Sep 8, 2023

View reviewed changes

viswavi requested changes Sep 8, 2023

View reviewed changes

saum7800 reviewed Sep 8, 2023

View reviewed changes

prompt2model/prompt_parser/instr_parser.py Show resolved Hide resolved

neubig and others added 2 commits September 8, 2023 08:22

Update prompt2model/model_retriever/generate_hypothetical_document.py

659bd3d

Co-authored-by: Vijay Viswanathan <[email protected]>

Reflect review comments

9a23cf5

neubig requested a review from viswavi September 8, 2023 12:28

neubig enabled auto-merge (squash) September 8, 2023 12:28

Update prompt2model/dataset_generator/prompt_based.py

d35bbce

Co-authored-by: Vijay Viswanathan <[email protected]>

viswavi approved these changes Sep 8, 2023

View reviewed changes

neubig merged commit 40aee3f into main Sep 8, 2023
8 checks passed

neubig deleted the refactor_dataset_generator branch September 8, 2023 12:53

This was referenced Sep 8, 2023

Cache the generated dataset while generation #340

Open

tiktoken usage #341

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor the dataset generator #335

Refactor the dataset generator #335

neubig commented Sep 7, 2023 •

edited

Loading

zhaochenyang20 commented Sep 8, 2023

neubig commented Sep 8, 2023

zhaochenyang20 left a comment

zhaochenyang20 Sep 8, 2023

zhaochenyang20 Sep 8, 2023

neubig Sep 8, 2023

zhaochenyang20 Sep 8, 2023

zhaochenyang20 commented Sep 8, 2023

viswavi left a comment

viswavi Sep 8, 2023

neubig Sep 8, 2023

viswavi Sep 8, 2023

neubig Sep 8, 2023 •

edited

Loading

zhaochenyang20 Sep 8, 2023

viswavi Sep 8, 2023

neubig Sep 8, 2023

viswavi Sep 8, 2023

neubig Sep 8, 2023

neubig commented Sep 8, 2023

viswavi left a comment

viswavi Sep 8, 2023

	value, repeat prompt generation process to generate a shorter one.
	value, repeat the prompt generation process to generate a shorter one.

Refactor the dataset generator #335

Refactor the dataset generator #335

Conversation

neubig commented Sep 7, 2023 • edited Loading

Description

References

Blocked by

zhaochenyang20 commented Sep 8, 2023

neubig commented Sep 8, 2023

zhaochenyang20 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhaochenyang20 commented Sep 8, 2023

viswavi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

neubig Sep 8, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

neubig commented Sep 8, 2023

viswavi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

neubig commented Sep 7, 2023 •

edited

Loading

neubig Sep 8, 2023 •

edited

Loading