Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

language prefixes #20

Merged
merged 3 commits into from
Oct 9, 2023
Merged

language prefixes #20

merged 3 commits into from
Oct 9, 2023

Conversation

Waino
Copy link
Collaborator

@Waino Waino commented Sep 25, 2023

  • Make the prefix transform configurable via config-config
  • Bugfixes to the prefix transform during training
  • Bugfixes to the prefix transform during translation
  • Avoid dying if a rare bug occurs in the look ahead bucketing

@Waino Waino force-pushed the feat/config-config-lang-prefix branch from 9d6fb12 to 0b69de6 Compare October 2, 2023 08:44
@Waino Waino changed the title WIP: language prefixes language prefixes Oct 2, 2023
Copy link
Collaborator

@TimotheeMickus TimotheeMickus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from what I understand: apparently some tests aren't discovered with unittest, so we should switch the workflow to pytest. I think this workflow change can be added in this PR for simplicity (unless you want to create a separate PR to fix, merge that one first and come to this one to have it merged.

The rest looks fine, aside from one change line 165 in dataloader.py.

current_bucket_idx = next(
bucket_idx
for bucket_idx in range(current_bucket_idx, len(self._buckets) + 1)
for bucket_idx in range(len(self._lens))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't that pick the smallest-length bucket?

Copy link
Collaborator Author

@Waino Waino Oct 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this was some change we made while debugging, that I've accidentally not removed from the commit.

I don't remember intending to make any changes beyond the try/except.

Actually, this is the "Doing something stupid instead". It indeed picks the smallest-length populated bucket.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert this line, but it's not major (PR #23 is going to nuke this anyways)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The diff pairs up the lines in a very unintuitive way. This is easier to read by looking at just the new code:

'Doing something stupid instead. Please check me.'

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙃 well I'll be damned. LGTM then, please wait for @onadegibert 's review

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a better idea for selecting a guaranteed-to-be-nonempty bucket after the two loops above have failed?

This should be dead code, but if you recall, we did manage to trigger the StopIteration when using pathological dummy data. Unless you are sure that you found and fixed that bug, we should have the try/except with some kind of fallback for selecting a bucket. It does not need to be a smart choice.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #20 is going to change the behavior pretty significantly here.

  • On the one hand, the _spiralling() pattern should visit all buckets starting from the one we picked for the batch.
  • On the other hand, we're looking to support genuine cases of empty LAB's with no available batch (when wrapping around the dataset).

So the try catch is reasonable enough for now.

Waino and others added 3 commits October 2, 2023 17:16
Config-config functionality.
StopIteration when trying to pick a bucket in a smart way.
Doing something stupid instead. Please check me.
- Retrieve prefixes in different way during training and translation
- Pass task to translation dataset
- Allow target to be None

Fix failing unit tests

- Prefix transform requires is_train to be correctly set.
- Always raising ValueError on errors, although it is not completely appropriate.
@Waino Waino force-pushed the feat/config-config-lang-prefix branch from 0b69de6 to 9310c4d Compare October 2, 2023 14:22
Copy link
Collaborator

@onadegibert onadegibert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me too :)

@Waino Waino merged commit 165038e into main Oct 9, 2023
@Waino Waino deleted the feat/config-config-lang-prefix branch October 9, 2023 10:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants