-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
language prefixes #20
Conversation
Waino
commented
Sep 25, 2023
- Make the prefix transform configurable via config-config
- Bugfixes to the prefix transform during training
- Bugfixes to the prefix transform during translation
- Avoid dying if a rare bug occurs in the look ahead bucketing
9d6fb12
to
0b69de6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from what I understand: apparently some tests aren't discovered with unittest, so we should switch the workflow to pytest. I think this workflow change can be added in this PR for simplicity (unless you want to create a separate PR to fix, merge that one first and come to this one to have it merged.
The rest looks fine, aside from one change line 165 in dataloader.py.
current_bucket_idx = next( | ||
bucket_idx | ||
for bucket_idx in range(current_bucket_idx, len(self._buckets) + 1) | ||
for bucket_idx in range(len(self._lens)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't that pick the smallest-length bucket?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this was some change we made while debugging, that I've accidentally not removed from the commit.
I don't remember intending to make any changes beyond the try/except.
Actually, this is the "Doing something stupid instead". It indeed picks the smallest-length populated bucket.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
revert this line, but it's not major (PR #23 is going to nuke this anyways)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The diff pairs up the lines in a very unintuitive way. This is easier to read by looking at just the new code:
mammoth/mammoth/inputters/dataloader.py
Line 161 in 0b69de6
'Doing something stupid instead. Please check me.' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🙃 well I'll be damned. LGTM then, please wait for @onadegibert 's review
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have a better idea for selecting a guaranteed-to-be-nonempty bucket after the two loops above have failed?
This should be dead code, but if you recall, we did manage to trigger the StopIteration
when using pathological dummy data. Unless you are sure that you found and fixed that bug, we should have the try/except with some kind of fallback for selecting a bucket. It does not need to be a smart choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR #20 is going to change the behavior pretty significantly here.
- On the one hand, the
_spiralling()
pattern should visit all buckets starting from the one we picked for the batch. - On the other hand, we're looking to support genuine cases of empty LAB's with no available batch (when wrapping around the dataset).
So the try catch is reasonable enough for now.
Config-config functionality.
StopIteration when trying to pick a bucket in a smart way. Doing something stupid instead. Please check me.
- Retrieve prefixes in different way during training and translation - Pass task to translation dataset - Allow target to be None Fix failing unit tests - Prefix transform requires is_train to be correctly set. - Always raising ValueError on errors, although it is not completely appropriate.
0b69de6
to
9310c4d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me too :)