language prefixes #20

Waino · 2023-09-25T10:52:07Z

Make the prefix transform configurable via config-config
Bugfixes to the prefix transform during training
Bugfixes to the prefix transform during translation
Avoid dying if a rare bug occurs in the look ahead bucketing

TimotheeMickus

from what I understand: apparently some tests aren't discovered with unittest, so we should switch the workflow to pytest. I think this workflow change can be added in this PR for simplicity (unless you want to create a separate PR to fix, merge that one first and come to this one to have it merged.

The rest looks fine, aside from one change line 165 in dataloader.py.

TimotheeMickus · 2023-10-02T13:52:27Z

mammoth/inputters/dataloader.py

                        current_bucket_idx = next(
                            bucket_idx
-                            for bucket_idx in range(current_bucket_idx, len(self._buckets) + 1)
+                            for bucket_idx in range(len(self._lens))


doesn't that pick the smallest-length bucket?

~~I wonder if this was some change we made while debugging, that I've accidentally not removed from the commit.~~

~~I don't remember intending to make any changes beyond the try/except.~~

Actually, this is the "Doing something stupid instead". It indeed picks the smallest-length populated bucket.

revert this line, but it's not major (PR #23 is going to nuke this anyways)

The diff pairs up the lines in a very unintuitive way. This is easier to read by looking at just the new code:

mammoth/mammoth/inputters/dataloader.py

Line 161 in 0b69de6

'Doing something stupid instead. Please check me.'

🙃 well I'll be damned. LGTM then, please wait for @onadegibert 's review

Do you have a better idea for selecting a guaranteed-to-be-nonempty bucket after the two loops above have failed?

This should be dead code, but if you recall, we did manage to trigger the StopIteration when using pathological dummy data. Unless you are sure that you found and fixed that bug, we should have the try/except with some kind of fallback for selecting a bucket. It does not need to be a smart choice.

PR #20 is going to change the behavior pretty significantly here.

On the one hand, the _spiralling() pattern should visit all buckets starting from the one we picked for the batch.

On the other hand, we're looking to support genuine cases of empty LAB's with no available batch (when wrapping around the dataset).

So the try catch is reasonable enough for now.

Config-config functionality.

StopIteration when trying to pick a bucket in a smart way. Doing something stupid instead. Please check me.

- Retrieve prefixes in different way during training and translation - Pass task to translation dataset - Allow target to be None Fix failing unit tests - Prefix transform requires is_train to be correctly set. - Always raising ValueError on errors, although it is not completely appropriate.

onadegibert

Looks good to me too :)

Waino force-pushed the feat/config-config-lang-prefix branch from 9d6fb12 to 0b69de6 Compare October 2, 2023 08:44

Waino changed the title ~~WIP: language prefixes~~ language prefixes Oct 2, 2023

Waino requested review from onadegibert and TimotheeMickus October 2, 2023 08:49

TimotheeMickus reviewed Oct 2, 2023

View reviewed changes

Waino and others added 3 commits October 2, 2023 17:16

Language tokens using the prefix transform

5643de3

Config-config functionality.

Avoid dying if a rare bug occurs in the look ahead bucketing

9874bb7

StopIteration when trying to pick a bucket in a smart way. Doing something stupid instead. Please check me.

Waino force-pushed the feat/config-config-lang-prefix branch from 0b69de6 to 9310c4d Compare October 2, 2023 14:22

onadegibert approved these changes Oct 3, 2023

View reviewed changes

Waino merged commit 165038e into main Oct 9, 2023

Waino deleted the feat/config-config-lang-prefix branch October 9, 2023 10:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

language prefixes #20

language prefixes #20

Waino commented Sep 25, 2023

TimotheeMickus left a comment

TimotheeMickus Oct 2, 2023

Waino Oct 2, 2023 •

edited

Loading

TimotheeMickus Oct 2, 2023

Waino Oct 2, 2023

TimotheeMickus Oct 2, 2023

Waino Oct 2, 2023

TimotheeMickus Oct 2, 2023

onadegibert left a comment

language prefixes #20

language prefixes #20

Conversation

Waino commented Sep 25, 2023

TimotheeMickus left a comment

Choose a reason for hiding this comment

TimotheeMickus Oct 2, 2023

Choose a reason for hiding this comment

Waino Oct 2, 2023 • edited Loading

Choose a reason for hiding this comment

TimotheeMickus Oct 2, 2023

Choose a reason for hiding this comment

Waino Oct 2, 2023

Choose a reason for hiding this comment

TimotheeMickus Oct 2, 2023

Choose a reason for hiding this comment

Waino Oct 2, 2023

Choose a reason for hiding this comment

TimotheeMickus Oct 2, 2023

Choose a reason for hiding this comment

onadegibert left a comment

Choose a reason for hiding this comment

Waino Oct 2, 2023 •

edited

Loading