Feats/bucket lord #23

TimotheeMickus · 2023-09-27T13:44:26Z

closes #12

TimotheeMickus · 2023-09-27T13:52:18Z

it's implemented, it runs, and it seems to be about as efficient as the previous solution—but I don't know if it's an actual improvement in terms of padding efficiency.

The buckets are likely very sparsely populated (with n_buckets² buckets total, that's a lot), so maybe we want fewer buckets and binning in comparable groups, rather than exact length matches (maybe we want n_buckets ** ½ or something).

I'll see if I can come up with stats in terms of how much padding we observe with this solution.

Waino

Check all uses of self._lens.

mammoth/inputters/dataloader.py

Waino · 2023-10-02T07:36:10Z

The buckets are likely very sparsely populated (with n_buckets² buckets total, that's a lot), so maybe we want fewer buckets and binning in comparable groups, rather than exact length matches (maybe we want n_buckets ** ½ or something).

Indeed. I think that the original 1-d buckets were probably already unnecessarily sparse due to being based on exact rather than quantized lengths. There is a tradeoff between minimizing the amount of padding and minimizing the amount of correlation between examples in a batch. Being extremely aggressive with minimizing padding may not be optimal for training, if it results in a very unshuffled data order (and large memory use).

I think that a variable-granularity binning with finegrained bins near the mode of the lengths and rough outside would be good:

the amount of extra padding is minimized for the very common batches near the mode of the distribution. These have a lot of data to choose from, so waiting for the batch to fill up will not lead to it always containing the exact same sentences.
the outliers are rare enough to not matter, so no sense binning them very exactly.

TimotheeMickus · 2023-10-02T08:42:56Z

The buckets are likely very sparsely populated (with n_buckets² buckets total, that's a lot), so maybe we want fewer buckets and binning in comparable groups, rather than exact length matches (maybe we want n_buckets ** ½ or something).

Indeed. I think that the original 1-d buckets were probably already unnecessarily sparse due to being based on exact rather than quantized lengths. There is a tradeoff between minimizing the amount of padding and minimizing the amount of correlation between examples in a batch. Being extremely aggressive with minimizing padding may not be optimal for training, if it results in a very unshuffled data order (and large memory use).

Memory use isn't a concern (there's still only pool_size items considered at any point). I think the system at least gives reasonable guarantees for shuffling near the mode of the length distribution. So the problem is likely more specifically the tail of the distribution.

I think that a variable-granularity binning with finegrained bins near the mode of the lengths and rough outside would be good:

1. the amount of extra padding is minimized for the very common batches near the mode of the distribution. These have a lot of data to choose from, so waiting for the batch to fill up will not lead to it always containing the exact same sentences.

2. the outliers are rare enough to not matter, so no sense binning them very exactly.

That's not easy to pre-compute, and requires different parameters per language or per task. It's not impossible, but it's going to be very involved. My concrete approach to that would be to try to automate that in the config config while reading the corpora length, e.g. with a process similar to the following:

fit some random variable to the observed length distribution on a subset of the data
compute equiprobable bins through the cdf
use bins from 2. to define lists of bucket sizes,
store in a new task key

We could obviously skip 1 and 2 and rely on hard observed data if we compute the full stats of each corpus. Do remark that in any case, this does involve configuring bucket sizes in the task.

If on the other we only care about ensuring that tail items are lumped in a single bucket, then I would consider just defining an outlier bucket by cutting off anything below an expected max length (say 128 or less), which can already be done by setting a lower n_bucket value.

In the present 2D case, we could also rely on length ratios to define how many of the offset diagonals we tolerate before sending an example to the outlier bucket.

There is a tradeoff between minimizing the amount of padding and minimizing the amount of correlation between examples in a batch.

If we're stressed out about wrapping around the full corpus and constructing batches with very similar examples (which is justifiable for LR langs in my opinion), then we can

remove the infinite cycling from the dataset,
use a flag to mark when the example stream has been exhausted
restart the iteration over the example stream when we've hit self.is_empty()

Waino · 2023-10-02T09:46:40Z

Memory use isn't a concern (there's still only pool_size items considered at any point). I think the system at least gives reasonable guarantees for shuffling near the mode of the length distribution. So the problem is likely more specifically the tail of the distribution.

In order for the rare buckets out there in the tail of the distribution to ever see enough data to be filled, pool_size needs to be increased to compensate. This would have a memory impact.

If pool_size is not increased, then these outlier buckets will just gather dust and never be used?

That's not easy to pre-compute

It doesn't need to be exact. A warmup step that reads the first 1k--10k lines of the file and estimates the mode from that would probably be good enough.

and requires different parameters per language or per task.

Yes, this would need to happen per task. Aggregating over languages would be tricky and not worth it.

try to automate that in the config config while reading the corpora length

We could compute this during config-config instead of in the beginning of training.
However, this makes config-config kind of mandatory to use, unless you want to estimate the parameter(s) yourself or hazard a guess. If there is only one parameter per task (the mode of the length), then this is still doable. For a full list of bucket sizes it is a lot to ask.

fit some random variable to the observed length distribution on a subset of the data

compute equiprobable bins through the cdf

use bins from 2. to define lists of bucket sizes,

store in a new task key

Yes, we could do it like this. This would be theoretically very elegant: the bins would be approximately equiprobable.

I'd just go with a rough approach where there are a few small (maybe size 1) bins around the mode and everything else gets chucked in a collector bin (or a few of those, if we want a little bit more granularity). This would not be very principled, but can be defined based on just one parameter.

We could obviously skip 1 and 2 and rely on hard observed data if we compute the full stats of each corpus.

It would be overkill to use the whole corpus to compute this.

If on the other we only care about ensuring that tail items are lumped in a single bucket, then I would consider just defining an outlier bucket by cutting off anything below an expected max length (say 128 or less), which can already be done by setting a lower n_bucket value.

I guess this also does the trick. It is the longer side that has the long tail, on the shorter side there is a minimum value that is quickly reached.

In the present 2D case, we could also rely on length ratios to define how many of the offset diagonals we tolerate before sending an example to the outlier bucket.

Not sure I follow.

There is a tradeoff between minimizing the amount of padding and minimizing the amount of correlation between examples in a batch.

If we're stressed out about wrapping around the full corpus and constructing batches with very similar examples (which is justifiable for LR langs in my opinion), then we can

remove the infinite cycling from the dataset,

use a flag to mark when the example stream has been exhausted

restart the iteration over the example stream when we've hit self.is_empty()

Wouldn't this just throw away the outliers entirely, instead of collecting them to (rare) outlier batches with (potentially several copies of) just the outliers?

TimotheeMickus · 2023-10-02T10:17:55Z

In order for the rare buckets out there in the tail of the distribution to ever see enough data to be filled, pool_size needs to be increased to compensate. This would have a memory impact.

If pool_size is not increased, then these outlier buckets will just gather dust and never be used?

In the 1D impl currently on main, anything with a length above n_buckets gets chucked into the last bucket. so self._buckets[-1] is already an outlier bucket

It doesn't need to be exact. A warmup step that reads the first 1k--10k lines of the file and estimates the mode from that would probably be good enough.

[...] this makes config-config kind of mandatory to use, unless you want to estimate the parameter(s) yourself or hazard a guess.

Reading 10k more lines would slow down the initialization. For simplicity, I'd consider using pool_size items for the warmup, and offloading bucket definition in the init function of the LAB. If we also offload bucket definition to the LAB itself then we don't need to overcomplicate the config either.

If so, I would suggest implementing something like:

starting by computing an exact length-quantized bucket arrays
define a merge operation of buckets b1, b2 so that they share a ref to the same list
find the two most underused adjacent buckets and merge them
repeat until we get to the target number of buckets

cons:

It's a perf hit on startup
It's a mess to save the structure when resuming training (you also need the merge info)
kind of a mess to do weighted sampling (we need to drop duplicate refs first, or sample from the merge info)
pretty opaque for outsider devs

pros:

makes the config config optional again
simplifies the config definition to something humanly readable
easy to adapt the current behavior (the shared bucket ref does the rerouting)

If on the other we only care about ensuring that tail items are lumped in a single bucket, then I would consider just defining an outlier bucket by cutting off anything below an expected max length (say 128 or less), which can already be done by setting a lower n_bucket value.

I guess this also does the trick. It is the longer side that has the long tail, on the shorter side there is a minimum value that is quickly reached.

This is the simpler option, yes...

In the present 2D case, we could also rely on length ratios to define how many of the offset diagonals we tolerate before sending an example to the outlier bucket.

Not sure I follow.

In the current version, we define buckets for (src_len, tgt_len) tuples. What happens if the target has an outlier length but source is in the mode?

There is a tradeoff between minimizing the amount of padding and minimizing the amount of correlation between examples in a batch.

If we're stressed out about wrapping around the full corpus and constructing batches with very similar examples (which is justifiable for LR langs in my opinion), then we can

remove the infinite cycling from the dataset,

use a flag to mark when the example stream has been exhausted

restart the iteration over the example stream when we've hit self.is_empty()

Wouldn't this just throw away the outliers entirely, instead of collecting them to (rare) outlier batches with (potentially several copies of) just the outliers?

no, you'd keep creating batches until self._is_empty(), yield those final batches, and then you'd restart the dataset iteration. Nothing lost, except the last batch might be smaller than expected.

Waino · 2023-10-02T11:10:56Z

can already be done by setting a lower n_bucket value

Ok. After clearing up some of my misunderstandings, I guess we could just lower the n_bucket value, so that the "high granularity" area contains a reasonable number of buckets when we go to 2D. Then everything else would either go into a single outlier bucket, or we could have e.g. 3: {source|target|both} exceeded the max length.

no, you'd keep creating batches until self._is_empty(), yield those final batches, and then you'd restart the dataset iteration. Nothing lost, except the last batch might be smaller than expected.

This also sounds great.

Pick whichever solution you prefer.

TimotheeMickus · 2023-10-02T11:18:16Z

Ok. After clearing up some of my misunderstandings, I guess we could just lower the n_bucket value, so that the "high granularity" area contains a reasonable number of buckets when we go to 2D.

That was the idea, yes

Then everything else would either go into a single outlier bucket, or we could have e.g. 3: {source|target|both} exceeded the max length.

I guess having everything type of outlier into a single bucket would make sense, but i don't really know how to make this work with the spiralling bucket search pattern. Let me think on it.

Pick whichever solution you prefer.

They're not mutually exclusive, we can implement both.

And we're dropping the over-engineered solution for equiprobable bins, correct?
Do we want that as a rc2 feature, along with issues #8, #17, #24 and #25 ?

…nto feats/bucket-lord

TimotheeMickus · 2023-10-02T20:03:35Z

Some further comments:

there was a bug in the spiralling pattern. This should be fixed, but I definitely need to write unittests as well
there is a possibility that the re-initialization of the example stream and the refurbishing of the buckets end up taking quite some time. Hopefully the comm will be resilient enough if the data pipe ends up too slow, otherwise we might want to fiddle with the batch queue prefetching
I ended up removing self._lens and just calling len(bucket) as appropriate instead, to avoid any discrepancy

I'm also strongly considering using this PR to also give a somewhat less nondescript name to the DynamicDatasetIterator, which does not define the dynamic iteration process nor directly iterates over datasets (a multiplexer of streams of batches from tasks? TaskBatchMux makes for a very fun tongue-twister at the very least).

TimotheeMickus · 2023-10-16T09:10:41Z

re-smoke tested post merge w/ main, so should be good to merge

Mickus Timothee added 3 commits September 27, 2023 16:31

spiralling pattern & 2d bucket arrays

dd4575c

debugging

4476b26

debugging / 2

ca5628d

TimotheeMickus self-assigned this Sep 27, 2023

TimotheeMickus requested a review from Waino September 27, 2023 13:44

TimotheeMickus marked this pull request as ready for review October 1, 2023 11:05

Waino requested changes Oct 2, 2023

View reviewed changes

mammoth/inputters/dataloader.py Outdated Show resolved Hide resolved

mammoth/inputters/dataloader.py Outdated Show resolved Hide resolved

TimotheeMickus mentioned this pull request Oct 2, 2023

language prefixes #20

Merged

Mickus Timothee added 5 commits October 2, 2023 21:41

spiralling pattern & 2d bucket arrays

97ff18f

debugging

9cb028d

debugging / 2

c3d572e

exhaust before restart, code cleanup

98c3124

Merge branch 'feats/bucket-lord' of github.com:Helsinki-NLP/mammoth i…

7a95de2

…nto feats/bucket-lord

TimotheeMickus marked this pull request as draft October 2, 2023 18:46

Mickus Timothee added 3 commits October 2, 2023 22:29

debug /3 (spiralling patterns) + code cleaning + small efficiency things

e433e5c

actually length checking is done in the main loop

3033ae8

removing self._lens

d827881

adding unittests for LAB

c03d7cc

TimotheeMickus marked this pull request as ready for review October 16, 2023 09:02

Merge branch 'main' into feats/bucket-lord

f981d26

TimotheeMickus requested a review from Waino October 16, 2023 09:08

Waino approved these changes Oct 23, 2023

View reviewed changes

TimotheeMickus merged commit ed7261c into main Oct 23, 2023

TimotheeMickus deleted the feats/bucket-lord branch October 23, 2023 06:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feats/bucket lord #23

Feats/bucket lord #23

TimotheeMickus commented Sep 27, 2023

TimotheeMickus commented Sep 27, 2023

Waino left a comment

Waino commented Oct 2, 2023

TimotheeMickus commented Oct 2, 2023 •

edited

Loading

Waino commented Oct 2, 2023

TimotheeMickus commented Oct 2, 2023

Waino commented Oct 2, 2023 •

edited

Loading

TimotheeMickus commented Oct 2, 2023

TimotheeMickus commented Oct 2, 2023

TimotheeMickus commented Oct 16, 2023

Feats/bucket lord #23

Feats/bucket lord #23

Conversation

TimotheeMickus commented Sep 27, 2023

TimotheeMickus commented Sep 27, 2023

Waino left a comment

Choose a reason for hiding this comment

Waino commented Oct 2, 2023

TimotheeMickus commented Oct 2, 2023 • edited Loading

Waino commented Oct 2, 2023

TimotheeMickus commented Oct 2, 2023

Waino commented Oct 2, 2023 • edited Loading

TimotheeMickus commented Oct 2, 2023

TimotheeMickus commented Oct 2, 2023

TimotheeMickus commented Oct 16, 2023

TimotheeMickus commented Oct 2, 2023 •

edited

Loading

Waino commented Oct 2, 2023 •

edited

Loading