-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feats/bucket lord #23
Conversation
it's implemented, it runs, and it seems to be about as efficient as the previous solution—but I don't know if it's an actual improvement in terms of padding efficiency. The buckets are likely very sparsely populated (with I'll see if I can come up with stats in terms of how much padding we observe with this solution. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check all uses of self._lens
.
Indeed. I think that the original 1-d buckets were probably already unnecessarily sparse due to being based on exact rather than quantized lengths. There is a tradeoff between minimizing the amount of padding and minimizing the amount of correlation between examples in a batch. Being extremely aggressive with minimizing padding may not be optimal for training, if it results in a very unshuffled data order (and large memory use). I think that a variable-granularity binning with finegrained bins near the mode of the lengths and rough outside would be good:
|
Memory use isn't a concern (there's still only
That's not easy to pre-compute, and requires different parameters per language or per task. It's not impossible, but it's going to be very involved. My concrete approach to that would be to try to automate that in the config config while reading the corpora length, e.g. with a process similar to the following:
We could obviously skip 1 and 2 and rely on hard observed data if we compute the full stats of each corpus. Do remark that in any case, this does involve configuring bucket sizes in the task. If on the other we only care about ensuring that tail items are lumped in a single bucket, then I would consider just defining an outlier bucket by cutting off anything below an expected max length (say 128 or less), which can already be done by setting a lower In the present 2D case, we could also rely on length ratios to define how many of the offset diagonals we tolerate before sending an example to the outlier bucket.
If we're stressed out about wrapping around the full corpus and constructing batches with very similar examples (which is justifiable for LR langs in my opinion), then we can
|
In order for the rare buckets out there in the tail of the distribution to ever see enough data to be filled, If
It doesn't need to be exact. A warmup step that reads the first 1k--10k lines of the file and estimates the mode from that would probably be good enough.
Yes, this would need to happen per task. Aggregating over languages would be tricky and not worth it.
We could compute this during config-config instead of in the beginning of training.
Yes, we could do it like this. This would be theoretically very elegant: the bins would be approximately equiprobable. I'd just go with a rough approach where there are a few small (maybe size 1) bins around the mode and everything else gets chucked in a collector bin (or a few of those, if we want a little bit more granularity). This would not be very principled, but can be defined based on just one parameter.
It would be overkill to use the whole corpus to compute this.
I guess this also does the trick. It is the longer side that has the long tail, on the shorter side there is a minimum value that is quickly reached.
Not sure I follow.
Wouldn't this just throw away the outliers entirely, instead of collecting them to (rare) outlier batches with (potentially several copies of) just the outliers? |
In the 1D impl currently on main, anything with a length above
Reading 10k more lines would slow down the initialization. For simplicity, I'd consider using If so, I would suggest implementing something like:
cons:
pros:
This is the simpler option, yes...
In the current version, we define buckets for
no, you'd keep creating batches until |
Ok. After clearing up some of my misunderstandings, I guess we could just lower the n_bucket value, so that the "high granularity" area contains a reasonable number of buckets when we go to 2D. Then everything else would either go into a single outlier bucket, or we could have e.g. 3: {source|target|both} exceeded the max length.
This also sounds great. Pick whichever solution you prefer. |
That was the idea, yes
I guess having everything type of outlier into a single bucket would make sense, but i don't really know how to make this work with the spiralling bucket search pattern. Let me think on it.
They're not mutually exclusive, we can implement both. And we're dropping the over-engineered solution for equiprobable bins, correct? |
Some further comments:
I'm also strongly considering using this PR to also give a somewhat less nondescript name to the DynamicDatasetIterator, which does not define the dynamic iteration process nor directly iterates over datasets (a multiplexer of streams of batches from tasks? |
re-smoke tested post merge w/ main, so should be good to merge |
closes #12