max length padding/truncating #55

TimotheeMickus · 2024-02-28T16:53:59Z

Load unbalance is a very likely candidates for the scaling issues we faced. This PR introduces a couple new flags to enforce equal load across nodes, which seems to result in reasonable communication performances.

This can be achieved with the :

during training, enable the truncation/padding with the --pad_to_max_length
set a --max_length to which the input tensors will be trimmed (hence the sequence dimension of input tensors will be constant)
set --batch_type "sents" (hence the batch dimension of input tensors will be constant), along with a low enough batch size.

Using a max length of 128 and a batch size of 300, we get 6k tok/sec on a 4-GPUs / 1 node Europarl job, with a fairly constant GPU utilization rate (90 to 100%).

Todo's before undrafting this PR

replace max_length with actual src seq length and tgt seq length, if not provided
ensure this works across multiple architectures
unit testing
maybe consider communicating ahead of the forward loop the effective max sequence lengths across processes, so as limit the actual processing required
consider enabling load-balancing measures in the token-level batching functions:
- make a stateful numel_fn, allow it to track the max lengths across the batch items,
- forbid sampling from buckets for source/ target lengths that would put the batch over the max load
- break when the intended number of true tokens is reached / when there are no example to select that would be compatile with the max load

Waino

LGTM. Let's merge this, although it is possibly obsoleted by better dynamic batching.

Waino · 2024-05-13T06:08:08Z

See #67

draft feature for max length padding/truncating

a147f7c

TimotheeMickus requested a review from Waino February 28, 2024 16:53

Waino approved these changes May 6, 2024

View reviewed changes

Waino mentioned this pull request May 13, 2024

reimplement sentence based max len padding #67

Closed

Waino closed this May 13, 2024

Waino mentioned this pull request May 20, 2024

Reimplement fixed size batching #69

Merged

TimotheeMickus deleted the feats/pad_to_maxlen branch June 13, 2024 09:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

max length padding/truncating #55

max length padding/truncating #55

TimotheeMickus commented Feb 28, 2024

Waino left a comment

Waino commented May 13, 2024

max length padding/truncating #55

max length padding/truncating #55

Conversation

TimotheeMickus commented Feb 28, 2024

Waino left a comment

Choose a reason for hiding this comment

Waino commented May 13, 2024