Discussion: Update dataloader to skip rows that dont require training #2344

felipemello1 · 2025-02-05T15:27:13Z

When a) train_on_input=False and b) message is too long that output is truncated, there may be a batch without trainable tokens, raising an error on the loss because of division by zero.

Beyond raising an inconvenient bug, this is a waste of compute, and fixing the loss seems to be fixing a symptom, instead of the root cause.

In the dataloader, should we skip rows that dont have trainable embeddings?

The text was updated successfully, but these errors were encountered:

RdoubleA · 2025-02-05T17:00:52Z

It's difficult to know if this is the case fro the DataLoader perspective. You would have to catch this and raise/skip in the recipe or add additional preprocessing to your dataset to prevent this. We don't have the ability to skip an item within the SFTDataset's __getitem__.

felipemello1 · 2025-02-05T17:06:39Z

I see. It could be added to the recipe:

for batch in dataloader():
	if not check_batch_requires_grad(batch):
		continue

I am sure @ebsmothers will love the idea!

EugenHotaj · 2025-02-05T19:09:30Z

Maybe slightly orthogonal but it would also be great to expose a way to do left truncation instead of right truncation. I think this is even more important once the "mask all previous turns" strategy is introduced.

(I'm actually a bit surprised to see that torchtune supports left truncation only since I thought right truncation was the standard practice for llama models).

pocca2048 · 2025-02-07T01:47:36Z

You would have to catch this and raise/skip in the recipe

I agree. Actually, I implemented skipping logic in the recipe w/o fixing the loss.

felipemello1 mentioned this issue Feb 5, 2025

CEWithChunkedOutputLoss does not check division by zero #2341

Open

felipemello1 added discussion Start a discussion best practice Things we should be doing but aren't triage review This issue should be discussed in weekly review labels Feb 5, 2025

felipemello1 self-assigned this Feb 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion: Update dataloader to skip rows that dont require training #2344

Discussion: Update dataloader to skip rows that dont require training #2344

felipemello1 commented Feb 5, 2025 •

edited

Loading

RdoubleA commented Feb 5, 2025

felipemello1 commented Feb 5, 2025 •

edited

Loading

EugenHotaj commented Feb 5, 2025

pocca2048 commented Feb 7, 2025

Discussion: Update dataloader to skip rows that dont require training #2344

Discussion: Update dataloader to skip rows that dont require training #2344

Comments

felipemello1 commented Feb 5, 2025 • edited Loading

RdoubleA commented Feb 5, 2025

felipemello1 commented Feb 5, 2025 • edited Loading

EugenHotaj commented Feb 5, 2025

pocca2048 commented Feb 7, 2025

felipemello1 commented Feb 5, 2025 •

edited

Loading

felipemello1 commented Feb 5, 2025 •

edited

Loading