Fix previous text prepending #142

bofenghuang · 2024-07-09T11:27:08Z

Hi 👋,

Thank you for continuously adding more features to the Whisper distillation code!

As I reviewed the section on prepending previous text during the preparation of training data, I made the following adjustments based on my interpretation:

Moved the prepending of decoder_prev_token_id to the end to ensure it's always triggered, even when prev_ids aren't cut by the previous two conditions
Updated the total length check to len(prev_ids + token_ids) + 1, which now includes decoder_prev_token_id since it's always added
Removed prev_ids from the trim_length calculation. For instance, with 3 prev_ids and 3 token_ids and a max_label_length of 6, we should retain only the last 2 tokens in prev_ids, calculated as max_label_length - len(token_ids) - 1 = 6 - 3 - 1 = 2

Update run_distillation.py

ba9bf0c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix previous text prepending #142

Fix previous text prepending #142

bofenghuang commented Jul 9, 2024

Fix previous text prepending #142

Are you sure you want to change the base?

Fix previous text prepending #142

Conversation

bofenghuang commented Jul 9, 2024