[WIP] Updates data processing logic to remove dependency on hardcoded chat templates #428

RobotSail · 2025-03-05T05:15:11Z

In order to use the training library today, we need to manually define the chat template and rely on specifically parsing the special tokens.

This introduces a number of issues for consumers where other models cannot easily be used without some inital effort.

This PR resolves this issue by introducing a new form of data processing which is able to arbitrarily process data in the messages format and apply the appropriate unmasking policy.

Specifically, the new data processing script now reads the unmask field to determine whether to unmask all messages (aside from the system message) or just the assistant responses (skills training).

Signed-off-by: Oleg Silkin <[email protected]>

mergify bot added the ci-failure label Mar 5, 2025

create the ability to unmask from arbitrary templates

8882221

Signed-off-by: Oleg Silkin <[email protected]>

RobotSail force-pushed the arbitrary-tokenizer branch from e78c7ed to 8882221 Compare March 5, 2025 05:20

save aldo methods

9565c38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Updates data processing logic to remove dependency on hardcoded chat templates #428

[WIP] Updates data processing logic to remove dependency on hardcoded chat templates #428

RobotSail commented Mar 5, 2025

[WIP] Updates data processing logic to remove dependency on hardcoded chat templates #428

Are you sure you want to change the base?

[WIP] Updates data processing logic to remove dependency on hardcoded chat templates #428

Conversation

RobotSail commented Mar 5, 2025