Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Updates data processing logic to remove dependency on hardcoded chat templates #428

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

RobotSail
Copy link
Member

In order to use the training library today, we need to manually define the chat template and rely on specifically parsing the special tokens.

This introduces a number of issues for consumers where other models cannot easily be used without some inital effort.

This PR resolves this issue by introducing a new form of data processing which is able to arbitrarily process data in the messages format and apply the appropriate unmasking policy.

Specifically, the new data processing script now reads the unmask field to determine whether to unmask all messages (aside from the system message) or just the assistant responses (skills training).

@mergify mergify bot added the ci-failure label Mar 5, 2025
@RobotSail RobotSail force-pushed the arbitrary-tokenizer branch from e78c7ed to 8882221 Compare March 5, 2025 05:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant