Inconsistent validation data handling in Keras 3 for Language Model fine-tuning #20748
Labels
Gemma
Gemma model specific issues
stat:awaiting response from contributor
type:support
User is asking for help / asking an implementation question. Stackoverflow would be better suited.
Issue Description
When fine-tuning language models in Keras 3, there are inconsistencies in how validation data should be provided. The documentation suggests validation_data should be in (x, y) format, but the actual requirements are unclear and the behavior differs between training and validation phases.
Current Behavior & Problems
Issue 1: Raw text arrays are not accepted for validation
Issue 2: Pre-tokenized validation fails
The error suggests the tokenizer is being applied again to already tokenized data. I understand there is the preprocessor=None parameter, but I don't want to preprocess train data manually.
Working Solution (But Needs Documentation)
The working approach is to provide prompt-completion pairs:
Expected Behavior
Environment
Additional Context
While there is a working solution using prompt-completion pairs, this differs from traditional language model training where each token predicts the next token. The documentation should clarify this architectural choice and explain the proper way to provide validation data.
The text was updated successfully, but these errors were encountered: