I have 324892 samples ,but why 【Num examples = 10,797】？ #499

Leawnn · 2025-03-11T05:50:33Z

when I train Qwen-1.5b-Inststruct,I have 324892 dataset-train,but Num examples = 10,797
Tokenizing train dataset: 100%|██████████| 324892/324892 [02:39<00:00, 2043.00 examples/s]
Packing train dataset: 100%|██████████| 324892/324892 [01:23<00:00, 3885.00 examples/s]

model_name_or_path: Qwen2.5-1.5B-Instruct
model_revision: main
torch_dtype: bfloat16
attn_implementation: flash_attention_2

# Data training arguments
dataset_name: /data/open-r1/dataset/conversations.jsonl
dataset_configs:
- all
preprocessing_num_workers: 8

# SFT trainer config
bf16: true
do_eval: false  
eval_strategy: "no"
eval_steps: 100
gradient_accumulation_steps: 1
gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false
hub_model_id: Qwen2.5-1.5B-Open-R1-Distill
hub_strategy: every_save
learning_rate: 2.0e-04
log_level: info
logging_steps: 5
logging_strategy: steps
lr_scheduler_type: cosine_with_min_lr 
lr_scheduler_kwargs:
  min_lr_rate: 0.1
packing: true
max_seq_length: 4096
max_steps: -1
num_train_epochs: 1
output_dir: /data/open-r1/output/Qwen2.5-1.5B-Open-R1
overwrite_output_dir: true
per_device_eval_batch_size: 1
per_device_train_batch_size: 1
push_to_hub: false
report_to:
- none
save_strategy: "epoch" 
save_steps: 100
save_total_limit: 5
seed: 42
warmup_ratio: 0.1

The text was updated successfully, but these errors were encountered:

Leawnn · 2025-03-12T01:53:34Z

I know why......packing=true

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I have 324892 samples ,but why 【Num examples = 10,797】？ #499

I have 324892 samples ,but why 【Num examples = 10,797】？ #499

Leawnn commented Mar 11, 2025

Leawnn commented Mar 12, 2025

I have 324892 samples ,but why 【Num examples = 10,797】？ #499

I have 324892 samples ,but why 【Num examples = 10,797】？ #499

Comments

Leawnn commented Mar 11, 2025

Leawnn commented Mar 12, 2025