Fine tuning with llama-7b-model failure #256

Ssukriti · 2023-11-01T17:49:49Z

Describe the bug

reported by Alan Braz

There are failures seen when fine tuning llama-7b-model with certain set of parameters :

{
  "modelName": "test-llama2",
  "parameters": {
    "baseModel": "/data/base_models/models/meta-llama/Llama-2-7b",
    "trainStream": {
      "file": {
        "filename": "/data/base_models/input/train_rte_small.json"
      }
    },
    "torchDtype": "float32",
    "batchSize": "1",
    "numEpochs": "1",
    "accumulateSteps": "1",
    "lr": 0.1,
    "maxSourceLength": "128",
    "maxTargetLength": "64",
    "randomSeed": "1"
  }
}

Platform

Please provide details about the environment you are using, including the following:

Library version: latest

Sample Code

run examples/run_fine_tuning.py script with any dataset and above config

Expected behavior

Fine tuning succeeds

Observed behavior

return inner_training_loop(
   File "/dccstor/ssharma/caikit_nlp_env_new/lib/python3.9/site-packages/accelerate/utils/memory.py", line 134, in decorator
     raise RuntimeError("No executable batch size found, reached zero.")
 RuntimeError: No executable batch size found, reached zero.

Additional context

Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

Ssukriti · 2023-11-01T18:11:05Z

Error seems to go away when switching from torch_dtype float32 to bfloat16

See sources that have similar suggestions to use bfloat16:

https://discourse.julialang.org/t/llama2-7b-difference-in-inference-when-between-float16-and-float32/103826/2 ,
default precision of llama-7b Code model https://huggingface.co/codellama/CodeLlama-7b-hf/blob/main/config.json uses bfloat16
people are seeing NAN while inferencing when using float16 , which goes on using bfloat16 llama 2 weights from fb (in bfloat16) are perhaps accidentally cast to float16 in conversion script? huggingface/transformers#25446

Besides the same error can also occur when memory has exceeded with large batch sizes even with bfloat16. In which case, we can try a smaller batch sizes and input sequence lengths

chakrn · 2023-11-01T18:26:17Z

@alanbraz can you verify with bfloat16 and let us know if you run into anymore issues?

alanbraz · 2024-01-05T13:44:19Z

Running in a pod, at the same cluster of the Model Tuner, with the same resources and NVIDIA-A100-SXM4-80GB GPU. Using caikit-nlp as the same as the caikit-nlp-service-trainer:0.7.11 image, from the commit https://github.com/caikit/caikit-nlp/tree/2a30fc94e3bb66acb6e2eda8d2ce34f3bd86be60

limits:
              cpu: '10'
              memory: 96000Mi
              nvidia.com/gpu: '1'
            requests:
              cpu: '2'
              memory: 5000Mi
              nvidia.com/gpu: '1'

still error at the command:

TOKENIZERS_PARALLELISM=false python run_fine_tuning.py --model_name /mnt/pvc-mount/cais-base-0/data/models/meta-llama/llama-2-7b  --output /mnt/pvc-mount/cais-shared-1/data/caikit-nlp/examples/output/  --num_epochs 1 --batch_size 1 --torch_dtype bfloat16 --dataset glue/rte

2024-01-05T13:19:40.133424 [TXT_G:DBUG] Number of inferred steps: [2490]
2024-01-05T13:19:40.133623 [TRCH_:INFO] Cuda devices available! Using 1 devices.
[2024-01-05 13:19:40,133] torch.distributed.launcher.api: [WARNING] config has no run_id, generated a random run_id: 255837245335697231841058382140573645167
Bus error (core dumped)
sh-5.1$ /opt/miniconda/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 9 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

chakrn assigned Ssukriti Nov 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fine tuning with llama-7b-model failure #256

Fine tuning with llama-7b-model failure #256

Ssukriti commented Nov 1, 2023 •

edited

Loading

Ssukriti commented Nov 1, 2023 •

edited

Loading

chakrn commented Nov 1, 2023

alanbraz commented Jan 5, 2024

Fine tuning with llama-7b-model failure #256

Fine tuning with llama-7b-model failure #256

Comments

Ssukriti commented Nov 1, 2023 • edited Loading

Describe the bug

Platform

Sample Code

Expected behavior

Observed behavior

Additional context

Ssukriti commented Nov 1, 2023 • edited Loading

chakrn commented Nov 1, 2023

alanbraz commented Jan 5, 2024

Ssukriti commented Nov 1, 2023 •

edited

Loading

Ssukriti commented Nov 1, 2023 •

edited

Loading