Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fine tuning with llama-7b-model failure #256

Open
Ssukriti opened this issue Nov 1, 2023 · 3 comments
Open

Fine tuning with llama-7b-model failure #256

Ssukriti opened this issue Nov 1, 2023 · 3 comments
Assignees

Comments

@Ssukriti
Copy link
Collaborator

Ssukriti commented Nov 1, 2023

Describe the bug

reported by Alan Braz

There are failures seen when fine tuning llama-7b-model with certain set of parameters :

{
  "modelName": "test-llama2",
  "parameters": {
    "baseModel": "/data/base_models/models/meta-llama/Llama-2-7b",
    "trainStream": {
      "file": {
        "filename": "/data/base_models/input/train_rte_small.json"
      }
    },
    "torchDtype": "float32",
    "batchSize": "1",
    "numEpochs": "1",
    "accumulateSteps": "1",
    "lr": 0.1,
    "maxSourceLength": "128",
    "maxTargetLength": "64",
    "randomSeed": "1"
  }
}

Platform

Please provide details about the environment you are using, including the following:

  • Library version: latest

Sample Code

run examples/run_fine_tuning.py script with any dataset and above config

Expected behavior

Fine tuning succeeds

Observed behavior

return inner_training_loop(
   File "/dccstor/ssharma/caikit_nlp_env_new/lib/python3.9/site-packages/accelerate/utils/memory.py", line 134, in decorator
     raise RuntimeError("No executable batch size found, reached zero.")
 RuntimeError: No executable batch size found, reached zero.

Additional context

Add any other context about the problem here.

@Ssukriti
Copy link
Collaborator Author

Ssukriti commented Nov 1, 2023

  1. Error seems to go away when switching from torch_dtype float32 to bfloat16

See sources that have similar suggestions to use bfloat16:

  1. https://discourse.julialang.org/t/llama2-7b-difference-in-inference-when-between-float16-and-float32/103826/2 ,
  2. default precision of llama-7b Code model https://huggingface.co/codellama/CodeLlama-7b-hf/blob/main/config.json uses bfloat16
  3. people are seeing NAN while inferencing when using float16 , which goes on using bfloat16 llama 2 weights from fb (in bfloat16) are perhaps accidentally cast to float16 in conversion script? huggingface/transformers#25446

Besides the same error can also occur when memory has exceeded with large batch sizes even with bfloat16. In which case, we can try a smaller batch sizes and input sequence lengths

@chakrn
Copy link
Collaborator

chakrn commented Nov 1, 2023

@alanbraz can you verify with bfloat16 and let us know if you run into anymore issues?

@alanbraz
Copy link

alanbraz commented Jan 5, 2024

Running in a pod, at the same cluster of the Model Tuner, with the same resources and NVIDIA-A100-SXM4-80GB GPU. Using caikit-nlp as the same as the caikit-nlp-service-trainer:0.7.11 image, from the commit https://github.com/caikit/caikit-nlp/tree/2a30fc94e3bb66acb6e2eda8d2ce34f3bd86be60

limits:
              cpu: '10'
              memory: 96000Mi
              nvidia.com/gpu: '1'
            requests:
              cpu: '2'
              memory: 5000Mi
              nvidia.com/gpu: '1'

still error at the command:

TOKENIZERS_PARALLELISM=false python run_fine_tuning.py --model_name /mnt/pvc-mount/cais-base-0/data/models/meta-llama/llama-2-7b  --output /mnt/pvc-mount/cais-shared-1/data/caikit-nlp/examples/output/  --num_epochs 1 --batch_size 1 --torch_dtype bfloat16 --dataset glue/rte
2024-01-05T13:19:40.133424 [TXT_G:DBUG] Number of inferred steps: [2490]
2024-01-05T13:19:40.133623 [TRCH_:INFO] Cuda devices available! Using 1 devices.
[2024-01-05 13:19:40,133] torch.distributed.launcher.api: [WARNING] config has no run_id, generated a random run_id: 255837245335697231841058382140573645167
Bus error (core dumped)
sh-5.1$ /opt/miniconda/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 9 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants