WIP [fine-tuning]: Gather more results #608

albertoperdomo2 · 2024-12-04T08:05:27Z

No description provided.

topsail-bot · 2024-12-04T08:11:16Z

Jenkins Job #1719

🔴 Test of 'rhoai test prepare_ci' failed after 00 hours 00 minutes 07 seconds. 🔴

Test configuration:

# RHOAI: run fine_tuning test prepare_ci
PR_POSITIONAL_ARGS: gating_dgx40gb_lora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: gating_dgx40gb_lora

• Link to the Rebuild page.

Failure indicator: Empty. (See run.log)

[Test ran on the internal Perflab CI]

topsail-bot · 2024-12-04T08:29:24Z

Jenkins Job #1720

🔴 Test of 'rhoai test prepare_ci' failed after 00 hours 00 minutes 09 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test prepare_ci
PR_POSITIONAL_ARGS: gating_dgx40gb_lora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: gating_dgx40gb_lora

• Link to the Rebuild page.

Failure indicator: Empty. (See run.log)

[Test ran on the internal Perflab CI]

topsail-bot · 2024-12-04T08:29:36Z

Jenkins Job #1720

🔴 Test of 'rhoai test export_artifacts /logs/artifacts' failed after 00 hours 00 minutes 06 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test export_artifacts /logs/artifacts
PR_POSITIONAL_ARGS: gating_dgx40gb_lora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: gating_dgx40gb_lora

• Link to the Rebuild page.

Failure indicator: Empty. (See run.log)

[Test ran on the internal Perflab CI]

topsail-bot · 2024-12-04T10:30:44Z

Jenkins Job #1721

🔴 Test of 'rhoai test prepare_ci' failed after 00 hours 00 minutes 09 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test prepare_ci
PR_POSITIONAL_ARGS: metal gating_dgx40gb_lora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: metal
PR_POSITIONAL_ARG_2: gating_dgx40gb_lora

• Link to the Rebuild page.

Failure indicator: Empty. (See run.log)

[Test ran on the internal Perflab CI]

topsail-bot · 2024-12-04T10:31:00Z

Jenkins Job #1721

🔴 Test of 'rhoai test export_artifacts /logs/artifacts' failed after 00 hours 00 minutes 08 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test export_artifacts /logs/artifacts
PR_POSITIONAL_ARGS: metal gating_dgx40gb_lora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: metal
PR_POSITIONAL_ARG_2: gating_dgx40gb_lora

• Link to the Rebuild page.

Failure indicator: Empty. (See run.log)

[Test ran on the internal Perflab CI]

topsail-bot · 2024-12-04T19:27:51Z

Jenkins Job #1722

🟢 Test of 'rhoai test test_ci' succeeded after 07 hours 24 minutes 44 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: cluster_rhoai_4h100 gating_dgx40gb_lora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_4h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_lora

• Link to the Rebuild page.

[Test ran on the internal Perflab CI]

topsail-bot · 2024-12-04T21:43:59Z

Jenkins Job #1723

🔴 Test of 'rhoai test test_ci' failed after 00 hours 03 minutes 19 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: cluster_rhoai_4h100 gating_dgx40gb_qlora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_4h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_qlora

• Link to the Rebuild page.

Failure indicator:

/logs/artifacts/000_test_ci/001__matbenchmarking/FAILURE | MatrixBenchmark benchmark failed.
/logs/artifacts/000_test_ci/001__matbenchmarking/qlora/000__qlora/001__prepare_data/000__storage__download_to_pvc/FAILURE | [000__storage__download_to_pvc] ./run_toolbox.py from_config storage download_to_pvc --extra={'source': 'dmf://rhoai/mistral-7b-v0.3-gptq', 'storage_dir': '/model', 'name': 'mistral-7b-v0.3-gptq', 'image': 'quay.io/modh/fms-hf-tuning:v2.0.1', 'creds': '/tmp/secrets/dmf.token'} --> 2
/logs/artifacts/000_test_ci/001__matbenchmarking/qlora/001__qlora/001__prepare_data/000__storage__download_to_pvc/FAILURE | [000__storage__download_to_pvc] ./run_toolbox.py from_config storage download_to_pvc --extra={'source': 'dmf://rhoai/granite-8b-code-instruct-gptq', 'storage_dir': '/model', 'name': 'granite-8b-code-instruct-gptq', 'image': 'quay.io/modh/fms-hf-tuning:v2.0.1', 'creds': '/tmp/secrets/dmf.token'} --> 2
/logs/artifacts/000_test_ci/001__matbenchmarking/qlora/002__qlora/001__prepare_data/000__storage__download_to_pvc/FAILURE | [000__storage__download_to_pvc] ./run_toolbox.py from_config storage download_to_pvc --extra={'source': 'dmf://rhoai/allam-beta-13b-chat-gptq', 'storage_dir': '/model', 'name': 'allam-beta-13b-chat-gptq', 'image': 'quay.io/modh/fms-hf-tuning:v2.0.1', 'creds': '/tmp/secrets/dmf.token'} --> 2
/logs/artifacts/000_test_ci/001__matbenchmarking/qlora/003__qlora/001__prepare_data/000__storage__download_to_pvc/FAILURE | [000__storage__download_to_pvc] ./run_toolbox.py from_config storage download_to_pvc --extra={'source': 'dmf://rhoai/granite-34b-code-base-gptq', 'storage_dir': '/model', 'name': 'granite-34b-code-base-gptq', 'image': 'quay.io/modh/fms-hf-tuning:v2.0.1', 'creds': '/tmp/secrets/dmf.token'} --> 2
/logs/artifacts/000_test_ci/001__matbenchmarking/qlora/004__qlora/001__prepare_data/000__storage__download_to_pvc/FAILURE | [000__storage__download_to_pvc] ./run_toolbox.py from_config storage download_to_pvc --extra={'source': 'dmf://rhoai/mixtral-8x7b-instruct-v0.1-gptq', 'storage_dir': '/model', 'name': 'mixtral-8x7b-instruct-v0.1-gptq', 'image': 'quay.io/modh/fms-hf-tuning:v2.0.1', 'creds': '/tmp/secrets/dmf.token'} --> 2
/logs/artifacts/000_test_ci/002__plots/FAILURE | An error happened during the results parsing, aborting the visualization (0_matbench_parse.log).
/logs/artifacts/000_test_ci/003__prom_plots/FAILURE | An error happened during the results parsing, aborting the visualization (0_matbench_parse.log).
/logs/artifacts/000_test_ci/FAILURE | Traceback (most recent call last):
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 394, in test

[...]

[Test ran on the internal Perflab CI]

topsail-bot · 2024-12-05T09:42:17Z

Jenkins Job #1724

🔴 Test of 'rhoai test test_ci' failed after 01 hours 11 minutes 23 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: cluster_rhoai_4h100 gating_dgx40gb_qlora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_4h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_qlora

• Link to the Rebuild page.

Failure indicator:

/logs/artifacts/000_test_ci/001__matbenchmarking/FAILURE | MatrixBenchmark benchmark failed.
/logs/artifacts/000_test_ci/001__matbenchmarking/qlora/000__qlora/003__fms_fine_tuning/000__fine_tuning__run_fine_tuning_job/FAILURE | [000__fine_tuning__run_fine_tuning_job] ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra={'name': 'qlora', 'model_name': 'mistral-7b-v0.3-gptq', 'dataset_name': 'alpaca_data.json', 'gpu': 4, 'dataset_replication': 0.2, 'pod_count': 1, 'hyper_parameters': {'fp16': True, 'gradient_accumulation_steps': 4, 'gradient_checkpointing': True, 'lora_alpha': 16, 'max_seq_length': 512, 'max_steps': -1, 'num_train_epochs': 1, 'packing': False, 'peft_method': 'qlora', 'per_device_train_batch_size': 1, 'r': 4, 'torch_dtype': 'float16', 'use_flash_attn': True, 'warmup_ratio': 0.03, 'auto_gptq': ['triton_v2'], 'target_modules': ['all-linear']}, 'dataset_transform': 'convert_alpaca.py', 'dataset_response_template': '\n### Label:'} --> 2
/logs/artifacts/000_test_ci/001__matbenchmarking/qlora/000__qlora/003__fms_fine_tuning/FAILURE | CalledProcessError: Command 'set -o errexit;set -o pipefail;set -o nounset;set -o errtrace;ARTIFACT_DIR="/logs/artifacts/000_test_ci/001__matbenchmarking/qlora/000__qlora/003__fms_fine_tuning" ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra="{'name': 'qlora', 'model_name': 'mistral-7b-v0.3-gptq', 'dataset_name': 'alpaca_data.json', 'gpu': 4, 'dataset_replication': 0.2, 'pod_count': 1, 'hyper_parameters': {'fp16': True, 'gradient_accumulation_steps': 4, 'gradient_checkpointing': True, 'lora_alpha': 16, 'max_seq_length': 512, 'max_steps': -1, 'num_train_epochs': 1, 'packing': False, 'peft_method': 'qlora', 'per_device_train_batch_size': 1, 'r': 4, 'torch_dtype': 'float16', 'use_flash_attn': True, 'warmup_ratio': 0.03, 'auto_gptq': ['triton_v2'], 'target_modules': ['all-linear']}, 'dataset_transform': 'convert_alpaca.py', 'dataset_response_template': '\n### Label:'}"' returned non-zero exit status 2.
Traceback (most recent call last):
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 143, in _run_test
    run.run_toolbox_from_config("fine_tuning", "run_fine_tuning_job",
  File "/opt/topsail/src/projects/core/library/run.py", line 49, in run_toolbox_from_config
    return run(f'{cmd_env} ./run_toolbox.py from_config {group} {command} {_dict_to_run_toolbox_args(kwargs)}', **run_kwargs)
  File "/opt/topsail/src/projects/core/library/run.py", line 105, in run
    proc = subprocess.run(command, **args)

[...]

[Test ran on the internal Perflab CI]

topsail-bot · 2024-12-06T00:10:56Z

Jenkins Job #1725

🟢 Test of 'rhoai test test_ci' succeeded after 10 hours 14 minutes 41 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: cluster_rhoai_4h100 gating_dgx40gb_qlora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_4h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_qlora

• Link to the Rebuild page.

[Test ran on the internal Perflab CI]

topsail-bot · 2024-12-07T14:04:04Z

Jenkins Job #1726

🔴 Test of 'rhoai test test_ci' failed after 02 hours 23 minutes 48 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: cluster_rhoai_4h100 gating_dgx40gb_lora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_4h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_lora

• Link to the Rebuild page.

Failure indicator:

/logs/artifacts/000_test_ci/001__matbenchmarking/FAILURE | MatrixBenchmark benchmark failed.
/logs/artifacts/000_test_ci/001__matbenchmarking/fine-tuning/000__fine-tuning/003__fms_fine_tuning/000__fine_tuning__run_fine_tuning_job/FAILURE | [000__fine_tuning__run_fine_tuning_job] ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra={'name': 'fine-tuning', 'pod_count': 1, 'model_name': 'llama-2-13b-hf', 'dataset_name': 'alpaca_data.json', 'gpu': 4, 'dataset_replication': 0.2, 'hyper_parameters': {'gradient_accumulation_steps': 1, 'lora_alpha': 16, 'max_seq_length': 512, 'peft_method': 'lora', 'per_device_train_batch_size': 1, 'r': 4, 'use_flash_attn': True, 'target_modules': ['q_proj', 'k_proj']}, 'dataset_transform': 'convert_alpaca.py', 'dataset_response_template': '\n### Label:'} --> 2
/logs/artifacts/000_test_ci/001__matbenchmarking/fine-tuning/000__fine-tuning/003__fms_fine_tuning/FAILURE | CalledProcessError: Command 'set -o errexit;set -o pipefail;set -o nounset;set -o errtrace;ARTIFACT_DIR="/logs/artifacts/000_test_ci/001__matbenchmarking/fine-tuning/000__fine-tuning/003__fms_fine_tuning" ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra="{'name': 'fine-tuning', 'pod_count': 1, 'model_name': 'llama-2-13b-hf', 'dataset_name': 'alpaca_data.json', 'gpu': 4, 'dataset_replication': 0.2, 'hyper_parameters': {'gradient_accumulation_steps': 1, 'lora_alpha': 16, 'max_seq_length': 512, 'peft_method': 'lora', 'per_device_train_batch_size': 1, 'r': 4, 'use_flash_attn': True, 'target_modules': ['q_proj', 'k_proj']}, 'dataset_transform': 'convert_alpaca.py', 'dataset_response_template': '\n### Label:'}"' returned non-zero exit status 2.
Traceback (most recent call last):
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 148, in _run_test
    run.run_toolbox_from_config("fine_tuning", "run_fine_tuning_job",
  File "/opt/topsail/src/projects/core/library/run.py", line 49, in run_toolbox_from_config
    return run(f'{cmd_env} ./run_toolbox.py from_config {group} {command} {_dict_to_run_toolbox_args(kwargs)}', **run_kwargs)
  File "/opt/topsail/src/projects/core/library/run.py", line 105, in run
    proc = subprocess.run(command, **args)

[...]

[Test ran on the internal Perflab CI]

topsail-bot · 2024-12-09T10:10:17Z

Jenkins Job #1727

🔴 Test of 'rhoai test test_ci' failed after 02 hours 23 minutes 41 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: cluster_rhoai_4h100 gating_dgx40gb_lora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_4h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_lora

• Link to the Rebuild page.

Failure indicator:

/logs/artifacts/001_test_ci/001__matbenchmarking/FAILURE | MatrixBenchmark benchmark failed.
/logs/artifacts/001_test_ci/001__matbenchmarking/fine-tuning/000__fine-tuning/003__fms_fine_tuning/000__fine_tuning__run_fine_tuning_job/FAILURE | [000__fine_tuning__run_fine_tuning_job] ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra={'name': 'fine-tuning', 'pod_count': 1, 'model_name': 'llama-2-13b-hf', 'dataset_name': 'alpaca_data.json', 'gpu': 4, 'dataset_replication': 0.2, 'hyper_parameters': {'gradient_accumulation_steps': 1, 'lora_alpha': 16, 'max_seq_length': 512, 'peft_method': 'lora', 'per_device_train_batch_size': 1, 'r': 4, 'use_flash_attn': True, 'target_modules': ['q_proj', 'k_proj']}, 'dataset_transform': 'convert_alpaca.py', 'dataset_response_template': '\n### Label:'} --> 2
/logs/artifacts/001_test_ci/001__matbenchmarking/fine-tuning/000__fine-tuning/003__fms_fine_tuning/FAILURE | CalledProcessError: Command 'set -o errexit;set -o pipefail;set -o nounset;set -o errtrace;ARTIFACT_DIR="/logs/artifacts/001_test_ci/001__matbenchmarking/fine-tuning/000__fine-tuning/003__fms_fine_tuning" ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra="{'name': 'fine-tuning', 'pod_count': 1, 'model_name': 'llama-2-13b-hf', 'dataset_name': 'alpaca_data.json', 'gpu': 4, 'dataset_replication': 0.2, 'hyper_parameters': {'gradient_accumulation_steps': 1, 'lora_alpha': 16, 'max_seq_length': 512, 'peft_method': 'lora', 'per_device_train_batch_size': 1, 'r': 4, 'use_flash_attn': True, 'target_modules': ['q_proj', 'k_proj']}, 'dataset_transform': 'convert_alpaca.py', 'dataset_response_template': '\n### Label:'}"' returned non-zero exit status 2.
Traceback (most recent call last):
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 148, in _run_test
    run.run_toolbox_from_config("fine_tuning", "run_fine_tuning_job",
  File "/opt/topsail/src/projects/core/library/run.py", line 49, in run_toolbox_from_config
    return run(f'{cmd_env} ./run_toolbox.py from_config {group} {command} {_dict_to_run_toolbox_args(kwargs)}', **run_kwargs)
  File "/opt/topsail/src/projects/core/library/run.py", line 105, in run
    proc = subprocess.run(command, **args)

[...]

[Test ran on the internal Perflab CI]

topsail-bot · 2024-12-09T10:58:42Z

Jenkins Job #1728

🔴 Test of 'rhoai test prepare_ci' failed after 00 hours 07 minutes 54 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test prepare_ci
PR_POSITIONAL_ARGS: cluster_rhoai_4h100 gating_dgx40gb_full
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_4h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_full

• Link to the Rebuild page.

Failure indicator:

/logs/artifacts/001_prepare_ci/000__prepare1/000__install_rhoai/000__rhods__deploy_ods/FAILURE | [000__rhods__deploy_ods] ./run_toolbox.py from_config rhods deploy_ods --extra={} --> 2
/logs/artifacts/001_prepare_ci/000__prepare1/000__install_rhoai/FAILURE | CalledProcessError: Command 'set -o errexit;set -o pipefail;set -o nounset;set -o errtrace;ARTIFACT_DIR="/logs/artifacts/001_prepare_ci/000__prepare1/000__install_rhoai" ./run_toolbox.py from_config rhods deploy_ods --extra="{}"' returned non-zero exit status 2.
Traceback (most recent call last):
  File "/opt/topsail/src/projects/rhods/library/prepare_rhoai.py", line 58, in install
    run.run_toolbox_from_config("rhods", "deploy_ods")
  File "/opt/topsail/src/projects/core/library/run.py", line 49, in run_toolbox_from_config
    return run(f'{cmd_env} ./run_toolbox.py from_config {group} {command} {_dict_to_run_toolbox_args(kwargs)}', **run_kwargs)
  File "/opt/topsail/src/projects/core/library/run.py", line 105, in run
    proc = subprocess.run(command, **args)
  File "/usr/lib64/python3.9/subprocess.py", line 528, in run

[...]

[Test ran on the internal Perflab CI]

topsail-bot · 2024-12-09T15:11:05Z

Jenkins Job #1729

🔴 Test of 'rhoai test test_ci' failed after 01 hours 48 minutes 31 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: cluster_rhoai_4h100 gating_dgx40gb_full
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_4h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_full

• Link to the Rebuild page.

Failure indicator:

/logs/artifacts/001_test_ci/001__matbenchmarking/FAILURE | MatrixBenchmark benchmark failed.
/logs/artifacts/001_test_ci/001__matbenchmarking/fine-tuning/001__fine-tuning/003__fms_fine_tuning/000__fine_tuning__run_fine_tuning_job/FAILURE | [000__fine_tuning__run_fine_tuning_job] ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra={'name': 'fine-tuning', 'pod_count': 1, 'model_name': 'meta-llama-3.1-70b', 'dataset_name': 'alpaca_data.json', 'gpu': 4, 'dataset_replication': 0.2, 'hyper_parameters': {'gradient_accumulation_steps': 1, 'max_seq_length': 512, 'peft_method': 'none', 'per_device_train_batch_size': 1, 'use_flash_attn': True}, 'dataset_transform': 'convert_alpaca.py', 'dataset_response_template': '\n### Label:'} --> 2
/logs/artifacts/001_test_ci/001__matbenchmarking/fine-tuning/001__fine-tuning/003__fms_fine_tuning/FAILURE | CalledProcessError: Command 'set -o errexit;set -o pipefail;set -o nounset;set -o errtrace;ARTIFACT_DIR="/logs/artifacts/001_test_ci/001__matbenchmarking/fine-tuning/001__fine-tuning/003__fms_fine_tuning" ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra="{'name': 'fine-tuning', 'pod_count': 1, 'model_name': 'meta-llama-3.1-70b', 'dataset_name': 'alpaca_data.json', 'gpu': 4, 'dataset_replication': 0.2, 'hyper_parameters': {'gradient_accumulation_steps': 1, 'max_seq_length': 512, 'peft_method': 'none', 'per_device_train_batch_size': 1, 'use_flash_attn': True}, 'dataset_transform': 'convert_alpaca.py', 'dataset_response_template': '\n### Label:'}"' returned non-zero exit status 2.
Traceback (most recent call last):
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 148, in _run_test
    run.run_toolbox_from_config("fine_tuning", "run_fine_tuning_job",
  File "/opt/topsail/src/projects/core/library/run.py", line 49, in run_toolbox_from_config
    return run(f'{cmd_env} ./run_toolbox.py from_config {group} {command} {_dict_to_run_toolbox_args(kwargs)}', **run_kwargs)
  File "/opt/topsail/src/projects/core/library/run.py", line 105, in run
    proc = subprocess.run(command, **args)

[...]

[Test ran on the internal Perflab CI]

topsail-bot · 2024-12-09T17:51:47Z

Jenkins Job #1731

🔴 Test of 'rhoai test test_ci' failed after 01 hours 52 minutes 36 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: cluster_rhoai_4h100 gating_dgx40gb_full
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_4h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_full

• Link to the Rebuild page.

Failure indicator:

/logs/artifacts/000_test_ci/001__matbenchmarking/FAILURE | MatrixBenchmark benchmark failed.
/logs/artifacts/000_test_ci/001__matbenchmarking/fine-tuning/001__fine-tuning/003__fms_fine_tuning/000__fine_tuning__run_fine_tuning_job/FAILURE | [000__fine_tuning__run_fine_tuning_job] ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra={'name': 'fine-tuning', 'pod_count': 1, 'model_name': 'meta-llama-3.1-70b', 'dataset_name': 'alpaca_data.json', 'gpu': 4, 'dataset_replication': 0.2, 'hyper_parameters': {'gradient_accumulation_steps': 1, 'max_seq_length': 512, 'peft_method': 'none', 'per_device_train_batch_size': 1, 'use_flash_attn': True}, 'dataset_transform': 'convert_alpaca.py', 'dataset_response_template': '\n### Label:'} --> 2
/logs/artifacts/000_test_ci/001__matbenchmarking/fine-tuning/001__fine-tuning/003__fms_fine_tuning/FAILURE | CalledProcessError: Command 'set -o errexit;set -o pipefail;set -o nounset;set -o errtrace;ARTIFACT_DIR="/logs/artifacts/000_test_ci/001__matbenchmarking/fine-tuning/001__fine-tuning/003__fms_fine_tuning" ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra="{'name': 'fine-tuning', 'pod_count': 1, 'model_name': 'meta-llama-3.1-70b', 'dataset_name': 'alpaca_data.json', 'gpu': 4, 'dataset_replication': 0.2, 'hyper_parameters': {'gradient_accumulation_steps': 1, 'max_seq_length': 512, 'peft_method': 'none', 'per_device_train_batch_size': 1, 'use_flash_attn': True}, 'dataset_transform': 'convert_alpaca.py', 'dataset_response_template': '\n### Label:'}"' returned non-zero exit status 2.
Traceback (most recent call last):
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 148, in _run_test
    run.run_toolbox_from_config("fine_tuning", "run_fine_tuning_job",
  File "/opt/topsail/src/projects/core/library/run.py", line 49, in run_toolbox_from_config
    return run(f'{cmd_env} ./run_toolbox.py from_config {group} {command} {_dict_to_run_toolbox_args(kwargs)}', **run_kwargs)
  File "/opt/topsail/src/projects/core/library/run.py", line 105, in run
    proc = subprocess.run(command, **args)

[...]

[Test ran on the internal Perflab CI]

topsail-bot · 2024-12-10T03:37:57Z

Jenkins Job #1732

🟢 Test of 'rhoai test test_ci' succeeded after 03 hours 54 minutes 37 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: cluster_rhoai_4h100 gating_dgx40gb_lora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_4h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_lora

• Link to the Rebuild page.

[Test ran on the internal Perflab CI]

topsail-bot · 2024-12-10T07:47:24Z

Jenkins Job #1733

🟢 Test of 'rhoai test test_ci' succeeded after 00 hours 07 minutes 13 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: cluster_rhoai_4h100 gating_dgx40gb_full
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_4h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_full

• Link to the Rebuild page.

[Test ran on the internal Perflab CI]

topsail-bot · 2024-12-10T08:06:10Z

Jenkins Job #1734

🟢 Test of 'rhoai test test_ci' succeeded after 00 hours 09 minutes 15 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: cluster_rhoai_4h100 gating_dgx40gb_lora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_4h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_lora

• Link to the Rebuild page.

[Test ran on the internal Perflab CI]

topsail-bot · 2024-12-10T18:55:57Z

Jenkins Job #1735

🔴 Test of 'rhoai test test_ci' failed after 08 hours 07 minutes 49 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: cluster_rhoai_4h100 gating_dgx40gb_qlora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_4h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_qlora

• Link to the Rebuild page.

Failure indicator:

/logs/artifacts/000_test_ci/001__matbenchmarking/FAILURE | MatrixBenchmark benchmark failed.
/logs/artifacts/000_test_ci/001__matbenchmarking/qlora/001__qlora/003__fms_fine_tuning/000__fine_tuning__run_fine_tuning_job/FAILURE | [000__fine_tuning__run_fine_tuning_job] ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra={'name': 'qlora', 'pod_count': 1, 'model_name': 'granite-8b-code-instruct-gptq', 'dataset_name': 'alpaca_data.json', 'gpu': 4, 'dataset_replication': 0.2, 'hyper_parameters': {'fp16': True, 'gradient_accumulation_steps': 4, 'gradient_checkpointing': True, 'lora_alpha': 16, 'max_seq_length': 512, 'max_steps': -1, 'num_train_epochs': 1, 'packing': False, 'peft_method': 'lora', 'per_device_train_batch_size': 1, 'r': 4, 'torch_dtype': 'float16', 'use_flash_attn': True, 'warmup_ratio': 0.03, 'auto_gptq': ['triton_v2'], 'target_modules': ['all-linear']}, 'dataset_transform': 'convert_alpaca.py', 'dataset_response_template': '\n### Label:'} --> 2
/logs/artifacts/000_test_ci/001__matbenchmarking/qlora/001__qlora/003__fms_fine_tuning/FAILURE | CalledProcessError: Command 'set -o errexit;set -o pipefail;set -o nounset;set -o errtrace;ARTIFACT_DIR="/logs/artifacts/000_test_ci/001__matbenchmarking/qlora/001__qlora/003__fms_fine_tuning" ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra="{'name': 'qlora', 'pod_count': 1, 'model_name': 'granite-8b-code-instruct-gptq', 'dataset_name': 'alpaca_data.json', 'gpu': 4, 'dataset_replication': 0.2, 'hyper_parameters': {'fp16': True, 'gradient_accumulation_steps': 4, 'gradient_checkpointing': True, 'lora_alpha': 16, 'max_seq_length': 512, 'max_steps': -1, 'num_train_epochs': 1, 'packing': False, 'peft_method': 'lora', 'per_device_train_batch_size': 1, 'r': 4, 'torch_dtype': 'float16', 'use_flash_attn': True, 'warmup_ratio': 0.03, 'auto_gptq': ['triton_v2'], 'target_modules': ['all-linear']}, 'dataset_transform': 'convert_alpaca.py', 'dataset_response_template': '\n### Label:'}"' returned non-zero exit status 2.
Traceback (most recent call last):
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 148, in _run_test
    run.run_toolbox_from_config("fine_tuning", "run_fine_tuning_job",
  File "/opt/topsail/src/projects/core/library/run.py", line 49, in run_toolbox_from_config
    return run(f'{cmd_env} ./run_toolbox.py from_config {group} {command} {_dict_to_run_toolbox_args(kwargs)}', **run_kwargs)
  File "/opt/topsail/src/projects/core/library/run.py", line 105, in run
    proc = subprocess.run(command, **args)

[...]

[Test ran on the internal Perflab CI]

topsail-bot · 2024-12-11T10:35:18Z

Jenkins Job #1736

🔴 Test of 'rhoai test test_ci' failed after 01 hours 56 minutes 16 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: cluster_rhoai_4h100 gating_dgx40gb_full
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_4h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_full

• Link to the Rebuild page.

Failure indicator:

/logs/artifacts/000_test_ci/001__matbenchmarking/FAILURE | MatrixBenchmark benchmark failed.
/logs/artifacts/000_test_ci/001__matbenchmarking/fine-tuning/001__fine-tuning/003__fms_fine_tuning/000__fine_tuning__run_fine_tuning_job/FAILURE | [000__fine_tuning__run_fine_tuning_job] ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra={'name': 'fine-tuning', 'pod_count': 1, 'model_name': 'meta-llama-3.1-70b', 'dataset_name': 'alpaca_data.json', 'gpu': 4, 'dataset_replication': 0.2, 'hyper_parameters': {'gradient_accumulation_steps': 1, 'max_seq_length': 512, 'peft_method': 'none', 'per_device_train_batch_size': 1, 'use_flash_attn': True}, 'dataset_transform': 'convert_alpaca.py', 'dataset_response_template': '\n### Label:'} --> 2
/logs/artifacts/000_test_ci/001__matbenchmarking/fine-tuning/001__fine-tuning/003__fms_fine_tuning/FAILURE | CalledProcessError: Command 'set -o errexit;set -o pipefail;set -o nounset;set -o errtrace;ARTIFACT_DIR="/logs/artifacts/000_test_ci/001__matbenchmarking/fine-tuning/001__fine-tuning/003__fms_fine_tuning" ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra="{'name': 'fine-tuning', 'pod_count': 1, 'model_name': 'meta-llama-3.1-70b', 'dataset_name': 'alpaca_data.json', 'gpu': 4, 'dataset_replication': 0.2, 'hyper_parameters': {'gradient_accumulation_steps': 1, 'max_seq_length': 512, 'peft_method': 'none', 'per_device_train_batch_size': 1, 'use_flash_attn': True}, 'dataset_transform': 'convert_alpaca.py', 'dataset_response_template': '\n### Label:'}"' returned non-zero exit status 2.
Traceback (most recent call last):
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 148, in _run_test
    run.run_toolbox_from_config("fine_tuning", "run_fine_tuning_job",
  File "/opt/topsail/src/projects/core/library/run.py", line 49, in run_toolbox_from_config
    return run(f'{cmd_env} ./run_toolbox.py from_config {group} {command} {_dict_to_run_toolbox_args(kwargs)}', **run_kwargs)
  File "/opt/topsail/src/projects/core/library/run.py", line 105, in run
    proc = subprocess.run(command, **args)

[...]

[Test ran on the internal Perflab CI]

topsail-bot · 2024-12-11T14:56:58Z

Jenkins Job #1737

🟢 Test of 'rhoai test test_ci' succeeded after 03 hours 59 minutes 37 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: cluster_rhoai_4h100 gating_dgx40gb_lora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_4h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_lora

• Link to the Rebuild page.

[Test ran on the internal Perflab CI]

topsail-bot · 2024-12-11T23:25:41Z

Jenkins Job #1738

🔴 Test of 'rhoai test test_ci' failed after 08 hours 05 minutes 07 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: cluster_rhoai_4h100 gating_dgx40gb_qlora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_4h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_qlora

• Link to the Rebuild page.

Failure indicator:

/logs/artifacts/000_test_ci/001__matbenchmarking/FAILURE | MatrixBenchmark benchmark failed.
/logs/artifacts/000_test_ci/001__matbenchmarking/qlora/004__qlora/003__fms_fine_tuning/000__fine_tuning__run_fine_tuning_job/FAILURE | [000__fine_tuning__run_fine_tuning_job] ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra={'name': 'qlora', 'pod_count': 1, 'model_name': 'mixtral-8x7b-instruct-v0.1-gptq', 'dataset_name': 'alpaca_data.json', 'gpu': 4, 'dataset_replication': 0.2, 'hyper_parameters': {'fp16': True, 'gradient_accumulation_steps': 4, 'gradient_checkpointing': True, 'lora_alpha': 16, 'max_seq_length': 512, 'max_steps': -1, 'num_train_epochs': 1, 'packing': False, 'peft_method': 'lora', 'per_device_train_batch_size': 1, 'r': 4, 'torch_dtype': 'float16', 'use_flash_attn': True, 'warmup_ratio': 0.03, 'auto_gptq': ['triton_v2'], 'target_modules': ['all-linear']}, 'dataset_transform': 'convert_alpaca.py', 'dataset_response_template': '\n### Label:'} --> 2
/logs/artifacts/000_test_ci/001__matbenchmarking/qlora/004__qlora/003__fms_fine_tuning/FAILURE | CalledProcessError: Command 'set -o errexit;set -o pipefail;set -o nounset;set -o errtrace;ARTIFACT_DIR="/logs/artifacts/000_test_ci/001__matbenchmarking/qlora/004__qlora/003__fms_fine_tuning" ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra="{'name': 'qlora', 'pod_count': 1, 'model_name': 'mixtral-8x7b-instruct-v0.1-gptq', 'dataset_name': 'alpaca_data.json', 'gpu': 4, 'dataset_replication': 0.2, 'hyper_parameters': {'fp16': True, 'gradient_accumulation_steps': 4, 'gradient_checkpointing': True, 'lora_alpha': 16, 'max_seq_length': 512, 'max_steps': -1, 'num_train_epochs': 1, 'packing': False, 'peft_method': 'lora', 'per_device_train_batch_size': 1, 'r': 4, 'torch_dtype': 'float16', 'use_flash_attn': True, 'warmup_ratio': 0.03, 'auto_gptq': ['triton_v2'], 'target_modules': ['all-linear']}, 'dataset_transform': 'convert_alpaca.py', 'dataset_response_template': '\n### Label:'}"' returned non-zero exit status 2.
Traceback (most recent call last):
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 148, in _run_test
    run.run_toolbox_from_config("fine_tuning", "run_fine_tuning_job",
  File "/opt/topsail/src/projects/core/library/run.py", line 49, in run_toolbox_from_config
    return run(f'{cmd_env} ./run_toolbox.py from_config {group} {command} {_dict_to_run_toolbox_args(kwargs)}', **run_kwargs)
  File "/opt/topsail/src/projects/core/library/run.py", line 105, in run
    proc = subprocess.run(command, **args)

[...]

[Test ran on the internal Perflab CI]

openshift-ci · 2024-12-12T08:39:45Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from albertoperdomo2. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

topsail-bot · 2024-12-12T16:55:22Z

Jenkins Job #1739

🔴 Test of 'rhoai test test_ci' failed after 08 hours 05 minutes 04 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: cluster_rhoai_4h100 gating_dgx40gb_qlora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_4h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_qlora

• Link to the Rebuild page.

Failure indicator:

/logs/artifacts/000_test_ci/001__matbenchmarking/FAILURE | MatrixBenchmark benchmark failed.
/logs/artifacts/000_test_ci/001__matbenchmarking/qlora/001__qlora/003__fms_fine_tuning/000__fine_tuning__run_fine_tuning_job/FAILURE | [000__fine_tuning__run_fine_tuning_job] ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra={'name': 'qlora', 'pod_count': 1, 'model_name': 'granite-8b-code-instruct-gptq', 'dataset_name': 'alpaca_data.json', 'gpu': 4, 'dataset_replication': 0.2, 'hyper_parameters': {'fp16': True, 'gradient_accumulation_steps': 4, 'gradient_checkpointing': True, 'lora_alpha': 16, 'max_seq_length': 512, 'max_steps': -1, 'num_train_epochs': 1, 'packing': False, 'peft_method': 'lora', 'per_device_train_batch_size': 1, 'r': 4, 'torch_dtype': 'float16', 'use_flash_attn': True, 'warmup_ratio': 0.03, 'auto_gptq': ['triton_v2'], 'target_modules': ['all-linear']}, 'dataset_transform': 'convert_alpaca.py', 'dataset_response_template': '\n### Label:'} --> 2
/logs/artifacts/000_test_ci/001__matbenchmarking/qlora/001__qlora/003__fms_fine_tuning/FAILURE | CalledProcessError: Command 'set -o errexit;set -o pipefail;set -o nounset;set -o errtrace;ARTIFACT_DIR="/logs/artifacts/000_test_ci/001__matbenchmarking/qlora/001__qlora/003__fms_fine_tuning" ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra="{'name': 'qlora', 'pod_count': 1, 'model_name': 'granite-8b-code-instruct-gptq', 'dataset_name': 'alpaca_data.json', 'gpu': 4, 'dataset_replication': 0.2, 'hyper_parameters': {'fp16': True, 'gradient_accumulation_steps': 4, 'gradient_checkpointing': True, 'lora_alpha': 16, 'max_seq_length': 512, 'max_steps': -1, 'num_train_epochs': 1, 'packing': False, 'peft_method': 'lora', 'per_device_train_batch_size': 1, 'r': 4, 'torch_dtype': 'float16', 'use_flash_attn': True, 'warmup_ratio': 0.03, 'auto_gptq': ['triton_v2'], 'target_modules': ['all-linear']}, 'dataset_transform': 'convert_alpaca.py', 'dataset_response_template': '\n### Label:'}"' returned non-zero exit status 2.
Traceback (most recent call last):
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 148, in _run_test
    run.run_toolbox_from_config("fine_tuning", "run_fine_tuning_job",
  File "/opt/topsail/src/projects/core/library/run.py", line 49, in run_toolbox_from_config
    return run(f'{cmd_env} ./run_toolbox.py from_config {group} {command} {_dict_to_run_toolbox_args(kwargs)}', **run_kwargs)
  File "/opt/topsail/src/projects/core/library/run.py", line 105, in run
    proc = subprocess.run(command, **args)

[...]

[Test ran on the internal Perflab CI]

topsail-bot · 2024-12-13T17:47:38Z

Jenkins Job #1740

🟢 Test of 'rhoai test test_ci' succeeded after 10 hours 09 minutes 20 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: cluster_rhoai_4h100 gating_dgx40gb_qlora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_4h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_qlora

• Link to the Rebuild page.

[Test ran on the internal Perflab CI]

topsail-bot · 2024-12-16T12:31:43Z

Jenkins Job #1741

🔴 Test of 'rhoai test prepare_ci' failed after 00 hours 00 minutes 08 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test prepare_ci
PR_POSITIONAL_ARGS: cluster_rhoai_2x8h100 gating_dgx40gb_lora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_2x8h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_lora

• Link to the Rebuild page.

Failure indicator: Empty. (See run.log)

[Test ran on the internal Perflab CI]

topsail-bot · 2024-12-17T14:18:17Z

Jenkins Job #1744

🟢 Test of 'rhoai test test_ci' succeeded after 04 hours 25 minutes 49 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: cluster_rhoai_2x8h100 gating_dgx40gb_lora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_2x8h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_lora

• Link to the Rebuild page.

[Test ran on the internal Perflab CI]

topsail-bot · 2024-12-18T21:21:04Z

Jenkins Job #1746

🔴 Test of 'rhoai test test_ci' failed after 00 hours 02 minutes 48 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: gating_dgx40gb_full
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: gating_dgx40gb_full

• Link to the Rebuild page.

Failure indicator:

/logs/artifacts/000_test_ci/FAILURE | Traceback (most recent call last):
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 402, in test
    failed = _run_test_and_visualize()
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 281, in _run_test_and_visualize
    if not prepare_rhoai_mod.is_rhoai_installed():
  File "/opt/topsail/src/projects/rhods/library/prepare_rhoai.py", line 40, in is_rhoai_installed
    installed_csv_cmd = run.run(f"oc get csv -loperators.coreos.com/{RHODS_OPERATOR_MANIFEST_NAME}.{RHODS_NAMESPACE}"
  File "/opt/topsail/src/projects/core/library/run.py", line 105, in run
    proc = subprocess.run(command, **args)
  File "/usr/lib64/python3.9/subprocess.py", line 528, in run

[...]

[Test ran on the internal Perflab CI]

topsail-bot · 2024-12-19T05:23:50Z

Jenkins Job #1747

🔴 Test of 'rhoai test test_ci' failed after 00 hours 00 minutes 07 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: gating_dgx40gb_full
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: gating_dgx40gb_full

• Link to the Rebuild page.

Failure indicator: Empty. (See run.log)

[Test ran on the internal Perflab CI]

albertoperdomo2 force-pushed the fine-tuning-blog branch from 5363336 to 382be9b Compare December 4, 2024 11:48

albertoperdomo2 force-pushed the fine-tuning-blog branch from 417172f to 014a54c Compare December 9, 2024 10:44

albertoperdomo2 added 19 commits March 3, 2025 09:10

[fine_tuning]: restore settings for 8xH100

56726cc

[fine_tuning]: test only granite-3b model

0dceef7

[fine_tuning]: Use accelerate command

061f8cc

[fine_tuning]: Use accelerate module

f836a17

[fine_tuning]: Make accelerate use the default file config

1930368

[fine_tuning]: Update clusters.

cb31e29

[fine_tuning]: Test fast torchrun settings

d405543

[storage]: Change NFS PVC size to bypass prepare step

1b34c23

[storage]: Revert NFS size

cae4c2d

[fine_tuning]: Remove unused code

b4c33ca

[fine_tuning]: Enable more models for testing

89c4df7

[fine_tuning]: Update torchrun settings

76c4e99

[fine_tuning]: Update max seq length

c2ff9cc

[fine_tuning]: Test only 8b model

868c683

[fine_tuning]: More agressive torchrun settings

432435e

[fine_tuning]: Use accelerate directly

260d3ed

[fine_tuning]: Torchrun with memory saving settings

6be67a8

[fine_tuning]: Try again the launch script

3ca635c

[fine_tuning]: Tweak settings

80997bf

albertoperdomo2 force-pushed the fine-tuning-blog branch from f06fb56 to 5c9798e Compare March 3, 2025 10:10

[fine_tuning]: Torch CUDA profile

d08b697

albertoperdomo2 force-pushed the fine-tuning-blog branch from 5c9798e to d08b697 Compare March 3, 2025 10:41

[fine_tuning]: Fix artifact path

fd34469

albertoperdomo2 force-pushed the fine-tuning-blog branch from 7a3f16c to fd34469 Compare March 3, 2025 15:16

[fine_tuning]: 3b test torchrun

a3ae6e2

albertoperdomo2 force-pushed the fine-tuning-blog branch 3 times, most recently from 4cc8b18 to 78fca5b Compare March 4, 2025 09:54

[fine_tuning]: 3b test launch script

9b76155

albertoperdomo2 force-pushed the fine-tuning-blog branch from 78fca5b to 9b76155 Compare March 4, 2025 10:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP [fine-tuning]: Gather more results #608

WIP [fine-tuning]: Gather more results #608

albertoperdomo2 commented Dec 4, 2024

topsail-bot bot commented Dec 4, 2024

topsail-bot bot commented Dec 4, 2024

topsail-bot bot commented Dec 4, 2024

topsail-bot bot commented Dec 4, 2024

topsail-bot bot commented Dec 4, 2024

topsail-bot bot commented Dec 4, 2024

topsail-bot bot commented Dec 4, 2024

topsail-bot bot commented Dec 5, 2024

topsail-bot bot commented Dec 6, 2024

topsail-bot bot commented Dec 7, 2024

topsail-bot bot commented Dec 9, 2024

topsail-bot bot commented Dec 9, 2024

topsail-bot bot commented Dec 9, 2024

topsail-bot bot commented Dec 9, 2024

topsail-bot bot commented Dec 10, 2024

topsail-bot bot commented Dec 10, 2024

topsail-bot bot commented Dec 10, 2024

topsail-bot bot commented Dec 10, 2024

topsail-bot bot commented Dec 11, 2024

topsail-bot bot commented Dec 11, 2024

topsail-bot bot commented Dec 11, 2024

openshift-ci bot commented Dec 12, 2024

topsail-bot bot commented Dec 12, 2024

topsail-bot bot commented Dec 13, 2024

topsail-bot bot commented Dec 16, 2024

topsail-bot bot commented Dec 17, 2024

topsail-bot bot commented Dec 18, 2024

topsail-bot bot commented Dec 19, 2024

WIP [fine-tuning]: Gather more results #608

Are you sure you want to change the base?

WIP [fine-tuning]: Gather more results #608

Conversation

albertoperdomo2 commented Dec 4, 2024

topsail-bot bot commented Dec 4, 2024

topsail-bot bot commented Dec 4, 2024

topsail-bot bot commented Dec 4, 2024

topsail-bot bot commented Dec 4, 2024

topsail-bot bot commented Dec 4, 2024

topsail-bot bot commented Dec 4, 2024

topsail-bot bot commented Dec 4, 2024

topsail-bot bot commented Dec 5, 2024

topsail-bot bot commented Dec 6, 2024

topsail-bot bot commented Dec 7, 2024

topsail-bot bot commented Dec 9, 2024

topsail-bot bot commented Dec 9, 2024

topsail-bot bot commented Dec 9, 2024

topsail-bot bot commented Dec 9, 2024

topsail-bot bot commented Dec 10, 2024

topsail-bot bot commented Dec 10, 2024

topsail-bot bot commented Dec 10, 2024

topsail-bot bot commented Dec 10, 2024

topsail-bot bot commented Dec 11, 2024

topsail-bot bot commented Dec 11, 2024

topsail-bot bot commented Dec 11, 2024

openshift-ci bot commented Dec 12, 2024

topsail-bot bot commented Dec 12, 2024

topsail-bot bot commented Dec 13, 2024

topsail-bot bot commented Dec 16, 2024

topsail-bot bot commented Dec 17, 2024

topsail-bot bot commented Dec 18, 2024

topsail-bot bot commented Dec 19, 2024