Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP [fine-tuning]: Gather more results #608

Open
wants to merge 54 commits into
base: main
Choose a base branch
from

Conversation

albertoperdomo2
Copy link
Collaborator

No description provided.

Copy link

topsail-bot bot commented Dec 4, 2024

Jenkins Job #1719

🔴 Test of 'rhoai test prepare_ci' failed after 00 hours 00 minutes 07 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test prepare_ci
PR_POSITIONAL_ARGS: gating_dgx40gb_lora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: gating_dgx40gb_lora

• Link to the Rebuild page.

Failure indicator: Empty. (See run.log)

[Test ran on the internal Perflab CI]

Copy link

topsail-bot bot commented Dec 4, 2024

Jenkins Job #1720

🔴 Test of 'rhoai test prepare_ci' failed after 00 hours 00 minutes 09 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test prepare_ci
PR_POSITIONAL_ARGS: gating_dgx40gb_lora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: gating_dgx40gb_lora

• Link to the Rebuild page.

Failure indicator: Empty. (See run.log)

[Test ran on the internal Perflab CI]

Copy link

topsail-bot bot commented Dec 4, 2024

Jenkins Job #1720

🔴 Test of 'rhoai test export_artifacts /logs/artifacts' failed after 00 hours 00 minutes 06 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test export_artifacts /logs/artifacts
PR_POSITIONAL_ARGS: gating_dgx40gb_lora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: gating_dgx40gb_lora

• Link to the Rebuild page.

Failure indicator: Empty. (See run.log)

[Test ran on the internal Perflab CI]

Copy link

topsail-bot bot commented Dec 4, 2024

Jenkins Job #1721

🔴 Test of 'rhoai test prepare_ci' failed after 00 hours 00 minutes 09 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test prepare_ci
PR_POSITIONAL_ARGS: metal gating_dgx40gb_lora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: metal
PR_POSITIONAL_ARG_2: gating_dgx40gb_lora

• Link to the Rebuild page.

Failure indicator: Empty. (See run.log)

[Test ran on the internal Perflab CI]

Copy link

topsail-bot bot commented Dec 4, 2024

Jenkins Job #1721

🔴 Test of 'rhoai test export_artifacts /logs/artifacts' failed after 00 hours 00 minutes 08 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test export_artifacts /logs/artifacts
PR_POSITIONAL_ARGS: metal gating_dgx40gb_lora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: metal
PR_POSITIONAL_ARG_2: gating_dgx40gb_lora

• Link to the Rebuild page.

Failure indicator: Empty. (See run.log)

[Test ran on the internal Perflab CI]

Copy link

topsail-bot bot commented Dec 4, 2024

Jenkins Job #1722

🟢 Test of 'rhoai test test_ci' succeeded after 07 hours 24 minutes 44 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: cluster_rhoai_4h100 gating_dgx40gb_lora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_4h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_lora

• Link to the Rebuild page.

[Test ran on the internal Perflab CI]

Copy link

topsail-bot bot commented Dec 4, 2024

Jenkins Job #1723

🔴 Test of 'rhoai test test_ci' failed after 00 hours 03 minutes 19 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: cluster_rhoai_4h100 gating_dgx40gb_qlora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_4h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_qlora

• Link to the Rebuild page.

Failure indicator:

/logs/artifacts/000_test_ci/001__matbenchmarking/FAILURE | MatrixBenchmark benchmark failed.
/logs/artifacts/000_test_ci/001__matbenchmarking/qlora/000__qlora/001__prepare_data/000__storage__download_to_pvc/FAILURE | [000__storage__download_to_pvc] ./run_toolbox.py from_config storage download_to_pvc --extra={'source': 'dmf://rhoai/mistral-7b-v0.3-gptq', 'storage_dir': '/model', 'name': 'mistral-7b-v0.3-gptq', 'image': 'quay.io/modh/fms-hf-tuning:v2.0.1', 'creds': '/tmp/secrets/dmf.token'} --> 2
/logs/artifacts/000_test_ci/001__matbenchmarking/qlora/001__qlora/001__prepare_data/000__storage__download_to_pvc/FAILURE | [000__storage__download_to_pvc] ./run_toolbox.py from_config storage download_to_pvc --extra={'source': 'dmf://rhoai/granite-8b-code-instruct-gptq', 'storage_dir': '/model', 'name': 'granite-8b-code-instruct-gptq', 'image': 'quay.io/modh/fms-hf-tuning:v2.0.1', 'creds': '/tmp/secrets/dmf.token'} --> 2
/logs/artifacts/000_test_ci/001__matbenchmarking/qlora/002__qlora/001__prepare_data/000__storage__download_to_pvc/FAILURE | [000__storage__download_to_pvc] ./run_toolbox.py from_config storage download_to_pvc --extra={'source': 'dmf://rhoai/allam-beta-13b-chat-gptq', 'storage_dir': '/model', 'name': 'allam-beta-13b-chat-gptq', 'image': 'quay.io/modh/fms-hf-tuning:v2.0.1', 'creds': '/tmp/secrets/dmf.token'} --> 2
/logs/artifacts/000_test_ci/001__matbenchmarking/qlora/003__qlora/001__prepare_data/000__storage__download_to_pvc/FAILURE | [000__storage__download_to_pvc] ./run_toolbox.py from_config storage download_to_pvc --extra={'source': 'dmf://rhoai/granite-34b-code-base-gptq', 'storage_dir': '/model', 'name': 'granite-34b-code-base-gptq', 'image': 'quay.io/modh/fms-hf-tuning:v2.0.1', 'creds': '/tmp/secrets/dmf.token'} --> 2
/logs/artifacts/000_test_ci/001__matbenchmarking/qlora/004__qlora/001__prepare_data/000__storage__download_to_pvc/FAILURE | [000__storage__download_to_pvc] ./run_toolbox.py from_config storage download_to_pvc --extra={'source': 'dmf://rhoai/mixtral-8x7b-instruct-v0.1-gptq', 'storage_dir': '/model', 'name': 'mixtral-8x7b-instruct-v0.1-gptq', 'image': 'quay.io/modh/fms-hf-tuning:v2.0.1', 'creds': '/tmp/secrets/dmf.token'} --> 2
/logs/artifacts/000_test_ci/002__plots/FAILURE | An error happened during the results parsing, aborting the visualization (0_matbench_parse.log).
/logs/artifacts/000_test_ci/003__prom_plots/FAILURE | An error happened during the results parsing, aborting the visualization (0_matbench_parse.log).
/logs/artifacts/000_test_ci/FAILURE | Traceback (most recent call last):
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 394, in test

[...]

[Test ran on the internal Perflab CI]

Copy link

topsail-bot bot commented Dec 5, 2024

Jenkins Job #1724

🔴 Test of 'rhoai test test_ci' failed after 01 hours 11 minutes 23 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: cluster_rhoai_4h100 gating_dgx40gb_qlora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_4h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_qlora

• Link to the Rebuild page.

Failure indicator:

/logs/artifacts/000_test_ci/001__matbenchmarking/FAILURE | MatrixBenchmark benchmark failed.
/logs/artifacts/000_test_ci/001__matbenchmarking/qlora/000__qlora/003__fms_fine_tuning/000__fine_tuning__run_fine_tuning_job/FAILURE | [000__fine_tuning__run_fine_tuning_job] ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra={'name': 'qlora', 'model_name': 'mistral-7b-v0.3-gptq', 'dataset_name': 'alpaca_data.json', 'gpu': 4, 'dataset_replication': 0.2, 'pod_count': 1, 'hyper_parameters': {'fp16': True, 'gradient_accumulation_steps': 4, 'gradient_checkpointing': True, 'lora_alpha': 16, 'max_seq_length': 512, 'max_steps': -1, 'num_train_epochs': 1, 'packing': False, 'peft_method': 'qlora', 'per_device_train_batch_size': 1, 'r': 4, 'torch_dtype': 'float16', 'use_flash_attn': True, 'warmup_ratio': 0.03, 'auto_gptq': ['triton_v2'], 'target_modules': ['all-linear']}, 'dataset_transform': 'convert_alpaca.py', 'dataset_response_template': '\n### Label:'} --> 2
/logs/artifacts/000_test_ci/001__matbenchmarking/qlora/000__qlora/003__fms_fine_tuning/FAILURE | CalledProcessError: Command 'set -o errexit;set -o pipefail;set -o nounset;set -o errtrace;ARTIFACT_DIR="/logs/artifacts/000_test_ci/001__matbenchmarking/qlora/000__qlora/003__fms_fine_tuning" ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra="{'name': 'qlora', 'model_name': 'mistral-7b-v0.3-gptq', 'dataset_name': 'alpaca_data.json', 'gpu': 4, 'dataset_replication': 0.2, 'pod_count': 1, 'hyper_parameters': {'fp16': True, 'gradient_accumulation_steps': 4, 'gradient_checkpointing': True, 'lora_alpha': 16, 'max_seq_length': 512, 'max_steps': -1, 'num_train_epochs': 1, 'packing': False, 'peft_method': 'qlora', 'per_device_train_batch_size': 1, 'r': 4, 'torch_dtype': 'float16', 'use_flash_attn': True, 'warmup_ratio': 0.03, 'auto_gptq': ['triton_v2'], 'target_modules': ['all-linear']}, 'dataset_transform': 'convert_alpaca.py', 'dataset_response_template': '\n### Label:'}"' returned non-zero exit status 2.
Traceback (most recent call last):
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 143, in _run_test
    run.run_toolbox_from_config("fine_tuning", "run_fine_tuning_job",
  File "/opt/topsail/src/projects/core/library/run.py", line 49, in run_toolbox_from_config
    return run(f'{cmd_env} ./run_toolbox.py from_config {group} {command} {_dict_to_run_toolbox_args(kwargs)}', **run_kwargs)
  File "/opt/topsail/src/projects/core/library/run.py", line 105, in run
    proc = subprocess.run(command, **args)

[...]

[Test ran on the internal Perflab CI]

Copy link

topsail-bot bot commented Dec 6, 2024

Jenkins Job #1725

🟢 Test of 'rhoai test test_ci' succeeded after 10 hours 14 minutes 41 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: cluster_rhoai_4h100 gating_dgx40gb_qlora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_4h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_qlora

• Link to the Rebuild page.

[Test ran on the internal Perflab CI]

Copy link

topsail-bot bot commented Dec 7, 2024

Jenkins Job #1726

🔴 Test of 'rhoai test test_ci' failed after 02 hours 23 minutes 48 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: cluster_rhoai_4h100 gating_dgx40gb_lora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_4h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_lora

• Link to the Rebuild page.

Failure indicator:

/logs/artifacts/000_test_ci/001__matbenchmarking/FAILURE | MatrixBenchmark benchmark failed.
/logs/artifacts/000_test_ci/001__matbenchmarking/fine-tuning/000__fine-tuning/003__fms_fine_tuning/000__fine_tuning__run_fine_tuning_job/FAILURE | [000__fine_tuning__run_fine_tuning_job] ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra={'name': 'fine-tuning', 'pod_count': 1, 'model_name': 'llama-2-13b-hf', 'dataset_name': 'alpaca_data.json', 'gpu': 4, 'dataset_replication': 0.2, 'hyper_parameters': {'gradient_accumulation_steps': 1, 'lora_alpha': 16, 'max_seq_length': 512, 'peft_method': 'lora', 'per_device_train_batch_size': 1, 'r': 4, 'use_flash_attn': True, 'target_modules': ['q_proj', 'k_proj']}, 'dataset_transform': 'convert_alpaca.py', 'dataset_response_template': '\n### Label:'} --> 2
/logs/artifacts/000_test_ci/001__matbenchmarking/fine-tuning/000__fine-tuning/003__fms_fine_tuning/FAILURE | CalledProcessError: Command 'set -o errexit;set -o pipefail;set -o nounset;set -o errtrace;ARTIFACT_DIR="/logs/artifacts/000_test_ci/001__matbenchmarking/fine-tuning/000__fine-tuning/003__fms_fine_tuning" ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra="{'name': 'fine-tuning', 'pod_count': 1, 'model_name': 'llama-2-13b-hf', 'dataset_name': 'alpaca_data.json', 'gpu': 4, 'dataset_replication': 0.2, 'hyper_parameters': {'gradient_accumulation_steps': 1, 'lora_alpha': 16, 'max_seq_length': 512, 'peft_method': 'lora', 'per_device_train_batch_size': 1, 'r': 4, 'use_flash_attn': True, 'target_modules': ['q_proj', 'k_proj']}, 'dataset_transform': 'convert_alpaca.py', 'dataset_response_template': '\n### Label:'}"' returned non-zero exit status 2.
Traceback (most recent call last):
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 148, in _run_test
    run.run_toolbox_from_config("fine_tuning", "run_fine_tuning_job",
  File "/opt/topsail/src/projects/core/library/run.py", line 49, in run_toolbox_from_config
    return run(f'{cmd_env} ./run_toolbox.py from_config {group} {command} {_dict_to_run_toolbox_args(kwargs)}', **run_kwargs)
  File "/opt/topsail/src/projects/core/library/run.py", line 105, in run
    proc = subprocess.run(command, **args)

[...]

[Test ran on the internal Perflab CI]

Copy link

topsail-bot bot commented Dec 9, 2024

Jenkins Job #1727

🔴 Test of 'rhoai test test_ci' failed after 02 hours 23 minutes 41 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: cluster_rhoai_4h100 gating_dgx40gb_lora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_4h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_lora

• Link to the Rebuild page.

Failure indicator:

/logs/artifacts/001_test_ci/001__matbenchmarking/FAILURE | MatrixBenchmark benchmark failed.
/logs/artifacts/001_test_ci/001__matbenchmarking/fine-tuning/000__fine-tuning/003__fms_fine_tuning/000__fine_tuning__run_fine_tuning_job/FAILURE | [000__fine_tuning__run_fine_tuning_job] ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra={'name': 'fine-tuning', 'pod_count': 1, 'model_name': 'llama-2-13b-hf', 'dataset_name': 'alpaca_data.json', 'gpu': 4, 'dataset_replication': 0.2, 'hyper_parameters': {'gradient_accumulation_steps': 1, 'lora_alpha': 16, 'max_seq_length': 512, 'peft_method': 'lora', 'per_device_train_batch_size': 1, 'r': 4, 'use_flash_attn': True, 'target_modules': ['q_proj', 'k_proj']}, 'dataset_transform': 'convert_alpaca.py', 'dataset_response_template': '\n### Label:'} --> 2
/logs/artifacts/001_test_ci/001__matbenchmarking/fine-tuning/000__fine-tuning/003__fms_fine_tuning/FAILURE | CalledProcessError: Command 'set -o errexit;set -o pipefail;set -o nounset;set -o errtrace;ARTIFACT_DIR="/logs/artifacts/001_test_ci/001__matbenchmarking/fine-tuning/000__fine-tuning/003__fms_fine_tuning" ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra="{'name': 'fine-tuning', 'pod_count': 1, 'model_name': 'llama-2-13b-hf', 'dataset_name': 'alpaca_data.json', 'gpu': 4, 'dataset_replication': 0.2, 'hyper_parameters': {'gradient_accumulation_steps': 1, 'lora_alpha': 16, 'max_seq_length': 512, 'peft_method': 'lora', 'per_device_train_batch_size': 1, 'r': 4, 'use_flash_attn': True, 'target_modules': ['q_proj', 'k_proj']}, 'dataset_transform': 'convert_alpaca.py', 'dataset_response_template': '\n### Label:'}"' returned non-zero exit status 2.
Traceback (most recent call last):
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 148, in _run_test
    run.run_toolbox_from_config("fine_tuning", "run_fine_tuning_job",
  File "/opt/topsail/src/projects/core/library/run.py", line 49, in run_toolbox_from_config
    return run(f'{cmd_env} ./run_toolbox.py from_config {group} {command} {_dict_to_run_toolbox_args(kwargs)}', **run_kwargs)
  File "/opt/topsail/src/projects/core/library/run.py", line 105, in run
    proc = subprocess.run(command, **args)

[...]

[Test ran on the internal Perflab CI]

Copy link

topsail-bot bot commented Dec 9, 2024

Jenkins Job #1728

🔴 Test of 'rhoai test prepare_ci' failed after 00 hours 07 minutes 54 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test prepare_ci
PR_POSITIONAL_ARGS: cluster_rhoai_4h100 gating_dgx40gb_full
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_4h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_full

• Link to the Rebuild page.

Failure indicator:

/logs/artifacts/001_prepare_ci/000__prepare1/000__install_rhoai/000__rhods__deploy_ods/FAILURE | [000__rhods__deploy_ods] ./run_toolbox.py from_config rhods deploy_ods --extra={} --> 2
/logs/artifacts/001_prepare_ci/000__prepare1/000__install_rhoai/FAILURE | CalledProcessError: Command 'set -o errexit;set -o pipefail;set -o nounset;set -o errtrace;ARTIFACT_DIR="/logs/artifacts/001_prepare_ci/000__prepare1/000__install_rhoai" ./run_toolbox.py from_config rhods deploy_ods --extra="{}"' returned non-zero exit status 2.
Traceback (most recent call last):
  File "/opt/topsail/src/projects/rhods/library/prepare_rhoai.py", line 58, in install
    run.run_toolbox_from_config("rhods", "deploy_ods")
  File "/opt/topsail/src/projects/core/library/run.py", line 49, in run_toolbox_from_config
    return run(f'{cmd_env} ./run_toolbox.py from_config {group} {command} {_dict_to_run_toolbox_args(kwargs)}', **run_kwargs)
  File "/opt/topsail/src/projects/core/library/run.py", line 105, in run
    proc = subprocess.run(command, **args)
  File "/usr/lib64/python3.9/subprocess.py", line 528, in run

[...]

[Test ran on the internal Perflab CI]

Copy link

topsail-bot bot commented Dec 9, 2024

Jenkins Job #1729

🔴 Test of 'rhoai test test_ci' failed after 01 hours 48 minutes 31 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: cluster_rhoai_4h100 gating_dgx40gb_full
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_4h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_full

• Link to the Rebuild page.

Failure indicator:

/logs/artifacts/001_test_ci/001__matbenchmarking/FAILURE | MatrixBenchmark benchmark failed.
/logs/artifacts/001_test_ci/001__matbenchmarking/fine-tuning/001__fine-tuning/003__fms_fine_tuning/000__fine_tuning__run_fine_tuning_job/FAILURE | [000__fine_tuning__run_fine_tuning_job] ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra={'name': 'fine-tuning', 'pod_count': 1, 'model_name': 'meta-llama-3.1-70b', 'dataset_name': 'alpaca_data.json', 'gpu': 4, 'dataset_replication': 0.2, 'hyper_parameters': {'gradient_accumulation_steps': 1, 'max_seq_length': 512, 'peft_method': 'none', 'per_device_train_batch_size': 1, 'use_flash_attn': True}, 'dataset_transform': 'convert_alpaca.py', 'dataset_response_template': '\n### Label:'} --> 2
/logs/artifacts/001_test_ci/001__matbenchmarking/fine-tuning/001__fine-tuning/003__fms_fine_tuning/FAILURE | CalledProcessError: Command 'set -o errexit;set -o pipefail;set -o nounset;set -o errtrace;ARTIFACT_DIR="/logs/artifacts/001_test_ci/001__matbenchmarking/fine-tuning/001__fine-tuning/003__fms_fine_tuning" ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra="{'name': 'fine-tuning', 'pod_count': 1, 'model_name': 'meta-llama-3.1-70b', 'dataset_name': 'alpaca_data.json', 'gpu': 4, 'dataset_replication': 0.2, 'hyper_parameters': {'gradient_accumulation_steps': 1, 'max_seq_length': 512, 'peft_method': 'none', 'per_device_train_batch_size': 1, 'use_flash_attn': True}, 'dataset_transform': 'convert_alpaca.py', 'dataset_response_template': '\n### Label:'}"' returned non-zero exit status 2.
Traceback (most recent call last):
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 148, in _run_test
    run.run_toolbox_from_config("fine_tuning", "run_fine_tuning_job",
  File "/opt/topsail/src/projects/core/library/run.py", line 49, in run_toolbox_from_config
    return run(f'{cmd_env} ./run_toolbox.py from_config {group} {command} {_dict_to_run_toolbox_args(kwargs)}', **run_kwargs)
  File "/opt/topsail/src/projects/core/library/run.py", line 105, in run
    proc = subprocess.run(command, **args)

[...]

[Test ran on the internal Perflab CI]

Copy link

topsail-bot bot commented Dec 9, 2024

Jenkins Job #1731

🔴 Test of 'rhoai test test_ci' failed after 01 hours 52 minutes 36 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: cluster_rhoai_4h100 gating_dgx40gb_full
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_4h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_full

• Link to the Rebuild page.

Failure indicator:

/logs/artifacts/000_test_ci/001__matbenchmarking/FAILURE | MatrixBenchmark benchmark failed.
/logs/artifacts/000_test_ci/001__matbenchmarking/fine-tuning/001__fine-tuning/003__fms_fine_tuning/000__fine_tuning__run_fine_tuning_job/FAILURE | [000__fine_tuning__run_fine_tuning_job] ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra={'name': 'fine-tuning', 'pod_count': 1, 'model_name': 'meta-llama-3.1-70b', 'dataset_name': 'alpaca_data.json', 'gpu': 4, 'dataset_replication': 0.2, 'hyper_parameters': {'gradient_accumulation_steps': 1, 'max_seq_length': 512, 'peft_method': 'none', 'per_device_train_batch_size': 1, 'use_flash_attn': True}, 'dataset_transform': 'convert_alpaca.py', 'dataset_response_template': '\n### Label:'} --> 2
/logs/artifacts/000_test_ci/001__matbenchmarking/fine-tuning/001__fine-tuning/003__fms_fine_tuning/FAILURE | CalledProcessError: Command 'set -o errexit;set -o pipefail;set -o nounset;set -o errtrace;ARTIFACT_DIR="/logs/artifacts/000_test_ci/001__matbenchmarking/fine-tuning/001__fine-tuning/003__fms_fine_tuning" ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra="{'name': 'fine-tuning', 'pod_count': 1, 'model_name': 'meta-llama-3.1-70b', 'dataset_name': 'alpaca_data.json', 'gpu': 4, 'dataset_replication': 0.2, 'hyper_parameters': {'gradient_accumulation_steps': 1, 'max_seq_length': 512, 'peft_method': 'none', 'per_device_train_batch_size': 1, 'use_flash_attn': True}, 'dataset_transform': 'convert_alpaca.py', 'dataset_response_template': '\n### Label:'}"' returned non-zero exit status 2.
Traceback (most recent call last):
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 148, in _run_test
    run.run_toolbox_from_config("fine_tuning", "run_fine_tuning_job",
  File "/opt/topsail/src/projects/core/library/run.py", line 49, in run_toolbox_from_config
    return run(f'{cmd_env} ./run_toolbox.py from_config {group} {command} {_dict_to_run_toolbox_args(kwargs)}', **run_kwargs)
  File "/opt/topsail/src/projects/core/library/run.py", line 105, in run
    proc = subprocess.run(command, **args)

[...]

[Test ran on the internal Perflab CI]

Copy link

topsail-bot bot commented Dec 10, 2024

Jenkins Job #1732

🟢 Test of 'rhoai test test_ci' succeeded after 03 hours 54 minutes 37 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: cluster_rhoai_4h100 gating_dgx40gb_lora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_4h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_lora

• Link to the Rebuild page.

[Test ran on the internal Perflab CI]

Copy link

topsail-bot bot commented Dec 10, 2024

Jenkins Job #1733

🟢 Test of 'rhoai test test_ci' succeeded after 00 hours 07 minutes 13 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: cluster_rhoai_4h100 gating_dgx40gb_full
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_4h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_full

• Link to the Rebuild page.

[Test ran on the internal Perflab CI]

Copy link

topsail-bot bot commented Dec 10, 2024

Jenkins Job #1734

🟢 Test of 'rhoai test test_ci' succeeded after 00 hours 09 minutes 15 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: cluster_rhoai_4h100 gating_dgx40gb_lora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_4h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_lora

• Link to the Rebuild page.

[Test ran on the internal Perflab CI]

Copy link

topsail-bot bot commented Dec 10, 2024

Jenkins Job #1735

🔴 Test of 'rhoai test test_ci' failed after 08 hours 07 minutes 49 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: cluster_rhoai_4h100 gating_dgx40gb_qlora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_4h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_qlora

• Link to the Rebuild page.

Failure indicator:

/logs/artifacts/000_test_ci/001__matbenchmarking/FAILURE | MatrixBenchmark benchmark failed.
/logs/artifacts/000_test_ci/001__matbenchmarking/qlora/001__qlora/003__fms_fine_tuning/000__fine_tuning__run_fine_tuning_job/FAILURE | [000__fine_tuning__run_fine_tuning_job] ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra={'name': 'qlora', 'pod_count': 1, 'model_name': 'granite-8b-code-instruct-gptq', 'dataset_name': 'alpaca_data.json', 'gpu': 4, 'dataset_replication': 0.2, 'hyper_parameters': {'fp16': True, 'gradient_accumulation_steps': 4, 'gradient_checkpointing': True, 'lora_alpha': 16, 'max_seq_length': 512, 'max_steps': -1, 'num_train_epochs': 1, 'packing': False, 'peft_method': 'lora', 'per_device_train_batch_size': 1, 'r': 4, 'torch_dtype': 'float16', 'use_flash_attn': True, 'warmup_ratio': 0.03, 'auto_gptq': ['triton_v2'], 'target_modules': ['all-linear']}, 'dataset_transform': 'convert_alpaca.py', 'dataset_response_template': '\n### Label:'} --> 2
/logs/artifacts/000_test_ci/001__matbenchmarking/qlora/001__qlora/003__fms_fine_tuning/FAILURE | CalledProcessError: Command 'set -o errexit;set -o pipefail;set -o nounset;set -o errtrace;ARTIFACT_DIR="/logs/artifacts/000_test_ci/001__matbenchmarking/qlora/001__qlora/003__fms_fine_tuning" ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra="{'name': 'qlora', 'pod_count': 1, 'model_name': 'granite-8b-code-instruct-gptq', 'dataset_name': 'alpaca_data.json', 'gpu': 4, 'dataset_replication': 0.2, 'hyper_parameters': {'fp16': True, 'gradient_accumulation_steps': 4, 'gradient_checkpointing': True, 'lora_alpha': 16, 'max_seq_length': 512, 'max_steps': -1, 'num_train_epochs': 1, 'packing': False, 'peft_method': 'lora', 'per_device_train_batch_size': 1, 'r': 4, 'torch_dtype': 'float16', 'use_flash_attn': True, 'warmup_ratio': 0.03, 'auto_gptq': ['triton_v2'], 'target_modules': ['all-linear']}, 'dataset_transform': 'convert_alpaca.py', 'dataset_response_template': '\n### Label:'}"' returned non-zero exit status 2.
Traceback (most recent call last):
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 148, in _run_test
    run.run_toolbox_from_config("fine_tuning", "run_fine_tuning_job",
  File "/opt/topsail/src/projects/core/library/run.py", line 49, in run_toolbox_from_config
    return run(f'{cmd_env} ./run_toolbox.py from_config {group} {command} {_dict_to_run_toolbox_args(kwargs)}', **run_kwargs)
  File "/opt/topsail/src/projects/core/library/run.py", line 105, in run
    proc = subprocess.run(command, **args)

[...]

[Test ran on the internal Perflab CI]

Copy link

topsail-bot bot commented Dec 11, 2024

Jenkins Job #1736

🔴 Test of 'rhoai test test_ci' failed after 01 hours 56 minutes 16 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: cluster_rhoai_4h100 gating_dgx40gb_full
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_4h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_full

• Link to the Rebuild page.

Failure indicator:

/logs/artifacts/000_test_ci/001__matbenchmarking/FAILURE | MatrixBenchmark benchmark failed.
/logs/artifacts/000_test_ci/001__matbenchmarking/fine-tuning/001__fine-tuning/003__fms_fine_tuning/000__fine_tuning__run_fine_tuning_job/FAILURE | [000__fine_tuning__run_fine_tuning_job] ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra={'name': 'fine-tuning', 'pod_count': 1, 'model_name': 'meta-llama-3.1-70b', 'dataset_name': 'alpaca_data.json', 'gpu': 4, 'dataset_replication': 0.2, 'hyper_parameters': {'gradient_accumulation_steps': 1, 'max_seq_length': 512, 'peft_method': 'none', 'per_device_train_batch_size': 1, 'use_flash_attn': True}, 'dataset_transform': 'convert_alpaca.py', 'dataset_response_template': '\n### Label:'} --> 2
/logs/artifacts/000_test_ci/001__matbenchmarking/fine-tuning/001__fine-tuning/003__fms_fine_tuning/FAILURE | CalledProcessError: Command 'set -o errexit;set -o pipefail;set -o nounset;set -o errtrace;ARTIFACT_DIR="/logs/artifacts/000_test_ci/001__matbenchmarking/fine-tuning/001__fine-tuning/003__fms_fine_tuning" ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra="{'name': 'fine-tuning', 'pod_count': 1, 'model_name': 'meta-llama-3.1-70b', 'dataset_name': 'alpaca_data.json', 'gpu': 4, 'dataset_replication': 0.2, 'hyper_parameters': {'gradient_accumulation_steps': 1, 'max_seq_length': 512, 'peft_method': 'none', 'per_device_train_batch_size': 1, 'use_flash_attn': True}, 'dataset_transform': 'convert_alpaca.py', 'dataset_response_template': '\n### Label:'}"' returned non-zero exit status 2.
Traceback (most recent call last):
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 148, in _run_test
    run.run_toolbox_from_config("fine_tuning", "run_fine_tuning_job",
  File "/opt/topsail/src/projects/core/library/run.py", line 49, in run_toolbox_from_config
    return run(f'{cmd_env} ./run_toolbox.py from_config {group} {command} {_dict_to_run_toolbox_args(kwargs)}', **run_kwargs)
  File "/opt/topsail/src/projects/core/library/run.py", line 105, in run
    proc = subprocess.run(command, **args)

[...]

[Test ran on the internal Perflab CI]

Copy link

topsail-bot bot commented Dec 11, 2024

Jenkins Job #1737

🟢 Test of 'rhoai test test_ci' succeeded after 03 hours 59 minutes 37 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: cluster_rhoai_4h100 gating_dgx40gb_lora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_4h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_lora

• Link to the Rebuild page.

[Test ran on the internal Perflab CI]

Copy link

topsail-bot bot commented Dec 11, 2024

Jenkins Job #1738

🔴 Test of 'rhoai test test_ci' failed after 08 hours 05 minutes 07 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: cluster_rhoai_4h100 gating_dgx40gb_qlora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_4h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_qlora

• Link to the Rebuild page.

Failure indicator:

/logs/artifacts/000_test_ci/001__matbenchmarking/FAILURE | MatrixBenchmark benchmark failed.
/logs/artifacts/000_test_ci/001__matbenchmarking/qlora/004__qlora/003__fms_fine_tuning/000__fine_tuning__run_fine_tuning_job/FAILURE | [000__fine_tuning__run_fine_tuning_job] ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra={'name': 'qlora', 'pod_count': 1, 'model_name': 'mixtral-8x7b-instruct-v0.1-gptq', 'dataset_name': 'alpaca_data.json', 'gpu': 4, 'dataset_replication': 0.2, 'hyper_parameters': {'fp16': True, 'gradient_accumulation_steps': 4, 'gradient_checkpointing': True, 'lora_alpha': 16, 'max_seq_length': 512, 'max_steps': -1, 'num_train_epochs': 1, 'packing': False, 'peft_method': 'lora', 'per_device_train_batch_size': 1, 'r': 4, 'torch_dtype': 'float16', 'use_flash_attn': True, 'warmup_ratio': 0.03, 'auto_gptq': ['triton_v2'], 'target_modules': ['all-linear']}, 'dataset_transform': 'convert_alpaca.py', 'dataset_response_template': '\n### Label:'} --> 2
/logs/artifacts/000_test_ci/001__matbenchmarking/qlora/004__qlora/003__fms_fine_tuning/FAILURE | CalledProcessError: Command 'set -o errexit;set -o pipefail;set -o nounset;set -o errtrace;ARTIFACT_DIR="/logs/artifacts/000_test_ci/001__matbenchmarking/qlora/004__qlora/003__fms_fine_tuning" ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra="{'name': 'qlora', 'pod_count': 1, 'model_name': 'mixtral-8x7b-instruct-v0.1-gptq', 'dataset_name': 'alpaca_data.json', 'gpu': 4, 'dataset_replication': 0.2, 'hyper_parameters': {'fp16': True, 'gradient_accumulation_steps': 4, 'gradient_checkpointing': True, 'lora_alpha': 16, 'max_seq_length': 512, 'max_steps': -1, 'num_train_epochs': 1, 'packing': False, 'peft_method': 'lora', 'per_device_train_batch_size': 1, 'r': 4, 'torch_dtype': 'float16', 'use_flash_attn': True, 'warmup_ratio': 0.03, 'auto_gptq': ['triton_v2'], 'target_modules': ['all-linear']}, 'dataset_transform': 'convert_alpaca.py', 'dataset_response_template': '\n### Label:'}"' returned non-zero exit status 2.
Traceback (most recent call last):
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 148, in _run_test
    run.run_toolbox_from_config("fine_tuning", "run_fine_tuning_job",
  File "/opt/topsail/src/projects/core/library/run.py", line 49, in run_toolbox_from_config
    return run(f'{cmd_env} ./run_toolbox.py from_config {group} {command} {_dict_to_run_toolbox_args(kwargs)}', **run_kwargs)
  File "/opt/topsail/src/projects/core/library/run.py", line 105, in run
    proc = subprocess.run(command, **args)

[...]

[Test ran on the internal Perflab CI]

Copy link

openshift-ci bot commented Dec 12, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from albertoperdomo2. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link

topsail-bot bot commented Dec 12, 2024

Jenkins Job #1739

🔴 Test of 'rhoai test test_ci' failed after 08 hours 05 minutes 04 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: cluster_rhoai_4h100 gating_dgx40gb_qlora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_4h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_qlora

• Link to the Rebuild page.

Failure indicator:

/logs/artifacts/000_test_ci/001__matbenchmarking/FAILURE | MatrixBenchmark benchmark failed.
/logs/artifacts/000_test_ci/001__matbenchmarking/qlora/001__qlora/003__fms_fine_tuning/000__fine_tuning__run_fine_tuning_job/FAILURE | [000__fine_tuning__run_fine_tuning_job] ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra={'name': 'qlora', 'pod_count': 1, 'model_name': 'granite-8b-code-instruct-gptq', 'dataset_name': 'alpaca_data.json', 'gpu': 4, 'dataset_replication': 0.2, 'hyper_parameters': {'fp16': True, 'gradient_accumulation_steps': 4, 'gradient_checkpointing': True, 'lora_alpha': 16, 'max_seq_length': 512, 'max_steps': -1, 'num_train_epochs': 1, 'packing': False, 'peft_method': 'lora', 'per_device_train_batch_size': 1, 'r': 4, 'torch_dtype': 'float16', 'use_flash_attn': True, 'warmup_ratio': 0.03, 'auto_gptq': ['triton_v2'], 'target_modules': ['all-linear']}, 'dataset_transform': 'convert_alpaca.py', 'dataset_response_template': '\n### Label:'} --> 2
/logs/artifacts/000_test_ci/001__matbenchmarking/qlora/001__qlora/003__fms_fine_tuning/FAILURE | CalledProcessError: Command 'set -o errexit;set -o pipefail;set -o nounset;set -o errtrace;ARTIFACT_DIR="/logs/artifacts/000_test_ci/001__matbenchmarking/qlora/001__qlora/003__fms_fine_tuning" ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra="{'name': 'qlora', 'pod_count': 1, 'model_name': 'granite-8b-code-instruct-gptq', 'dataset_name': 'alpaca_data.json', 'gpu': 4, 'dataset_replication': 0.2, 'hyper_parameters': {'fp16': True, 'gradient_accumulation_steps': 4, 'gradient_checkpointing': True, 'lora_alpha': 16, 'max_seq_length': 512, 'max_steps': -1, 'num_train_epochs': 1, 'packing': False, 'peft_method': 'lora', 'per_device_train_batch_size': 1, 'r': 4, 'torch_dtype': 'float16', 'use_flash_attn': True, 'warmup_ratio': 0.03, 'auto_gptq': ['triton_v2'], 'target_modules': ['all-linear']}, 'dataset_transform': 'convert_alpaca.py', 'dataset_response_template': '\n### Label:'}"' returned non-zero exit status 2.
Traceback (most recent call last):
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 148, in _run_test
    run.run_toolbox_from_config("fine_tuning", "run_fine_tuning_job",
  File "/opt/topsail/src/projects/core/library/run.py", line 49, in run_toolbox_from_config
    return run(f'{cmd_env} ./run_toolbox.py from_config {group} {command} {_dict_to_run_toolbox_args(kwargs)}', **run_kwargs)
  File "/opt/topsail/src/projects/core/library/run.py", line 105, in run
    proc = subprocess.run(command, **args)

[...]

[Test ran on the internal Perflab CI]

Copy link

topsail-bot bot commented Dec 13, 2024

Jenkins Job #1740

🟢 Test of 'rhoai test test_ci' succeeded after 10 hours 09 minutes 20 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: cluster_rhoai_4h100 gating_dgx40gb_qlora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_4h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_qlora

• Link to the Rebuild page.

[Test ran on the internal Perflab CI]

Copy link

topsail-bot bot commented Dec 16, 2024

Jenkins Job #1741

🔴 Test of 'rhoai test prepare_ci' failed after 00 hours 00 minutes 08 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test prepare_ci
PR_POSITIONAL_ARGS: cluster_rhoai_2x8h100 gating_dgx40gb_lora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_2x8h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_lora

• Link to the Rebuild page.

Failure indicator: Empty. (See run.log)

[Test ran on the internal Perflab CI]

Copy link

topsail-bot bot commented Dec 17, 2024

Jenkins Job #1744

🟢 Test of 'rhoai test test_ci' succeeded after 04 hours 25 minutes 49 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: cluster_rhoai_2x8h100 gating_dgx40gb_lora
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: cluster_rhoai_2x8h100
PR_POSITIONAL_ARG_2: gating_dgx40gb_lora

• Link to the Rebuild page.

[Test ran on the internal Perflab CI]

Copy link

topsail-bot bot commented Dec 18, 2024

Jenkins Job #1746

🔴 Test of 'rhoai test test_ci' failed after 00 hours 02 minutes 48 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: gating_dgx40gb_full
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: gating_dgx40gb_full

• Link to the Rebuild page.

Failure indicator:

/logs/artifacts/000_test_ci/FAILURE | Traceback (most recent call last):
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 402, in test
    failed = _run_test_and_visualize()
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 281, in _run_test_and_visualize
    if not prepare_rhoai_mod.is_rhoai_installed():
  File "/opt/topsail/src/projects/rhods/library/prepare_rhoai.py", line 40, in is_rhoai_installed
    installed_csv_cmd = run.run(f"oc get csv -loperators.coreos.com/{RHODS_OPERATOR_MANIFEST_NAME}.{RHODS_NAMESPACE}"
  File "/opt/topsail/src/projects/core/library/run.py", line 105, in run
    proc = subprocess.run(command, **args)
  File "/usr/lib64/python3.9/subprocess.py", line 528, in run

[...]

[Test ran on the internal Perflab CI]

Copy link

topsail-bot bot commented Dec 19, 2024

Jenkins Job #1747

🔴 Test of 'rhoai test test_ci' failed after 00 hours 00 minutes 07 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

# RHOAI: run fine_tuning test test_ci
PR_POSITIONAL_ARGS: gating_dgx40gb_full
PR_POSITIONAL_ARG_0: fine_tuning-perf-ci
PR_POSITIONAL_ARG_1: gating_dgx40gb_full

• Link to the Rebuild page.

Failure indicator: Empty. (See run.log)

[Test ran on the internal Perflab CI]

@albertoperdomo2 albertoperdomo2 force-pushed the fine-tuning-blog branch 3 times, most recently from 4cc8b18 to 78fca5b Compare March 4, 2025 09:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants