Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

testing new h100 cluster #673

Open
wants to merge 46 commits into
base: main
Choose a base branch
from

Conversation

tosokin
Copy link
Collaborator

@tosokin tosokin commented Feb 11, 2025

No description provided.

Copy link

openshift-ci bot commented Feb 11, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from tosokin. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tosokin tosokin changed the title testing mew h100 cluster testing new h100 cluster Feb 11, 2025
@tosokin
Copy link
Collaborator Author

tosokin commented Feb 11, 2025

/test jump-ci 2x8xh100 fine_tuning ilab_2x8xh100_pod_network
/only test_ci

Copy link

topsail-bot bot commented Feb 11, 2025

🔴 Test of 'fine_tuning test test_ci' failed after 00 hours 00 minutes 07 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_1: ilab_2x8xh100_pod_network

Failure indicator: Empty. (See run.log)

@tosokin
Copy link
Collaborator Author

tosokin commented Feb 11, 2025

/test jump-ci 2x8xh100 fine_tuning ilab_2x8xh100_pod_network
/only test_ci

Copy link

topsail-bot bot commented Feb 11, 2025

🔴 Test of 'fine_tuning test test_ci' failed after 00 hours 13 minutes 36 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_1: ilab_2x8xh100_pod_network

Failure indicator:

/logs/artifacts/001__matbenchmarking/FAILURE | MatrixBenchmark benchmark failed.
/logs/artifacts/001__matbenchmarking/ilab/000__ilab/001__prepare_data/000__storage__download_to_pvc/FAILURE | [000__storage__download_to_pvc] ./run_toolbox.py from_config storage download_to_pvc --extra={'source': 's3://instructlab-standalone/data/ilab_large_10000samples_skills_data.jsonl', 'storage_dir': '/dataset', 'name': 'ilab_large_10000samples_skills_data.jsonl', 'creds': '/run/secrets/PSAP_ODS_SECRET_PATH/.awscred'} --> 2
/logs/artifacts/002__plots/FAILURE | A fatal error happened during the results parsing, aborting the visualization (0_matbench_parse.log).
/logs/artifacts/003__prom_plots/FAILURE | A fatal error happened during the results parsing, aborting the visualization (0_matbench_parse.log).
/logs/artifacts/FAILURE | Traceback (most recent call last):
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 394, in test
    failed = _run_test_and_visualize()
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 322, in _run_test_and_visualize
    generate_visualization(do_matbenchmarking, test_artifact_dir_p[0])
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 373, in generate_visualization

[...]

@tosokin
Copy link
Collaborator Author

tosokin commented Feb 12, 2025

/test jump-ci instruct-l40s fine_tuning ilab_l40s_scale
/only test_ci

Copy link

topsail-bot bot commented Feb 12, 2025

🔴 Test of 'fine_tuning test test_ci' failed after 01 hours 21 minutes 58 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_1: ilab_l40s_scale

Failure indicator:

/logs/artifacts/001__matbenchmarking/FAILURE | MatrixBenchmark benchmark failed.
/logs/artifacts/001__matbenchmarking/ilab/001__ilab/003__ilab_fine_tuning/000__fine_tuning__run_fine_tuning_job/FAILURE | [000__fine_tuning__run_fine_tuning_job] ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra={'name': 'ilab', 'pod_count': 1, 'model_name': 'granite-3.0-8b-instruct', 'dataset_name': 'ilab_large_10000samples_skills_data.jsonl', 'gpu': 2, 'shared_memory': 20, 'ephemeral_output_pvc_size': '500Gi', 'hyper_parameters': {'cpu_offload_optimizer': True, 'cpu_offload_params': True, 'max_batch_len': 40000, 'num_epochs': 1}} --> 2
/logs/artifacts/001__matbenchmarking/ilab/001__ilab/003__ilab_fine_tuning/FAILURE | CalledProcessError: Command 'set -o errexit;set -o pipefail;set -o nounset;set -o errtrace;ARTIFACT_DIR="/logs/artifacts/001__matbenchmarking/ilab/001__ilab/003__ilab_fine_tuning" ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra="{'name': 'ilab', 'pod_count': 1, 'model_name': 'granite-3.0-8b-instruct', 'dataset_name': 'ilab_large_10000samples_skills_data.jsonl', 'gpu': 2, 'shared_memory': 20, 'ephemeral_output_pvc_size': '500Gi', 'hyper_parameters': {'cpu_offload_optimizer': True, 'cpu_offload_params': True, 'max_batch_len': 40000, 'num_epochs': 1}}"' returned non-zero exit status 2.
Traceback (most recent call last):
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 151, in _run_test
    run.run_toolbox_from_config("fine_tuning", "run_fine_tuning_job",
  File "/opt/topsail/src/projects/core/library/run.py", line 65, in run_toolbox_from_config
    return run(f'{cmd_env} ./run_toolbox.py from_config {group} {command} {_dict_to_run_toolbox_args(kwargs)}', **run_kwargs)
  File "/opt/topsail/src/projects/core/library/run.py", line 121, in run
    proc = subprocess.run(command, **args)

[...]

@tosokin
Copy link
Collaborator Author

tosokin commented Feb 12, 2025

/test jump-ci instruct-l40s fine_tuning ilab_l40s_shard
/only test_ci

Copy link

topsail-bot bot commented Feb 12, 2025

🔴 Test of 'fine_tuning test test_ci' failed after 00 hours 00 minutes 04 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_1: ilab_l40s_shard

Failure indicator: Empty. (See run.log)

@tosokin
Copy link
Collaborator Author

tosokin commented Feb 12, 2025

/test jump-ci instruct-l40s fine_tuning ilab_l40s_shard
/only test_ci

Copy link

topsail-bot bot commented Feb 12, 2025

🔴 Test of 'fine_tuning test test_ci' failed after 00 hours 00 minutes 05 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_1: ilab_l40s_shard

Failure indicator: Empty. (See run.log)

@tosokin
Copy link
Collaborator Author

tosokin commented Feb 12, 2025

/test jump-ci instruct-l40s fine_tuning ilab_l40s_shard
/only test_ci

Copy link

topsail-bot bot commented Feb 12, 2025

🔴 Test of 'fine_tuning test test_ci' failed after 00 hours 03 minutes 56 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_1: ilab_l40s_shard

Failure indicator:

/logs/artifacts/001__matbenchmarking/FAILURE | MatrixBenchmark benchmark failed.
/logs/artifacts/001__matbenchmarking/ilab/000__ilab/003__ilab_fine_tuning/000__fine_tuning__run_fine_tuning_job/FAILURE | [000__fine_tuning__run_fine_tuning_job] ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra={'name': 'ilab', 'pod_count': 4, 'model_name': 'granite-3.0-8b-instruct', 'dataset_name': 'ilab_large_10000samples_skills_data.jsonl', 'gpu': 2, 'shared_memory': 20, 'ephemeral_output_pvc_size': '500Gi', 'hyper_parameters': {'cpu_offload_optimizer': True, 'cpu_offload_params': True, 'fsdp_sharding_strategy': 'SHARD_GRAD_OP', 'max_batch_len': 35000, 'num_epochs': 1}, 'use_secondary_nic': ['network-port-01', 'network-port-02', 'network-port-03', 'network-port-04']} --> 2
/logs/artifacts/001__matbenchmarking/ilab/000__ilab/003__ilab_fine_tuning/FAILURE | CalledProcessError: Command 'set -o errexit;set -o pipefail;set -o nounset;set -o errtrace;ARTIFACT_DIR="/logs/artifacts/001__matbenchmarking/ilab/000__ilab/003__ilab_fine_tuning" ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra="{'name': 'ilab', 'pod_count': 4, 'model_name': 'granite-3.0-8b-instruct', 'dataset_name': 'ilab_large_10000samples_skills_data.jsonl', 'gpu': 2, 'shared_memory': 20, 'ephemeral_output_pvc_size': '500Gi', 'hyper_parameters': {'cpu_offload_optimizer': True, 'cpu_offload_params': True, 'fsdp_sharding_strategy': 'SHARD_GRAD_OP', 'max_batch_len': 35000, 'num_epochs': 1}, 'use_secondary_nic': ['network-port-01', 'network-port-02', 'network-port-03', 'network-port-04']}"' returned non-zero exit status 2.
Traceback (most recent call last):
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 151, in _run_test
    run.run_toolbox_from_config("fine_tuning", "run_fine_tuning_job",
  File "/opt/topsail/src/projects/core/library/run.py", line 65, in run_toolbox_from_config
    return run(f'{cmd_env} ./run_toolbox.py from_config {group} {command} {_dict_to_run_toolbox_args(kwargs)}', **run_kwargs)
  File "/opt/topsail/src/projects/core/library/run.py", line 121, in run
    proc = subprocess.run(command, **args)

[...]

@tosokin
Copy link
Collaborator Author

tosokin commented Feb 12, 2025

/test jump-ci instruct-l40s fine_tuning ilab_l40s_shard
/only test_ci

Copy link

topsail-bot bot commented Feb 12, 2025

🔴 Test of 'fine_tuning test test_ci' failed after 00 hours 26 minutes 49 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_1: ilab_l40s_shard

Failure indicator:

/logs/artifacts/001__matbenchmarking/FAILURE | MatrixBenchmark benchmark failed.
/logs/artifacts/001__matbenchmarking/ilab/000__ilab/003__ilab_fine_tuning/000__fine_tuning__run_fine_tuning_job/FAILURE | [000__fine_tuning__run_fine_tuning_job] ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra={'name': 'ilab', 'pod_count': 4, 'model_name': 'granite-3.0-8b-instruct', 'dataset_name': 'ilab_large_10000samples_skills_data.jsonl', 'gpu': 2, 'shared_memory': 20, 'ephemeral_output_pvc_size': '500Gi', 'hyper_parameters': {'cpu_offload_optimizer': True, 'cpu_offload_params': True, 'fsdp_sharding_strategy': 'SHARD_GRAD_OP', 'max_batch_len': 35000, 'num_epochs': 1}, 'use_secondary_nic': ['single-subnet-01', 'single-subnet-02', 'single-subnet-03', 'single-subnet-04']} --> 2
/logs/artifacts/001__matbenchmarking/ilab/000__ilab/003__ilab_fine_tuning/FAILURE | CalledProcessError: Command 'set -o errexit;set -o pipefail;set -o nounset;set -o errtrace;ARTIFACT_DIR="/logs/artifacts/001__matbenchmarking/ilab/000__ilab/003__ilab_fine_tuning" ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra="{'name': 'ilab', 'pod_count': 4, 'model_name': 'granite-3.0-8b-instruct', 'dataset_name': 'ilab_large_10000samples_skills_data.jsonl', 'gpu': 2, 'shared_memory': 20, 'ephemeral_output_pvc_size': '500Gi', 'hyper_parameters': {'cpu_offload_optimizer': True, 'cpu_offload_params': True, 'fsdp_sharding_strategy': 'SHARD_GRAD_OP', 'max_batch_len': 35000, 'num_epochs': 1}, 'use_secondary_nic': ['single-subnet-01', 'single-subnet-02', 'single-subnet-03', 'single-subnet-04']}"' returned non-zero exit status 2.
Traceback (most recent call last):
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 151, in _run_test
    run.run_toolbox_from_config("fine_tuning", "run_fine_tuning_job",
  File "/opt/topsail/src/projects/core/library/run.py", line 65, in run_toolbox_from_config
    return run(f'{cmd_env} ./run_toolbox.py from_config {group} {command} {_dict_to_run_toolbox_args(kwargs)}', **run_kwargs)
  File "/opt/topsail/src/projects/core/library/run.py", line 121, in run
    proc = subprocess.run(command, **args)

[...]

@albertoperdomo2
Copy link
Collaborator

/test jump-ci rhoai-4xh100 fine_tuning gating_dgx40gb_full
/only test_ci

@albertoperdomo2
Copy link
Collaborator

/test jump-ci 2x8xh100 fine_tuning gating_dgx40gb_full
/var tests.fine_tuning.test_extra_settings.model_name: ibm-granite/granite-3b-code-instruct
/only test_ci

Copy link

topsail-bot bot commented Feb 24, 2025

🔴 Test of 'fine_tuning test test_ci' failed after 01 hours 47 minutes 08 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_1: ilab_l40s_shard

Failure indicator:

/logs/artifacts/001__matbenchmarking/FAILURE | MatrixBenchmark benchmark failed.
/logs/artifacts/001__matbenchmarking/ilab/004__ilab/003__ilab_fine_tuning/000__fine_tuning__run_fine_tuning_job/FAILURE | [000__fine_tuning__run_fine_tuning_job] ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra={'name': 'ilab', 'pod_count': 4, 'model_name': 'granite-3.0-8b-instruct', 'dataset_name': 'ilab_large_10000samples_skills_data.jsonl', 'gpu': 2, 'shared_memory': 20, 'ephemeral_output_pvc_size': '500Gi', 'hyper_parameters': {'cpu_offload_optimizer': True, 'cpu_offload_params': True, 'fsdp_sharding_strategy': 'FULL_SHARD', 'max_batch_len': 60000, 'num_epochs': 1}, 'use_secondary_nic': ['single-subnet-01', 'single-subnet-02', 'single-subnet-03', 'single-subnet-04']} --> 2
/logs/artifacts/001__matbenchmarking/ilab/004__ilab/003__ilab_fine_tuning/FAILURE | CalledProcessError: Command 'set -o errexit;set -o pipefail;set -o nounset;set -o errtrace;ARTIFACT_DIR="/logs/artifacts/001__matbenchmarking/ilab/004__ilab/003__ilab_fine_tuning" ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra="{'name': 'ilab', 'pod_count': 4, 'model_name': 'granite-3.0-8b-instruct', 'dataset_name': 'ilab_large_10000samples_skills_data.jsonl', 'gpu': 2, 'shared_memory': 20, 'ephemeral_output_pvc_size': '500Gi', 'hyper_parameters': {'cpu_offload_optimizer': True, 'cpu_offload_params': True, 'fsdp_sharding_strategy': 'FULL_SHARD', 'max_batch_len': 60000, 'num_epochs': 1}, 'use_secondary_nic': ['single-subnet-01', 'single-subnet-02', 'single-subnet-03', 'single-subnet-04']}"' returned non-zero exit status 2.
Traceback (most recent call last):
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 151, in _run_test
    run.run_toolbox_from_config("fine_tuning", "run_fine_tuning_job",
  File "/opt/topsail/src/projects/core/library/run.py", line 65, in run_toolbox_from_config
    return run(f'{cmd_env} ./run_toolbox.py from_config {group} {command} {_dict_to_run_toolbox_args(kwargs)}', **run_kwargs)
  File "/opt/topsail/src/projects/core/library/run.py", line 121, in run
    proc = subprocess.run(command, **args)

[...]

Copy link

openshift-ci bot commented Feb 24, 2025

@tosokin: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/jump-ci dc4a822 link true /test jump-ci

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants