testing new h100 cluster #673

tosokin · 2025-02-11T09:57:27Z

No description provided.

openshift-ci · 2025-02-11T09:57:32Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from tosokin. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

tosokin · 2025-02-11T10:27:08Z

/test jump-ci 2x8xh100 fine_tuning ilab_2x8xh100_pod_network
/only test_ci

topsail-bot · 2025-02-11T10:57:39Z

🔴 Test of 'fine_tuning test test_ci' failed after 00 hours 00 minutes 07 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_1: ilab_2x8xh100_pod_network

Failure indicator: Empty. (See run.log)

tosokin · 2025-02-11T11:11:49Z

/test jump-ci 2x8xh100 fine_tuning ilab_2x8xh100_pod_network
/only test_ci

topsail-bot · 2025-02-11T12:11:54Z

🔴 Test of 'fine_tuning test test_ci' failed after 00 hours 13 minutes 36 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_1: ilab_2x8xh100_pod_network

Failure indicator:

/logs/artifacts/001__matbenchmarking/FAILURE | MatrixBenchmark benchmark failed.
/logs/artifacts/001__matbenchmarking/ilab/000__ilab/001__prepare_data/000__storage__download_to_pvc/FAILURE | [000__storage__download_to_pvc] ./run_toolbox.py from_config storage download_to_pvc --extra={'source': 's3://instructlab-standalone/data/ilab_large_10000samples_skills_data.jsonl', 'storage_dir': '/dataset', 'name': 'ilab_large_10000samples_skills_data.jsonl', 'creds': '/run/secrets/PSAP_ODS_SECRET_PATH/.awscred'} --> 2
/logs/artifacts/002__plots/FAILURE | A fatal error happened during the results parsing, aborting the visualization (0_matbench_parse.log).
/logs/artifacts/003__prom_plots/FAILURE | A fatal error happened during the results parsing, aborting the visualization (0_matbench_parse.log).
/logs/artifacts/FAILURE | Traceback (most recent call last):
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 394, in test
    failed = _run_test_and_visualize()
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 322, in _run_test_and_visualize
    generate_visualization(do_matbenchmarking, test_artifact_dir_p[0])
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 373, in generate_visualization

[...]

tosokin · 2025-02-12T08:52:57Z

/test jump-ci instruct-l40s fine_tuning ilab_l40s_scale
/only test_ci

topsail-bot · 2025-02-12T10:40:45Z

🔴 Test of 'fine_tuning test test_ci' failed after 01 hours 21 minutes 58 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_1: ilab_l40s_scale

Failure indicator:

/logs/artifacts/001__matbenchmarking/FAILURE | MatrixBenchmark benchmark failed.
/logs/artifacts/001__matbenchmarking/ilab/001__ilab/003__ilab_fine_tuning/000__fine_tuning__run_fine_tuning_job/FAILURE | [000__fine_tuning__run_fine_tuning_job] ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra={'name': 'ilab', 'pod_count': 1, 'model_name': 'granite-3.0-8b-instruct', 'dataset_name': 'ilab_large_10000samples_skills_data.jsonl', 'gpu': 2, 'shared_memory': 20, 'ephemeral_output_pvc_size': '500Gi', 'hyper_parameters': {'cpu_offload_optimizer': True, 'cpu_offload_params': True, 'max_batch_len': 40000, 'num_epochs': 1}} --> 2
/logs/artifacts/001__matbenchmarking/ilab/001__ilab/003__ilab_fine_tuning/FAILURE | CalledProcessError: Command 'set -o errexit;set -o pipefail;set -o nounset;set -o errtrace;ARTIFACT_DIR="/logs/artifacts/001__matbenchmarking/ilab/001__ilab/003__ilab_fine_tuning" ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra="{'name': 'ilab', 'pod_count': 1, 'model_name': 'granite-3.0-8b-instruct', 'dataset_name': 'ilab_large_10000samples_skills_data.jsonl', 'gpu': 2, 'shared_memory': 20, 'ephemeral_output_pvc_size': '500Gi', 'hyper_parameters': {'cpu_offload_optimizer': True, 'cpu_offload_params': True, 'max_batch_len': 40000, 'num_epochs': 1}}"' returned non-zero exit status 2.
Traceback (most recent call last):
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 151, in _run_test
    run.run_toolbox_from_config("fine_tuning", "run_fine_tuning_job",
  File "/opt/topsail/src/projects/core/library/run.py", line 65, in run_toolbox_from_config
    return run(f'{cmd_env} ./run_toolbox.py from_config {group} {command} {_dict_to_run_toolbox_args(kwargs)}', **run_kwargs)
  File "/opt/topsail/src/projects/core/library/run.py", line 121, in run
    proc = subprocess.run(command, **args)

[...]

tosokin · 2025-02-12T11:26:42Z

/test jump-ci instruct-l40s fine_tuning ilab_l40s_shard
/only test_ci

topsail-bot · 2025-02-12T11:42:27Z

🔴 Test of 'fine_tuning test test_ci' failed after 00 hours 00 minutes 04 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_1: ilab_l40s_shard

Failure indicator: Empty. (See run.log)

tosokin · 2025-02-12T11:53:59Z

/test jump-ci instruct-l40s fine_tuning ilab_l40s_shard
/only test_ci

topsail-bot · 2025-02-12T12:25:51Z

🔴 Test of 'fine_tuning test test_ci' failed after 00 hours 00 minutes 05 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_1: ilab_l40s_shard

Failure indicator: Empty. (See run.log)

tosokin · 2025-02-12T12:33:17Z

/test jump-ci instruct-l40s fine_tuning ilab_l40s_shard
/only test_ci

topsail-bot · 2025-02-12T13:11:50Z

🔴 Test of 'fine_tuning test test_ci' failed after 00 hours 03 minutes 56 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_1: ilab_l40s_shard

Failure indicator:

/logs/artifacts/001__matbenchmarking/FAILURE | MatrixBenchmark benchmark failed.
/logs/artifacts/001__matbenchmarking/ilab/000__ilab/003__ilab_fine_tuning/000__fine_tuning__run_fine_tuning_job/FAILURE | [000__fine_tuning__run_fine_tuning_job] ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra={'name': 'ilab', 'pod_count': 4, 'model_name': 'granite-3.0-8b-instruct', 'dataset_name': 'ilab_large_10000samples_skills_data.jsonl', 'gpu': 2, 'shared_memory': 20, 'ephemeral_output_pvc_size': '500Gi', 'hyper_parameters': {'cpu_offload_optimizer': True, 'cpu_offload_params': True, 'fsdp_sharding_strategy': 'SHARD_GRAD_OP', 'max_batch_len': 35000, 'num_epochs': 1}, 'use_secondary_nic': ['network-port-01', 'network-port-02', 'network-port-03', 'network-port-04']} --> 2
/logs/artifacts/001__matbenchmarking/ilab/000__ilab/003__ilab_fine_tuning/FAILURE | CalledProcessError: Command 'set -o errexit;set -o pipefail;set -o nounset;set -o errtrace;ARTIFACT_DIR="/logs/artifacts/001__matbenchmarking/ilab/000__ilab/003__ilab_fine_tuning" ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra="{'name': 'ilab', 'pod_count': 4, 'model_name': 'granite-3.0-8b-instruct', 'dataset_name': 'ilab_large_10000samples_skills_data.jsonl', 'gpu': 2, 'shared_memory': 20, 'ephemeral_output_pvc_size': '500Gi', 'hyper_parameters': {'cpu_offload_optimizer': True, 'cpu_offload_params': True, 'fsdp_sharding_strategy': 'SHARD_GRAD_OP', 'max_batch_len': 35000, 'num_epochs': 1}, 'use_secondary_nic': ['network-port-01', 'network-port-02', 'network-port-03', 'network-port-04']}"' returned non-zero exit status 2.
Traceback (most recent call last):
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 151, in _run_test
    run.run_toolbox_from_config("fine_tuning", "run_fine_tuning_job",
  File "/opt/topsail/src/projects/core/library/run.py", line 65, in run_toolbox_from_config
    return run(f'{cmd_env} ./run_toolbox.py from_config {group} {command} {_dict_to_run_toolbox_args(kwargs)}', **run_kwargs)
  File "/opt/topsail/src/projects/core/library/run.py", line 121, in run
    proc = subprocess.run(command, **args)

[...]

tosokin · 2025-02-12T13:36:47Z

/test jump-ci instruct-l40s fine_tuning ilab_l40s_shard
/only test_ci

topsail-bot · 2025-02-12T14:46:33Z

🔴 Test of 'fine_tuning test test_ci' failed after 00 hours 26 minutes 49 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_1: ilab_l40s_shard

Failure indicator:

/logs/artifacts/001__matbenchmarking/FAILURE | MatrixBenchmark benchmark failed.
/logs/artifacts/001__matbenchmarking/ilab/000__ilab/003__ilab_fine_tuning/000__fine_tuning__run_fine_tuning_job/FAILURE | [000__fine_tuning__run_fine_tuning_job] ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra={'name': 'ilab', 'pod_count': 4, 'model_name': 'granite-3.0-8b-instruct', 'dataset_name': 'ilab_large_10000samples_skills_data.jsonl', 'gpu': 2, 'shared_memory': 20, 'ephemeral_output_pvc_size': '500Gi', 'hyper_parameters': {'cpu_offload_optimizer': True, 'cpu_offload_params': True, 'fsdp_sharding_strategy': 'SHARD_GRAD_OP', 'max_batch_len': 35000, 'num_epochs': 1}, 'use_secondary_nic': ['single-subnet-01', 'single-subnet-02', 'single-subnet-03', 'single-subnet-04']} --> 2
/logs/artifacts/001__matbenchmarking/ilab/000__ilab/003__ilab_fine_tuning/FAILURE | CalledProcessError: Command 'set -o errexit;set -o pipefail;set -o nounset;set -o errtrace;ARTIFACT_DIR="/logs/artifacts/001__matbenchmarking/ilab/000__ilab/003__ilab_fine_tuning" ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra="{'name': 'ilab', 'pod_count': 4, 'model_name': 'granite-3.0-8b-instruct', 'dataset_name': 'ilab_large_10000samples_skills_data.jsonl', 'gpu': 2, 'shared_memory': 20, 'ephemeral_output_pvc_size': '500Gi', 'hyper_parameters': {'cpu_offload_optimizer': True, 'cpu_offload_params': True, 'fsdp_sharding_strategy': 'SHARD_GRAD_OP', 'max_batch_len': 35000, 'num_epochs': 1}, 'use_secondary_nic': ['single-subnet-01', 'single-subnet-02', 'single-subnet-03', 'single-subnet-04']}"' returned non-zero exit status 2.
Traceback (most recent call last):
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 151, in _run_test
    run.run_toolbox_from_config("fine_tuning", "run_fine_tuning_job",
  File "/opt/topsail/src/projects/core/library/run.py", line 65, in run_toolbox_from_config
    return run(f'{cmd_env} ./run_toolbox.py from_config {group} {command} {_dict_to_run_toolbox_args(kwargs)}', **run_kwargs)
  File "/opt/topsail/src/projects/core/library/run.py", line 121, in run
    proc = subprocess.run(command, **args)

[...]

albertoperdomo2 · 2025-02-24T15:05:55Z

/test jump-ci rhoai-4xh100 fine_tuning gating_dgx40gb_full
/only test_ci

albertoperdomo2 · 2025-02-24T15:09:29Z

/test jump-ci 2x8xh100 fine_tuning gating_dgx40gb_full
/var tests.fine_tuning.test_extra_settings.model_name: ibm-granite/granite-3b-code-instruct
/only test_ci

topsail-bot · 2025-02-24T17:26:10Z

🔴 Test of 'fine_tuning test test_ci' failed after 01 hours 47 minutes 08 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_1: ilab_l40s_shard

Failure indicator:

/logs/artifacts/001__matbenchmarking/FAILURE | MatrixBenchmark benchmark failed.
/logs/artifacts/001__matbenchmarking/ilab/004__ilab/003__ilab_fine_tuning/000__fine_tuning__run_fine_tuning_job/FAILURE | [000__fine_tuning__run_fine_tuning_job] ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra={'name': 'ilab', 'pod_count': 4, 'model_name': 'granite-3.0-8b-instruct', 'dataset_name': 'ilab_large_10000samples_skills_data.jsonl', 'gpu': 2, 'shared_memory': 20, 'ephemeral_output_pvc_size': '500Gi', 'hyper_parameters': {'cpu_offload_optimizer': True, 'cpu_offload_params': True, 'fsdp_sharding_strategy': 'FULL_SHARD', 'max_batch_len': 60000, 'num_epochs': 1}, 'use_secondary_nic': ['single-subnet-01', 'single-subnet-02', 'single-subnet-03', 'single-subnet-04']} --> 2
/logs/artifacts/001__matbenchmarking/ilab/004__ilab/003__ilab_fine_tuning/FAILURE | CalledProcessError: Command 'set -o errexit;set -o pipefail;set -o nounset;set -o errtrace;ARTIFACT_DIR="/logs/artifacts/001__matbenchmarking/ilab/004__ilab/003__ilab_fine_tuning" ./run_toolbox.py from_config fine_tuning run_fine_tuning_job --extra="{'name': 'ilab', 'pod_count': 4, 'model_name': 'granite-3.0-8b-instruct', 'dataset_name': 'ilab_large_10000samples_skills_data.jsonl', 'gpu': 2, 'shared_memory': 20, 'ephemeral_output_pvc_size': '500Gi', 'hyper_parameters': {'cpu_offload_optimizer': True, 'cpu_offload_params': True, 'fsdp_sharding_strategy': 'FULL_SHARD', 'max_batch_len': 60000, 'num_epochs': 1}, 'use_secondary_nic': ['single-subnet-01', 'single-subnet-02', 'single-subnet-03', 'single-subnet-04']}"' returned non-zero exit status 2.
Traceback (most recent call last):
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 151, in _run_test
    run.run_toolbox_from_config("fine_tuning", "run_fine_tuning_job",
  File "/opt/topsail/src/projects/core/library/run.py", line 65, in run_toolbox_from_config
    return run(f'{cmd_env} ./run_toolbox.py from_config {group} {command} {_dict_to_run_toolbox_args(kwargs)}', **run_kwargs)
  File "/opt/topsail/src/projects/core/library/run.py", line 121, in run
    proc = subprocess.run(command, **args)

[...]

openshift-ci · 2025-02-24T17:27:35Z

@tosokin: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/jump-ci	`dc4a822`	link	true	`/test jump-ci`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

remove 4 pods from test

2475b54

tosokin changed the title ~~testing mew h100 cluster~~ testing new h100 cluster Feb 11, 2025

update cluster master node

005ffa9

smoke test for cluster

4f1bb9f

tosokin added 4 commits February 12, 2025 12:57

perform smoke test

5d5dbbe

fix space before comment

5b34765

add network ports

8ea90d6

add space

e2dfd71

add sharding parameter to config

d25b912

fix

487d729

fix2

eada3ac

tosokin added 4 commits February 12, 2025 17:38

hybrid shard test

bc2cacd

full shard test

661a3c0

no shard test

f7653f7

add tests with different number of nodes for sharding

f1ef86b

tosokin added 11 commits February 16, 2025 18:21

debug a bug

d0a2373

remove debug code

4684874

fix path to nic mapping file

61adeb2

bug in updating config map for secondary NIC mapping

8311676

hybrid shard different max batch length

2fd14e6

shard grad op different max batch length

849fa16

hybrid shard different max batch length 2 nodes

a0a1a74

check max_batch_len limits hybrid_shard

01a46be

2 nodes

57874b1

check max_batch_len limits full_shard

ca7e2fd

full shard H100

dc4a822

tosokin added 15 commits February 25, 2025 13:50

test for hybrid shard

aff04d1

add another max batch length

b4777d7

check limit for hybrid shard

4202633

test for h100 hybrid shard

cd2804c

add more max batch length

55cbae8

remove some batch sizes

0518840

changes for test

526dc3f

L40S 8 nodes test changes

8925765

L40S 8 nodes test changes

073a654

L40S 16 nodes test changes

497dff1

L40S 16 nodes test changes

0ddc903

L40S 14 nodes test changes

faab30f

L40S 14 nodes test changes hybrid shard

d38a0ed

L40S 14 nodes test changes shard grad po

905f8c2

L40S 14 nodes

c03e9a3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

testing new h100 cluster #673

testing new h100 cluster #673

tosokin commented Feb 11, 2025

openshift-ci bot commented Feb 11, 2025

tosokin commented Feb 11, 2025

topsail-bot bot commented Feb 11, 2025

tosokin commented Feb 11, 2025

topsail-bot bot commented Feb 11, 2025

tosokin commented Feb 12, 2025

topsail-bot bot commented Feb 12, 2025

tosokin commented Feb 12, 2025

topsail-bot bot commented Feb 12, 2025

tosokin commented Feb 12, 2025

topsail-bot bot commented Feb 12, 2025

tosokin commented Feb 12, 2025

topsail-bot bot commented Feb 12, 2025

tosokin commented Feb 12, 2025

topsail-bot bot commented Feb 12, 2025

albertoperdomo2 commented Feb 24, 2025

albertoperdomo2 commented Feb 24, 2025

topsail-bot bot commented Feb 24, 2025

openshift-ci bot commented Feb 24, 2025

testing new h100 cluster #673

Are you sure you want to change the base?

testing new h100 cluster #673

Conversation

tosokin commented Feb 11, 2025

openshift-ci bot commented Feb 11, 2025

tosokin commented Feb 11, 2025

topsail-bot bot commented Feb 11, 2025

tosokin commented Feb 11, 2025

topsail-bot bot commented Feb 11, 2025

tosokin commented Feb 12, 2025

topsail-bot bot commented Feb 12, 2025

tosokin commented Feb 12, 2025

topsail-bot bot commented Feb 12, 2025

tosokin commented Feb 12, 2025

topsail-bot bot commented Feb 12, 2025

tosokin commented Feb 12, 2025

topsail-bot bot commented Feb 12, 2025

tosokin commented Feb 12, 2025

topsail-bot bot commented Feb 12, 2025

albertoperdomo2 commented Feb 24, 2025

albertoperdomo2 commented Feb 24, 2025

topsail-bot bot commented Feb 24, 2025

openshift-ci bot commented Feb 24, 2025