[fine_tuning] ilab: try some changes #657

tosokin · 2025-01-27T07:50:22Z

No description provided.

openshift-ci · 2025-01-27T07:50:27Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign drewrip for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

tosokin · 2025-01-27T07:51:58Z

/test jump-ci 2x8xh100 fine_tuning ray_bench__iperf
/only test_ci

topsail-bot · 2025-01-27T08:12:57Z

🔴 Test of 'fine_tuning test test_ci' failed after 00 hours 04 minutes 14 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_1: ray_bench__iperf

Failure indicator:

/logs/artifacts/003__ray__ray-benchmark/000__fine_tuning__ray_fine_tuning_job/FAILURE | [000__fine_tuning__ray_fine_tuning_job] ./run_toolbox.py from_config fine_tuning ray_fine_tuning_job --extra={'name': 'ray', 'pod_count': 2, 'gpu': 0, 'ephemeral_output_pvc_size': '500Gi', 'node_selector_key': 'nvidia.com/gpu.present', 'node_selector_value': 'true', 'hyper_parameters': {'flavor': 'iperf'}} --> 2
/logs/artifacts/003__ray__ray-benchmark/FAILURE | CalledProcessError: Command 'set -o errexit;set -o pipefail;set -o nounset;set -o errtrace;ARTIFACT_DIR="/logs/artifacts/003__ray__ray-benchmark" ./run_toolbox.py from_config fine_tuning ray_fine_tuning_job --extra="{'name': 'ray', 'pod_count': 2, 'gpu': 0, 'ephemeral_output_pvc_size': '500Gi', 'node_selector_key': 'nvidia.com/gpu.present', 'node_selector_value': 'true', 'hyper_parameters': {'flavor': 'iperf'}}"' returned non-zero exit status 2.
Traceback (most recent call last):
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 154, in _run_test
    run.run_toolbox_from_config(
  File "/opt/topsail/src/projects/core/library/run.py", line 65, in run_toolbox_from_config
    return run(f'{cmd_env} ./run_toolbox.py from_config {group} {command} {_dict_to_run_toolbox_args(kwargs)}', **run_kwargs)
  File "/opt/topsail/src/projects/core/library/run.py", line 121, in run
    proc = subprocess.run(command, **args)
  File "/usr/lib64/python3.9/subprocess.py", line 528, in run

[...]

kpouget · 2025-01-27T08:16:11Z

image pull took too long ... retrying

NAME                                    READY   STATUS              RESTARTS   AGE     IP       NODE               NOMINATED NODE   READINESS GATES
ray-raycluster-nmpgh-head-f8bn5         0/1     ContainerCreating   0          3m14s   <none>   gx3d-8h100-kp6hs   <none>           <none>
ray-raycluster-nmpgh-worker-ray-zhnxz   0/1     Init:0/1            0          3m14s   <none>   gx3d-8h100-5kncp   <none>           <none>

/test jump-ci 2x8xh100 fine_tuning ray_bench__iperf
/only test_ci

topsail-bot · 2025-01-27T08:23:19Z

🔴 Test of 'fine_tuning test test_ci' failed after 00 hours 04 minutes 17 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_1: ray_bench__iperf

Failure indicator:

/logs/artifacts/003__ray__ray-benchmark/000__fine_tuning__ray_fine_tuning_job/FAILURE | [000__fine_tuning__ray_fine_tuning_job] ./run_toolbox.py from_config fine_tuning ray_fine_tuning_job --extra={'name': 'ray', 'pod_count': 2, 'gpu': 0, 'ephemeral_output_pvc_size': '500Gi', 'node_selector_key': 'nvidia.com/gpu.present', 'node_selector_value': 'true', 'hyper_parameters': {'flavor': 'iperf'}} --> 2
/logs/artifacts/003__ray__ray-benchmark/FAILURE | CalledProcessError: Command 'set -o errexit;set -o pipefail;set -o nounset;set -o errtrace;ARTIFACT_DIR="/logs/artifacts/003__ray__ray-benchmark" ./run_toolbox.py from_config fine_tuning ray_fine_tuning_job --extra="{'name': 'ray', 'pod_count': 2, 'gpu': 0, 'ephemeral_output_pvc_size': '500Gi', 'node_selector_key': 'nvidia.com/gpu.present', 'node_selector_value': 'true', 'hyper_parameters': {'flavor': 'iperf'}}"' returned non-zero exit status 2.
Traceback (most recent call last):
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 154, in _run_test
    run.run_toolbox_from_config(
  File "/opt/topsail/src/projects/core/library/run.py", line 65, in run_toolbox_from_config
    return run(f'{cmd_env} ./run_toolbox.py from_config {group} {command} {_dict_to_run_toolbox_args(kwargs)}', **run_kwargs)
  File "/opt/topsail/src/projects/core/library/run.py", line 121, in run
    proc = subprocess.run(command, **args)
  File "/usr/lib64/python3.9/subprocess.py", line 528, in run

[...]

tosokin · 2025-01-27T09:12:17Z

/test jump-ci 2x8xh100 fine_tuning ray_bench__iperf

topsail-bot · 2025-01-27T09:44:25Z

🔴 Test of 'fine_tuning test test_ci' failed after 00 hours 02 minutes 59 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_1: ray_bench__iperf

Failure indicator:

/logs/artifacts/004__plots/FAILURE | A fatal error happened during the results parsing, aborting the visualization (0_matbench_parse.log).
/logs/artifacts/FAILURE | Traceback (most recent call last):
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 394, in test
    failed = _run_test_and_visualize()
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 322, in _run_test_and_visualize
    generate_visualization(do_matbenchmarking, test_artifact_dir_p[0])
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 357, in generate_visualization
    raise exc
  File "/opt/topsail/src/projects/core/library/run.py", line 194, in run_and_catch
    fct(*args, **kwargs)

[...]

openshift-ci · 2025-01-27T09:45:19Z

@tosokin: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/jump-ci	`cf825f4`	link	true	`/test jump-ci`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

make some change

cf825f4

kpouget changed the title ~~make some change~~ [fine_tuning] ilab: try some changes Jan 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fine_tuning] ilab: try some changes #657

[fine_tuning] ilab: try some changes #657

tosokin commented Jan 27, 2025

openshift-ci bot commented Jan 27, 2025

tosokin commented Jan 27, 2025

topsail-bot bot commented Jan 27, 2025

kpouget commented Jan 27, 2025

topsail-bot bot commented Jan 27, 2025

tosokin commented Jan 27, 2025

topsail-bot bot commented Jan 27, 2025

openshift-ci bot commented Jan 27, 2025

[fine_tuning] ilab: try some changes #657

Are you sure you want to change the base?

[fine_tuning] ilab: try some changes #657

Conversation

tosokin commented Jan 27, 2025

openshift-ci bot commented Jan 27, 2025

tosokin commented Jan 27, 2025

topsail-bot bot commented Jan 27, 2025

kpouget commented Jan 27, 2025

topsail-bot bot commented Jan 27, 2025

tosokin commented Jan 27, 2025

topsail-bot bot commented Jan 27, 2025

openshift-ci bot commented Jan 27, 2025