Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fine_tuning] ilab: try some changes #657

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

tosokin
Copy link
Collaborator

@tosokin tosokin commented Jan 27, 2025

No description provided.

Copy link

openshift-ci bot commented Jan 27, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign drewrip for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tosokin
Copy link
Collaborator Author

tosokin commented Jan 27, 2025

/test jump-ci 2x8xh100 fine_tuning ray_bench__iperf
/only test_ci

Copy link

topsail-bot bot commented Jan 27, 2025

🔴 Test of 'fine_tuning test test_ci' failed after 00 hours 04 minutes 14 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_1: ray_bench__iperf

Failure indicator:

/logs/artifacts/003__ray__ray-benchmark/000__fine_tuning__ray_fine_tuning_job/FAILURE | [000__fine_tuning__ray_fine_tuning_job] ./run_toolbox.py from_config fine_tuning ray_fine_tuning_job --extra={'name': 'ray', 'pod_count': 2, 'gpu': 0, 'ephemeral_output_pvc_size': '500Gi', 'node_selector_key': 'nvidia.com/gpu.present', 'node_selector_value': 'true', 'hyper_parameters': {'flavor': 'iperf'}} --> 2
/logs/artifacts/003__ray__ray-benchmark/FAILURE | CalledProcessError: Command 'set -o errexit;set -o pipefail;set -o nounset;set -o errtrace;ARTIFACT_DIR="/logs/artifacts/003__ray__ray-benchmark" ./run_toolbox.py from_config fine_tuning ray_fine_tuning_job --extra="{'name': 'ray', 'pod_count': 2, 'gpu': 0, 'ephemeral_output_pvc_size': '500Gi', 'node_selector_key': 'nvidia.com/gpu.present', 'node_selector_value': 'true', 'hyper_parameters': {'flavor': 'iperf'}}"' returned non-zero exit status 2.
Traceback (most recent call last):
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 154, in _run_test
    run.run_toolbox_from_config(
  File "/opt/topsail/src/projects/core/library/run.py", line 65, in run_toolbox_from_config
    return run(f'{cmd_env} ./run_toolbox.py from_config {group} {command} {_dict_to_run_toolbox_args(kwargs)}', **run_kwargs)
  File "/opt/topsail/src/projects/core/library/run.py", line 121, in run
    proc = subprocess.run(command, **args)
  File "/usr/lib64/python3.9/subprocess.py", line 528, in run

[...]

@kpouget
Copy link
Contributor

kpouget commented Jan 27, 2025

image pull took too long ... retrying

NAME                                    READY   STATUS              RESTARTS   AGE     IP       NODE               NOMINATED NODE   READINESS GATES
ray-raycluster-nmpgh-head-f8bn5         0/1     ContainerCreating   0          3m14s   <none>   gx3d-8h100-kp6hs   <none>           <none>
ray-raycluster-nmpgh-worker-ray-zhnxz   0/1     Init:0/1            0          3m14s   <none>   gx3d-8h100-5kncp   <none>           <none>

/test jump-ci 2x8xh100 fine_tuning ray_bench__iperf
/only test_ci

Copy link

topsail-bot bot commented Jan 27, 2025

🔴 Test of 'fine_tuning test test_ci' failed after 00 hours 04 minutes 17 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_1: ray_bench__iperf

Failure indicator:

/logs/artifacts/003__ray__ray-benchmark/000__fine_tuning__ray_fine_tuning_job/FAILURE | [000__fine_tuning__ray_fine_tuning_job] ./run_toolbox.py from_config fine_tuning ray_fine_tuning_job --extra={'name': 'ray', 'pod_count': 2, 'gpu': 0, 'ephemeral_output_pvc_size': '500Gi', 'node_selector_key': 'nvidia.com/gpu.present', 'node_selector_value': 'true', 'hyper_parameters': {'flavor': 'iperf'}} --> 2
/logs/artifacts/003__ray__ray-benchmark/FAILURE | CalledProcessError: Command 'set -o errexit;set -o pipefail;set -o nounset;set -o errtrace;ARTIFACT_DIR="/logs/artifacts/003__ray__ray-benchmark" ./run_toolbox.py from_config fine_tuning ray_fine_tuning_job --extra="{'name': 'ray', 'pod_count': 2, 'gpu': 0, 'ephemeral_output_pvc_size': '500Gi', 'node_selector_key': 'nvidia.com/gpu.present', 'node_selector_value': 'true', 'hyper_parameters': {'flavor': 'iperf'}}"' returned non-zero exit status 2.
Traceback (most recent call last):
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 154, in _run_test
    run.run_toolbox_from_config(
  File "/opt/topsail/src/projects/core/library/run.py", line 65, in run_toolbox_from_config
    return run(f'{cmd_env} ./run_toolbox.py from_config {group} {command} {_dict_to_run_toolbox_args(kwargs)}', **run_kwargs)
  File "/opt/topsail/src/projects/core/library/run.py", line 121, in run
    proc = subprocess.run(command, **args)
  File "/usr/lib64/python3.9/subprocess.py", line 528, in run

[...]

@tosokin
Copy link
Collaborator Author

tosokin commented Jan 27, 2025

/test jump-ci 2x8xh100 fine_tuning ray_bench__iperf

Copy link

topsail-bot bot commented Jan 27, 2025

🔴 Test of 'fine_tuning test test_ci' failed after 00 hours 02 minutes 59 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_1: ray_bench__iperf

Failure indicator:

/logs/artifacts/004__plots/FAILURE | A fatal error happened during the results parsing, aborting the visualization (0_matbench_parse.log).
/logs/artifacts/FAILURE | Traceback (most recent call last):
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 394, in test
    failed = _run_test_and_visualize()
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 322, in _run_test_and_visualize
    generate_visualization(do_matbenchmarking, test_artifact_dir_p[0])
  File "/opt/topsail/src/projects/fine_tuning/testing/test_finetuning.py", line 357, in generate_visualization
    raise exc
  File "/opt/topsail/src/projects/core/library/run.py", line 194, in run_and_catch
    fct(*args, **kwargs)

[...]

Copy link

openshift-ci bot commented Jan 27, 2025

@tosokin: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/jump-ci cf825f4 link true /test jump-ci

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@kpouget kpouget changed the title make some change [fine_tuning] ilab: try some changes Jan 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants