[WIP] Kserve: RHOAI 2.18.0 RC performance gating #685

mcharanrm · 2025-02-20T16:21:37Z

The email didn’t include a tag, so I used a custom tag with the valid SHA digest provided. Clients will always pull the image using the SHA digest. If a valid tag is given but the digest is invalid, clients won’t be able to download the image.

The SHA digest is only considered when both a tag and digest are provided.

openshift-ci · 2025-02-20T16:21:44Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign drewrip for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

mcharanrm · 2025-02-20T16:43:01Z

Going ahead with a full test, as I believe all the key changes related to the OpenShift AI operator were already released in 2.17.0. So, I don't expect anything unexpected.

/test jump-ci rhoai-4xh100 kserve vllm_cpt_single_model_gating

topsail-bot · 2025-02-20T23:55:24Z

🔴 Test of 'kserve test test_ci' failed after 06 hours 34 minutes 08 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_1: vllm_cpt_single_model_gating

Failure indicator:

/logs/artifacts/000__local_ci__run_multi_e2e_perf_test/FAILURE | [000__local_ci__run_multi_e2e_perf_test] ./run_toolbox.py from_config local_ci run_multi --suffix=deploy_and_test_sequentially --extra={} --> 2
/logs/artifacts/000__local_ci__run_multi_e2e_perf_test/artifacts/ci-pod-0/004__mpt-7b-instruct2/002__kserve__deploy_model/FAILURE | [002__kserve__deploy_model] ./run_toolbox.py kserve deploy_model --namespace=kserve-e2e-perf --runtime=vllm --model_name=mpt-7b-instruct2 --sr_name=vllm --sr_kserve_image=quay.io/modh/vllm@sha256:4f1f6b5738b311332b2bc786ea71259872e570081807592d97b4bd4cb65c4be1 --inference_service_name=mpt-7b-instruct2 --delete_others=True --raw_deployment=True --> 2
/logs/artifacts/000__local_ci__run_multi_e2e_perf_test/artifacts/ci-pod-0/004__mpt-7b-instruct2/FAILURE | mpt-7b-instruct2 failed: CalledProcessError: Command 'set -o errexit;set -o pipefail;set -o nounset;set -o errtrace;ARTIFACT_DIR="/logs/artifacts/004__mpt-7b-instruct2" ./run_toolbox.py kserve deploy_model --namespace='kserve-e2e-perf' --runtime='vllm' --model_name='mpt-7b-instruct2' --sr_name='vllm' --sr_kserve_image='quay.io/modh/vllm@sha256:4f1f6b5738b311332b2bc786ea71259872e570081807592d97b4bd4cb65c4be1' --inference_service_name='mpt-7b-instruct2' --delete_others='True' --raw_deployment='True'' returned non-zero exit status 2.
/logs/artifacts/003__plots/000__projects.kserve.visualizations.kserve-llm_plots/FAILURE | An error happened during the visualization post-processing ... (regression detected in /logs/artifacts/003__plots/000__projects.kserve.visualizations.kserve-llm_plots)
RuntimeError: An error happened during the visualization post-processing ... (regression detected in /logs/artifacts/003__plots/000__projects.kserve.visualizations.kserve-llm_plots)
Traceback (most recent call last):
  File "/opt/topsail/src/projects/kserve/testing/test.py", line 98, in test_ci
    test_e2e.test_ci()
  File "/opt/topsail/src/projects/kserve/testing/test_e2e.py", line 110, in test_ci
    single_model_deploy_and_test_sequentially(locally=False)

[...]

… performance gating test for 2.18.0 RC1

mcharanrm · 2025-02-25T15:01:18Z

Removed the mpt-7b-instruct2 model from the kserve model-serving performance gating test for 2.18.0 RC1, as it is quite outdated. Additionally, we're discussing the deployment issues we encountered with this model when using the latest vLLM version - 'V0 LLM engine (v0.7.3.dev291+gd7637aaec)', to understand whether an issue is required.

Re-launching the kserve model-serving gating test to generate the plots and regression analysis report for the Prometheus data, which failed to generate previously due to some errors.

/test jump-ci rhoai-4xh100 kserve vllm_cpt_single_model_gating

topsail-bot · 2025-02-25T22:10:47Z

🔴 Test of 'kserve test test_ci' failed after 06 hours 20 minutes 44 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_1: vllm_cpt_single_model_gating

Failure indicator:

/logs/artifacts/003__plots/000__projects.kserve.visualizations.kserve-llm_plots/FAILURE | An error happened during the visualization post-processing ... (regression detected in /logs/artifacts/003__plots/000__projects.kserve.visualizations.kserve-llm_plots)
RuntimeError: An error happened during the visualization post-processing ... (regression detected in /logs/artifacts/003__plots/000__projects.kserve.visualizations.kserve-llm_plots)
Traceback (most recent call last):
  File "/opt/topsail/src/projects/kserve/testing/test.py", line 237, in generate_plots
    visualize.generate_from_dir(str(results_dirname))
  File "/opt/topsail/src/projects/matrix_benchmarking/library/visualize.py", line 73, in wrapper
    fct(*args, **kwargs)
  File "/opt/topsail/src/projects/matrix_benchmarking/library/visualize.py", line 472, in generate_from_dir
    generate_visualizations(results_dirname, generate_lts=generate_lts)
  File "/opt/topsail/src/projects/matrix_benchmarking/library/visualize.py", line 73, in wrapper

[...]

openshift-ci · 2025-02-25T22:14:37Z

@mcharanrm: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/jump-ci	`65aaa4f`	link	true	`/test jump-ci`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

kserve: use 2.18.0 RC1 for model-serving gating work

adbc16d

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 20, 2025

kserve: remove model mpt-7b-instruct2 from the kserve model-serving…

65aaa4f

… performance gating test for 2.18.0 RC1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Kserve: RHOAI 2.18.0 RC performance gating #685

[WIP] Kserve: RHOAI 2.18.0 RC performance gating #685

mcharanrm commented Feb 20, 2025

openshift-ci bot commented Feb 20, 2025

mcharanrm commented Feb 20, 2025

topsail-bot bot commented Feb 20, 2025

mcharanrm commented Feb 25, 2025

topsail-bot bot commented Feb 25, 2025

openshift-ci bot commented Feb 25, 2025

[WIP] Kserve: RHOAI 2.18.0 RC performance gating #685

Are you sure you want to change the base?

[WIP] Kserve: RHOAI 2.18.0 RC performance gating #685

Conversation

mcharanrm commented Feb 20, 2025

openshift-ci bot commented Feb 20, 2025

mcharanrm commented Feb 20, 2025

topsail-bot bot commented Feb 20, 2025

mcharanrm commented Feb 25, 2025

topsail-bot bot commented Feb 25, 2025

openshift-ci bot commented Feb 25, 2025