Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Kserve: RHOAI 2.18.0 RC performance gating #685

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

mcharanrm
Copy link
Collaborator

The email didn’t include a tag, so I used a custom tag with the valid SHA digest provided. Clients will always pull the image using the SHA digest. If a valid tag is given but the digest is invalid, clients won’t be able to download the image.

The SHA digest is only considered when both a tag and digest are provided.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 20, 2025
Copy link

openshift-ci bot commented Feb 20, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign drewrip for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mcharanrm
Copy link
Collaborator Author

Going ahead with a full test, as I believe all the key changes related to the OpenShift AI operator were already released in 2.17.0. So, I don't expect anything unexpected.

/test jump-ci rhoai-4xh100 kserve vllm_cpt_single_model_gating

Copy link

topsail-bot bot commented Feb 20, 2025

🔴 Test of 'kserve test test_ci' failed after 06 hours 34 minutes 08 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_1: vllm_cpt_single_model_gating

Failure indicator:

/logs/artifacts/000__local_ci__run_multi_e2e_perf_test/FAILURE | [000__local_ci__run_multi_e2e_perf_test] ./run_toolbox.py from_config local_ci run_multi --suffix=deploy_and_test_sequentially --extra={} --> 2
/logs/artifacts/000__local_ci__run_multi_e2e_perf_test/artifacts/ci-pod-0/004__mpt-7b-instruct2/002__kserve__deploy_model/FAILURE | [002__kserve__deploy_model] ./run_toolbox.py kserve deploy_model --namespace=kserve-e2e-perf --runtime=vllm --model_name=mpt-7b-instruct2 --sr_name=vllm --sr_kserve_image=quay.io/modh/vllm@sha256:4f1f6b5738b311332b2bc786ea71259872e570081807592d97b4bd4cb65c4be1 --inference_service_name=mpt-7b-instruct2 --delete_others=True --raw_deployment=True --> 2
/logs/artifacts/000__local_ci__run_multi_e2e_perf_test/artifacts/ci-pod-0/004__mpt-7b-instruct2/FAILURE | mpt-7b-instruct2 failed: CalledProcessError: Command 'set -o errexit;set -o pipefail;set -o nounset;set -o errtrace;ARTIFACT_DIR="/logs/artifacts/004__mpt-7b-instruct2" ./run_toolbox.py kserve deploy_model --namespace='kserve-e2e-perf' --runtime='vllm' --model_name='mpt-7b-instruct2' --sr_name='vllm' --sr_kserve_image='quay.io/modh/vllm@sha256:4f1f6b5738b311332b2bc786ea71259872e570081807592d97b4bd4cb65c4be1' --inference_service_name='mpt-7b-instruct2' --delete_others='True' --raw_deployment='True'' returned non-zero exit status 2.
/logs/artifacts/003__plots/000__projects.kserve.visualizations.kserve-llm_plots/FAILURE | An error happened during the visualization post-processing ... (regression detected in /logs/artifacts/003__plots/000__projects.kserve.visualizations.kserve-llm_plots)
RuntimeError: An error happened during the visualization post-processing ... (regression detected in /logs/artifacts/003__plots/000__projects.kserve.visualizations.kserve-llm_plots)
Traceback (most recent call last):
  File "/opt/topsail/src/projects/kserve/testing/test.py", line 98, in test_ci
    test_e2e.test_ci()
  File "/opt/topsail/src/projects/kserve/testing/test_e2e.py", line 110, in test_ci
    single_model_deploy_and_test_sequentially(locally=False)

[...]

@mcharanrm
Copy link
Collaborator Author

Removed the mpt-7b-instruct2 model from the kserve model-serving performance gating test for 2.18.0 RC1, as it is quite outdated. Additionally, we're discussing the deployment issues we encountered with this model when using the latest vLLM version - 'V0 LLM engine (v0.7.3.dev291+gd7637aaec)', to understand whether an issue is required.

Re-launching the kserve model-serving gating test to generate the plots and regression analysis report for the Prometheus data, which failed to generate previously due to some errors.

/test jump-ci rhoai-4xh100 kserve vllm_cpt_single_model_gating

Copy link

topsail-bot bot commented Feb 25, 2025

🔴 Test of 'kserve test test_ci' failed after 06 hours 20 minutes 44 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_1: vllm_cpt_single_model_gating

Failure indicator:

/logs/artifacts/003__plots/000__projects.kserve.visualizations.kserve-llm_plots/FAILURE | An error happened during the visualization post-processing ... (regression detected in /logs/artifacts/003__plots/000__projects.kserve.visualizations.kserve-llm_plots)
RuntimeError: An error happened during the visualization post-processing ... (regression detected in /logs/artifacts/003__plots/000__projects.kserve.visualizations.kserve-llm_plots)
Traceback (most recent call last):
  File "/opt/topsail/src/projects/kserve/testing/test.py", line 237, in generate_plots
    visualize.generate_from_dir(str(results_dirname))
  File "/opt/topsail/src/projects/matrix_benchmarking/library/visualize.py", line 73, in wrapper
    fct(*args, **kwargs)
  File "/opt/topsail/src/projects/matrix_benchmarking/library/visualize.py", line 472, in generate_from_dir
    generate_visualizations(results_dirname, generate_lts=generate_lts)
  File "/opt/topsail/src/projects/matrix_benchmarking/library/visualize.py", line 73, in wrapper

[...]

Copy link

openshift-ci bot commented Feb 25, 2025

@mcharanrm: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/jump-ci 65aaa4f link true /test jump-ci

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant