Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI][sharktank] Move Sharktank Data-Dependent Tests to OSSCI Cluster #932

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

renxida
Copy link
Contributor

@renxida renxida commented Feb 7, 2025

We've been getting memory-allocation related errors possibly due to oversaturation on shark-mi300x-3.

@renxida renxida requested a review from ScottTodd February 7, 2025 17:03
@renxida renxida marked this pull request as ready for review February 7, 2025 17:03
@renxida renxida marked this pull request as draft February 7, 2025 17:03
@renxida
Copy link
Contributor Author

renxida commented Feb 7, 2025

Need #926; failing because no dir to cache hf downloads

@sogartar
Copy link
Contributor

These tests require some data to be available on the runner.

The premise that we don't have enough resources on the machine seems wrong.
What we need is a proper allocation of GPUs to jobs/runners. Also users to adhere to a mandatory reservation system. Meaning that if you don't request a GPU you don't see any. For example when running iree-run-module --list_devices.
I sometimes catch myself not specifying the correct GPU.
Maybe we could make a service that utilizes the ROCR_VISIBLE_DEVICES env var. Or to utilize Linux groups to control access.

@sogartar
Copy link
Contributor

sogartar commented Feb 10, 2025

I would also want to move these tests soon to nightly as we should strive to have only fast tests on every PR.

@renxida renxida marked this pull request as ready for review February 10, 2025 16:58
@renxida renxida marked this pull request as draft February 10, 2025 16:58
@renxida renxida force-pushed the sharktank-datadependent-tests-to-ossci branch from c7bb337 to 7a809a9 Compare February 12, 2025 20:15
@ScottTodd ScottTodd requested review from Eliasj42 and removed request for ScottTodd February 17, 2025 18:58
@renxida renxida force-pushed the sharktank-datadependent-tests-to-ossci branch from 762330e to 31a09e1 Compare February 20, 2025 16:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants