-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement caching for onnx/models git LFS files. #71
Conversation
a8f260f
to
98a5c74
Compare
0f53bff
to
e4b80c8
Compare
validated/vision/classification/ | ||
mnist/ | ||
model/ | ||
mnist-12.onnx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing at a time, but I'll probably want to actually fetch the .tar.gz
files like https://github.com/onnx/models/blob/main/validated/vision/classification/mnist/model/mnist-12.tar.gz and then unpack them, since those archives include inputs and outputs already generated and validated for us. Could just have the archives cached and then unpack at test time.... orrrr let the cache be responsible for that too. See what huggingface_hub has for extra files ("assets") here: https://huggingface.co/docs/huggingface_hub/guides/manage-cache#caching-assets.
There will be file name conflicts if I extract all of those files into the same flat directory though, in either the cache or working directory 🤔. May want to ditch the current "subdirectory" concept and instead create one artifacts directory per test case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Brainstorming for this:
Maybe you could have the CacheManager
handle this by pluralizing to get_files_in_working_directory
, which could return a dict with keys ['model', 'metadata', 'inputs', 'golden_outputs']
in case those are available in the cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaning towards one directory per test case, with the test runner (conftest.py
) or some onnx util class handling extracting from .tar.gz
so the cache only needs to download files. Teaching the cache how to extract files and then remove the original archives could save on disk space used, but I'd rather keep the implementation simple right now.
Simple cache, duplicate storage:
cache_dir
test_case_1_archive.tar.gz (git lfs understands how to version this)
working_dir
test_case_1
archive.tar.gz (symlink)
model.onnx (from archive)
input.pb (from archive)
Minimal storage:
cache_dir
[deleted after extraction] test_case_1_archive.tar.gz
test_case_1
model.onnx (from archive)
input.pb (from archive)
working_dir
test_case_1
model.onnx (symlink)
input.pb (symlink)
With the minimal storage approach, the cache would need to understand that the extracted files don't need to be redownloaded. That isn't really an issue with the onnx models, or even HF models, since the files are pretty stable and we don't need the absolute latest versions all the time, but I like how the default approach does automatically get latest versions with no extra bookkeeping on our end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the simple cache approach seems best to me (and seems to likely have minimal storage in cache when not running the tests, assuming the compressed file is actually reasonably compressed).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the direction this is heading. Factoring out cache management is something e2eshark could benefit from too, since OnnxModelInfo
https://github.com/nod-ai/SHARK-TestSuite/blob/4797eae93979854df5ef0bf8cefeb4f2f6c215b6/alt_e2eshark/e2e_testing/framework.py#L45 currently handles too much in one class.
All of my comments are brainstorming/nits. I hope some of them are helpful. Let me know if you want me to re-review. I'm going to leave this review as a comment for now to discuss some of the ideas and hear your feedback.
onnx_models/cache.py
Outdated
|
||
|
||
class CacheScope(abc.ABC): | ||
"""Abstract base class for a cache scope.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be useful to put in the docstring more info about what a child class needs to implement. It's not clear from reading this class definition that a child class needs to implement not only the retrieval of warm cache file paths, but also the downloading/extracting of such files. Perhaps this can be resolved by adding something like:
@abc.abstractmethod
def setup_file(self, relative_path: str):
"""Download or generate the cache file if not present"""
However, I'm not sure how to enforce compatibility between this function and the output of get_file
. Maybe it would just be better to keep this base class simple and add a bit more documentation to indicate that get_file
should also handle initializing the cache file when not present.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, will write more docs.
It's not clear from reading this class definition that a child class needs to implement not only the retrieval of warm cache file paths, but also the downloading/extracting of such files.
The interface is that a cache scope returns a path on a local disk for a given [remote] path. As part of that, the scope may need to download a file, authenticate with a server, or perform some other operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some more docs. How's this look now?
validated/vision/classification/ | ||
mnist/ | ||
model/ | ||
mnist-12.onnx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Brainstorming for this:
Maybe you could have the CacheManager
handle this by pluralizing to get_files_in_working_directory
, which could return a dict with keys ['model', 'metadata', 'inputs', 'golden_outputs']
in case those are available in the cache.
Thanks! Addressed some of the surface level comments. Want to re-review? I think I'm just about ready to get these tests running on presubmit in https://github.com/iree-org/iree now. I have more ideas/goals about the "test configs", XFAILs, and other ways that developers and users interface with test suites.I would like to be reporting current status in an easy to understand way. This repository and SHARK-TestSuite should be feeding into release notes and public project status trackers so users can see which programs are known to be working at which versions. We have plenty of the building blocks for that here and in https://github.com/nod-ai/e2eshark-reports/ already. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me! Thanks for looking through and addressing the comments so quickly.
Progress on #6.
This adds a caching layer which allows developers and persistent CI runners to avoid needing to redownload source
.onnx
files.Details
${IREE_TEST_FILES}/iree-test-suites
ifIREE_TEST_FILES
is set, or~/.cache/iree-test-suites/
otherwise. This can be overridden with the custom--cache-dir=/path/to/cache
pytest option. Several of our persistent CI machines set theIREE_TEST_FILES
environment variable already.git lfs pull
in the local clone to fetch the latest version of the file and then it creates a symlink from the cache directory to the test working directory. This usage should be pretty similar to what huggingface_hub provides: https://huggingface.co/docs/huggingface_hub/guides/manage-cache.Testing
Tested in iree-org/iree on persistent runners here:
Cold cache: https://github.com/iree-org/iree/actions/runs/12838451019/job/35804050925#step:8:22
Warm cache: https://github.com/iree-org/iree/actions/runs/12838451019/job/35804127583#step:8:22
(The rest of the logs are currently the same)