Implement caching for onnx/models git LFS files. #71

ScottTodd · 2025-01-17T22:48:57Z

Progress on #6.

This adds a caching layer which allows developers and persistent CI runners to avoid needing to redownload source .onnx files.

Details

The cache location defaults to ${IREE_TEST_FILES}/iree-test-suites if IREE_TEST_FILES is set, or ~/.cache/iree-test-suites/ otherwise. This can be overridden with the custom --cache-dir=/path/to/cache pytest option. Several of our persistent CI machines set the IREE_TEST_FILES environment variable already.
The cache is implemented as a local git clone of the https://github.com/onnx/models repository, which uses Git Large File Storage (LFS) to store large files. When a file is requested by a test, the cache layer runs git lfs pull in the local clone to fetch the latest version of the file and then it creates a symlink from the cache directory to the test working directory. This usage should be pretty similar to what huggingface_hub provides: https://huggingface.co/docs/huggingface_hub/guides/manage-cache.

Testing

Tested in iree-org/iree on persistent runners here:

Cold cache: https://github.com/iree-org/iree/actions/runs/12838451019/job/35804050925#step:8:22

---------------------------- live log sessionstart -----------------------------
INFO     onnx_models.conftest:conftest.py:96 Using cache directory: '/home/esaimana/iree_tests_cache/iree-test-suites'
INFO     onnx_models.cache:cache.py:115 Setting up GitHub repository 'onnx/models'
INFO     onnx_models.cache:cache.py:117 Checking for working 'git lfs' (https://git-lfs.com/)
INFO     onnx_models.cache:cache.py:136 Cloning https://github.com/onnx/models.git into '/home/esaimana/iree_tests_cache/iree-test-suites/onnx_models'
Cloning into '/home/esaimana/iree_tests_cache/iree-test-suites/onnx_models'...

Warm cache: https://github.com/iree-org/iree/actions/runs/12838451019/job/35804127583#step:8:22

---------------------------- live log sessionstart -----------------------------
INFO     onnx_models.conftest:conftest.py:96 Using cache directory: '/home/esaimana/iree_tests_cache/iree-test-suites'
INFO     onnx_models.cache:cache.py:115 Setting up GitHub repository 'onnx/models'
INFO     onnx_models.cache:cache.py:117 Checking for working 'git lfs' (https://git-lfs.com/)
INFO     onnx_models.cache:cache.py:122 Directory '/home/esaimana/iree_tests_cache/iree-test-suites/onnx_models' already exists

(The rest of the logs are currently the same)

ScottTodd · 2025-01-17T23:39:48Z

onnx_models/cache.py

+        validated/vision/classification/
+          mnist/
+            model/
+              mnist-12.onnx


One thing at a time, but I'll probably want to actually fetch the .tar.gz files like https://github.com/onnx/models/blob/main/validated/vision/classification/mnist/model/mnist-12.tar.gz and then unpack them, since those archives include inputs and outputs already generated and validated for us. Could just have the archives cached and then unpack at test time.... orrrr let the cache be responsible for that too. See what huggingface_hub has for extra files ("assets") here: https://huggingface.co/docs/huggingface_hub/guides/manage-cache#caching-assets.

There will be file name conflicts if I extract all of those files into the same flat directory though, in either the cache or working directory 🤔. May want to ditch the current "subdirectory" concept and instead create one artifacts directory per test case.

Brainstorming for this:

Maybe you could have the CacheManager handle this by pluralizing to get_files_in_working_directory, which could return a dict with keys ['model', 'metadata', 'inputs', 'golden_outputs'] in case those are available in the cache.

Leaning towards one directory per test case, with the test runner (conftest.py) or some onnx util class handling extracting from .tar.gz so the cache only needs to download files. Teaching the cache how to extract files and then remove the original archives could save on disk space used, but I'd rather keep the implementation simple right now.

Simple cache, duplicate storage:

cache_dir test_case_1_archive.tar.gz (git lfs understands how to version this) working_dir test_case_1 archive.tar.gz (symlink) model.onnx (from archive) input.pb (from archive)

Minimal storage:

cache_dir [deleted after extraction] test_case_1_archive.tar.gz test_case_1 model.onnx (from archive) input.pb (from archive) working_dir test_case_1 model.onnx (symlink) input.pb (symlink)

With the minimal storage approach, the cache would need to understand that the extracted files don't need to be redownloaded. That isn't really an issue with the onnx models, or even HF models, since the files are pretty stable and we don't need the absolute latest versions all the time, but I like how the default approach does automatically get latest versions with no extra bookkeeping on our end.

Yeah, the simple cache approach seems best to me (and seems to likely have minimal storage in cache when not running the tests, assuming the compressed file is actually reasonably compressed).

zjgarvey

I like the direction this is heading. Factoring out cache management is something e2eshark could benefit from too, since OnnxModelInfo https://github.com/nod-ai/SHARK-TestSuite/blob/4797eae93979854df5ef0bf8cefeb4f2f6c215b6/alt_e2eshark/e2e_testing/framework.py#L45 currently handles too much in one class.

All of my comments are brainstorming/nits. I hope some of them are helpful. Let me know if you want me to re-review. I'm going to leave this review as a comment for now to discuss some of the ideas and hear your feedback.

onnx_models/cache.py

onnx_models/conftest.py

onnx_models/cache.py

zjgarvey · 2025-01-21T20:08:44Z

onnx_models/cache.py

+
+
+class CacheScope(abc.ABC):
+    """Abstract base class for a cache scope."""


It might be useful to put in the docstring more info about what a child class needs to implement. It's not clear from reading this class definition that a child class needs to implement not only the retrieval of warm cache file paths, but also the downloading/extracting of such files. Perhaps this can be resolved by adding something like:

@abc.abstractmethod def setup_file(self, relative_path: str): """Download or generate the cache file if not present"""

However, I'm not sure how to enforce compatibility between this function and the output of get_file. Maybe it would just be better to keep this base class simple and add a bit more documentation to indicate that get_file should also handle initializing the cache file when not present.

Sounds good, will write more docs.

It's not clear from reading this class definition that a child class needs to implement not only the retrieval of warm cache file paths, but also the downloading/extracting of such files.

The interface is that a cache scope returns a path on a local disk for a given [remote] path. As part of that, the scope may need to download a file, authenticate with a server, or perform some other operations.

Added some more docs. How's this look now?

zjgarvey · 2025-01-21T20:17:53Z

onnx_models/cache.py

+        validated/vision/classification/
+          mnist/
+            model/
+              mnist-12.onnx


Brainstorming for this:

Maybe you could have the CacheManager handle this by pluralizing to get_files_in_working_directory, which could return a dict with keys ['model', 'metadata', 'inputs', 'golden_outputs'] in case those are available in the cache.

Co-authored-by: zjgarvey <[email protected]>

ScottTodd · 2025-01-21T21:44:40Z

All of my comments are brainstorming/nits. I hope some of them are helpful. Let me know if you want me to re-review. I'm going to leave this review as a comment for now to discuss some of the ideas and hear your feedback.

Thanks! Addressed some of the surface level comments. Want to re-review? I think I'm just about ready to get these tests running on presubmit in https://github.com/iree-org/iree now.

I have more ideas/goals about the "test configs", XFAILs, and other ways that developers and users interface with test suites.I would like to be reporting current status in an easy to understand way. This repository and SHARK-TestSuite should be feeding into release notes and public project status trackers so users can see which programs are known to be working at which versions. We have plenty of the building blocks for that here and in https://github.com/nod-ai/e2eshark-reports/ already.

zjgarvey

This looks good to me! Thanks for looking through and addressing the comments so quickly.

ScottTodd added 2 commits January 17, 2025 14:39

Implement caching for onnx/models git LFS files.

8bfefbd

Install git lfs in workflow.

98a5c74

ScottTodd force-pushed the onnx-model-cache branch from a8f260f to 98a5c74 Compare January 17, 2025 22:54

ScottTodd added 2 commits January 17, 2025 14:58

Try without git lfs env sanity check.

61d5bc2

Rework subprocess usage for Linux.

e4b80c8

ScottTodd force-pushed the onnx-model-cache branch from 0f53bff to e4b80c8 Compare January 17, 2025 23:07

Populate default cache dir from IREE_TEST_FILES environment variable.

2c68d52

ScottTodd marked this pull request as ready for review January 17, 2025 23:31

ScottTodd requested review from zjgarvey and saienduri January 17, 2025 23:31

ScottTodd commented Jan 17, 2025

View reviewed changes

zjgarvey reviewed Jan 21, 2025

View reviewed changes

ScottTodd and others added 4 commits January 21, 2025 12:54

Apply pathlib suggestions from code review

efb0e60

Co-authored-by: zjgarvey <[email protected]>

Fix symlink argument ordering.

bb53a3f

Add register_scope function, switch from list to dict.

df41dec

Add more docs, TODOs, and logging.

30dc459

ScottTodd requested a review from zjgarvey January 21, 2025 21:40

zjgarvey approved these changes Jan 21, 2025

View reviewed changes

ScottTodd merged commit 5650966 into iree-org:main Jan 21, 2025
2 checks passed

ScottTodd deleted the onnx-model-cache branch January 21, 2025 22:02

ScottTodd mentioned this pull request Jan 21, 2025

[infra] Run parameterized ONNX model tests across CPU, Vulkan, and HIP. iree-org/iree#19524

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement caching for onnx/models git LFS files. #71

Implement caching for onnx/models git LFS files. #71

ScottTodd commented Jan 17, 2025 •

edited

Loading

ScottTodd Jan 17, 2025

zjgarvey Jan 21, 2025

ScottTodd Jan 21, 2025

zjgarvey Jan 21, 2025

zjgarvey left a comment

zjgarvey Jan 21, 2025

ScottTodd Jan 21, 2025

ScottTodd Jan 21, 2025

zjgarvey Jan 21, 2025

ScottTodd commented Jan 21, 2025

zjgarvey left a comment



		class CacheScope(abc.ABC):
		"""Abstract base class for a cache scope."""

Implement caching for onnx/models git LFS files. #71

Implement caching for onnx/models git LFS files. #71

Conversation

ScottTodd commented Jan 17, 2025 • edited Loading

Details

Testing

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zjgarvey left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ScottTodd commented Jan 21, 2025

zjgarvey left a comment

Choose a reason for hiding this comment

ScottTodd commented Jan 17, 2025 •

edited

Loading