Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement caching for onnx/models git LFS files. #71

Merged
merged 9 commits into from
Jan 21, 2025

Conversation

ScottTodd
Copy link
Member

@ScottTodd ScottTodd commented Jan 17, 2025

Progress on #6.

This adds a caching layer which allows developers and persistent CI runners to avoid needing to redownload source .onnx files.

Details

  • The cache location defaults to ${IREE_TEST_FILES}/iree-test-suites if IREE_TEST_FILES is set, or ~/.cache/iree-test-suites/ otherwise. This can be overridden with the custom --cache-dir=/path/to/cache pytest option. Several of our persistent CI machines set the IREE_TEST_FILES environment variable already.
  • The cache is implemented as a local git clone of the https://github.com/onnx/models repository, which uses Git Large File Storage (LFS) to store large files. When a file is requested by a test, the cache layer runs git lfs pull in the local clone to fetch the latest version of the file and then it creates a symlink from the cache directory to the test working directory. This usage should be pretty similar to what huggingface_hub provides: https://huggingface.co/docs/huggingface_hub/guides/manage-cache.

Testing

Tested in iree-org/iree on persistent runners here:

  • Cold cache: https://github.com/iree-org/iree/actions/runs/12838451019/job/35804050925#step:8:22

    ---------------------------- live log sessionstart -----------------------------
    INFO     onnx_models.conftest:conftest.py:96 Using cache directory: '/home/esaimana/iree_tests_cache/iree-test-suites'
    INFO     onnx_models.cache:cache.py:115 Setting up GitHub repository 'onnx/models'
    INFO     onnx_models.cache:cache.py:117 Checking for working 'git lfs' (https://git-lfs.com/)
    INFO     onnx_models.cache:cache.py:136 Cloning https://github.com/onnx/models.git into '/home/esaimana/iree_tests_cache/iree-test-suites/onnx_models'
    Cloning into '/home/esaimana/iree_tests_cache/iree-test-suites/onnx_models'...
    
  • Warm cache: https://github.com/iree-org/iree/actions/runs/12838451019/job/35804127583#step:8:22

    ---------------------------- live log sessionstart -----------------------------
    INFO     onnx_models.conftest:conftest.py:96 Using cache directory: '/home/esaimana/iree_tests_cache/iree-test-suites'
    INFO     onnx_models.cache:cache.py:115 Setting up GitHub repository 'onnx/models'
    INFO     onnx_models.cache:cache.py:117 Checking for working 'git lfs' (https://git-lfs.com/)
    INFO     onnx_models.cache:cache.py:122 Directory '/home/esaimana/iree_tests_cache/iree-test-suites/onnx_models' already exists
    

(The rest of the logs are currently the same)

@ScottTodd ScottTodd marked this pull request as ready for review January 17, 2025 23:31
validated/vision/classification/
mnist/
model/
mnist-12.onnx
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing at a time, but I'll probably want to actually fetch the .tar.gz files like https://github.com/onnx/models/blob/main/validated/vision/classification/mnist/model/mnist-12.tar.gz and then unpack them, since those archives include inputs and outputs already generated and validated for us. Could just have the archives cached and then unpack at test time.... orrrr let the cache be responsible for that too. See what huggingface_hub has for extra files ("assets") here: https://huggingface.co/docs/huggingface_hub/guides/manage-cache#caching-assets.

There will be file name conflicts if I extract all of those files into the same flat directory though, in either the cache or working directory 🤔. May want to ditch the current "subdirectory" concept and instead create one artifacts directory per test case.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brainstorming for this:

Maybe you could have the CacheManager handle this by pluralizing to get_files_in_working_directory, which could return a dict with keys ['model', 'metadata', 'inputs', 'golden_outputs'] in case those are available in the cache.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaning towards one directory per test case, with the test runner (conftest.py) or some onnx util class handling extracting from .tar.gz so the cache only needs to download files. Teaching the cache how to extract files and then remove the original archives could save on disk space used, but I'd rather keep the implementation simple right now.

Simple cache, duplicate storage:

cache_dir
  test_case_1_archive.tar.gz (git lfs understands how to version this)

working_dir
  test_case_1
    archive.tar.gz (symlink)
    model.onnx (from archive)
    input.pb (from archive)

Minimal storage:

cache_dir
  [deleted after extraction] test_case_1_archive.tar.gz
  test_case_1
    model.onnx (from archive)
    input.pb (from archive)

working_dir
  test_case_1
    model.onnx (symlink)
    input.pb (symlink)

With the minimal storage approach, the cache would need to understand that the extracted files don't need to be redownloaded. That isn't really an issue with the onnx models, or even HF models, since the files are pretty stable and we don't need the absolute latest versions all the time, but I like how the default approach does automatically get latest versions with no extra bookkeeping on our end.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the simple cache approach seems best to me (and seems to likely have minimal storage in cache when not running the tests, assuming the compressed file is actually reasonably compressed).

Copy link
Collaborator

@zjgarvey zjgarvey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the direction this is heading. Factoring out cache management is something e2eshark could benefit from too, since OnnxModelInfo https://github.com/nod-ai/SHARK-TestSuite/blob/4797eae93979854df5ef0bf8cefeb4f2f6c215b6/alt_e2eshark/e2e_testing/framework.py#L45 currently handles too much in one class.

All of my comments are brainstorming/nits. I hope some of them are helpful. Let me know if you want me to re-review. I'm going to leave this review as a comment for now to discuss some of the ideas and hear your feedback.

onnx_models/cache.py Outdated Show resolved Hide resolved
onnx_models/cache.py Outdated Show resolved Hide resolved
onnx_models/cache.py Outdated Show resolved Hide resolved
onnx_models/conftest.py Outdated Show resolved Hide resolved
onnx_models/cache.py Outdated Show resolved Hide resolved


class CacheScope(abc.ABC):
"""Abstract base class for a cache scope."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be useful to put in the docstring more info about what a child class needs to implement. It's not clear from reading this class definition that a child class needs to implement not only the retrieval of warm cache file paths, but also the downloading/extracting of such files. Perhaps this can be resolved by adding something like:

@abc.abstractmethod
def setup_file(self, relative_path: str):
    """Download or generate the cache file if not present"""

However, I'm not sure how to enforce compatibility between this function and the output of get_file. Maybe it would just be better to keep this base class simple and add a bit more documentation to indicate that get_file should also handle initializing the cache file when not present.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, will write more docs.

It's not clear from reading this class definition that a child class needs to implement not only the retrieval of warm cache file paths, but also the downloading/extracting of such files.

The interface is that a cache scope returns a path on a local disk for a given [remote] path. As part of that, the scope may need to download a file, authenticate with a server, or perform some other operations.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some more docs. How's this look now?

validated/vision/classification/
mnist/
model/
mnist-12.onnx
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brainstorming for this:

Maybe you could have the CacheManager handle this by pluralizing to get_files_in_working_directory, which could return a dict with keys ['model', 'metadata', 'inputs', 'golden_outputs'] in case those are available in the cache.

@ScottTodd ScottTodd requested a review from zjgarvey January 21, 2025 21:40
@ScottTodd
Copy link
Member Author

All of my comments are brainstorming/nits. I hope some of them are helpful. Let me know if you want me to re-review. I'm going to leave this review as a comment for now to discuss some of the ideas and hear your feedback.

Thanks! Addressed some of the surface level comments. Want to re-review? I think I'm just about ready to get these tests running on presubmit in https://github.com/iree-org/iree now.

I have more ideas/goals about the "test configs", XFAILs, and other ways that developers and users interface with test suites.I would like to be reporting current status in an easy to understand way. This repository and SHARK-TestSuite should be feeding into release notes and public project status trackers so users can see which programs are known to be working at which versions. We have plenty of the building blocks for that here and in https://github.com/nod-ai/e2eshark-reports/ already.

Copy link
Collaborator

@zjgarvey zjgarvey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me! Thanks for looking through and addressing the comments so quickly.

@ScottTodd ScottTodd merged commit 5650966 into iree-org:main Jan 21, 2025
2 checks passed
@ScottTodd ScottTodd deleted the onnx-model-cache branch January 21, 2025 22:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants