implemented a cache for already downloaded models #39

shivam6862 · 2024-04-22T10:48:24Z

@raphaelDkhn
Implemented a cache for already downloaded models by using the diskcache package.
Closes #37

Gonmeso

Pretty well usage of diskcache!

There are some things that we need to change and also review the formatting changes that should have been handled prior to the commit by black.

Also, I think that we could move this logic into a class in giza_actions/utils.py and just initialise this class on the GizaModel.__init__() method.

With a specific class that handles the download and the cache we can create tests for it without the complexity of GizaModel

Gonmeso · 2024-04-22T15:41:05Z

giza_actions/model.py

@@ -1,6 +1,8 @@
 import logging
 from pathlib import Path
 from typing import Dict, Optional
+from diskcache import Cache
+import os


This import os shouldn't be placed here.

This should have been handle by pre-commit, make sure that you have it installed and run pre-commit run --files giza_actions/model.py

Gonmeso · 2024-04-22T15:42:15Z

giza_actions/model.py

@@ -54,13 +56,15 @@ def __init__(
        output_path: Optional[str] = None,
    ):
        if model_path is None and id is None and version is None:
-            raise ValueError("Either model_path or id and version must be provided.")
+            raise ValueError(


Why are you changing this formatting? This is handled by black and every commit, should trigger it through pre-commit

Gonmeso · 2024-04-22T15:43:28Z

giza_actions/model.py

@@ -85,6 +89,7 @@ def __init__(
            self.endpoint_id = self._get_endpoint_id()
            if output_path:
                self._download_model(output_path)
+            self.cache = Cache(os.getcwd() + '/tmp/cachedir')


Better to use os.path.join or pathlib as the separator depends on the OS ( / unix, \ for windows)

Gonmeso · 2024-04-22T15:44:13Z

giza_actions/model.py

+            cache_str = f"{self.model.id}_{self.version.version}_model"
+            if cache_str in self.cache:
+                file_path = self.cache.get(cache_str)
+                file_path = Path(file_path)


I feel that this is not needed, as open accepts strings as well or you could just do it directly:

file_path = Path(self.cache.get(cache_str))

Gonmeso · 2024-04-22T15:47:08Z

giza_actions/model.py

@@ -85,6 +89,7 @@ def __init__(
            self.endpoint_id = self._get_endpoint_id()
            if output_path:
                self._download_model(output_path)
+            self.cache = Cache(os.getcwd() + '/tmp/cachedir')


I would also make it "private" as this cache is not something that we want the user to use directly.

self._cache = Cache(...)

Gonmeso · 2024-04-22T15:48:05Z

giza_actions/model.py

-                self.model.id, self.version.version
-            )
+            cache_str = f"{self.model.id}_{self.version.version}_model"
+            if cache_str in self.cache:


I like this approach for the cache very much 🚀

Gonmeso · 2024-04-22T15:48:49Z

giza_actions/model.py

+        cache_str = f"{self.model.id}_{self.version.version}_model"
+        if cache_str in self.cache:
+            file_path = self.cache.get(cache_str)
+            file_path = Path(file_path)


Same as the previous comment, could be the string only or make a single line to create the path

Gonmeso · 2024-04-22T15:51:18Z

pyproject.toml

@@ -13,6 +13,7 @@ license = "MIT"

 [tool.poetry.dependencies]
 python = ">=3.11,<4.0"
+diskcache == "5.6.3"


We should not directly pin this dependency, make sure to add it just running

poetry add diskcache

it worked by degrading the Python version to Python 3.11.9

Gonmeso · 2024-04-22T16:06:30Z

giza_actions/model.py

+                with open(file_path, "rb") as f:
+                    onnx_model = f.read()
+            else:
+                onnx_model = self.version_client.download_original(


Let's change this to use _download_model and the read it, so we have the benefits of the cache.

Gonmeso · 2024-04-22T16:06:46Z

giza_actions/model.py

+            with open(file_path, "rb") as f:
+                file = f.read()
+        else:
+            file = self.version_client.download_original(


We should change this to use _download_model which handles the usage of the cache if it does not exist. Currently if this is used multiple times the model would be download each time as is not reflected on the cache.

Lets use _download_model and the read.

shivam6862 · 2024-04-22T18:18:25Z

@Gonmeso @raphaelDkhn requested changes have been made

Gonmeso · 2024-04-23T09:42:03Z

giza_actions/model.py

            self.framework = self.version.framework
            self.uri = self._retrieve_uri()
            self.endpoint_id = self._get_endpoint_id()
+            if output_path:


This if will make the prediction fail when verifiable=False if it is not provided, making it mandatory which is not what we are aiming for.

Gonmeso · 2024-04-23T09:42:44Z

giza_actions/model.py

            self.framework = self.version.framework
            self.uri = self._retrieve_uri()
            self.endpoint_id = self._get_endpoint_id()
+            if output_path:
+                self.session = self._set_session(output_path)


output_path shouldn't be a mandatory argument for set_session

Gonmeso · 2024-04-23T10:06:29Z

giza_actions/model.py

            self.framework = self.version.framework
            self.uri = self._retrieve_uri()
            self.endpoint_id = self._get_endpoint_id()
+            if output_path:
+                self.session = self._set_session(output_path)
            if output_path:


Having to pass output_path to _download_model and _get_output_dtype will mostly make output_path to be mandatory, but this is not the intention so another path where if output_path is not provided we handle it for the user:

Lets make the class to have an attribute self._output_path

If output_path is present when the user creates the instance, we will use that path, if not we will handle that

Remove output_path argument from _download_model and use self._output_path

Remove output_path from _set_session and _get_output_dtype as it is not needed any more

An example of the init could be:

def __init__( self, model_path: Optional[str] = None, id: Optional[int] = None, version: Optional[int] = None, output_path: Optional[str] = None, ): ... if model_path: self.session = ort.InferenceSession(model_path) elif id and version: self.model_id = id self.version_id = version self.model_client = ModelsClient(API_HOST) self.version_client = VersionsClient(API_HOST) self.api_client = ApiClient(API_HOST) self.endpoints_client = EndpointsClient(API_HOST) self._get_credentials() self.model = self._get_model(id) self.version = self._get_version(version) self.session = self._set_session() self.framework = self.version.framework self.uri = self._retrieve_uri() self.endpoint_id = self._get_endpoint_id() if output_path is not None: self._output_path = output_path else: self._output_path = os.path.join(tempfile.gettempdir(), f"{self.model_id}_{self.version_id}_{self.model.name}) # Now this internally uses self._output_path # As we are using the cache hitting this function should not be problematic self._download_model()

Gonmeso · 2024-04-23T10:09:01Z

giza_actions/model.py

-            onnx_model = self.version_client.download_original(
-                self.model.id, self.version.version
-            )
+            cache_str = f"{self.model.id}_{self.version.version}_model"


Let's remove this and use self._output_path so cache keys are more consistent

Gonmeso · 2024-04-23T10:10:21Z

giza_actions/model.py

+            cache_str = f"{self.model.id}_{self.version.version}_model"
+            self._download_model(output_path)
+
+            if cache_str in self._cache:


With the proposed changes

if self._output_path in self._cache: file_path = Path(self._cache.get(self._output_path)) with open(file_path, "rb") as f: onnx_model = f.read()

Gonmeso · 2024-04-23T10:10:56Z

giza_actions/model.py

-        onnx_model = self.version_client.download_original(
-            self.model.id, self.version.version
-        )
+        cache_str = f"{self.model.id}_{self.version.version}_model"


As in the previous comment, let's remove this and use self._output_path so cache keys are more consistent

Gonmeso · 2024-04-23T10:11:39Z

giza_actions/model.py

-            save_path = Path(f"{output_path}/{self.model.name}.onnx")
+            logger.info("ONNX model is ready, downloading! ✅")
+
+            if ".onnx" in output_path:


With the proposed changes this should be self._output_path

Gonmeso · 2024-04-23T10:12:54Z

giza_actions/model.py

@@ -221,6 +236,7 @@ def predict(
        custom_output_dtype: Optional[str] = None,
        job_size: str = "M",
        dry_run: bool = False,
+        output_path: Optional[str] = None,


Remove this as it is not needed

Gonmeso · 2024-04-23T10:17:14Z

giza_actions/model.py

@@ -272,7 +288,7 @@ def predict(
                    logger.info("Serialized: %s", serialized_output)

                    if custom_output_dtype is None:
-                        output_dtype = self._get_output_dtype()
+                        output_dtype = self._get_output_dtype(output_path)


This should be as it was before

Gonmeso · 2024-04-23T10:17:47Z

giza_actions/model.py

-        file = self.version_client.download_original(
-            self.model.id, self.version.version
-        )
+        cache_str = f"{self.model.id}_{self.version.version}_model"


With the proposed changes this should be self._output_path

Gonmeso · 2024-04-23T10:18:57Z

Testing is still missing, that its why I proposed creating a cache class to test it easily.

shivam6862 · 2024-04-23T15:12:22Z

@Gonmeso @raphaelDkhn
Implemented cache test and requested changes have been done.
I'm looking for some guidance.

Gonmeso

Really close to be done!

We just need to fix the tests and this would be ready, as CI is failing. Try to make sure that everything is working previously by running pytest

Gonmeso · 2024-04-24T08:54:56Z

giza_actions/model.py

+                    f"{self.model_id}_{self.version_id}_{self.model.name}",
+                )
+            self._download_model()
+            self._cache = Cache(os.path.join(os.getcwd(), "tmp", "cachedir"))


Cache should be initialized before _set_session and download_model as if the user does not provide output_path set session will try to hit self._cache which has not been initialized.

This is making the previous two existing tests to fail.

Gonmeso · 2024-04-24T08:59:30Z

tests/test_model.py

+@patch("giza_actions.model.GizaModel._get_output_dtype")
+@patch("giza_actions.model.GizaModel._retrieve_uri")
+@patch("giza_actions.model.GizaModel._get_endpoint_id", return_value=1)
+def test_cache_implementation(*args):


The test is failing as some mocks are missing, the current error is due to missing credentials, for that just add:

@patch("giza_actions.model.GizaModel._get_credentials") def test_cache_implementation(*args):

Also it is possible that some patches are missing as well, for example for _get_model and _get_version might be need.

shivam6862 · 2024-04-24T09:37:18Z

@Gonmeso Added Patch for test and made required changes to pass CI,
I'm looking for some guidance.

shivam6862 · 2024-04-24T12:02:00Z

@Gonmeso The CI failed due to "Exception: Token expired"
guide me on how to deal with this situation

Gonmeso · 2024-04-24T13:34:27Z

Hi!

One little note, making the changes and just waiting CI to pass will make this process longer, it is encouraged to run the tests locally and then push the changes when they pass locally.

The error is the following:

/home/runner/work/actions-sdk/actions-sdk/giza_actions/model.py:206: in _download_model
    onnx_model = self.version_client.download_original(
E   Exception: Token expired or not set. API Key not available. Log in again.
        self       = <giza_actions.model.GizaModel object at 0x7fcfd48[65](https://github.com/gizatechxyz/actions-sdk/actions/runs/8814392110/job/24195017336?pr=39#step:6:66)890>

Here we can see that _download_model is being executed and the next call in the stack is self.version_client.download_original which is trying to get the model from the API and fails to do so without credentials, that means that we need to patch this.

With the recent changes _download_model, it is being executed every single time as currently we always handle the output path to use with the cache.

To solve this we could start by patching the self.version_client.download_original function. Checking this function, this dependency comes in the __init__ method:

            self.model_client = ModelsClient(API_HOST)
            self.version_client = VersionsClient(API_HOST) # < This one 
            self.api_client = ApiClient(API_HOST)
            self.endpoints_client = EndpointsClient(API_HOST)

In order to patch this, we need to patch the imported client from the script (docs here: https://docs.python.org/3/library/unittest.mock.html#where-to-patch):

@patch("giza_actions.model.VersionsClient", return_value=b"some bytes")

We add a returned value of bytes because it is what self.version_client.download_original returns.

Hope this helps @shivam6862

shivam6862 · 2024-04-24T15:14:50Z

@Gonmeso Added all test case are passed
Please see if these are the required changes

shivam6862 · 2024-04-24T16:01:26Z

@Gonmeso done for all file run pre-commit run --all-files which was done previously for .gitignore

Gonmeso

Thanks for your input!

LGTM!!!!

Great work @shivam6862 ⚡

implemented a cache for already downloaded models

2843ce3

Gonmeso requested changes Apr 22, 2024

View reviewed changes

fixed errors

3feb5b7

added diskcache

4f81a5a

Gonmeso reviewed Apr 23, 2024

View reviewed changes

added test for cache

6e19f06

Gonmeso requested changes Apr 24, 2024

View reviewed changes

added patch for test

3bf5584

Gonmeso self-assigned this Apr 24, 2024

added more patch

1738e47

pre commit all file done

d5c71c2

Gonmeso approved these changes Apr 25, 2024

View reviewed changes

Gonmeso merged commit f5b84eb into gizatechxyz:main Apr 25, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implemented a cache for already downloaded models #39

implemented a cache for already downloaded models #39

shivam6862 commented Apr 22, 2024 •

edited

Loading

Gonmeso left a comment

Gonmeso Apr 22, 2024

Gonmeso Apr 22, 2024

Gonmeso Apr 22, 2024

Gonmeso Apr 22, 2024

Gonmeso Apr 22, 2024

Gonmeso Apr 22, 2024

Gonmeso Apr 22, 2024

Gonmeso Apr 22, 2024

shivam6862 Apr 22, 2024 •

edited

Loading

Gonmeso Apr 22, 2024

Gonmeso Apr 22, 2024

shivam6862 commented Apr 22, 2024 •

edited

Loading

Gonmeso Apr 23, 2024

Gonmeso Apr 23, 2024

Gonmeso Apr 23, 2024

Gonmeso Apr 23, 2024

Gonmeso Apr 23, 2024

Gonmeso Apr 23, 2024

Gonmeso Apr 23, 2024

Gonmeso Apr 23, 2024

Gonmeso Apr 23, 2024

Gonmeso Apr 23, 2024

Gonmeso commented Apr 23, 2024

shivam6862 commented Apr 23, 2024 •

edited

Loading

Gonmeso left a comment

Gonmeso Apr 24, 2024

Gonmeso Apr 24, 2024

shivam6862 commented Apr 24, 2024

shivam6862 commented Apr 24, 2024

Gonmeso commented Apr 24, 2024

shivam6862 commented Apr 24, 2024 •

edited

Loading

shivam6862 commented Apr 24, 2024

Gonmeso left a comment

implemented a cache for already downloaded models #39

implemented a cache for already downloaded models #39

Conversation

shivam6862 commented Apr 22, 2024 • edited Loading

Gonmeso left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shivam6862 Apr 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shivam6862 commented Apr 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Gonmeso commented Apr 23, 2024

shivam6862 commented Apr 23, 2024 • edited Loading

Gonmeso left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shivam6862 commented Apr 24, 2024

shivam6862 commented Apr 24, 2024

Gonmeso commented Apr 24, 2024

shivam6862 commented Apr 24, 2024 • edited Loading

shivam6862 commented Apr 24, 2024

Gonmeso left a comment

Choose a reason for hiding this comment

shivam6862 commented Apr 22, 2024 •

edited

Loading

shivam6862 Apr 22, 2024 •

edited

Loading

shivam6862 commented Apr 22, 2024 •

edited

Loading

shivam6862 commented Apr 23, 2024 •

edited

Loading

shivam6862 commented Apr 24, 2024 •

edited

Loading