Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow flexible caching policies #299

Open
delgadom opened this issue May 18, 2017 · 1 comment
Open

Allow flexible caching policies #299

delgadom opened this issue May 18, 2017 · 1 comment

Comments

@delgadom
Copy link
Member

The Problem

  • Most user machines can't use a cache (some individual files are too large)
  • Users without a cache will likely still want to cache specific files that are used frequently
  • Specific operations could benefit from temporary caching
  • Changes to the caching system cannot be disruptive to users down the line

Thinking that the best way forward with this is a cache policy object rather than string, but I'm open to other options.

@delgadom delgadom added this to the 1.0 milestone May 18, 2017
@delgadom
Copy link
Member Author

delgadom commented May 18, 2017

Proposed Solution: the CachePolicy class

When working with files in data_file.py, pass a cache_policy object instead of a string, which decides on the fly how to handle a file object:

# in datafs/core/data_archive.py:

class DataArchive(object):
...

    def get_local_path(self, ...):
          ...

        data_file.get_local_path(
            self.authority,
            self.api.cache,
            self.api.cache_policy,
            ...):

The cache_policy object's methods will be invoked by passing in the proposed archive to access, and could draw on the full set of API, DataArchive, and current context settings:

# somewhere in the refactored version of datafs/core/data_file.py:

    def handle_file_version_caching(self, ...): # or something

        # decide on caching behavior for current file
        cache_file = api.cache_policy.get_cache_action(archive, archive_version)

This cache_policy object could thus decide what to do based on a number of factors, and we could easily allow users to override-configuration based policies in real time:

>>> # cache latest version of all archives
... api.cache_policy.set_policy('latest')

>>> # limit cached items to 100 MB
... api.cache_policy.set_size_limit(100*1024*1024)

>>> # small archives are now cached
... archive = api.get_archive('ACP_climate_gcm-modelweights.csv')
>>> api.cache_policy.get_cache_action(archive, 'latest')
True

>>> # large archives are not cached
>>> archive = api.get_archive(
...     'GCP/climate/SMME/tasmin/grid025/daily/rcp85/pattern31/2046/SON.nc')
>>> api.cache_policy.get_cache_action(archive, 'latest')
False

We could also allow temporary cache policies:

>>> # temporarily cache all versions of archives
... with api.cache_policy.temporary_policy('all', size_limit=None):
...     archive = api.get_archive(
...         'GCP/climate/SMME/tasmin/grid025/daily/rcp85/pattern31/2046/SON.nc')
...
...      api.cache_policy.get_cache_action(archive, version='0.0.1')
...
True
>>> api.cache_policy.get_cache_action(archive, version='0.0.1')
False

Implementation notes

Most importantly, we wouldn't have to roll out all of these features immediately, as new CachePolicy features would not be API-breaking and should be backwards-compatible.

We could start with a very simple class that pulls the cache from the config:

# in datafs/core/cache.py

from contextlib import contextmanager

class CachePolicy(object):
    '''
    Parameters
    ----------
    policy : str, optional
        Global caching policy, e.g. from api config (default 'never')
    max_size : int, optional
        Maximum size limit on downloads (default `None`)
    '''

    def __init__(self, policy='never', max_size=None):
        self._policy = policy
        self._max_size = None

    @property
    def policy(self):
        ''' could be overridden later to provide temporary policies, etc '''
        return self._policy

    @property
    def max_size(self):
        ''' ditto '''
        return self._max_size

    @contextmanager
    def temporary_policy(self, policy=None, max_size=None):
        '''
        Set a temporary policy and max_size for caching

        These settings only affect actions taken using this api instance
        while in this context manager. Changes to the cache policy while
        in this context will be rolled back on exit.
        '''

        _prev_policy = self.policy
        _prev_max_size = self.max_size

        if policy is not None:
            self._policy = policy

        if max_size is not None:
            self._max_size = max_size

        try:
            yield

        finally:
            self._policy = _prev_policy
            self._max_size = _prev_max_size

    def get_cache_action(self, archive, version):
        '''
        Determines whether to cache the object on write

        Parameters
        ----------
        archive : DataArchive
            the archive under consideration
        version : str or version
            the version of the archive under consideration

        Returns
        -------
        cache_action : bool
            Value indicating whether the file should be cached
        '''

        if self.max_size is not None:
            archive_size = archive.authority.getinfo(version).get('size')

            if (archive_size is None) or (archive_size > self.max_size):
                return False

        return self.policy

Limitations

  • This does not fix any of the problems associated with using large datafs files from the cli/other programs
  • This does not in itself provide the ability to do archive-specific caching, though presumably this feature could be added to the CachePolicy class without too much trouble. Perhaps a cache policy file that is stored in the same directory as the config file could specify custom caching policies at the archive level? This could be added down the line without breaking backwards- and forwards-compatibility.

thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant