Allow caches to opt-in to granular cleanup #863

jakearchibald · 2016-03-31T14:57:55Z

Something like:

cache.settings({
  removalStrategy: "least-matched"
});

…where the developer can allow the browser to auto-delete items from the cache based on some kind of strategy.

Apologies for the really rough idea, just wanted to add something before I forgot 😄

kornelski · 2016-03-31T15:03:08Z

The old world browser cache (non-SW) does not require any cleanup or special care, and that's very convenient.

In SW I'd also would like to have the same type of cache, where I can carelessly keep putting things in forever, and have browser worry about managing the disk usage and cleanup. I'd rather not specify any maximum size or cleanup method — I'm OK with it being left up to the implementation.

In my app I have maybe 5 files for which I need guarantee that they'll stay (HTML + JS), and an ever-growing list of thousands of assets (images, subpages) that can come and go, and the app will recover if they suddenly go missing, so a leaky cache would be great for them.

wanderview · 2016-03-31T15:05:58Z

I think this is covered by the v2 storage spec proposal. It had a way of marking a box such that individual items could be removed based on LRU or frecency. Doing it at the storage spec level would in theory create a consistent system for other storage APIs as well.

jakearchibald · 2016-03-31T16:04:01Z

I'm worried that the storage spec won't be able to provide the required granularity, but if it can, great!

annevk · 2016-04-01T12:09:09Z

Yeah, the way we could do this is by filling out the details of whatwg/storage#18. We have a box that can used by dozens of APIs to store stuff on. We should probably define what kind of items go in a box and what kind of metadata they carry.

delapuente · 2016-04-01T16:26:49Z

As I don't know the details about storage API perhaps what I'm going to comment is already covered but it would be convenient to have some kind of functional event triggered under memory pressure that allow us to implement a resource release strategy or to opt for a built-in one.

wanderview · 2016-04-01T16:52:36Z

As I don't know the details about storage API perhaps what I'm going to comment is already covered but it would be convenient to have some kind of functional event triggered under memory pressure that allow us to implement a resource release strategy or to opt for a built-in one.

I believe this is what @slightlyoff has favored in the past (and now?), but gecko storage folks don't like it. Some of our reasons:

Spinning up potentially hundreds or thousands of origin service workers to fire an event, which may or may not remove enough space, is a non-trivial thing to do in terms of memory/cpu.
Origins don't have enough information to reason about what to remove. Has this game level been accessed more recently than resources in other origins? It has no idea.

I feel like I'm missing one. Maybe @sicking remembers.

So from the gecko team's perspective we prefer an API that lets an origin declare the relative persistence of data and then let the browser reason about those resources in aggregate. The browser can make better decisions with a master list of frecency. Right now that list just operates at the origin level, but the proposed spec would let entries in the frecency list refer to individual boxes or individual items within a box.

annevk · 2016-04-01T17:05:12Z

I think that is correct, https://wiki.whatwg.org/wiki/Storage#Why_no_eviction_event.3F captures it too.

delapuente · 2016-04-01T17:17:17Z

I see. Thank you for the clarifications. What about allowing client code to annotate usage metadata to improve global reasoning... ? I think I'm going to read the storage spec.

annevk · 2016-04-01T17:23:53Z

@delapuente yeah, I think that's something everyone is comfortable with, though how exactly that should be done is still unclear. It's very much a a post-v1-feature.

WebReflection · 2016-05-18T12:20:59Z

Wouldn't simply reflect eventual expiration header be useful? So that the server can give an hint about how long it should be kept and the cache be aware about it? It would also simplify grateful engagement. Is this a bad idea? Has this approach been considered already? Is this already the case? Thanks

WebReflection · 2016-05-18T12:37:12Z

Enhancement *

indolering · 2016-10-08T06:15:10Z

@wanderview I agree that this wouldn't be very useful for compaction on a global level, but what about at a local level? You can't rely on local storage anyway, so you end up checking if something is in the cache, fetching it if it isn't, and pushing it into the cache. Managing your resource quota entails purging according to a TTL index or managing increasingly complex accounting on what has been used. I would love to be able to specify the size of the box and let the browser handle the rest.

tomayac · 2017-12-05T14:26:00Z

Cache.keys() returns results in insertion order, so the cache replacement policies FIFO and LIFO are straightforward.

It would, however, be convenient if there was a direct way to get the "last accessed time" (and maybe even the "added time") in a fictive Cache.has(request, {options}) method (that would not return a Response like Cache.match(request, {options}), but rather a likewise fictive CacheItem object with the timeAdded and timeLastAccessed timestamps and a matches boolean) for more straightforward LRU/MRU.

Note that this is already possible now, but there is quite some overhead—via @wanderview's comment:

var cache;
cache.open('foo').then(function(foo) {
  cache = foo;
  return cache.match(request);
}).then(function(response) {
  if (response) {
    // update order of entries in keys()
    cache.put(request, response.clone());  // <== overhead happens here
    return response;
  }

  var maxItems = 100;
  return addToLRU(cache, maxItems, request);
});

function addToLRU(cache, maxItems, request) {
  return cache.keys().then(function(keys) {
    if (keys.length < maxItems) {
      return cache.add(request);
    }

    return cache.delete(keys[0]).then(function() {
      return cache.add(request);
    });
  });
}

This would not take care of eviction (why no eviction), but simplify common cache replacement policies, without having to resort to IndexedDB. What do you think?

gauntface · 2017-12-05T22:42:40Z

I like the idea of this, but I'm not 100% certain we know what the common cleanup approaches are desirable or the config needed as a result (Workbox has some of this but illustrating the behavior is difficult and it causes a lot of developer confusion, possible due to implementation problems and / or debugging troubles for end users for local testing).

Aside: If IndexedDB wasn't so painful to use, I don't think this feature would be as desirable as it seems on the surface.

tomayac · 2017-12-07T12:51:00Z

Thanks, @gauntface! Any further thoughts from the Workbox team or others in general? Should I fork this (this being a Cache.has(request, {options})) method that returns a CacheItem object) out into its own Issue, @jakearchibald? Please advise. Merci!

jakearchibald · 2017-12-19T16:48:36Z

The problem with "last accessed time" is the difference between "accessed" and "used".

If I do cache.keys() have I just made the access time equal across all items of my cache? What if I get an item out of the cache but don't actually use it?

tomayac · 2017-12-19T17:01:52Z

Probably only calls to cache.match() and caches.match() should be counted as “accessed”, but not cache.keys(). How does that sound?

jakearchibald · 2017-12-19T17:24:57Z

"last matched" then? I guess that's useful. Could you show some code developers would write to clear up a cache, using the proposed API?

delapuente · 2017-12-20T10:37:25Z

And what about explicitly marking them for disposal? What if we provide a general disposal queue and the developer can simply mark a response as disposable:

let response = await Caches.match(request);
if (response) {
  handle(response);
  response.isDisposable(true); // adds to the disposal queue
}

It gives control over semantics to the developer. We can provide extra parameters to common methods to ease batch operations:

let keys = await cache.keys(request, { disposable: true });
keys[keys.length - 1].isDisposable(false); // all matches are disposable except the last one

jakearchibald · 2017-12-20T10:56:01Z

If we want auto-cleanup within a cache I think it'd be more convenient if the settings applied to a cache rather than an individual response.

await cache.setEvictionBehaviour(…);

That way you're applying caching behaviour to the cache.

annevk · 2017-12-20T11:01:42Z

@jakearchibald I'd prefer it if we can sort out if we can do this on top of buckets somehow. Where you create a bucket with a policy and then associate cache items with that bucket. Maybe that's too complex though and we need buckets to contain an entire cache or nothing, but it's worth exploring a bit. In particular as you might want to share the eviction logic with items stored in other places, such as IDB.

annevk · 2017-12-20T11:03:48Z

(Note, the above comment doesn't apply to exposing "last matched time". Just to any kind of storage policy.)

tomayac · 2017-12-20T14:00:07Z

@jakearchibald, as a response to your comment, here is a still raw sketch of an API of a fictive Cache.has(request, {options}) that would return a likewise fictive CacheItem object:

// Create `CacheItem` foo at timestamp 1000
await cache.add('foo');

// Get statistics about the never matched `CacheItem`
await cache.has('foo');
// returns `{key: 'foo', timeCached: 1000, timeMatched: Infinity}`

// Get statistics about a non-existent `CacheItem`
await cache.has('bar');
// returns `undefined`

// Use `CacheItem` foo at timestamp 2000
await cache.match('foo');

// Get statistics about the now matched (=used) `CacheItem`
await cache.has('foo');
// returns `{key: 'foo', timeCached: 1000, timeMatched: 2000}`

This would allow for relatively simple to implement cache replacement policies that are independent of Indexed Database, like, for example, Least Recently Used:

const MAX_CACHE_ITEMS = 5;
const cache = await caches.open('lru-cache');

const addToCache = async (cache, url) => {
  const keys = await cache.keys();
  if (keys.length < MAX_CACHE_ITEMS) {
    return await cache.add(url);
  }
  const cacheItems = await Promise.all(
    keys.map(async key => await cache.has(key))
  );
  const lruCacheItem = cacheItems.sort(
    (a, b) => a.timeMatched - b.timeMatched
  )[0];
  await cache.delete(lruCacheItem.key);
  return await cache.add(url);
};

Something I'm unsure about is the appropriateness of Infinity for never matched CacheItems, but it makes the sorting easier. I'm definitely overlooking a number of corner cases, so looking for your thoughts. Thanks!

jakearchibald · 2017-12-20T16:35:56Z

@annevk do you think the buckets approach would allow granular eviction within the bucket, or just control when the whole bucket can be evicted?

annevk · 2017-12-20T20:51:37Z

@jakearchibald I'm not sure, I was thinking per bucket, but we really haven't explored it much yet. I just think this is a problem that spans across storage APIs, so it'd be nice if we figured out a way to tackle it for all of them, so you don't end up having to put stuff in the Cache API that doesn't really fit there because of the eviction feature.

jakearchibald · 2018-10-25T13:50:49Z

Adding metadata seems like a good idea. We should look at what workbox stores, and help them do what they do, but faster.

tomayac · 2018-10-25T13:56:49Z

As an additional data point, Workbox, which is widely used in the industry, handles cache expiration via IDB timestamps:

jakearchibald · 2018-10-25T14:03:26Z

Seems like it would be better to store request/response into IDB.

tomayac · 2018-10-29T09:41:49Z

FYI: @jeffposnick, @philipwalton for Workbox requirements.

The tl;dr of this issue is:

There is super rough proposal to add metadata ("added to cache" timestamp, "last accessed" timestamp) to cached items, so you no longer have to resort to IDB for this information in order to implement cache expiration strategies.
The (preferred?!) alternative would be to just allow IDB to store Requests and Responses.
At the TPAC F2F, we (IIRC) agreed that it was not desirable to implement these strategies as part of the Cache API itself (as initially suggested when opening the issue).

jeffposnick · 2018-10-29T14:35:08Z

(Some random thoughts, with the caveat that I might be missing context from the F2F discussion.)

If there aren't going to be any changes to the relevant standards' status quo, I am not sure that it makes sense to move away from Workbox's current model (IDB for timestamp metadata + the Cache Storage API for the Response bodies), given that it's a mature solution at this point.

My main concern going into this is that it locks folks into using Workbox (or ngsw, which is the only other "service worker-y" framework that I'm aware of that's implemented cache expiration).

But perhaps if the official guidance is that IndexedDB should be used for this sort of thing, someone from the community see that as an opportunity to write a standalone helper library to implement storage and expiration. That could then end up being an alternative for folks who would rather not opt-in to using a framework like Workbox or ngsw.

philipwalton · 2018-10-29T16:07:32Z

Some thoughts as well:

IMO it's not great that when responding to a single request, developers wanting to do any sort of expiration need to check both the Cache API and IDB. Having data for the same request stored in multiple places makes it much easier for that data to get out of sync (e.g. the cache gets cleared but IDB doesn't, or vise-versa), and then we need extra logic to handle those sorts of cases.
In addition to cache expiration, there are other use cases for making Request/Response objects structured cloneable and easily storable in IDB. Right now we have to do a fair amount of work to extract all the data from a request (including reading the body as a Blob) so it can be stored in IDB and retried during a sync event.

tomayac · 2018-10-30T07:50:51Z

[F]or Workbox requirements

When I wrote this, I meant functional requirements to make cache expiration work (as in: "What data does Workbox need for its cache expiration logic?"); and not so much the way it's currently implemented and working. Sorry for not formulating the question more clearly.

@jeffposnick, I get that the current model is battle-tested and working well, yet as @philipwalton writes in his first bullet, the current experience for developers if they want to implement cache expiration themselves (i.e., not using Workbox) is not great.

So with your "Workbox implementor glasses" off :-) B, but your "experience gained through implementing Workbox glasses" on B-), would you (i) rather want IDB to be able to store Responses and Requests (including the aspects brought up in @philipwalton's second bullet), or (ii) to have more metadata primitives in the Cache API?

jakearchibald · 2018-10-30T10:05:43Z

@philipwalton

Having data for the same request stored in multiple places makes it much easier for that data to get out of sync

Allowing arbitrary data to be stored with a response doesn't help this much. The data can easily get out of sync, although it's true both will be cleared at the same time.

In addition to cache expiration, there are other use cases for making Request/Response objects structured cloneable and easily storable in IDB. Right now we have to do a fair amount of work to extract all the data from a request (including reading the body as a Blob) so it can be stored in IDB and retried during a sync event.

Agreed. This would be useful for background fetch too.

jakearchibald · 2018-10-30T10:19:40Z

@tomayac

or (ii) to have more metadata primitives in the Cache API?

I think it's pretty obvious that arbitrary metadata storage along with responses would have made workbox easier. I don't think anyone's saying otherwise.

You could equally say "implementing Ember would have been easier if the whole of Ember was already in the platform".

The issues we raised in the F2F are:

Is it the right primitive to be adding to the platform?
We keep adding extra arbitrary storage along with APIs. Is this a pattern we should continue. Right now it's odd that some APIs have it and some don't.
Would it be better to do something lower level, like associating data across stores?
Would a simpler (and faster) storage make this not-a-problem?

jeffposnick · 2018-10-30T14:27:19Z

I'm at a bit of a loss whether there's specific feedback (if any) folks are looking for—it sounds like all of the relevant points were already discussed during the F2F, and I trust that the folks involved understand the developer needs and are balancing that against platform complexity.

philipwalton · 2018-10-30T14:43:04Z

Allowing arbitrary data to be stored with a response doesn't help this much. The data can easily get out of sync, although it's true both will be cleared at the same time.

Right, my specific concern was: given that some browsers apparently reserve the right to clear entries from the cache at their discretion. Developers storing request metadata in IDB will need to write extra code to account for that possibility.

asakusuma · 2018-10-30T17:51:48Z

Right, my specific concern was: given that some browsers apparently reserve the right to clear entries from the cache at their discretion.

Also, I believe there are different "Clear browsing data" options available to the user which allows a user to clear Cache but not IDB.

When does Workbox check expiration dates? Our experience has been that IDB latency is not reliably low enough to block time-sensitive operations, like serving requests. We put our expiration dates as a header in the cached response, and check the header when serving, to make sure the asset isn't expired. The IDB latency issue is discussed more in #1331.

Are there any other known use cases for Cache metadata besides the following?

Informing cache cleanup
Implementing cache entry expiration dates

jeffposnick · 2018-10-30T18:07:02Z

When does Workbox check expiration dates? Our experience has been that IDB latency is not reliably low enough to block time-sensitive operations, like serving requests. We put our expiration dates as a header in the cached response, and check the header when serving, to make sure the asset isn't expired. The IDB latency issue is discussed more in #1331.

Workbox has a freshness sanity check prior to calling respondWith() that goes against the Date header in the cached Response object. We are doing that so as not to block on IndexedDB.

It expires entries based on IndexedDB metadata (maximum entries and/or maximum age) via a separate codepath that runs after respondWith() has been called, with a guard in place to prevent simultaneous clean-ups from happening at once.

wanderview · 2018-10-30T18:09:43Z

Also, I believe there are different "Clear browsing data" options available to the user which allows a user to clear Cache but not IDB.

I don't think this is quite accurate, at least not in chrome and firefox. The separate "cached" data that can be cleared is http cache. I believe quota-managed storage is always cleared atomically.

This also matches the clear-site-data mechanism which can clear http cache separately, but always cleared cache API and IDB together under "storage":

https://w3c.github.io/webappsec-clear-site-data/#grammardef-storage

philipwalton · 2018-10-30T18:40:22Z

@wanderview, not sure if you were referencing my comment or just @asakusuma's, but I was basing my claim on a post from the webkit.org blog:

To keep only the stored information that is useful to the user, WebKit will remove unused service worker registrations after a period of a few weeks. Caches that do not get opened after a few weeks will also be removed. Web Applications must be resilient to any individual cache, cache entry or service worker being removed.

I'm not sure how often this actually happens in the wild though.

wanderview · 2018-10-30T19:01:38Z

@philipwalton Yea, I'm aware of that, although I've never understood how the webkit team expects sites to manage that kind of unpredictable platform behavior. There are many web platform features built around the concept of quota storage persisting together or being wiped all at once. Expecting sites to handle partial state due to unpredictable partial wipes seems quite challenging.

tomayac · 2020-01-08T22:39:31Z

@aarongustafson has brought this topic up again in a new MSEdgeExplainer document.

jakearchibald added the enhancement label Mar 31, 2016

jakearchibald added this to the Version 2 milestone Mar 31, 2016

jakearchibald mentioned this issue Apr 1, 2016

Allow all storage types to expire, not just cookies whatwg/storage#11

Open

jakearchibald mentioned this issue Nov 3, 2017

Create F2F agenda - 7 November 2017 #1206

Open

bsittler mentioned this issue Feb 3, 2018

The Cache objects do not expire unless authors *or users* delete the entries. #1276

Open

jatindersmann mentioned this issue Apr 23, 2018

Create F2F agenda - 25 October 2018 #1303

Closed

jeffposnick mentioned this issue Dec 14, 2018

Declarative routing #1373

Open

asutherland mentioned this issue Jan 23, 2019

Disallow starting readwrite transactions while readonly transactions are running? w3c/IndexedDB#253

Closed

asakusuma mentioned this issue Feb 19, 2019

Immediate Service Worker #1389

Open

inexorabletash mentioned this issue Sep 20, 2019

[IndexedDB] WebAppsWG TPAC F2F agenda (Fukuoka, Sep 19-20 2019) w3c/IndexedDB#288

Closed

inexorabletash mentioned this issue Sep 30, 2019

Allow indexing on Request/Response properties w3c/IndexedDB#305

Open

asutherland mentioned this issue Mar 12, 2021

Add maxCount for StorageBuckets WICG/storage-buckets#36

Closed

SteveBeckerMSFT mentioned this issue Oct 1, 2024

Should IndexedDB allow indexing on Request/Response properties #1730

Open

Allow caches to opt-in to granular cleanup #863

Allow caches to opt-in to granular cleanup #863

Comments

jakearchibald commented Mar 31, 2016

kornelski commented Mar 31, 2016

wanderview commented Mar 31, 2016

jakearchibald commented Mar 31, 2016

annevk commented Apr 1, 2016

delapuente commented Apr 1, 2016

wanderview commented Apr 1, 2016

annevk commented Apr 1, 2016

delapuente commented Apr 1, 2016

annevk commented Apr 1, 2016

WebReflection commented May 18, 2016

WebReflection commented May 18, 2016

indolering commented Oct 8, 2016

tomayac commented Dec 5, 2017 • edited Loading

gauntface commented Dec 5, 2017

tomayac commented Dec 7, 2017

jakearchibald commented Dec 19, 2017

tomayac commented Dec 19, 2017

jakearchibald commented Dec 19, 2017

delapuente commented Dec 20, 2017

jakearchibald commented Dec 20, 2017

annevk commented Dec 20, 2017

annevk commented Dec 20, 2017

tomayac commented Dec 20, 2017 • edited Loading

jakearchibald commented Dec 20, 2017

annevk commented Dec 20, 2017

jakearchibald commented Oct 25, 2018

tomayac commented Oct 25, 2018

jakearchibald commented Oct 25, 2018

tomayac commented Oct 29, 2018

jeffposnick commented Oct 29, 2018

philipwalton commented Oct 29, 2018

tomayac commented Oct 30, 2018

jakearchibald commented Oct 30, 2018

jakearchibald commented Oct 30, 2018

jeffposnick commented Oct 30, 2018

philipwalton commented Oct 30, 2018

asakusuma commented Oct 30, 2018

jeffposnick commented Oct 30, 2018

wanderview commented Oct 30, 2018

philipwalton commented Oct 30, 2018

wanderview commented Oct 30, 2018

tomayac commented Jan 8, 2020

tomayac commented Dec 5, 2017 •

edited

Loading

tomayac commented Dec 20, 2017 •

edited

Loading