-
Notifications
You must be signed in to change notification settings - Fork 150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Making cachalot caches faster by partitioning data to avoid invalidation #164
Comments
@Andrew-Chen-Wang |
No. Currently cachalot seems to only be useful if your model, just the model, does a single type of select query. I imagine the reason is that the original author's intention is for you to constantly do What I'm going to do is allow for caching of ALL queries rather than just one for a single model. It'll be an experimental feature I suppose. I'll post my draft proposal at some point next week; just been busy lately. |
I like the idea of making cache invalidation more conservative, though it might be tricky to get right. Cachalot's simplicity might also be its greatest strength: I've tried several of the other caching apps for Django and keep coming back to cachalot as the only 'install it and forget about it' option that delivers reliable cache hits. The others all seem to make compromises that can produce surprising results, or require the user to do fairly hands-on active cache management themselves anyway. Some examples from other projects' docs might help to illustrate the challenges with implementing finer grained granularity:
I'm not saying that cachalot does not suffer from some of the above issues as well (though I am not aware of any), but it does seem to avoid large categories of invalidation-related issues by solely focussing on the table-level. So I guess what I am saying is that there are already various options out there that try to do cache invalidation on a finer level of granularity with some degree of success. But in my experience cachalot is the only one that produces query results that are not stale, and based on the data as it actually exists in the database at any given point in time. |
Everything mentioned above is probably already captured more succinctly in the feature comparison table included in django-cachalot's project documentation. @Andrew-Chen-Wang, do you think it would be possible to improve cache invalidation while still maintaining all of the features / benefits as outlined in that table? |
Edit: sorry to address those concerns listed above, we catch everything, including .update(). The original author was pretty smart (or to others really crazy) because the author monkey patched the shite out of the main Django code that writes compiles all the SQL and we just hash the SQL and cache the result. That's why cachalot doesn't really have the issues like:
- Update event isn't triggering
- .values() isn't working
Other concerns were covered by the original author like `.count()`, but it's super easy to mend cachalot's code base since everything's monkey patched so debugging is pretty easy.
Regarding this proposal in general, it's just a proposal. I'm probably going to do a versioning system: choose how you want to cache everything either by selecting this new feature or the old one or both at the same time.
…On Sun, Feb 14, 2021 at 1:38 AM John Cass ***@***.***> wrote:
Everything mentioned above is probably already captured more succinctly in
the feature comparison table
<https://django-cachalot.readthedocs.io/en/latest/introduction.html#id2>
included in django-cachalot's project documentation.
@Andrew-Chen-Wang <https://github.com/Andrew-Chen-Wang>, do you think it
would be possible to improve cache invalidation while still maintaining all
of the features / benefits as outlined in that table?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#164 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AOLG4VXPECVKIRGHT55OJ3TS65VXPANCNFSM4RXYPBSQ>
.
|
This comment has been minimized.
This comment has been minimized.
This is my current draft proposal (note that it was before I found out that every new query on a table replaces another query's cache). Once I finish writing it, it'll make a little more sense, and I'll hide this comment: Cachalot should specifically find which values are grabbed per query. Let's say we have a table like this: from django.db import models
class Book(models.Model):
id = models.BigAutoField(primary_key=True)
title = models.CharField(max_length=255)
content = models.TextField(max_length=1000000) Follow the following insertions and SELECTs. book1 = Book.models.create(title="hello1", content="content")
book2 = Book.models.create(title="hello2", content="content")
# Make a SELECT and force Django to evaluate it.
query1 = list(Book.objects.values_list("id", "title").filter(title__startswith="hello")) The query is cached. It's cached by finding the actual SQL query of the django-cachalot/cachalot/utils.py Lines 63 to 95 in d0b5213
Example 1: specifying which columns to grabSo here's what I'm proposing. Let's say we update a specific column: # Grabs only id and title with specific title parameter:
query1 = list(Book.objects.values_list("id", "title").filter(title__startswith="hello"))
# Grabs all columns
query2 = list(Book.objects.filter(title__startswith="hello"))
# Update attribute "content"
book1.content = "New Content 1"
book1.save(update_fields=["content"])
"""
You must specify update_fields. Otherwise, cachalot
(in this proposal) will invalidate all queries. In general,
it's also good practice since the UPDATE query uses
the update_fields list to write which fields should be updated.
If none are specified, all fields are updated.
"""
# Let's re-evaluate the two queries:
cached = list(Book.objects.values_list("id", "title").filter(title__startswith="hello"))
not_cached = list(Book.objects.filter(title__startswith="hello")) What happened? The first query still has a cache, but the second query does not. What happened? The cache invalidation method is dependent on the columns that were grabbed. The first query only grabbed the Example 2: specifying which columns to filter onThe Other worries
Method of InvalidationWe'll need to change up how the cache keys are generated. Currently, every time a table is updated, regardless of the number of records, only one record, or just a specific column is being updated, then all queries related to a table is invalidated. What I'm currently thinking (I've mostly forgotten about this; sorry, I underestimated college workload with a startup in tow too):
This way, when cachalot tries to grab the cache for a specific query, it first checks for the specific table (and which database alias was used), then it ALSO checks for which columns are involved. If the Recall, cachalot is not a per object caching system; there may be a better method, for example, knowing if a SELECT statement only grabbed a single object, but that's is overcomplicating this and design-wise very difficult to determine in changing database schemas. But why did I say "I'm stuck?" Let's recall the original purpose: individual column attribution. The current format could look something like: Unlike table names, which is a single "thing," the columns specified could be in any order (of course we can sort them alphabetically but besides that), changing based on schema redesigns or SELECT ordering. Hashing isn't really an option: If we specify columns ReferencesThis is where we cache all queries: django-cachalot/cachalot/monkey_patch.py Line 84 in d0b5213
django-cachalot/cachalot/monkey_patch.py Line 35 in d0b5213
|
@jcass77 Sorry about the long proposal. You don't need to read it; just something to get out there for now since I'm busy with school. This proposal (and the other solution I mentioned) will still be in the code base, but that huge proposal I just wrote is mainly to fix this: "Useful for > 50 modifications per minute: X" in that table. Cachalot does per-table caching, but each table query is completely different and there's only one cache per table. UNLESS I'm reading the code wrong. I didn't have time to just test it out unfortunately, so I'm looking to do that next week and deduce if it's true there's only a single cache for one table. |
I wrote a quick test and it seems like django-cachalot maintains a cache for multiple queries on the same table, as expected: def test_cache_maintained_for_multiple_queries_on_same_table(self):
# Create cached entry for first query
with self.assertNumQueries(1):
Test.objects.get(name='test1')
with self.assertNumQueries(0):
Test.objects.get(name='test1')
# Create cached entry for second query
with self.assertNumQueries(1):
Test.objects.get(name='test2')
with self.assertNumQueries(0):
Test.objects.get(name='test2')
# Verify both queries are still cached
with self.assertNumQueries(0):
Test.objects.get(name='test1')
with self.assertNumQueries(0):
Test.objects.get(name='test2') Not sure if that covers the usecase that you are referring to, but seems fine. |
Yup, as expected :) Sorry was just not reading the code right. Edit: ugh my mistake. I didn't see this (my brain was not functioning correctly last night, sorry): django-cachalot/cachalot/monkey_patch.py Line 58 in d0b5213
The per query caching is correct, but this issue was more looking into the values selected and filtered per query. Instead of invalidating an entire table, only invalidate caches that contain certain columns. So if column/attribute Title was updated for a single object or multiple, then invalidate any cache that filters or selects attribute title. If a query didn't, then don't invalidate that cache. In other words, more refined invalidation of caching. |
Could you point me where code I need to patch to achieve "adding a prefix to cache keys"? The main goal to add a prefix by hostname Thanks |
Description
Besides the new PR which allows for a context manager to disable cachalot, there needs to be a better way to avoid constant cache invalidation.
The following proposal is based on django-cache-machine's per object caching. Although we're caching every query, we can include some extra information inside that cache key. Inspired by some work with Simple JWT and PRNGs, I believe what we can do is the following:
Each cache key is a three part key. Let's say there's a modification to a signal row, we don't want to invalidate all caches related to that table. Instead, we can use cache key wildcards, e.g.
cache.delete("beginning:TABLE CACHE KEY:CACHALOT ORIGINAL GENERATED CACHE KEY")
, to find any data that pertains to the primary key of the table.In other words, all objects' IDs need to be fed in to some kind of hashing algorithm that outputs all their IDs for the first part. Then, using the table cache key generator, we can use that for the second part (i.e. in the format of ASD:ASD:ASD). This way, every time a certain primary key is modified (or multiple) using our algorithm, we can filter the cache by using that aforementioned format.
Rationale
Django cachalot has this single issue: it caches everything. All queries are cached and once a modification is made on a table, all caches linked to that table are invalidated. Not good, especially with new cache misses happening that could eventually crash the server with heavy latency.
The text was updated successfully, but these errors were encountered: