-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent duplicate entries using unique index #128
base: master
Are you sure you want to change the base?
Conversation
Addresses #89 Using a generated column to make a hash of cron task uniqueness lets us add a unique index to the table. Changing the $wpdb->insert() call to use replace() means that an existing record will be updated rather than causing an error. Need to determine if this is really the right way to define uniqueness. Potential for improving lookup performance using a generated hash column for args as well as thats more reasonable to add an index for.
Additional note - I went with a generated column rather than a functional index because that would require MySQL 8.0.13 as a minimum requirement. Maybe we're fine with that but would need to add a guard against it. |
Hm, unschedule test failing, can't find the record... |
So from memory, this is basically intentional not to have it. Effectively, the uniqueness contraint would be on every column (minimum of hook, args, site, nextrun, and status), so the index basically duplicates the entire table. This leads to two problems: the cardinality of the index is basically the entire table so the storage requirements are very high, and updating or adding any item slows down writes. Using a generated column here will also increase the storage requirements since those are stored (basically as a cache). In terms of enforcing uniqueness, simply adding a UNIQUE key will just make any additional calls fail anyway, which would make the The root cause of this really is that WordPress isn't designed for concurrency, so there's tradeoffs no matter what you do; ideally, we'd lift it further up the stack so that the pattern of |
It's not a massive amount of extra storage per row, one 32bit column, and the index only needs to be against the hash in this case so it's not duplicating the entire table at all. This isn't an uncommon pattern. There could be a further saving if we were to require MySQL 8.0.13+, as then you wouldn't need the intermediary column e.g. ALTER TABLE ADD UNIQUE INDEX `uniq` ((
UNHEX(SHA2(CONCAT_WS(
'-', `site`, `hook`, `args`, `nextrun`, `status`, `schedule`
), 256))
)); I get that WP isn't designed for concurrency, but Cavalcade definitely is, so while core is stuck on the current system (see this attempt to do exactly what you suggest from 4+ years ago) why not fix it in Cavalcade where the negative impact of duplicates is amplified? I agree regarding |
I think you misunderstood my point about cardinality; it’s not that you’re duplicating the table per se, but rather that you have the same number of buckets (cardinality) as just using the whole table, so you don’t gain anything in terms of query performance. There's two parts to indexes: improving performance and enforcing uniqueness. For performance, indexes are about being able to discard as much data as possible before you do an actual query, basically, and this doesn’t help with that; in fact, because it has such high cardinality, it may hurt (and will almost certainly hurt write performance). (eg an index on status when you’re querying status = waiting means you instantly discard 75% of the data before querying) For example, an index on status would look like:
This means that the query for For this index, if you have eg 100k rows, you're going to effectively have 100k buckets with this index, where each bucket looks like:
This isn't a good tradeoff for performance, as it might even be worse than the original; MySQL's query planner may even discard it entirely without using it (e.g. see Using it purely as a uniqueness constraint doesn’t really work either because the nextrun needs to be part of the index, and that is (basically by definition) unique for each schedule call. For example, if you have a daily job which fails and restarts multiple times, you would have rows like (simplified):
The only difference between these rows is the The only valid unique constraint for the table with the way that both Cavalcade and WordPress are designed really is the ID - which is already the primary key and hence unique. Changing this would change the design.
The problem is making an assumption at a low level about what the higher level is doing. Without handling and surfacing the problem correctly, you're making assumptions that may not be correct, which then leads to a system that is difficult to reason about. |
Hmm, so in the interests of something that is easy to reason about, we could add the duplicate entry clean up routine perhaps. |
Addresses #89
Using a generated column to make a hash of cron task uniqueness lets us add a unique index to the table.
Changing the
$wpdb->insert()
call to usereplace()
means that an existing record will be updated rather than causing an error.Need to determine if this is really the right way to define uniqueness.
Potential for improving lookup performance using a generated hash column for args as well as thats more reasonable to add an index for.