Prevent duplicate entries using unique index #128

roborourke · 2024-11-21T17:23:52Z

Addresses #89

Using a generated column to make a hash of cron task uniqueness lets us add a unique index to the table.

Changing the $wpdb->insert() call to use replace() means that an existing record will be updated rather than causing an error.

Need to determine if this is really the right way to define uniqueness.

Potential for improving lookup performance using a generated hash column for args as well as thats more reasonable to add an index for.

Addresses #89 Using a generated column to make a hash of cron task uniqueness lets us add a unique index to the table. Changing the $wpdb->insert() call to use replace() means that an existing record will be updated rather than causing an error. Need to determine if this is really the right way to define uniqueness. Potential for improving lookup performance using a generated hash column for args as well as thats more reasonable to add an index for.

roborourke · 2024-11-21T17:26:43Z

Additional note - I went with a generated column rather than a functional index because that would require MySQL 8.0.13 as a minimum requirement.

Maybe we're fine with that but would need to add a guard against it.

roborourke · 2024-11-21T18:06:19Z

Hm, unschedule test failing, can't find the record...

rmccue · 2024-11-22T03:11:32Z

So from memory, this is basically intentional not to have it.

Effectively, the uniqueness contraint would be on every column (minimum of hook, args, site, nextrun, and status), so the index basically duplicates the entire table. This leads to two problems: the cardinality of the index is basically the entire table so the storage requirements are very high, and updating or adding any item slows down writes.

Using a generated column here will also increase the storage requirements since those are stored (basically as a cache).

In terms of enforcing uniqueness, simply adding a UNIQUE key will just make any additional calls fail anyway, which would make the insert fail; switching this to replace means it "silently" swallows the conflict, but the behaviour also becomes harder to reason about.

The root cause of this really is that WordPress isn't designed for concurrency, so there's tradeoffs no matter what you do; ideally, we'd lift it further up the stack so that the pattern of if ( ! wp_next_scheduled() ) wp_schedule_event() would be a single atomic operation.

roborourke · 2024-11-22T10:19:15Z

It's not a massive amount of extra storage per row, one 32bit column, and the index only needs to be against the hash in this case so it's not duplicating the entire table at all. This isn't an uncommon pattern.

There could be a further saving if we were to require MySQL 8.0.13+, as then you wouldn't need the intermediary column e.g.

ALTER TABLE ADD UNIQUE INDEX `uniq` ((
  UNHEX(SHA2(CONCAT_WS(
     '-', `site`, `hook`, `args`, `nextrun`, `status`, `schedule`
   ), 256))
));

I get that WP isn't designed for concurrency, but Cavalcade definitely is, so while core is stuck on the current system (see this attempt to do exactly what you suggest from 4+ years ago) why not fix it in Cavalcade where the negative impact of duplicates is amplified?

I agree regarding replace, I think it would be better if it was an INSERT IGNORE query, or just handled the duplicate error, but I don't get why we need to shy away from it because it's harder to reason about. Some things are complicated because they have to be.

rmccue · 2024-11-25T06:04:13Z

It's not a massive amount of extra storage per row, one 32bit column, and the index only needs to be against the hash in this case so it's not duplicating the entire table at all. This isn't an uncommon pattern.

I think you misunderstood my point about cardinality; it’s not that you’re duplicating the table per se, but rather that you have the same number of buckets (cardinality) as just using the whole table, so you don’t gain anything in terms of query performance.

There's two parts to indexes: improving performance and enforcing uniqueness.

For performance, indexes are about being able to discard as much data as possible before you do an actual query, basically, and this doesn’t help with that; in fact, because it has such high cardinality, it may hurt (and will almost certainly hurt write performance). (eg an index on status when you’re querying status = waiting means you instantly discard 75% of the data before querying)

For example, an index on status would look like:

|-----------|------------------|
| key       | items            |
|-----------|------------------|
| failed    | 1,2,3,4,8,11,... |
| completed | 5,6,7,9,10,12... |
| waiting   | 48,49            |
|-----------|------------------|

This means that the query for status = 'waiting' can effectively discard most of the data before it even needs to touch the table's data.

For this index, if you have eg 100k rows, you're going to effectively have 100k buckets with this index, where each bucket looks like:

|--------|-------|
| key    | items |
|--------|-------|
| abc123 | 1     |
| def456 | 2     |
| ghi789 | 3     |
...

This isn't a good tradeoff for performance, as it might even be worse than the original; MySQL's query planner may even discard it entirely without using it (e.g. see EXPLAIN output).

Using it purely as a uniqueness constraint doesn’t really work either because the nextrun needs to be part of the index, and that is (basically by definition) unique for each schedule call.

For example, if you have a daily job which fails and restarts multiple times, you would have rows like (simplified):

|----|-----------|--------|---------------|---------|
| id | hook      | args   | nextrun       | status  |
|----|-----------|--------|---------------|---------|
|  1 | run_thing | a:0:{} | 2024-11-01... | failed  |
|  2 | run_thing | a:0:{} | 2024-11-02... | failed  |
|  3 | run_thing | a:0:{} | 2024-11-03... | failed  |
|  4 | run_thing | a:0:{} | 2024-11-04... | failed  |
|  5 | run_thing | a:0:{} | 2024-11-26... | waiting |
...

The only difference between these rows is the nextrun - but technically, that could be the same too, if you aren't just scheduling jobs to start at time(). For example, let's say a job runs at midnight daily, and fails. You might then reschedule it on the same day for the start of the day to have it "catch up" (which Cavalcade does by design), and so nextrun would be the same there.

The only valid unique constraint for the table with the way that both Cavalcade and WordPress are designed really is the ID - which is already the primary key and hence unique. Changing this would change the design.

I get that WP isn't designed for concurrency, but Cavalcade definitely is, so while core is stuck on the current system (see this attempt to do exactly what you suggest from 4+ years ago) why not fix it in Cavalcade where the negative impact of duplicates is amplified?

The problem is making an assumption at a low level about what the higher level is doing. Without handling and surfacing the problem correctly, you're making assumptions that may not be correct, which then leads to a system that is difficult to reason about.

roborourke · 2024-11-25T11:19:21Z

Hmm, so in the interests of something that is easy to reason about, we could add the duplicate entry clean up routine perhaps.

roborourke requested a review from rmccue November 21, 2024 17:24

roborourke added 2 commits November 22, 2024 12:00

Add status to the hash column

3b8ad66

Try handling duplicate error on insert

6c12452

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent duplicate entries using unique index #128

Prevent duplicate entries using unique index #128

roborourke commented Nov 21, 2024 •

edited

Loading

roborourke commented Nov 21, 2024

roborourke commented Nov 21, 2024

rmccue commented Nov 22, 2024

roborourke commented Nov 22, 2024

rmccue commented Nov 25, 2024

roborourke commented Nov 25, 2024

Prevent duplicate entries using unique index #128

Are you sure you want to change the base?

Prevent duplicate entries using unique index #128

Conversation

roborourke commented Nov 21, 2024 • edited Loading

roborourke commented Nov 21, 2024

roborourke commented Nov 21, 2024

rmccue commented Nov 22, 2024

roborourke commented Nov 22, 2024

rmccue commented Nov 25, 2024

roborourke commented Nov 25, 2024

roborourke commented Nov 21, 2024 •

edited

Loading