Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent duplicate entries using unique index #128

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

roborourke
Copy link
Contributor

@roborourke roborourke commented Nov 21, 2024

Addresses #89

Using a generated column to make a hash of cron task uniqueness lets us add a unique index to the table.

Changing the $wpdb->insert() call to use replace() means that an existing record will be updated rather than causing an error.

Need to determine if this is really the right way to define uniqueness.

Potential for improving lookup performance using a generated hash column for args as well as thats more reasonable to add an index for.

Addresses #89

Using a generated column to make a hash of cron task uniqueness lets us add a unique index to the table.

Changing the $wpdb->insert() call to use replace() means that an existing record will be updated rather than causing an error.

Need to determine if this is really the right way to define uniqueness.

Potential for improving lookup performance using a generated hash column for args as well as thats more reasonable to add an index for.
@roborourke
Copy link
Contributor Author

Additional note - I went with a generated column rather than a functional index because that would require MySQL 8.0.13 as a minimum requirement.

Maybe we're fine with that but would need to add a guard against it.

@roborourke
Copy link
Contributor Author

Hm, unschedule test failing, can't find the record...

@rmccue
Copy link
Member

rmccue commented Nov 22, 2024

So from memory, this is basically intentional not to have it.

Effectively, the uniqueness contraint would be on every column (minimum of hook, args, site, nextrun, and status), so the index basically duplicates the entire table. This leads to two problems: the cardinality of the index is basically the entire table so the storage requirements are very high, and updating or adding any item slows down writes.

Using a generated column here will also increase the storage requirements since those are stored (basically as a cache).

In terms of enforcing uniqueness, simply adding a UNIQUE key will just make any additional calls fail anyway, which would make the insert fail; switching this to replace means it "silently" swallows the conflict, but the behaviour also becomes harder to reason about.

The root cause of this really is that WordPress isn't designed for concurrency, so there's tradeoffs no matter what you do; ideally, we'd lift it further up the stack so that the pattern of if ( ! wp_next_scheduled() ) wp_schedule_event() would be a single atomic operation.

@roborourke
Copy link
Contributor Author

It's not a massive amount of extra storage per row, one 32bit column, and the index only needs to be against the hash in this case so it's not duplicating the entire table at all. This isn't an uncommon pattern.

There could be a further saving if we were to require MySQL 8.0.13+, as then you wouldn't need the intermediary column e.g.

ALTER TABLE ADD UNIQUE INDEX `uniq` ((
  UNHEX(SHA2(CONCAT_WS(
     '-', `site`, `hook`, `args`, `nextrun`, `status`, `schedule`
   ), 256))
));

I get that WP isn't designed for concurrency, but Cavalcade definitely is, so while core is stuck on the current system (see this attempt to do exactly what you suggest from 4+ years ago) why not fix it in Cavalcade where the negative impact of duplicates is amplified?

I agree regarding replace, I think it would be better if it was an INSERT IGNORE query, or just handled the duplicate error, but I don't get why we need to shy away from it because it's harder to reason about. Some things are complicated because they have to be.

@rmccue
Copy link
Member

rmccue commented Nov 25, 2024

It's not a massive amount of extra storage per row, one 32bit column, and the index only needs to be against the hash in this case so it's not duplicating the entire table at all. This isn't an uncommon pattern.

I think you misunderstood my point about cardinality; it’s not that you’re duplicating the table per se, but rather that you have the same number of buckets (cardinality) as just using the whole table, so you don’t gain anything in terms of query performance.

There's two parts to indexes: improving performance and enforcing uniqueness.

For performance, indexes are about being able to discard as much data as possible before you do an actual query, basically, and this doesn’t help with that; in fact, because it has such high cardinality, it may hurt (and will almost certainly hurt write performance). (eg an index on status when you’re querying status = waiting means you instantly discard 75% of the data before querying)

For example, an index on status would look like:

|-----------|------------------|
| key       | items            |
|-----------|------------------|
| failed    | 1,2,3,4,8,11,... |
| completed | 5,6,7,9,10,12... |
| waiting   | 48,49            |
|-----------|------------------|

This means that the query for status = 'waiting' can effectively discard most of the data before it even needs to touch the table's data.

For this index, if you have eg 100k rows, you're going to effectively have 100k buckets with this index, where each bucket looks like:

|--------|-------|
| key    | items |
|--------|-------|
| abc123 | 1     |
| def456 | 2     |
| ghi789 | 3     |
...

This isn't a good tradeoff for performance, as it might even be worse than the original; MySQL's query planner may even discard it entirely without using it (e.g. see EXPLAIN output).

Using it purely as a uniqueness constraint doesn’t really work either because the nextrun needs to be part of the index, and that is (basically by definition) unique for each schedule call.

For example, if you have a daily job which fails and restarts multiple times, you would have rows like (simplified):

|----|-----------|--------|---------------|---------|
| id | hook      | args   | nextrun       | status  |
|----|-----------|--------|---------------|---------|
|  1 | run_thing | a:0:{} | 2024-11-01... | failed  |
|  2 | run_thing | a:0:{} | 2024-11-02... | failed  |
|  3 | run_thing | a:0:{} | 2024-11-03... | failed  |
|  4 | run_thing | a:0:{} | 2024-11-04... | failed  |
|  5 | run_thing | a:0:{} | 2024-11-26... | waiting |
...

The only difference between these rows is the nextrun - but technically, that could be the same too, if you aren't just scheduling jobs to start at time(). For example, let's say a job runs at midnight daily, and fails. You might then reschedule it on the same day for the start of the day to have it "catch up" (which Cavalcade does by design), and so nextrun would be the same there.

The only valid unique constraint for the table with the way that both Cavalcade and WordPress are designed really is the ID - which is already the primary key and hence unique. Changing this would change the design.

I get that WP isn't designed for concurrency, but Cavalcade definitely is, so while core is stuck on the current system (see this attempt to do exactly what you suggest from 4+ years ago) why not fix it in Cavalcade where the negative impact of duplicates is amplified?

The problem is making an assumption at a low level about what the higher level is doing. Without handling and surfacing the problem correctly, you're making assumptions that may not be correct, which then leads to a system that is difficult to reason about.

@roborourke
Copy link
Contributor Author

Hmm, so in the interests of something that is easy to reason about, we could add the duplicate entry clean up routine perhaps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants