Impending DST cutover caused jobs to run continuously #74

mackensen · 2018-11-04T13:33:11Z

Last evening, at exactly 9 PM EDT (2 AM GMT), hourly jobs started running continuously--instead of being scheduled for the next hour. This caused a major load event and we had to stop Cavalcade everywhere. Sample log showing the issue:

20413   wp_privacy_delete_old_export_files      2018-11-03 19:14:32     completed
20413   wp_privacy_delete_old_export_files      2018-11-03 20:14:32     completed
20413   wp_privacy_delete_old_export_files      2018-11-03 21:14:34     completed
20413   wp_privacy_delete_old_export_files      2018-11-03 21:14:36     completed
20413   wp_privacy_delete_old_export_files      2018-11-03 21:14:37     completed
20413   wp_privacy_delete_old_export_files      2018-11-03 21:14:39     completed
20413   wp_privacy_delete_old_export_files      2018-11-03 21:14:40     completed
20413   wp_privacy_delete_old_export_files      2018-11-03 21:14:41     completed
20413   wp_privacy_delete_old_export_files      2018-11-03 21:14:42     completed

This morning, it seems like we can get cavalcade happy by running wp site list --field=url | xargs -n 1 -I % wp cron event run --due-now --url=% to bring all jobs into the present. The WordPress site in question uses UTC internally; the host servers (both application and database) are on EDT/EST.

The text was updated successfully, but these errors were encountered:

discoinfiltrator · 2018-11-04T17:38:00Z

@mackensen , similar thing happened here, took down our sites. Running the command you listed now - thanks for sharing that. Everything looking good for you since?

mackensen · 2018-11-04T18:01:25Z

@discoinfiltrator more or less (so far); noticed a related issue where jobs persisted for deleted sites and I'm clearing those out manually.

rmccue · 2018-11-05T00:57:56Z

Fascinating! Cavalcade should be using UTC timestamps throughout the entire system, as does WordPress, so unsure why this is occurring.

What is your timezone in PHP's settings (i.e. php.ini) set to?

mackensen · 2018-11-05T01:21:21Z

It's set to America/New_York in our environments.

discoinfiltrator · 2018-11-05T03:01:03Z

America/Toronto in my case. The solution of running:

wp site list --field=url | xargs -n 1 -I % wp cron event run --due-now --url=%

Didn't quite work out for me. There were too many duplicate jobs since the time switchover, and:

add_filter( 'pre_update_option_cron', __NAMESPACE__ . '\\update_cron_array', 10, 2 );
add_filter( 'pre_option_cron',        __NAMESPACE__ . '\\get_cron_array' );

from inc/connector/namespace.php was causing problems somehow. I need to investigate it further.

In the end, I backed up + truncated wp_cavalcade_logs and deleted every row from wp_cavalcade_jobs where interval wasn't null or status was completed or failed (e.g. I only kept the one-off events; the rest could reschedule themselves). The tables were getting pretty large, about 4.5 million rows in the logs table.

sirilyan · 2019-01-10T16:03:04Z

The problem is that Cavalcade-Runner sets a completed job's next run to nextrun + interval, even if that timestamp is now in the past. The job immediately gets picked up again and run. The DST changeover exposed this behaviour, but it'd also happen if the runner crashed and got restarted after a few hours. I'll submit a pull request on Cavalcade-Runner that seems to fix the problem.

rmccue · 2019-01-10T16:05:49Z

The problem is that Cavalcade-Runner sets a completed job's next run to nextrun + interval, even if that timestamp is now in the past.

This is intentional, so that if the Runner crashes (e.g.), the same number of jobs are run as it "catches up".

So long as everything is in UTC, there should be no issue with this, as the clock (generally) monotonically increases.

sirilyan · 2019-01-15T18:42:28Z

I've done a bit more investigation and it seems that under heavy load wp_next_scheduled() can return incorrect results, which makes wp_schedule_single_event() incorrectly queue multiple jobs.

I started up a new WP instance, created a simple hook I called "cavalcade_load" that just slept for a few seconds, and ran six copies of loadtest.sh in parallel:

#!/bin/bash
while true; do
    wp --path=/usr/share/wordpress eval "wp_schedule_single_event(time()+5, 'cavalcade_load');"
done

Stock WordPress refuses to populate the cron table with more than one job per second, no matter how many loadtest.sh I was running in parallel. This makes sense, because the epoch time is the array key.

Next step: install Cavalcade and Cavalcade-Runner, and spawn six copies of loadtest.sh. This produced

mysql> select id,start from wp_cavalcade_jobs where hook = 'cavalcade_load';
+------+---------------------+
| id   | start               |
+------+---------------------+
| 1797 | 2019-01-15 01:19:43 |
| 1798 | 2019-01-15 01:20:21 |
| 1799 | 2019-01-15 01:20:43 |
| 1800 | 2019-01-15 01:20:54 |
| 1801 | 2019-01-15 01:20:54 |
| 1802 | 2019-01-15 01:21:05 |
| 1803 | 2019-01-15 01:21:05 |
| 1804 | 2019-01-15 01:21:16 |
| 1805 | 2019-01-15 01:21:26 |
| 1806 | 2019-01-15 01:21:26 |
| 1807 | 2019-01-15 01:21:38 |
| 1808 | 2019-01-15 01:21:38 |
| 1809 | 2019-01-15 01:21:50 |
| 1810 | 2019-01-15 01:21:51 |
+------+---------------------+

Note there are eight rows where the start time is the same as the start time in another row. This should be impossible - even if we didn't have WP's implementation indirectly promising that only one hook per second can be created, the process that created id 1801 should still have seen the existing hook from id 1800. You can increase the number of collisions by running more simultaneous loadtest.sh - at around 20 or so I can get three at once.

Here's another interesting excerpt from the table, later on:

| 1862 | 2019-01-15 01:55:05 | completed |
| 1863 | 2019-01-15 01:55:05 | completed |
| 1864 | 2019-01-15 01:55:06 | completed |
| 1865 | 2019-01-15 01:55:05 | completed |
| 1866 | 2019-01-15 01:55:06 | completed |
| 1867 | 2019-01-15 01:55:06 | completed |
| 1868 | 2019-01-15 01:55:06 | completed |
| 1869 | 2019-01-15 01:56:05 | completed |

The start time should be increasing monotonically the same way that id is (all of these jobs are running on the same server, so they all share the same system clock), but we've got some that start at 6 seconds after in the middle of a set that start 5 seconds after.

This might be an issue with the cavalcade-jobs cache. I've got a local change where I query the database in is_created() for the job we're about to insert instead of assuming the cache is trustworthy, and it doesn't schedule duplicate jobs in the same second. I'll fork and create a branch if you're interested in seeing it.

discoinfiltrator · 2019-01-15T18:51:45Z

Yeah, this might not be directly responsible for the DST bug but it could be a factor. In wp-includes/cron.php, wp_schedule_single_event should prevent the same job from being scheduled if it's already scheduled within the next 10 minutes:

// Don't schedule a duplicate if there's already an identical event due within 10 minutes of it
$next = wp_next_scheduled($hook, $args);
if ( $next && abs( $next - $timestamp ) <= 10 * MINUTE_IN_SECONDS ) {
    return false;
}

I would expect that cavalcade be consistent with default WP behaviour.

peterwilsoncc · 2019-04-20T05:51:52Z

@discoinfiltrator There was a longstanding core bug in that code that will be fixed in WP 5.2, Cavalcade will inherit the bug fix.

discoinfiltrator · 2019-11-03T15:28:23Z

@rmccue , this happened again last night have I have good explanation for the cause and a fix.

In the Job class in Cavalcade-Runner, we have:

$this->nextrun = date( MYSQL_DATE_FORMAT, strtotime( $this->nextrun ) + $this->interval );

$this->nextrun is from the DB and is in UTC. However, PHP might be set with a different timezone, meaning the PHP date and strtotime functions could be using a timezone that has a daylight savings switchover. So we end up with:

php > echo date_default_timezone_get();
America/Toronto
php > echo date('Y-m-d H:i:s', strtotime( '2019-11-03 01:08:45' ) + 3600);
2019-11-03 01:08:45

Note that the resulting time string is the exact same as the input string. This causes the infinite loop where jobs scheduled between 1:00am and 2:00am with an interval of less than 3600s will be rescheduled for the exact time that they initially had.

We can fix this by making sure that this calculation is done with UTC in mind.

$date = new DateTime( '2019-11-03 01:08:45', new DateTimeZone( 'UTC' ) );
$date->add( new DateInterval( 'PT3600S' ) );
$nextrun = $date->format( 'Y-m-d H:i:s' );

This will correctly result in a nextrun string of 2019-11-03 02:08:45. I've created a PR to address this:

humanmade/Cavalcade-Runner#64

archon810 · 2021-03-18T04:31:52Z

Aaand we just got bitten by this a few days ago. Found 16k jobs and counting, with the top dozen recurring jobs not updating the date.

maciejmackowiak · 2021-03-22T12:11:33Z

This is because of how php DateTime works and Cavalcade for converting the dates is using mysql2date which is using DateTime.
To replicate it let say the site is in Los Angeles time zone and you have a task in DB with saved date as 2021-03-14 02:10:10.

When quering the job by Cavalcade it is converted by to_instance method with mysql2date to timestamp and in get_jobs_by_query back to date.

Try this in wp instance that is using Los Angeles time zone

echo '2021-03-14 02:10:10<br>';
echo mysql2date('G', '2021-03-14 02:10:10') .'<br>';
echo date( 'Y-m-d H:i:s', mysql2date('G', '2021-03-14 02:10:10') );

results in:

2021-03-14 02:10:10
1615691410
2021-03-14 03:10:10

So when it is queried back in pre_reschedule_event it cannot be found.

mackensen mentioned this issue Nov 7, 2018

Multisite cleanup #16

Open

discoinfiltrator mentioned this issue Nov 3, 2019

Force the nextrun calculation to use UTC to be consistent with the DB humanmade/Cavalcade-Runner#64

Merged

discoinfiltrator mentioned this issue Nov 3, 2019

set database timezone as php one humanmade/Cavalcade-Runner#59

Closed

boonebgorges mentioned this issue Apr 9, 2021

Improve logic for setting next run time in Job::reschedule(). humanmade/Cavalcade-Runner#51

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Impending DST cutover caused jobs to run continuously #74

Impending DST cutover caused jobs to run continuously #74

mackensen commented Nov 4, 2018

discoinfiltrator commented Nov 4, 2018 •

edited

Loading

mackensen commented Nov 4, 2018

rmccue commented Nov 5, 2018

mackensen commented Nov 5, 2018

discoinfiltrator commented Nov 5, 2018 •

edited

Loading

sirilyan commented Jan 10, 2019

rmccue commented Jan 10, 2019

sirilyan commented Jan 15, 2019 •

edited

Loading

discoinfiltrator commented Jan 15, 2019

peterwilsoncc commented Apr 20, 2019

discoinfiltrator commented Nov 3, 2019 •

edited

Loading

archon810 commented Mar 18, 2021

maciejmackowiak commented Mar 22, 2021

Impending DST cutover caused jobs to run continuously #74

Impending DST cutover caused jobs to run continuously #74

Comments

mackensen commented Nov 4, 2018

discoinfiltrator commented Nov 4, 2018 • edited Loading

mackensen commented Nov 4, 2018

rmccue commented Nov 5, 2018

mackensen commented Nov 5, 2018

discoinfiltrator commented Nov 5, 2018 • edited Loading

sirilyan commented Jan 10, 2019

rmccue commented Jan 10, 2019

sirilyan commented Jan 15, 2019 • edited Loading

discoinfiltrator commented Jan 15, 2019

peterwilsoncc commented Apr 20, 2019

discoinfiltrator commented Nov 3, 2019 • edited Loading

archon810 commented Mar 18, 2021

maciejmackowiak commented Mar 22, 2021

discoinfiltrator commented Nov 4, 2018 •

edited

Loading

discoinfiltrator commented Nov 5, 2018 •

edited

Loading

sirilyan commented Jan 15, 2019 •

edited

Loading

discoinfiltrator commented Nov 3, 2019 •

edited

Loading