Duplicate cavalcade job entries #89

owaincuvelier · 2020-01-07T20:28:35Z

Seeing a potential issue where (in this example) scheduled post cavalcade jobs are duplicated causing a race condition and posts don't publish.

Will update upon further investigation

r-a-y · 2020-01-10T00:13:27Z

I noticed the same thing, which is why I opened #88 to help clean up job entries via WP-CLI, but it would be great to know what is causing these duplicate entries to occur.

svandragt · 2020-01-14T16:32:18Z

Does this project have multiple webservers? My approach was going to be to put an key constraint on hook + args + site so that a second instance of the same event wouldn't get into the database. The issue with that is that the args column is a longtext instead of varchar(max) which doesn't support indexes. (https://github.com/humanmade/Cavalcade/blob/master/inc/namespace.php#L70-L73) The issue can also occur when using intervals because each worker will check if an event needs to be rescheduled.

The acquire_lock method of the runner isn't adequate: https://github.com/humanmade/Cavalcade-Runner/blob/master/inc/class-job.php#L55-L66 When 4 workers get started and they call https://github.com/humanmade/Cavalcade-Runner/blob/master/inc/class-runner.php#L236-L239 at a similar time then try and update we see the race condition here.

Last time I spoke with @rmccue about this he wanted to look at database locking if I recall correctly. Looking into this something like https://dev.mysql.com/doc/refman/5.7/en/innodb-locking-reads.html might be an option.

archon810 · 2020-02-28T15:21:52Z

Would love to hear if there's a solution as we're looking to implement Cavalcade right now. If duplicates are being created, the solution would be (at the moment) no better than WP's broken built-in cron.

roborourke · 2020-03-05T15:11:53Z

@archon810 you could try Cavalcade 2.0. Worth noting this doesn't happen all the time - Cavalcade is running on all our production sites and wordpress.org too.

dd32 · 2020-05-19T04:49:43Z

WordPress.org has this running as a daily cron task to solve this issue:

/**
 * Mark any duplicate jobs as duplicates.
 * 
 * When the table cannot be read, the query times out, or a new crontask is added
 * it's common to end up with duplicate entries in the cavalcade tables.
 * 
 * This simply marks them as duplicates which prevents `get_cron_array()` having many extra
 * duplicate tasks that slows things down.
 *
 * NOTE: For non-recurring tasks, this will clean out any duplicate jobs even if queued at different times
 *       hopefully that'll be compatible with future core/w.org cron uses.
 */
function mark_jobs_as_duplicates() {
	global $wpdb;

	$jobs_with_duplicates = $wpdb->get_results(
		"SELECT
			min(`id`) as `id_to_keep`,
			`site`, `hook`, `args`, `interval`
		FROM `wp_cavalcade_jobs`
		WHERE status IN ('waiting', 'running')
		GROUP BY `site`, `hook`, `args`, `interval`
		HAVING COUNT(*) > 1"
	);

	foreach ( $jobs_with_duplicates as $j ) {
		$sql = $wpdb->prepare(
			"UPDATE `wp_cavalcade_jobs`
			SET `status` = 'duplicate'
			WHERE `id` != %d AND status = 'waiting'
				AND `site` = %d AND `hook` = %s
				AND `args` = %s AND `interval` = %d",
			$j->id_to_keep,
			$j->site,
			$j->hook,
			$j->args,
			$j->interval
		);

		// Hack for non-recurring jobs. wpdb::prepare() doesn't support NULL fields, but we can force it in.
		if ( is_null( $j->interval ) ) {
			$sql = str_replace( '`interval` = 0', '`interval` IS NULL', $sql );
		}

		$wpdb->query( $sql );
	}

}

It's not ideal, but it's been working for us for quite some time.

Duplicate jobs are a little more common with Cavalcade over the usual WP cron storage, as Cavalcade inserts multiple rows where as WP cron just overwrites the previous cron array with a new one (which will only include one cron entry).

Using a table lock would help, but Cavalcade would also have to reload the DB cron entries for the current site after locking, which could cause a lot of table locking on a high usage site like WordPress.org.

For reference, Cavalcade on WordPress.org in numbers:

~10k-20k cron executions an hour over 20 threads spread over 5 servers
400-600 cron entries added per hour spiking to ~800-1000/hour now and then

archon810 · 2021-02-26T18:40:39Z

Now that we've been running Cavalcade for several months, I just checked and we're also seeing a ton of dupes, including for WP's own functions.

Are there any plans to attempt to root cause and fix the issue?

archon810 · 2021-02-26T18:43:34Z

Maybe change the logic to confirm 100% that the query that looks for existing jobs returned correctly, and if it errors, don't schedule a potential dupe? Error check?

roborourke · 2021-03-02T15:01:46Z

Hey, thanks for these recent updates. Yes is the answer that we want to understand and resolve this issue for good. Thanks @dd32 for the code that checks for duplicates too. We did have some initial discovery work internally but we didn't reach any direct conclusions yet.

We will have a few QA sprints coming up next month so I will try and get this addressed then.

germanoronoz · 2022-05-12T15:17:13Z

Hello,

Any news regarding this issue?

We implemented @dd32 duplicate marking solution as a cron job and it's working, but we still see a lot of duplicates (around 30,000 dupes currently).

It's a multisite network with 400+ sites, and three workers (runners on AWS instances) with 25 max-workers each. The three of them are %99 CPU all the time, and we think it is due to the underlying dupes problem.

We modified the runner a bit to change events with less than 15 minutes interval, to make the nextrun the current execution time plus the interval, so they never get stuck, and changed also the nextrun of the events with a one minute interval to be two minutes instead:

Hopefully you cant sort this out!
Best regards

doradod · 2024-11-16T20:09:40Z

It seems this piece of software is abandoned... I'm running on issues with several multisite installations where cavalcade is opening and closing a lot of connections putting down a t4g.large aurora database server really quick

roborourke · 2024-11-16T22:29:53Z

@doradod not abandoned, it’s in production on large installations including Wordpress.org with smaller instance sizes, but this is a difficult issue to consistently reproduce and test for.

Is your problem the same as the one described here?

doradod · 2024-11-16T22:44:22Z

My problem looks related to a very high number of queries

as an example

redacted logs

there hundreds of these ones related to different multisite installations where we have from just 2-3 sites on each installation to no more than 30 sites

i had to run a script to iterate over the 'cavalcade jobs' tables that i have across my installations to remove duplicates, but anyhow looks like since i've disabled the supervidor processes that i have set up to run cavalcade in the background the server has not been restarted.

I have the logs of a period of 15 minutes when the issue last happened and beside WPML doing it's bullshit cavalcade is the next heavy actor querying the database really intensive

doradod · 2024-11-16T23:22:47Z

Specially the PUBLISH FUTURE POST is duplicated even though i removed them with my script

roborourke · 2024-11-16T23:45:02Z

It’s not clear those are definitely duplicates, the args could be different, and they’re scheduled for very different times. What’s the plugin or code adding the transient cleaner task? Maybe there’s something to investigate there

doradod · 2024-11-17T00:16:49Z

But this is just an example of a task i'm also not sure what is exactly doing the database open a lot of new connections, but definitely after shutting down cavalcade didn't happen still monitoring

<?php
/**
 * Plugin Name: PJ Transient Cleaner
 * Description: Cleans expired transients behind the scenes.
 * Plugin URI: http://pressjitsu.com
 */

class Pj_Transient_Cleaner {
	public static function load() {
		add_action( 'init', array( __CLASS__, 'schedule_events' ) );
	}

	/**
	 * Schedule cron events, runs during init.
	 */
	public static function schedule_events() {
		if ( ! wp_next_scheduled( 'pj_transient_cleaner' ) )
			wp_schedule_event( time(), 'daily', 'pj_transient_cleaner' );

		add_action( 'pj_transient_cleaner', array( __CLASS__, 'cleaner' ) );
	}

	/**
	 * Runs in a wp-cron intsance.
	 */
	public static function cleaner() {
		global $wpdb;

		$timestamp = time() - 24 * HOUR_IN_SECONDS; // expired x hours ago.
		$time_start = time();
		$time_limit = 30;
		$batch = 100;

		// @todo Look at site transients too.
		// Don't take longer than $time_limit seconds.
		while ( time() < $time_start + $time_limit ) {
			$option_names = $wpdb->get_col( "SELECT `option_name` FROM {$wpdb->options} WHERE `option_name` LIKE '\_transient\_timeout\_%'
				AND CAST(`option_value` AS UNSIGNED) < {$timestamp} LIMIT {$batch};" );

			if ( empty( $option_names ) )
				break;

			// Add transient keys to transient timeout keys.
			foreach ( $option_names as $key => $option_name )
				$option_names[] = '_transient_' . substr( $option_name, 19 );

			// Create a list to use with MySQL IN().
			$options_in = implode( ', ', array_map( function( $item ) use ( $wpdb ) {
				return $wpdb->prepare( '%s', $item );
			}, $option_names ) );

			// Delete transient and transient timeout fields.
			$wpdb->query( "DELETE FROM {$wpdb->options} WHERE `option_name` IN ({$options_in});" );

			// Break if no more deletable options available.
			if ( count( $option_names ) < $batch * 2 )
				break;
		}
	}
}

Pj_Transient_Cleaner::load();

i think i did something not very good, i cleaned all records @ the jobs tables... i see the jobs are recreated now but the next run shows that are already late

roborourke · 2024-11-17T12:12:57Z

They should be recreated and rescheduled, that should be ok. Are you using InnoDB, and possibly any kind of dual read only & write db set up?

doradod · 2024-11-17T21:21:32Z

They are recreated, but not rescheduled, not a single cron is executing now. so far over the last two days the php has not been restarted, so looks lke cavalcave was the responsible of those php-fpm processes hanging the database opening too many connections

I have single aurora instance on my aurora im not using two instances for read and write tables are set to InnoDB

Any recommendations?

roborourke · 2024-11-18T14:44:58Z

Did the cavalcade runner stop as well? Cavalcade works by executing WP CLI commands so it shouldn't be creating additional php-fpm processes itself.

This could be a general scalability problem, but it's hard to advise without knowing what the stack looks like or how much traffic you're handling. You might need to look into adding Redis or other object cache if you haven't already to help alleviate the number of requests, as well as a page caching solution via batcache & cloudfront for example.

doradod · 2024-11-18T15:11:09Z

I have object cache pro, nginx fastcgi cache for the whole server, caching scripts that iterates over all my urls in ~10m caching the sites for a week and cloudfront on the most heavy sites...

I've stopped the cavalcade processes but still i saw some entries on the aurora logs cheking the tables, which seems weird...

But again the php-fpm hanged, and now looks like it's not cavalcade itself doing it...

Anyhow i'm not sure how to make the crons to run again... they got late on the database and i cannot make them to run ahead in time

roborourke · 2024-11-18T15:26:30Z

You would need to deactivate the cavalcade plugin too to switch off cavalcade fully.

The plugin just hooks into WordPress and tells it to interact with the cavalcade tables rather than the standard array in the options table.

The cavalcade runner is a separate server process that runs every second (or more, depending how you configured the runner), so if the cavalcade runner is stopped then no WordPress cron tasks will be run.

You'll need to profile your application, could be too many / long running blocking HTTP requests for a missing resource or some plugin or other code that's trying to update / write to the db on every request is another common one.

Anyway, I think your issue is not the one described here, you could confirm it by checking the stored args and start time of the jobs match. There will be multiple entries for a given hook if different arguments adn times are passed for it.

citrika · 2024-11-20T05:00:07Z

@roborourke What do you think about preventing duplicate events as suggested here: https://core.trac.wordpress.org/ticket/49693 ?

roborourke · 2024-11-20T14:19:04Z

@citrika I think we follow the same logic as core, in theory it shouldn't happen but we have an opportunity here that core does not have which is using MySQL features. I would look into adding a UNIQUE key to solve this issue I think.

In the meantime there's #89 (comment)

Addresses #89 Using a generated column to make a hash of cron task uniqueness lets us add a unique index to the table. Changing the $wpdb->insert() call to use replace() means that an existing record will be updated rather than causing an error. Need to determine if this is really the right way to define uniqueness. Potential for improving lookup performance using a generated hash column for args as well as thats more reasonable to add an index for.

svandragt added the Bug label Jan 28, 2020

svandragt self-assigned this Feb 25, 2020

roborourke mentioned this issue Nov 21, 2024

Prevent duplicate entries using unique index #128

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate cavalcade job entries #89

Duplicate cavalcade job entries #89

owaincuvelier commented Jan 7, 2020

r-a-y commented Jan 10, 2020

svandragt commented Jan 14, 2020 •

edited

Loading

archon810 commented Feb 28, 2020

roborourke commented Mar 5, 2020

dd32 commented May 19, 2020

archon810 commented Feb 26, 2021

archon810 commented Feb 26, 2021

roborourke commented Mar 2, 2021

germanoronoz commented May 12, 2022

doradod commented Nov 16, 2024

roborourke commented Nov 16, 2024

doradod commented Nov 16, 2024 •

edited by roborourke

Loading

doradod commented Nov 16, 2024

roborourke commented Nov 16, 2024

doradod commented Nov 17, 2024 •

edited

Loading

roborourke commented Nov 17, 2024

doradod commented Nov 17, 2024 •

edited

Loading

roborourke commented Nov 18, 2024

doradod commented Nov 18, 2024

roborourke commented Nov 18, 2024

citrika commented Nov 20, 2024 •

edited

Loading

roborourke commented Nov 20, 2024

Duplicate cavalcade job entries #89

Duplicate cavalcade job entries #89

Comments

owaincuvelier commented Jan 7, 2020

r-a-y commented Jan 10, 2020

svandragt commented Jan 14, 2020 • edited Loading

archon810 commented Feb 28, 2020

roborourke commented Mar 5, 2020

dd32 commented May 19, 2020

archon810 commented Feb 26, 2021

archon810 commented Feb 26, 2021

roborourke commented Mar 2, 2021

germanoronoz commented May 12, 2022

doradod commented Nov 16, 2024

roborourke commented Nov 16, 2024

doradod commented Nov 16, 2024 • edited by roborourke Loading

doradod commented Nov 16, 2024

roborourke commented Nov 16, 2024

doradod commented Nov 17, 2024 • edited Loading

roborourke commented Nov 17, 2024

doradod commented Nov 17, 2024 • edited Loading

roborourke commented Nov 18, 2024

doradod commented Nov 18, 2024

roborourke commented Nov 18, 2024

citrika commented Nov 20, 2024 • edited Loading

roborourke commented Nov 20, 2024

svandragt commented Jan 14, 2020 •

edited

Loading

doradod commented Nov 16, 2024 •

edited by roborourke

Loading

doradod commented Nov 17, 2024 •

edited

Loading

doradod commented Nov 17, 2024 •

edited

Loading

citrika commented Nov 20, 2024 •

edited

Loading