Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data Liberation] Re-entrant WP_Stream_Importer #2004

Merged
merged 12 commits into from
Nov 22, 2024

Conversation

adamziel
Copy link
Collaborator

@adamziel adamziel commented Nov 18, 2024

Adds re-entrancy semantics to the importer API to enable pausing and resuming data imports:

$wxr_path = __DIR__ . '/tests/fixtures/wxr-simple.xml';
$importer = WP_Stream_Importer::create_for_wxr_file( $wxr_path );
// Do some work
for($i = 0;$i<10;$i++) {
	$importer->next_step();
}
// Save our progress
$cursor = $importer->get_reentrancy_cursor();

// Continue where we left off later on
$new_importer = WP_Stream_Importer::create_for_wxr_file( $wxr_path, [], $cursor );
$new_importer->next_step();

Motivation

Most WordPress importers fail because they assume a happy path: we have enough memory, we have enough time, all the assets will be available, and so on.

In Data Liberation, I want to assume the worst possible path through thorny quicksand in full sun with venomous wasps stinging us. We'll run out of memory after the first post, all the assets will be 40GB large, and half of them won't be possible to download.

Pausing, resuming, and recovering from errors should be a basic primitive of the system. The first step to supporting that is the ability to suspend the import operation and restart it from the same spot later on. And that's exactly what this PR adds.

Re-entrancy interface

This PR doesn't store any information in the database yet. It merely adds the plumbing for pausing and resuming the WP_Stream_Importer instance.

WP_Byte_Stream re-entrancy

The WP_Byte_Stream interface directly exposes a tell(): int and seek($offset) methods. There's no need for anything fancier than that – we're only interested in an offset in the stream. It seems to work well for simple byte streams.

My only worry is we may need to revisit this interface later on to support fetching fixed-size chunks from large files using byte ranges.

WP_XML_Processor re-entrancy

WP_XML_Processor supports exporting state via:

  • A get_reentrancy_cursor() method
  • Resuming via a static create($xml, $options, $cursor=null).
  • Seeking the input stream to the correct location via get_token_byte_offset_in_the_input_stream()

No method in the XML processor API will ever accept the cursor or the byte offset as a way of moving to another location in the document. You can only create a new XML processor at $cursor.

This is a measure to:

  • Discourage using the byte offsets for manual string operations on the XML document. It's a footgun and most API consumers who would try that would just introduce bugs into their codebase.
  • Make it impossible to misuse the re-entrancy API for seek()-ing. We already have named bookmarks for that.

Usage:

$xml = WP_XML_Processor::create_from_string( $xml_bytes );
for($i = 0;$i<10;$i++) {
	$xml->next_step();
}

$cursor = $xml->get_reentrancy_cursor();

$unparsed_xml = substr(
	$xml_bytes,
	$xml->get_token_byte_offset_in_the_input_stream()
);
$xml2 = WP_XML_Processor::create_from_string( $unparsed_xml, $cursor );
$xml2->next_step();

WP_WXR_Reader re-entrancy

The WP_WXR_Reader class uses the same get_reentrancy_cursor() interface as WP_XML_Processor.

WP_Stream_Importer re-entrancy

The WP_Stream_Importer class uses the same get_reentrancy_cursor() interface as WP_XML_Processor. See the example at the top of this description.

Testing instructions

TBD. We don't yet have a good way of running PHPUnit in the WordPress context yet. @zaerl is working on running import in CLI, we may need to wait for that before adding tests to this PR and shipping it.

Exploratory PR to keep track of the import state so that, upon crash,
the next run may seamlessly resume where the previous one left off.
@adamziel adamziel marked this pull request as ready for review November 20, 2024 16:09
@adamziel
Copy link
Collaborator Author

adamziel commented Nov 20, 2024

cc @zaerl @brandonpayton @dmsnell @sirreal @ellatrix – any thoughts or comments?

@adamziel adamziel changed the title [Data Liberation] Make WP_Stream_Importer re-entrant [Data Liberation] Re-entrant WP_Stream_Importer Nov 20, 2024
@adamziel adamziel mentioned this pull request Nov 20, 2024
47 tasks
@sirreal
Copy link
Member

sirreal commented Nov 21, 2024

I've given this a once-over and it seems well thought out. A lot of this (streaming, wxr, etc.) is outside my day-to-day so I'm sorry I don't have a lot of insights to share.

@zaerl
Copy link
Collaborator

zaerl commented Nov 21, 2024

I like the resume_at_entity one. We used something similar for Jetpack. The cursor's granularity must be subject to the phase currently running. Some phases are an order of magnitude, or more, slower than others (AKA images download).

It is ok now; there is always room for improvement. Great work.

SQLite has a relatively strict anomaly testing that a modern PHP app should follow. While the I/O one is pretty impossible to generate by a server not made of wood, the other two are more frequent than one thinks with a default php.ini.

@adamziel
Copy link
Collaborator Author

I like the resume_at_entity one. We used something similar for Jetpack. The cursor's granularity must be subject to the phase currently running. Some phases are an order of magnitude, or more, slower than others (AKA images download).

Yes! I'd like to keep tabs on the download progress, too. It seems like a cursor problem, too, doesn't it? As in, a downloading a 300GB file might take multiple sessions and the ability to pause and resume it matters.

It is ok now; there is always room for improvement. Great work.

Thanks!

SQLite has a relatively strict anomaly testing that a modern PHP app should follow. While the I/O one is pretty impossible to generate by a server not made of wood, the other two are more frequent than one thinks with a default php.ini.

Oh that's a great idea in context of importers! Would you mind starting a new issue to track that and link to it in the tracking issue? #1894

@adamziel adamziel merged commit e4ffd2d into trunk Nov 22, 2024
7 of 10 checks passed
@adamziel adamziel deleted the reentrant-WP_Stream_Importer branch November 22, 2024 11:27
@zaerl
Copy link
Collaborator

zaerl commented Nov 22, 2024

@adamziel I have rebased the CLI added the new re-entrance semantics and replaced the temporary test code, so it is ready to be reviewed. #2012

@zaerl
Copy link
Collaborator

zaerl commented Nov 22, 2024

Oh that's a great idea in context of importers! Would you mind starting a new issue to track that and link to it in the tracking issue? #1894

Created with what I think are the most critical one and added to the tracking issue: #2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants