-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
StreamChain: An API for streams-processing data (e.g. HTTP → ZIP → XML → HTML) #1
base: trunk
Are you sure you want to change the base?
Conversation
Brings together a few explorations to stream-rewrite site URLs in a WXR file coming from a remote server. All of that with no curl, DOMDocument, or other PHP dependencies. It's just a few small libraries built with WordPress core in mind: * [AsyncHttp\Client](WordPress/blueprints#52) * [WP_XML_Processor](WordPress/wordpress-develop#6713) * [WP_Block_Markup_Url_Processor](https://github.com/adamziel/site-transfer-protocol) * [WP_HTML_Tag_Processor](https://developer.wordpress.org/reference/classes/wp_html_tag_processor/) Here's what the rewriter looks like: ```php $wxr_url = "https://raw.githubusercontent.com/WordPress/blueprints/normalize-wxr-assets/blueprints/stylish-press-clone/woo-products.wxr"; $xml_processor = new WP_XML_Processor('', [], WP_XML_Processor::IN_PROLOG_CONTEXT); foreach( stream_remote_file( $wxr_url ) as $chunk ) { $xml_processor->stream_append_xml($chunk); foreach ( xml_next_content_node_for_rewriting( $xml_processor ) as $text ) { $string_new_site_url = 'https://mynew.site/'; $parsed_new_site_url = WP_URL::parse( $string_new_site_url ); $current_site_url = 'https://raw.githubusercontent.com/wordpress/blueprints/normalize-wxr-assets/blueprints/stylish-press-clone/wxr-assets/'; $parsed_current_site_url = WP_URL::parse( $current_site_url ); $base_url = 'https://playground.internal'; $url_processor = new WP_Block_Markup_Url_Processor( $text, $base_url ); foreach ( html_next_url( $url_processor, $current_site_url ) as $parsed_matched_url ) { $updated_raw_url = rewrite_url( $url_processor->get_raw_url(), $parsed_matched_url, $parsed_current_site_url, $parsed_new_site_url ); $url_processor->set_raw_url( $updated_raw_url ); } $updated_text = $url_processor->get_updated_html(); if ($updated_text !== $text) { $xml_processor->set_modifiable_text($updated_text); } } echo $xml_processor->get_processed_xml(); } echo $xml_processor->get_unprocessed_xml(); ```
Show me the codeHere's what the rewriter looks like: $wxr_url = "https://raw.githubusercontent.com/WordPress/blueprints/normalize-wxr-assets/blueprints/stylish-press-clone/woo-products.wxr";
$xml_processor = new WP_XML_Processor('', [], WP_XML_Processor::IN_PROLOG_CONTEXT);
foreach( stream_remote_file( $wxr_url ) as $chunk ) {
$xml_processor->stream_append_xml($chunk);
foreach ( xml_next_content_node_for_rewriting( $xml_processor ) as $text ) {
$url_processor = new WP_Block_Markup_Url_Processor( $text, $base_url );
foreach ( html_next_url( $url_processor, $current_site_url ) as $parsed_matched_url ) {
$updated_raw_url = rewrite_url(
$url_processor->get_raw_url(),
$parsed_matched_url,
$parsed_current_site_url,
$parsed_new_site_url
);
$url_processor->set_raw_url( $updated_raw_url );
}
$updated_text = $url_processor->get_updated_html();
if ($updated_text !== $text) {
$xml_processor->set_modifiable_text($updated_text);
}
}
echo $xml_processor->get_processed_xml();
}
echo $xml_processor->get_unprocessed_xml(); ArchitectureThe rewriter explored here pipes and stream-processes data as follows:
The layers of data at play are:
Remaining workThis PR explores a Streaming / Pipes API to make the streams easy to compose and visualize. While the implementation may change, the goal is to pipe chunks of data as far as possible from upstream to downstream while supporting both blocking and non-blocking streams.
Open QuestionsPassing bytes around is great for a consistent interface and byte-oriented operations. However, a HTTP request yields response headers before the body. Reading from a ZIP file produces a series of metadata and data streams – one for every decoded file. How can we use pipes with these more complex data structures? Should we even try? If yes, what would be the API? Would there be multiplexing? Or returning other data types? Or would it be a different interface? |
: I've been exploring a Pipe-based API for easy composing of all those data transformations, here's what I came up with: Pipe::run( [
new RequestStream( new Request( 'https://raw.githubusercontent.com/WordPress/blueprints/normalize-wxr-assets/blueprints/stylish-press-clone/woo-products.wxr' ) ),
new XMLProcessorStream(function (WP_XML_Processor $processor) {
if(is_wxr_content_node($processor)) {
$text = $processor->get_modifiable_text();
$updated_text = Pipe::run([
new BlockMarkupURLRewriteStream(
$text,
[
'from_url' => 'https://raw.githubusercontent.com/wordpress/blueprints/normalize-wxr-assets/blueprints/stylish-press-clone/wxr-assets/',
'to_url' => 'https://mynew.site/',
]
),
]);
if ( $updated_text !== $text ) {
$processor->set_modifiable_text( $updated_text );
}
}
}),
new EchoStream(),
] ); It's based on the following two interfaces (that are likely to keep changing for now): interface ReadableStream {
public function read(): bool;
public function is_finished(): bool;
public function consume_output(): ?string;
public function get_error(): ?string;
}
interface WritableStream {
public function write( string $data ): bool;
public function get_error(): ?string;
} Here's a few more streams I would like to have:
That way we'll be able to put together pipes like this: Pipe::run( [
new RequestStream( new Request( 'https://site.com/export.wxr.zip' ) ),
new ZipReaderStream( '/export.wxr' ),
new XMLProcessorStream(function (WP_XML_Processor $processor) use ($assets_downloader) {
if(is_wxr_content_node($processor)) {
$text = $processor->get_modifiable_text();
// Download the missing assets files
$assets_downloader->process( $text );
if(!$assets_downloader->everything_already_downloaded()) {
// Don't import content that has pending downloads
return;
}
// Update the URLs in the text
$updated_text = Pipe::run([
new BlockMarkupURLRewriteStream(
$text,
[ 'from_url' => $from_site, 'to_url' => $to_site ]
),
]);
if ( $updated_text !== $text ) {
$processor->set_modifiable_text( $updated_text );
}
}
})
] ); or this: Pipe::run( [
new GitSparseCheckoutStream( 'https://github.com/WordPress/gutenberg.git', [
'docs/**/*.md'
] ),
new MarkdownToBlockMarkupStream(),
new BlockMarkupURLRewriteStream(
$text,
[ 'from_url' => $from_site, 'to_url' => $to_site ]
),
new CreatePageStream()
] ); |
I’ve played with ideas like graph TD
A[HttpClient] -->|runs 10 concurrent requests| B[Pipeline]
B -->|filter ZIP files| C[ZipPipeline]
B -->|filter XML files| D[XmlPipeline]
C -->|decode ZIP files| E[ZipDecoder]
E -->|output XML entries| F[ZipXmlFilter]
F -->|filter XML files| G[XmlProcessor]
D -->|passthrough| G
G -->|find WXR content nodes| H[XmlProcessor]
H -->|parse as HTML| I[BlockMarkupURLProcessor]
I -->|rewrite URLs| J[HTML string]
J -->|write to local files| K[LocalFileWriter]
classDef blue fill:#bbf,stroke:#f66,stroke-width:2px;
class B,C,D,E,F,G,H,I,J,K blue;
Sadly, the best result I got was a complex DSL you couldn't use without spending time with the documentation: <?php
// Create the main pipeline
$pipeline = HttpClient::pipeline([
"http://example.com/file1.zip",
"http://example.com/file2.zip",
"http://example.com/file3.zip",
"http://example.com/file4.zip",
"http://example.com/file5.zip",
"http://example.com/file6.xml",
"http://example.com/file7.xml",
"http://example.com/file8.xml",
"http://example.com/file9.xml",
"http://example.com/file10.xml"
]);
[$zipPipeline, $xmlPipeline] = $pipeline->split(HttpClient::filterContentType('application/zip'));
$zipPipeline
->flatMap(ZipDecoder::create())
->filter(Pipeline::filterFileName('.xml$'))
->combineWith($xmlPipeline)
->map(new WXRRewriter())
->map(Pipeline::defaultFilename('output.xml'))
->map(new LocalFileWriter('./')) The alternative is the following imperative code: $zips = [
"http://example.com/file1.zip",
"http://example.com/file2.zip",
"http://example.com/file3.zip",
"http://example.com/file4.zip",
"http://example.com/file5.zip",
];
$zip_decoders = [];
$xmls = [
"http://example.com/file6.xml",
"http://example.com/file7.xml",
"http://example.com/file8.xml",
"http://example.com/file9.xml",
"http://example.com/file10.xml"
];
$local_paths = [];
$xml_rewriters = [];
$client = new Client();
$client->enqueue( [ ...$zips, ...$xmls ] );
while ( $client->await_next_event() ) {
$request = $client->get_request();
$original_url = $request->original_request()->url;
switch ( $client->get_event() ) {
case Client::EVENT_HEADERS_RECEIVED:
if ( in_array( $original_url, $zips ) ) {
$zip_decoders[$original_url] = new ZipStreamReader();
} else {
$xml_rewriters[$original_url] = new XmlRewriter();
}
break;
case Client::EVENT_BODY_CHUNK_AVAILABLE:
if ( in_array( $original_url, $zips ) ) {
$zip_decoders[$original_url]->write( $request->get_response_body_chunk() );
} else {
$xml_rewriters[$original_url]->write( $request->get_response_body_chunk() );
}
break;
case Client::EVENT_FAILED:
case Client::EVENT_FINISHED:
unset( $zip_decoders[$request->original_request()->id] );
continue 2;
}
foreach( $zip_decoders as $url => $zip ) {
if ( $zip->is_file_finished() ) {
$zip->next_file();
}
while ( $zip->read() ) {
if( $zip->get_last_error() ) {
// TODO: Handle error
continue 2;
}
$file = $zip->get_file_name();
if(!isset($xml_rewriters[$file])) {
$xml_rewriters[$file] = new XmlRewriter();
}
$xml_rewriters[$url]->write( $zip->get_content_chunk() );
}
}
foreach ( $xml_rewriters as $url => $rewriter ) {
while ( $rewriter->read() ) {
file_put_contents(
$local_paths[$url],
$rewriter->get_response_body_chunk(),
FILE_APPEND
);
}
}
} It is longer, sure, but there's way less ideas in it, you have more control, and it can also be encapsulated similarly as public function next_chunk() {
$this->await_response_bytes();
$this->process_zip_chunks();
$this->process_xml_chunks();
$this->write_output_bytes();
} It's not declarative but it's simple. |
One option might be to something like a I was wondering if something modeled after JavaScript Promises might be more flexible in providing branching abilities. |
Noodling on that idea, we'd need a new type category for multiple data flows:
Here's one way how they could combine: $client = new Client();
$client->enqueue( [ ...$zips, ...$xmls ] );
MultiPipeline::run([
// This produces multiple Request[] streams
$client->demultiplex(),
MultiPipeline::branch(
( $request ) => is_zip($request),
// ZipStreamDemultiplexer is a bytes -> File[] array transformer. It's not
// a demultiplexer because the next file is always produced before the next
// one so there is no concurrent processing here. We could, perhaps, implement
// it as a demultiplexer anyway to reduce the number of ideas in the codebase.
[ () => new ZipStreamReader( '*.xml' ) ]
),
// XmlRewriter is a regular bytes -> bytes stream. In here,
// we support multiple concurrent XML streams.
// We can skip the new MultiTransformer() call and have MultiPipeline backfill it for us.
() => new XmlRewriter(),
// And now we're gathering all the File objects into a single File stream.
new Multiplexer(),
() => new ZipStreamEncoder()
// Let's write to a local file.
// At this point we only have a single stream id, but we're still
// in a multi-stream world so we have to wrap with a MultiTransformer.
() => new LocalFileWriter( 'out.zip' )
]); This looks much better than the bloat I outlined in my previous comment. Perhaps it can be simplified even further. Although, I guess it's not that different from: $client = new Client();
$client->enqueue( [ ...$zips, ...$xmls ] );
$client
->demultiplex()
->branch(
( $request ) => is_zip($request),
( $branch ) => $branch->pipeTo( () => new ZipStreamReader( '*.xml' ) )
)
->pipeTo( () => new XmlRewriter() )
->multiplex()
->pipeTo( new ZipStreamEncoder() )
->pipeTo( new LocalFileWriter( 'out.zip' ) ) One thing I'm not sure about is passing bytes vs
I don't have anything against callbacks, but I'd rather keep the data flow here as linear as possible and err on the side of simplicity over allowing multiple forks, splitting the data in success streams and error streams etc. |
I just realized piping Therefore, we can pipe HTTP responses, ZIP files etc. without almost any additional complexity. We would pipe bytes as we do now, and then we'd also support moving an optional To support multiplexing, I introduced a A Demultiplexer is just a regular TransformStream that:
A Multiplexer isn't even needed as every pipe is a linear stream of Here's a snippet of code that actually works with the latest version of this branch: Pipe::run( [
new RequestStream( [
new Request( 'https://raw.githubusercontent.com/WordPress/blueprints-library/trunk/php.ini' ),
new Request( 'https://raw.githubusercontent.com/WordPress/blueprints-library/trunk/phpcs.xml' ),
new Request( 'https://raw.githubusercontent.com/WordPress/blueprints/trunk/blueprints/stylish-press/site-content.wxr' ),
] ),
// Filter response chunks as a flat list
new FilterStream( fn ($metadata) => (
str_ends_with( $metadata->get_filename(), '.xml' ) ||
str_ends_with( $metadata->get_filename(), '.wxr' )
) ),
// This demultiplexer pipes each response through a separate
// XMLProcessor so that each parser only deals with a single
// XML document.
new DemultiplexerStream(fn () => $wxr_rewriter()),
// We're back to a flat list, let's strtoupper() each data chunk
new UppercaseTransformer(),
// A Pipe is also a TransformStream and allows us to compose multiple streams for demultiplexing
new DemultiplexerStream(fn () => Pipe::from([
new EchoTransformer(),
new LocalFileStream(fn ($metadata) => __DIR__ . '/output/' . $metadata->get_resource_id() . '.chunk'),
])),
] ); With this design, we could easily add fluid API if needed and also add support for ZIP files and other data types. Some open questions are:
|
I have found the loop-orientation of the HTML API useful and more concrete than abstract types and interfaces. To that end, I also like the way bookmarks get a user-defined name. In these pipelines it seems like they could be added with a name, and a context object could provide stage-specific metadata and control through the entire stack. For example, I could write something like this. Pipe::run( [
'http' => new RequestStream( new Request( 'https://site.com/export.wxr.zip' ) ),
'zip' => new ZipReaderStream( '/export.wxr' ),
'xml' => new XMLProcessorStream(function (WP_XML_Processor $processor, $context) use ($assets_downloader) {
if(!str_ends_with($context['zip']->filename, '.wxr')) {
return $context['zip']->skip_file();
}
if(is_wxr_content_node($processor)) {
$text = $processor->get_modifiable_text();
// Download the missing assets files
$assets_downloader->process( $text );
if(!$assets_downloader->everything_already_downloaded()) {
// Don't import content that has pending downloads
return;
}
// Update the URLs in the text
$updated_text = Pipe::run([
new BlockMarkupURLRewriteStream(
$text,
[ 'from_url' => $from_site, 'to_url' => $to_site ]
),
]);
if ( $updated_text !== $text ) {
$processor->set_modifiable_text( $updated_text );
}
}
})
] ); In fact this whole stack could build a generator which can then be called in a loop. $pipe = Pipe::run( [ … ] );
while ( $pipe->next() as $context ) {
list( 'xml' => $xml, 'zip' => $zip ) = $context;
if ( ! str_ends_with( $zip->get_filename(), '.wxr' ) ) {
$zip->skip_file();
continue;
}
// start processing.
} |
@dmsnell I love the idea, I'm confused about the details. Would the loop run for every stage of the pipeline? Or just for the final outcome? In the latter scenario, the filtering would happen after the chunks have been already processed. Also, what would this look like for a "demultiplexing" (streaming 5 concurrent requests) and a "branching" (only unzip zip files) use-cases? |
no idea @adamziel 😄 but I think it relates to the need for requesting more. for example, the loop could execute as soon as any and every stage has something ready to process. in the case of XML, it could sit there in the loop and as long as it doesn't have enough data to process could say for demultiplexing I would assume that the multiplexed stream would provide a way to access the contents of each sub-stream. |
I like reducing nesting @dmsnell. While demuxing is powerful, it's also complex and feels like solving an overly general problem instead of tailoring something simple to WordPress use-cases. Here's a take on processing multiple XML files using a flat stream structure: Pipe::run( [
'http' => new RequestStream( [ /* ... */ ] ),
'zip' => new ZipReaderStream( fn ($context) => {
if(!str_ends_with($context['http']->url, '.zip')) {
return $context->skip();
}
$context['zip']->set_processed_resource( $context['http']->url );
} ),
'xml' => new XMLProcessorStream(fn ($context) => {
if(
! str_ends_with($context['zip']->filename, '.wxr') &&
! str_ends_with($context['http']->url, '.wxr')
) {
return $context->skip();
}
$context['xml']->set_processed_resource( $context['zip']->filename );
$xml_processor = $context['xml']->get_processor( );
while(WXR_Processor::next_content_node($xml_processor)) {
// Migrate URLs and download assets
}
}),
] ); |
if we want this, it would seem like each callback should potentially have access to the context of all stages above and below it, plus space for shared state. in the case of |
I think that's a must, otherwise we'd need buffer size / backpressure semantics. By processing each incoming chunk right away we may sometimes go too granular or do too many checks, but perhaps it wouldn't be too bad – especially when networking and not CPU is the bottleneck.
Shared data and context lookaheads sounds like trouble, though. I was hoping that read-only access to context from all the stages above would suffice. |
these are valid concerns. I share them. still, I think that undoubtedly, someone will want to do something like conditionally skip a file in the ZIP based on something in the WXR processor, and being able to interact with that from below seems much more useful. this is maybe the challenge that separate callback functions creates, because the flat model doesn't separate the layers. |
Agreed! The challenge is we may only get the information necessary to reject a file after processing 10 or a 1000 chunks from that file. I can only see three solutions here:
I realized one more gotcha: Imagine requesting 5 WXR exports, rewriting URLs, and saving them all to a local ZIP file. The ZIP writer needs to write data sequentially, so write all the chunks of the first file, write all the chunks of second file after that, and so on. However, sourcing data from HTTP would interleave chunks from different files. Simply piping those chunks to We could turn it into a constraint solving problem. Stream classes would declare whether they:
On mismatch, the entire pipe would error out without even starting. |
to me this reads as a statement of the problem, not an impediment to the problem. if we have to wait for 1000 chunks before knowing whether to process a file, that's a sunk cost.
maybe it's just me but I'm lost in all of this. these examples are complicated, but are they likely? are they practical? where is the scope of what we're doing? |
I'm hoping for a simpler code structure and clearer data flows.
There's much more I'll do to review and think through this, but at the top of my head one question arises: how does it look to be re-entrant here? Perhaps in the Playground this isn't a big problem, with unlimited execution time, but on any real PHP server we're dealing with Without asking you to instantly solve this, do you see a way to persist the in-transit state of the pipeline so that it can be resumed later? Could we put a pause button in here that someone clicks on and then can resume later? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adamziel monumental work here. of the three pipes I like the controller version the best because of how it seems like the processing steps are a little more global in those cases.
but I noticed something in all formulations: the pipeline doesn't seem to be where the complexity lies. it seems like the examples focus on pipelining the download of files, which I think involves files that get queued while processing.
what would this look like if instead of this processing pipeline we had a main loop where each stage was exposed directly, without the pipeline abstraction, but the files could be downloaded still in parallel?
what could that look like? would it be worse? I think I'm puzzled on how to abstract a universal interface for streaming things, apart from calling everything a token
, but your example of the WXR rewriter demonstrates how in many cases the individual token is not the right step function. in many cases, we will process many bytes all at once, and one production from an earlier stage might create many tokens for the next stage.
I'm also thinking more about re-entrancy and how to wrap the indices throughout the pipeline. in this system I suppose we could add new methods exposing the current bookmark, the start and end of the current token for a given stage. this might be critical for being able to pause and resume progress.
at this point I think I have some feel for the design, so I'd like to ask you for some leading questions if you have any. I know this is inherently a very complicated task; the code itself also seems very complicated.
$this->set_modifiable_html_text( | ||
$html, | ||
substr($text, 0, $at) . json_encode($new_attributes, JSON_HEX_TAG | JSON_HEX_AMP) | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My block comment delimiter finder might help here.
foreach($attributes as $key => $value) { | ||
$new_attributes[$key] = $this->process_block_attributes($value); | ||
} | ||
return $new_attributes; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
array_walk_recursive
might be of help here. your code is working fine, but presumably this could perform better, if it does.
I suppose there's no practical concern here about stack overflow, since this is only processing block attributes, but I'm on the lookout for any non-tail-recursive recursion (and I think that no user-space PHP code is, even if it's in tail-recursive form, which this isn't).
Alternatively there's also the approach of adding values to a stack to process, where the initial search runs over the stack, adding new items for each one that it finds that's an array.
This is not important; I just noticed it.
* @TODO: Investigate how bad this is – would it stand the test of time, or do we need | ||
* a proper URL-matching state machine? | ||
*/ | ||
const URL_REGEXP = '\b((?:(https?):\/\/|www\.)[-a-zA-Z0-9@:%._\+\~#=]+(?:\.[a-zA-Z0-9]{2,})+[-a-zA-Z0-9@:%_\+.\~#?&//=]*)\b'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check out the extended flag x
If this modifier is set, whitespace data characters in the pattern are totally ignored except when escaped or inside a character class, and characters between an unescaped # outside a character class and the next newline character, inclusive, are also ignored. This is equivalent to Perl's /x modifier, and makes it possible to include commentary inside complicated patterns. Note, however, that this applies only to data characters. Whitespace characters may never appear within special character sequences in a pattern, for example within the sequence (?( which introduces a conditional subpattern.
https://www.php.net/manual/en/reference.pcre.pattern.modifiers.php
this can help make long and confusing regexes clearer, with comments to annotate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would image that this review is more about the pipeline, but I think for URLs, if we're using a WHAT-WG compliant URL parser, we can probably jump to \b(?:[a-z-]+://|www\.|/)
and start checking if those base points can produce a valid parse. it looks like this code isn't using what you've done in other explorations, so this comment may not be valid
} | ||
if( | ||
$p->get_token_type() === '#cdata-section' && | ||
strpos($new_value, '>') !== false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it's #cdata-section
then it's a real CDATA section and we should check for ]]>
. if it's #comment
and WP_HTML_Tag_Processor::COMMENT_AS_CDATA_LOOKALIKE === $p->get_comment_type()
then it's a lookalike and >
is the closer.
$this->xml = $new_xml; | ||
$this->stack_of_open_elements = $breadcrumbs; | ||
$this->parser_context = $parser_context; | ||
$this->had_previous_chunks = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with the HTML API's extend()
I've planned on ensuring that we only cut off as much from the front of the document until the first bookmark.
I've considered two modes: one simply extends (which is what #5050 does), and the other extends and forgets.
The major difference is what comes out of get_updated_html()
Here for XML this may be easier, but for HTML it's not as easy as resetting the stack open elements. There's a lot more state to track and modify, so right now in trunk
it will reset to the start and crawl forward until it reaches the bookmark again if the bookmark is before the cursor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dmsnell My thinking is the processor has no idea whether the input stream is finished or not. It can make an assumption that an unclosed tag means we're paused at incomplete input, but the input stream may be in fact exhausted. The reverse is also problematic – we may have enough input to infer parsing is finished when in fact more input is coming. Perhaps these processors need to be explicitly told "we're still waiting for more data" or "no more input data will come".
with the HTML API's extend() I've planned on ensuring that we only cut off as much from the front of the document until the first bookmark.
Are there any system-level bookmarks that are implicitly created? As in, is there a chance we'd never forget any bytes because we've seen the <body>
tag and we'll keep track of it indefinitely?
A memory limit also crossed my mind, as in "never buffer more than 1MB of data", although that seems more complex and maybe not worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps these processors need to be explicitly told "we're still waiting for more data" or "no more input data will come".
Yes I believe this is going to be the demand. At some point I think we will probably add some method like $processor->get_incomplete_end_of_document()
but it's not there because I have no idea what that should be right now, or if it's truly necessary.
Only the caller will be able to know if the document was truncated or if more chunks are inbound. This is also true for cases where we have everything in memory, e.g. we got truncated HTML as input and don't know where it came from - "that's it, that's all!"
Are there any system-level bookmarks that are implicitly created? As in, is there a chance we'd never forget any bytes because we've seen the tag and we'll keep track of it indefinitely?
In the HTML Processor there are for sure, though in the case of the fragment parser, since the context element never exists on the stack of open elements this shouldn't be a problem. We should be able to eject portions of the string that are closed.
pipes-controller.php
Outdated
* to the second ZIP file, and so on. | ||
* | ||
* This way we can maintain a predictable $context variable that carries upstream | ||
* metadata and exposes methods like skip_file(). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good comment
I explored that in bd19ad7. I like that it's less code overall. Here's what I don't like:
I explored inlining the loop cascade into a single loop with a switch-based stage management in daaba8a. It's more readable, but the other painpoints still stand.
You may be pointing at this already with your choice of words – I'm noticing a lot of similarities between this work and the MySQL parser explorations. We're ingesting "tokens" in form of bytes, XML tags etc, identifying the next non-terminal processing rule, and moving them there. If we squint and forget about sourcing data from the network, disk, etc., we're just composing parsers here. At an abstract level, the entire process could be driven by a grammar declaration – I now think the pipeline definition is just that.
My initial thinking is we could store the cursor as follows:
Upon resuming, each processor would restore the frozen state and skip over to the relevant byte in the stream. On the upside, it seems simple. On the downside:
I didn't phrase much of this comment as questions, but it's all me asking for your thoughts. |
Highly relevant PR from @dmsnell: WordPress/wordpress-develop#6883 |
… WebApp Redesign (#1731) ## Description Implements a large part of the [website redesign](#1561): ![CleanShot 2024-09-14 at 10 24 57@2x](https://github.com/user-attachments/assets/f245c7ac-cb8c-4e5a-b90a-b4aeff802e7b) High-level changes shipped in this PR: * Multiple Playgrounds. Every temporary Playground can be saved either in the browser storage (OPFS) or in a local directory (Chrome desktop only for now). * New Playground settings options: Name name, language, multisite * URL as the source of truth for the application state * State management via Redux This work is a convergence of 18+ months of effort and discussions. The new UI opens relieves the users from juggling ephemeral Playgrounds and losing their work. It opens up space for long-lived site configurations and additional integrations. We could bring over all the [PR previewers and demos](https://playground.wordpress.net/demos/) right into the Playground app. Here's just a few features unblocked by this PR: * #1438 – no more losing your work by accident 🎉 * #797 – with multiple sites we can progressively build features we'll eventually propose for WordPress core: * A Playground export and import feature, pioneering the standard export format for WordPress sites. * A "Clone this Playground" feature, pioneering the [Site Transfer Protocol](https://core.trac.wordpress.org/ticket/60375). * A "Sync two Playgrounds" feature, pioneering the Site Sync Protocol * #1445 – better git support is in top 5 most highly requested features. With multiple Playgrounds, we can save your work and get rid of the "save your work before connecting GitHub or you'll lose it" and cumbersome "repo setup" forms on every interaction. Instead, we can make git operations like Pull, Commit, etc. very easy and even enable auto-syncing with a git repository. * #1025 – as we bring in more PHP plumbing into this repository, we'll replace the TypeScript parts with PHP parts to create a WordPress core-first Blueprints engine * #1056 – Site transfer protocol will unlocks seamlessly passing Playgrounds between the browser and a local development environment * #1558 – we'll integrate [the Blueprints directory] and offer single-click Playground setups, e.g. an Ecommerce store or a Slide deck editor. #718. * #539 – the recorded Blueprints would be directly editable in Playground and perhaps saved as a new Playground template * #696 – the new interaction model creates space for additional integrations. * #707 – you could create a "GitHub–synchronized" Playground * #760 – we can bootstrap one inside Playground using a Blueprint and benefit the users immediately, and then gradually work towards enabling it on WordPress.org * #768 – the new UI has space for a "new in Playground" section, similar to what Chrome Devtools do * #629 * #32 * #104 * #497 * #562 * #580 ### Remaining work - [ ] Write a release note for https://make.wordpress.org/playground/ - [x] Make sure GitHub integration is working. Looks like OAuth connection leads to 404. - [x] Fix temp site "Edit Settings" functionality to actually edit settings (forking a temp site can come in a follow-up PR) - [x] Fix style issue with overlapping site name label with narrow site info views - [x] Fix style issue with bottom "Open Site" and "WP Admin" buttons missing for mobile viewports - [x] Make sure there is a path for existing OPFS sites to continue to load - [x] Adjust E2E tests. - [x] Reflect OPFS write error in UI when saving temp site fails - [x] Find a path forward for [try-wordpress](https://github.com/WordPress/try-wordpress) to continue working after this PR - [x] Figure out why does the browser get so choppy during OPFS save. It looks as if there was a lot of synchronous work going on. Shouldn't all the effort be done by a worker a non-blocking way? - [x] Test with Safari and Firefox. Might require a local production setup as FF won't work with the Playground dev server. - [x] Fix Safari error: `Unhandled Promise Rejection: UnknownError: Invalid platform file handle` when saving a temporary Playground to OPFS. - [x] Fix to allow deleting site that fails to boot. This is possible when saving a temp site fails partway through. - [x] Fix this crash: ```ts /** * @todo: Fix OPFS site storage write timeout that happens alongside 2000 * "Cannot read properties of undefined (reading 'apply')" errors here: * I suspect the postMessage call we do to the safari worker causes it to * respond with another message and these unexpected exchange throws off * Comlink. We should make Comlink ignore those. */ // redirectTo(PlaygroundRoute.site(selectSiteBySlug(state, siteSlug))); ``` - [x] Test different scenarios manually, in particular those involving Blueprints passed via hash - [x] Ensure we have all the aria, `name=""` etc. accessibility attributes we need, see AXE tools for Chrome. - [x] Update developer documentation on the `storage` query arg (it's removed in this PR) - [x] Go through all the `TODOs` added in this PR and decide whether to solve or punt them - [x] Handle errors like "site not found in OPFS", "files missing from a local directory" - [x] Disable any `Local Filesystem` UI in browsers that don't support them. Don't just hide them, though. Provide a help text to explain why are they disabled. - [x] Reduce the naming confusion, e.g. `updateSite` in redux-store.ts vs `updateSite` in `site-storage.ts`. What would an unambiguous code pattern look like? - [x] Find a reliable and intuitive way of updating these deeply nested redux state properties. Right now we do an ad-hoc recursive merge that's slightly different for sites and clients. Which patterns used in other apps would make it intuitive? - [x] Have a single entrypoint for each logical action such as "Create a new site", "Update site", "Select site" etc. that will take care of updating the redux store, updating OPFS, and updating the URL. My ideal scenario is calling something like `updateSite(slug, newConfig)` in a React Component and being done without thinking "ughh I still need to update OPFS" or "I also have to adjust that .json file over there" - [x] Fix all the tiny design imperfections, e.g. cut-off labels in the site settings form. ### Follow up work - [ ] Mark all the related blocked issues as unblocked on the project board, e.g. #1703, #1731, and more – [see the All Tasks view](https://github.com/orgs/WordPress/projects/180/views/2?query=sort%3Aupdated-desc+is%3Aopen&filterQuery=status%3A%22Up+next%22%2C%22In+progress%22%2C%22Needs+review%22%2C%22Reviewed%22%2C%22Done%22%2CBlocked) - [ ] Update WordPress/Learn#1583 with info that the redesign is now in and we're good to record a video tutorial. - [ ] #1746 - [ ] Write a note in [What's new for developers? (October 2024)](WordPress/developer-blog-content#309) - [ ] Document the new site saving flow in `packages/docs/site/docs/main/about/build.md` cc @juanmaguitar - [ ] Update all the screenshots in the documentation cc @juanmaguitar - [ ] When the site fails to load via `.list()`, still return that site's info but make note of the error. Not showing that site on a list could greatly confuse the user ("Hey, where did my site go?"). Let's be explicit about problems. - [ ] Introduce notifications system to provide feedback about outcomes of various user actions. - [ ] Add non-minified WordPress versions to the "New site" modal. - [ ] Fix `console.js:288 TypeError: Cannot read properties of undefined (reading 'apply') at comlink.ts:314:51 at Array.reduce (<anonymous>) at callback (comlink.ts:314:29)` – it seems to happen at trunk, too. - [ ] Attribute log messages to the site that triggered them. - [ ] Take note of any interactions that we find frustrating or confusing. We can perhaps adjust them in a follow-up PR, but let's make sure we notice and document them here. - [ ] Solidify the functional tooling for transforming between `URL`, `runtimeConfiguration`, `Blueprint`, and `site settings form state` for both OPFS sites and in-memory sites. Let's see if we can make it reusable in Playground CLI. - [ ] Speed up OPFS interactions, saving a site can take quite a while. - [ ] A mobile-friendly modal architecture that doesn't stack modals, allows dismissing, and understands some modals (e.g. fatal error report) might have priority over other modals (e.g. connect to GitHub). Discuss whether modals should be declared at the top level, like here, or contextual to where the "Show modal" button is rendered. - [ ] Discuss the need to support strong, masked passwords over a simple password that's just `"password"`. - [ ] Duplicate site feature implemented as "Export site + import site" with the new core-first PHP tools from adamziel/wxr-normalize#1 and https://github.com/adamziel/site-transfer-protocol - [x] Retain temporary sites between site changes. Don't just trash their iframe and state when the user switches to another site. Closes #1719 cc @brandonpayton --------- Co-authored-by: Brandon Payton <[email protected]> Co-authored-by: Bero <[email protected]> Co-authored-by: Bart Kalisz <[email protected]>
Doodling - this is probably all a disaster. $pipeline->add( 'http', $client );
$pipeline->add( 'zip', $zip_decoder );
$pipeline->add( 'xml', $xml_processor );
$xml_processor->auto_feeder = array( $zip_decoder, 'read_chunk' );
$zip_decoder->auto_feeder = array( $client, 'next_file' );
$client->new_item = fn ( $filename, $chunk ) => $zip_decoder->new_stream( $chunk );
$zip_decoder->new_item = fn ( $filename, $chunk ) => $xml_processor->new_stream( $chunk );
while ( $pipeline->keep_going() ) {
if ( $zip_decoder->get_file_path() !== 'export.xml' ) {
$zip_decoder->next_file();
continue;
}
if ( ! $xml_processor->next_token() ) {
wp_insert_post( $post );
continue;
}
$post = new WP_Post();
$token = $xml_processor->get_token_name();
…
} so maybe this more or less mirrors work you did in the
Can we find a simple expression of pipe events without requiring the creation of new classes and without exposing all of the nitty-gritty internals? Maybe not. Maybe the verbose approach is best and largely, code using these streams will be highly-specialized and complicated, and the verbosity is fine because these complicated flows require paying attention to them. 🤔 |
I have some thoughts about reentrancy unrelated to @dmsnell last comment: Pausing a pipe may require saving the current state and the data buffer of every parser in the pipe. Imagine the following pipe: Local file > zip reader > xml parser > WXR importer Now imagine we failed to import the post number 10472. Here's what we need to consider:
Every parser must maintain its internal state in such a way, that we could destroy and recreate all its internal resources at any time. For example, the ZIP parser's buffer should never start mid gzip block because that would prevent it from recreating the deflate handle. We'll need to set checkpoints after each meaningful downstream task, e.g. when a post is importer. A checkpoint would be a serialized pipe state at that point in time. The downstream WXR parser may import 100 posts from a single zip chunk, and then it may need 100 zip chunks to import 1 post. We need to export all the upstream states and buffers to correctly resume the downstream parser and allow it to pull the next upstream chunk. We can only set checkpoints after the last task OR at the first chunk of the next task, but not right before the next task. Why? Because we can't know we're about to enter the next WP post without peeking, and peek() isnt supported in the current streaming api. Later on we may try to optimize the state serialization and:
Both should be possible upstream from the ZIP parser but I'm not sure about downstream. It would require synchronizing parser byte offsets, compressed/uncompressed offsets, and gzip block offsets between the piped parsers. Streaming ZIP files has one more complexity. We may need two cursors — one to parse the central directory index, and one to go through the actual files. This could be a higher order stream with two inputs, but that smells like complexity and adding a lot of ideas to the streaming architecture. Maybe a custom pipe class that knows how to request new input streams and has a single output? Cc @sirreal |
We've got the first prototype of re-entrant streams!In 3c07f99 I've prototyped the $file_stream = new File_Byte_Stream('./test.txt', 100);
// Read bytes 0-99
$file_stream->next_bytes();
// Pause the processing
file_put_contents('paused_state.txt', json_encode($file_stream->pause()));
// Resume the processing in another request
$file_stream = new File_Byte_Stream('./test.txt', 100);
$paused_state = json_decode(file_get_contents('paused_state.txt'));
$file_stream->resume($paused_state);
// Read the bytes 100 - 199
$file_stream->next_bytes(); It seems to be working quite well! What did not workAt first, I tried the following approach: $file_stream = new File_Byte_Stream('./test.txt', 100);
$file_stream->next_bytes();
$file_stream_2 = File_Byte_Stream::resume( $file_stream->pause() ); It worked well for simple streams, but there's no way to generalize it to callback-based streams like ProcessorByteStream – we can't serialize the callbacks as JSON: class ZIP_Reader
{
static public function stream()
{
return ProcessorByteStream::demuxed(
function () { return new ZipStreamReader(); },
function (ZipStreamReader $zip_reader, ByteStreamState $state) {
while ($zip_reader->next()) {
switch ($zip_reader->get_state()) {
case ZipStreamReader::STATE_FILE_ENTRY:
$state->file_id = $zip_reader->get_file_path();
$state->output_bytes = $zip_reader->get_file_body_chunk();
return true;
}
}
return false;
}
);
}
} Therefore, I stuck with the approach of creating a stable stream (or stream chain) instance from "schema", and then exporting/importing its internal state: function create_stream_chain($paused_state=null) {
$chain = new StreamChain(
[
'file' => new File_Byte_Stream('./export.wxr', 100),
'xml' => XML_Processor::stream(function () { }),
]
);
if($paused_state) {
$chain->resume($paused_state);
}
return $chain;
} We could, in theory, provide an interface such as class StrtoupperStream extends TransformStream {
protected function transform($chunk) {
return strtoupper( $chunk );
}
}
StreamApi::register(StrtoupperStream::class);
class RewriteLinksInWXRStream extends ProcessorTransformStream {
protected function transform(WP_XML_Processor $processor) {
// ...
}
}
StreamApi::register(RewriteLinksInWXRStream::class); However, you can see how requiring a class registration for every simple transform would unnecessarily increase the complexity and baloon the number of classes, files, dependencies, inheritance hierarchies etc. Having spent a few years with Java, I have to say hard pass. |
The API needs more thought and polish here, but we're in a good place to start wrapping up v1 for content import and exports in the WordPress Playground repo. We'll keep iterating and rebuilding it there to serve the real use-cases well. |
Zip re-entrancy challengePausing ZIP parsing in the middle of a gzip-compressed file might require a custom GZip deflater and so, at least at first, we may not support resuming ZIP parsing. GZip has a variable block size and PHP doesn't expose the current block size or boundaries, meaning there's no obvious place where we could split the data. We'd could work around that by exporting the entire deflater's internal state. This would also solve for the sliding window problem. The nth block may refer to any previous block within a 32kb sliding window. However, that previous block, might also refer to something in the previous 32kb. We're effectively maintaining a dictionary that's initialized at byte 0 and keeps evolving throughout the entire stream, and for re-entrancy we'd need to export that dictionary. Some deflaters cut ties to previous 32kb every now and then by performing an occasional "full flush". This would reduce the paused context size. Local ZIP file re-entrancyPHP has a set of functions called
WXR + re-entrancy next stepsIt seems like the
|
@dmsnell it's not too different from the current proposal in this PR: $pipeline = new StreamChain(
[
'http' => HTTP_Client::stream([
new Request('http://127.0.0.1:9864/export.wxr.zip'),
// new Request('http://127.0.0.1:9864/export.wxr.zip'),
// Bad request, will fail:
new Request('http://127.0.0.1:9865')
]),
'zip' => ZIP_Reader::stream(),
Byte_Stream::map(function($bytes, $context) {
if($context['zip']->get_file_id() === 'export.wxr') {
$context['zip']->skip_file();
return null;
}
return $bytes;
}),
'xml' => XML_Processor::stream(function () { }),
Byte_Stream::map(function($bytes) { return strtoupper($bytes); }),
]
);
foreach($pipeline as $chunk) {
$post = new WP_Post();
// ...
} With a bit of augmentation, we could move
Note your example above involves the same number of classes as this PR. There's a class to represent the Pipeline, there's one class per decoder, it seems like there's a class to represent the stream. |
In b7102b7 I've prototyped a reentrant ZipStreamReaderLocal. I initially tried implementing it via PHP stream filters, but every time I called There's a few rough edges to polish, e.g. the DemultiplexerStream doesn't understand that the streaming have ended. Overall it works pretty well, though, and it seems like we can start with Thinking about the
|
The last blocking problem with the API designDoodling on processing zipped WXR files, I found myself writing this code: $chain = new StreamChain(
[
'zip' => ZIP_Reader_Local::stream('./export.wxr.zip'),
'xml' => XML_Processor::stream(function ($processor) {
$breadcrumbs = $processor->get_breadcrumbs();
if (
'#cdata-section' === $processor->get_token_type() &&
end($breadcrumbs) === 'content:encoded'
) {
echo '<content:encoded>'.substr(str_replace("\n", "", $processor->get_modifiable_text()), 0, 100)."...</content:encoded>\n\n";
}
}),
]
);
foreach($chain as $chunk) {
echo $chunk->get_bytes();
} This feels weird! The Encoding pull parser semantics into the system would make this feel a lot more natural: $pipeline = new StreamChain([
'zip' => ZIP_Reader_Local::stream('./export.wxr.zip'),
'xml' => WP_XML_Processor::consume(),
]);
while($pipeline->keep_going()) {
if($pipeline['zip']->get_file_extension() !== '.wxr') {
$pipeline['zip']->next_file();
continue;
}
$processor = $pipeline['xml']->get_processor();
// next_tag() automatically pulls more data from the "zip" stage
// when the current buffer is exhausted
while($processor->next_tag()) {
}
} The problem is, the inner The only solution I can think of for the parallelization case is making the import process re-entrant. Not only that, but we'd need to be ready for a context switch at any point in time – we might run out of data 30 times before processing a single post. The code would look something like this: $pipeline = new StreamChain([
'zip' => ZIP_Reader_Local::stream('./export.wxr.zip'),
'xml' => WP_XML_Processor::consume(),
]);
while($pipeline->keep_going()) {
if($pipeline['zip']->get_file_extension() !== '.wxr') {
$pipeline['zip']->next_file();
continue;
}
$processor = $pipeline['xml']->get_processor();
if(!$pipeline['wxr_import']->state) {
$pipeline['wxr_import']->state = '#scanning-for-post';
}
// next_token() doesn't pull anything automatically. It only works with the
// information it has available at a moment.
while($processor->next_token()) {
if($pipeline['wxr_import']->state === '#scanning-for-post') {
if(
$processor->get_tag() === 'item' &&
$processor->breadcrumbs_match('item')
) {
$pipeline['wxr_import']->state = '#post';
$pipeline['wxr_import']->post = array();
}
} else if($pipeline['wxr_import']->state === '#post') {
if (
$processor->breadcrumbs_match('content:encoded') &&
$processor->get_type() === '#cdata-section'
) {
$pipeline['wxr_import']->post['post_content'] = $processor->get_modifiable_text();
} else if // ...
}
}
} Doesn't it look like another stateful streaming processor? This makes me think the pipe could perhaps look as follows: $pipeline = new StreamChain([
'zip' => ZIP_Reader_Local::stream('./export.wxr.zip'),
'wxr' => new WP_WXR_Stream_Importer()
]);
while($pipeline->keep_going()) {
$paused_state = $importer->pause();
// ...
}
// or:
$importer = new StreamChain([
HTTP_Client::stream(
'https://mysite.com/export-1.wxr',
'https://mysite.com/export-2.wxr',
),
new WP_WXR_Stream_Importer()
]);
while($importer->import_next_entity()) {
$paused_state = $importer->pause();
// ...
} I'm now having second thoughts about the On the up side, it centralizes the stream state management logic, cannot be extended with new streams after being declared, and it frees each stream from implementing a method like On the down side, the developer in me would rather use this API: $pipeline = Zip_Reader::from_local_file('./export.wxr.zip')->connect_to(new WXR_Importer());
while($pipeline->keep_going()) {
// ... twiddle our thumbs ...
}
$pipeline_state = $pipeline->pause();
// ... later ...
$pipeline = Zip_Reader::from_local_file('./export.wxr.zip')->connect_to(new WXR_Importer());
$pipeline->resume($pipeline_state); What I don't like about it is that each stream class would have to implement a method such as |
A potential pivot away from pipelines?Uh-oh:
This wasn't clear when I focused on rewriting the URLs in the WXR file, but became apparent when I started exploring an importer. This makes me question other use-cases discussed in this PR. Do we actually need to build arbitrary pipes? Perhaps we'll only ever work with two streams, like a data source and a data target, each of them potentially being a composition of two streams in itself? In that scenario, we'd have specialized classes such as This work is now unblocked, let's start puting the code explored in this PR to use in PlaygroundLet's stop hypothesizing and start bringing the basic building blocks (URL parser, XML parser etc) into Playground to use them for feature development. This should reveal much better answers about the API design than going through more thinking exercises here. |
…ools (#1888) Let's officially kickoff [the Data Liberation](https://wordpress.org/data-liberation/) efforts under the Playground umbrella and unlock powerful new use cases for WordPress. ## Rationale ### Why work on Data Liberation? WordPress core _really_ needs reliable data migration tools. There's just no reliable, free, open source solution for: - Content import and export - Site import and export - Site transfer and bulk transfers, e.g. mass WordPress -> WordPress, or Tumblr -> WordPress - Site-to-site synchronization Yes, there's the WXR content export. However, it won't help you backup a photography blog full of media files, plugins, API integrations, and custom tables. There are paid products out there, but nothing in core. At the same time, so many Playground use-cases are **all about moving your data**. Exporting your site as a zip archive, migrating between hosts with the [Data Liberation browser extension](https://github.com/WordPress/try-wordpress/), creating interactive tutorials and showcasing beautiful sites using [the Playground block](https://wordpress.org/plugins/interactive-code-block/), previewing Pull Requests, building new themes, and [editing documentation](#1524) are just the tip of the iceberg. ### Why the existing data migration tools fall short? Moving data around seems easy, but it's a complex problem – consider migrating links. Imagine you're moving a site from [https://my-old-site.com](https://playground-site-1.com) to [https://my-new-site.com/blog/](https://my-site-2.com). If you just moved the posts, all the links would still point to the old domain so you'll need an importer that can adjust all the URLs in your entire database. However, the typical tools like `preg_replace` or `wp search_replace` can only replace some URLs correctly. They won't reliably adjust deeply encoded data, such as this URL inside JSON inside an HTML comment inside a WXR export: The only way to perform a reliable replacement here is to carefully parse each and every data format and replace the relevant parts of the URL at the bottom of it. That requires four parsers: an XML parser, an HTML parser, a JSON parser, a WHATWG URL parser. Most of those tools don't exist in PHP. PHP provides `json_encode()`, which isn't free of issues, and that's it. You can't even rely on DOMDocument to parse XML because of its limited availability and non-streaming nature. ### Why build this in Playground? Playground gives us a lot for free: - **Customer-centric environment.** The need to move data around is so natural in Playground. So many people asked for reliable WXR imports, site exports, synchronization with git, and the ability to share their Playground. Playground allows us to get active users and customer feedback every step of the way. - **Free QA**. Anyone can share a testing link and easily report any problems they found. Playground is the perfect environment to get ample, fast moving feedback. - **Space to mature the API**. Playground doesn’t provide the same backward compatibility guarantees as WordPress core. It's easy to prototype a parser, find a use case where the design breaks down, and start over. - **Control over the runtime.** Playground can lean on PHP extensions to validate our ideas, test them on a simulated slow hardware, and ship them to a tablet to see how they do when the app goes into background and the internet is flaky. Playground enables methodically building spec-compliant software to create the solid foundation WordPress needs. ## The way there ### What needs to be built? There's been a lot of [gathering information, ideas, and tools](https://core.trac.wordpress.org/ticket/60375). This writeup is based on 10 years worth of site transfer problems, WordPress synchronization plugins, chats with developers, analyzing existing codebases, past attempts at data importing, non-WordPress tools, discussions, and more. WordPress needs parsers. Not just any parsers, they must be streaming, re-entrant, fast, standard compliant, and tested using a large body of possible inputs. The data synchronization tools must account for data conflicts, WordPress plugins, invalid inputs, and unexpected power outages. The errors must be non-fatal, retryable, and allow manual resolution by the user. No data loss, ever. The transfer target site should be usable as early as possible and show no broken links or images during the transfer. That's the gist of it. A number of parsers have already been prototyped. There's even [a draft of reliable URL rewriting library](https://github.com/adamziel/site-transfer-protocol). Here's a bunch of early drafts of specific streaming use-cases: - [A URL parser](https://github.com/adamziel/site-transfer-protocol/blob/trunk/src/WP_URL.php) - [A block markup parser](https://github.com/adamziel/site-transfer-protocol/blob/trunk/src/WP_Block_Markup_Processor.php) - [An XML parser](WordPress/wordpress-develop#6713), also explored by @dmsnell and @jonsurrell - [A Zip archive parser](https://github.com/WordPress/blueprints-library/blob/87afea1f9a244062a14aeff3949aae054bf74b70/src/WordPress/Zip/ZipStreamReader.php) - [A multihandle HTTP client](https://github.com/WordPress/blueprints-library/blob/trunk/src/WordPress/AsyncHttp/Client.php) without curl dependency - [A MySQL query parser](WordPress/sqlite-database-integration#157) started by @zieladam and now explored by @JanJakes - [A stream chaining API](adamziel/wxr-normalize#1) to connect all these pieces On top of that, WordPress core now has an HTML parser, and @dmsnell have been exploring a [UTF-8](WordPress/wordpress-develop#6883) decoder that would to enable fast and regex-less URL detection in long data streams. There are still technical challenges to figure out, such as how to pause and resume the data streaming. As this work progresses, you'll start seeing incremental improvements in Playground. One possible roadmap is shipping a reliable content importer, then reliable site zip importer and exporter, then cloning a site, and then extends towards full-featured site transfers and synchronization. ### How soon can it be shipped? Three points: * No dates. * Let's keep building on top of prior work and ship meaningful user flows often. * Let's not ship any stable public APIs until the design is mature. For example, the [Try WordPress extension](https://github.com/WordPress/try-wordpress/) can already give you a Playground site, even if you cannot migrate it to another WordPress site just yet. **Shipping matters. At the same time, taking the time required to build rigorous, reliable software is also important**. An occasional early version of this or that parser may be shipped once its architecture seems alright, but the architecture and the stable API won't be rushed. That would jeopardize the entire project. This project aims for a solid design that will serve WordPress for years. The progress will be communicated in the open, while maintaining feedback loops and using the work to ship new Playground features. ## Plans, goals, details ### Next steps Let's start with building a tool to export and import _a single WordPress post_. Yes! Just one post. The tricky part is that all the URLs will have to be preserved. From there, let's explore the breadth and depth of the problem, e.g.: * Rewriting links * Frontloading media files * Preserving dependent data (post meta, custom tables, etc.) * Exporting/importing a WXR file using the above * Pausing and resuming a WXR export/import * Exporting/importing a full WordPress site as a zip file Ideally, each milestone will result in a small, readily reusable tool. For example "paste WordPress post, paste a new site URL, get your post migrated". There's an ample body of existing work. Let's keep the existing codebases (e.g. WXR, site migration plugins) and discussions open in a browser window during this work. Let's involve the authors of these tools, ask them questions, ask them for reviews. Let's publish the progress and the challenges encountered on the way. ### Design goals - **Fault tolerance** – all the data tools should be able to start, stop, resume, tolerate errors, accept alternative data from the user, e.g. media files, posts etc. - **WordPress-first** – let's build everything in PHP using WordPress naming conventions. - **Compatibility** – Every WordPress version, PHP version (7.2+, CLI), and Playground runtime (web, CLI, browser extension, desktop app, CI etc.) should be supported. - **Dependency-free** – No PHP extensions required. If this means we can't rely on cUrl, then let's build an HTTP client from scratch. Only minimal Composer dependencies allowed, and only when absolutely necessary. - **Simplicity** – no advanced OOP patterns. Our role model is [WP_HTML_Processor](https://developer.wordpress.org/reference/classes/wp_html_processor/) – a **single class** that can parse nearly all HTML. There's no "Node", "Element", "Attribute" classes etc. Let's aim for the same here. - **Extensibility** – Playground should be able to benefit from, say, WASM markdown parser even if core WordPress cannot. - **Reusability** – Each library should be framework-agnostic and usable outside of WordPress. We should be able to use them in WordPress core, WP-CLI, Blueprint steps, Drupal, Symfony bundles, non-WordPress tools like https://github.com/adamziel/playground-content-converters, and even in Next.js via PHP.wasm. ### Prior art Here's a few codebases that needs to be reviewed at minimum, and brought into this project at maximum: - URL rewriter: https://github.com/adamziel/site-transfer-protocol - URL detector : WordPress/wordpress-develop#7450 - WXR rewriter: https://github.com/adamziel/wxr-normalize/ - Stream Chain: adamziel/wxr-normalize#1 - WordPress/wordpress-develop#5466 - WordPress/wordpress-develop#6666 - XML parser: WordPress/wordpress-develop#6713 - Streaming PHP parsers: https://github.com/WordPress/blueprints-library/tree/trunk/src/WordPress - Zip64 support (in JS ZIP parser): #1799 - Local Zip file reader in PHP (seeks to central directory, seeks back as needed): https://github.com/adamziel/wxr-normalize/blob/rewrite-remote-xml/zip-stream-reader-local.php - WordPress/wordpress-develop#6883 - Blocky formats – Markdown <-> Block markup WordPress plugin: https://github.com/dmsnell/blocky-formats - Sandbox Site plugin that exports and imports WordPress to/from a zip file: https://github.com/WordPress/playground-tools/tree/trunk/packages/playground - WordPress + Playground CLI setup to import, convert, and exporting data: https://github.com/adamziel/playground-content-converters - Markdown -> Playground workflow _and WordPress plugins_: https://github.com/adamziel/playground-docs-workflow - _Edit Visually_ browser extension for bringing data in and out of Playground: WordPress/playground-tools#298 - _Try WordPress_ browser extension that imports existing WordPress and non-WordPress sites to Playground: https://github.com/WordPress/try-wordpress/ - Humanmade WXR importer designed by @rmccue: https://github.com/humanmade/WordPress-Importer ### Related resources - [Site transfer protocol](https://core.trac.wordpress.org/ticket/60375) - [Existing data migration plugins](https://core.trac.wordpress.org/ticket/60375#comment:32) - WordPress/data-liberation#74 - #1524 - WordPress/gutenberg#65012 ### The project structure The structure of the `data-liberation` package is an open exploration and will change multiple times. Here's what it aims to achieve. **Structural goals:** - Publish each library as a separate Composer package - Publish each WordPress plugin separately (perhaps a single plugin would be the most useful?) - No duplication of libraries between WordPress plugins - Easy installation in Playground via Blueprints, e.g. no `composer install` required - Compatibility with different Playground runtimes (web, CLI) and versions of WordPress and PHP **Logical parts** - First-party libraries, e.g. streaming parsers - WordPress plugins where those libraries are used, e.g. content importers - Third party libraries installed via Composer, e.g. a URL parser **Ideas:** - Use Composer dependency graph to automatically resolve dependencies between libraries and WordPress plugins - or use WordPress "required plugins" feature to manage dependencies - or use Blueprints to manage dependencies cc @brandonpayton @bgrgicak @mho22 @griffbrad @akirk @psrpinto @ashfame @ryanwelcher @justintadlock @azaozz @annezazu @mtias @schlessera @swissspidy @eliot-akira @sirreal @obenland @rralian @ockham @youknowriad @ellatrix @mcsf @hellofromtonya @jsnajdr @dawidurbanski @palmiak @JanJakes @luisherranz @naruniec @peterwilsoncc @priethor @zzap @michalczaplinski @danluu
A part of #1894. Follows up on #1893. This PR brings in a few more PHP APIs that were initially explored outside of Playground so that they can be incubated in Playground. See the linked descriptions for more details about each API: * XML Processor from WordPress/wordpress-develop#6713 * Stream chain from adamziel/wxr-normalize#1 * A draft of a WXR URL Rewriter class capable of rewriting URLs in WXR files ## Testing instructions * Confirm the PHPUnit tests pass in CI * Confirm the test suite looks reasonabel * That's it for now! It's all new code that's not actually used anywhere in Playground yet. I just want to merge it to keep iterating and improving.
This new ZipStreamReader opens its own file handles which means it can be paused, resumed, and is more reliable. The original implementation was built as a part of adamziel/wxr-normalize#1 This is all new code so there are no testing instructions. Eventually this implementation will replace the existing ZipStreamReader.
This new ZipStreamReader opens its own file handles which means it can be paused, resumed, and is more reliable. The original implementation was built as a part of adamziel/wxr-normalize#1 This is all new code so there are no testing instructions. Eventually this implementation will replace the existing ZipStreamReader.
This PR explores a generic Stream interface that allows piping data through different format processors, e.g. HTTP request → ZIP decoder → XML reader → HTML Processor → WordPress Database.
Jump to the last status update and feedback request
It brings together all the stream processing explorations in WordPress to enable stream-rewriting site URLs in a WXR file coming from a remote server. All of that with no curl, DOMDocument, or other PHP dependencies. It's just a few small libraries built with WordPress core in mind:
The rewriter is easy to extend. It could, e.g. stream-rewrite data from a ZIP-ped XML file, re-zip it on the fly, and return it as a HTTP response.
FYI @dmsnell @akirk @brandonpayton @bgrgicak @jordesign @mtias @griffbrad – this is exploratory for now, but will likely become relevant for production use sooner than later.
Related to:
Historically, this PR started as an exploration of rewriting URLs in a remote WXR file.