[Data Liberation] Tracking issue #1894

adamziel · 2024-10-14T17:57:35Z

Let's use this issue to track Data Liberation: Let's Build WordPress-first Data Migration Tools

Technical plumbing

Preliminary roadmap by use-case

Here's a few more use-cases we'll likely tackle along the way, but they're not key milestones on their own:

Markdown workflow for editing existing documentation sites from GitHub
- Markdown importer
- Markdown exporter – migrate @dmsnell's Markdown <-> Block markup TypeScript converter from https://github.com/dmsnell/blocky-formats to PHP
- Discuss using Playground to edit Playground docs, Gutenberg docs, and potentially all WordPress docs
- Discuss using it as a drop-in static site generator replacement (e.g. Jekyll)

The text was updated successfully, but these errors were encountered:

A part of #1894. Follows up on #1893. This PR brings in a few more PHP APIs that were initially explored outside of Playground so that they can be incubated in Playground. See the linked descriptions for more details about each API: * XML Processor from WordPress/wordpress-develop#6713 * Stream chain from adamziel/wxr-normalize#1 * A draft of a WXR URL Rewriter class capable of rewriting URLs in WXR files ## Testing instructions * Confirm the PHPUnit tests pass in CI * Confirm the test suite looks reasonabel * That's it for now! It's all new code that's not actually used anywhere in Playground yet. I just want to merge it to keep iterating and improving.

A part of #1894. Adds https://github.com/WordPress/blueprints-library as a git submodule to the data-liberation package to enable easy code reuse between the projects. I'm not yet sure, but perhaps moving all the PHP libraries to the blueprints-library would make sense? TBD No testing instructions. This is just a new submodule. No code changes are involved.

…essor (#1960) Merge `WP_XML_Tag_Processor` and `WP_XML_Processor` into a single `WP_XML_Processor` class. This reduces abstractions, enables keeping more properties as private, and simplifies the code. Related to #1894 and WordPress/wordpress-develop#6713 ## Testing instructions Confirm the CI tests pass.

brandonpayton · 2024-11-01T03:54:15Z

@adamziel I think this may have been accidentally closed when #1960 was merged because it was "Related to" this one. There are a good number of tasks left unfinished, and this closing looks automated rather than intentional.

I'll reopen, and you can close again if it was intentional.

adamziel · 2024-11-02T14:20:09Z

Let's also review Automattic's VIP WXR importer for going from WXR reading to importing:

https://github.com/search?q=repo%3AAutomattic%2Fvip-go-mu-plugins%20wxr&type=code

This PR introduces the `WP_WXR_Reader` class for parsing WordPress eXtended RSS (WXR) files, along with supporting improvements to the XML processing infrastructure. **Note: `WP_WXR_Reader` is just a reader. It won't actually import the data into WordPress** – that part is coming soon. A part of #1894 ## Motivation There is no WordPress importer that would check all these boxes: * Supports 100GB+ WXR files without running out of memory * Can pause and resume along the way * Can resume even after a fatal error * Can run without libxml and mbstring * Is really fast `WP_WXR_Reader` is a step in that direction. It cannot pause and resume yet, but the next few PRs will add that feature. ## Implementation `WP_WXR_Reader` uses the `WP_XML_Processor` to find XML tags representing meaningful WordPress entities. The reader knows the WXR schema and only looks for relevant elements. For example, it knows that posts are stored in `rss > channel > item` and comments are stored in `rss > channel > item > `wp:comment`. The `$wxr->next_entity()` method stream-parses the next entity from the WXR document and exposes it to the API consumer via `$wxr->get_entity_type()` and `$wxr->get_entity_date()`. The next call to `$wxr->next_entity()` remembers where the parsing has stopped and parses the next entity after that point. ```php $fp = fopen('my-wxr-file.xml', 'r'); $wxr_reader = WP_WXR_Reader::from_stream(); while(true) { if($wxr_reader->next_entity()) { switch ( $wxr_reader->get_entity_type() ) { case 'post': // ... process post ... break; case 'comment': // ... process comment ... break; case 'site_option': // ... process site option ... break; // ... process other entity types ... } continue; } // Next entity not found – we ran out of data to process. // Let's feed another chunk of bytes to the reader. if(feof($fp)) { break; } $chunk = fread($fp, 8192); if(false === $chunk) { $wxr_reader->input_finished(); continue; } $wxr_reader->append_bytes($chunk); } ``` Similarly to `WP_XML_Processor`, the `WP_WXR_Reader` enters a paused state when it doesn't have enough XML bytes to parse the entire entity. The _next_entity() -> fread -> break_ usage pattern may seem a bit tedious. This is expected. Even if the WXR parsing part of the `WP_WXR_Reader` offers a high-level API, working with byte streams requires reasoning on a much lower level. The `StreamChain` class shipped in this repository will make the API consumption easier with its transformation–oriented API for chaining data processors. ### Supported WordPress entities * posts – sourced from `<item>` tags * comments – sourced from `<wp:comment>` tags * comment meta – sourced from `<wp:commentmeta>` tags * users – sourced from `<wp:author>` tags * post meta – sourced from `<wp:postmeta>` tags * terms – sourced from `<wp:term>` tags * tags – sourced from `<wp:tag>` tags * categories – sourced from `<wp:category>` tags ## Caveats ### Extensibility `WP_WXR_Reader` ignores any XML elements it doesn't recognize. The WXR format is extensible so in the future the reader may start supporting registration of custom handlers for unknown tags in the future. ### Nested entities intertwined with data `WP_WXR_Reader` flushes the current entity whenever another entity starts. The upside is simplicity and a tiny memory footprint. The downside is that it's possible to craft a WXR document where some information would be lost. For example: ```xml <rss> <channel> <item> <title>Page with comments</title> <link>http://wpthemetestdata.wordpress.com/about/page-with-comments/</link> <wp:postmeta> <wp:meta_key>_wp_page_template</wp:meta_key> <wp:meta_value><![CDATA[default]]></wp:meta_value> </wp:postmeta> <wp:post_id>146</wp:post_id> </item> </channel> </rss> ``` `WP_WXR_Reader` would accumulate post data until the `wp:post_meta` tag. Then it would emit a `post` entity and accumulate the meta information until the `</wp:postmeta>` closer. Then it would advance to `<wp:post_id>` and **ignore it**. This is not a problem in all the `.wxr` files I saw. Still, it is important to note this limitation. It is possible there is a `.wxr` generator somewhere out there that intertwines post fields with post meta and comments. If this ever comes up, we could: * Emit the `post` entity first, then all the nested entities, and then emit a special `post_update` entity. * Do multiple passes over the WXR file – one for each level of nesting, e.g. 1. Insert posts, 2. Insert Comments, 3. Insert comment meta Buffering all the post meta and comments seems like a bad idea – there might be gigabytes of data. ## Future Plans The next phase will add pause/resume functionality to handle timeout scenarios: - Save parser state after each entity or every `n` entities to speed it up. Then also save the `n` for a quick rewind after resuming. - Resume parsing from saved state. ## Testing Instructions Read the tests and ponder whether they make sense. Confirm the PHPUnit test suite passed on CI. The test suite includes coverage for various WXR formats and streaming behaviors.

adamziel added [Aspect] Data Liberation [Type] Project [Type] Tracking Tactical breakdown of efforts across the codebase and/or tied to Overview issues. labels Oct 14, 2024

github-project-automation bot added this to Playground Board Oct 14, 2024

github-project-automation bot moved this to Inbox in Playground Board Oct 14, 2024

adamziel mentioned this issue Oct 14, 2024

[Data liberation] wp_rewrite_urls() #1893

Merged

8 tasks

bgrgicak moved this from Inbox to In progress in Playground Board Oct 15, 2024

adamziel moved this from In progress to Project: In Progress in Playground Board Oct 16, 2024

adamziel added this to the Data Liberation: URL Rewriting milestone Oct 25, 2024

adamziel mentioned this issue Oct 28, 2024

[Data Liberation] Add XML API, Stream API, WXR URL Rewriter API #1952

Merged

adamziel mentioned this issue Oct 29, 2024

Adam's list of Data Liberation wishes and ideas #1957

Open

adamziel mentioned this issue Oct 30, 2024

[Data liberation] Add blueprints-library as a submodule #1967

Merged

adamziel mentioned this issue Oct 31, 2024

[Data Liberation] Merge both XML processors into a single WP_XML_Processor #1960

Merged

adamziel linked a pull request Oct 31, 2024 that will close this issue

[Data Liberation] Merge both XML processors into a single WP_XML_Processor #1960

Merged

adamziel closed this as completed in #1960 Oct 31, 2024

github-project-automation bot moved this from Project: In Progress to Done in Playground Board Oct 31, 2024

brandonpayton reopened this Nov 1, 2024

github-project-automation bot moved this from Done to Inbox in Playground Board Nov 1, 2024

adamziel mentioned this issue Nov 2, 2024

[Data Liberation] WP_WXR_Reader #1972

Merged

bgrgicak mentioned this issue Nov 7, 2024

Explore interplay between Blueprints and Assembler WordPress/blueprints-library#117

Open

This was referenced Nov 18, 2024

[Data Liberation] Re-entrant gzip decoder #2002

Open

[Data Liberation] Re-entrant WP_Stream_Importer #2004

Merged

adamziel mentioned this issue Nov 28, 2024

[Data Liberation] WP_Stream_Importer: Incremental import #2013

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data Liberation] Tracking issue #1894

[Data Liberation] Tracking issue #1894

adamziel commented Oct 14, 2024 •

edited by zaerl

Loading

brandonpayton commented Nov 1, 2024

adamziel commented Nov 2, 2024 •

edited

Loading

[Data Liberation] Tracking issue #1894

[Data Liberation] Tracking issue #1894

Comments

adamziel commented Oct 14, 2024 • edited by zaerl Loading

Technical plumbing

Preliminary roadmap by use-case

brandonpayton commented Nov 1, 2024

adamziel commented Nov 2, 2024 • edited Loading

adamziel commented Oct 14, 2024 •

edited by zaerl

Loading

adamziel commented Nov 2, 2024 •

edited

Loading