This was fully ported to https://github.com/WordPress/wordpress-playground as a part of the Data Liberation project.
The ported code is now available at https://github.com/WordPress/wordpress-playground/tree/e17cbbdbf9340ef18a1404d70f1ad9d1d431a3d3/packages/playground/data-liberation
This is an exploration of what could become the WordPress Site Transfer Protocol:
https://core.trac.wordpress.org/ticket/60375
The current version is focused on finding and rewriting URLs as well as downloading any related assets.
Inspect tricky-input.html and tricky-output.html to see what this repo can do today.
The next steps here would be to flesh out this README more, start some issues and discussions, and define the minimal v1 to ship and stress-test in Playground.
- URL rewriting works to perhaps the greatest extent it ever did. There are still corner-cases to discuss and performance optimizations to reap before making any of this a public API.
- The URL parser requires PHP 8.1. This is fine for some Playground applications, but we'll more compatibility to get any of this into WordPress core. Also, that parser uses dozens of classes when we could likely rework it into something as self-contained like the WP_HTML_Tag_Processor.
- Downloading the assets isn't implemented yet. It feels like there's no way to even start doing it without a state tracking table as there might be way more images to download than the PHP time limit allows. Perhaps that doesn't need to be solved to start using this with Markdown.
When this code is fed into the migrator:
<!-- wp:paragraph -->
<p>
<!-- Inline URLs are migrated -->
🚀-science.com/science has the best scientific articles on the internet! We're also
available via the punycode URL:
<!-- No problem handling HTML-encoded punycode URLs with urlencoded characters in the path -->
https://xn---science-7f85g.com/%73%63ience/.
<!-- Correctly ignores similar–but–different URLs -->
This isn't migrated: https://🚀-science.comcast/science <br>
Or this: super-🚀-science.com/science
</p>
<!-- /wp:paragraph -->
<!-- Block attributes are migrated without any issue -->
<!-- wp:image {"src": "https:\/\/\ud83d\ude80-\u0073\u0063ience.com/%73%63ience/wp-content/image.png"} -->
<!-- As are URI HTML attributes -->
<img src="https://xn---science-7f85g.com/science/wp-content/image.png">
<!-- /wp:image -->
<!-- Classes are not migrated. -->
<span class="https://🚀-science.com/science"></span>
This actual output is produced:
<!-- wp:paragraph -->
<p>
<!-- Inline URLs are migrated -->
science.wordpress.com has the best scientific articles on the internet! We're also
available via the punycode URL:
<!-- No problem handling HTML-encoded punycode URLs with urlencoded characters in the path -->
https://science.wordpress.com/.
<!-- Correctly ignores similar–but–different URLs -->
This isn't migrated: https://🚀-science.comcast/science <br>
Or this: super-🚀-science.com/science
</p>
<!-- /wp:paragraph -->
<!-- Block attributes are migrated without any issue -->
<!-- wp:image {"src":"https:\/\/science.wordpress.com\/wp-content\/image.png"} -->
<!-- As are URI HTML attributes -->
<img src="https://science.wordpress.com/wp-content/image.png">
<!-- /wp:image -->
<!-- Classes are not migrated. -->
<span class="https://🚀-science.com/science"></span>
- WordPress/wordpress-playground#1524
- https://github.com/adamziel/wxr-normalize
- WordPress/blueprints#52
- https://github.com/adamziel/playground-docs-workflow
- https://github.com/WordPress/blueprints-library/blob/trunk/src/WordPress/AsyncHttp/Client.php
- WordPress/wordpress-develop#6713
To migrate domain names from tricky-input.html, run:
php bin/rewrite-urls.php migrate_urls --file ./tricky-input.html --current-site-url https://🚀-science.com/science --new-site-url https://science.wordpress.com
To list all the URLs found in that file, run:
php bin/rewrite-urls.php list_urls --file ./tricky-input.html