Skip to content

Commit

Permalink
Replace hpricot with nokogiri in wordpressdotcom (#555)
Browse files Browse the repository at this point in the history
Merge pull request 555
  • Loading branch information
parkr authored Jan 21, 2025
1 parent 0e3a6f1 commit 693652d
Show file tree
Hide file tree
Showing 6 changed files with 1,035 additions and 53 deletions.
7 changes: 3 additions & 4 deletions docs/_importers/wordpressdotcom.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,10 +17,9 @@ Their default values are what you see above.

### Further WordPress migration alternatives

While the above method works, it does not import much of the metadata that is
usually stored in WordPress posts and pages. If you need to export things like
pages, tags, custom fields, image attachments and so on, the following resources
might be useful to you:
While the above method works, it doesn't import absolutely every piece of
metadata. If you need to import custom fields from your pages and posts,
the following resources might be useful to you:

- [Exitwp](https://github.com/thomasf/exitwp) is a configurable tool written in
Python for migrating one or more WordPress blogs into Jekyll (Markdown) format
Expand Down
1 change: 0 additions & 1 deletion jekyll-import.gemspec
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,6 @@ Gem::Specification.new do |s|

# importer dependencies:
# s.add_development_dependency("behance", "~> 0.3") # uses outdated dependencies
s.add_development_dependency("hpricot", "~> 0.8")
s.add_development_dependency("htmlentities", "~> 4.3")
s.add_development_dependency("mysql2", "~> 0.3")
s.add_development_dependency("open_uri_redirections", "~> 0.2")
Expand Down
68 changes: 34 additions & 34 deletions lib/jekyll-import/importers/wordpressdotcom.rb
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ def self.require_deps
rubygems
fileutils
safe_yaml
hpricot
nokogiri
time
open-uri
open_uri_redirections
Expand All @@ -22,16 +22,16 @@ def self.specify_options(c)
end

# Will modify post DOM tree
def self.download_images(title, post_hpricot, assets_folder)
images = (post_hpricot / "img")
def self.download_images(title, post_doc, assets_folder)
images = post_doc.css("img")
return if images.empty?

Jekyll.logger.info "Downloading images for ", title
Jekyll.logger.info "Downloading:", "images for #{title}"
images.each do |i|
uri = URI::DEFAULT_PARSER.escape(i["src"])

dst = File.join(assets_folder, File.basename(uri))
i["src"] = File.join("{{ site.baseurl }}", dst)
i["src"] = File.join("{{site.baseurl}}", dst)
Jekyll.logger.info uri
if File.exist?(dst)
Jekyll.logger.info "Already in cache. Clean assets folder if you want a redownload."
Expand All @@ -54,15 +54,18 @@ def self.download_images(title, post_hpricot, assets_folder)

class Item
def initialize(node)
raise "Node is nil" if node.nil?

@node = node
end

def text_for(path)
@node.at(path).inner_text
subnode = @node.at_xpath("./#{path}") || @node.at(path) || @node.children.find { |child| child.name == path }
subnode.text
end

def title
@title ||= text_for(:title).strip
@title ||= text_for("title").strip
end

def permalink_title
Expand All @@ -76,12 +79,10 @@ def permalink_title
end

def permalink
# Hpricot thinks "link" is a self closing tag so it puts the text of the link after the tag
# but sometimes it works right! I think it's the xml declaration
@permalink ||= begin
uri = text_for("link")
uri = @node.at("link").following[0] if uri.empty?
URI(uri.to_s).path
uri = @node.at("link").next_sibling.text if uri.empty?
URI(uri.to_s.strip).path
end
end

Expand Down Expand Up @@ -127,12 +128,8 @@ def published?

def excerpt
@excerpt ||= begin
text = Hpricot(text_for("excerpt:encoded")).inner_text
if text.empty?
nil
else
text
end
text = Nokogiri::HTML(text_for("excerpt:encoded")).text
text.empty? ? nil : text
end
end
end
Expand All @@ -144,29 +141,32 @@ def self.process(options)
FileUtils.mkdir_p(assets_folder)

import_count = Hash.new(0)
doc = Hpricot::XML(File.read(source))
doc = Nokogiri::XML(File.read(source))
# Fetch authors data from header
authors = Hash[
(doc / :channel / "wp:author").map do |author|
[author.at("wp:author_login").inner_text.strip, {
"login" => author.at("wp:author_login").inner_text.strip,
"email" => author.at("wp:author_email").inner_text,
"display_name" => author.at("wp:author_display_name").inner_text,
"first_name" => author.at("wp:author_first_name").inner_text,
"last_name" => author.at("wp:author_last_name").inner_text,
},]
doc.xpath("//channel/wp:author").map do |author|
[
author.xpath("./wp:author_login").text.strip,
{
"login" => author.xpath("./wp:author_login").text.strip,
"email" => author.xpath("./wp:author_email").text,
"display_name" => author.xpath("./wp:author_display_name").text,
"first_name" => author.xpath("./wp:author_first_name").text,
"last_name" => author.xpath("./wp:author_last_name").text,
},
]
end
] rescue {}

(doc / :channel / :item).each do |node|
doc.css("channel > item").each do |node|
item = Item.new(node)
categories = node.search('category[@domain="category"]').map(&:inner_text).reject { |c| c == "Uncategorized" }.uniq
tags = node.search('category[@domain="post_tag"]').map(&:inner_text).uniq
categories = node.css('category[domain="category"]').map(&:text).reject { |c| c == "Uncategorized" }.uniq
tags = node.css('category[domain="post_tag"]').map(&:text).uniq

metas = {}
node.search("wp:postmeta").each do |meta|
key = meta.at("wp:meta_key").inner_text
value = meta.at("wp:meta_value").inner_text
node.xpath("./wp:postmeta").each do |meta|
key = meta.at_xpath("./wp:meta_key").text
value = meta.at_xpath("./wp:meta_value").text
metas[key] = value
end

Expand All @@ -189,7 +189,7 @@ def self.process(options)
}

begin
content = Hpricot(item.text_for("content:encoded"))
content = Nokogiri::HTML(item.text_for("content:encoded"))
header["excerpt"] = item.excerpt if item.excerpt

if fetch
Expand Down Expand Up @@ -221,7 +221,7 @@ def self.process(options)
end

import_count.each do |key, value|
Jekyll.logger.info "Imported #{value} #{key}s"
Jekyll.logger.info "Imported", "#{value} #{Util.pluralize(key, value)}"
end
end

Expand Down
8 changes: 8 additions & 0 deletions lib/jekyll-import/util.rb
Original file line number Diff line number Diff line change
Expand Up @@ -73,5 +73,13 @@ def self.wpautop(pee, br = true)
end
pee
end

def self.pluralize(word, count)
return word if count <= 1

return word if word.end_with?("s")

"#{word}s"
end
end
end
Loading

0 comments on commit 693652d

Please sign in to comment.