-
Notifications
You must be signed in to change notification settings - Fork 2
Parsing gmane with factor part 2a
Part 2a of the Parsing Gmane with Factor tutorial.
In this part, we'll get acquainted with how to parse HTML in Factor and also how to manage state.
Factor's HTML parsing vocab exists at html.parser
. Here is a minimal usage example:
IN: scratchpad USE: html.parser
IN: scratchpad "<div>Hello!</div>" parse-html .
V{
T{ tag { name "div" } { attributes H{ } } }
T{ tag { name text } { text "Hello!" } }
T{ tag { name "div" } { attributes H{ } } { closing? t } }
}
parse-html
is a very straightforward verb that takes a html string and outputs a vector consisting of tag tuples. Each tag is either an opening tag, a text node or a closing tag.
Suppose we want to use this verb to convert random html to plain text (here the text to enter is split over multiple lines to avoid horizontal scrolling but you can enter the whole expressions on one line just aswell):
IN: scratchpad "<p>Hello <b>World!</b></p>"
IN: scratchpad parse-html [ text>> dup "" ? ] map concat print
Hello World!
Works fine so far! Let's try a more difficult example.
IN: scratchpad "<p>Hello \n<b>World!</b></p>"
IN: scratchpad parse-html [ text>> dup "" ? ] map concat print
Hello
World!
See what I did there? I added the newline character to the end of "Hello " which of course just get printed verbatim. That's not right; newline characters should not be rendered as line breaks according to the HTML specification. Can we fix the problem? Sure, just remove the \n
character from the resulting string:
IN: scratchpad USE: splitting
IN: scratchpad "<div>Hello \n<b>World!</b></div>"
IN: scratchpad parse-html [ text>> dup "" ? ] map concat "\n" "" replace print
Hello World!
Let's try a third example, this time involving the <pre>
tag:
IN: scratchpad "<pre>One\nTwo</pre>" parse-html
IN: scratchpad [ text>> dup "" ? ] map concat "\n" "" replace print
OneTwo
Here the line break should be preserved. The output should have been:
One
Two
If we leave the line breaks in, text in div tags aren't displayed correctly. But if we take them out the pre tags look bad. To solve it we need to know whether the text is written inside a pre
tag or not.
The tag vector parse-html
spits out isn't enough to answer the question of which tags surrounds a given tag. Since we don't know it's place in the html hierarchy, we can't say if it is inside a pre
tag or not.
The solution is to for each tag, record which ancestors or path it has in the tree. For example, say you have the following html:
<div>
<p>
<span>
hi
</span>
foo
</p>
</div>
Then each tag (opening, closing and text element) would have the following paths:
TAG PATH
<div> (empty path)
<p> div
<span> div->p
hi div->p->span
</span> div->p
foo div->p
</p> div
</div> (empty path)
The algorithm would work by keeping track of of the current position in the html tree and emit that position for each tag. Something like this:
TAG OLD-PATH NEW-PATH TAG-PATH
<div> { } { "div" } { }
<p> { "div" } { "div" "p" } { "div" }
<span> { "div" "p" } { "div" "p" "span" } { "div" "p" }
hi { "div" "p" "span" } { "div" "p" "span" } { "div" "p" "span" }
</span> { "div" "p" "span" } { "div" "p" } { "div" "p" }
foo { "div" "p" } { "div" "p" } { "div" "p" }
</p> { "div" "p" } { "div" } { "div" }
</div> { "div" } { } { }
I'm sure you see the general pattern. An opening block tag adds to the NEW-PATH while a closing block tag removes the last item from it. Text can't have any children so those tags neither adds nor removes from it. The interesting thing about this is that to correctly determine the values of NEW-PATH and TAG-PATH requires both TAG and OLD-PATH.
The full vocabulary to accomplish this task is not very big:
USING: accessors html.parser html.parser.analyzer kernel namespaces sequences ;
IN: gmane.html2text.tagpath
SYMBOL: path
CONSTANT: childless-tags { comment text dtd "br" "img" "link" }
: blocktag-path ( path tag -- path' tag-path )
[ name>> ] [ closing?>> ] [ "/" attribute? ] tri or
[ over ?last = [ but-last ] when dup ] [ dupd suffix ] if ;
: tag-path ( tag -- path )
path get swap dup name>> childless-tags member?
[ drop ] [ blocktag-path path set ] if ;
: tag-paths ( tags -- tags' )
{ } path [ [ tag-path ] map ] with-variable ;
Using the tag-paths
word we can construct another word that parses text inside pre
tags correctly:
: tags>string ( tags -- string )
dup tag-paths [
"pre" swap member? [ text>> dup "" ? ] dip
[ "\n" " " replace ] unless
] 2map concat
IN: scratchpad "<div>Hello\nThere</div> <pre>One\nTwo</pre>"
IN: scratchpad parse-html tags>string print
Hello There One
Two
But HTML parsing is more convoluted than that so there are a few more wrinkles we need to add before putting it all together.
Proceed to part 2b.