Skip to content

Parsing gmane with factor part 2a

Björn Lindqvist edited this page Aug 19, 2013 · 23 revisions

Part 2a of the Parsing Gmane with Factor tutorial.

In this part, we'll get acquainted with how to parse HTML in Factor and also how to manage state.

Factor's HTML parsing vocab exists at html.parser. Here is a minimal usage example:

IN: scratchpad USE: html.parser
IN: scratchpad "<div>Hello!</div>" parse-html .
V{
    T{ tag { name "div" } { attributes H{ } } }
    T{ tag { name text } { text "Hello!" } }
    T{ tag { name "div" } { attributes H{ } } { closing? t } }
}

parse-html is a very straightforward verb that takes a html string and outputs a vector consisting of tag tuples. Each tag is either an opening tag, a text node or a closing tag.

Suppose we want to use this verb to convert random html to plain text (here the text to enter is split over multiple lines to avoid horizontal scrolling but you can enter the whole expressions on one line just aswell):

IN: scratchpad "<p>Hello <b>World!</b></p>"
IN: scratchpad parse-html [ text>> dup "" ? ] map concat print
Hello World!

Works fine so far! Let's try a more difficult example.

IN: scratchpad "<p>Hello \n<b>World!</b></p>"
IN: scratchpad parse-html [ text>> dup "" ? ] map concat print
Hello
World!

See what I did there? I added the newline character to the end of "Hello " which of course just get printed verbatim. That's not right; newline characters should not be rendered as line breaks according to the HTML specification. Can we fix the problem? Sure, just remove the \n character from the resulting string:

IN: scratchpad USE: splitting
IN: scratchpad "<div>Hello \n<b>World!</b></div>"
IN: scratchpad parse-html [ text>> dup "" ? ] map concat "\n" "" replace print
Hello World!

Let's try a third example, this time involving the <pre> tag:

IN: scratchpad "<pre>One\nTwo</pre>" parse-html
IN: scratchpad [ text>> dup "" ? ] map concat "\n" "" replace print
OneTwo

Here the line break should be preserved. The output should have been:

One
Two

If we leave the line breaks in, text in div tags aren't displayed correctly. But if we take them out the pre tags look bad. To solve it we need to know whether the text is written inside a pre tag or not.

Recording Tag Paths

The tag vector parse-html spits out isn't enough to answer the question of which tags surrounds a given tag. Since we don't know it's place in the html hierarchy, we can't say if it is inside a pre tag or not.

The solution is to for each tag, record which ancestors or path it has in the tree. For example, say you have the following html:

<div>
  <p>
    <span>
      hi
    </span>
    foo
  </p>
</div>

Then each tag (opening, closing and text element) would have the following paths:

TAG       PATH
<div>     (empty path)
<p>       div
<span>    div->p
hi        div->p->span
</span>   div->p
foo       div->p
</p>      div
</div>    (empty path)

Reducing

The algorithm would work by keeping track of of the current position in the html tree and emit that position for each tag. Something like this:

TAG      OLD-PATH             NEW-PATH             TAG-PATH
<div>    { }                  { "div" }            { }
<p>      { "div" }            { "div" "p" }        { "div" }
<span>   { "div" "p" }        { "div" "p" "span" } { "div" "p" }
hi       { "div" "p" "span" } { "div" "p" "span" } { "div" "p" "span" }
</span>  { "div" "p" "span" } { "div" "p" }        { "div" "p" }
foo      { "div" "p" }        { "div" "p" }        { "div" "p" }
</p>     { "div" "p" }        { "div" }            { "div" }
</div>   { "div" }            { }                  { }

I'm sure you see the general pattern. An opening block tag adds to the NEW-PATH while a closing block tag removes the last item from it. Text can't have any children so those tags neither adds nor removes from it. The interesting thing about this is that to correctly determine the values of NEW-PATH and TAG-PATH requires both TAG and OLD-PATH.

Implementation

The full vocabulary to accomplish this task is not very big:

USING: accessors html.parser html.parser.analyzer kernel namespaces sequences ;
IN: gmane.html2text.tagpath

SYMBOL: path

CONSTANT: childless-tags { comment text dtd "br" "img" "link" }

: blocktag-path ( path tag -- path' tag-path )
    [ name>> ] [ closing?>> ] [ "/" attribute? ] tri or
    [ over ?last = [ but-last ] when dup ] [ dupd suffix ] if ;

: tag-path ( tag -- path )
    path get swap dup name>> childless-tags member?
    [ drop ] [ blocktag-path path set ] if ;

: tag-paths ( tags -- tags' )
    { } path [ [ tag-path ] map ] with-variable ;

Using the tag-paths word we can construct another word that parses text inside pre tags correctly:

: tags>string ( tags -- string )
    dup tag-paths [
        "pre" swap member? [ text>> dup "" ? ] dip
        [ "\n" " " replace ] unless
    ] 2map concat

IN: scratchpad "<div>Hello\nThere</div> <pre>One\nTwo</pre>"
IN: scratchpad parse-html tags>string print
Hello There One
Two

But HTML parsing is more convoluted than that so there are a few more wrinkles we need to add before putting it all together.

Proceed to part 2b.