Skip to content

Parsing gmane with factor part 3

luftluft edited this page Oct 27, 2013 · 2 revisions

Part 3 of the Parsing Gmane with Factor tutorial.

The scraper we will build in this part will scrape urls such as http://article.gmane.org/gmane.comp.lang.factor.general/6020 and return mail tuples that we defined in part 1.

About pipelines

But first, a small recap. One of the words we wrote in part 1 was the insert-mail word so that we could write:

IN: scratchpad random-mail insert-mail

Note how we haven't declared a single variable anywhere, and yet, this piece of code is perfectly readable! Generate a random mail, then insert it. The reason is that Factor makes it incredibly easy to combine words and create data pipelines. Each part of a Factor program can be seen as a pipe through which data flows:

ø >> random-mail >> mail
mail >> insert-mail >> ø

Here, ø means empty. random-mail is the start of the pipeline and generates one mail which insert-mail consumes and returns nothing or ø. The scraper function we are going to build will take something that identifies a gmane mail and produce one mail tuple:

mail-key >> scrape-mail >> mail

What is mail-key? It is the pieces of data we need to uniquely identify a message sent to a newsgroup on Gmane. I'm choosing the define it as the combination of the mid (an integer) and group (a string) of the mail. So the above can be restated as:

mid group >> scrape-mail >> mail

Then just insert this "pipe" into the old pipeline:

1234 "newsgroup here" scrape-mail insert-mail

So, the task of scraping gmane can be restated as merely consuming a mail key, an int and a string, to produce a mail tuple.

Some Utility Functions

Create an empty vocabulary skeleton located in gmane/scraper/scraper.factor:

USING: ;
IN: gmane.scraper

All vocabularies are strictly required to have the same directory path as its vocabulary name. The exception is private vocabularies which this tutorial won't cover. So, if the vocabulary is named foo.bar.baz, the file must be put in foo/bar/baz/baz.factor.

From this point forward, I will not list what vocabularies you need to add to USING: because Factor will tell you about it when you compile the file. You can also search for in the Factor handbook to find which vocabulary a word belongs to.

Generating URL:s

: mail-url ( n str -- str )
    swap "http://article.gmane.org/gmane.%s/%d" sprintf ;

This trivial function creates urls from mail specifications.

IN: scratchpad 33 "comp.lang.factor.general" mail-url
"http://article.gmane.org/gmane.comp.lang.factor.general/33"

Why is the function declared as n str instead of str n? If it was the latter, then the swap stack shuffling word wouldn't be needed. The reason is that str is the "dominant" parameter and keeping it last makes it easy to use in quotations:

IN: scratchpad 10 [1,b] [ "comp.lang.factor.general" mail-url ] map .
{
    "http://article.gmane.org/gmane.comp.lang.factor.general/1"
    "http://article.gmane.org/gmane.comp.lang.factor.general/2"
    "http://article.gmane.org/gmane.comp.lang.factor.general/3"
    "http://article.gmane.org/gmane.comp.lang.factor.general/4"
    "http://article.gmane.org/gmane.comp.lang.factor.general/5"
    "http://article.gmane.org/gmane.comp.lang.factor.general/6"
    "http://article.gmane.org/gmane.comp.lang.factor.general/7"
    "http://article.gmane.org/gmane.comp.lang.factor.general/8"
    "http://article.gmane.org/gmane.comp.lang.factor.general/9"
    "http://article.gmane.org/gmane.comp.lang.factor.general/10"
}

Usually, when you are unsure what order to give the parameters to a word, put the dominating parameter, or the one that changes the least, last because that will make it easier to call the word. It's a good rule of thumb but there are many, many exceptions.

Replacing html entities

The two accepted text formats for email is plain text and html. Plain text is simple and easy and can be stored and displayed verbatim. But we won't bother rendering stupid html and will instead convert it to plain text before storing it the SQLite database. One particular annoyance is html entities like   and & that must be replaced with their single-character equivalents which is what the word below does:

: replace-entities ( html-str -- str )
    '[ _ string>xml-chunk ] with-html-entities first ;

Test it in the listener:

IN: scratchpad "<test&test>" replace-entities .
"<test&test>"

Downloading and Parsing Mails

Scraping html is done with the scrape-html word in the html.parser.analyzer library. It takes an http url as input and produces a tuple of response headers and a vector representing the parsed html page:

IN: scratchpad "http://article.gmane.org/gmane.comp.lang.factor.general/33" scrape-html

--- Data stack:
T{ response f "1.1" 200 "OK" H{ ~array~ ~array~ ~array~ ~array~...
V{ ~tag~ ~tag~ ~tag~ ~tag~ ~tag~ ~tag~ ~tag~ ~tag~ ~tag~ ~tag~...

We can then further process the vector to extract the required bits of information such as the mail subject and body text. But before we do that, let's introduce another utility function:

: tag-vector>string ( vector -- string )
    [ dup name>> "br" = [ text >>name "\n" >>text ] when ] map
    [ [ text>> ] [ name>> comment = not ] bi and ] filter
    [ text>> ] map concat
    replace-entities
    R/ \n\n+/ "\n\n" re-replace
    [ blank? ] trim ;

This function takes a vector containing tags on the same format that scrape-html emits and produces a string with all superfluous html markup stripped and br tags replaced with new lines. We want to display mails in plain text format and not with redundant html tags when showing them in the listener.

To compile it, add the html, html.parser and html.parser.analyzer vocabularies.

Ensure that the defined word works:

IN: scratchpad "http://article.gmane.org/gmane.comp.lang.factor.general/33" scrape-html nip tag-vector>string print
...lots of text here...
IN: scratchpad

Note that the function doesn't strip everything that it probably should -- contents of script tags, for example, are considered "plain text" which is clearly wrong. But we won't concern ourselves too much with those flaws at the moment because we will only use the function to find the body text of html mails. Scraping html is always a messy affair, so good enough is the name of the game here.

Parsing the Mail Body

Take a look at an example of a html formatted mail archived in the comp.lang.factor.general group in Firefox' or Chrome's DOM explorer, for example this http://article.gmane.org/gmane.comp.lang.factor.general/5681. Note how the content we're interested in is structured on the page:

<td class="bodytd">
  <div class="headers">
    <div class="face">...</div>
    ...
  </div>
  ... mail body is here! ...
  <script type="text/javascript">
    document.domain = 'gmane.org';
    document.title = 'radical';
  </script>
</div>

Using this information, we can construct an algorithm to get all the content within the bodytd class, then "subtract" the headers class and the script tag and what we're left with is the mail body which we convert to plain text.

: remove-all ( seq subseqs -- seq )
    swap [ { } replace ] reduce ;

: parse-mail-body ( html -- text )
    "bodytd" find-by-class-between dup
    [
        [ name>> "script" = ] [ "headers" html-class? ] bi or
    ] find-between-all remove-all tag-vector>string ;

The "trick" to understanding parse-mail-body is to recognize that remove-all works as a subtraction and that find-between-all only finds nodes within the nodes find-by-class-between found.

Again, check that the function does what it's supposed to:

IN: scratchpad 5681 "comp.lang.factor.general" mail-url scrape-html nip parse-mail-body print
... mail text here ...
IN: scratchpad

Word replace missing

No word named “replace” found in current vocabulary search path

You have an old version of Factor. Easiest way to fix it is just to upgrade your Factor installation. If you are able to, it's probably best to build the source yourself directly from git.

Parsing the Mail Headers

In addition to the mail body, we want to extract the date when the mail was sent, the subject and who sent it. By looking at the source of some pages on gmane, we can conclude that the relevant information is contained in lines like these:

From: ....\n
Date: ....\n
Subject: ....\n

It's trivial to parse that using regular expressions:

: parse-mail-header ( html header -- text )
    [ tag-vector>string ] dip
    ": " append dup "[^\n]+" append <regexp> swapd first-match
    swap "" replace ;

As always, try it out in the listener to see how it works:

IN: scratchpad 5681 "comp.lang.factor.general" mail-url scrape-html nip "From" parse-mail-header .
"L N <leonardnemoi@...>"

Remember the mail tuple we defined in part 1? If we use parse-mail-header to get the mid, group, date, sender and subject field and parse-mail-body to get the body then we have enough pieces to generate complete mail tuples.

Making Mail Tuples

Here is a first bad version of a word to parse mail tuples using the helpers previously defined:

: parse-mail-bad ( n str -- mail )    
    f -rot 2dup
    mail-url scrape-html nip
    {        
        [ "Date" parse-mail-header ymdhms>timestamp ]
        [ "From" parse-mail-header ]
        [ "Subject" parse-mail-header ]
        [ parse-mail-body ]
    } cleave mail boa ;

This function has the same signature and purpose as the random-mail function we created in part 1. It uses cleave to put the necessary data in the correct order on the stack and calls boa to build the tuple. Remember the pipeline analogy introduced earlier? Here, cleave forks the data flow into N streams, one for each quote in the sequence, and then joins their results on the stack.

Why is this function bad? Although it works superbly when the parameters references an existing mail, there is no error checking in case they don't:

IN: scratchpad 25 "comp.lang.factor.general" parse-mail-bad ! this works

--- Data stack:
T{ mail f f 25 "comp.lang.factor.general" ~timestamp~...
IN: scratchpad 25 "bork.fork" parse-mail-bad ! this doesn't
Bad store to specialized slot
value f
class string

Type :help for debugging help.

Ouch! This is not good and we definitely need to handle that error. The only question is what the most efficient way to do it is. Take a look at the page the scraper accesses when we type a command like above: http://article.gmane.org/gmane.bork.fork/25. Unfortunately, gmane doesn't set the response code to 404 like it absolutely should so we have to use another method to detect missing mails. What we can do instead is check if the parsed pages tag vector only consists of one element and return false in that case. It is not as clean as looking at the response code would have been, but it gets the job done.

: parse-mail ( n str -- mail/f )
    2dup mail-url scrape-html nip dup length 1 =
    [ 3drop f ]
    [
        [ f -rot ] dip
        {
            [ "Date" parse-mail-header ymdhms>timestamp ]
            [ "From" parse-mail-header ]
            [ "Subject" parse-mail-header ]
            [ parse-mail-body ]
        } cleave mail boa
    ] if ;

In the next part of the tutorial, we will write some text formatting words to make easy and tasteful display of tabular data.