-
Notifications
You must be signed in to change notification settings - Fork 2
Parsing gmane with factor part 3
Part 3 of the Parsing Gmane with Factor tutorial.
The scraper we will build in this part will scrape urls such as http://article.gmane.org/gmane.comp.lang.factor.general/6020 and return mail tuples that we defined in part 1.
But first, a small recap. One of the words we wrote in part 1 was the insert-mail
word so that we could write:
IN: scratchpad random-mail insert-mail
Note how we haven't declared a single variable anywhere, and yet, this piece of code is perfectly readable! Generate a random mail, then insert it. The reason is that Factor makes it incredibly easy to combine words and create data pipelines. Each part of a Factor program can be seen as a pipe through which data flows:
ø >> random-mail >> mail
mail >> insert-mail >> ø
Here, ø means empty. random-mail
is the start of the pipeline and generates one mail which insert-mail
consumes and returns nothing or ø. The scraper function we are going to build will take something that identifies a gmane mail and produce one mail tuple:
mail-key >> scrape-mail >> mail
What is mail-key? It is the pieces of data we need to uniquely identify a message sent to a newsgroup on Gmane. I'm choosing the define it as the combination of the mid
(an integer) and group
(a string) of the mail. So the above can be restated as:
mid group >> scrape-mail >> mail
Then just insert this "pipe" into the old pipeline:
1234 "newsgroup here" scrape-mail insert-mail
So, the task of scraping gmane can be restated as merely consuming a mail key, an int and a string, to produce a mail tuple.
Create an empty vocabulary skeleton located in gmane/scraper/scraper.factor
:
USING: ;
IN: gmane.scraper
All vocabularies are strictly required to have the same directory path as its vocabulary name. The exception is private vocabularies which this tutorial won't cover. So, if the vocabulary is named foo.bar.baz
, the file must be put in foo/bar/baz/baz.factor
.
From this point forward, I will not list what vocabularies you need to add to USING:
because Factor will tell you about it when you compile the file. You can also search for in the Factor handbook to find which vocabulary a word belongs to.
: mail-url ( n str -- str )
swap "http://article.gmane.org/gmane.%s/%d" sprintf ;
This trivial function creates urls from mail specifications.
IN: scratchpad 33 "comp.lang.factor.general" mail-url
"http://article.gmane.org/gmane.comp.lang.factor.general/33"
Why is the function declared as n str
instead of str n
? If it was the latter, then the swap
stack shuffling word wouldn't be needed. The reason is that str
is the "dominant" parameter and keeping it last makes it easy to use in quotations:
IN: scratchpad 10 [1,b] [ "comp.lang.factor.general" mail-url ] map .
{
"http://article.gmane.org/gmane.comp.lang.factor.general/1"
"http://article.gmane.org/gmane.comp.lang.factor.general/2"
"http://article.gmane.org/gmane.comp.lang.factor.general/3"
"http://article.gmane.org/gmane.comp.lang.factor.general/4"
"http://article.gmane.org/gmane.comp.lang.factor.general/5"
"http://article.gmane.org/gmane.comp.lang.factor.general/6"
"http://article.gmane.org/gmane.comp.lang.factor.general/7"
"http://article.gmane.org/gmane.comp.lang.factor.general/8"
"http://article.gmane.org/gmane.comp.lang.factor.general/9"
"http://article.gmane.org/gmane.comp.lang.factor.general/10"
}
Usually, when you are unsure what order to give the parameters to a word, put the dominating parameter, or the one that changes the least, last because that will make it easier to call the word. It's a good rule of thumb but there are many, many exceptions.
The two accepted text formats for email is plain text and html. Plain text is simple and easy and can be stored and displayed verbatim. But we won't bother rendering stupid html and will instead convert it to plain text before storing it the SQLite database. One particular annoyance is html entities like and & that must be replaced with their single-character equivalents which is what the word below does:
: replace-entities ( html-str -- str )
'[ _ string>xml-chunk ] with-html-entities first ;
Test it in the listener:
IN: scratchpad "<test&test>" replace-entities .
"<test&test>"
Scraping html is done with the scrape-html
word in the html.parser.analyzer
library. It takes an http url as input and produces a tuple of response headers and a vector representing the parsed html page:
IN: scratchpad "http://article.gmane.org/gmane.comp.lang.factor.general/33" scrape-html
--- Data stack:
T{ response f "1.1" 200 "OK" H{ ~array~ ~array~ ~array~ ~array~...
V{ ~tag~ ~tag~ ~tag~ ~tag~ ~tag~ ~tag~ ~tag~ ~tag~ ~tag~ ~tag~...
We can then further process the vector to extract the required bits of information such as the mail subject and body text. But before we do that, let's introduce another utility function:
: tag-vector>string ( vector -- string )
[ dup name>> "br" = [ text >>name "\n" >>text ] when ] map
[ [ text>> ] [ name>> comment = not ] bi and ] filter
[ text>> ] map concat
replace-entities
R/ \n\n+/ "\n\n" re-replace
[ blank? ] trim ;
This function takes a vector containing tags on the same format that scrape-html
emits and produces a string with all superfluous html markup stripped and br tags replaced with new lines. We want to display mails in plain text format and not with redundant html tags when showing them in the listener.
To compile it, add the html
, html.parser
and html.parser.analyzer
vocabularies.
Ensure that the defined word works:
IN: scratchpad "http://article.gmane.org/gmane.comp.lang.factor.general/33" scrape-html nip tag-vector>string print
...lots of text here...
IN: scratchpad
Note that the function doesn't strip everything that it probably should -- contents of script tags, for example, are considered "plain text" which is clearly wrong. But we won't concern ourselves too much with those flaws at the moment because we will only use the function to find the body text of html mails. Scraping html is always a messy affair, so good enough is the name of the game here.
Take a look at an example of a html formatted mail archived in the comp.lang.factor.general group in Firefox' or Chrome's DOM explorer, for example this http://article.gmane.org/gmane.comp.lang.factor.general/5681. Note how the content we're interested in is structured on the page:
<td class="bodytd">
<div class="headers">
<div class="face">...</div>
...
</div>
... mail body is here! ...
<script type="text/javascript">
document.domain = 'gmane.org';
document.title = 'radical';
</script>
</div>
Using this information, we can construct an algorithm to get all the content within the bodytd
class, then "subtract" the headers
class and the script
tag and what we're left with is the mail body which we convert to plain text.
: remove-all ( seq subseqs -- seq )
swap [ { } replace ] reduce ;
: parse-mail-body ( html -- text )
"bodytd" find-by-class-between dup
[
[ name>> "script" = ] [ "headers" html-class? ] bi or
] find-between-all remove-all tag-vector>string ;
The "trick" to understanding parse-mail-body
is to recognize that remove-all
works as a subtraction and that find-between-all
only finds nodes within the nodes find-by-class-between
found.
Again, check that the function does what it's supposed to:
IN: scratchpad 5681 "comp.lang.factor.general" mail-url scrape-html nip parse-mail-body print
... mail text here ...
IN: scratchpad
No word named “replace” found in current vocabulary search path
You have an old version of Factor. Easiest way to fix it is just to upgrade your Factor installation. If you are able to, it's probably best to build the source yourself directly from git.
In addition to the mail body, we want to extract the date when the mail was sent, the subject and who sent it. By looking at the source of some pages on gmane, we can conclude that the relevant information is contained in lines like these:
From: ....\n
Date: ....\n
Subject: ....\n
It's trivial to parse that using regular expressions:
: parse-mail-header ( html header -- text )
[ tag-vector>string ] dip
": " append dup "[^\n]+" append <regexp> swapd first-match
swap "" replace ;
As always, try it out in the listener to see how it works:
IN: scratchpad 5681 "comp.lang.factor.general" mail-url scrape-html nip "From" parse-mail-header .
"L N <leonardnemoi@...>"
Remember the mail tuple we defined in part 1? If we use parse-mail-header
to get the mid, group, date, sender and subject field and parse-mail-body
to get the body then we have enough pieces to generate complete mail tuples.
Here is a first bad version of a word to parse mail tuples using the helpers previously defined:
: parse-mail-bad ( n str -- mail )
f -rot 2dup
mail-url scrape-html nip
{
[ "Date" parse-mail-header ymdhms>timestamp ]
[ "From" parse-mail-header ]
[ "Subject" parse-mail-header ]
[ parse-mail-body ]
} cleave mail boa ;
This function has the same signature and purpose as the random-mail
function we created in part 1. It uses cleave
to put the necessary data in the correct order on the stack and calls boa
to build the tuple. Remember the pipeline analogy introduced earlier? Here, cleave
forks the data flow into N streams, one for each quote in the sequence, and then joins their results on the stack.
Why is this function bad? Although it works superbly when the parameters references an existing mail, there is no error checking in case they don't:
IN: scratchpad 25 "comp.lang.factor.general" parse-mail-bad ! this works
--- Data stack:
T{ mail f f 25 "comp.lang.factor.general" ~timestamp~...
IN: scratchpad 25 "bork.fork" parse-mail-bad ! this doesn't
Bad store to specialized slot
value f
class string
Type :help for debugging help.
Ouch! This is not good and we definitely need to handle that error. The only question is what the most efficient way to do it is. Take a look at the page the scraper accesses when we type a command like above: http://article.gmane.org/gmane.bork.fork/25. Unfortunately, gmane doesn't set the response code to 404 like it absolutely should so we have to use another method to detect missing mails. What we can do instead is check if the parsed pages tag vector only consists of one element and return false in that case. It is not as clean as looking at the response code would have been, but it gets the job done.
: parse-mail ( n str -- mail/f )
2dup mail-url scrape-html nip dup length 1 =
[ 3drop f ]
[
[ f -rot ] dip
{
[ "Date" parse-mail-header ymdhms>timestamp ]
[ "From" parse-mail-header ]
[ "Subject" parse-mail-header ]
[ parse-mail-body ]
} cleave mail boa
] if ;
In the next part of the tutorial, we will write some text formatting words to make easy and tasteful display of tabular data.