Skip to content

Parsing Gmane with Factor Part 1

Björn Lindqvist edited this page May 27, 2013 · 49 revisions

Part 1 of the Parsing Gmane with Factor tutorial.

Much of the information for this section comes from the Factor/DB page on the Concatenative Wiki which you probably ought to read before this.

In this part, we'll write a vocabulary (that's Factor lingo for module) to interact with a simple SQLite database. We'll also define a mapping between the database table and Factor's tuple objects.

Here I'd like to mention that while this tutorial presents the construction of this program as a discrete series of steps that neatly takes you from nothing to complete program, that is not at all how it was developed. :) I tried various approaches, changed the schema several times, had to go back and forth and ran into many dead ends while figuring out how to best write the code. All that is omitted from this text because otherwise it would be far too long. But it is worth keeping in mind when you develop your own code that it doesn't have to be perfect the first time.

Making sure SQLite works

Create a new vocabulary with the path gmane/db/db.factor with the following contents:

USING:
    db db.sqlite
    kernel ;
IN: gmane.db

: with-mydb ( quot -- )
    "gmane.db" <sqlite-db> swap with-db ; inline

If you are using emacs and FUEL, you can compile the vocabulary by first typing C-c C-z to run factor and then C-c C-k to compile. You should get a message in the mini-buffer that indicates whether the compilation succeded or not.

After you've ensured that it compiles switch to the listener with C-c C-z and type:

IN: scratchpad USE: gmane.db
IN: scratchpad [ ] with-mydb
IN: scratchpad    

Missing sqlite3_open

It is possible that factor throws an error at this point complaining about missing sqlite3_open. On Windows, fix the problem by downloading the SQLite dll file for your architecture from here or here and put it in the same directory that factor.exe lives in. On Linux you probably needs to install SQLite's development package, like this:

$ sudo apt-get install libsqlite3-dev

Modify the command to something appropriate for your distro if it isn't an apt based one.

Defining the Data

Next, you want to define the columns for the table in the SQLite database that will store the scraped mails from Gmane. Edit gmane/db/db.factor so that it looks like:

USING:
    db db.sqlite db.tuples db.types 
    kernel ;
IN: gmane.db

TUPLE: mail id mid group date sender subject body ;

mail "mail" {
    { "id" "id" +db-assigned-id+ }
    { "mid" "mid" INTEGER }
    { "group" "group_" TEXT }
    { "date" "date" DATETIME }
    { "sender" "sender" TEXT }
    { "subject" "subject" TEXT }
    { "body" "body" TEXT }
} define-persistent

: with-mydb ( quot -- )
    "gmane.db" <sqlite-db> swap with-db ; inline

A tuple is like a record or struct in other programming languages. The code that ends with define-persistent maps the slots in the tuple to columns in the database table. I think most of the columns are self-explanatory, except for mid which is the message id of the mail set by gmane and group which is the name of the newsgroup the mail was sent to. This message for example, has group "comp.lang.factor.general" and mid 3020. According to gmane the newsgroup is "gmane.comp.lang.factor.general," but since all newsgroup gmane knows about starts with the "gmane." prefix it becomes redundant and can be omitted.

Generating Garbage Mails

Next, we'll add some functions to create random data to store in the database. Just so that we have some data to play with before continuing on to building the scraper. Add the three functions below to the bottom of db.factor:

: ascii-lower ( -- seq )
    CHAR: a CHAR: z [a,b] ;

: random-ascii ( n -- str )
    [ ascii-lower random ] replicate >string ;

: random-mail ( -- mail )
    f
    5000 random
    10 random-ascii
    now 1000 random 500 - days time+ 
    { 15 30 80 } [ random-ascii ] map first3
    mail boa ;

Try compile with C-k, and Factor will alert you about lots of missing vocabularies. So add math, math.ranges, random, sequences, strings and calendar to the vocabulary's USING: statement. Then Factor should happily compile it.

Now you can insert some garbage data and then read it back in the listener:

IN: scratchpad [ mail create-table ] with-mydb
... each time you run the line below, a new mail gets inserted 
IN: scratchpad [ random-mail insert-tuple ] with-mydb
IN: scratchpad [ T{ mail } select-tuples ] with-mydb .

Factor always tells you when it can't find a word and helpfully suggests which vocabulary to import. Probably, there is a whole bunch of them you need to import to get the above lines running. As a last step, let's add a function to insert mails so that we don't have to call insert-tuple and with-mydb directly:

: insert-mail ( mail -- id )
    '[ _ insert-tuple last-insert-id ] with-mydb ;

To compile it, you need to add the fry module to enable fried quotations. What happens here is that the fried quotation binds the mail instance and creates a new quotation that inserts the mail tuple and then returns the database generated if of the inserted row. Right now, we have no use of the database id but it will become useful to us later on. That whole quotation is then shipped to with-mydb which can be seen as a context manager in which code that operates on a database can be run.

The code to insert mails in the listener can now be written as:

IN: scratchpad random-mail insert-mail

--- Data stack:
1

Finally, we add a simple utility function to check if a given mail exists in the database

: mail-exists? ( mid group -- mail/f )
    '[ T{ mail } _ >>mid _ >>group select-tuple ] with-mydb ;

Continue on to Part 2 of the tutorial where we will build a scraper.