Skip to content

Parsing Gmane with Factor Part 1

Marco Rimoldi edited this page Dec 14, 2015 · 49 revisions

Part 1 of the Parsing Gmane with Factor tutorial.

Much of the information for this section comes from the Factor/DB page on the Concatenative Wiki which you should read before this.

In this part, we'll write a vocabulary (that's Factor lingo for module) to interact with a simple SQLite database. We'll also define a mapping between the database table and Factor's tuple objects.

Here I'd like to mention that, while this tutorial presents the construction of this program as a discrete series of steps that neatly takes you from nothing to complete program, that is not at all how it was developed. :) I tried various approaches, changed the schema several times, had to go back and forth and ran into many dead ends while figuring out how to best structure the code. All that is omitted from this text which otherwise would have been far too long. But it is worth keeping in mind when you develop your own code that it doesn't have to be perfect the first time.

I also recommend that you actually type the code. Computers these days all come with handy copy-paste functionality which you could use to write the code -- or just clone this repository and be done with it, then you don't have to write anything! But it is my firm belief that the act of typing code is extremely important to learning programming languages. It has something to do with committing the idioms of the language to the muscle memory of the hands, and being able to appreciate succinctly expressed algorithms because they require less effort to type.

Enough with the chatting, let's get started!

Making sure SQLite works

Create a new vocabulary with the path gmane/db/db.factor with the following contents:

USING:
    db db.sqlite
    kernel ;
IN: gmane.db

: with-mydb ( quot -- )
    "gmane.db" <sqlite-db> swap with-db ; inline

If you are using emacs and FUEL, you can compile the vocabulary by first typing C-c C-z to run factor and then C-c C-k to compile the buffer. You should get a message in the mini-buffer that indicates whether the compilation succeeded or not.

After you've ensured that it compiles switch to the listener by pressing C-c C-z and type:

IN: scratchpad USE: gmane.db
IN: scratchpad [ ] with-mydb
IN: scratchpad

Missing sqlite3_open

It is possible that factor throws an error at this point complaining about missing sqlite3_open. On Windows, fix the problem by downloading the SQLite dll file for your architecture from here or here and put it in the same directory that factor.exe lives in. On Linux, you likely needs to install SQLite's development package, like this:

$ sudo apt-get install libsqlite3-dev

Modify the command to something appropriate for your distro if it isn't an apt based one.

Defining the Data

Next, we want to define the columns for the table in the SQLite database that will store the scraped mails from Gmane. Edit gmane/db/db.factor so that it has the following contents:

USING:
    db db.sqlite db.tuples db.types
    kernel ;
IN: gmane.db

TUPLE: mail id mid group date sender subject body ;

mail "mail" {
    { "id" "id" +db-assigned-id+ }
    { "mid" "mid" INTEGER }
    { "group" "group_" TEXT }
    { "date" "date" DATETIME }
    { "sender" "sender" TEXT }
    { "subject" "subject" TEXT }
    { "body" "body" TEXT }
} define-persistent

: with-mydb ( quot -- )
    "gmane.db" <sqlite-db> swap with-db ; inline

A tuple is like a record or struct in other programming languages. I think most of the columns are self-explanatory, except for mid which is the message id of the mail set by gmane and group which is the name of the newsgroup the email was sent to. This message for example, has group "comp.lang.factor.general" and mid 3020. According to gmane the newsgroup is "gmane.comp.lang.factor.general," but since all newsgroups gmane knows about start with the "gmane." prefix it is redundant and can be omitted.

define-persistent call is what associates the tuple with a table in the SQLite database. It is a form of the active record pattern and serves the same purpose as, say, SQLAlchemy's declarative extension does in Python.

Generating Garbage Mails

Now that we have the database table in place, let's write some functions to create random data to store in the database -- just so that we have something to play with before continuing on with building the scraper. Add all three functions below to the bottom of db.factor:

: ascii-lower ( -- seq )
    CHAR: a CHAR: z [a,b] ;

: random-ascii ( n -- str )
    [ ascii-lower random ] "" replicate-as ;

: random-mail ( -- mail )
    f
    5000 random
    10 random-ascii
    now 1000 random 500 - days time+
    { 15 30 80 } [ random-ascii ] map first3
    mail boa ;

Try compile with C-c C-k. Factor will alert you about a few missing vocabularies. So add math, math.ranges, random, sequences, strings and calendar to the vocabulary's USING: statement. Then Factor should happily compile it.

Let's try this code out in the listener and insert some mails:

IN: scratchpad USE: gmane.db
IN: scratchpad USE: db.tuples
IN: scratchpad [ mail create-table ] with-mydb
! Each time you run the line below, a new mail gets inserted
IN: scratchpad [ random-mail insert-tuple ] with-mydb
! List contents of the table
IN: scratchpad [ T{ mail } select-tuples ] with-mydb .

If you forgot any of the required USE statements, Factor will tell you which word it cannot find and helpfully suggest vocabularies to import. In fact, Factor's missing vocabulary hints are so good that I won't bother with explicitly listing which imports you need. It's really easy to figure out by compiling the code.

As a last step, let's add a function to insert mails so that we don't have to call insert-tuple and with-mydb directly:

: insert-mail ( mail -- id )
    '[ _ insert-tuple last-insert-id ] with-mydb ;

To compile it, you need to add the fry module to enable fried quotations.

What happens in the word is that the fried quotation binds the mail instance on the stack and emits a new quotation that inserts the mail tuple and then returns the database generated id of the inserted row. Right now, we have no use of the database id but it will become useful to us later on. That whole quotation is then shipped to with-mydb which can be seen as a context manager in which code that operates on a database is run.

Using this word, we can reformulate the expression to insert mails to:

IN: scratchpad random-mail insert-mail

--- Data stack:
27

Finally, we add a simple utility function to check if a given mail exists in the database

: mail-exists? ( mid group -- mail/f )
    '[ T{ mail } _ >>mid _ >>group select-tuple ] with-mydb ;

This word requires the accessors vocabulary. Continue on to Part 2 of the tutorial, where we will build a vocabulary to parse HTML formatted mails.