-
Notifications
You must be signed in to change notification settings - Fork 2
Parsing Gmane with Factor Part 1
Part 1 of the Parsing Gmane with Factor tutorial.
Much of the information for this section comes from the Factor/DB page on the Concatenative Wiki which you probably ought to read before this.
In this part, we'll write a vocabulary (that's Factor lingo for module) to interact with a simple SQLite database. We'll also define a mapping between the database table and Factor's tuple objects.
Here I'd like to mention that while this tutorial presents the construction of this program as a discrete series of steps that neatly takes you from nothing to complete program, that is not at all how it was developed. :) I tried various approaches, changed the schema several times, had to go back and forth and ran into many dead ends while figuring out how to best write the code. All that is omitted from this text because otherwise it would be far too long. But it is worth keeping in mind when you develop your own code that it doesn't have to be perfect the first time.
Create a new vocabulary with the path gmane/db/db.factor
with the following contents:
USING:
db db.sqlite
kernel ;
IN: gmane.db
: with-mydb ( quot -- )
"gmane.db" <sqlite-db> swap with-db ; inline
If you are using emacs and FUEL, you can compile the vocabulary by first typing C-c C-z
to run factor and then C-c C-k
to compile. You should get a message in the mini-buffer that indicates whether the compilation succeded or not.
After you've ensured that it compiles switch to the listener with C-c C-z
and type:
IN: scratchpad USE: gmane.db
IN: scratchpad [ ] with-mydb
IN: scratchpad
It is possible that factor throws an error at this point complaining about missing sqlite3_open
. On Windows, fix the problem by downloading the SQLite dll file for your architecture from here or here and put it in the same directory that factor.exe lives in. On Linux you probably needs to install SQLite's development package, like this:
$ sudo apt-get install libsqlite3-dev
Modify the command to something appropriate for your distro if it isn't an apt based one.
Next, you want to define the columns for the table in the SQLite database that will store the scraped mails from Gmane. Edit gmane/db/db.factor
so that it looks like:
USING:
db db.sqlite db.tuples db.types
kernel ;
IN: gmane.db
TUPLE: mail id mid group date sender subject body ;
mail "mail" {
{ "id" "id" +db-assigned-id+ }
{ "mid" "mid" INTEGER }
{ "group" "group_" TEXT }
{ "date" "date" DATETIME }
{ "sender" "sender" TEXT }
{ "subject" "subject" TEXT }
{ "body" "body" TEXT }
} define-persistent
: with-mydb ( quot -- )
"gmane.db" <sqlite-db> swap with-db ; inline
A tuple is like a record or struct in other programming languages. The code that ends with define-persistent
maps the slots in the tuple to columns in the database table. I think most of the columns are self-explanatory, except for mid
which is the message id of the mail set by gmane and group
which is the name of the newsgroup the mail was sent to. This message for example, has group "comp.lang.factor.general" and mid 3020. According to gmane the newsgroup is "gmane.comp.lang.factor.general," but since all newsgroup gmane knows about starts with the "gmane." prefix it becomes redundant and can be omitted.
Next, we'll add some functions to create random data to store in the database. Just so that we have some data to play with before continuing on to building the scraper. Add the three functions below to the bottom of db.factor
:
: ascii-lower ( -- seq )
CHAR: a CHAR: z [a,b] ;
: random-ascii ( n -- str )
[ ascii-lower random ] replicate >string ;
: random-mail ( -- mail )
f
5000 random
10 random-ascii
now 1000 random 500 - days time+
{ 15 30 80 } [ random-ascii ] map first3
mail boa ;
Try compile with C-k
, and Factor will alert you about lots of missing vocabularies. So add math
, math.ranges
, random
, sequences
, strings
and calendar
to the vocabulary's USING:
statement. Then Factor should happily compile it.
Now you can insert some garbage data and then read it back in the listener:
IN: scratchpad [ mail create-table ] with-mydb
... each time you run the line below, a new mail gets inserted
IN: scratchpad [ random-mail insert-tuple ] with-mydb
IN: scratchpad [ T{ mail } select-tuples ] with-mydb .
Factor always tells you when it can't find a word and helpfully suggests which vocabulary to import. Probably, there is a whole bunch of them you need to import to get the above lines running. As a last step, let's add a function to insert mails so that we don't have to call insert-tuple
and with-mydb
directly:
: insert-mail ( mail -- )
'[ _ insert-tuple ] with-mydb ;
To compile it, you need to add the fry
module to enable fried quotations. The code to insert mails in the listener can now be written as:
IN: scratchpad random-mail insert-mail
Now you can continue on to Part 2 of the tutorial where we will build a scraper.