Home

Table of Contents KrdWrd Mission Statement For People who want to tag Pages For People who want to clean or download Pages, or take Screenshots For [http://wacky.sslmit.unibo.it WaCky] People For Developers System Components Server Setup Subdirectories KrdWrd Annotation Format DOM Propagation Links to other Sites Some Overviews Proxy Things (improvements over wwwoffle)

KrdWrd

This is the KrdWrd Project's original TracWiki content that used to be availabe at https://krdwrd.org/trac.

Mission Statement

Provide tools and infrastructure for acquisition, visual annotation, merging and storage of web pages as parts of bigger corpora.

Develop a classification engine that learns to automatically annotate pages, provide visual tools for inspection of results.

For People who want to tag Pages

Download the Firefox Add-on for tagging data.

Find help on usage of the Add-on and annotation guidelines in the Add-on manual. The privacy policy explains to which extent private data is processed and stored.

Try the search function of this site to find help when you run into problems - or write an email to the mailing list at [email protected].

For People who want to clean or download Pages, or take Screenshots

Find out about the [App], how to get it, and get it running, testing it, and using it.

For WaCky People

Have a Look onto the corpus page to learn how to build a corpus, tag a corpus, use it for training, etc.

For Developers

System Components

The different components of the KrdWrd system are

 * Firefox [[AddOn|Add-on]] for interactive visual annotation and retrieval of
   auto-tagging results
 * [[XulRunner]] application for batch processing of web pages
 * [[WebProxy]] and [[CgiBin]] infrastructure for providing access to corpora and
   storing annotation results
 * ClPipeline and JamfPipeline feed the SvmClassifier

Server Setup

Description of the infrastructure for deployment and the central data pool.

Subdirectories

(path relative to site root)

 * trac/ Project management
 * addon/ Firefox [[AddOn|Add-on]] deployment
 * tutorial/ The tutorial in html and pdf
 * insecure/ Warning page concerning CAcert install - this is the root on
   http://krdwrd.org
 * pages/bin CGI for page download and upload
 * pages/dat Data 
 * pages/dat/input clean. ready for annoptation
 * pages/dat/tagged previously tagged pages

AddonDevelopment explains how to get a SVN checkout and directly use it as Firefox Add-on, without install,

KrdWrd Annotation Format

AnnotationFormat specifies how KrdWrd stores annotations in the DOM.

DOM

KrdWrd stores annotations by injecting krdwrd-tag-X css classes into document blocks by appending to their className (just class in html). Where X is one of

 * 1 - spam (aka. boilerplate)
 * 2 - neutral (captions, etc.)
 * 3 - ham (running text)
 * none - cleared (obsoleted by r651 but can still be present in older
   documents)

The original classNames of a document are preserved. However, krdwrd classes set their background color with the !important flag, so we override any other specification on this attribute.

Propagation

Upon a propagation run, those tags are moved downwards in the DOM to the blocks immediately preceding the actual text blocks. Only the closest original tag is used here, allowing overriding of tags set in more outer blocks.

Propagation is available in the Add-on via the Utils/Propagate and /Sidebar functionality as well as in the app, by merging a document with itself:

./krdwrd -out /tmp/propagated.html -merge /tmp/input.html /tmp/input.html

Links to other Sites

a(nother) WAC comunity site to consolidate efforts and results - among others...

http://webascorpus.sourceforge.net

a somewhat similarly spirited approach with respect to page analysis...

http://www.cs.uiuc.edu/homes/dengcai2/VIPS/VIPS.html

visual development of Web data extraction programs

http://www.lixto.com/lixto_visual_developer/

Some Overviews

Proxy Things (improvements over wwwoffle)

the downloaded data on the file system stresses the file-utils when dealing with a huge number of sites, i.e. not the file system itself is stressed but the tools operating on it (find, ls, etc...).

why not use a well-developed tool to build up a local cache? (well, because - for the time being - wwwoffle does the job and was easier to set-up. but using archive.org's stuff should be considered...)

http://webteam.archive.org/confluence/display/Heritrix/Home

in incubation stage - but ambitions... data back-end could be file, partition, etc... however, off-line use seems unclear.

http://incubator.apache.org/trafficserver/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly