-
Notifications
You must be signed in to change notification settings - Fork 0
Home
This is the KrdWrd Project's original TracWiki content that used to be availabe at https://krdwrd.org/trac.
Provide tools and infrastructure for acquisition, visual annotation, merging and storage of web pages as parts of bigger corpora.
Develop a classification engine that learns to automatically annotate pages, provide visual tools for inspection of results.
Download the Firefox Add-on for tagging data.
Find help on usage of the Add-on and annotation guidelines in the Add-on manual. The privacy policy explains to which extent private data is processed and stored.
Try the search function of this site to find help when you run into problems - or write an email to the mailing list at [email protected].
Find out about the [App], how to get it, and get it running, testing it, and using it.
For WaCky People
Have a Look onto the corpus page to learn how to build a corpus, tag a corpus, use it for training, etc.
The different components of the KrdWrd system are
* Firefox [[AddOn|Add-on]] for interactive visual annotation and retrieval of auto-tagging results * [[XulRunner]] application for batch processing of web pages * [[WebProxy]] and [[CgiBin]] infrastructure for providing access to corpora and storing annotation results * ClPipeline and JamfPipeline feed the SvmClassifier
Description of the infrastructure for deployment and the central data pool.
(path relative to site root)
* trac/ Project management * addon/ Firefox [[AddOn|Add-on]] deployment * tutorial/ The tutorial in html and pdf * insecure/ Warning page concerning CAcert install - this is the root on http://krdwrd.org * pages/bin CGI for page download and upload * pages/dat Data * pages/dat/input clean. ready for annoptation * pages/dat/tagged previously tagged pages
AddonDevelopment explains how to get a SVN checkout and directly use it as Firefox Add-on, without install,
AnnotationFormat specifies how KrdWrd stores annotations in the DOM.
KrdWrd stores annotations by injecting krdwrd-tag-X
css classes into
document blocks by appending to their className
(just class
in
html). Where X is one of
* 1 - spam (aka. boilerplate) * 2 - neutral (captions, etc.) * 3 - ham (running text) * none - cleared (obsoleted by r651 but can still be present in older documents)
The original classNames of a document are preserved. However, krdwrd classes
set their background color with the !important
flag, so we override any
other specification on this attribute.
Upon a propagation run, those tags are moved downwards in the DOM to the blocks immediately preceding the actual text blocks. Only the closest original tag is used here, allowing overriding of tags set in more outer blocks.
Propagation is available in the Add-on via the Utils/Propagate
and
/Sidebar
functionality as well as in the app, by merging a
document with itself:
./krdwrd -out /tmp/propagated.html -merge /tmp/input.html /tmp/input.html
a(nother) WAC comunity site to consolidate efforts and results - among others...
a somewhat similarly spirited approach with respect to page analysis...
visual development of Web data extraction programs
- http://reasoningweb.org/2005/teaching-material/baumgartner-robert_information-extraction-for-the-semantic-web.pdf
- http://www.cs.uic.edu/~liub/Web-Content-Mining-2.pdf
the downloaded data on the file system stresses the file-utils when dealing with a huge number of sites, i.e. not the file system itself is stressed but the tools operating on it (find, ls, etc...).
why not use a well-developed tool to build up a local cache? (well, because - for the time being - wwwoffle does the job and was easier to set-up. but using archive.org's stuff should be considered...)
in incubation stage - but ambitions... data back-end could be file, partition, etc... however, off-line use seems unclear.