KrdWrd

The KrdWrd Project ran from 2008 to 2011. The mission statement was

Provide tools and infrastructure for acquisition, visual annotation, merging and storage of web pages as parts of bigger corpora.

Develop a classification engine that learns to automatically annotate pages, and provide visual tools for inspection of results.

Basically, it was an infrastructure for research into web page cleaning. A good overview can be found in the paper and an extensive description in the master's thesis (both, see further down).

Remnants

The annotation guidelines and the Firefox add-on manual are still available online and as pdf file.
The CANOLA Corpus

System Components

The system consisted of

Firefox Add-on for interactive visual annotation and retrieval of tagging results
XULRunner application for batch processing of web pages
Web Proxy and additional server-side infrastructure for providing access to corpora and storing annotation results
Server-side Machine Learning infrastructure for experiments with cleaning models

This is part of the server-side infrastructure for harvesting web corpora.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
README.md		README.md
mkcorpus.sh		mkcorpus.sh
seed		seed
urls.clean		urls.clean

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KrdWrd

Remnants

System Components

About

Releases

Packages

Languages

krdwrd/src_harvest

Folders and files

Latest commit

History

Repository files navigation

KrdWrd

Remnants

System Components

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages