KrdWrd

The KrdWrd Project ran from 2008 to 2011. The mission statement was

Provide tools and infrastructure for acquisition, visual annotation, merging and storage of web pages as parts of bigger corpora.

Develop a classification engine that learns to automatically annotate pages, and provide visual tools for inspection of results.

Basically, it was an infrastructure for research into web page cleaning. A good overview can be found in the paper and an extensive description in the master's thesis (both, see further down).

KrdWrd CANOLA Corpus

The CANOLA Corpus is a visually annotated English web corpus for training classification engines to remove boiler plate on unseen web pages. It was harvested, annotated and evaluated by the tools and infrastructure of the KrdWrd Project.

The corpus consists of 216 files (Web pages) - 208 of which constitute the main corpus.

See https://github.com/krdwrd/doc_CANOLA/releases/download/v1.1/canola.pdf for more information.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
canola		canola
tutorial		tutorial
LICENSE.txt		LICENSE.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KrdWrd

KrdWrd CANOLA Corpus

About

Releases 2

Packages

Languages

License

krdwrd/data

Folders and files

Latest commit

History

Repository files navigation

KrdWrd

KrdWrd CANOLA Corpus

About

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages