Skip to content
This repository has been archived by the owner on Apr 12, 2024. It is now read-only.
/ doc_CANOLA Public archive

Documentation for the KrdWrd CANOLA corpus

License

Notifications You must be signed in to change notification settings

krdwrd/doc_CANOLA

Repository files navigation

KrdWrd

The KrdWrd Project ran from 2008 to 2011. The mission statement was

Provide tools and infrastructure for acquisition, visual annotation, merging and storage of web pages as parts of bigger corpora.

Develop a classification engine that learns to automatically annotate pages, and provide visual tools for inspection of results.

Basically, it was an infrastructure for research into web page cleaning. A good overview can be found in the paper and an extensive description in the master's thesis (both, see further down).

KrdWrd CANOLA Corpus Documentation

See https://github.com/krdwrd/doc_CANOLA/releases/download/v1.1/canola.pdf