Skip to content
This repository has been archived by the owner on Aug 14, 2019. It is now read-only.
/ src_harvest Public archive

The KrdWrd harvest environment for a new corpus

Notifications You must be signed in to change notification settings

krdwrd/src_harvest

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 

Repository files navigation

KrdWrd

The KrdWrd Project ran from 2008 to 2011. The mission statement was

Provide tools and infrastructure for acquisition, visual annotation, merging and storage of web pages as parts of bigger corpora.

Develop a classification engine that learns to automatically annotate pages, and provide visual tools for inspection of results.

Basically, it was an infrastructure for research into web page cleaning. A good overview can be found in the paper and an extensive description in the master's thesis (both, see further down).

Remnants

  1. The annotation guidelines and the Firefox add-on manual are still available online and as pdf file.

  2. The CANOLA Corpus

System Components

The system consisted of

  • Firefox Add-on for interactive visual annotation and retrieval of tagging results
  • XULRunner application for batch processing of web pages
  • Web Proxy and additional server-side infrastructure for providing access to corpora and storing annotation results
  • Server-side Machine Learning infrastructure for experiments with cleaning models

This is part of the server-side infrastructure for harvesting web corpora.

About

The KrdWrd harvest environment for a new corpus

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages