Skip to content

Commit

Permalink
(Late commit) Add info about current version of .plan-file for CANOLA…
Browse files Browse the repository at this point in the history
… corpus analyses
  • Loading branch information
iiegn committed Aug 14, 2019
1 parent ca8463f commit 847f700
Show file tree
Hide file tree
Showing 3 changed files with 24 additions and 2 deletions.
1 change: 1 addition & 0 deletions LICENSE.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/4.0/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.
21 changes: 21 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# KrdWrd

The KrdWrd Project ran from 2008 to 2011. The mission statement was
> Provide tools and infrastructure for acquisition, visual annotation, merging and storage of web pages as parts of bigger corpora.
>
> Develop a classification engine that learns to automatically annotate pages, and provide visual tools for inspection of results.
Basically, it was an infrastructure for research into web page cleaning. A good
overview can be found in the paper and an extensive description in the master's thesis (both, see [further down](#cite-work)).

# KrdWrd CANOLA Corpus

The CANOLA Corpus is a visually annotated English web corpus for training
classification engines to remove boiler plate on unseen web pages. It was
harvested, annotated and evaluated by the tools and infrastructure of the
KrdWrd Project.

The corpus consists of 216 files (Web pages) - 208 of which constitute the main
corpus.

See https://github.com/krdwrd/doc_CANOLA/releases/download/v1.1/canola.pdf for more information.
4 changes: 2 additions & 2 deletions canola/README.txt
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,8 @@ $ ./agreement.py stats/ plan.canola UIDs
-weighted: the respective calculations but weighted by the number of tokens

stats/, stats.20100908/
output of an app's merge run with -stats; for all considered pageIDs there are
two files:
output of an app's merge run with -stats and plan.appErrsRmvd; for all
considered pageIDs there are two files:
pageID and pageID.stats
- pageID: this is the HTML page after merging the different votes;
-- the agreement on each tag is ( users-voted-for-the-tag / votes-on-the-tag )
Expand Down

0 comments on commit 847f700

Please sign in to comment.