diff --git a/LICENSE.txt b/LICENSE.txt new file mode 100644 index 0000000..4cc9cb5 --- /dev/null +++ b/LICENSE.txt @@ -0,0 +1 @@ +This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/4.0/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA. diff --git a/README.md b/README.md new file mode 100644 index 0000000..0a37faf --- /dev/null +++ b/README.md @@ -0,0 +1,21 @@ +# KrdWrd + +The KrdWrd Project ran from 2008 to 2011. The mission statement was +> Provide tools and infrastructure for acquisition, visual annotation, merging and storage of web pages as parts of bigger corpora. +> +> Develop a classification engine that learns to automatically annotate pages, and provide visual tools for inspection of results. + +Basically, it was an infrastructure for research into web page cleaning. A good +overview can be found in the paper and an extensive description in the master's thesis (both, see [further down](#cite-work)). + +# KrdWrd CANOLA Corpus + +The CANOLA Corpus is a visually annotated English web corpus for training +classification engines to remove boiler plate on unseen web pages. It was +harvested, annotated and evaluated by the tools and infrastructure of the +KrdWrd Project. + +The corpus consists of 216 files (Web pages) - 208 of which constitute the main +corpus. + +See https://github.com/krdwrd/doc_CANOLA/releases/download/v1.1/canola.pdf for more information. diff --git a/canola/README.txt b/canola/README.txt index 9257444..ee249e8 100755 --- a/canola/README.txt +++ b/canola/README.txt @@ -19,8 +19,8 @@ $ ./agreement.py stats/ plan.canola UIDs -weighted: the respective calculations but weighted by the number of tokens stats/, stats.20100908/ -output of an app's merge run with -stats; for all considered pageIDs there are -two files: +output of an app's merge run with -stats and plan.appErrsRmvd; for all +considered pageIDs there are two files: pageID and pageID.stats - pageID: this is the HTML page after merging the different votes; -- the agreement on each tag is ( users-voted-for-the-tag / votes-on-the-tag )