(Late commit) Add CANOLA corpus description

krdwrd · Aug 14, 2019 · 543f2d5 · 543f2d5
1 parent 699ba2c
commit 543f2d5
Show file tree

Hide file tree

Showing 23 changed files with 2,859 additions and 0 deletions.
diff --git a/000inputs.tex b/000inputs.tex
@@ -0,0 +1,35 @@
+\input{010texfoo}
+
+\title{\mytitle}
+
+\begin{document}
+  \maketitle
+  \input{060abstract}
+
+  % set the TOC with larger line spacing
+  \begin{spacing}{1.1}
+  \tableofcontents
+  \end{spacing}
+  \newpage
+  \pagestyle{plain}
+
+  \input{300}
+
+  	\section{\label{sec:systemoverview}System Overview} 
+  	\input{310}
+
+	\section{\label{sec:preprocessing}Pre-Processing: Harvesting Web Pages}
+	\input{320}
+
+	\section{\label{sec:manualannotation}Manual Annotation: Classification of Web-Page Content by Human Annotators}
+	\input{330}
+
+	\section{\label{sec:goldstandard}The Gold Standard: Compilation and Analysis of manually annotated Data}
+	\input{340}
+
+	%\section{Summary}
+	%\input{350}
+
+    \clearpage
+    \input{900bib}
+\end{document}
diff --git a/010texfoo.tex b/010texfoo.tex
@@ -0,0 +1,174 @@
+\documentclass[11pt,a4paper,oneside,liststotoc,listsleft,abstract=true]{scrartcl}
+
+% selectively in/exclude pieces of text
+\usepackage{comment}
+\includecomment{longversion}
+
+% have koma type headings in rm font
+%\addtokomafont{sectioning}{\rmfamily}
+
+
+%%%% package imports (order matters)
+\usepackage{lineno}
+
+% specifies the encoding of this file (and \include or \input files)
+\usepackage[latin1]{inputenc}
+
+% in pdfTeX an active font can only refer to 256 glyphs at a time;
+% select the std. T1 mapping for this document
+\usepackage[T1]{fontenc}
+
+% activate hyphenation
+\usepackage[english]{babel}
+
+% activate character protruding for margin kerning, 
+% i.e. try to get a 'smoother' margin
+%\usepackage[activate]{pdfcprot}
+\usepackage[protrusion=true,expansion,kerning=true]{microtype}
+
+% activate some symbols, e.g. \textmusicalnote (and more 'important' ones...)
+\usepackage{textcomp}
+
+% activate 'pretty' code listings
+\usepackage{listings}
+
+% activate ip alphabet
+\usepackage{tipa}
+
+% activate the Almost European computer modern font (cf. http://www.ctan.org/tex-archive/fonts/ae/)
+%\usepackage{ae}
+% XOR
+% TeX Gyre (cf. http://www.tug.dk/FontCatalogue)
+\usepackage{tgheros}
+\usepackage{tgtermes}
+% XOR
+% activate springer's minion and myriad font
+%\pdfmapfile{+springer.map}
+%\renewcommand{\sfdefault}{fmy}
+%\renewcommand{\rmdefault}{fmnx}
+%\renewcommand{\ttdefault}{lmtt}
+
+% allow for inclusion of pdf documents
+\usepackage{pdfpages}
+
+%
+%\usepackage[right=7cm,left=2.5cm,top=2cm,bottom=3.5cm]{geometry}
+\usepackage[top=3.0cm,bottom=4.0cm]{geometry}
+\usepackage{setspace}
+
+\usepackage{epic,eepic}
+\usepackage{graphicx}
+\graphicspath{{./}{./images/}}
+% this will produce a warning:
+% LaTeX Warning: Command \@makecol has changed.
+% seems to occur in combination with the setspace package.
+\usepackage[stable, bottom]{footmisc}
+%\usepackage{fullpage}
+\usepackage{url}
+\usepackage{amsmath}
+\usepackage{amssymb}
+\usepackage{tabularx}
+%\usepackage[pdftex]{color}
+\usepackage[
+	pdftex,
+	final=true,
+	pdfstartview=FitH
+	]{hyperref}
+
+%%%% hyper & options
+\definecolor{fuchsia}{rgb}{1,0,1}
+\definecolor{myblue}{rgb}{0.25,0.25,0.75}
+\definecolor{darkblue}{rgb}{0,0,0.75}
+\definecolor{darkred}{rgb}{0.4,0,0}
+\hypersetup{%
+colorlinks=true,
+bookmarks=true,
+bookmarksnumbered=true,
+bookmarksopen=true,
+bookmarksopenlevel=2,
+pdftitle={\mypdftitle},
+pdfauthor={\myauthor},
+pdfsubject={\mytitle},
+pdfkeywords={\mykeywords},
+pdfproducer={pdflatex, inkscape, gnuplot},
+frenchlinks=true,
+pdfborder=0 0 0,
+linkcolor=myblue,
+%pagecolor=darkblue,
+urlcolor=myblue,
+citecolor=darkred,
+setpagesize=true
+}
+
+%
+% align numbering in TOC on the left side, i.e.
+% 1
+% 1.1
+% 1.1.1
+% ...
+%\usepackage{tocloft}
+%\usepackage{chngcntr}
+%\setlength{\cftchapnumwidth}{\cftsubsubsecnumwidth}
+%\setlength{\cftsecnumwidth}{\cftsubsubsecnumwidth}
+%\setlength{\cftsubsecnumwidth}{\cftsubsubsecnumwidth}
+%\setlength{\cftsubsubsecnumwidth}{\cftsubsubsecnumwidth}
+%\setlength{\cftsecindent}{0pt}
+%\setlength{\cftsubsecindent}{0pt}
+%\setlength{\cftsubsubsecindent}{0pt}
+
+%%%% fancy & options
+%\usepackage{fancyhdr}
+%\pagestyle{fancy}
+%\renewcommand{\footrulewidth}{0.5pt}
+%\renewcommand{\headrulewidth}{0.5pt}
+%\setlength{\headheight}{25pt}
+%\setlength{\headsep}{20pt}
+%\renewcommand{\chaptermark}[1]{\markboth{\quad #1}{\quad #1}}
+%\renewcommand{\sectionmark}[1]{\markright{#1}}
+%\fancyhf{}
+%\fancyfoot[CE]{\myauthor}
+%\fancyfoot[CO]{\myStitle}
+%\fancyhead[LE,RO]{\bfseries\thepage}
+%\fancyhead[RE]{\bfseries\leftmark }
+%\fancyhead[LO]{\bfseries\rightmark }
+
+% do not reset footnote count or every chapter
+%\counterwithout*{footnote}{chapter}
+
+%%%% indexing options
+\setcounter{tocdepth}{3}
+\setcounter{secnumdepth}{3}
+%\newcounter{lofdepth}
+%\setcounter{lofdepth}{3}
+
+%%%% new commands
+\DeclareMathOperator{\project}{project}
+\DeclareMathOperator{\CC}{CC}
+\DeclareMathOperator{\NCC}{NCC}
+\newcommand{\src}[1]{\texttt{#1}}
+\newcommand{\fref}[1]{\src{#1} (c.f. \ref{#1})}
+\newcommand{\email}[1]{\href{mailto:#1}{#1}}
+\newcommand{\grad}[0]{^\circ}
+\newcommand{\fig}[4]
+{%
+ \begin{figure}[h]
+  \centering
+  \includegraphics[width=#1\textwidth]{#2}
+  \caption{#3}
+  \label{#4}
+ \end{figure}
+}
+
+%\renewcommand*{\raggedsection}{}
+
+\renewenvironment{abstract}{%
+ \addsec*{\abstractname}
+}
+
+%%%% typesetting options
+\unitlength10mm
+\renewcommand*{\tabularxcolumn}[1]{>{\small}m{#1}}
+
+
+%\usepackage{natbib}
+%\bibliographystyle{plainnat}
diff --git a/060abstract.tex b/060abstract.tex
@@ -0,0 +1,6 @@
+\begin{abstract}
+This document describes the KrdWrd CANOLA Corpus.
+
+The CANOLA Corpus is a visually annotaded English web corpus for training the KrdWrd classification engine to remove boiler plate on unseen web pages.
+It was harvested, annotaded and evaluated by the tools and infrastructur of the KrdWrd Project.
+\end{abstract}
diff --git a/300.tex b/300.tex
@@ -0,0 +1,16 @@
+\begin{longversion}
+The KrdWrd Project\cite{krdwrd.org} deals with the design of an abstract architecture for 
+A)~the unified treatment of Web data for automatic processing, \emph{without} neglecting visual information, on annotation and processing side and 
+B)~the appropriate annotation tool to gather data for supervised processing of such data.
+
+The Project comprises an implementation appropriate for pre-processing and cleaning of Web pages, where users are provided with accurate Web page presentations and annotation utilities in a typical browsing environment, while machine learning (ML) algorithms also operate on representations of the visual rendering of Web pages.
+The system also preserves the original Web documents and all the additional information contained therein to make different approaches comparable on identical data.
+
+The system is sketched in \cite{StegerStemle2009}.
+
+For training the KrdWrd ML Engine, a substantial amount of hand-annotated data, viz.~Web pages, are needed. 
+Following, we present the parts of the system that cover the acquisition of training data, i.e.~the steps before training data can be fed into a ML Engine. 
+
+Then, after an overview of the sequence of steps needed to gather new training data in \ref{sec:systemoverview}, an in-depth description of the processing steps \emph{before} Web pages can be presented to annotators in \ref{sec:preprocessing}, presentation of the actual tool annotators use in \ref{sec:manualannotation}, and the compilation of their submitted results in \ref{sec:goldstandard}, we will be ready to feed the KrdWrd Gold Standard to a ML Engine.
+%An exemplification, the KrdWrd ML Engine, is covered in \ref{cha:krdwrdsys2}.
+\end{longversion}
diff --git a/310.tex b/310.tex
@@ -0,0 +1,50 @@
+\begin{longversion}
+%
+%
+Two fundamental ideas behind this part of the system are:
+firstly, Web pages have a textual representation, namely the text they contain, a structural representation, namely their DOM tree, and a visual representation, namely their rendered view -- all representations should be considered when automatically cleaning Web pages, and consequently, all should be annotated during acquisition of training data for ML tasks.
+Secondly, data acquisition for training of supervised ML algorithms should preserve pristine, unmodified versions of Web pages -- this will help to reproduce results \emph{and} to compare those of different architectures.
+
+% What
+\subsection{Functional Walk-Through}
+
+%What, Who, How, Result
+Gathering a set of sample pages is at the beginning of tagging new data. 
+The process needs to be coordinated by the administrators of the system, i.e.~server level access is needed to make new corpora available for later tagging by users. 
+The Process starts with a list of seed terms which are used to construct an ad-hoc corpus of Web pages where the result is a list of Uniform Resource Locators (URL\footnote{see \cite{URL} for details -- but also \cite{w3.org/Addressing}.}).  
+
+%What, Who, How, Result
+The URL list is then \emph{harvested}, i.e.~the according Web pages are downloaded and saved for further processing. 
+This process is coordinated by the administrators of the system and is started as automated batch-job on the server where its input is the URL List and the result is the set of downloaded Web pages and their content.
+
+%What, Who, How, Result
+These Web Pages are then available online to users for tagging, i.e.~there are no constraints on who is able to access these pages; 
+however, keeping track of \emph{who tagged what} requires to differentiate between users, and hence, registration with the system, viz.~logging in. 
+The Web pages are accessible via the KrdWrd Add-on in combination\footnotemark~with the Web Services hosted on \cite[Web Site]{krdwrd.org}.
+\footnotetext{Indeed, the data is accessible with \emph{any} browser -- but the KrdWrd Add-on enhances the experience.}
+
+%What, Who, How, Result
+Users can tag new, alter or redisplay formerly tagged Web pages with the help of the KrdWrd Add-on.
+The KrdWrd Add-on builds upon and extends the functionality of the Firefox \cite{firefox} browser and facilitates the visual tagging of Web pages, i.e.~users are provided with an accurate Web page presentation and annotation utility in a typical browsing environment.
+Readily (or partly) tagged pages are directly sent back to the server for storage in the KrdWrd Corpus data pool and for further processing.  
+
+%What, Who, How, Result
+Updated or newly submitted tagging results are regularly merged, i.e.~submitted results from different users for the same content are processed and compiled into a majority-driven uniform view.
+This automated process uses a \emph{winner takes all strategy} and runs regularly on the server -- without further ado.
+The \emph{merged} content is stored in the KrdWrd data pool and hence, available for browsing, viewing, and analysis by the KrdWrd Add-on\footnotemark[\value{footnote}] and furthermore, it can be used as training data for Machine Learning algorithms. 
+
+
+% What
+\subsection{Implementation Survey}
+
+The KrdWrd Infrastructure consists of several components that bring along the overall functionality of the system. 
+They are run either on the KrdWrd Server or are part of the KrdWrd Add-on and hence, build upon and extend the functionality of the Firefox browser.
+The Server components are hosted on a Debian GNU/Linux \cite{debian.org} powered machine.
+However, the requirements\footnote{These include sed, awk, python, bash, subversion, XULRunner, wwwoffle, apache, R.} are rather limited and many other standard linux - or linux-like - systems should easily suffice, and even other platforms should be able to host the system. 
+Nevertheless, the KrdWrd Add-on strictly runs only as an extension of the Firefox browser, version 3\footnote{But it could be converted into a self-contained XULRunner application.}.
+
+Access to the system is given as HTTP Service hosted on \url{krdwrd.org}, an SSL-certified virtual host running on an Apache Web Server \cite{httpd.apache.org} accompanied by mailing services, a dedicated trac as Wiki and issue tracking system for software development (extended with a mailing extension), and subversion \cite{subversion} as version control system.
+The interfacing between the KrdWrd Add-on and the Web Server is done via CGI \cite{cgi} scripts, which itself are mostly written in the Python programming language \cite{python}.
+%
+%
+\end{longversion}