This repository has been archived by the owner on Apr 12, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
(Late commit) Add CANOLA corpus description
- Loading branch information
Showing
23 changed files
with
2,859 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
\input{010texfoo} | ||
|
||
\title{\mytitle} | ||
|
||
\begin{document} | ||
\maketitle | ||
\input{060abstract} | ||
|
||
% set the TOC with larger line spacing | ||
\begin{spacing}{1.1} | ||
\tableofcontents | ||
\end{spacing} | ||
\newpage | ||
\pagestyle{plain} | ||
|
||
\input{300} | ||
|
||
\section{\label{sec:systemoverview}System Overview} | ||
\input{310} | ||
|
||
\section{\label{sec:preprocessing}Pre-Processing: Harvesting Web Pages} | ||
\input{320} | ||
|
||
\section{\label{sec:manualannotation}Manual Annotation: Classification of Web-Page Content by Human Annotators} | ||
\input{330} | ||
|
||
\section{\label{sec:goldstandard}The Gold Standard: Compilation and Analysis of manually annotated Data} | ||
\input{340} | ||
|
||
%\section{Summary} | ||
%\input{350} | ||
|
||
\clearpage | ||
\input{900bib} | ||
\end{document} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,174 @@ | ||
\documentclass[11pt,a4paper,oneside,liststotoc,listsleft,abstract=true]{scrartcl} | ||
|
||
% selectively in/exclude pieces of text | ||
\usepackage{comment} | ||
\includecomment{longversion} | ||
|
||
% have koma type headings in rm font | ||
%\addtokomafont{sectioning}{\rmfamily} | ||
|
||
|
||
%%%% package imports (order matters) | ||
\usepackage{lineno} | ||
|
||
% specifies the encoding of this file (and \include or \input files) | ||
\usepackage[latin1]{inputenc} | ||
|
||
% in pdfTeX an active font can only refer to 256 glyphs at a time; | ||
% select the std. T1 mapping for this document | ||
\usepackage[T1]{fontenc} | ||
|
||
% activate hyphenation | ||
\usepackage[english]{babel} | ||
|
||
% activate character protruding for margin kerning, | ||
% i.e. try to get a 'smoother' margin | ||
%\usepackage[activate]{pdfcprot} | ||
\usepackage[protrusion=true,expansion,kerning=true]{microtype} | ||
|
||
% activate some symbols, e.g. \textmusicalnote (and more 'important' ones...) | ||
\usepackage{textcomp} | ||
|
||
% activate 'pretty' code listings | ||
\usepackage{listings} | ||
|
||
% activate ip alphabet | ||
\usepackage{tipa} | ||
|
||
% activate the Almost European computer modern font (cf. http://www.ctan.org/tex-archive/fonts/ae/) | ||
%\usepackage{ae} | ||
% XOR | ||
% TeX Gyre (cf. http://www.tug.dk/FontCatalogue) | ||
\usepackage{tgheros} | ||
\usepackage{tgtermes} | ||
% XOR | ||
% activate springer's minion and myriad font | ||
%\pdfmapfile{+springer.map} | ||
%\renewcommand{\sfdefault}{fmy} | ||
%\renewcommand{\rmdefault}{fmnx} | ||
%\renewcommand{\ttdefault}{lmtt} | ||
|
||
% allow for inclusion of pdf documents | ||
\usepackage{pdfpages} | ||
|
||
% | ||
%\usepackage[right=7cm,left=2.5cm,top=2cm,bottom=3.5cm]{geometry} | ||
\usepackage[top=3.0cm,bottom=4.0cm]{geometry} | ||
\usepackage{setspace} | ||
|
||
\usepackage{epic,eepic} | ||
\usepackage{graphicx} | ||
\graphicspath{{./}{./images/}} | ||
% this will produce a warning: | ||
% LaTeX Warning: Command \@makecol has changed. | ||
% seems to occur in combination with the setspace package. | ||
\usepackage[stable, bottom]{footmisc} | ||
%\usepackage{fullpage} | ||
\usepackage{url} | ||
\usepackage{amsmath} | ||
\usepackage{amssymb} | ||
\usepackage{tabularx} | ||
%\usepackage[pdftex]{color} | ||
\usepackage[ | ||
pdftex, | ||
final=true, | ||
pdfstartview=FitH | ||
]{hyperref} | ||
|
||
%%%% hyper & options | ||
\definecolor{fuchsia}{rgb}{1,0,1} | ||
\definecolor{myblue}{rgb}{0.25,0.25,0.75} | ||
\definecolor{darkblue}{rgb}{0,0,0.75} | ||
\definecolor{darkred}{rgb}{0.4,0,0} | ||
\hypersetup{% | ||
colorlinks=true, | ||
bookmarks=true, | ||
bookmarksnumbered=true, | ||
bookmarksopen=true, | ||
bookmarksopenlevel=2, | ||
pdftitle={\mypdftitle}, | ||
pdfauthor={\myauthor}, | ||
pdfsubject={\mytitle}, | ||
pdfkeywords={\mykeywords}, | ||
pdfproducer={pdflatex, inkscape, gnuplot}, | ||
frenchlinks=true, | ||
pdfborder=0 0 0, | ||
linkcolor=myblue, | ||
%pagecolor=darkblue, | ||
urlcolor=myblue, | ||
citecolor=darkred, | ||
setpagesize=true | ||
} | ||
|
||
% | ||
% align numbering in TOC on the left side, i.e. | ||
% 1 | ||
% 1.1 | ||
% 1.1.1 | ||
% ... | ||
%\usepackage{tocloft} | ||
%\usepackage{chngcntr} | ||
%\setlength{\cftchapnumwidth}{\cftsubsubsecnumwidth} | ||
%\setlength{\cftsecnumwidth}{\cftsubsubsecnumwidth} | ||
%\setlength{\cftsubsecnumwidth}{\cftsubsubsecnumwidth} | ||
%\setlength{\cftsubsubsecnumwidth}{\cftsubsubsecnumwidth} | ||
%\setlength{\cftsecindent}{0pt} | ||
%\setlength{\cftsubsecindent}{0pt} | ||
%\setlength{\cftsubsubsecindent}{0pt} | ||
|
||
%%%% fancy & options | ||
%\usepackage{fancyhdr} | ||
%\pagestyle{fancy} | ||
%\renewcommand{\footrulewidth}{0.5pt} | ||
%\renewcommand{\headrulewidth}{0.5pt} | ||
%\setlength{\headheight}{25pt} | ||
%\setlength{\headsep}{20pt} | ||
%\renewcommand{\chaptermark}[1]{\markboth{\quad #1}{\quad #1}} | ||
%\renewcommand{\sectionmark}[1]{\markright{#1}} | ||
%\fancyhf{} | ||
%\fancyfoot[CE]{\myauthor} | ||
%\fancyfoot[CO]{\myStitle} | ||
%\fancyhead[LE,RO]{\bfseries\thepage} | ||
%\fancyhead[RE]{\bfseries\leftmark } | ||
%\fancyhead[LO]{\bfseries\rightmark } | ||
|
||
% do not reset footnote count or every chapter | ||
%\counterwithout*{footnote}{chapter} | ||
|
||
%%%% indexing options | ||
\setcounter{tocdepth}{3} | ||
\setcounter{secnumdepth}{3} | ||
%\newcounter{lofdepth} | ||
%\setcounter{lofdepth}{3} | ||
|
||
%%%% new commands | ||
\DeclareMathOperator{\project}{project} | ||
\DeclareMathOperator{\CC}{CC} | ||
\DeclareMathOperator{\NCC}{NCC} | ||
\newcommand{\src}[1]{\texttt{#1}} | ||
\newcommand{\fref}[1]{\src{#1} (c.f. \ref{#1})} | ||
\newcommand{\email}[1]{\href{mailto:#1}{#1}} | ||
\newcommand{\grad}[0]{^\circ} | ||
\newcommand{\fig}[4] | ||
{% | ||
\begin{figure}[h] | ||
\centering | ||
\includegraphics[width=#1\textwidth]{#2} | ||
\caption{#3} | ||
\label{#4} | ||
\end{figure} | ||
} | ||
|
||
%\renewcommand*{\raggedsection}{} | ||
|
||
\renewenvironment{abstract}{% | ||
\addsec*{\abstractname} | ||
} | ||
|
||
%%%% typesetting options | ||
\unitlength10mm | ||
\renewcommand*{\tabularxcolumn}[1]{>{\small}m{#1}} | ||
|
||
|
||
%\usepackage{natbib} | ||
%\bibliographystyle{plainnat} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
\begin{abstract} | ||
This document describes the KrdWrd CANOLA Corpus. | ||
|
||
The CANOLA Corpus is a visually annotaded English web corpus for training the KrdWrd classification engine to remove boiler plate on unseen web pages. | ||
It was harvested, annotaded and evaluated by the tools and infrastructur of the KrdWrd Project. | ||
\end{abstract} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
\begin{longversion} | ||
The KrdWrd Project\cite{krdwrd.org} deals with the design of an abstract architecture for | ||
A)~the unified treatment of Web data for automatic processing, \emph{without} neglecting visual information, on annotation and processing side and | ||
B)~the appropriate annotation tool to gather data for supervised processing of such data. | ||
|
||
The Project comprises an implementation appropriate for pre-processing and cleaning of Web pages, where users are provided with accurate Web page presentations and annotation utilities in a typical browsing environment, while machine learning (ML) algorithms also operate on representations of the visual rendering of Web pages. | ||
The system also preserves the original Web documents and all the additional information contained therein to make different approaches comparable on identical data. | ||
|
||
The system is sketched in \cite{StegerStemle2009}. | ||
|
||
For training the KrdWrd ML Engine, a substantial amount of hand-annotated data, viz.~Web pages, are needed. | ||
Following, we present the parts of the system that cover the acquisition of training data, i.e.~the steps before training data can be fed into a ML Engine. | ||
|
||
Then, after an overview of the sequence of steps needed to gather new training data in \ref{sec:systemoverview}, an in-depth description of the processing steps \emph{before} Web pages can be presented to annotators in \ref{sec:preprocessing}, presentation of the actual tool annotators use in \ref{sec:manualannotation}, and the compilation of their submitted results in \ref{sec:goldstandard}, we will be ready to feed the KrdWrd Gold Standard to a ML Engine. | ||
%An exemplification, the KrdWrd ML Engine, is covered in \ref{cha:krdwrdsys2}. | ||
\end{longversion} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,50 @@ | ||
\begin{longversion} | ||
% | ||
% | ||
Two fundamental ideas behind this part of the system are: | ||
firstly, Web pages have a textual representation, namely the text they contain, a structural representation, namely their DOM tree, and a visual representation, namely their rendered view -- all representations should be considered when automatically cleaning Web pages, and consequently, all should be annotated during acquisition of training data for ML tasks. | ||
Secondly, data acquisition for training of supervised ML algorithms should preserve pristine, unmodified versions of Web pages -- this will help to reproduce results \emph{and} to compare those of different architectures. | ||
|
||
% What | ||
\subsection{Functional Walk-Through} | ||
|
||
%What, Who, How, Result | ||
Gathering a set of sample pages is at the beginning of tagging new data. | ||
The process needs to be coordinated by the administrators of the system, i.e.~server level access is needed to make new corpora available for later tagging by users. | ||
The Process starts with a list of seed terms which are used to construct an ad-hoc corpus of Web pages where the result is a list of Uniform Resource Locators (URL\footnote{see \cite{URL} for details -- but also \cite{w3.org/Addressing}.}). | ||
|
||
%What, Who, How, Result | ||
The URL list is then \emph{harvested}, i.e.~the according Web pages are downloaded and saved for further processing. | ||
This process is coordinated by the administrators of the system and is started as automated batch-job on the server where its input is the URL List and the result is the set of downloaded Web pages and their content. | ||
|
||
%What, Who, How, Result | ||
These Web Pages are then available online to users for tagging, i.e.~there are no constraints on who is able to access these pages; | ||
however, keeping track of \emph{who tagged what} requires to differentiate between users, and hence, registration with the system, viz.~logging in. | ||
The Web pages are accessible via the KrdWrd Add-on in combination\footnotemark~with the Web Services hosted on \cite[Web Site]{krdwrd.org}. | ||
\footnotetext{Indeed, the data is accessible with \emph{any} browser -- but the KrdWrd Add-on enhances the experience.} | ||
|
||
%What, Who, How, Result | ||
Users can tag new, alter or redisplay formerly tagged Web pages with the help of the KrdWrd Add-on. | ||
The KrdWrd Add-on builds upon and extends the functionality of the Firefox \cite{firefox} browser and facilitates the visual tagging of Web pages, i.e.~users are provided with an accurate Web page presentation and annotation utility in a typical browsing environment. | ||
Readily (or partly) tagged pages are directly sent back to the server for storage in the KrdWrd Corpus data pool and for further processing. | ||
|
||
%What, Who, How, Result | ||
Updated or newly submitted tagging results are regularly merged, i.e.~submitted results from different users for the same content are processed and compiled into a majority-driven uniform view. | ||
This automated process uses a \emph{winner takes all strategy} and runs regularly on the server -- without further ado. | ||
The \emph{merged} content is stored in the KrdWrd data pool and hence, available for browsing, viewing, and analysis by the KrdWrd Add-on\footnotemark[\value{footnote}] and furthermore, it can be used as training data for Machine Learning algorithms. | ||
|
||
|
||
% What | ||
\subsection{Implementation Survey} | ||
|
||
The KrdWrd Infrastructure consists of several components that bring along the overall functionality of the system. | ||
They are run either on the KrdWrd Server or are part of the KrdWrd Add-on and hence, build upon and extend the functionality of the Firefox browser. | ||
The Server components are hosted on a Debian GNU/Linux \cite{debian.org} powered machine. | ||
However, the requirements\footnote{These include sed, awk, python, bash, subversion, XULRunner, wwwoffle, apache, R.} are rather limited and many other standard linux - or linux-like - systems should easily suffice, and even other platforms should be able to host the system. | ||
Nevertheless, the KrdWrd Add-on strictly runs only as an extension of the Firefox browser, version 3\footnote{But it could be converted into a self-contained XULRunner application.}. | ||
|
||
Access to the system is given as HTTP Service hosted on \url{krdwrd.org}, an SSL-certified virtual host running on an Apache Web Server \cite{httpd.apache.org} accompanied by mailing services, a dedicated trac as Wiki and issue tracking system for software development (extended with a mailing extension), and subversion \cite{subversion} as version control system. | ||
The interfacing between the KrdWrd Add-on and the Web Server is done via CGI \cite{cgi} scripts, which itself are mostly written in the Python programming language \cite{python}. | ||
% | ||
% | ||
\end{longversion} |
Oops, something went wrong.