This repository has been archived by the owner on Jul 24, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
/
naaclhlt2018.tex
403 lines (314 loc) · 26 KB
/
naaclhlt2018.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
%
% File naaclhlt2018.tex
%
%% Based on the style files for NAACL-HLT 2018, which were
%% Based on the style files for ACL-2015, with some improvements
%% taken from the NAACL-2016 style
%% Based on the style files for ACL-2014, which were, in turn,
%% based on ACL-2013, ACL-2012, ACL-2011, ACL-2010, ACL-IJCNLP-2009,
%% EACL-2009, IJCNLP-2008...
%% Based on the style files for EACL 2006 by
%% and that of ACL 08 by Joakim Nivre and Noah Smith
\documentclass[11pt,a4paper]{article}
% Added language, encoding and TODOs support
% 201803 by [email protected]
\usepackage[english]{babel}
\usepackage[utf8x]{inputenc}
% NOTES %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\usepackage[usenames]{xcolor}
\usepackage[inline,nomargin,index]{fixme}
\fxsetup{theme=color,mode=multiuser}
\FXRegisterAuthor{alex}{al}{\color{violet!60}Alex}
\FXRegisterAuthor{egon}{eg}{\color{red!60}Egon}
\newcommand{\egon}[1]{\egonnote{#1}}
\newcommand{\alex}[1]{\alexnote{#1}}
% table packages %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\usepackage{graphicx}
\usepackage{multirow}
\usepackage{siunitx}
\sisetup{input-ignore={,},group-separator={,},input-decimal-markers={.}}
% CHANGES
\usepackage{changes}
\colorlet{Changes@Color}{green}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\usepackage[hyperref]{naaclhlt2018}
\usepackage{times}
\usepackage{latexsym}
% Added
% 201804 by [email protected]
% 'Solve' error: \pdfendlink ended up in different nesting level than \pd %%%%%
%\let\oldhref\href
%\renewcommand{\href}[2]{\oldhref{#1}{\hbox{#2}}}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\usepackage{url}
\aclfinalcopy % Uncomment this line for the final submission
%\def\aclpaperid{***} % Enter the acl Paper ID here
%\setlength\titlebox{5cm}
% You can expand the titlebox if you need extra space
% to show all the authors. Please do not make the titlebox
% smaller than 5cm (the original size); we will check this
% in the camera-ready version and ask you to change it back.
\newcommand\BibTeX{B{\sc ib}\TeX}
\usepackage{xspace}
\newcommand\fT{\texttt{fastText}\xspace}
\title{Using Language Learner Data for Metaphor Detection}
\author{Egon W.~Stemle \\
Eurac Research \\
Bolzano-Bozen, Italy \\
{\tt [email protected]} \\\And
Alexander Onysko \\
Alpen-Adria-Universität \\
Klagenfurt a.W., Austria \\
{\tt [email protected]} \\}
\date{}
\begin{document}
\maketitle
\begin{abstract}
This article describes the system that participated in the shared task (ST) on metaphor detection \cite{Leong2018} on the Vrije University Amsterdam Metaphor Corpus (VUA).
The ST was part of the workshop on processing figurative language at the 16th annual conference of the \emph{North American Chapter of the Association for Computational Linguistics} (NAACL2018).
The system combines a small assertion of trending techniques, which implement matured methods from NLP and ML;
in particular, the system uses word embeddings from standard corpora and from corpora representing different proficiency levels of language learners in a LSTM BiRNN architecture.
The system is available under the APLv2 open-source license.
\end{abstract}
\section{Introduction} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\label{sec:intro}
Ever since conceptual metaphor theory was laid out in \newcite{LakoffJohnson80}, the most vexing question has remained a methodological one: how can conceptual metaphors be reliably identified in language use? Although manual identification was put on a stronger methodological footing with the Metaphor Identification Procedure (MIP) \cite{doi:10.1080/10926480709336752} and its elaboration into MIPVU \cite{Steen2010}, fuzzy areas remain due to the fact that conceptual metaphors can vary between primary metaphors and complex metaphors \cite[cf.][]{Grady1997}.
Furthermore, highly conventionalized metaphorical expressions might not be processed in the same way as novel metaphors. The core process of manual metaphor identification is not completely unproblematic either since it can be difficult to establish whether the meaning of a lexical unit in its context deviates from its basic meaning or not.
In the face of that slippery terrain, automatic metaphor identification emerges as an extremely challenging task.
An increasing volume of research since the start of annual workshops at NAACL in 2013 has shown first promising results using different methods of automated metaphor identification (see for example %\newcite{W13-0900}, \newcite{W14-2300},
\newcite{W15-1400} and
\newcite{W16-1100} for previous events).
The current shared task of metaphor identification provided a further opportunity to put the computational spotting of metaphors to the test.
Our bid for this task combines (cf.~Section~\ref{sec:design}) \fT word embeddings (WEs) with a single-layer long short-term memory bidirectional recurrent neural network (BiRNN) architecture.
The input, sequences of WE representations of words, is fed into the BiRNN which predicts metaphorical usage for each word.
The WEs were trained (cf.~Section \ref{sec:wemodels}) on different large corpora (BNC, Wikipedia, enTenTen13, ukWaC) and on the Vienna-Oxford International Corpus of English (VOICE) as well as on the TOEFL11 Corpus of Non-Native English. The latter corpus was used, among others, in the First Native Language Identification Shared Task \cite{tetreault-blanchard-cahill:2013:BEA} held at the \emph{8th Workshop on Innovative Use of NLP for Building Educational Applications} as part of NAACL-HLT 2013.
We were led by the idea (cf.~Section \ref{subsec:lld}) that metaphorical language use changes while gaining proficiency in a language, and so we hoped to be able to utilise the information contained in corpora of different proficiency levels.
The paper is organised as follows: We present our system design with related work in
Section~\ref{sec:design}, the implementation in
Section~\ref{sec:implementation}, and the experimental setup with an evaluation in
Section~\ref{sec:results}.
Section~\ref{sec:conclusion} concludes with an outlook on possible next steps.
\section{Design} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\label{sec:design}
Generally, our design builds upon the foundation laid out by~\newcite{collobert:2011b} for a neural network (NN) architecture and learning algorithm that can be applied to various natural language processing tasks.
The most related task specific design is given in \newcite{W16-1104} who used a NN in combination with WEs to detect metaphors.
% they demonstrate positive influence of part-of-speech (POS) features.
In contrast to our study, they used a dense multi-layer NN while we adapted the design of \newcite{stemle:2016:WAC-X,
stemle:2016:evalita}, who combined WEs with a recurrent NN (RNN) to predict part-of-speech (PoS) tags of computer-mediated communication (CMC) and Web corpora for German and Italian.
RNNs are usually considered to be more suitable for labelling sequential data such as text.
\subsection{Word Embeddings} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\label{subsec:we}
Recently, state-of-the-art results on various linguistic tasks were accomplished by architectures using neural-network based WEs.
\newcite{baroni-dinu-kruszewski:2014:P14-1} conducted a set of experiments comparing the popular word2vec~\cite{DBLP:journals/corr/abs-1301-3781,arXiv:1310.4546} implementation for creating WEs with other well-known distributional methods across various (semantic) tasks.
These results suggest that the WEs substantially outperform the other architectures on semantic similarity and analogy detection tasks.
Subsequently,~\newcite{TACL570} conducted a comprehensive set of experiments that suggest that much of the improved results are due to the system design and parameter optimizations, rather than the selected method.
They conclude that "there does not seem to be a consistent significant advantage to one approach over the other".
WEs provide high-quality low dimensional vector representations of words from large corpora of unlabelled data. The representations, typically computed using NNs, encode many linguistic regularities and patterns~\cite{arXiv:1310.4546}.
\subsection{Bidirectional Recurrent Neural Network} %%%%%%%%%%%%%%%%%%%%%%%%%%
\label{subsec:nn}
NNs consist of a large number of simple, highly interconnected processing nodes in an architecture loosely inspired by the structure of the cerebral cortex of the brain~\cite{oreilly2000}.
The nodes receive weighted inputs through their connections on one side and \emph{fire} according to their individual thresholds of their shared activation function.
A firing node passes on an activation to all connected nodes on the other side.
During learning the input is propagated through the network and the actual output is compared to the desired output.
Then, the weights of the connections (and the thresholds) are adjusted step-wise so as to more closely resemble a configuration that would produce the desired output.
After all training data have been presented, the process typically starts over, and the learned output values will usually be closer to the desired values.
Recurrent NNs (RNNs), introduced by \newcite{Elman1990}, are NNs where the connections between the elements are directed cycles, i.e.~the networks have loops, and this enables the NN to model sequential dependencies of the input.
However, regular RNNs have fundamental difficulties learning long-term dependencies, and special kinds of RNNs need to be used~\cite{Hochreiter1991};
a very popular one is the so called long short-term memory (LSTM) network proposed by~\newcite{Hochreiter:1997:LSM:1246443.1246450}.
Bidirectional RNNs (BiRNN), introduced by \newcite{Schuster1997}, extend unidirectional RNNs by introducing a layer, where the directed cycles enable the input to flow in opposite sequential order.
While processing text, this means that for any given word the network not only considers the text leading up to the word but also the text thereafter.
Overall, we benefit from available labelled data with this design but
also from large amounts of available unlabelled data.
\subsection{Language Learner Data} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\label{subsec:lld}
Our experimental design also utilizes data from language learner corpora.
This is based on the intuition that metaphor use might vary depending on learner proficiency.
\newcite{W13-0902} %p.18
indeed found a correlation between higher proficiency ratings of learner texts and a higher density of metaphors in these texts. Their study is also one of the few in the field of automated metaphor detection that are concerned with learner language. Their aim, however, is quite different to the current study as they try to establish annotations for metaphoric language use that can help to train an automated classifier of metaphors in test-taker essays. The current study, by contrast, utilizes learner corpus data to build WEs among other corpora representing written standard language. Learner language could be a particularly helpful source of information for automated metaphor detection via WEs as learner language provides different usage patterns compared to WEs derived from standard language corpora.
\begin{table*}[t]
\setlength{\tabcolsep}{3pt}
\begin{center}
%\small
\begin{tabular}{|l|S[table-format=6.1]|r|r||c|c|c|c|c|c|c|c|c|c||r|rr|}
\hline
\multicolumn{1}{|c}{} &
\multicolumn{1}{|c| }{\shortstack[c]{\textbf{Tokens} \\ \textbf{(Mio)}}} &
\multicolumn{1}{ c| }{\shortstack[c]{\textbf{min} \\ \textbf{Cnt}}} &
\multicolumn{1}{ c||}{\shortstack[c]{\textbf{dim} }} &
\multicolumn{1}{ c| }{\rotatebox{90}{T11 (low)}} &
\multicolumn{1}{ c| }{\rotatebox{90}{T11 (med)}} &
\multicolumn{1}{ c| }{\rotatebox{90}{T11 (high)}} &
\multicolumn{1}{ c| }{\rotatebox{90}{T11 (l+m+h)}} &
\multicolumn{1}{ c| }{\rotatebox{90}{VOICE}} &
\multicolumn{1}{ c| }{\rotatebox{90}{BNC}} &
\multicolumn{1}{ c| }{\rotatebox{90}{enTenTen13}} &
\multicolumn{1}{ c| }{\rotatebox{90}{ukWaC}} &
\multicolumn{1}{ c| }{\rotatebox{90}{ukWaC T11-size\ }} &
\multicolumn{1}{ c||}{\rotatebox{90}{Wikipedia17}} &
\multicolumn{1}{ c| }{\rotatebox{90}{F1-score on Test Set\ }} &
\multicolumn{2}{ c| }{\shortstack[c]{10-fold CV \\ Accuracy \\ on Training Set \\ $\mu$ | $\sigma$ }} \\
\hline
\hline
%T11 (low) & $0.3$ & 1 & 50 \\ % 245.130
%T11 (med) & $1.8$ & 1 & 50 \\ % 1.819.407
%T11 (high) & $1.4$ & 1 & 50 \\ % 1.388.260
%%\hline
%T11 (low+med+high) & $3.5$ & 1 & 50 \\ % 3.452.797 tokens
\multirow[t]{2}{*}{T11 (low)} % 245.130
% NLI_2013_low_min1_dim50-10fold 0.207 91.7 +/- 1.64
& 0.3 & 1 & 50 & X & & & & & & & & & & 0.207 & 0.917 & 0.016 \\
\hline
\multirow[t]{2}{*}{T11 (med)} % 1.819.407
% NLI_2013_med_min1_dim50-10fold 0.526 92.4 +/- 1.06
& 1.8 & 1 & 50 & & X & & & & & & & & & 0.526 & 0.924 & 0.011 \\
\hline
\multirow[t]{3}{*}{T11 (high)} % 1.388.260
% NLI_2013_high_min1_dim50-10fold 0.514 93.0 +/- 0.74
& 1.4 & 1 & 50 & & & X & & & & & & & & 0.514 & 0.930 & 0.007 \\
\hline
\multirow[t]{2}{*}{T11 (l+m+h) } % 3.452.797
% NLI_2013_all_min1_dim50-10fold 0.541 92.8 +/- 0.82
& 3.5 & 1 & 50 & & & & X & & & & & & & 0.541 & 0.928 & 0.008 \\
\hline
\multirow[t]{2}{*}{VOICE} % Voice 1-million-word, 1.003.696
% voice_min1_dim50-10fold 0.495 92.3 +/- 0.95
& 1 & 1 & 50 & & & & & X & & & & & & 0.495 & 0.923 & 0.010 \\
\hline
% https://embeddings.sketchengine.co.uk/static/index.html
\multirow[t]{2}{*}{BNC} % BNC 100-million-word, 100.000.000
% bnc-10fold 0.597 94.2 +/- 0.54
& 100 & 5 & 100 & & & & & & X & & & & & 0.597 & 0.942 & 0.005 \\
\hline
\multirow[t]{2}{*}{enTenTen13} %
% ententen-10fold 0.594 94.7 +/- 0.40
& 19,000 & 5 & 100 & & & & & & & X & & & & 0.594 & 0.947 & 0.004 \\
\hline
\multirow[t]{2}{*}{ukWaC} % 2.119.891.296 tokens
% ukwac-10fold 0.598 94.5 +/- 0.41
& 2,100 & 5 & 100 & & & & & & & & X & & & \textbf{0.598} & 0.945 & 0.004 \\
\hline % \cline{2-17}
\multirow[t]{2}{*}{ukWaC T11-size} %
% ukwac-NLI-size_sample_min1_dim50-10fold 0.564 93.3 +/- 0.93
& 3.5 & 1 & 50 & & & & & & & & & X & & 0.564 & 0.933 & 0.009 \\
\hline
%% https://fasttext.cc/docs/en/pretrained-vectors.html
\multirow[t]{2}{*}{Wikipedia17} %
% wiki-10fold 0.586 94.7 +/- 0.32
& ca 2,300 & 5 & 300 & & & & & & & & & & X & 0.586 & \textbf{0.947} & 0.003 \\
\hline
\hline
% NLIs_low_med_high__ukwac-NLI-size_sample_min1_dim50-10fold 0.576 94.1 +/- 0.34
& 7 & & & X & X & X & & & & & & X & & 0.576 & 0.941 & 0.003 \\
\cline{2-17}
% NLIs_all__ukwac-NLI-size_sample_min1_dim50-10fold 0.5669893020886398 93.55 +/- 0.83
& 7 & & & & & & X & & & & & X & & 0.567 & 0.936 & 0.008 \\
\cline{2-17}
%% NLI_all__voice_min1_dim50-10fold 0.5243175834834419 93.64 +/- 0.52
%& 4.5 & & & & & & X & X & & & & & & 0.524 & 0.936 & 0.005 \\
%\cline{2-17}
% NLIs_low_med_high__bnc_min1_dim50-10fold 0.5964377559767596 94.40 +/- 0.83
& 103.5 & & & X & X & X & & & X & & & & & 0.596 & 0.944 & 0.008 \\
\cline{2-17}
% NLIs_all__bnc_min1_dim50-10fold 0.613 94.5 +/- 0.53
& 103.5 & & & & & & X & & X & & & & & \textbf{0.613} & 0.945 & 0.005 \\
\cline{2-17}
% ukwac-NLI-size__bnc_min1_dim50-10fold 0.597 94.8 +/- 0.31
& 103.5 & & & & & & & & X & & & X & & 0.597 & 0.948 & 0.003 \\
\cline{2-17}
% NLIs_low_med_high__voice__bnc-10fold 0.601 95.0 +/- 0.36
& 104.5 & & & X & X & X & & X & X & & & & & 0.601 & 0.950 & 0.004 \\
\cline{2-17}
% NLIs_all__bnc__ukwac-NLI-size_sample_min1_dim50-10fold 0.586087420042644 95.11 +/- 0.23
& 107 & & & & & & X & & X & & & X & & 0.586 & 0.951 & 0.002 \\
\cline{2-17}
% NLIs_all__voice__bnc__ukwac-NLI-size_sample_min1_dim50-10fold 0.5495198079231692 94.87 +/- 0.33
& 108 & & & & & & X & X & X & & & X & & 0.550 & 0.948 & 0.003 \\
\cline{2-17}
% NLIs_low_med_high__voice__ententen-10fold 0.603 94.7 +/- 0.64
& 19,004.5 & & & X & X & X & & X & & X & & & & 0.603 & 0.947 & 0.006 \\
\cline{2-17}
% bnc__wiki__ententen-10fold 0.605 95.1 +/- 0.34
& 21,400 & & & & & & & & X & X & & & X & 0.605 & 0.951 & 0.003 \\
\cline{2-17}
% voice__bnc__wiki__ententen-10fold 0.594320757232369 95.27 +/- 0.26
& 21,401 & & & & & & & X & X & X & & & X & 0.594 & \textbf{0.953} & 0.003 \\
\cline{2-17}
% NLIs_low_med_high__voice__bnc__wiki__ententen-10fold 0.597 95.2 +/- 0.31
& 21,404.5 & & & X & X & X & & X & X & X & & & X & 0.597 & 0.952 & 0.003 \\
\hline
\end{tabular}
\end{center}
\caption{\label{tab:models}Overview of the word embedding models we used, and evaluation results for individual models and some combinations on the metaphor prediction track for \emph{all content part-of-speech}. \\ Number of tokens in the original corpus, parameters \texttt{minCount} and \texttt{dim} for \fT during training of the models. Our calculated F1-scores on the official labelled test set (they should coincide with the organisers' results).
The mean accuracy as well as the standard deviation in the accuracy
for 10-fold cross validation runs on the training set.}
\end{table*}
\section{Implementation} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\label{sec:implementation}
We maintain the implementation in a source code repository\footnote{\url{https://github.com/bot-zen/}}.
Our system uses sequences of word features as input to a BiRNN with a LSTM architecture.
\subsection{Word Embeddings} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
We use gensim\footnote{\url{https://radimrehurek.com/gensim/}}, a Python tool for unsupervised semantic modelling from plain text, to load pre-computed WE models and to compute embedding-vector representations of words.
Words missing in a WE model, i.e.~out-of-vocabulary words (OOV), are first estimated by looking at a fixed context of their non-OOV words.
If this fails, OOVs are mapped to their individual, randomly generated, vector representations.
\subsection{Neural Network} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Our implementation uses Keras \cite{chollet2015}, a high-level NNs' library written in Python, on top of TensorFlow~\cite{tensorflow2016}, an open source software library for numerical computation.
The number of input layers corresponds to the number of employed feature sets.
For multiple feature sets, e.g.~multiple WE models or additional PoS tags, sequences are concatenated on the word level such that the number of features for an individual word grows.
%\footnote{For an overview of combining multiple WE models see, e.g.~\newcite{Turian:2010:WRS:1858681.1858721}.}
Input sequences have a pre-defined length and represent original textual sentence segments.
In case a sentence is longer than the sequence length, the input is split into multiple segments.
And if a segment is shorter than the sequence length, the remaining slots are padded, i.e.~they are filled with identical dummy information.
Each input layer feeds into a masking layer such that the padded values from the input sequence will be skipped in all downstream layers.\footnote{This is considered good practice and speeds up processing with long sequences and many padded values -- with our rather short sequences it did not help much.}
The masked input is fed into a bidirectional LSTM layer that, in turn, projects to a fully connected output layer that is activated by a softmax function.
The output is a single sequence of matching length with labels indicating whether the corresponding word is used metaphorically or not.
During training, we use dropout for the linear transformation of the recurrent state, i.e.~the network drops a fraction of recurrent connections,
which helps prevent overfitting~\cite{Srivastava2014};
and we use a weighted categorical cross-entropy loss function to counteract the fact that far fewer words in our sequences are labelled as metaphorical than non-metaphorical, which usually hampers classification performance \cite[cf.][]{Kotsiantis2006}.
\section{Experiments and Results} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\label{sec:results}
Participants of the ST could either participate in the metaphor prediction tracks for verbs only, all content part-of-speech only, or both.
For a given text in VUA, and for each sentence, the task was to predict metaphoricity for each verb or content word respectively, and submit the result to CodaLab\footnote{\url{http://codalab.org}} for evaluation.
Results were calculated as the harmonic average of the precision and recall (F1-score) of the metaphoricity label.
We participated with our system in both tasks.
The remainder of this section introduces the official data set, our WE models and describes our fixed hyper-parameters.
The results of different combinations of WE models are shown in Table \ref{tab:models}.
Also note that \emph{all results} in this paper refer only to the all content part-of-speech task.
\subsection{Shared Task Data} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
The VUA, the corpus that was used in the shared task, originates from the British National Corpus (BNC).
Altogether, it is comprised of 117 texts covering four genres (academic, conversation, fiction, news).
For the ST, VUA was pre-divided by the organisers into a training and a test set.
The training set was labelled and could be used to train classifiers, while the participants were supposed to label the test set and submit it.
The distribution of metaphorical vs.~non-metaphorical labels was imbalanced with a ratio of roughly 1:6 ($11044:61567$).
\subsection{Word Embedding Models} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\label{sec:wemodels}
We use pre-built WE models of the following corpora: \emph{BNC} and \emph{enTenTen13} web corpus \cite{1120431} from SketchEngine\footnote{\url{https://embeddings.sketchengine.co.uk/static/index.html}}, as well as \emph{Wikipedia17}\footnote{\url{https://fasttext.cc/docs/en/pretrained-vectors.html}} from \fT \cite{DBLP:journals/corr/BojanowskiGJM16}.
We trained WE models using \fT's SkipGram model with the default parameters\footnote{\url{https://github.com/facebookresearch/fastText/archive/v0.1.0.zip}} except for the two parameters \texttt{-minCount} (the minimal number of word occurrences) and \texttt{-dim} (size of word vectors).
The two parameters were altered to take the smaller sizes of our corpora into account. See Table \ref{tab:models} for details.
Three individual models were trained for the different proficiency levels low, medium and high of the training subset of the \emph{TOEFL11} \cite{ETS2:ETS202331}; another model was trained for the full training set comprising all three proficiency levels. One model was trained for the \emph{VOICE} \cite{Seidlhofer2013}, a corpus of English as it is spoken by a non-native speaking majority of users in different contexts.
Two models were trained for ukWaC \cite{wackycorpora2009}, a corpus constructed from the Web using medium-frequency words from the BNC as seeds.
The first model for the full corpus and the second model for a random sample of documents approximating the token count of the full TOEFL11 training set.
\subsection{Hyper-Parameter Tuning} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Hyper-parameter tuning is important for good performance. The parameters of our system were optimised via an ad-hoc grid search in 3-fold cross validation (CV) runs.
Parameters were:
NN optimizer (\emph{rmsprop}, adadelta, adam), recurrent dropout rate for the LSTM layer (0.1, \emph{0.25}, 0.5), dropout for the input layer (\emph{0}, 0.1, 0.2), sequence length (5, 10, \emph{15}, 50), learning epochs (3, 5, \emph{20}, 32) and batch size (16, \emph{32}, 64), and the network architecture, e.g.~introducing a second LSTM abstraction layer or using a Gated Recurrent (GRU) layer instead of the LSTM layer.
Furthermore, we trained WE models with different values for the \texttt{dim} (25, \emph{50}, 100, 150, 200, 250) and \texttt{minCount} (\emph{1}, 2, 5, 10) parameters.
The weight for the categorical cross-entropy loss function is calculated as the logarithm of the ratio of number of words vs. metaphorical labels. The context for estimating OOV words was set to 10.
Once set, we used the same configuration for all experiments.
\section{Conclusion \& Outlook} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\label{sec:conclusion}
The combination of WEs with a BiRNN is capable of recognizing metaphorical usage of words better than many other already tested approaches.
More importantly, our design does not rely on WordNet or VerbNet information, and does not need concreteness or abstractness information like many successful architectures from previous annual workshops at NAACL.
Besides VUA, our system only needs running text.
The best result on the test set was achieved with a combination of TOEFL11 learner data and data from the BNC.
So far, the results are encouraging---but also mixed---regarding our initial idea that metaphorical language use at different proficiency levels could be utilised to recognizing metaphorical usage of words.
To this end, we are looking forward to output from the \emph{European Network for Combining Language Learning with Crowdsourcing Techniques}\footnote{\url{http://www.cost.eu/COST_Actions/ca/CA16105}}, where potentially more and more fine-grained language learner data will be collected and made available.
\section*{Acknowledgements} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
The computational results presented have been achieved in part using the \href{http://vsc.ac.at}{Vienna
Scientific Cluster (VSC)}.
\medskip
%\cleardoublepage
%\newpage
\bibliographystyle{acl_natbib}
\bibliography{naaclhlt2018}
\end{document}