-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathNLP-LanguageModeling-Ass.tex
191 lines (166 loc) · 7.99 KB
/
NLP-LanguageModeling-Ass.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
\documentclass[11pt]{article}
%\usepackage[top=20mm,left=20mm,right=20mm,bottom=15mm,a4paper]{geometry} % see geometry.pdf on how to lay out the page. There's lots.
\usepackage[top=20mm,left=20mm,right=20mm,bottom=15mm,headsep=15pt,footskip=15pt,a4paper]{geometry} % see geometry.pdf on how to lay out the page. There's lots.
%\geometry{a4paper} % or letter or a5paper or ... etc
% \geometry{landscape} % rotated page geometry
\usepackage[round]{natbib}
\setlength{\bibsep}{0.0pt}
\usepackage{color}
\usepackage{times}
%\usepackage[T1]{fontenc}
%\usepackage{mathptmx}
\usepackage{tikz-dependency}
\usepackage{enumitem}
%\usepackage{times}
\usepackage[procnames]{listings}
\usepackage{color}
\usepackage{todonotes}
% See the ``Article customise'' template for come common customisations
\newcommand{\refeq}[1]{Equation~\ref{eq:#1}}
\newcommand{\reffig}[1]{Figure~\ref{fig:#1}}
\newcommand{\reftab}[1]{Table~\ref{tab:#1}}
\newcommand{\refsec}[1]{\textsection\ref{sec:#1}}
\newcommand{\newsec}[2]{\section{#1}\label{sec:#2}\noindent}
\newcommand{\newsubsec}[2]{\subsection{#1}\label{sec:#2}\noindent}
\newcommand{\argmax}{\operatornamewithlimits{argmax}}
\newcommand{\argmin}{\operatornamewithlimits{argmin}}
\makeatletter
\def\@maketitle{ % custom maketitle
\begin{center}%
{\bfseries \@title}%
{\bfseries \@author}%
\end{center}%
\smallskip \hrule \bigskip }
% custom section
\renewcommand{\section}{\@startsection
{section}% % the name
{1}% % the level
{0mm}% % the indent
{-0.8\baselineskip}% % the before skip
{0.3\baselineskip}% % the after skip
{\bfseries\large}}% the style
% custom subsection
\renewcommand{\subsection}{\@startsection
{subsection}% % the name
{2}% % the level
{0mm}% % the indent
{-0.8\baselineskip}% % the before skip
{0.3\baselineskip}% % the after skip
{\bfseries\large}}% the style
\renewcommand{\paragraph}{%
\@startsection{paragraph}{4}%
{\z@}{1.5ex \@plus 1ex \@minus .2ex}{-1em}%
{\normalfont\normalsize\bfseries}%
}\makeatother
%\title{{\LARGE Universal Parser (UP)}\\[-8mm]
%\includegraphics[height=8mm]{RUPA}~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\includegraphics[height=8mm]{RUPA}}
\title{{\LARGE Natural Language Processing}\\[1.5mm]{\large Lab 4: Language Modeling}}
\author{}
\date{} % delete this line to display the current date
%%% BEGIN DOCUMENT
\begin{document}
\definecolor{keywords}{RGB}{255,0,90}
\definecolor{comments}{RGB}{0,0,113}
\definecolor{red}{RGB}{160,0,0}
\definecolor{green}{RGB}{0,150,0}
\lstset{language=Python,
basicstyle=\ttfamily\small,
keywordstyle=\color{keywords},
commentstyle=\color{comments},
stringstyle=\color{red},
showstringspaces=false,
identifierstyle=\color{green},
procnamekeys={def,class}}
\maketitle
%\tableofcontents
%\vspace{3mm}
\vspace{-2mm}
\newsec{Introduction}{intro}%
In this lab, we are going to build and evaluate $n$-gram models %(for various values of $n$)
for the task of probabilistic language modeling, or predicting the next word in a text. We will make use
of a standard toolkit called SRILM, and we will explore different orders of $n$ as well as different smoothing techniques.
\paragraph{Acknowledgment:} Thanks to Emily Bender for letting us reuse
and modify an older lab.
\newsec{Data and preprocessing}{data}%
To train our language models, we will use the same collection of
Sherlock Holmes stories as in the previous exercises ({\tt
holmes.txt}). To evaluate the models, we will use three different
texts:
\begin{enumerate}[noitemsep,topsep=0.2cm]
\item \emph{His Last Bow}, a collection of Sherlock Holmes stories by
Conan Doyle (not included in {\tt holmes.txt})
\item \emph{The Lost World}, a novel by Conan Doyle
\item \emph{Stories by English Authors: London}, short stories written
around the same time as the other works
\end{enumerate}
These texts are in the files {\tt his-last-bow.txt}, {\tt
lost-world.txt}, and {\tt other-authors}, respectively.
\paragraph{Note on tokenization:} The SRILM toolkit requires tokenized
text in a different format from the one we have been using so far.
Instead of having one token per line and blank lines to separate
sentences, we should now have one sentence per line and spaces to
separate tokens. To tokenize the texts, you should use {\tt
tokenizer6.py}, which performs exactly the same tokenization as {\tt
tokenizer5.py} but produces its output in the format required by
SRILM.\\
You should tokenize the three texts as well as {\tt holmes.txt}.
\newsec{Train a language model}{train}%
To train a language model on {\tt holmes.txt}, we need to run a
command like the following:
\begin{verbatim}
ngram-count -order 2 -text holmes-tok.txt -addsmooth 0.01 -lm bigram-add
\end{verbatim}
This commands tells SRILM to count bigrams ({\tt -order~2}) in the training file ({\tt -file~holmes-tok.txt})
and to create a language model and save it in a file called {\tt bigram-add} ({\tt -lm~bigram-add}) with additive smoothing with ({\tt -addsmooth~0.01}).
By varying parameters such as n-gram order and smoothing method, we can create different language models.
\newsec{Evaluate a language model}{eval}%
To evaluate a language model created as above on {\tt
his-last-bow.txt}, we run:
\begin{verbatim}
ngram -order 2 -lm bigram-add -ppl his-last-bow-tok.txt
\end{verbatim}
This command tells SRILM to use a bigram model ({\tt -order~2}) based
on the previously stored model ({\tt -lm~bigram-add}) to estimate the
perplexity of the test file ({\tt -ppl~his-last-bow-tok.txt}). The
output generated by this command should be something like:
\begin{verbatim}
file his-last-bow-tok.txt: 5080 sentences, 81393 words, 2729 OOVs
0 zeroprobs, logprob= -177261 ppl= 130.829 ppl1= 179.224
\end{verbatim}
This tells us that the test file contained 5080 sentences and 81393
words, of which 2729 were out-of-vocabulary (OOV), meaning that they
did not occur in the training set. The model did not find any zero
probabilities (thanks to the smoothing), the log probability was
$-$177261 and the perplexity (ppl) was 130.829.\footnote{The second
perplexity measure (ppl1) is computed without end-of-sentence tokens
and can be ignored for now.} Remember that perplexity, as well as
the related entropy measure, tells us how surprised or confused the
model is when seeing the test data, so lower perplexity is better.
\newsec{Explore different test sets }{sets}%
Our first test set is drawn from the same author and the same genre as
our training set. What happens if we apply the model to the other two
test sets? What happens if we apply the model to the training set
itself? First think about what you think will happen and discuss with your
classmates. Write down your expectations: E.g.: I think perplexity will be
lower on text set X than on test set Y.
Run the experiments and compare the results with your intuitions. Do the
results match your expectations?
\newsec{Explore different language models }{models}%
Our first language model was a bigram model with simple additive
smoothing. What happens if we change the $n$-gram order? Explore at
least unigrams and trigrams on all three test sets with the same
smoothing method. Decide which model works best and switch the
smoothing method for this model to Kneser-Ney smoothing (a type of
backoff smoothing).
Make a table with the different results to ease the interpretation of the
results.
\paragraph{Note on SRILM options:} To change the $n$-gram order,
simply change {\tt -order~2} to {\tt -order~$n$} (for whatever $n$
you want to try). Please note that the $n$-gram order used for
testing cannot be higher than the $n$-gram order used for training
(but it can be lower). To change smoothing method, replace {\tt
-addsmooth~0.01} by {\tt -kndiscount} (for Kneser-Ney
discounting). If you want to explore further options, consult the
SRILM manual at {\small {\tt
http://www.speech.sri.com/projects/srilm/manpages/ngram-count.1.html}}.
\end{document}