old-related-work-section.tex


Paraphrase extraction using bilingual parallel corpora was proposed by \newcite{Callison-Burch2005}
who induced paraphrases using techniques from {\it phrase-based}
statistical machine translation \cite{Koehn2003}. After extracting a
bilingual phrase table, English paraphrases can be obtained by
pivoting through foreign language phrases. 
The phrase table contains
phrase pairs $(e, f)$ (where the $e$ and $f$ stand for English and
foreign phrases, respectively) as well as bi-directional 
Since many
paraphrases can be extracted for a phrase,, \newcite{Callison-Burch2005} rank them using a paraphrase probability defined in terms of the translation model
probabilities $p(f | e)$ and $p(e | f)$: 

\begin{eqnarray}
  p(e_2|e_1) &=& \sum_f p(e_2,f|e_1)\\
                  &=& \sum_f p(e_2|f,e_1) p(f|e_1) \\
                  &\approx& \sum_f p(e_2|f) p(f|e_1)
\label{paraphrase_prob_eqn}
\end{eqnarray}
Additionally, the resulting paraphrases are further re-ranked using
contextual features such as a language model.
\newcite{Callison-Burch2008} improves on this method by requiring the
paraphrases to be governed by the same constituent, resulting in less
noisy and more grammatical paraphrases.

\newcite{Madnani2007} apply the pivot technique to {\it hierarchical}
phrase-based machine translation \cite{Chiang2005}. Hierarchical
phrases contain a variable ``X'' that allows for slot-fillers.
\newcite{Madnani2007}'s paraphrase table contains these slotted
patterns as well. Because hierarchical phrase-based machine
translation is formally a synchronous context free grammar (SCFG),
\newcite{Madnani2007}'s paraphrase table can be thought of as a {\it
  paraphrase grammar}. Their paraphrase grammar can paraphrase (or
``decode'') input sentences using an SCFG decoder, like the Hiero MT
system. \newcite{Madnani2007} mirror Hiero's log-linear model and its
feature set. The parameters for the log-linear model are estimated
using minimum error rate training, maximizing the BLEU metric on a set
of parallel English sentences. The authors report significant gains in
translation quality when using additional references generated by
paraphrasing to tune a machine translation system.

\newcite{Zhao2008} further enrich pivot-based paraphrase approach with
syntactic information by extracting partial subtrees from a
dependency-parsed English side of a bitext and pivoting over the
corresponding Chinese phrases to extract paraphrases. The slots in the
resulting patterns are labeled with part-of-speech tags (but not
larger syntactic constituents). Their system also employs a log-linear
model that combines translation and lexical probabilities and is tuned
to maximize precision over a hand-labeled set of paraphrases.

Several research efforts have leveraged parallel monolingual corpora,
however they jointly suffer from the scarcity and noisiness of
parallel corpora.  \newcite{Dolan2004} work around this issue by extracting
parallel sentences from the vast amount of freely available comparable
English text and apply machine translation techniques to create a
paraphrasing system \cite{Quirk2004}. However, the word-based
translation model and monotone decoder they use results in a
substantial amount of identity paraphrases or single-word
substitutions.

Relying on small data sets of semantically equivalent translations, \newcite{Pang2003} created finite state automata by syntax-aligning parallel sentences, enabling the generation of additional reference translations.

Both \newcite{Barzilay2001} and \newcite{Ibrahim2003} sentence-align existing noisy parallel monolingual corpora such as translations of the same novels. While \newcite{Ibrahim2003} employ a set of heuristics that rely on anchor words identified by textual identity or matchin liguistic features such as gender, number or semantic class, \newcite{Barzilay2001} use a co-training approach that leverages context similarity to identify viable paraphrases.

Semantic parallelism is well-established as a stong basis for the extraction of correspondencies such as paraphrases. However, there are notable efforts that choose to forgo it in favor of clustering approaches based on distributional characteristics. The well-known DIRT method by \newcite{Lin2001} fully relies on distributional similarity features for paraphrase extraction. Patterns extracted from paths in dependency graphs are clustered based on the similarity of the observed contents of their slots.

Similarly, \newcite{Bhagat2008} argue that vast amounts of text can be leveraged to make up for the relative weakness of distributional features compared to parallelism. They also forgo complex annotations such as syntactic or dependency parses, relying only on part-of-speech tags to inform their approach. In their work, relations are learned by finding pattern clusters initially seeded by already known patterns. However, this method is not capable of producing syntactic paraphrases. \mnote{Need better tie-in with the overall theme of  structural paraphrases.}