Skip to content

Latest commit

 

History

History
2727 lines (1499 loc) · 339 KB

PAPERS2017.md

File metadata and controls

2727 lines (1499 loc) · 339 KB

Table of Contents

Articles

2017-01

A Simple and Accurate Syntax-Agnostic Neural Model for Dependency-based Semantic Role Labeling

Authors: Diego Marcheggiani, Anton Frolov, Ivan Titov

Abstract: We introduce a simple and accurate neural model for dependency-based semantic role labeling. Our model predicts predicate-argument dependencies relying on states of a bidirectional LSTM encoder. The semantic role labeler achieves respectable performance on English even without any kind of syntactic information and only using local inference. However, when automatically predicted part-of-speech tags are provided as input, it substantially outperforms all previous local models and approaches the best reported results on the CoNLL-2009 dataset. Syntactic parsers are unreliable on out-of-domain data, so standard (i.e. syntactically-informed) SRL models are hindered when tested in this setting. Our syntax-agnostic model appears more robust, resulting in the best reported results on the standard out-of-domain test set.

URL: https://arxiv.org/abs/1701.02593

Notes: New paper from Russian guys about PoS-tagging, could be useful in dialog tracking maybe.

Reinforcement Learning via Recurrent Convolutional Neural Networks

Authors: Tanmay Shankar, Santosha K. Dwivedy, Prithwijit Guha

Abstract: Deep Reinforcement Learning has enabled the learning of policies for complex tasks in partially observable environments, without explicitly learning the underlying model of the tasks. While such model-free methods achieve considerable performance, they often ignore the structure of task. We present a natural representation of to Reinforcement Learning (RL) problems using Recurrent Convolutional Neural Networks (RCNNs), to better exploit this inherent structure. We define 3 such RCNNs, whose forward passes execute an efficient Value Iteration, propagate beliefs of state in partially observable environments, and choose optimal actions respectively. Backpropagating gradients through these RCNNs allows the system to explicitly learn the Transition Model and Reward Function associated with the underlying MDP, serving as an elegant alternative to classical model-based RL. We evaluate the proposed algorithms in simulation, considering a robot planning problem. We demonstrate the capability of our framework to reduce the cost of replanning, learn accurate MDP models, and finally re-plan with learnt models to achieve near-optimal policies.

URL: https://arxiv.org/abs/1701.02392

Notes: Reccurent CNN is a new trend, they had shown themselves as useful tool in NLP, now in RL, couldn't miss this one.

Visualizing Residual Networks

Authors: Brian Chu, Daylen Yang, Ravi Tadinada

Abstract: Residual networks are the current state of the art on ImageNet. Similar work in the direction of utilizing shortcut connections has been done extremely recently with derivatives of residual networks and with highway networks. This work potentially challenges our understanding that CNNs learn layers of local features that are followed by increasingly global features. Through qualitative visualization and empirical analysis, we explore the purpose that residual skip connections serve. Our assessments show that the residual shortcut connections force layers to refine features, as expected. We also provide alternate visualizations that confirm that residual networks learn what is already intuitively known about CNNs in general.

URL: https://arxiv.org/abs/1701.02362

Notes: The heading is talking for itself, could be useful due to residuality now is useful everywhere: RNN, CNN, etc.

Multi-level Representations for Fine-Grained Typing of Knowledge Base Entities

Authors: Yadollah Yaghoobzadeh, Hinrich Schütze

Abstract: Entities are essential elements of natural language. In this paper, we present methods for learning multi-level representations of entities on three complementary levels: character (character patterns in entity names extracted, e.g., by neural networks), word (embeddings of words in entity names) and entity (entity embeddings). We investigate state-of-the-art learning methods on each level and find large differences, e.g., for deep learning models, traditional ngram features and the subword model of fasttext (Bojanowski et al., 2016) on the character level; for word2vec (Mikolov et al., 2013) on the word level; and for the order-aware model wang2vec (Ling et al., 2015a) on the entity level. We confirm experimentally that each level of representation contributes complementary information and a joint representation of all three levels improves the existing embedding based baseline for fine-grained entity typing by a large margin. Additionally, we show that adding information from entity descriptions further improves multi-level representations of entities.

URL: https://arxiv.org/abs/1701.02025

Notes: fresh paper from Schuetze, triune of Char, Word, & Entity, seems to be the part of NLP Holy Grail

Neural Personalized Response Generation as Domain Adaptation

Authors: Weinan Zhang, Ting Liu, Yifa Wang, Qingfu Zhu

Abstract: In this paper, we focus on the personalized response generation for conversational systems. Based on the sequence to sequence learning, especially the encoder-decoder framework, we propose a two-phase approach, namely initialization then adaptation, to model the responding style of human and then generate personalized responses. For evaluation, we propose a novel human aided method to evaluate the performance of the personalized response generation models by online real-time conversation and offline human judgement. Moreover, the lexical divergence of the responses generated by the 5 personalized models indicates that the proposed two-phase approach achieves good results on modeling the responding style of human and generating personalized responses for the conversational systems.

URL: https://arxiv.org/abs/1701.02073

Notes: personalized answer is really important part for seamless conversation, training the style from responses is a nice idea.

Task-Specific Attentive Pooling of Phrase Alignments Contributes to Sentence Matching

Authors: Wenpeng Yin, Hinrich Schütze

Abstract: This work studies comparatively two typical sentence matching tasks: textual entailment (TE) and answer selection (AS), observing that weaker phrase alignments are more critical in TE, while stronger phrase alignments deserve more attention in AS. The key to reach this observation lies in phrase detection, phrase representation, phrase alignment, and more importantly how to connect those aligned phrases of different matching degrees with the final classifier. Prior work (i) has limitations in phrase generation and representation, or (ii) conducts alignment at word and phrase levels by handcrafted features or (iii) utilizes a single framework of alignment without considering the characteristics of specific tasks, which limits the framework's effectiveness across tasks. We propose an architecture based on Gated Recurrent Unit that supports (i) representation learning of phrases of arbitrary granularity and (ii) task-specific attentive pooling of phrase alignments between two sentences. Experimental results on TE and AS match our observation and show the effectiveness of our approach.

URL: https://arxiv.org/abs/1701.02149

Notes: attentive pooling for NLP tasks is very hot

Structural Attention Neural Networks for improved sentiment analysis

Authors: Filippos Kokkinos, Alexandros Potamianos

Abstract: We introduce a tree-structured attention neural network for sentences and small phrases and apply it to the problem of sentiment classification. Our model expands the current recursive models by incorporating structural information around a node of a syntactic tree using both bottom-up and top-down information propagation. Also, the model utilizes structural attention to identify the most salient representations during the construction of the syntactic tree. To our knowledge, the proposed models achieve state of the art performance on the Stanford Sentiment Treebank dataset.

URL: https://arxiv.org/abs/1701.01811

Notes: one more attention type

Self-Taught Convolutional Neural Networks for Short Text Clustering

Authors: Jiaming Xu, Bo Xu, Peng Wang, Suncong Zheng, Guanhua Tian, Jun Zhao, Bo Xu

Abstract: Short text clustering is a challenging problem due to its sparseness of text representation. Here we propose a flexible Self-Taught Convolutional neural network framework for Short Text Clustering (dubbed STC^2), which can flexibly and successfully incorporate more useful semantic features and learn non-biased deep text representation in an unsupervised manner. In our framework, the original raw text features are firstly embedded into compact binary codes by using one existing unsupervised dimensionality reduction methods. Then, word embeddings are explored and fed into convolutional neural networks to learn deep feature representations, meanwhile the output units are used to fit the pre-trained binary codes in the training process. Finally, we get the optimal clusters by employing K-means to cluster the learned representations. Extensive experimental results demonstrate that the proposed framework is effective, flexible and outperform several popular clustering methods when tested on three public short text datasets.

URL: https://arxiv.org/abs/1701.00185

Notes: unsupervised clustering by CNNs!

Textual Entailment with Structured Attentions and Composition

Authors: Kai Zhao, Liang Huang, Mingbo Ma

Abstract: Deep learning techniques are increasingly popular in the textual entailment task, overcoming the fragility of traditional discrete models with hard alignments and logics. In particular, the recently proposed attention models (Rockt"aschel et al., 2015; Wang and Jiang, 2015) achieves state-of-the-art accuracy by computing soft word alignments between the premise and hypothesis sentences. However, there remains a major limitation: this line of work completely ignores syntax and recursion, which is helpful in many traditional efforts. We show that it is beneficial to extend the attention model to tree nodes between premise and hypothesis. More importantly, this subtree-level attention reveals information about entailment relation. We study the recursive composition of this subtree-level entailment relation, which can be viewed as a soft version of the Natural Logic framework (MacCartney and Manning, 2009). Experiments show that our structured attention and entailment composition model can correctly identify and infer entailment relations from the bottom up, and bring significant improvements in accuracy.

URL: https://arxiv.org/abs/1701.01126

Notes: yet another attention

Unsupervised neural and Bayesian models for zero-resource speech processing

Authors: Herman Kamper

Abstract: In settings where only unlabelled speech data is available, zero-resource speech technology needs to be developed without transcriptions, pronunciation dictionaries, or language modelling text. There are two central problems in zero-resource speech processing: (i) finding frame-level feature representations which make it easier to discriminate between linguistic units (phones or words), and (ii) segmenting and clustering unlabelled speech into meaningful units. In this thesis, we argue that a combination of top-down and bottom-up modelling is advantageous in tackling these two problems. To address the problem of frame-level representation learning, we present the correspondence autoencoder (cAE), a neural network trained with weak top-down supervision from an unsupervised term discovery system. By combining this top-down supervision with unsupervised bottom-up initialization, the cAE yields much more discriminative features than previous approaches. We then present our unsupervised segmental Bayesian model that segments and clusters unlabelled speech into hypothesized words. By imposing a consistent top-down segmentation while also using bottom-up knowledge from detected syllable boundaries, our system outperforms several others on multi-speaker conversational English and Xitsonga speech data. Finally, we show that the clusters discovered by the segmental Bayesian model can be made less speaker- and gender-specific by using features from the cAE instead of traditional acoustic features. In summary, the different models and systems presented in this thesis show that both top-down and bottom-up modelling can improve representation learning, segmentation and clustering of unlabelled speech data.

URL: https://arxiv.org/abs/1701.00851

Notes: Unsurervised neural vs Bayesian approahces in speech processing

Dense Associative Memory is Robust to Adversarial Inputs

Authors: Dmitry Krotov, John J Hopfield

Abstract: Deep neural networks (DNN) trained in a supervised way suffer from two known problems. First, the minima of the objective function used in learning correspond to data points (also known as rubbish examples or fooling images) that lack semantic similarity with the training data. Second, a clean input can be changed by a small, and often imperceptible for human vision, perturbation, so that the resulting deformed input is misclassified by the network. These findings emphasize the differences between the ways DNN and humans classify patterns, and raise a question of designing learning algorithms that more accurately mimic human perception compared to the existing methods. Our paper examines these questions within the framework of Dense Associative Memory (DAM) models. These models are defined by the energy function, with higher order (higher than quadratic) interactions between the neurons. We show that in the limit when the power of the interaction vertex in the energy function is sufficiently large, these models have the following three properties. First, the minima of the objective function are free from rubbish images, so that each minimum is a semantically meaningful pattern. Second, artificial patterns poised precisely at the decision boundary look ambiguous to human subjects and share aspects of both classes that are separated by that decision boundary. Third, adversarial images constructed by models with small power of the interaction vertex, which are equivalent to DNN with rectified linear units (ReLU), fail to transfer to and fool the models with higher order interactions. This opens up a possibility to use higher order models for detecting and stopping malicious adversarial attacks. The presented results suggest that DAM with higher order energy functions are closer to human visual perception than DNN with ReLUs.

URL: https://arxiv.org/abs/1701.00939

Notes: The memory paper from that Hopfield

A K-fold Method for Baseline Estimation in Policy Gradient Algorithms

Authors: Nithyanand Kota, Abhishek Mishra, Sunil Srinivasa, Xi (Peter) Chen, Pieter Abbeel

Abstract: The high variance issue in unbiased policy-gradient methods such as VPG and REINFORCE is typically mitigated by adding a baseline. However, the baseline fitting itself suffers from the underfitting or the overfitting problem. In this paper, we develop a K-fold method for baseline estimation in policy gradient algorithms. The parameter K is the baseline estimation hyperparameter that can adjust the bias-variance trade-off in the baseline estimates. We demonstrate the usefulness of our approach via two state-of-the-art policy gradient algorithms on three MuJoCo locomotive control tasks.

URL: https://arxiv.org/abs/1701.00867

Notes: Simple baseline for policy gradient

Generating Long and Diverse Responses with Neural Conversation Models

Authors: Louis Shao, Stephan Gouws, Denny Britz, Anna Goldie, Brian Strope, Ray Kurzweil

Abstract: Building general-purpose conversation agents is a very challenging task, but necessary on the road toward intelligent agents that can interact with humans in natural language. Neural conversation models — purely data-driven systems trained end-to-end on dialogue corpora — have shown great promise recently, yet they often produce short and generic responses. This work presents new training and decoding methods that improve the quality, coherence, and diversity of long responses generated using sequence-to-sequence models. Our approach adds self-attention to the decoder to maintain coherence in longer responses, and we propose a practical approach, called the glimpse-model, for scaling to large datasets. We introduce a stochastic beam-search algorithm with segment-by-segment reranking which lets us inject diversity earlier in the generation process. We trained on a combined data set of over 2.3B conversation messages mined from the web. In human evaluation studies, our method produces longer responses overall, with a higher proportion rated as acceptable and excellent as length increases, compared to baseline sequence-to-sequence models with explicit length-promotion. A back-off strategy produces better responses overall, in the full spectrum of lengths.

URL: https://arxiv.org/abs/1701.03185

Notes: more diversity for responces, we should look over this work, since the beam search isn't satisfying

Simplified Gating in Long Short-term Memory (LSTM) Recurrent Neural Networks

Authors: Yuzhen Lu, Fathi M. Salem

Abstract: The standard LSTM recurrent neural networks while very powerful in long-range dependency sequence applications have highly complex structure and relatively large (adaptive) parameters. In this work, we present empirical comparison between the standard LSTM recurrent neural network architecture and three new parameter-reduced variants obtained by eliminating combinations of the input signal, bias, and hidden unit signals from individual gating signals. The experiments on two sequence datasets show that the three new variants, called simply as LSTM1, LSTM2, and LSTM3, can achieve comparable performance to the standard LSTM model with less (adaptive) parameters.

URL: https://arxiv.org/abs/1701.03441

Notes: that's a pity, I had the similar idea, you need to go fast with trying ideas this days!

Modularized Morphing of Neural Networks

Authors: Tao Wei, Changhu Wang, Chang Wen Chen

Abstract: In this work we study the problem of network morphism, an effective learning scheme to morph a well-trained neural network to a new one with the network function completely preserved. Different from existing work where basic morphing types on the layer level were addressed, we target at the central problem of network morphism at a higher level, i.e., how a convolutional layer can be morphed into an arbitrary module of a neural network. To simplify the representation of a network, we abstract a module as a graph with blobs as vertices and convolutional layers as edges, based on which the morphing process is able to be formulated as a graph transformation problem. Two atomic morphing operations are introduced to compose the graphs, based on which modules are classified into two families, i.e., simple morphable modules and complex modules. We present practical morphing solutions for both of these two families, and prove that any reasonable module can be morphed from a single convolutional layer. Extensive experiments have been conducted based on the state-of-the-art ResNet on benchmark datasets, and the effectiveness of the proposed solution has been verified.

URL: https://arxiv.org/abs/1701.03281

Notes: modularization is a fresh idea, I cannot get morphing aside the brain damage (zeroing small weights) yet

A Copy-Augmented Sequence-to-Sequence Architecture Gives Good Performance on Task-Oriented Dialogue

Authors: Mihail Eric, Christopher D. Manning

Abstract: Task-oriented dialogue focuses on conversational agents that participate in user-initiated dialogues on domain-specific topics. In contrast to chatbots, which simply seek to sustain open-ended meaningful discourse, existing task-oriented agents usually explicitly model user intent and belief states. This paper examines bypassing such an explicit representation by depending on a latent neural embedding of state and learning selective attention to dialogue history together with copying to incorporate relevant prior context. We complement recent work by showing the effectiveness of simple sequence-to-sequence neural architectures with a copy mechanism. Our model outperforms more complex memory-augmented models by 7% in per-response generation and is on par with the current state-of-the-art on DSTC2.

URL: https://arxiv.org/abs/1701.04024

Notes: Seq2Seq still has some tricks up its sleeve, copying as context is a bright idea

Dialog Context Language Modeling with Recurrent Neural Networks

Authors: Bing Liu, Ian Lane

Abstract: In this work, we propose contextual language models that incorporate dialog level discourse information into language modeling. Previous works on contextual language model treat preceding utterances as a sequence of inputs, without considering dialog interactions. We design recurrent neural network (RNN) based contextual language models that specially track the interactions between speakers in a dialog. Experiment results on Switchboard Dialog Act Corpus show that the proposed model outperforms conventional single turn based RNN language model by 3.3% on perplexity. The proposed models also demonstrate advantageous performance over other competitive contextual language models.

URL: https://arxiv.org/abs/1701.04056

Notes: Another context incorporation with RNN

Neural Models for Sequence Chunking

Authors: Feifei Zhai, Saloni Potdar, Bing Xiang, Bowen Zhou

Abstract: Many natural language understanding (NLU) tasks, such as shallow parsing (i.e., text chunking) and semantic slot filling, require the assignment of representative labels to the meaningful chunks in a sentence. Most of the current deep neural network (DNN) based methods consider these tasks as a sequence labeling problem, in which a word, rather than a chunk, is treated as the basic unit for labeling. These chunks are then inferred by the standard IOB (Inside-Outside-Beginning) labels. In this paper, we propose an alternative approach by investigating the use of DNN for sequence chunking, and propose three neural models so that each chunk can be treated as a complete unit for labeling. Experimental results show that the proposed neural sequence chunking models can achieve start-of-the-art performance on both the text chunking and slot filling tasks.

URL: https://arxiv.org/abs/1701.04027

Notes: Sequence chunking is common task for asian languages in the first place, but since we are going to go chars, for european ones too

Understanding the Effective Receptive Field in Deep Convolutional Neural Networks

Authors: Wenjie Luo, Yujia Li, Raquel Urtasun, Richard Zemel

Abstract: We study characteristics of receptive fields of units in deep convolutional networks. The receptive field size is a crucial issue in many visual tasks, as the output must respond to large enough areas in the image to capture information about large objects. We introduce the notion of an effective receptive field, and show that it both has a Gaussian distribution and only occupies a fraction of the full theoretical receptive field. We analyze the effective receptive field in several architecture designs, and the effect of nonlinear activations, dropout, sub-sampling and skip connections on it. This leads to suggestions for ways to address its tendency to be too small.

URL: https://arxiv.org/abs/1701.04128

Notes: The topic I was always curious about, shoud read it carefully, since now CNN are in rising at NLP field

Agent-Agnostic Human-in-the-Loop Reinforcement Learning

Authors: David Abel, John Salvatier, Andreas Stuhlmüller, Owain Evans

Abstract: Providing Reinforcement Learning agents with expert advice can dramatically improve various aspects of learning. Prior work has developed teaching protocols that enable agents to learn efficiently in complex environments; many of these methods tailor the teacher's guidance to agents with a particular representation or underlying learning scheme, offering effective but specialized teaching procedures. In this work, we explore protocol programs, an agent-agnostic schema for Human-in-the-Loop Reinforcement Learning. Our goal is to incorporate the beneficial properties of a human teacher into Reinforcement Learning without making strong assumptions about the inner workings of the agent. We show how to represent existing approaches such as action pruning, reward shaping, and training in simulation as special cases of our schema and conduct preliminary experiments on simple domains.

URL: https://arxiv.org/abs/1701.04079

Notes: Next step for FAIR human in the loop approach

Minimally Naturalistic Artificial Intelligence

Authors: Steven Stenberg Hansen

Abstract: The rapid advancement of machine learning techniques has re-energized research into general artificial intelligence. While the idea of domain-agnostic meta-learning is appealing, this emerging field must come to terms with its relationship to human cognition and the statistics and structure of the tasks humans perform. The position of this article is that only by aligning our agents' abilities and environments with those of humans do we stand a chance at developing general artificial intelligence (GAI). A broad reading of the famous 'No Free Lunch' theorem is that there is no universally optimal inductive bias or, equivalently, bias-free learning is impossible. This follows from the fact that there are an infinite number of ways to extrapolate data, any of which might be the one used by the data generating environment; an inductive bias prefers some of these extrapolations to others, which lowers performance in environments using these adversarial extrapolations. We may posit that the optimal GAI is the one that maximally exploits the statistics of its environment to create its inductive bias; accepting the fact that this agent is guaranteed to be extremely sub-optimal for some alternative environments. This trade-off appears benign when thinking about the environment as being the physical universe, as performance on any fictive universe is obviously irrelevant. But, we should expect a sharper inductive bias if we further constrain our environment. Indeed, we implicitly do so by defining GAI in terms of accomplishing that humans consider useful. One common version of this is need the for 'common-sense reasoning', which implicitly appeals to the statistics of physical universe as perceived by humans.

URL: https://arxiv.org/abs/1701.03868

Notes: Seems to be little bit too loud name, but we should check inside.

Adversarial Variational Bayes: Unifying Variational Autoencoders and Generative Adversarial Networks

Authors: Lars Mescheder, Sebastian Nowozin, Andreas Geiger

Abstract: Variational Autoencoders (VAEs) are expressive latent variable models that can be used to learn complex probability distributions from training data. However, the quality of the resulting model crucially relies on the expressiveness of the inference model used during training. We introduce Adversarial Variational Bayes (AVB), a technique for training Variational Autoencoders with arbitrarily expressive inference models. We achieve this by introducing an auxiliary discriminative network that allows to rephrase the maximum-likelihood-problem as a two-player game, hence establishing a principled connection between VAEs and Generative Adversarial Networks (GANs). We show that in the nonparametric limit our method yields an exact maximum-likelihood assignment for the parameters of the generative model, as well as the exact posterior distribution over the latent variables given an observation. Contrary to competing approaches which combine VAEs with GANs, our approach has a clear theoretical justification, retains most advantages of standard Variational Autoencoders and is easy to implement.

URL: https://arxiv.org/abs/1701.04722

Notes: Convergence of autoencoders & GANs, neat!

Adversarial Learning for Neural Dialogue Generation

Authors: Jiwei Li, Will Monroe, Tianlin Shi, Alan Ritter, Dan Jurafsky

Abstract: In this paper, drawing intuition from the Turing test, we propose using adversarial training for open-domain dialogue generation: the system is trained to produce sequences that are indistinguishable from human-generated dialogue utterances. We cast the task as a reinforcement learning (RL) problem where we jointly train two systems, a generative model to produce response sequences, and a discriminator---analagous to the human evaluator in the Turing test-— to distinguish between the human-generated dialogues and the machine-generated ones. The outputs from the discriminator are then used as rewards for the generative model, pushing the system to generate dialogues that mostly resemble human dialogues. In addition to adversarial training we describe a model for adversarial {\em evaluation} that uses success in fooling an adversary as a dialogue evaluation metric, while avoiding a number of potential pitfalls. Experimental results on several metrics, including adversarial evaluation, demonstrate that the adversarially-trained system generates higher-quality responses than previous baselines.

URL: https://arxiv.org/abs/1701.06547

Notes: HOT! GAN-RL!

A Multichannel Convolutional Neural Network For Cross-language Dialog State Tracking

Authors: Hongjie Shi, Takashi Ushio, Mitsuru Endo, Katsuyoshi Yamagami, Noriaki Horii

Abstract: The fifth Dialog State Tracking Challenge (DSTC5) introduces a new cross-language dialog state tracking scenario, where the participants are asked to build their trackers based on the English training corpus, while evaluating them with the unlabeled Chinese corpus. Although the computer-generated translations for both English and Chinese corpus are provided in the dataset, these translations contain errors and careless use of them can easily hurt the performance of the built trackers. To address this problem, we propose a multichannel Convolutional Neural Networks (CNN) architecture, in which we treat English and Chinese language as different input channels of one single CNN model. In the evaluation of DSTC5, we found that such multichannel architecture can effectively improve the robustness against translation errors. Additionally, our method for DSTC5 is purely machine learning based and requires no prior knowledge about the target language. We consider this a desirable property for building a tracker in the cross-language context, as not every developer will be familiar with both languages.

URL: https://arxiv.org/abs/1701.06247

Notes: CNN for dialog state tracking

Learning to Decode for Future Success

Authors: Jiwei Li, Will Monroe, Dan Jurafsky

Abstract: We introduce a general strategy for improving neural sequence generation by incorporating knowledge about the future. Our decoder combines a standard sequence decoder with a 'soothsayer' prediction function Q that estimates the outcome in the future of generating a word in the present. Our model draws on the same intuitions as reinforcement learning, but is both simpler and higher performing, avoiding known problems with the use of reinforcement learning in tasks with enormous search spaces like sequence generation. We demonstrate our model by incorporating Q functions that incrementally predict what the future BLEU or ROUGE score of the completed sequence will be, its future length, and the backwards probability of the source given the future target sequence. Experimental results show that future rediction yields improved performance in abstractive summarization and conversational response generation and the state-of-the-art in machine translation, while also enabling the decoder to generate outputs that have specific properties.

URL: https://arxiv.org/abs/1701.06549

Notes: future prediction with Q-learning

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Authors: Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, Jeff Dean

Abstract: The capacity of a neural network to absorb information is limited by its number of parameters. Conditional computation, where parts of the network are active on a per-example basis, has been proposed in theory as a way of dramatically increasing model capacity without a proportional increase in computation. In practice, however, there are significant algorithmic and performance challenges. In this work, we address these challenges and finally realize the promise of conditional computation, achieving greater than 1000x improvements in model capacity with only minor losses in computational efficiency on modern GPU clusters. We introduce a Sparsely-Gated Mixture-of-Experts layer (MoE), consisting of up to thousands of feed-forward sub-networks. A trainable gating network determines a sparse combination of these experts to use for each example. We apply the MoE to the tasks of language modeling and machine translation, where model capacity is critical for absorbing the vast quantities of knowledge available in the training corpora. We present model architectures in which a MoE with up to 137 billion parameters is applied convolutionally between stacked LSTM layers. On large language modeling and machine translation benchmarks, these models achieve significantly better results than state-of-the-art at lower computational cost.

URL: https://arxiv.org/abs/1701.06538

Notes: dropout analog from Dean & Hinton

Regularizing Neural Networks by Penalizing Confident Output Distributions

Authors: Gabriel Pereyra, George Tucker, Jan Chorowski, Łukasz Kaiser, Geoffrey Hinton

Abstract: We systematically explore regularizing neural networks by penalizing low entropy output distributions. We show that penalizing low entropy output distributions, which has been shown to improve exploration in reinforcement learning, acts as a strong regularizer in supervised learning. Furthermore, we connect a maximum entropy based confidence penalty to label smoothing through the direction of the KL divergence. We exhaustively evaluate the proposed confidence penalty and label smoothing on 6 common benchmarks: image classification (MNIST and Cifar-10), language modeling (Penn Treebank), machine translation (WMT'14 English-to-German), and speech recognition (TIMIT and WSJ). We find that both label smoothing and the confidence penalty improve state-of-the-art models across benchmarks without modifying existing hyperparameters, suggesting the wide applicability of these regularizers.

URL: https://arxiv.org/abs/1701.06548

Notes: smart regularization from Hinton

Discriminative Neural Topic Models

Authors: Gaurav Pandey, Ambedkar Dukkipati

Abstract: We propose a neural network based approach for learning topics from text and image datasets. The model makes no assumptions about the conditional distribution of the observed features given the latent topics. This allows us to perform topic modelling efficiently using sentences of documents and patches of images as observed features, rather than limiting ourselves to words. Moreover, the proposed approach is online, and hence can be used for streaming data. Furthermore, since the approach utilizes neural networks, it can be implemented on GPU with ease, and hence it is very scalable.

URL: https://arxiv.org/abs/1701.06796

Notes: don't like topic modeling but you should stay in touch with advances these days

Hierarchical Recurrent Attention Network for Response Generation

Authors: Chen Xing, Wei Wu, Yu Wu, Ming Zhou, Yalou Huang, Wei-Ying Ma

Abstract: We study multi-turn response generation in chatbots where a response is generated according to a conversation context. Existing work has modeled the hierarchy of the context, but does not pay enough attention to the fact that words and utterances in the context are differentially important. As a result, they may lose important information in context and generate irrelevant responses. We propose a hierarchical recurrent attention network (HRAN) to model both aspects in a unified framework. In HRAN, a hierarchical attention mechanism attends to important parts within and among utterances with word level attention and utterance level attention respectively. With the word level attention, hidden vectors of a word level encoder are synthesized as utterance vectors and fed to an utterance level encoder to construct hidden representations of the context. The hidden vectors of the context are then processed by the utterance level attention and formed as context vectors for decoding the response. Empirical studies on both automatic evaluation and human judgment show that HRAN can significantly outperform state-of-the-art models for multi-turn response generation.

URL: https://arxiv.org/abs/1701.07149

Notes: responce genration with hierarchical network is not that fresh idea, but may be these guys have the results

Wasserstein GAN

Authors: Martin Arjovsky, Soumith Chintala, Léon Bottou

Abstract: We introduce a new algorithm named WGAN, an alternative to traditional GAN training. In this new model, we show that we can improve the stability of learning, get rid of problems like mode collapse, and provide meaningful learning curves useful for debugging and hyperparameter searches. Furthermore, we show that the corresponding optimization problem is sound, and provide extensive theoretical work highlighting the deep connections to other distances between distributions.

URL: https://arxiv.org/abs/1701.07875

Notes: New enhancement for GANs, could be useful.

Reinforced backpropagation improves test performance of deep networks: a toy-model study

Authors: Haiping Huang, Taro Toyoizumi

Abstract: Standard error backpropagation is used in almost all modern deep network training. However, it typically suffers from proliferation of saddle points in high-dimensional parameter space. Therefore, it is highly desirable to design an efficient algorithm to escape from these saddle points and reach a good parameter region of better generalization capabilities, especially based on rough insights about the landscape of the error surface. Here, we propose a simple extension of the backpropagation, namely reinforced backpropagation, which simply adds previous first-order gradients in a stochastic manner with a probability that increases with learning time. Extensive numerical simulations on a toy deep learning model verify its excellent performance. The reinforced backpropagation can significantly improve test performance of the deep network training, especially when the data are scarce. The performance is even better than that of state-of-the-art stochastic optimization algorithm called Adam, with an extra advantage of less computer memory required.

URL: https://arxiv.org/abs/1701.07974

Notes: seems to be analogue for momentum in SGD, could be helpful may be

CommAI: Evaluating the first steps towards a useful general AI

Authors: Marco Baroni, Armand Joulin, Allan Jabri, Germàn Kruszewski, Angeliki Lazaridou, Klemen Simonic, Tomas Mikolov

Abstract: With machine learning successfully applied to new daunting problems almost every day, general AI starts looking like an attainable goal. However, most current research focuses instead on important but narrow applications, such as image classification or machine translation. We believe this to be largely due to the lack of objective ways to measure progress towards broad machine intelligence. In order to fill this gap, we propose here a set of concrete desiderata for general AI, together with a platform to test machines on how well they satisfy such desiderata, while keeping all further complexities to a minimum.

URL: https://arxiv.org/abs/1701.08954

Notes: Fresh article from Mikolov about actual path to general AI.

Adversarial Learning for Neural Dialogue Generation

Authors: Jiwei Li, Will Monroe, Tianlin Shi, Sébastien Jean, Alan Ritter, Dan Jurafsky

Abstract: In this paper, drawing intuition from the Turing test, we propose using adversarial training for open-domain dialogue generation: the system is trained to produce sequences that are indistinguishable from human-generated dialogue utterances. We cast the task as a reinforcement learning (RL) problem where we jointly train two systems, a generative model to produce response sequences, and a discriminator---analagous to the human evaluator in the Turing test--- to distinguish between the human-generated dialogues and the machine-generated ones. The outputs from the discriminator are then used as rewards for the generative model, pushing the system to generate dialogues that mostly resemble human dialogues. In addition to adversarial training we describe a model for adversarial {\em evaluation} that uses success in fooling an adversary as a dialogue evaluation metric, while avoiding a number of potential pitfalls. Experimental results on several metrics, including adversarial evaluation, demonstrate that the adversarially-trained system generates higher-quality responses than previous baselines.

URL: https://arxiv.org/abs/1701.06547

Notes: GAN for dialog generation by seq2seq, with few tricks to make it work

Memory Augmented Neural Networks with Wormhole Connections

Authors: Caglar Gulcehre, Sarath Chandar, Yoshua Bengio

Abstract: Recent empirical results on long-term dependency tasks have shown that neural networks augmented with an external memory can learn the long-term dependency tasks more easily and achieve better generalization than vanilla recurrent neural networks (RNN). We suggest that memory augmented neural networks can reduce the effects of vanishing gradients by creating shortcut (or wormhole) connections. Based on this observation, we propose a novel memory augmented neural network model called TARDIS (Temporal Automatic Relation Discovery in Sequences). The controller of TARDIS can store a selective set of embeddings of its own previous hidden states into an external memory and revisit them as and when needed. For TARDIS, memory acts as a storage for wormhole connections to the past to propagate the gradients more effectively and it helps to learn the temporal dependencies. The memory structure of TARDIS has similarities to both Neural Turing Machines (NTM) and Dynamic Neural Turing Machines (D-NTM), but both read and write operations of TARDIS are simpler and more efficient. We use discrete addressing for read/write operations which helps to substantially to reduce the vanishing gradient problem with very long sequences. Read and write operations in TARDIS are tied with a heuristic once the memory becomes full, and this makes the learning problem simpler when compared to NTM or D-NTM type of architectures. We provide a detailed analysis on the gradient propagation in general for MANNs. We evaluate our models on different long-term dependency tasks and report competitive results in all of them.

URL: https://arxiv.org/abs/1701.08718

Notes: NN with external network and additional path for gradient flowing

2017-02

Low-Dose CT with a Residual Encoder-Decoder Convolutional Neural Network (RED-CNN)

Authors: Hu Chen, Yi Zhang, Mannudeep K. Kalra, Feng Lin, Peixi Liao, Jiliu Zhou, Ge Wang

Abstract: Given the potential X-ray radiation risk to the patient, low-dose CT has attracted a considerable interest in the medical imaging field. The current main stream low-dose CT methods include vendor-specific sinogram domain filtration and iterative reconstruction, but they need to access original raw data whose formats are not transparent to most users. Due to the difficulty of modeling the statistical characteristics in the image domain, the existing methods for directly processing reconstructed images cannot eliminate image noise very well while keeping structural details. Inspired by the idea of deep learning, here we combine the autoencoder, the deconvolution network, and shortcut connections into the residual encoder-decoder convolutional neural network (RED-CNN) for low-dose CT imaging. After patch-based training, the proposed RED-CNN achieves a competitive performance relative to the-state-of-art methods in both simulated and clinical cases. Especially, our method has been favorably evaluated in terms of noise suppression, structural preservation and lesion detection.

URL: https://arxiv.org/abs/1702.00288

Notes: similar architecture we've used for sentence representation task

On orthogonality and learning recurrent networks with long term dependencies

Authors: Eugene Vorontsov, Chiheb Trabelsi, Samuel Kadoury, Chris Pal

Abstract: It is well known that it is challenging to train deep neural networks and recurrent neural networks for tasks that exhibit long term dependencies. The vanishing or exploding gradient problem is a well known issue associated with these challenges. One approach to addressing vanishing and exploding gradients is to use either soft or hard constraints on weight matrices so as to encourage or enforce orthogonality. Orthogonal matrices preserve gradient norm during backpropagation and can therefore be a desirable property; however, we find that hard constraints on orthogonality can negatively affect the speed of convergence and model performance. This paper explores the issues of optimization convergence, speed and gradient stability using a variety of different methods for encouraging or enforcing orthogonality. In particular we propose a weight matrix factorization and parameterization strategy through which we can bound matrix norms and therein control the degree of expansivity induced during backpropagation.

URL: https://arxiv.org/abs/1702.00071

Notes: long-term dependencies for text are really important, so I should this check out

Understanding trained CNNs by indexing neuron selectivity

Authors: Ivet Rafegas, Maria Vanrell, Luis A. Alexandre

Abstract: The impressive performance and plasticity of convolutional neural networks to solve different vision problems are shadowed by their black-box nature and its consequent lack of full understanding. To reduce this gap we propose to describe the activity of individual neurons by quantifying their inherent selectivity to specific properties. Our approach is based on the definition of feature selectivity indexes that allow the ranking of neurons according to specific properties. Here we report the results of exploring selectivity indexes for: (a) an image feature (color); and (b) an image label (class membership). Our contribution is a framework to seek or classify neurons by indexing on these selectivity properties. It helps to find color selective neurons, such as a red-mushroom neuron in layer conv4 or class selective neurons such as dog-face neurons in layer conv5, and establishes a methodology to derive other selectivity properties. Indexing on neuron selectivity can statistically draw how features and classes are represented through layers at a moment when the size of trained nets is growing and automatic tools to index can be helpful.

URL: https://arxiv.org/abs/1702.00382

Notes: 1st article about understanding CNN today

Design, Analysis and Application of A Volumetric Convolutional Neural Network

Authors: Xiaqing Pan, Yueru Chen, C.-C. Jay Kuo

Abstract: The design, analysis and application of a volumetric convolutional neural network (VCNN) are studied in this work. Although many CNNs have been proposed in the literature, their design is empirical. In the design of the VCNN, we propose a feed-forward K-means clustering algorithm to determine the filter number and size at each convolutional layer systematically. For the analysis of the VCNN, the cause of confusing classes in the output of the VCNN is explained by analyzing the relationship between the filter weights (also known as anchor vectors) from the last fully-connected layer to the output. Furthermore, a hierarchical clustering method followed by a random forest classification method is proposed to boost the classification performance among confusing classes. For the application of the VCNN, we examine the 3D shape classification problem and conduct experiments on a popular ModelNet40 dataset. The proposed VCNN offers the state-of-the-art performance among all volume-based CNN methods.

URL: https://arxiv.org/abs/1702.00158

Notes: 2nd article about understanding CNN

On SGD's Failure in Practice: Characterizing and Overcoming Stalling

Authors: Vivak Patel

Abstract: Stochastic Gradient Descent (SGD) is widely used in machine learning problems to efficiently perform empirical risk minimization, yet, in practice, SGD is known to stall before reaching the actual minimizer of the empirical risk. SGD stalling has often been attributed to its sensitivity to the conditioning of the problem; however, as we demonstrate, SGD will stall even when applied to a simple linear regression problem with unity condition number for standard learning rates. Thus, in this work, we numerically demonstrate and mathematically argue that stalling is a crippling and generic limitation of SGD and its variants in practice. Once we have established the problem of stalling, we introduce a framework for hedging against its effects, which (1) deters SGD and its variants from stalling, (2) still provides convergence guarantees, and (3) makes SGD and its variants more practical methods for minimization.

URL: https://arxiv.org/abs/1702.00317

Notes: SGD failures should be understood to work with it

Multilingual Multi-modal Embeddings for Natural Language Processing

Authors: Iacer Calixto, Qun Liu, Nick Campbell

Abstract: We propose a novel discriminative model that learns embeddings from multilingual and multi-modal data, meaning that our model can take advantage of images and descriptions in multiple languages to improve embedding quality. To that end, we introduce a modification of a pairwise contrastive estimation optimisation function as our training objective. We evaluate our embeddings on an image-sentence ranking (ISR), a semantic textual similarity (STS), and a neural machine translation (NMT) task. We find that the additional multilingual signals lead to improvements on both the ISR and STS tasks, and the discriminative cost can also be used in re-ranking n-best lists produced by NMT models, yielding strong improvements.

URL: https://arxiv.org/abs/1702.01101

Notes: yet another embedding, should hurry to publish mine

Doubly-Attentive Decoder for Multi-modal Neural Machine Translation

Authors: Iacer Calixto, Qun Liu, Nick Campbell

Abstract: We introduce a Multi-modal Neural Machine Translation model in which a doubly-attentive decoder naturally incorporates spatial visual features obtained using pre-trained convolutional neural networks, bridging the gap between image description and translation. Our decoder learns to attend to source-language words and parts of an image independently by means of two separate attention mechanisms as it generates words in the target language. We find that our model can efficiently exploit not just back-translated in-domain multi-modal data but also large general-domain text-only MT corpora. We also report state-of-the-art results on the Multi30k data set.

URL: https://arxiv.org/abs/1702.01287

Notes: double attention for NMT

Opinion Recommendation using Neural Memory Model

Authors: Zhongqing Wang, Yue Zhang

Abstract: We present opinion recommendation, a novel task of jointly predicting a custom review with a rating score that a certain user would give to a certain product or service, given existing reviews and rating scores to the product or service by other users, and the reviews that the user has given to other products and services. A characteristic of opinion recommendation is the reliance of multiple data sources for multi-task joint learning, which is the strength of neural models. We use a single neural network to model users and products, capturing their correlation and generating customised product representations using a deep memory network, from which customised ratings and reviews are constructed jointly. Results show that our opinion recommendation system gives ratings that are closer to real user ratings on Yelp.com data compared with Yelp's own ratings, and our methods give better results compared to several pipelines baselines using state-of-the-art sentiment rating and summarization systems.

URL: https://arxiv.org/abs/1702.01517

Notes: opinion recommendation - never seen such task before

Search Intelligence: Deep Learning For Dominant Category Prediction

Authors: Zeeshan Khawar Malik, Mo Kobrosli, Peter Maas

Abstract: Deep Neural Networks, and specifically fully-connected convolutional neural networks are achieving remarkable results across a wide variety of domains. They have been trained to achieve state-of-the-art performance when applied to problems such as speech recognition, image classification, natural language processing and bioinformatics. Most of these deep learning models when applied to classification employ the softmax activation function for prediction and aim to minimize cross-entropy loss. In this paper, we have proposed a supervised model for dominant category prediction to improve search recall across all eBay classifieds platforms. The dominant category label for each query in the last 90 days is first calculated by summing the total number of collaborative clicks among all categories. The category having the highest number of collaborative clicks for the given query will be considered its dominant category. Second, each query is transformed to a numeric vector by mapping each unique word in the query document to a unique integer value; all padded to equal length based on the maximum document length within the pre-defined vocabulary size. A fully-connected deep convolutional neural network (CNN) is then applied for classification. The proposed model achieves very high classification accuracy compared to other state-of-the-art machine learning techniques.

URL: https://arxiv.org/abs/1702.01717

Notes: CNN for classification is classic approach now, but this thing is working with click streams

Syntax-aware Neural Machine Translation Using CCG

Authors: Maria Nadejde, Siva Reddy, Rico Sennrich, Tomasz Dwojak, Marcin Junczys-Dowmunt, Philipp Koehn, Alexandra Birch

Abstract: Neural machine translation (NMT) models are able to partially learn syntactic information from sequential lexical information. Still, some complex syntactic phenomena such as prepositional phrase attachment are poorly modeled. This work aims to answer two questions: 1) Does explicitly modeling source or target language syntax help NMT? 2) Is tight integration of words and syntax better than multitask training? We introduce syntactic information in the form of CCG supertags either in the source as an extra feature in the embedding, or in the target, by interleaving the target supertags with the word sequence. Our results on WMT data show that explicitly modeling syntax improves machine translation quality for English-German, a high-resource pair, and for English-Romanian, a low-resource pair and also several syntactic phenomena including prepositional phrase attachment. Furthermore, a tight coupling of words and syntax improves translation quality more than multitask training.

URL: https://arxiv.org/abs/1702.01147

Notes: NMT with syntax awareness from Philipp Koehn

Reluplex: An Efficient SMT Solver for Verifying Deep Neural Networks

Authors: Guy Katz, Clark Barrett, David Dill, Kyle Julian, Mykel Kochenderfer

Abstract: Deep neural networks have emerged as a widely used and effective means for tackling complex, real-world problems. However, a major obstacle in applying them to safety-critical systems is the great difficulty in providing formal guarantees about their behavior. We present a novel, scalable, and efficient technique for verifying properties of deep neural networks (or providing counter-examples). The technique is based on the simplex method, extended to handle the non-convex Rectified Linear Unit (ReLU) activation function, which is a crucial ingredient in many modern neural networks. The verification procedure tackles neural networks as a whole, without making any simplifying assumptions. We evaluated our technique on a prototype deep neural network implementation of the next-generation Airborne Collision Avoidance System for unmanned aircraft (ACAS Xu). Results show that our technique can successfully prove properties of networks that are an order of magnitude larger than the largest networks verified using existing methods.

URL: https://arxiv.org/abs/1702.01135

Notes: SMT based on collision avoidance system that is something new

Neural Semantic Parsing over Multiple Knowledge-bases

Authors: Jonathan Herzig, Jonathan Berant

Abstract: A fundamental challenge in developing semantic parsers is the paucity of strong supervision in the form of language utterances annotated with logical form. In this paper, we propose to exploit structural regularities in language in different domains, and train semantic parsers over multiple knowledge-bases (KBs), while sharing information across datasets. We find that we can substantially improve parsing accuracy by training a single sequence-to-sequence model over multiple KBs, when providing an encoding of the domain at decoding time. Our model achieves state-of-the-art performance on the Overnight dataset (containing eight domains), improves performance over a single KB baseline from 75.6% to 79.6%, while obtaining a 7x reduction in the number of model parameters.

URL: https://arxiv.org/abs/1702.01569

Notes: KBs are generally not my area, but this worth to check out since they propose simplification

All-but-the-Top: Simple and Effective Postprocessing for Word Representations

Authors: Jiaqi Mu, Suma Bhat, Pramod Viswanath

Abstract: Real-valued word representations have transformed NLP applications, popular examples are word2vec and GloVe, recognized for their ability to capture linguistic regularities. In this paper, we demonstrate a very simple, and yet counter-intuitive, postprocessing technique — eliminate the common mean vector and a few top dominating directions from the word vectors — that renders off-the-shelf representations even stronger. The postprocessing is empirically validated on a variety of lexical-level intrinsic tasks (word similarity, concept categorization, word analogy) and sentence-level extrinsic tasks (semantic textual similarity) on multiple datasets and with a variety of representation methods and hyperparameter choices in multiple languages, in each case, the processed representations are consistently better than the original ones. Furthermore, we demonstrate quantitatively in downstream applications that neural network architectures "automatically learn" the postprocessing operation.

URL: https://arxiv.org/abs/1702.01417

Notes: interesting wistle for the word vectors

Deep Learning With Denamic Computation Graph

Authors: Moshe Looks, Marcello Herreshoff, DeLesley Hutchins & Peter Norvig

Abstract: Neural networks that compute over graph structures are a natural fit for problems in a variety of domains, including natural language (parse trees) and cheminformatics (molecular graphs). However, since the computation graph has a different shape and size for every input, such networks do not directly support batched training or inference. They are also difficult to implement in popular deep learning libraries, which are based on static data-flow graphs. We introduce a technique called dynamic batching, which not only batches together operations between different input graphs of dissimilar shape, but also between different nodes within a single input graph. The technique allows us to create static graphs, using popular libraries, that emulate dynamic computation graphs of arbitrary shape and size. We further present a high-level library1 of compositional blocks that simplifies the creation of dynamic graph models. Using the library, we demonstrate concise and batch-wise parallel implementations for a variety of models from the literature.

URL: https://openreview.net/pdf?id=ryrGawqex

Notes: Fresh paper from DeepMind about TF dynamic batching. HOT!

A Knowledge-Grounded Neural Conversation Model

Authors: Marjan Ghazvininejad, Chris Brockett, Ming-Wei Chang, Bill Dolan, Jianfeng Gao, Wen-tau Yih, Michel Galley

Abstract: Neural network models are capable of generating extremely natural sounding conversational interactions. Nevertheless, these models have yet to demonstrate that they can incorporate content in the form of factual information or entity-grounded opinion that would enable them to serve in more task-oriented conversational applications. This paper presents a novel, fully data-driven, and knowledge-grounded neural conversation model aimed at producing more contentful responses without slot filling. We generalize the widely-used Seq2Seq approach by conditioning responses on both conversation history and external "facts", allowing the model to be versatile and applicable in an open-domain setting. Our approach yields significant improvements over a competitive Seq2Seq baseline. Human judges found that our outputs are significantly more informative.

URL: https://arxiv.org/abs/1702.01932

Notes: seq2seq with external facts

Comparative Study of CNN and RNN for Natural Language Processing

Authors: Wenpeng Yin, Katharina Kann, Mo Yu, Hinrich Schütze

Abstract: Deep neural networks (DNN) have revolutionized the field of natural language processing (NLP). Convolutional neural network (CNN) and recurrent neural network (RNN), the two main types of DNN architectures, are widely explored to handle various NLP tasks. CNN is supposed to be good at extracting position-invariant features and RNN at modeling units in sequence. The state of the art on many NLP tasks often switches due to the battle between CNNs and RNNs. This work is the first systematic comparison of CNN and RNN on a wide range of representative NLP tasks, aiming to give basic guidance for DNN selection.

URL: https://arxiv.org/abs/1702.01923

Notes: comparison of RNN and CNN (LeCun vs Schmidthuber) for NLP, neat!

Neural Discourse Structure for Text Categorization

Authors: Yangfeng Ji, Noah Smith

Abstract: We show that discourse structure, as defined by Rhetorical Structure Theory and provided by an existing discourse parser, benefits text categorization. Our approach uses a recursive neural network and a newly proposed attention mechanism to compute a representation of the text that focuses on salient content, from the perspective of both RST and the task. Experiments consider variants of the approach and illustrate its strengths and weaknesses.

URL: https://arxiv.org/abs/1702.01829

Notes: neural discourse structure

Living a discrete life in a continuous world: Reference with distributed representations

Authors: Gemma Boleda, Sebastian Padó, Nghia The Pham, Marco Baroni

Abstract: Reference is the crucial property of language that allows us to connect linguistic expressions to the world. Modeling it requires handling both continuous and discrete aspects of meaning. Data-driven models excel at the former, but struggle with the latter, and the reverse is true for symbolic models. We propose a fully data-driven, end-to-end trainable model that, while operating on continuous multimodal representations, learns to organize them into a discrete-like entity library. We also introduce a referential task to test it, cross-modal tracking. Our model beats standard neural network architectures, but is outperformed by some parametrizations of Memory Networks, another model with external memory.

URL: https://arxiv.org/abs/1702.01815

Notes: continous representetaions with discrete meanings!

Beam Search Strategies for Neural Machine Translation

Authors: Markus Freitag, Yaser Al-Onaizan

Abstract: The basic concept in Neural Machine Translation (NMT) is to train a large Neural Network that maximizes the translation performance on a given parallel corpus. NMT is then using a simple left-to-right beam-search decoder to generate new translations that approximately maximize the trained conditional probability. The current beam search strategy generates the target sentence word by word from left-to- right while keeping a fixed amount of active candidates at each time step. First, this simple search is less adaptive as it also expands candidates whose scores are much worse than the current best. Secondly, it does not expand hypotheses if they are not within the best scoring candidates, even if their scores are close to the best one. The latter one can be avoided by increasing the beam size until no performance improvement can be observed. While you can reach better performance, this has the draw- back of a slower decoding speed. In this paper, we concentrate on speeding up the decoder by applying a more flexible beam search strategy whose candidate size may vary at each time step depending on the candidate scores. We speed up the original decoder by up to 43% for the two language pairs German-English and Chinese-English without losing any translation quality.

URL: https://arxiv.org/abs/1702.01806

Notes: beam search for NMT

Ensemble Distillation for Neural Machine Translation

Authors: Markus Freitag, Yaser Al-Onaizan, Baskaran Sankaran

Abstract: Knowledge distillation describes a method for training a student network to perform better by learning from a stronger teacher network. In this work, we run experiments with different kinds of teacher net- works to enhance the translation performance of a student Neural Machine Translation (NMT) network. We demonstrate techniques based on an ensemble and a best BLEU teacher network. We also show how to benefit from a teacher network that has the same architecture and dimensions of the student network. Further- more, we introduce a data filtering technique based on the dissimilarity between the forward translation (obtained during knowledge distillation) of a given source sentence and its target reference. We use TER to measure dissimilarity. Finally, we show that an ensemble teacher model can significantly reduce the student model size while still getting performance improvements compared to the baseline student network.

URL: https://arxiv.org/abs/1702.01802

Notes: knowledge distillation for NMT

Multi-task Coupled Attentions for Category-specific Aspect and Opinion Terms Co-extraction

Authors: Wenya Wang, Sinno Jialin Pan, Daniel Dahlmeier

Abstract: In aspect-based sentiment analysis, most existing methods either focus on aspect/opinion terms extraction or aspect terms categorization. However, each task by itself only provides partial information to end users. To generate more detailed and structured opinion analysis, we propose a finer-grained problem, which we call category-specific aspect and opinion terms extraction. This problem involves the identification of aspect and opinion terms within each sentence, as well as the categorization of the identified terms. To this end, we propose an end-to-end multi-task attention model, where each task corresponds to aspect/opinion terms extraction for a specific category. Our model benefits from exploring the commonalities and relationships among different tasks to address the data sparsity issue. We demonstrate its state-of-the-art performance on three benchmark datasets.

URL: https://arxiv.org/abs/1702.01776

Notes: multi-task for opinion extraction

Fast and Accurate Sequence Labeling with Iterated Dilated Convolutions

Authors: Emma Strubell, Patrick Verga, David Belanger, Andrew McCallum

Abstract: Bi-directional LSTMs have emerged as a standard method for obtaining per-token vector representations serving as input to various token labeling tasks (whether followed by Viterbi prediction or independent classification). This paper proposes an alternative to Bi-LSTMs for this purpose: iterated dilated convolutional neural networks (ID-CNNs), which have better capacity than traditional CNNs for large context and structured prediction. We describe a distinct combination of network structure, parameter sharing and training procedures that is not only more accurate than Bi-LSTM-CRFs, but also 8x faster at test time on long sequences. Moreover, ID-CNNs with independent classification enable a dramatic 14x test-time speedup, while still attaining accuracy comparable to the Bi-LSTM-CRF. We further demonstrate the ability of ID-CNNs to combine evidence over long sequences by demonstrating their improved accuracy on whole-document (rather than per-sentence) inference. Unlike LSTMs whose sequential processing on sentences of length N requires O(N) time even in the face of parallelism, IDCNNs permit fixed-depth convolutions to run in parallel across entire documents. Today when many companies run basic NLP on the entire web and large-volume traffic, faster methods are paramount to saving time and energy costs.

URL: https://arxiv.org/abs/1702.02098

Notes: seq labeling with dilated CNNs

Learning similarity preserving representations with neural similarity encoders

Authors: Franziska Horn, Klaus-Robert Müller

Abstract: Many dimensionality reduction or manifold learning algorithms optimize for retaining the pairwise similarities, distances, or local neighborhoods of data points. Spectral methods like Kernel PCA (kPCA) or isomap achieve this by computing the singular value decomposition (SVD) of some similarity matrix to obtain a low dimensional representation of the original data. However, this is computationally expensive if a lot of training examples are available and, additionally, representations for new (out-of-sample) data points can only be created when the similarities to the original training examples can be computed. We introduce similarity encoders (SimEc), which learn similarity preserving representations by using a feed-forward neural network to map data into an embedding space where the original similarities can be approximated linearly. The model optimizes the same objective as kPCA but in the process it learns a linear or non-linear embedding function (in the form of the tuned neural network), with which the representations of novel data points can be computed - even if the original pairwise similarities of the training set were generated by an unknown process such as human ratings. By creating embeddings for both image and text datasets, we demonstrate that SimEc can, on the one hand, reach the same solution as spectral methods, and, on the other hand, obtain meaningful embeddings from similarities based on human labels.

URL: https://arxiv.org/abs/1702.01824

Notes: neural PCA

Semi-Supervised QA with Generative Domain-Adaptive Nets

Authors: Zhilin Yang, Junjie Hu, Ruslan Salakhutdinov, William W. Cohen

Abstract: We study the problem of semi-supervised question answering----utilizing unlabeled text to boost the performance of question answering models. We propose a novel training framework, the Generative Domain-Adaptive Nets. In this framework, we train a generative model to generate questions based on the unlabeled text, and combine model-generated questions with human-generated questions for training question answering models. We develop novel domain adaptation algorithms, based on reinforcement learning, to alleviate the discrepancy between the model-generated data distribution and the human-generated data distribution. Experiments show that our proposed framework obtains substantial improvement from unlabeled text.

URL: https://arxiv.org/abs/1702.02206

Notes: Salakhutdinov's fresh article about QA, domain adaptation

Trainable Greedy Decoding for Neural Machine Translation

Authors: Jiatao Gu, Kyunghyun Cho, Victor O.K. Li

Abstract: Recent research in neural machine translation has largely focused on two aspects; neural network architectures and end-to-end learning algorithms. The problem of decoding, however, has received relatively little attention from the research community. In this paper, we solely focus on the problem of decoding given a trained neural machine translation model. Instead of trying to build a new decoding algorithm for any specific decoding objective, we propose the idea of trainable decoding algorithm in which we train a decoding algorithm to find a translation that maximizes an arbitrary decoding objective. More specifically, we design an actor that observes and manipulates the hidden state of the neural machine translation decoder and propose to train it using a variant of deterministic policy gradient. We extensively evaluate the proposed algorithm using four language pairs and two decoding objectives and show that we can indeed train a trainable greedy decoder that generates a better translation (in terms of a target decoding objective) with minimal computational overhead.

URL: https://arxiv.org/abs/1702.02429

Notes: Cho's fresh article, reconding is really the hurting thing

Automatic Rule Extraction from Long Short Term Memory Networks

Authors: W. James Murdoch, Arthur Szlam

Abstract: Although deep learning models have proven effective at solving problems in natural language processing, the mechanism by which they come to their conclusions is often unclear. As a result, these models are generally treated as black boxes, yielding no insight of the underlying learned patterns. In this paper we consider Long Short Term Memory networks (LSTMs) and demonstrate a new approach for tracking the importance of a given input to the LSTM for a given output. By identifying consistently important patterns of words, we are able to distill state of the art LSTMs on sentiment analysis and question answering into a set of representative phrases. This representation is then quantitatively validated by using the extracted phrases to construct a simple, rule-based classifier which approximates the output of the LSTM.

URL: https://arxiv.org/abs/1702.02540

Notes: rule extraction from RNN is really interesting - could cost a job for a few linguists

A Hybrid Convolutional Variational Autoencoder for Text Generation

Authors: Stanislau Semeniuta, Aliaksei Severyn, Erhardt Barth

Abstract: In this paper we explore the effect of architectural choices on learning a Variational Autoencoder (VAE) for text generation. In contrast to the previously introduced VAE model for text where both the encoder and decoder are RNNs, we propose a novel hybrid architecture that blends fully feed-forward convolutional and deconvolutional components with a recurrent language model. Our architecture exhibits several attractive properties such as faster run time and convergence, ability to better handle long sequences and, more importantly, it helps to avoid some of the major difficulties posed by training VAE models on textual data.

URL: https://arxiv.org/abs/1702.02390

Notes: VAE for texts! with recurrent flavour also

Iterative Multi-document Neural Attention for Multiple Answer Prediction

Authors: Claudio Greco, Alessandro Suglia, Pierpaolo Basile, Gaetano Rossiello, Giovanni Semeraro

Abstract: People have information needs of varying complexity, which can be solved by an intelligent agent able to answer questions formulated in a proper way, eventually considering user context and preferences. In a scenario in which the user profile can be considered as a question, intelligent agents able to answer questions can be used to find the most relevant answers for a given user. In this work we propose a novel model based on Artificial Neural Networks to answer questions with multiple answers by exploiting multiple facts retrieved from a knowledge base. The model is evaluated on the factoid Question Answering and top-n recommendation tasks of the bAbI Movie Dialog dataset. After assessing the performance of the model on both tasks, we try to define the long-term goal of a conversational recommender system able to interact using natural language and to support users in their information seeking processes in a personalized way.

URL: https://arxiv.org/abs/1702.02367

Notes: iterative attention could get more attention

Question Answering through Transfer Learning from Large Fine-grained Supervision Data

Authors: Sewon Min, Minjoon Seo, Hannaneh Hajishirzi

Abstract: We show that the task of question answering (QA) can significantly benefit from the transfer learning of models trained on a different large, fine-grained QA dataset. We achieve the state of the art in two well-studied QA datasets, WikiQA and SemEval-2016 (Task 3A), through a basic transfer learning technique from SQuAD. For WikiQA, our model outperforms the previous best model by more than 8%. We demonstrate that finer supervision provides better guidance for learning lexical and syntactic information than coarser supervision, through quantitative results and visual analysis. We also show that a similar transfer learning procedure achieves the state of the art on an entailment task.

URL: https://arxiv.org/abs/1702.02171

Notes: transfer learning for QA

Neural Machine Translation with Source-Side Latent Graph Parsing

Authors: Kazuma Hashimoto, Yoshimasa Tsuruoka

Abstract: This paper presents a novel neural machine translation model which jointly learns translation and source-side latent graph representations of sentences. Unlike existing pipelined approaches using syntactic parsers, our end-to-end model learns a latent graph parser as part of the encoder of an attention-based neural machine translation model, so the parser is optimized according to the translation objective. Experimental results show that our model significantly outperforms the previous best results on the standard English-to-Japanese translation dataset.

URL: https://arxiv.org/abs/1702.02265

Notes: NMT on graphs - long time dream, one step to it

Exploiting Domain Knowledge via Grouped Weight Sharing with Application to Text Categorization

Authors: Ye Zhang, Matthew Lease, Byron C. Wallace

Abstract: A fundamental advantage of neural models for NLP is their ability to learn representations from scratch. However, in practice this often means ignoring existing external linguistic resources, e.g., WordNet or domain specific ontologies such as the Unified Medical Language System (UMLS). We propose a general, novel method for exploiting such resources via weight sharing. Prior work on weight sharing in neural networks has considered it largely as a means of model compression. In contrast, we treat weight sharing as a flexible mechanism for incorporating prior knowledge into neural models. We show that this approach consistently yields improved performance on classification tasks compared to baseline strategies that do not exploit weight sharing.

URL: https://arxiv.org/abs/1702.02535

Notes: weight sharing for categorization could be interesting

How to evaluate word embeddings? On importance of data efficiency and simple supervised tasks

Authors: Stanisław Jastrzebski, Damian Leśniak, Wojciech Marian Czarnecki

Abstract: Maybe the single most important goal of representation learning is making subsequent learning faster. Surprisingly, this fact is not well reflected in the way embeddings are evaluated. In addition, recent practice in word embeddings points towards importance of learning specialized representations. We argue that focus of word representation evaluation should reflect those trends and shift towards evaluating what useful information is easily accessible. Specifically, we propose that evaluation should focus on data efficiency and simple supervised tasks, where the amount of available data is varied and scores of a supervised model are reported for each subset (as commonly done in transfer learning). In order to illustrate significance of such analysis, a comprehensive evaluation of selected word embeddings is presented. Proposed approach yields a more complete picture and brings new insight into performance characteristics, for instance information about word similarity or analogy tends to be non--linearly encoded in the embedding space, which questions the cosine-based, unsupervised, evaluation methods. All results and analysis scripts are available online.

URL: https://arxiv.org/abs/1702.02170

Notes: evaluation of word emdebbings is really should be grounded, for now it is still open, although virtually everyone is using it

Character-level Deep Conflation for Business Data Analytics

Authors: Zhe Gan, P. D. Singh, Ameet Joshi, Xiaodong He, Jianshu Chen, Jianfeng Gao, Li Deng

Abstract: Connecting different text attributes associated with the same entity (conflation) is important in business data analytics since it could help merge two different tables in a database to provide a more comprehensive profile of an entity. However, the conflation task is challenging because two text strings that describe the same entity could be quite different from each other for reasons such as misspelling. It is therefore critical to develop a conflation model that is able to truly understand the semantic meaning of the strings and match them at the semantic level. To this end, we develop a character-level deep conflation model that encodes the input text strings from character level into finite dimension feature vectors, which are then used to compute the cosine similarity between the text strings. The model is trained in an end-to-end manner using back propagation and stochastic gradient descent to maximize the likelihood of the correct association. Specifically, we propose two variants of the deep conflation model, based on long-short-term memory (LSTM) recurrent neural network (RNN) and convolutional neural network (CNN), respectively. Both models perform well on a real-world business analytics dataset and significantly outperform the baseline bag-of-character (BoC) model.

URL: https://arxiv.org/abs/1702.02640

Notes: character-level semantics, another try

Convolutional Neural Network for Humor Recognition

Authors: Lei Chen, Chong MIn Lee

Abstract: For the purpose of automatically evaluating speakers' humor usage, we build a presentation corpus containing humorous utterances based on TED talks. Compared to previous data resources supporting humor recognition research, ours has several advantages, including (a) both positive and negative instances coming from a homogeneous data set, (b) containing a large number of speakers, and (c) being open. Focusing on using lexical cues for humor recognition, we systematically compare a newly emerging text classification method based on Convolutional Neural Networks (CNNs) with a well-established conventional method using linguistic knowledge. The CNN method shows its advantages on both higher recognition accuracies and being able to learn essential features automatically.

URL: https://arxiv.org/abs/1702.02584

Notes: humor recognition is one of research areas of e.g. OpenAI

Parallel Long Short-Term Memory for Multi-stream Classification

Authors: Mohamed Bouaziz, Mohamed Morchid, Richard Dufour, Georges Linarès, Renato De Mori

Abstract: Recently, machine learning methods have provided a broad spectrum of original and efficient algorithms based on Deep Neural Networks (DNN) to automatically predict an outcome with respect to a sequence of inputs. Recurrent hidden cells allow these DNN-based models to manage long-term dependencies such as Recurrent Neural Networks (RNN) and Long Short-Term Memory (LSTM). Nevertheless, these RNNs process a single input stream in one (LSTM) or two (Bidirectional LSTM) directions. But most of the information available nowadays is from multistreams or multimedia documents, and require RNNs to process these information synchronously during the training. This paper presents an original LSTM-based architecture, named Parallel LSTM (PLSTM), that carries out multiple parallel synchronized input sequences in order to predict a common output. The proposed PLSTM method could be used for parallel sequence classification purposes. The PLSTM approach is evaluated on an automatic telecast genre sequences classification task and compared with different state-of-the-art architectures. Results show that the proposed PLSTM method outperforms the baseline n-gram models as well as the state-of-the-art LSTM approach.

URL: https://arxiv.org/abs/1702.03402

Notes: parallel rnns could be very useful

Batch Policy Gradient Methods for Improving Neural Conversation Models

Authors: Kirthevasan Kandasamy, Yoram Bachrach, Ryota Tomioka, Daniel Tarlow, David Carter

Abstract: We study reinforcement learning of chatbots with recurrent neural network architectures when the rewards are noisy and expensive to obtain. For instance, a chatbot used in automated customer service support can be scored by quality assurance agents, but this process can be expensive, time consuming and noisy. Previous reinforcement learning work for natural language processing uses on-policy updates and/or is designed for on-line learning settings. We demonstrate empirically that such strategies are not appropriate for this setting and develop an off-policy batch policy gradient method (BPG). We demonstrate the efficacy of our method via a series of synthetic experiments and an Amazon Mechanical Turk experiment on a restaurant recommendations dataset.

URL: https://arxiv.org/abs/1702.03334

Notes: offline RL training for chatbots; these guys have invented batch RL approach, very helpful in NLP, also they used importance sampling for more graduate parameter update

Offline bilingual word vectors, orthogonal transformations and the inverted softmax

Authors: Samuel L. Smith, David H. P. Turban, Steven Hamblin, Nils Y. Hammerla

Abstract: Usually bilingual word vectors are trained "online". Mikolov et al. showed they can also be found "offline", whereby two pre-trained embeddings are aligned with a linear transformation, using dictionaries compiled from expert knowledge. In this work, we prove that the linear transformation between two spaces should be orthogonal. This transformation can be obtained using the singular value decomposition. We introduce a novel "inverted softmax" for identifying translation pairs, with which we improve the precision @1 of Mikolov's original mapping from 34% to 43%, when translating a test set composed of both common and rare English words into Italian. Orthogonal transformations are more robust to noise, enabling us to learn the transformation without expert bilingual signal by constructing a "pseudo-dictionary" from the identical character strings which appear in both languages, achieving 40% precision on the same test set. Finally, we extend our method to retrieve the true translations of English sentences from a corpus of 200k Italian sentences with a precision @1 of 68%.

URL: https://arxiv.org/abs/1702.03859

Notes: in my experience, we've also encountered the ortho transformations for word vectors

A Morphology-aware Network for Morphological Disambiguation

Authors: Eray Yildiz, Caglar Tirkaz, H. Bahadir Sahin, Mustafa Tolga Eren, Ozan Sonmez

Abstract: Agglutinative languages such as Turkish, Finnish and Hungarian require morphological disambiguation before further processing due to the complex morphology of words. A morphological disambiguator is used to select the correct morphological analysis of a word. Morphological disambiguation is important because it generally is one of the first steps of natural language processing and its performance affects subsequent analyses. In this paper, we propose a system that uses deep learning techniques for morphological disambiguation. Many of the state-of-the-art results in computer vision, speech recognition and natural language processing have been obtained through deep learning models. However, applying deep learning techniques to morphologically rich languages is not well studied. In this work, while we focus on Turkish morphological disambiguation we also present results for French and German in order to show that the proposed architecture achieves high accuracy with no language-specific feature engineering or additional resource. In the experiments, we achieve 84.12, 88.35 and 93.78 morphological disambiguation accuracy among the ambiguous words for Turkish, German and French respectively.

URL: https://arxiv.org/abs/1702.03654

Notes: should have a look for the corpora they used, I have my work on that

Learning to Parse and Translate Improves Neural Machine Translation

Authors: Akiko Eriguchi, Yoshimasa Tsuruoka, Kyunghyun Cho

Abstract: There has been relatively little attention to incorporating linguistic prior to neural machine translation. Much of the previous work was further constrained to considering linguistic prior on the source side. In this paper, we propose a hybrid model, called NMT+RG, that learns to parse and translate by combining the recurrent neural network grammar into the attention-based neural machine translation. Our approach encourages the neural machine translation model to incorporate linguistic prior during training, and lets it translate on its own afterward. Extensive experiments with four language pairs show the effectiveness of the proposed NMT+RG.

URL: https://arxiv.org/abs/1702.03525

Notes: Cho's fresh article about NMT, grammar here is additional net

Exploring loss function topology with cyclical learning rates

Authors: Leslie N. Smith, Nicholay Topin

Abstract: We present observations and discussion of previously unreported phenomena discovered while training residual networks. The goal of this work is to better understand the nature of neural networks through the examination of these new empirical results. These behaviors were identified through the application of Cyclical Learning Rates (CLR) and linear network interpolation. Among these behaviors are counterintuitive increases and decreases in training loss and instances of rapid training. For example, we demonstrate how CLR can produce greater testing accuracy than traditional training despite using large learning rates. Files to replicate these results are available at this URL

URL: https://arxiv.org/abs/1702.04283

Notes: I've heard, that could allow us to improve results after the training is over, intriguing

Frustratingly Short Attention Spans in Neural Language Modeling

Authors: Michał Daniluk, Tim Rocktäschel, Johannes Welbl, Sebastian Riedel

Abstract: Neural language models predict the next token using a latent representation of the immediate token history. Recently, various methods for augmenting neural language models with an attention mechanism over a differentiable memory have been proposed. For predicting the next token, these models query information from a memory of the recent history which can facilitate learning mid- and long-range dependencies. However, conventional attention mechanisms used in memory-augmented neural language models produce a single output vector per time step. This vector is used both for predicting the next token as well as for the key and value of a differentiable memory of a token history. In this paper, we propose a neural language model with a key-value attention mechanism that outputs separate representations for the key and value of a differentiable memory, as well as for encoding the next-word distribution. This model outperforms existing memory-augmented neural language models on two corpora. Yet, we found that our method mainly utilizes a memory of the five most recent output representations. This led to the unexpected main finding that a much simpler model based only on the concatenation of recent output representations from previous time steps is on par with more sophisticated memory-augmented neural language models.

URL: https://arxiv.org/abs/1702.04521

Notes: attention is said to be a sot of memory, and short memory is surprise for language modeling

A Dependency-Based Neural Reordering Model for Statistical Machine Translation

Authors: Christian Hadiwinoto, Hwee Tou Ng

Abstract: In machine translation (MT) that involves translating between two languages with significant differences in word order, determining the correct word order of translated words is a major challenge. The dependency parse tree of a source sentence can help to determine the correct word order of the translated words. In this paper, we present a novel reordering approach utilizing a neural network and dependency-based embeddings to predict whether the translations of two source words linked by a dependency relation should remain in the same order or should be swapped in the translated sentence. Experiments on Chinese-to-English translation show that our approach yields a statistically significant improvement of 0.57 BLEU point on benchmark NIST test sets, compared to our prior state-of-the-art statistical MT system that uses sparse dependency-based reordering features.

URL: https://arxiv.org/abs/1702.04510

Notes: dependency-based approach for NMT

Training Language Models Using Target-Propagation

Authors: Sam Wiseman, Sumit Chopra, Marc'Aurelio Ranzato, Arthur Szlam, Ruoyu Sun, Soumith Chintala, Nicolas Vasilache

Abstract: While Truncated Back-Propagation through Time (BPTT) is the most popular approach to training Recurrent Neural Networks (RNNs), it suffers from being inherently sequential (making parallelization difficult) and from truncating gradient flow between distant time-steps. We investigate whether Target Propagation (TPROP) style approaches can address these shortcomings. Unfortunately, extensive experiments suggest that TPROP generally underperforms BPTT, and we end with an analysis of this phenomenon, and suggestions for future work.

URL: https://arxiv.org/abs/1702.04770

Notes: the people tried to reinvent backprop for the RNNs and failed; praise them for sharing this experience

Latent Variable Dialogue Models and their Diversity

Authors: Kris Cao, Stephen Clark

Abstract: We present a dialogue generation model that directly captures the variability in possible responses to a given input, which reduces the `boring output' issue of deterministic dialogue models. Experiments show that our model generates more diverse outputs than baseline models, and also generates more consistently acceptable output than sampling from a deterministic encoder-decoder model.

URL: https://arxiv.org/abs/1702.05962

Notes: hidden variable adaptation to dialog generation

Reproducing and learning new algebraic operations on word embeddings using genetic programming

Authors: Roberto Santana

Abstract: Word-vector representations associate a high dimensional real-vector to every word from a corpus. Recently, neural-network based methods have been proposed for learning this representation from large corpora. This type of word-to-vector embedding is able to keep, in the learned vector space, some of the syntactic and semantic relationships present in the original word corpus. This, in turn, serves to address different types of language classification tasks by doing algebraic operations defined on the vectors. The general practice is to assume that the semantic relationships between the words can be inferred by the application of a-priori specified algebraic operations. Our general goal in this paper is to show that it is possible to learn methods for word composition in semantic spaces. Instead of expressing the compositional method as an algebraic operation, we will encode it as a program, which can be linear, nonlinear, or involve more intricate expressions. More remarkably, this program will be evolved from a set of initial random programs by means of genetic programming (GP). We show that our method is able to reproduce the same behavior as human-designed algebraic operators. Using a word analogy task as benchmark, we also show that GP-generated programs are able to obtain accuracy values above those produced by the commonly used human-designed rule for algebraic manipulation of word vectors. Finally, we show the robustness of our approach by executing the evolved programs on the word2vec GoogleNews vectors, learned over 3 billion running words, and assessing their accuracy in the same word analogy task.

URL: https://arxiv.org/abs/1702.05624

Notes: the genetic algorithms for word embeddings inquiry that's a fresh view: the author creates the algorithm with basic operations on word embeddings

soc2seq: Social Embedding meets Conversation Model

Authors: Parminder Bhatia, Marsal Gavalda, Arash Einolghozati

Abstract: While liking or upvoting a post on a mobile app is easy to do, replying with a written note is much more difficult, due to both the cognitive load of coming up with a meaningful response as well as the mechanics of entering the text. Here we present a novel textual reply generation model that goes beyond the current auto-reply and predictive text entry models by taking into account the content preferences of the user, the idiosyncrasies of their conversational style, and even the structure of their social graph. Specifically, we have developed two types of models for personalized user interactions: a content-based conversation model, which makes use of location together with user information, and a social-graph-based conversation model, which combines content-based conversation models with social graphs.

URL: https://arxiv.org/abs/1702.05512

Notes: promo-article of Yak-Yak guys, pretty interesting: they combined persona-based style, location, social graph embeddings to produce possible answers for social networks

An Attention-Based Deep Net for Learning to Rank

Authors: Baiyang Wang, Diego Klabjan

Abstract: In information retrieval, learning to rank constructs a machine-based ranking model which given a query, sorts the search results by their degree of relevance or importance to the query. Neural networks have been successfully applied to this problem, and in this paper, we propose an attention-based deep neural network which better incorporates different embeddings of the queries and search results with an attention-based mechanism. This model also applies a decoder mechanism to learn the ranks of the search results in a listwise fashion. The embeddings are trained with convolutional neural networks or the word2vec model. We demonstrate the performance of this model with image retrieval and text querying data sets.

URL: https://arxiv.org/abs/1702.06106

Notes: attention applied to LtR NNs

Collaborative Deep Reinforcement Learning

Authors: Kaixiang Lin, Shu Wang, Jiayu Zhou

Abstract: Besides independent learning, human learning process is highly improved by summarizing what has been learned, communicating it with peers, and subsequently fusing knowledge from different sources to assist the current learning goal. This collaborative learning procedure ensures that the knowledge is shared, continuously refined, and concluded from different perspectives to construct a more profound understanding. The idea of knowledge transfer has led to many advances in machine learning and data mining, but significant challenges remain, especially when it comes to reinforcement learning, heterogeneous model structures, and different learning tasks. Motivated by human collaborative learning, in this paper we propose a collaborative deep reinforcement learning (CDRL) framework that performs adaptive knowledge transfer among heterogeneous learning agents. Specifically, the proposed CDRL conducts a novel deep knowledge distillation method to address the heterogeneity among different learning tasks with a deep alignment network. Furthermore, we present an efficient collaborative Asynchronous Advantage Actor-Critic (cA3C) algorithm to incorporate deep knowledge distillation into the online training of agents, and demonstrate the effectiveness of the CDRL framework using extensive empirical evaluation on OpenAI gym.

URL: https://arxiv.org/abs/1702.05796

Notes: guys're using knowledge distillation for collaboration; also really strange: it seems that this paper is in proceedings of ACL Woodstock 1997 conference, which is very unlikely to me

Learning to Repeat: Fine Grained Action Repetition for Deep Reinforcement Learning

Authors: Sahil Sharma, Aravind S. Lakshminarayanan, Balaraman Ravindran

Abstract: Reinforcement Learning algorithms can learn complex behavioral patterns for sequential decision making tasks wherein an agent interacts with an environment and acquires feedback in the form of rewards sampled from it. Traditionally, such algorithms make decisions, i.e., select actions to execute, at every single time step of the agent-environment interactions. In this paper, we propose a novel framework, Fine Grained Action Repetition (FiGAR), which enables the agent to decide the action as well as the time scale of repeating it. FiGAR can be used for improving any Deep Reinforcement Learning algorithm which maintains an explicit policy estimate by enabling temporal abstractions in the action space. We empirically demonstrate the efficacy of our framework by showing performance improvements on top of three policy search algorithms in different domains: Asynchronous Advantage Actor Critic in the Atari 2600 domain, Trust Region Policy Optimization in Mujoco domain and Deep Deterministic Policy Gradients in the TORCS car racing domain.

URL: https://arxiv.org/abs/1702.06054

Notes: action repetition on almost every SotA algo in RL

Collaborative Deep Reinforcement Learning for Joint Object Search

Authors: Xiangyu Kong, Bo Xin, Yizhou Wang, Gang Hua

Abstract: We examine the problem of joint top-down active search of multiple objects under interaction, e.g., person riding a bicycle, cups held by the table, etc.. Such objects under interaction often can provide contextual cues to each other to facilitate more efficient search. By treating each detector as an agent, we present the first collaborative multi-agent deep reinforcement learning algorithm to learn the optimal policy for joint active object localization, which effectively exploits such beneficial contextual information. We learn inter-agent communication through cross connections with gates between the Q-networks, which is facilitated by a novel multi-agent deep Q-learning algorithm with joint exploitation sampling. We verify our proposed method on multiple object detection benchmarks. Not only does our model help to improve the performance of state-of-the-art active localization models, it also reveals interesting co-detection patterns that are intuitively interpretable.

URL: https://arxiv.org/abs/1702.05573

Notes: gated cross-connections between networks with different experience, some kind of knowledge distillation in my opinion

Cosine Normalization: Using Cosine Similarity Instead of Dot Product in Neural Networks

Authors: Luo Chunjie, Zhan jianfeng, Wang lei, Yang Qiang

Abstract: Traditionally, multi-layer neural networks use dot product between the output vector of previous layer and the incoming weight vector as the input to activation function. The result of dot product is unbounded, thus causes large variance. Large variance makes the model sensitive to the change of input distribution, thus results in bad generalization and aggravates the internal covariate shift. To bound dot product, we propose to use cosine similarity instead of dot product in neural network, which we call cosine normalization. Experiments show that cosine normalization in fully connected neural networks can reduce the test err with lower divergence compared with other normalization techniques. Applied to convolutional networks, cosine normalization also significantly enhances the accuracy of classification.

URL: https://arxiv.org/abs/1702.05870

Notes: these guys are saying that cosine normalization is better (in a way) than batch-norm - on MNIST, but that could be useful knowledge

On Loss Functions for Deep Neural Networks in Classification

Authors: Katarzyna Janocha, Wojciech Marian Czarnecki

Abstract: Deep neural networks are currently among the most commonly used classifiers. Despite easily achieving very good performance, one of the best selling points of these models is their modular design - one can conveniently adapt their architecture to specific needs, change connectivity patterns, attach specialised layers, experiment with a large amount of activation functions, normalisation schemes and many others. While one can find impressively wide spread of various configurations of almost every aspect of the deep nets, one element is, in authors' opinion, underrepresented - while solving classification problems, vast majority of papers and applications simply use log loss. In this paper we try to investigate how particular choices of loss functions affect deep models and their learning dynamics, as well as resulting classifiers robustness to various effects. We perform experiments on classical datasets, as well as provide some additional, theoretical insights into the problem. In particular we show that L1 and L2 losses are, quite surprisingly, justified classification objectives for deep nets, by providing probabilistic interpretation in terms of expected misclassification. We also introduce two losses which are not typically used as deep nets objectives and show that they are viable alternatives to the existing ones.

URL: https://arxiv.org/abs/1702.05659

Notes: most widely-used losses comparison, very nice

Dataset Augmentation in Feature Space

Authors: Terrance DeVries, Graham W. Taylor

Abstract: Dataset augmentation, the practice of applying a wide array of domain-specific transformations to synthetically expand a training set, is a standard tool in supervised learning. While effective in tasks such as visual recognition, the set of transformations must be carefully designed, implemented, and tested for every new domain, limiting its re-use and generality. In this paper, we adopt a simpler, domain-agnostic approach to dataset augmentation. We start with existing data points and apply simple transformations such as adding noise, interpolating, or extrapolating between them. Our main insight is to perform the transformation not in input space, but in a learned feature space. A re-kindling of interest in unsupervised representation learning makes this technique timely and more effective. It is a simple proposal, but to-date one that has not been tested empirically. Working in the space of context vectors generated by sequence-to-sequence models, we demonstrate a technique that is effective for both static and sequential data.

URL: https://arxiv.org/abs/1702.05538

Notes: augmentation in feature space - interesting, could be used when we have already trained net (say, AlexNet) and relatively few data

An Extended Framework for Marginalized Domain Adaptation

Authors: Gabriela Csurka, Boris Chidlovski, Stephane Clinchant, Sophia Michel

Abstract: We propose an extended framework for marginalized domain adaptation, aimed at addressing unsupervised, supervised and semi-supervised scenarios. We argue that the denoising principle should be extended to explicitly promote domain-invariant features as well as help the classification task. Therefore we propose to jointly learn the data auto-encoders and the target classifiers. First, in order to make the denoised features domain-invariant, we propose a domain regularization that may be either a domain prediction loss or a maximum mean discrepancy between the source and target data. The noise marginalization in this case is reduced to solving the linear matrix system AX=B which has a closed-form solution. Second, in order to help the classification, we include a class regularization term. Adding this component reduces the learning problem to solving a Sylvester linear matrix equation AX+BX=C, for which an efficient iterative procedure exists as well. We did an extensive study to assess how these regularization terms improve the baseline performance in the three domain adaptation scenarios and present experimental results on two image and one text benchmark datasets, conventionally used for validating domain adaptation methods. We report our findings and comparison with state-of-the-art methods.

URL: https://arxiv.org/abs/1702.05993

Notes: additional loss for transfer learning, some improvement are actually shown

Revisiting Perceptron: Efficient and Label-Optimal Active Learning of Halfspaces

Authors: Songbai Yan, Chicheng Zhang

Abstract: It has been a long-standing problem to efficiently learn a linear separator using as few labels as possible. In this work, we propose an efficient perceptron-based algorithm for actively learning homogeneous linear separators under uniform distribution. Under bounded noise, where each label is flipped with probability at most η, our algorithm achieves near-optimal $O~(frac{d}{(1−2η)^2}log(frac{1}{ϵ}))$ label complexity in time $O~(frac{d^2}{ϵ(1−2η)^2})$, and significantly improves over the best known result (Awasthi et al., 2016). Under adversarial noise, where at most ν fraction of labels can be flipped, our algorithm achieves near-optimal $O~(log(frac{1}{ϵ}))$ label complexity in time $O~(frac{d^2}{ϵ})$, which is better than the best known label complexity and time complexity in Awasthi et al. (2014).

URL: https://arxiv.org/abs/1702.05581

Notes: active learning of perceptron, so 50-s and so 2000-s at the same time

Distributed Second-Order Optimization Using Kronecker-Factored Approximations

Authors: Jimmy Ba, Roger Grosse, James Martens

Abstract: As more computational resources become available, machine learning researchers train ever larger neural networks on millions of data points using stochastic gradient descent (SGD). Although SGD scales well in terms of both the size of dataset and the number of parameters of the model, it has rapidly diminishing returns as parallel computing resources increase. Second-order optimization methods have an affinity for well-estimated gradients and large mini-batches, and can therefore benefit much more from parallel computation in principle. Unfortunately, they often employ severe approximations to the curvature matrix in order to scale to large models with millions of parameters, limiting their effectiveness in practice versus well-tuned SGD with momentum. The recently proposed K-FAC method (Martens and Grosse, 2015) uses a stronger and more sophisticated curvature approximation, and has been shown to make much more per-iteration progress than SGD, while only introducing a modest overhead. In this paper, we develop a version of K-FAC that distributes the computation of gradients and additional quantities required by K-FAC across multiple machines, thereby taking advantage of the method’s superior scaling to large mini-batches and mitigating its additional overheads. We provide a Tensorflow implementation of our approach which is easy to use and can be applied to many existing codebases without modification. Additionally, we develop several algorithmic enhancements to K-FAC which can improve its computational performance for very large models. Finally, we show that our distributed K-FAC method speeds up training of various state-of-the-art ImageNet classification models by a factor of two compared to an improved form of Batch Normalization (Ioffe and Szegedy, 2015).

URL: https://jimmylba.github.io/papers/nsync.pdf

Notes: Optimisation which makes nets 50% faster, wow. The second-orders pastly haven't been used due to complexity and seeming slowness.

Mimicking Ensemble Learning with Deep Branched Networks

Authors: Byungju Kim, Youngsoo Kim, Yeakang Lee, Junmo Kim

Abstract: This paper proposes a branched residual network for image classification. It is known that high-level features of deep neural network are more representative than lower-level features. By sharing the low-level features, the network can allocate more memory to high-level features. The upper layers of our proposed network are branched, so that it mimics the ensemble learning. By mimicking ensemble learning with single network, we have achieved better performance on ImageNet classification task.

URL: https://arxiv.org/abs/1702.06376

Notes: really short paper from KAIST, about branching in resnets

Style Transfer Generative Adversarial Networks: Learning to Play Chess Differently

Authors: Muthuraman Chidambaram, Yanjun Qi

Abstract: The idea of style transfer has largely only been explored in image-based tasks, which we attribute in part to the specific nature of loss functions used for style transfer. We propose a general formulation of style transfer as an extension of generative adversarial networks, by using a discriminator to regularize a generator with an otherwise separate loss function. We apply our approach to the task of learning to play chess in the style of a specific player, and present empirical evidence for the viability of our approach.

URL: https://arxiv.org/abs/1702.06762

Notes: next step in style transfer - playing chess, the salt in the loss and nothing more

Learning to Draw Dynamic Agent Goals with Generative Adversarial Networks

Authors: Shariq Iqbal, John Pearson

Abstract: We address the problem of designing artificial agents capable of reproducing human behavior in a competitive game involving dynamic control. Given data consisting of multiple realizations of inputs generated by pairs of interacting players, we model each agent's actions as governed by a time-varying latent goal state coupled to a control model. These goals, in turn, are described as stochastic processes evolving according to player-specific value functions depending on the current state of the game. We model these value functions using generative adversarial networks (GANs) and show that our GAN-based approach succeeds in producing sample gameplay that captures the rich dynamics of human agents. The latent goal dynamics inferred and generated by our model has applications to fields like neuroscience and animal behavior, where the underlying value functions themselves are of theoretical interest.

URL: https://arxiv.org/abs/1702.07319

Notes: GAN here takes place of RL, I think that could lead us to simultenous ude of GAN and RL in tasks

Data Distillation for Controlling Specificity in Dialogue Generation

Authors: Jiwei Li, Will Monroe, Dan Jurafsky

Abstract: People speak at different levels of specificity in different situations. Depending on their knowledge, interlocutors, mood, etc.} A conversational agent should have this ability and know when to be specific and when to be general. We propose an approach that gives a neural network--based conversational agent this ability. Our approach involves alternating between \emph{data distillation} and model training : removing training examples that are closest to the responses most commonly produced by the model trained from the last round and then retrain the model on the remaining dataset. Dialogue generation models trained with different degrees of data distillation manifest different levels of specificity. We then train a reinforcement learning system for selecting among this pool of generation models, to choose the best level of specificity for a given input. Compared to the original generative model trained without distillation, the proposed system is capable of generating more interesting and higher-quality responses, in addition to appropriately adjusting specificity depending on the context. Our research constitutes a specific case of a broader approach involving training multiple subsystems from a single dataset distinguished by differences in a specific property one wishes to model. We show that from such a set of subsystems, one can use reinforcement learning to build a system that tailors its output to different input contexts at test time.

URL: https://arxiv.org/abs/1702.06703

Notes: Jurafsky et al. are adding more divensification to dialog-agent traning; this could be interpreted as an active learning in a way

Context-Aware Prediction of Derivational Word-forms

Authors: Ekaterina Vylomova, Ryan Cotterell, Timothy Baldwin, Trevor Cohn

Abstract: Derivational morphology is a fundamental and complex characteristic of language. In this paper we propose the new task of predicting the derivational form of a given base-form lemma that is appropriate for a given context. We present an encoder--decoder style neural network to produce a derived form character-by-character, based on its corresponding character-level representation of the base form and the context. We demonstrate that our model is able to generate valid context-sensitive derivations from known base forms, but is less accurate under a lexicon agnostic setting.

URL: https://arxiv.org/abs/1702.06675

Notes: Katya you're the best! Seriously, interesting problem is solved with char-level model

Tackling Error Propagation through Reinforcement Learning: A Case of Greedy Dependency Parsing

Authors: Minh Le, Antske Fokkens

Abstract: Error propagation is a common problem in NLP. Reinforcement learning explores erroneous states during training and can therefore be more robust when mistakes are made early in a process. In this paper, we apply reinforcement learning to greedy dependency parsing which is known to suffer from error propagation. Reinforcement learning improves accuracy of both labeled and unlabeled dependencies of the Stanford Neural Dependency Parser, a high performance greedy parser, while maintaining its efficiency. We investigate the portion of errors which are the result of error propagation and confirm that reinforcement learning reduces the occurrence of error propagation.

URL: https://arxiv.org/abs/1702.06794

Notes: RL applied to dependency parsing

Active One-shot Learning

Authors: Mark Woodward, Chelsea Finn

Abstract: Recent advances in one-shot learning have produced models that can learn from a handful of labeled examples, for passive classification and regression tasks. This paper combines reinforcement learning with one-shot learning, allowing the model to decide, during classification, which examples are worth labeling. We introduce a classification task in which a stream of images are presented and, on each time step, a decision must be made to either predict a label or pay to receive the correct label. We present a recurrent neural network based action-value function, and demonstrate its ability to learn how and when to request labels. Through the choice of reward function, the model can achieve a higher prediction accuracy than a similar model on a purely supervised task, or trade prediction accuracy for fewer label requests.

URL: https://arxiv.org/abs/1702.06559

Notes: RL approach (action-value) applied to active learning

The Shattered Gradients Problem: If resnets are the answer, then what is the question?

Authors: David Balduzzi, Marcus Frean, Lennox Leary, JP Lewis, Kurt Wan-Duo Ma, Brian McWilliams

Abstract: A long-standing obstacle to progress in deep learning is the problem of vanishing and exploding gradients. The problem has largely been overcome through the introduction of carefully constructed initializations and batch normalization. Nevertheless, architectures incorporating skip-connections such as resnets perform much better than standard feedforward architectures despite well-chosen initialization and batch normalization. In this paper, we identify the shattered gradients problem. Specifically, we show that the correlation between gradients in standard feedforward networks decays exponentially with depth resulting in gradients that resemble white noise. In contrast, the gradients in architectures with skip-connections are far more resistant to shattering decaying sublinearly. Detailed empirical evidence is presented in support of the analysis, on both fully-connected networks and convnets. Finally, we present a new "looks linear" (LL) initialization that prevents shattering. Preliminary experiments show the new initialization allows to train very deep networks without the addition of skip-connections.

URL: https://arxiv.org/abs/1702.08591

Notes: report on resnets and vanishing gradients

Deep Clustering using Auto-Clustering Output Layer

Authors: Ozsel Kilinc, Ismail Uysal

Abstract: In this paper, we propose a novel method to enrich the representation provided to the output layer of feedforward neural networks in the form of an auto-clustering output layer (ACOL) which enables the network to naturally create sub-clusters under the provided main class la- bels. In addition, a novel regularization term is introduced which allows ACOL to encourage the neural network to reveal its own explicit clustering objective. While the underlying process of finding the subclasses is completely unsupervised, semi-supervised learning is also possible based on the provided classification objective. The results show that ACOL can achieve a 99.2% clustering accuracy for the semi-supervised case when partial class labels are presented and a 96% accuracy for the unsupervised clustering case. These findings represent a paradigm shift especially when it comes to harnessing the power of deep networks for primary and secondary clustering applications in large datasets.

URL: https://arxiv.org/abs/1702.08648

Notes: could be useful in clustering project

Coherent Dialogue with Attention-Based Language Models

Authors: Hongyuan Mei, Mohit Bansal, Matthew R. Walter

Abstract: We model coherent conversation continuation via RNN-based dialogue models equipped with a dynamic attention mechanism. Our attention-RNN language model dynamically increases the scope of attention on the history as the conversation continues, as opposed to standard attention (or alignment) models with a fixed input scope in a sequence-to-sequence model. This allows each generated word to be associated with the most relevant words in its corresponding conversation history. We evaluate the model on two popular dialogue datasets, the open-domain MovieTriples dataset and the closed-domain Ubuntu Troubleshoot dataset, and achieve significant improvements over the state-of-the-art and baselines on several metrics, including complementary diversity-based metrics, human evaluation, and qualitative visualizations. We also show that a vanilla RNN with dynamic attention outperforms more complex memory models (e.g., LSTM and GRU) by allowing for flexible, long-distance memory. We promote further coherence via topic modeling-based reranking.

URL: http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14164

Notes: attention in details in RNN

On the Origin of Deep Learning

Authors: Haohan Wang, Bhiksha Raj

Abstract: This paper is a review of the evolutionary history of deep learning models. It covers from the genesis of neural networks when associationism modeling of the brain is studied, to the models that dominate the last decade of research in deep learning like convolutional neural networks, deep belief networks, and recurrent neural networks. In addition to a review of these models, this paper primarily focuses on the precedents of the models above, examining how the initial ideas are assembled to construct the early models and how these preliminary models are developed into their current forms. Many of these evolutionary paths last more than half a century and have a diversity of directions. For example, CNN is built on prior knowledge of biological vision system; DBN is evolved from a trade-off of modeling power and computation complexity of graphical models and many nowadays models are neural counterparts of ancient linear models. This paper reviews these evolutionary paths and offers a concise thought flow of how these models are developed, and aims to provide a thorough background for deep learning. More importantly, along with the path, this paper summarizes the gist behind these milestones and proposes many directions to guide the future research of deep learning.

URL: https://arxiv.org/abs/1702.07800

Notes: long review of history of deep learning

Bridging the Gap Between Value and Policy Based Reinforcement Learning

Authors: Ofir Nachum, Mohammad Norouzi, Kelvin Xu, Dale Schuurmans

Abstract: We formulate a new notion of softmax temporal consistency that generalizes the standard hard-max Bellman consistency usually considered in value based reinforcement learning (RL). In particular, we show how softmax consistent action values correspond to optimal policies that maximize entropy regularized expected reward. More importantly, we establish that softmax consistent action values and the optimal policy must satisfy a mutual compatibility property that holds across any state-action subsequence. Based on this observation, we develop a new RL algorithm, Path Consistency Learning (PCL), that minimizes the total inconsistency measured along multi-step subsequences extracted from both both on and off policy traces. An experimental evaluation demonstrates that PCL significantly outperforms strong actor-critic and Q-learning baselines across several benchmark tasks.

URL: https://arxiv.org/abs/1702.08892

Notes: differentiable RL from Google Brain, at least it seems to be so, could be useful

Rationalization: A Neural Machine Translation Approach to Generating Natural Language Explanations

Authors: Brent Harrison, Upol Ehsan, Mark O. Riedl

Abstract: We introduce AI rationalization, an approach for generating explanations of autonomous system behavior as if a human had done the behavior. We describe a rationalization technique that uses neural machine translation to translate internal state-action representations of the autonomous agent into natural language. We evaluate our technique in the Frogger game environment. The natural language is collected from human players thinking out loud as they play the game. We motivate the use of rationalization as an approach to explanation generation, show the results of experiments on the accuracy of our rationalization technique, and describe future research agenda.

URL: https://arxiv.org/abs/1702.07826

Notes: using people's thoughts while gameplay as dataset to create a model to transcribe desitions into natural language

Hybrid Code Networks: practical and efficient end-to-end dialog control with supervised and reinforcement learning

Authors: Jason D. Williams, Kavosh Asadi, Geoffrey Zweig

Abstract: End-to-end learning of recurrent neural networks (RNNs) is an attractive solution for dialog systems; however, current techniques are data-intensive and require thousands of dialogs to learn simple behaviors. We introduce Hybrid Code Networks (HCNs), which combine an RNN with domain-specific knowledge encoded as software and system action templates. Compared to existing end-to-end approaches, HCNs considerably reduce the amount of training data required, while retaining the key benefit of inferring a latent representation of dialog state. In addition, HCNs can be optimized with supervised learning, reinforcement learning, or a mixture of both. HCNs attain state-of-the-art performance on the bAbI dialog dataset, and outperform two commercially deployed customer-facing dialog systems.

URL: https://arxiv.org/abs/1702.03274

Notes: very similar to the germans' paper; architecture is really close

Sequence Modeling via Segmentations

Authors: Chong Wang, Yining Wang, Po-Sen Huang, Abdelrahman Mohamed, Dengyong Zhou, Li Deng

Abstract: Segmental structure is a common pattern in many types of sequences such as phrases in human languages. In this paper, we present a probabilistic model for sequences via their segmentations. The probability of a segmented sequence is calculated as the product of the probabilities of all its segments, where each segment is modeled using existing tools such as recurrent neural networks. Since the segmentation of a sequence is usually unknown in advance, we sum over all valid segmentations to obtain the final probability for the sequence. An efficient dynamic programming algorithm is developed for forward and backward computations without resorting to any approximation. We demonstrate our approach on text segmentation and speech recognition tasks. In addition to quantitative results, we also show that our approach can discover meaningful segments in their respective application contexts.

URL: https://arxiv.org/abs/1702.07463

Notes: tricky seq2seq learning by small subseqs, has its own algo to generate gradients for backprop

2017-03

End-to-End Task-Completion Neural Dialogue Systems

Authors: Xuijun Li, Yun-Nung Chen, Lihong Li, Jianfeng Gao

Abstract: This paper presents an end-to-end learning framework for task-completion neural dialogue systems, which leverages supervised and reinforcement learning with various deep-learning models. The system is able to interface with a structured database, and interact with users for assisting them to access information and complete tasks such as booking movie tickets. Our experiments in a movie-ticket booking domain show the proposed system outperforms a modular-based dialogue system and is more robust to noise produced by other components in the system.

URL: https://arxiv.org/abs/1703.01008

Notes: another RL-based dialog system without architecture or proofs on datasets

Controllable Text Generation

Authors: Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan Salakhutdinov, Eric P. Xing

Abstract: Generic generation and manipulation of text is challenging and has limited success compared to recent deep generative modeling in visual domain. This paper aims at generating plausible natural language sentences, whose attributes are dynamically controlled by learning disentangled latent representations with designated semantics. We propose a new neural generative model which combines variational auto-encoders and holistic attribute discriminators for effective imposition of semantic structures. With differentiable approximation to discrete text samples, explicit constraints on independent attribute controls, and efficient collaborative learning of generator and discriminators, our model learns highly interpretable representations from even only word annotations, and produces realistic sentences with desired attributes. Quantitative evaluation validates the accuracy of sentence and attribute generation.

URL: https://arxiv.org/abs/1703.00955

Notes: really nice work from Ruslan Salakhutdinov, working sentence-GAN

FeUdal Networks for Hierarchical Reinforcement Learning

Authors: Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, Koray Kavukcuoglu

Abstract: We introduce FeUdal Networks (FuNs): a novel architecture for hierarchical reinforcement learning. Our approach is inspired by the feudal reinforcement learning proposal of Dayan and Hinton, and gains power and efficacy by decoupling end-to-end learning across multiple levels — allowing it to utilise different resolutions of time. Our framework employs a Manager module and a Worker module. The Manager operates at a lower temporal resolution and sets abstract goals which are conveyed to and enacted by the Worker. The Worker generates primitive actions at every tick of the environment. The decoupled structure of FuN conveys several benefits — in addition to facilitating very long timescale credit assignment it also encourages the emergence of sub-policies associated with different goals set by the Manager. These properties allow FuN to dramatically outperform a strong baseline agent on tasks that involve long-term credit assignment or memorisation. We demonstrate the performance of our proposed system on a range of tasks from the ATARI suite and also from a 3D DeepMind Lab environment.

URL: https://arxiv.org/abs/1703.01161

Notes: DeepMind paper on hierarchical RL

Generative and Discriminative Text Classification with Recurrent Neural Networks

Authors: Dani Yogatama, Chris Dyer, Wang Ling, Phil Blunsom

Abstract: We empirically characterize the performance of discriminative and generative LSTM models for text classification. We find that although RNN-based generative models are more powerful than their bag-of-words ancestors (e.g., they account for conditional dependencies across words in a document), they have higher asymptotic error rates than discriminatively trained RNN models. However we also find that generative models approach their asymptotic error rate more rapidly than their discriminative counterparts---the same pattern that Ng & Jordan (2001) proved holds for linear classification models that make more naive conditional independence assumptions. Building on this finding, we hypothesize that RNN-based generative classification models will be more robust to shifts in the data distribution. This hypothesis is confirmed in a series of experiments in zero-shot and continual learning settings that show that generative models substantially outperform discriminative models.

URL: https://arxiv.org/abs/1703.01898

Notes: DeepMind paper on RNN text classification; zero-shot learning as a bonus; unsupervised pre-train, after that freeze and fine-tune on labeled data

All You Need is Beyond a Good Init: Exploring Better Solution for Training Extremely Deep Convolutional Neural Networks with Orthonormality and Modulation

Authors: Di Xie, Jiang Xiong, Shiliang Pu

Abstract: Deep neural network is difficult to train and this predicament becomes worse as the depth increases. The essence of this problem exists in the magnitude of backpropagated errors that will result in gradient vanishing or exploding phenomenon. We show that a variant of regularizer which utilizes orthonormality among different filter banks can alleviate this problem. Moreover, we design a backward error modulation mechanism based on the quasi-isometry assumption between two consecutive parametric layers. Equipped with these two ingredients, we propose several novel optimization solutions that can be utilized for training a specific-structured (repetitively triple modules of Conv-BNReLU) extremely deep convolutional neural network (CNN) WITHOUT any shortcuts/ identity mappings from scratch. Experiments show that our proposed solutions can achieve 4% improvement for a 44-layer plain network and almost 50% improvement for a 110-layer plain network on the CIFAR-10 dataset. Moreover, we can successfully train plain CNNs to match the performance of the residual counterparts. Besides, we propose new principles for designing network structure from the insights evoked by orthonormality. Combined with residual structure, we achieve comparative performance on the ImageNet dataset.

URL: https://arxiv.org/abs/1703.01827

Notes: orthonormality is really important thing in training, as far as I see, the results are interesting and could be useful

Multi-step Reinforcement Learning: A Unifying Algorithm

Authors: Kristopher De Asis, J. Fernando Hernandez-Garcia, G. Zacharias Holland, Richard S. Sutton

Abstract: Unifying seemingly disparate algorithmic ideas to produce better performing algorithms has been a longstanding goal in reinforcement learning. As a primary example, TD(λ) elegantly unifies one-step TD prediction with Monte Carlo methods through the use of eligibility traces and the trace-decay parameter λ. Currently, there are a multitude of algorithms that can be used to perform TD control, including Sarsa, Q-learning, and Expected Sarsa. These methods are often studied in the one-step case, but they can be extended across multiple time steps to achieve better performance. Each of these algorithms is seemingly distinct, and no one dominates the others for all problems. In this paper, we study a new multi-step action-value algorithm called Q(σ) which unifies and generalizes these existing algorithms, while subsuming them as special cases. A new parameter, σ, is introduced to allow the degree of sampling performed by the algorithm at each step during its backup to be continuously varied, with Sarsa existing at one extreme (full sampling), and Expected Sarsa existing at the other (pure expectation). Q(σ) is generally applicable to both on- and off-policy learning, but in this work we focus on experiments in the on-policy case. Our results show that an intermediate value of σ, which results in a mixture of the existing algorithms, performs better than either extreme. The mixture can also be varied dynamically which can result in even greater performance.

URL: https://arxiv.org/abs/1703.01327

Notes: new paper from Sutton about unified RL-algo

Understanding Synthetic Gradients and Decoupled Neural Interfaces

Authors: Wojciech Marian Czarnecki, Grzegorz Świrszcz, Max Jaderberg, Simon Osindero, Oriol Vinyals, Koray Kavukcuoglu

Abstract: When training neural networks, the use of Synthetic Gradients (SG) allows layers or modules to be trained without update locking - without waiting for a true error gradient to be backpropagated - resulting in Decoupled Neural Interfaces (DNIs). This unlocked ability of being able to update parts of a neural network asynchronously and with only local information was demonstrated to work empirically in Jaderberg et al (2016). However, there has been very little demonstration of what changes DNIs and SGs impose from a functional, representational, and learning dynamics point of view. In this paper, we study DNIs through the use of synthetic gradients on feed-forward networks to better understand their behaviour and elucidate their effect on optimisation. We show that the incorporation of SGs does not affect the representational strength of the learning system for a neural network, and prove the convergence of the learning system for linear and deep linear models. On practical problems we investigate the mechanism by which synthetic gradient estimators approximate the true loss, and, surprisingly, how that leads to drastically different layer-wise representations. Finally, we also expose the relationship of using synthetic gradients to other error approximation techniques and find a unifying language for discussion and comparison.

URL: https://arxiv.org/abs/1703.00522

Notes: another DeepMind paper, next step in their synthetic gradients research

Evolving Deep Neural Networks

Authors: Risto Miikkulainen, Jason Liang, Elliot Meyerson, Aditya Rawal, Dan Fink, Olivier Francon, Bala Raju, Hormoz Shahrzad, Arshak Navruzyan, Nigel Duffy, Babak Hodjat

Abstract: The success of deep learning depends on finding an architecture to fit the task. As deep learning has scaled up to more challenging tasks, the architectures have become difficult to design by hand. This paper proposes an automated method, CoDeepNEAT, for optimizing deep learning architectures through evolution. By extending existing neuroevolution methods to topology, components, and hyperparameters, this method achieves results comparable to best human designs in standard benchmarks in object recognition and language modeling. It also supports building a real-world application of automated image captioning on a magazine website. Given the anticipated increases in available computing power, evolution of deep networks is promising approach to constructing deep learning applications in the future.

URL: https://arxiv.org/abs/1703.00548

Notes: evolutionally created NNs, pretty neat

Emergence of Grounded Compositional Language in Multi-Agent Populations

Authors: Igor Mordatch, Pieter Abbeel

Abstract: By capturing statistical patterns in large corpora, machine learning has enabled significant advances in natural language processing, including in machine translation, question answering, and sentiment analysis. However, for agents to intelligently interact with humans, simply capturing the statistical patterns is insufficient. In this paper we investigate if, and how, grounded compositional language can emerge as a means to achieve goals in multi-agent populations. Towards this end, we propose a multi-agent learning environment and learning methods that bring about emergence of a basic compositional language. This language is represented as streams of abstract discrete symbols uttered by agents over time, but nonetheless has a coherent structure that possesses a defined vocabulary and syntax. We also observe emergence of non-verbal communication such as pointing and guiding when language communication is unavailable.

URL: https://arxiv.org/abs/1703.04908

Notes: I'm impressed: OpenAI invented a Petri dish for language creation

Evolution Strategies as a Scalable Alternative to Reinforcement Learning

Authors: Tim Salimans, Jonathan Ho, Xi Chen, Ilya Sutskever

Abstract: We explore the use of Evolution Strategies, a class of black box optimization algorithms, as an alternative to popular RL techniques such as Q-learning and Policy Gradients. Experiments on MuJoCo and Atari show that ES is a viable solution strategy that scales extremely well with the number of CPUs available: By using hundreds to thousands of parallel workers, ES can solve 3D humanoid walking in 10 minutes and obtain competitive results on most Atari games after one hour of training time. In addition, we highlight several advantages of ES as a black box optimization technique: it is invariant to action frequency and delayed rewards, tolerant of extremely long horizons, and does not need temporal discounting or value function approximation.

URL: https://arxiv.org/abs/1703.03864

Notes: Fresh paper from Ilya Sytskever / OpenAI, as almost all of OpenAI's papers it's fascinating: evolution algorithms is long-lsting dream of all machine-learners, and here they showed that in long (in the means of timesteps) episodes they are actually better than slassic RL, like Q-learning

Measuring, Predicting and Visualizing Short-Term Change in Word Representation and Usage in VKontakte Social Network

Authors: Ian Stewart, Dustin Arendt, Eric Bell, Svitlana Volkova

Abstract: Language in social media is extremely dynamic: new words emerge, trend and disappear, while the meaning of existing words can fluctuate over time. Such dynamics are especially notable during a period of crisis. This work addresses several important tasks of measuring, visualizing and predicting short term text representation shift, i.e. the change in a word's contextual semantics, and contrasting such shift with surface level word dynamics, or concept drift, observed in social media streams. Unlike previous approaches on learning word representations from text, we study the relationship between short-term concept drift and representation shift on a large social media corpus - VKontakte posts in Russian collected during the Russia-Ukraine crisis in 2014-2015. Our novel contributions include quantitative and qualitative approaches to (1) measure short-term representation shift and contrast it with surface level concept drift; (2) build predictive models to forecast short-term shifts in meaning from previous meaning as well as from concept drift; and (3) visualize short-term representation shift for example keywords to demonstrate the practical use of our approach to discover and track meaning of newly emerging terms in social media. We show that short-term representation shift can be accurately predicted up to several weeks in advance. Our unique approach to modeling and visualizing word representation shifts in social media can be used to explore and characterize specific aspects of the streaming corpus during crisis events and potentially improve other downstream classification tasks including real-time event detection.

URL: https://arxiv.org/abs/1703.07012

Notes: a paper from (mostly) Russian guys, who described term drift during the Ukranian civil war; unfortunately (despite the war itself, which is of cause a disaster), this research project has its weakness: since the beginning of Ukranian tumult, many Ukranian citizens have changed their preference in the social networks and moved to Facebook from VKontakte, so authors analized biased sample, but the results are still interesting

Towards Diverse and Natural Image Descriptions via a Conditional GAN

Authors: Bo Dai, Dahua Lin, Raquel Urtasun, Sanja Fidler

Abstract: Despite the substantial progress in recent years, the image captioning techniques are still far from being perfect.Sentences produced by existing methods, e.g. those based on RNNs, are often overly rigid and lacking in variability. This issue is related to a learning principle widely used in practice, that is, to maximize the likelihood of training samples. This principle encourages high resemblance to the "ground-truth" captions while suppressing other reasonable descriptions. Conventional evaluation metrics, e.g. BLEU and METEOR, also favor such restrictive methods. In this paper, we explore an alternative approach, with the aim to improve the naturalness and diversity -- two essential properties of human expression. Specifically, we propose a new framework based on Conditional Generative Adversarial Networks (CGAN), which jointly learns a generator to produce descriptions conditioned on images and an evaluator to assess how well a description fits the visual content. It is noteworthy that training a sequence generator is nontrivial. We overcome the difficulty by Policy Gradient, a strategy stemming from Reinforcement Learning, which allows the generator to receive early feedback along the way. We tested our method on two large datasets, where it performed competitively against real people in our user study and outperformed other methods on various tasks.

URL: https://arxiv.org/abs/1703.06029

Notes: interesting work on GAN for image desrptions: here the most close RL&GAN setup AFAIK, also they flourish diversity and naturalness, not the exact alignment of generated texts with natural ones

Massive Exploration of Neural Machine Translation Architectures

Authors: Denny Britz, Anna Goldie, Minh-Thang Luong, Quoc Le

Abstract: Neural Machine Translation (NMT) has shown remarkable progress over the past few years with production systems now being deployed to end-users. One major drawback of current architectures is that they are expensive to train, typically requiring days to weeks of GPU time to converge. This makes exhaustive hyperparameter search, as is commonly done with other neural network architectures, prohibitively expensive. In this work, we present the first large-scale analysis of NMT architecture hyperparameters. We report empirical results and variance numbers for several hundred experimental runs, corresponding to over 250,000 GPU hours on the standard WMT English to German translation task. Our experiments lead to novel insights and practical advice for building and extending NMT architectures. As part of this contribution, we release an open-source NMT framework that enables researchers to easily experiment with novel techniques and reproduce state of the art results.

URL: https://arxiv.org/abs/1703.03906

Notes: full of insights paper about Google NMT

Optimisation of a Model for Few-Shot Learning

Authors: Sachin Ravi and Hugo Larochelle

Abstract: Though deep neural networks have shown great success in the large data domain, they generally perform poorly on few-shot learning tasks, where a classifier has to quickly generalize after seeing very few examples from each class. The general belief is that gradient-based optimization in high capacity classifiers requires many iterative steps over many examples to perform well. Here, we propose an LSTM-based meta-learner model to learn the exact optimization algorithm used to train another learner neural network classifier in the few-shot regime. The parametrization of our model allows it to learn appropriate parameter updates specifically for the scenario where a set amount of updates will be made, while also learning a general initialization of the learner (classifier) network that allows for quick convergenceof training. We demonstrate that this meta-learning model is competitive with deep metric-learning techniques for few-shot learning.

URL: https://openreview.net/pdf?id=rJY0-Kcll

Notes: meta-learning approach applied to zero-shot learning, complex paper, not so simple to understand

Opening the Black Box of Deep Neural Networks via Information

Authors: Ravid Shwartz-Ziv, Naftali Tishby

Abstract: Despite their great success, there is still no comprehensive theoretical understanding of learning with Deep Neural Networks (DNNs) or their inner organization. Previous work [Tishby & Zaslavsky (2015)] proposed to analyze DNNs in the Information Plane; i.e., the plane of the Mutual Information values that each layer preserves on the input and output variables. They suggested that the goal of the network is to optimize the Information Bottleneck (IB) tradeoff between compression and prediction, successively, for each layer. In this work we follow up on this idea and demonstrate the effectiveness of the Information-Plane visualization of DNNs. We first show that the stochastic gradient descent (SGD) epochs have two distinct phases: fast empirical error minimization followed by slow representation compression, for each layer. We then argue that the DNN layers end up very close to the IB theoretical bound, and present a new theoretical argument for the computational benefit of the hidden layers.

URL: https://arxiv.org/abs/1703.00810

Notes: the paper sheddding the poor light on DNN internal behaviour in training process, but still it is interesting

Learning to Discover Cross-Domain Relations with Generative Adversarial Networks

Authors: Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jungkwon Lee, Jiwon Kim

Abstract: While humans easily recognize relations between data from different domains without any supervision, learning to automatically discover them is in general very challenging and needs many ground-truth pairs that illustrate the relations. To avoid costly pairing, we address the task of discovering cross-domain relations given unpaired data. We propose a method based on generative adversarial networks that learns to discover relations between different domains (DiscoGAN). Using the discovered relations, our proposed network successfully transfers style from one domain to another while preserving key attributes such as orientation and face identity.

URL: https://arxiv.org/abs/1703.05192

Notes: Another step in Style Transfer

Failures of Deep Learning

Authors: Shai Shalev-Shwartz, Ohad Shamir, Shaked Shammah

Abstract: In recent years, Deep Learning has become the go-to solution for a broad range of applications, often outperforming state-of-the-art. However, it is important, for both theoreticians and practitioners, to gain a deeper understanding of the difficulties and limitations associated with common approaches and algorithms. We describe four families of problems for which some of the commonly used existing algorithms fail or suffer significant difficulty. We illustrate the failures through practical experiments, and provide theoretical insights explaining their source, and how they might be remedied.

URL: https://arxiv.org/abs/1703.07950

Notes: like the title said - the failures of deep learning, I like the part about End2End vs. Decomposition (the latter is better), but there are more

Mask R-CNN

Authors: Kaiming He, Georgia Gkioxari, Piotr Dollár, Ross Girshick

Abstract: We present a conceptually simple, flexible, and general framework for object instance segmentation. Our approach efficiently detects objects in an image while simultaneously generating a high-quality segmentation mask for each instance. The method, called Mask R-CNN, extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps. Moreover, Mask R-CNN is easy to generalize to other tasks, e.g., allowing us to estimate human poses in the same framework. We show top results in all three tracks of the COCO suite of challenges, including instance segmentation, bounding-box object detection, and person keypoint detection. Without tricks, Mask R-CNN outperforms all existing, single-model entries on every task, including the COCO 2016 challenge winners. We hope our simple and effective approach will serve as a solid baseline and help ease future research in instance-level recognition. Code will be made available.

URL: https://arxiv.org/abs/1703.06870

Notes: FAIR's paper on faster image segmentation, the idea is they predict class & box offset and in parallel independently binary RoI mask, the architecture is shown to have better results on COCO, interesting what about MS COCO.

Dance Dance Convolution

Authors: Chris Donahue, Zachary C. Lipton, Julian McAuley

Abstract: Dance Dance Revolution (DDR) is a popular rhythm-based video game. Players perform steps on a dance platform in synchronization with music as directed by on-screen step charts. While many step charts are available in standardized packs, users may grow tired of existing charts, or wish to dance to a song for which no chart exists. We introduce the task of learning to choreograph. Given a raw audio track, the goal is to produce a new step chart. This task decomposes naturally into two subtasks: deciding when to place steps and deciding which steps to select. For the step placement task, we combine recurrent and convolutional neural networks to ingest spectrograms of low-level audio features to predict steps, conditioned on chart difficulty. For step selection, we present a conditional LSTM generative model that substantially outperforms n-gram and fixed-window approaches.

URL: https://arxiv.org/abs/1703.06891

Notes: as one wise man once sad: before the actual progress in AI the Ai will be applied to a lot of funny but unnessesery things

Unsupervised Anomaly Detection with Generative Adversarial Networks to Guide Marker Discovery

Authors: Thomas Schlegl, Philipp Seeböck, Sebastian M. Waldstein, Ursula Schmidt-Erfurth, Georg Langs

Abstract: Obtaining models that capture imaging markers relevant for disease progression and treatment monitoring is challenging. Models are typically based on large amounts of data with annotated examples of known markers aiming at automating detection. High annotation effort and the limitation to a vocabulary of known markers limit the power of such approaches. Here, we perform unsupervised learning to identify anomalies in imaging data as candidates for markers. We propose AnoGAN, a deep convolutional generative adversarial network to learn a manifold of normal anatomical variability, accompanying a novel anomaly scoring scheme based on the mapping from image space to a latent space. Applied to new data, the model labels anomalies, and scores image patches indicating their fit into the learned distribution. Results on optical coherence tomography images of the retina demonstrate that the approach correctly identifies anomalous images, such as images containing retinal fluid or hyperreflective foci.

URL: https://arxiv.org/abs/1703.05921

Notes: could be helpful in detection of unusual behaviour on video stream

Sequence-to-Sequence Models Can Directly Transcribe Foreign Speech

Authors: Ron J. Weiss, Jan Chorowski, Navdeep Jaitly, Yonghui Wu, Zhifeng Chen

Abstract: We present a recurrent encoder-decoder deep neural network architecture that directly translates speech in one language into text in another. The model does not explicitly transcribe the speech into text in the source language, nor does it require supervision from the ground truth source language transcription during training. We apply a slightly modified sequence-to-sequence with attention architecture that has previously been used for speech recognition and show that it can be repurposed for this more complex task, illustrating the power of attention-based models. A single model trained end-to-end obtains state-of-the-art performance on the Fisher Callhome Spanish-English speech translation task, outperforming a cascade of independently trained sequence-to-sequence speech recognition and machine translation models by 1.8 BLEU points on the Fisher test set. In addition, we find that making use of the training data in both languages by multi-task training sequence-to-sequence speech translation and recognition models with a shared encoder network can improve performance by a further 1.4 BLEU points.

URL: https://arxiv.org/abs/1703.08581

Notes: Google guys created an architecture to translate on-the-fly - without transcribing into the text, fascinating. Its main feature - the CNN in the bottom onverting the sound into a vector, after that two bi-LSTMs to create embedding - that's for encoder, the decoder is simplier - just 4 LTSM layers with attention to Softmax.

A Structured Self-attentive Sentence Embedding

Authors: Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, Yoshua Bengio

Abstract: This paper proposes a new model for extracting an interpretable sentence embedding by introducing self-attention. Instead of using a vector, we use a 2-D matrix to represent the embedding, with each row of the matrix attending on a different part of the sentence. We also propose a self-attention mechanism and a special regularization term for the model. As a side effect, the embedding comes with an easy way of visualizing what specific parts of the sentence are encoded into the embedding. We evaluate our model on 3 different tasks: author profiling, sentiment classification, and textual entailment. Results show that our model yields a significant performance gain compared to other sentence embedding methods in all of the 3 tasks.

URL: https://arxiv.org/abs/1703.03130

Notes: nice work from IBM Watson and Bengio, matrix embedding for attention with Frobenius norm as regularization on it

Deep Learning applied to NLP

Authors: Marc Moreno Lopez, Jugal Kalita

Abstract: Convolutional Neural Network (CNNs) are typically associated with Computer Vision. CNNs are responsible for major breakthroughs in Image Classification and are the core of most Computer Vision systems today. More recently CNNs have been applied to problems in Natural Language Processing and gotten some interesting results. In this paper, we will try to explain the basics of CNNs, its different variations and how they have been applied to NLP.

URL: https://arxiv.org/abs/1703.03091

Notes: basic CNNs for NLP, could be useful

Improving Neural Machine Translation with Conditional Sequence Generative Adversarial Nets

Authors: Zhen Yang, Wei Chen, Feng Wang, Bo Xu

Abstract: This paper proposes a new route for applying the generative adversarial nets (GANs) to NLP tasks (taking the neural machine translation as an instance) and the widespread perspective that GANs can't work well in the NLP area turns out to be unreasonable. In this work, we build a conditional sequence generative adversarial net which comprises of two adversarial sub models, a generative model (generator) which translates the source sentence into the target sentence as the traditional NMT models do and a discriminative model (discriminator) which discriminates the machine-translated target sentence from the human-translated sentence. From the perspective of Turing test, the proposed model is to generate the translation which is indistinguishable from the human-translated one. Experiments show that the proposed model achieves significant improvements than the traditional NMT model. In Chinese-English translation tasks, we obtain up to +2.0 BLEU points improvement. To the best of our knowledge, this is the first time that the quantitative results about the application of GANs in the traditional NLP task is reported. Meanwhile, we present detailed strategies for GAN training. In addition, We find that the discriminator of the proposed model shows great capability in data cleaning.

URL: https://arxiv.org/abs/1703.04887

Notes: Chinese guys have given nice description on GAN training process for NLP, namely NMT here

Learning Simpler Language Models with the Delta Recurrent Neural Network Framework

Authors: Alexander G. Ororbia II, Tomas Mikolov, David Reitter

Abstract: Learning useful information across long time lags is a critical and difficult problem for temporal neural models in tasks like language modeling. Existing architectures that address the issue are often complex and costly to train. The Delta Recurrent Neural Network (Delta-RNN) framework is a simple and high-performing design that unifies previously proposed gated neural models. The Delta-RNN models maintain longer-term memory by learning to interpolate between a fast-changing data-driven representation and a slowly changing, implicitly stable state. This requires hardly any more parameters than a classical simple recurrent network. The models outperform popular complex architectures, such as the Long Short Term Memory (LSTM) and the Gated Recurrent Unit (GRU) and achieve state-of-the art performance in language modeling at character and word levels and yield comparable performance at the subword level.

URL: https://arxiv.org/abs/1703.08864

Notes: Delta-RNN one more type of RNNs, seems to be simplier

Exploring Question Understanding and Adaptation in Neural-Network-Based Question Answering

Authors: Junbei Zhang, Xiaodan Zhu, Qian Chen, Lirong Dai, Hui Jiang

Abstract: The last several years have seen intensive interest in exploring neural-network-based models for machine comprehension (MC) and question answering (QA). In this paper, we approach the problems by closely modelling questions in a neural network framework. We first introduce syntactic information to help encode questions. We then view and model different types of questions and the information shared among them as an adaptation task and proposed adaptation models for them. On the Stanford Question Answering Dataset (SQuAD), we show that these approaches can help attain better results over a competitive baseline.

URL: https://arxiv.org/abs/1703.04617

Notes: QA with discrimination and embeddings in the bottom

Question Answering from Unstructured Text by Retrieval and Comprehension

Authors: Yusuke Watanabe, Bhuwan Dhingra, Ruslan Salakhutdinov

Abstract: Open domain Question Answering (QA) systems must interact with external knowledge sources, such as web pages, to find relevant information. Information sources like Wikipedia, however, are not well structured and difficult to utilize in comparison with Knowledge Bases (KBs). In this work we present a two-step approach to question answering from unstructured text, consisting of a retrieval step and a comprehension step. For comprehension, we present an RNN based attention model with a novel mixture mechanism for selecting answers from either retrieved articles or a fixed vocabulary. For retrieval we introduce a hand-crafted model and a neural model for ranking relevant articles. We achieve state-of-the-art performance on WIKI MOVIES dataset, reducing the error by 40%. Our experimental results further demonstrate the importance of each of the introduced components.

URL: https://arxiv.org/abs/1703.08885

Notes: comprehension of wikipedia articles to answer the questions

FastQA: A Simple and Efficient Neural Architecture for Question Answering

Authors: Dirk Weissenborn, Georg Wiese, Laura Seiffe

Abstract: Recent development of large-scale question answering (QA) datasets triggered a substantial amount of research into end-to-end neural architectures for QA. Increasingly complex systems have been conceived without comparison to a simpler neural baseline system that would justify their complexity. In this work, we propose a simple heuristic that guided the development of FastQA, an efficient end-to-end neural model for question answering that is very competitive with existing models. We further demonstrate, that an extended version (FastQAExt) achieves state-of-the-art results on recent benchmark datasets, namely SQuAD, NewsQA and MsMARCO, outperforming most existing models. However, we show that increasing the complexity of FastQA to FastQAExt does not yield any systematic improvements. We argue that the same holds true for most existing systems that are similar to FastQAExt. A manual analysis reveals that our proposed heuristic explains most predictions of our model, which indicates that modeling a simple heuristic is enough to achieve strong performance on extractive QA datasets. The overall strong performance of FastQA puts results of existing, more complex models into perspective.

URL: https://arxiv.org/abs/1703.04816

Notes: non recurrecnt QA architecture!

End-to-end optimization of goal-driven and visually grounded dialogue systems

Authors: Florian Strub, Harm de Vries, Jeremie Mary, Bilal Piot, Aaron Courville, Olivier Pietquin

Abstract: End-to-end design of dialogue systems has recently become a popular research topic thanks to powerful tools such as encoder-decoder architectures for sequence-to-sequence learning. Yet, most current approaches cast human-machine dialogue management as a supervised learning problem, aiming at predicting the next utterance of a participant given the full history of the dialogue. This vision is too simplistic to render the intrinsic planning problem inherent to dialogue as well as its grounded nature, making the context of a dialogue larger than the sole history. This is why only chit-chat and question answering tasks have been addressed so far using end-to-end architectures. In this paper, we introduce a Deep Reinforcement Learning method to optimize visually grounded task-oriented dialogues, based on the policy gradient algorithm. This approach is tested on a dataset of 120k dialogues collected through Mechanical Turk and provides encouraging results at solving both the problem of generating natural dialogues and the task of discovering a specific object in a complex picture.

URL: https://arxiv.org/abs/1703.05423

Notes: DeepMind paper on training dialog systems, RL & RNNs

Learning to Remember Rare Events

Authors: Łukasz Kaiser, Ofir Nachum, Aurko Roy, Samy Bengio

Abstract: Despite recent advances, memory-augmented deep neural networks are still limited when it comes to life-long and one-shot learning, especially in remembering rare events. We present a large-scale life-long memory module for use in deep learning. The module exploits fast nearest-neighbor algorithms for efficiency and thus scales to large memory sizes. Except for the nearest-neighbor query, the module is fully differentiable and trained end-to-end with no extra supervision. It operates in a life-long manner, i.e., without the need to reset it during training. Our memory module can be easily added to any part of a supervised neural network. To show its versatility we add it to a number of networks, from simple convolutional ones tested on image classification to deep sequence-to-sequence and recurrent-convolutional models. In all cases, the enhanced network gains the ability to remember and do life-long one-shot learning. Our module remembers training examples shown many thousands of steps in the past and it can successfully generalize from them. We set new state-of-the-art for one-shot learning on the Omniglot dataset and demonstrate, for the first time, life-long one-shot learning in recurrent neural networks on a large-scale machine translation task.

URL: https://arxiv.org/abs/1703.03129

Notes: memory module architecture from Bengio

Generalization and Equilibrium in Generative Adversarial Nets (GANs)

Authors: Sanjeev Arora, Rong Ge, Yingyu Liang, Tengyu Ma, Yi Zhang

Abstract: This paper makes progress on several open theoretical issues related to Generative Adversarial Networks. A definition is provided for what it means for the training to generalize, and it is shown that generalization is not guaranteed for the popular distances between distributions such as Jensen-Shannon or Wasserstein. We introduce a new metric called neural net distance for which generalization does occur. We also show that an approximate pure equilibrium in the 2-player game exists for a natural training objective (Wasserstein). Showing such a result has been an open problem (for any training objective). Finally, the above theoretical ideas lead us to propose a new training protocol, MIX+GAN, which can be combined with any existing method. We present experiments showing that it stabilizes and improves some existing methods.

URL: https://arxiv.org/abs/1703.00573

Notes: theory behind GANs, finite mixture of random variables could be enough to approximate inifinite mixture with Wasserstein loss

The cognitive roots of regularization in language

Authors: Vanessa Ferdinand, Simon Kirby, Kenny Smith

Abstract: Regularization occurs when the output a learner produces is less variable than the linguistic data they observed. In an artificial language learning experiment, we show that there exist at least two independent sources of regularization bias in cognition: a domain-general source based on cognitive load and a domain-specific source triggered by linguistic stimuli. Both of these factors modulate how frequency information is encoded and produced, but only the production-side modulations result in regularization (i.e. cause learners to eliminate variation from the observed input). We formalize the definition of regularization as the reduction of entropy and find that entropy measures are better at identifying regularization behavior than frequency-based analyses. We also use a model of cultural transmission to extrapolate from our experimental data in order to predict the amount of regularization which would develop in each experimental condition if the artificial language was transmitted over several generations of learners. Here we find an interaction between cognitive load and linguistic domain, suggesting that the effect of cognitive constraints can become more complex when put into the context of cultural evolution: although learning biases certainly carry information about the course of language evolution, we should not expect a one-to-one correspondence between the micro-level processes that regularize linguistic datasets and the macro-level evolution of linguistic regularity.

URL: https://arxiv.org/abs/1703.03442

Notes: a paper about human cognition: people know the words and could do so called regularization - overproduce say most frequent word in a row of seen words when they are asked to reproduce the row; for the random words the regularization is smaller - people need to make more effort to remember them (in the example); also there are other thing in the article, but that seems to me the most important

Multiagent Bidirectionally-Coordinated Nets for Learning to Play StarCraft Combat Games

Authors: Peng Peng, Quan Yuan, Ying Wen, Yaodong Yang, Zhenkun Tang, Haitao Long, Jun Wang

Abstract: Real-world artificial intelligence (AI) applications often require multiple agents to work in a collaborative effort. Efficient learning for intra-agent communication and coordination is an indispensable step towards general AI. In this paper, we take StarCraft combat game as the test scenario, where the task is to coordinate multiple agents as a team to defeat their enemies. To maintain a scalable yet effective communication protocol, we introduce a multiagent bidirectionally-coordinated network (BiCNet ['bIknet]) with a vectorised extension of actor-critic formulation. We show that BiCNet can handle different types of combats under diverse terrains with arbitrary numbers of AI agents for both sides. Our analysis demonstrates that without any supervisions such as human demonstrations or labelled data, BiCNet could learn various types of coordination strategies that is similar to these of experienced game players. Moreover, BiCNet is easily adaptable to the tasks with heterogeneous agents. In our experiments, we evaluate our approach against multiple baselines under different scenarios; it shows state-of-the-art performance, and possesses potential values for large-scale real-world applications.

URL: https://arxiv.org/abs/1703.10069

Notes: paper from Alibaba Group (!), the salt is they are using independent agents with independent rewards, nice pictures by the way

2017-04

It Takes Two to Tango: Towards Theory of AI's Mind

Authors: Arjun Chandrasekaran, Deshraj Yadav, Prithvijit Chattopadhyay, Viraj Prabhu, Devi Parikh

Abstract: Theory of Mind is the ability to attribute mental states (beliefs, intents, knowledge, perspectives, etc.) to others and recognize that these mental states may differ from one's own. Theory of Mind is critical to effective communication and to teams demonstrating higher collective performance. To effectively leverage the progress in Artificial Intelligence (AI) to make our lives more productive, it is important for humans and AI to work well together in a team. Traditionally, there has been much emphasis on research to make AI more accurate, and (to a lesser extent) on having it better understand human intentions, tendencies, beliefs, and contexts. The latter involves making AI more human-like and having it develop a theory of our minds. In this work, we argue that for human-AI teams to be effective, humans must also develop a theory of AI's mind - get to know its strengths, weaknesses, beliefs, and quirks. We instantiate these ideas within the domain of Visual Question Answering (VQA). We find that using just a few examples(50), lay people can be trained to better predict responses and oncoming failures of a complex VQA model. Surprisingly, we find that having access to the model's internal states - its confidence in its top-k predictions, explicit or implicit attention maps which highlight regions in the image (and words in the question) the model is looking at (and listening to) while answering a question about an image - do not help people better predict its behavior.

URL: https://arxiv.org/abs/1704.00717

Notes: people are trying to represent visual question answering as AI, little bit pretentious as for me, but the approach is interesting

Frames: A Corpus For Adding Memory To Goal-Oriented Dialogue Systems

Authors: Layla El Asri, Hannes Schulz, Shikhar Sharma, Jeremie Zumer, Justin Harris, Emery Fine, Rahul Mehrotra, Kaheer Suleman

Abstract: This paper presents the Frames dataset, a corpus of 1369 human-human dia-logues with an average of 15 turns per dialogue. We developed this dataset tostudy the role of memory in goal-oriented dialogue systems. Based on Frames,we introduce a task calledframe tracking, which extends state tracking to a set-ting where several states are tracked simultaneously. We propose a baseline modelfor this task. We show that Frames can also be used to study memory in dialoguemanagement and information presentation through natural language generation.

URL: https://arxiv.org/abs/1704.00057

Notes: new paper presenting dataset with human2human conversations

Learning to Generate Reviews and Discovering Sentiment

Authors: Alec Radford, Rafal Jozefowicz, Ilya Sutskever

Abstract: We explore the properties of byte-level recurrent language models. When given sufficient amounts of capacity, training data, and compute time, the representations learned by these models include disentangled features corresponding to high-level concepts. Specifically, we find a single unit which performs sentiment analysis. These representations, learned in an unsupervised manner, achieve state of the art on the binary subset of the Stanford Sentiment Treebank. They are also very data efficient. When using only a handful of labeled examples, our approach matches the performance of strong baselines trained on full datasets. We also demonstrate the sentiment unit has a direct influence on the generative process of the model. Simply fixing its value to be positive or negative generates samples with the corresponding positive or negative sentiment.

URL: https://arxiv.org/abs/1704.01444

Notes: a lot of hype around this OpenAI paper; it is nice, that there is a dimension responsible for sentiment, but I don't see any magic here: it is game of chance, as for me

Beating Atari with Natural Language Guided Reinforcement Learning

Authors: Russell Kaplan, Christopher Sauer, Alexander Sosa

Abstract: We introduce the first deep reinforcement learning agent that learns to beat Atari games with the aid of natural language instructions. The agent uses a multimodal embedding between environment observations and natural language to self-monitor progress through a list of English instructions, granting itself additional reward for completing instructions in addition to increasing the game score. Our agent significantly outperforms Deep-Q Networks, Asynchronous Advantage Actor-Critic (A3C) agents, and the best agents posted to OpenAI Gym [4] on what is often considered the hardest Atari 2600 environment [2]: Montezuma's Revenge.

URL: http://web.stanford.edu/class/cs224n/reports/2762090.pdf

Notes: Socher's students paper; they guide the angent, so teach it natural language, nice

Improved Training of Wasserstein GANs

Authors: Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, Aaron Courville

Abstract: Generative Adversarial Networks (GANs) are powerful generative models, but suffer from training instability. The recently proposed Wasserstein GAN (WGAN) makes significant progress toward stable training of GANs, but can still generate low-quality samples or fail to converge in some settings. We find that these training failures are often due to the use of weight clipping in WGAN to enforce a Lipschitz constraint on the critic, which can lead to pathological behavior. We propose an alternative method for enforcing the Lipschitz constraint: instead of clipping weights, penalize the norm of the gradient of the critic with respect to its input. Our proposed method converges faster and generates higher-quality samples than WGAN with weight clipping. Finally, our method enables very stable GAN training: for the first time, we can train a wide variety of GAN architectures with almost no hyperparameter tuning, including 101-layer ResNets and language models over discrete data.

URL: https://arxiv.org/abs/1704.00028

Notes: imporved training of WGANs from Aaron Courville; with code!

Bayesian Recurrent Neural Networks

Authors: Meire Fortunato, Charles Blundell, Oriol Vinyals

Abstract: In this work we explore a straightforward variational Bayes scheme for Recurrent Neural Networks. Firstly, we show that a simple adaptation of truncated backpropagation through time can yield good quality uncertainty estimates and superior regularisation at only a small extra computational cost during training. Secondly, we demonstrate how a novel kind of posterior approximation yields further improvements to the performance of Bayesian RNNs. We incorporate local gradient information into the approximate posterior to sharpen it around the current batch statistics. This technique is not exclusive to recurrent neural networks and can be applied more widely to train Bayesian neural networks. We also empirically demonstrate how Bayesian RNNs are superior to traditional RNNs on a language modelling benchmark and an image captioning task, as well as showing how each of these methods improve our model over a variety of other schemes for training them. We also introduce a new benchmark for studying uncertainty for language models so future methods can be easily compared.

URL: https://arxiv.org/abs/1704.02798

Notes: an enhancement for RNNs - the additional shapening of distributions of parameters; DeepMind kindly provided the community with a code, we're lucky!

Unfolding and Shrinking Neural Machine Translation Ensembles

Authors: Felix Stahlberg, Bill Byrne

Abstract: Ensembling is a well-known technique in neural machine translation (NMT). Instead of a single neural net, multiple neural nets with the same topology are trained separately, and the decoder generates predictions by averaging over the individual models. Ensembling often improves the quality of the generated translations drastically. However, it is not suitable for production systems because it is cumbersome and slow. This work aims to reduce the runtime to be on par with a single system without compromising the translation quality. First, we show that the ensemble can be unfolded into a single large neural network which imitates the output of the ensemble system. We show that unfolding can already improve the runtime in practice since more work can be done on the GPU. We proceed by describing a set of techniques to shrink the unfolded network by reducing the dimensionality of layers. On Japanese-English we report that the resulting network has the size and decoding speed of a single NMT network but performs on the level of a 3-ensemble system.

URL: https://arxiv.org/abs/1704.03279

Notes: interesting practical paper: people create one big network from ensemble and shrinking it by SVD and tricky linear combination of activations of neurons whic are left after thinning

The reinterpretation of standard deviation concept

Authors: Xiaoming Ye

Abstract: Existing mathematical theory interprets the concept of standard deviation as the dispersion degree. Therefore, in measurement theory, both uncertainty concept and precision concept, which are expressed with standard deviation or times standard deviation, are also defined as the dispersion of measurement result, so that the concept logic is tangled. Through comparative analysis of the standard deviation concept and re-interpreting the measurement error evaluation principle, this paper points out that the concept of standard deviation is actually single error's probability interval value instead of dispersion degree, and that the error with any regularity can be evaluated by standard deviation, corrected this mathematical concept, and gave the correction direction of measurement concept logic. These will bring a global change to measurement theory system.

URL: https://arxiv.org/abs/1704.03812

Notes: fundamental paper on statistic grounded in practical work of geodesy; the key idea is that std dev is only probability of error in measurement

Mobile Keyboard Input Decoding with Finite-State Transducers

Authors: Tom Ouyang, David Rybach, Françoise Beaufays, Michael Riley

Abstract: We propose a finite-state transducer (FST) representation for the models used to decode keyboard inputs on mobile devices. Drawing from learnings from the field of speech recognition, we describe a decoding framework that can satisfy the strict memory and latency constraints of keyboard input. We extend this framework to support functionalities typically not present in speech recognition, such as literal decoding, autocorrections, word completions, and next word predictions. We describe the general framework of what we call for short the keyboard "FST decoder" as well as the implementation details that are new compared to a speech FST decoder. We demonstrate that the FST decoder enables new UX features such as post-corrections. Finally, we sketch how this decoder can support advanced features such as personalization and contextualization.

URL: https://arxiv.org/abs/1704.03987

Notes: my favorite automata on mobile keyboards input prediction

Incremental Skip-gram Model with Negative Sampling

Authors: Nobuhiro Kaji, Hayato Kobayashi

Abstract: This paper explores an incremental training strategy for the skip-gram model with negative sampling (SGNS) from both empirical and theoretical perspectives. Existing methods of neural word embeddings, including SNGS, are multi-pass algorithms and thus cannot perform incremental model update. To address this problem, we present a simple incremental extension of SNGS and provide a thorough theoretical analysis to demonstrate its validity. Empirical experiments demonstrated the correctness of the theoretical analysis as well as the practical usefulness of the incremental algorithm.

URL: https://arxiv.org/abs/1704.03956

Notes: practical paper on online training of word embeddings, seems to be useful, say, in social network analysis

Equivalence Between Policy Gradients and Soft Q-Learning

Authors: John Schulman, Pieter Abbeel, Xi Chen

Abstract: Two of the leading approaches for model-free reinforcement learning are policy gradient methods and Q-learning methods. Q-learning methods can be effective and sample-efficient when they work, however, it is not well-understood why they work, since empirically, the Q-values they estimate are very inaccurate. A partial explanation may be that Q-learning methods are secretly implementing policy gradient updates: we show that there is a precise equivalence between Q-learning and policy gradient methods in the setting of entropy-regularized reinforcement learning, that "soft" (entropy-regularized) Q-learning is exactly equivalent to a policy gradient method. We also point out a connection between Q-learning methods and natural policy gradient methods. Experimentally, we explore the entropy-regularized versions of Q-learning and policy gradients, and we find them to perform as well as (or slightly better than) the standard variants on the Atari benchmark. We also show that the equivalence holds in practical settings by constructing a Q-learning method that closely matches the learning dynamics of A3C without using a target network or \epsilon-greedy exploration schedule.

URL: https://arxiv.org/abs/1704.06440

Notes: fresh from ICLR: policy gradient and q-learning are mathematically equivalent, the experiments show that the converge to the same solutions (unsurprisingly)

Adversarial Training Methods for Semi-Supervised Text Classification

Authors: Takeru Miyato, Andrew M. Dai, Ian Goodfellow

Abstract: Adversarial training provides a means of regularizing supervised learning algorithms while virtual adversarial training is able to extend supervised learning algorithms to the semi-supervised setting. However, both methods require making small perturbations to numerous entries of the input vector, which is inappropriate for sparse high-dimensional inputs such as one-hot word representations. We extend adversarial and virtual adversarial training to the text domain by applying perturbations to the word embeddings in a recurrent neural network rather than to the original input itself. The proposed method achieves state of the art results on multiple benchmark semi-supervised and purely supervised tasks. We provide visualizations and analysis showing that the learned word embeddings have improved in quality and that while training, the model is less prone to overfitting.

URL: https://openreview.net/forum?id=r1X3g2_xl&noteId=r1X3g2_xl

Notes: really nice work from Google on (virtual) adversarial training for classification; the main idea is to add noise to classifier which maximises the KL-divergence (or NLL)

On the universal structure of human lexical semantics

Authors: Hyejin Youn, Logan Sutton, Eric Smith, Cristopher Moore, Jon F. Wilkins, Ian Maddieson, William Croft, and Tanmoy Bhattacharya

Abstract: How universal is human conceptual structure? The way concepts are organized in the human brain may reflect distinct features of cultural, historical, and environmental background in addition to properties universal to human cognition. Semantics, or meaning expressed through language, provides indirect access to the underlying conceptual structure, but meaning is notoriously difficult to measure, let alone parameterize. Here, we provide an empirical measure of semantic proximity between concepts using cross-linguistic dictionaries to translate words to and from languages carefully selected to be representative of worldwide diversity. These translations reveal cases where a particular language uses a single “polysemous” word to express multiple concepts that another language represents using distinct words. We use the frequency of such polysemies linking two concepts as a measure of their semantic proximity and represent the pattern of these linkages by a weighted network. This network is highly structured: Certain concepts are far more prone to polysemy than others, and naturally interpretable clusters of closely related concepts emerge. Statistical analysis of the polysemies observed in a subset of the basic vocabulary shows that these structural properties are consistent across different language groups, and largely independent of geography, environment, and the presence or absence of a literary tradition. The methods developed here can be applied to any semantic domain to reveal the extent to which its conceptual structure is, similarly, a universal attribute of human cognition and language use.

URL: http://www.pnas.org/content/113/7/1766.abstract

Notes: very important paper as for me: it could be possible that de Saussure is wrong and the sound is bound to the meaning, it is the revolution in the linguistics, semiotics, psychology and so forth

Improving Neural Language Models with a Continuous Cache

Authors: Edouard Grave, Armand Joulin, Nicolas Usunier

Abstract: We propose an extension to neural network language models to adapt their prediction to the recent history. Our model is a simplified version of memory augmented networks, which stores past hidden activations as memory and accesses them through a dot product with the current hidden activation. This mechanism is very efficient and scales to very large memory sizes. We also draw a link between the use of external memory in neural network and cache models used with count based language models. We demonstrate on several language model datasets that our approach performs significantly better than recent memory augmented networks.

URL: https://research.fb.com/publications/improving-neural-language-models-with-a-continuous-cache/

Notes: new paper from FAIR on ICLR; the main idea is to fuse into softmax distribution the distribution on recently used words

Unsupervised Learning by Predicting Noise

Authors: Piotr Bojanowski, Armand Joulin

Abstract: Convolutional neural networks provide visual features that perform remarkably well in many computer vision applications. However, training these networks requires significant amounts of supervision. This paper introduces a generic framework to train deep networks, end-to-end, with no supervision. We propose to fix a set of target representations, called Noise As Targets (NAT), and to constrain the deep features to align to them. This domain agnostic approach avoids the standard unsupervised learning issues of trivial solutions and collapsing of features. Thanks to a stochastic batch reassignment strategy and a separable square loss function, it scales to millions of images. The proposed approach produces representations that perform on par with state-of-the-art unsupervised methods on ImageNet and Pascal VOC.

URL: https://arxiv.org/abs/1704.05310

Notes: the title is not very descriptive, actually the authors are predicting different random vectors for different images, so learning features to represent images

Adversarial Generator-Encoder Networks

Authors: Dmitry Ulyanov, Andrea Vedaldi, Victor Lempitsky

Abstract: We present a new autoencoder-type architecture, that is trainable in an unsupervised mode, sustains both generation and inference, and has the quality of conditional and unconditional samples boosted by adversarial learning. Unlike previous hybrids of autoencoders and adversarial networks, the adversarial game in our approach is set up directly between the encoder and the generator, and no external mappings are trained in the process of learning. The game objective compares the divergences of each of the real and the generated data distributions with the canonical distribution in the latent space. We show that direct generator-vs-encoder game leads to a tight coupling of the two components, resulting in samples and reconstructions of a comparable quality to some recently-proposed more complex architectures.

URL: https://arxiv.org/abs/1704.02304

Notes: paper form Russian guys, who work on style transfer for a while, new step in their architecture, could be useful for us

Translating Neuralese

Authors: Jacob Andreas, Anca Dragan, Dan Klein

Abstract: Several approaches have recently been proposed for learning decentralized deep multiagent policies that coordinate via a differentiable communication channel. While these policies are effective for many tasks, interpretation of their induced communication strategies has remained a challenge. Here we propose to interpret agents' messages by translating them. Unlike in typical machine translation problems, we have no parallel data to learn from. Instead we develop a translation model based on the insight that agent messages and natural language strings mean the same thing if they induce the same belief about the world in a listener. We present theoretical guarantees and empirical evidence that our approach preserves both the semantics and pragmatics of messages by ensuring that players communicating through a translation layer do not suffer a substantial loss in reward relative to players with a common language.

URL: https://arxiv.org/abs/1704.06960

Notes: translating neuralese is actually finding a mapping in LDA-like topic space, expluding actual words, since in neuralese could not be such a concept, only a message itself; the authors working with messages and contexts (situalitions in the world)

Adversarial Neural Machine Translation

Authors: Lijun Wu, Yingce Xia, Li Zhao, Fei Tian, Tao Qin, Jianhuang Lai, Tie-Yan Liu

Abstract: In this paper, we study a new learning paradigm for Neural Machine Translation (NMT). Instead of maximizing the likelihood of the human translation as in previous works, we minimize the distinction between human translation and the translation given by a NMT model. To achieve this goal, inspired by the recent success of generative adversarial networks (GANs), we employ an adversarial training architecture and name it as Adversarial-NMT. In Adversarial-NMT, the training of the NMT model is assisted by an adversary, which is an elaborately designed Convolutional Neural Network (CNN). The goal of the adversary is to differentiate the translation result generated by the NMT model from that by human. The goal of the NMT model is to produce high quality translations so as to cheat the adversary. A policy gradient method is leveraged to co-train the NMT model and the adversary. Experimental results on English→French and German→English translation tasks show that Adversarial-NMT can achieve significantly better translation quality than several strong baselines.

URL: https://arxiv.org/abs/1704.06933

Notes: this paper has done a lot of noise: adversarial objective for NMT, long-time waited, but the authors are first who succeed

Deep Text Classification Can be Fooled

Authors: Bin Liang, Hongcheng Li, Miaoqiang Su, Pan Bian, Xirong Li, Wenchang Shi

Abstract: Deep neural networks (DNNs) play a key role in many applications. Current studies focus on crafting adversarial samples against DNN-based image classifiers by introducing some imperceptible perturbations to the input. However, DNNs for natural language processing have not got the attention they deserve. In fact, the existing perturbation algorithms for images cannot be directly applied to text. This paper presents a simple but effective method to attack DNN-based text classifiers. Three perturbation strategies, namely insertion, modification, and removal, are designed to generate an adversarial sample for a given text. By computing the cost gradients, what should be inserted, modified or removed, where to insert and how to modify are determined effectively. The experimental results show that the adversarial samples generated by our method can successfully fool a state-of-the-art model to misclassify them as any desirable classes without compromising their utilities. At the same time, the introduced perturbations are difficult to be perceived. Our study demonstrates that DNN-based text classifiers are also prone to the adversarial sample attack.

URL: https://arxiv.org/abs/1704.08006

Notes: guys trained the network to remove/add/shuffle chars and words to fool a recent CNN classification SOTA, nice

Reading Wikipedia to Answer Open-Domain Questions

Authors: Danqi Chen, Adam Fisch, Jason Weston, Antoine Bordes

Abstract: This paper proposes to tackle open- domain question answering using Wikipedia as the unique knowledge source: the answer to any factoid question is a text span in a Wikipedia article. This task of machine reading at scale combines the challenges of document retrieval (finding the relevant articles) with that of machine comprehension of text (identifying the answer spans from those articles). Our approach combines a search component based on bigram hashing and TF-IDF matching with a multi-layer recurrent neural network model trained to detect answers in Wikipedia paragraphs. Our experiments on multiple existing QA datasets indicate that (1) both modules are highly competitive with respect to existing counterparts and (2) multitask learning using distant supervision on their combination is an effective complete system on this challenging task.

URL: https://arxiv.org/abs/1704.00051

Notes: DrQA paper, improving AI2 approach: wiki-search + RNN reader, with code! https://github.com/facebookresearch/DrQA

Mapping Instructions and Visual Observations to Actions with Reinforcement Learning

Authors: Dipendra Misra, John Langford, Yoav Artzi

Abstract: We propose to directly map raw visual observations and text input to actions for instruction execution. While existing approaches assume access to structured environment representations or use a pipeline of separately trained models, we learn a single model to jointly reason about linguistic and visual input. We use reinforcement learning in a contextual bandit setting to train a neural network agent. To guide the agent's exploration, we use reward shaping with different forms of supervision. Our approach does not require intermediate representations, planning procedures, or training different models. We evaluate in a simulated environment, and show significant improvements over supervised learning and common reinforcement learning variants.

URL: https://arxiv.org/abs/1704.08795

Notes: nice work on multi-modal reasoning: CNN for vision, LSTM for text, RL for overall task

Composite Task-Completion Dialogue Policy Learning via Hierarchical Deep Reinforcement Learning

Authors: Baolin Peng, Xiujun Li, Lihong Li, Jianfeng Gao, Asli Celikyilmaz, Sungjin Lee, Kam-Fai Wong

Abstract: Building a dialogue agent to fulfill complex tasks, such as travel planning, is challenging because the agent has to learn to collectively complete multiple subtasks. For example, the agent needs to reserve a hotel and book a flight so that there leaves enough time for commute between arrival and hotel check-in. This paper addresses this challenge by formulating the task in the mathematical framework of options over Markov Decision Processes (MDPs), and proposing a hierarchical deep reinforcement learning approach to learning a dialogue manager that operates at different temporal scales. The dialogue manager consists of: (1) a top-level dialogue policy that selects among subtasks or options, (2) a low-level dialogue policy that selects primitive actions to complete the subtask given by the top-level policy, and (3) a global state tracker that helps ensure all cross-subtask constraints be satisfied. Experiments on a travel planning task with simulated and real users show that our approach leads to significant improvements over three baselines, two based on handcrafted rules and the other based on flat deep reinforcement learning.

URL: https://arxiv.org/abs/1704.03084

Notes: formalization of multi-goal dialog systems as hierarchical RL-task; have some results, not very convincing though

The loss surface of deep and wide neural networks

Authors: Quynh Nguyen, Matthias Hein

Abstract: While the optimization problem behind deep neural networks is highly non-convex, it is frequently observed in practice that training deep networks seems possible without getting stuck in suboptimal points. It has been argued that this is the case as all local minima are close to being globally optimal. We show that this is (almost) true, in fact almost all local minima are globally optimal, for a fully connected network with squared loss and analytic activation function given that the number of hidden units of one layer of the network is larger than the number of training points and the network structure from this layer on is pyramidal.

URL: https://arxiv.org/abs/1704.08045

Notes: theoretical proof of reachability of global optima for certain FFN

Learning Symmetric Collaborative Dialogue Agents with Dynamic Knowledge Graph Embeddings

Authors: He He, Anusha Balakrishnan, Mihail Eric, Percy Liang

Abstract: We study a symmetric collaborative dialogue setting in which two agents, each with private knowledge, must strategically communicate to achieve a common goal. The open-ended dialogue state in this setting poses new challenges for existing dialogue systems. We collected a dataset of 11K human-human dialogues, which exhibits interesting lexical, semantic, and strategic elements. To model both structured knowledge and unstructured language, we propose a neural model with dynamic knowledge graph embeddings that evolve as the dialogue progresses. Automatic and human evaluations show that our model is both more effective at achieving the goal and more human-like than baseline neural and rule-based models.

URL: https://arxiv.org/abs/1704.07130

Notes: collaborative conversational dataset on finding mutual friends; baseline solution on knowledge graph; all code in nb

2017-05

Machine Comprehension by Text-to-Text Neural Question Generation

Authors: Xingdi Yuan, Tong Wang, Caglar Gulcehre, Alessandro Sordoni, Philip Bachman, Sandeep Subramanian, Saizheng Zhang, Adam Trischler

Abstract: We propose a recurrent neural model that generates natural-language questions from documents, conditioned on answers. We show how to train the model using a combination of supervised and reinforcement learning. After teacher forcing for standard maximum likelihood training, we fine-tune the model using policy gradient techniques to maximize several rewards that measure question quality. Most notably, one of these rewards is the performance of a question-answering system. We motivate question generation as a means to improve the performance of question answering systems. Our model is trained and evaluated on the recent question-answering dataset SQuAD.

URL: https://arxiv.org/abs/1705.02012

Notes: nice work from people of Maluuba, generate question by text, trained on SQuAD with RL losses - F1 & perplexy

The power of deeper networks for expressing natural functions

Authors: David Rolnick, Max Tegmark

Abstract: It is well-known that neural networks are universal approximators, but that deeper networks tend to be much more efficient than shallow ones. We shed light on this by proving that the total number of neurons m required to approximate natural classes of multivariate polynomials of n variables grows only linearly with n for deep neural networks, but grows exponentially when merely a single hidden layer is allowed. We also provide evidence that when the number of hidden layers is increased from 1 to k, the neuron requirement grows exponentially not with n but with n1/k, suggesting that the minimum number of layers required for computational tractability grows only logarithmically with n.

URL: https://arxiv.org/abs/1705.05502

Notes: ''obvious'' statement is proved in this paper on wider range of functions #theoreticaljustification

Key-Value Retrieval Networks for Task-Oriented Dialogue

Authors: Mihail Eric, Christopher D. Manning

Abstract: Neural task-oriented dialogue systems often struggle to smoothly interface with a knowledge base. In this work, we seek to address this problem by proposing a new neural dialogue agent that is able to effectively sustain grounded, multi-domain discourse through a novel key-value retrieval mechanism. The model is end-to-end differentiable and does not need to explicitly model dialogue state or belief trackers. We also release a new dataset of 3,031 dialogues that are grounded through underlying knowledge bases and span three distinct tasks in the in-car personal assistant space: calendar scheduling, weather information retrieval, and point-of-interest navigation. Our architecture is simultaneously trained on data from all domains and significantly outperforms a competitive rule-based system and other existing neural dialogue architectures on the provided domains according to both automatic and human evaluation metrics.

URL: https://arxiv.org/abs/1705.05414

Notes: fresh paper from Manning's group: key-value memory as attention for task-oriented dialog system

Convolutional Sequence to Sequence Learning

Authors: Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, Yann N. Dauphin

Abstract: The prevalent approach to sequence to sequence learning maps an input sequence to a variable length output sequence via recurrent neural networks. We introduce an architecture based entirely on convolutional neural networks. Compared to recurrent models, computations over all elements can be fully parallelized during training and optimization is easier since the number of non-linearities is fixed and independent of the input length. Our use of gated linear units eases gradient propagation and we equip each decoder layer with a separate attention module. We outperform the accuracy of the deep LSTM setup of Wu et al. (2016) on both WMT'14 English-German and WMT'14 English-French translation at an order of magnitude faster speed, both on GPU and CPU.

URL: https://arxiv.org/abs/1705.03122

Notes: FAIR's paper on NMT, fully convolutional NMT (and summrization), with multi-layer attention for decoder; it's better read by yourself

Multimodal Word Distributions

Authors: Ben Athiwaratkun, Andrew Gordon Wilson

Abstract: Word embeddings provide point representations of words containing useful semantic information. We introduce multimodal word distributions formed from Gaussian mixtures, for multiple word meanings, entailment, and rich uncertainty information. To learn these distributions, we propose an energy-based max-margin objective. We show that the resulting approach captures uniquely expressive semantic information, and outperforms alternatives, such as word2vec skip-grams, and Gaussian embeddings, on benchmark datasets such as word similarity and entailment.

URL: https://arxiv.org/abs/1704.08424

Notes: m-dimensional Gaussian mixtures as word embeddings, almost word2vec, but with probability flavour

A Deep Reinforced Model for Abstractive Summarization

Authors: Romain Paulus, Caiming Xiong, Richard Socher

Abstract: Attentional, RNN-based encoder-decoder models for abstractive summarization have achieved good performance on short input and output sequences. However, for longer documents and summaries, these models often include repetitive and incoherent phrases. We introduce a neural network model with intra-attention and a new training method. This method combines standard supervised word prediction and reinforcement learning (RL). Models trained only with the former often exhibit "exposure bias" -- they assume ground truth is provided at each step during training. However, when standard word prediction is combined with the global sequence prediction training of RL the resulting summaries become more readable. We evaluate this model on the CNN/Daily Mail and New York Times datasets. Our model obtains a 41.16 ROUGE-1 score on the CNN/Daily Mail dataset, a 5.7 absolute points improvement over previous state-of-the-art models. It also performs well as the first abstractive model on the New York Times corpus. Human evaluation also shows that our model produces higher quality summaries.

URL: https://arxiv.org/abs/1705.04304

Notes: Socher's fresh paper about summarization; intra-sequence attention for decoder

Recurrent Additive Networks

Authors: Kenton Lee, Omer Levy, Luke Zettlemoyer

Abstract: We introduce recurrent additive networks (RANs), a new gated RNN which is distinguished by the use of purely additive latent state updates. At every time step, the new state is computed as a gated component-wise sum of the input and the previous state, without any of the non-linearities commonly used in RNN transition dynamics. We formally show that RAN states are weighted sums of the input vectors, and that the gates only contribute to computing the weights of these sums. Despite this relatively simple functional form, experiments demonstrate that RANs outperform both LSTMs and GRUs on benchmark language modeling problems. This result shows that many of the non-linear computations in LSTMs and related networks are not essential, at least for the problems we consider, and suggests that the gates are doing more of the computational work than previously understood.

URL: http://www.kentonl.com/pub/llz.2017.pdf

Notes: RAN is simplified LSTM, only summation of two gates

Adversarial Generation of Natural Language

Authors: Sai Rajeswar, Sandeep Subramanian, Francis Dutil, Christopher Pal, Aaron Courville

Abstract: Generative Adversarial Networks (GANs) have gathered a lot of attention from the computer vision community, yielding impressive results for image generation. Advances in the adversarial generation of natural language from noise however are not commensurate with the progress made in generating images, and still lag far behind likelihood based methods. In this paper, we take a step towards generating natural language with a GAN objective alone. We introduce a simple baseline that addresses the discrete output space problem without relying on gradient estimators and show that it is able to achieve state-of-the-art results on a Chinese poem generation dataset. We present quantitative results on generating sentences from context-free and probabilistic context-free grammars, and qualitative language modeling results. A conditional version is also described that can generate sequences conditioned on sentence characteristics.

URL: https://arxiv.org/abs/1705.10929

Notes: synthetic baseline for GANs in text to start with

The Cramer Distance as a Solution to Biased Wasserstein Gradients

Authors: Marc G. Bellemare, Ivo Danihelka, Will Dabney, Shakir Mohamed, Balaji Lakshminarayanan, Stephan Hoyer, Rémi Munos

Abstract: The Wasserstein probability metric has received much attention from the machine learning community. Unlike the Kullback-Leibler divergence, which strictly measures change in probability, the Wasserstein metric reflects the underlying geometry between outcomes. The value of being sensitive to this geometry has been demonstrated, among others, in ordinal regression and generative modelling. In this paper we describe three natural properties of probability divergences that reflect requirements from machine learning: sum invariance, scale sensitivity, and unbiased sample gradients. The Wasserstein metric possesses the first two properties but, unlike the Kullback-Leibler divergence, does not possess the third. We provide empirical evidence suggesting that this is a serious issue in practice. Leveraging insights from probabilistic forecasting we propose an alternative to the Wasserstein metric, the Cram'er distance. We show that the Cram'er distance possesses all three desired properties, combining the best of the Wasserstein and Kullback-Leibler divergences. To illustrate the relevance of the Cram'er distance in practice we design a new algorithm, the Cram'er Generative Adversarial Network (GAN), and show that it performs significantly better than the related Wasserstein GAN.

URL: https://arxiv.org/abs/1705.10743

Notes: Cramer distance seems to be reasonable alternative to Wasserstein: it has missing unbiasedness property

Second-Order Word Embeddings from Nearest Neighbor Topological Features

Authors: Denis Newman-Griffis, Eric Fosler-Lussier

Abstract: We introduce second-order vector representations of words, induced from nearest neighborhood topological features in pre-trained contextual word embeddings. We then analyze the effects of using second-order embeddings as input features in two deep natural language processing models, for named entity recognition and recognizing textual entailment, as well as a linear model for paraphrase recognition. Surprisingly, we find that nearest neighbor information alone is sufficient to capture most of the performance benefits derived from using pre-trained word embeddings. Furthermore, second-order embeddings are able to handle highly heterogeneous data better than first-order representations, though at the cost of some specificity. Additionally, augmenting contextual embeddings with second-order information further improves model performance in some cases. Due to variance in the random initializations of word embeddings, utilizing nearest neighbor features from multiple first-order embedding samples can also contribute to downstream performance gains. Finally, we identify intriguing characteristics of second-order embedding spaces for further research, including much higher density and different semantic interpretations of cosine similarity.

URL: https://arxiv.org/abs/1705.08488

Notes: simple paper about training word2vec and train new embedding basing on the produced neighbours

Emergence of Language with Multi-agent Games: Learning to Communicate with Sequences of Symbols

Authors: Serhii Havrylov, Ivan Titov

Abstract: Learning to communicate through interaction, rather than relying on explicit supervision, is often considered a prerequisite for developing a general AI. We study a setting where two agents engage in playing a referential game and, from scratch, develop a communication protocol necessary to succeed in this game. Unlike previous work, we require that messages they exchange, both at train and test time, are in the form of a language (i.e. sequences of discrete symbols). We compare a reinforcement learning approach and one using a differentiable relaxation (straight-through Gumbel-softmax estimator) and observe that the latter is much faster to converge and it results in more effective protocols. Interestingly, we also observe that the protocol we induce by optimizing the communication success exhibits a degree of compositionality and variability (i.e. the same information can be phrased in different ways), both properties characteristic of natural languages. As the ultimate goal is to ensure that communication is accomplished in natural language, we also perform experiments where we inject prior information about natural language into our model and study properties of the resulting protocol.

URL: https://arxiv.org/abs/1705.11192

Notes: Gumbel-softmax is better than RL for language invention task (yes, this task already exists!)

Discourse-Based Objectives for Fast Unsupervised Sentence Representation Learning

Authors: Yacine Jernite, Samuel R. Bowman, David Sontag

Abstract: This work presents a novel objective function for the unsupervised training of neural network sentence encoders. It exploits signals from paragraph-level discourse coherence to train these models to understand text. Our objective is purely discriminative, allowing us to train models many times faster than was possible under prior methods, and it yields models which perform well in extrinsic evaluations.

URL: https://arxiv.org/abs/1705.00557

Notes: siamese sentence embeddings trained in multi-task environment; claimed to be lightspeed fast

Emergent Language in a Multi-Modal, Multi-Step Referential Game

Authors: Katrina Evtimova, Andrew Drozdov, Douwe Kiela, Kyunghyun Cho

Abstract: Inspired by previous work on emergent language in referential games, we propose a novel multi-modal, multi-step referential game, where the sender and receiver have access to distinct modalities of an object, and their information exchange is bidirectional and of arbitrary duration. The multi-modal multi-step setting allows agents to develop an internal language significantly closer to natural language, in that they share a single set of messages, and that the length of the conversation may vary according to the difficulty of the task. We examine these properties empirically using a dataset consisting of images and textual descriptions of mammals, where the agents are tasked with identifying the correct object. Our experiments indicate that a robust and efficient communication protocol emerges, where gradual information exchange informs better predictions and higher communication bandwidth improves generalization.

URL: https://arxiv.org/abs/1705.10369

Notes: one agent describe the picture to another and the other should guess referred object on the picture

Supervised Learning of Universal Sentence Representations from Natural Language Inference Data

Authors: Alexis Conneau, Douwe Kiela, Holger Schwenk, Loic Barrault, Antoine Bordes

Abstract: Many modern NLP systems rely on word embeddings, previously trained in an unsupervised manner on large corpora, as base features. Efforts to obtain embeddings for larger chunks of text, such as sentences, have however not been so successful. Several attempts at learning unsupervised representations of sentences have not reached satisfactory enough performance to be widely adopted. In this paper, we show how universal sentence representations trained using the supervised data of the Stanford Natural Language Inference datasets can consistently outperform unsupervised methods like SkipThought vectors on a wide range of transfer tasks. Much like how computer vision uses ImageNet to obtain features, which can then be transferred to other tasks, our work tends to indicate the suitability of natural language inference for transfer learning to other NLP tasks. Our encoder is publicly available.

URL: https://arxiv.org/abs/1705.02364

Notes: comparison of some archs (hie. conv net, bi-LSTM with pooling, self-attentive net, std LSTM/GPU) on some phrase tasks, like entailment; also nice, that training details for archs are provided; with code for evaluation!

2017-06

On Unifying Deep Generative Models

Authors: Zhiting Hu, Zichao Yang, Ruslan Salakhutdinov, Eric P. Xing

Abstract: Deep generative models have achieved impressive success in recent years. Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs), as powerful frameworks for deep generative model learning, have largely been considered as two distinct paradigms and received extensive independent study respectively. This paper establishes formal connections between deep generative modeling approaches through a new formulation of GANs and VAEs. We show that GANs and VAEs are essentially minimizing KL divergences with opposite directions and reversed latent/visible treatments, extending the two learning phases of classic wake-sleep algorithm, respectively. The unified view provides a powerful tool to analyze a diverse set of existing model variants, and enables to exchange ideas across research lines in a principled way. For example, we transfer the importance weighting method in VAE literatures for improved GAN learning, and enhance VAEs with an adversarial mechanism. Quantitative experiments show generality and effectiveness of the imported extensions.

URL: https://arxiv.org/abs/1706.00550

Notes: fresh paper from Russ Salakhutdinov: GAN & VAE are just different names of one approach, also fruitful in combination, they're both optimizing KL-divergence

A simple neural network module for relational reasoning

Authors: Adam Santoro, David Raposo, David G.T. Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, Timothy Lillicrap

Abstract: Relational reasoning is a central component of generally intelligent behavior, but has proven difficult for neural networks to learn. In this paper we describe how to use Relation Networks (RNs) as a simple plug-and-play module to solve problems that fundamentally hinge on relational reasoning. We tested RN-augmented networks on three tasks: visual question answering using a challenging dataset called CLEVR, on which we achieve state-of-the-art, super-human performance; text-based question answering using the bAbI suite of tasks; and complex reasoning about dynamic physical systems. Then, using a curated dataset called Sort-of-CLEVR we show that powerful convolutional networks do not have a general capacity to solve relational questions, but can gain this capacity when augmented with RNs. Our work shows how a deep learning architecture equipped with an RN module can implicitly discover and learn to reason about entities and their relations.

URL: https://arxiv.org/abs/1706.01427

Notes: revolutionary in a way work: just MLP over pair of 'objects' improves VQA по super-human performance

Parameter Space Noise for Exploration

Authors: Matthias Plappert, Rein Houthooft, Prafulla Dhariwal, Szymon Sidor, Richard Y. Chen, Xi Chen, Tamim Asfour, Pieter Abbeel, Marcin Andrychowicz

Abstract: Deep reinforcement learning (RL) methods generally engage in exploratory behavior through noise injection in the action space. An alternative is to add noise directly to the agent's parameters, which can lead to more consistent exploration and a richer set of behaviors. Methods such as evolutionary strategies use parameter perturbations, but discard all temporal structure in the process and require significantly more samples. Combining parameter noise with traditional RL methods allows to combine the best of both worlds. We demonstrate that both off- and on-policy methods benefit from this approach through experimental comparison of DQN, DDPG, and TRPO on high-dimensional discrete action environments as well as continuous control tasks. Our results show that RL with parameter noise learns more efficiently than traditional RL with action space noise and evolutionary strategies individually.

URL: https://arxiv.org/abs/1706.01905

Notes: Bayesian approach to enhance deep RL methods - adding noise to params

UCB and InfoGain Exploration via Q-Ensembles

Authors: Richard Y. Chen, Szymon Sidor, Pieter Abbeel, John Schulman

Abstract: We show how an ensemble of Q∗-functions can be leveraged for more effective exploration in deep reinforcement learning. We build on well established algorithms from the bandit setting, and adapt them to the Q-learning setting. First we propose an exploration strategy based on upper-confidence bounds (UCB). Next, we define an "InfoGain" exploration bonus, which depends on the disagreement of the Q-ensemble. Our experiments show significant gains on the Atari benchmark.

URL: https://arxiv.org/abs/1706.01502

Notes: SGD analogue for Q-function learning: create bunch and use majority vote

Deep reinforcement learning from human preferences

Authors: Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, Dario Amodei

Abstract: For sophisticated reinforcement learning (RL) systems to interact usefully with real-world environments, we need to communicate complex goals to these systems. In this work, we explore goals defined in terms of (non-expert) human preferences between pairs of trajectory segments. We show that this approach can effectively solve complex RL tasks without access to the reward function, including Atari games and simulated robot locomotion, while providing feedback on less than one percent of our agent's interactions with the environment. This reduces the cost of human oversight far enough that it can be practically applied to state-of-the-art RL systems. To demonstrate the flexibility of our approach, we show that we can successfully train complex novel behaviors with about an hour of human time. These behaviors and environments are considerably more complex than any that have been previously learned from human feedback.

URL: https://arxiv.org/abs/1706.03741

Notes: human-in-the-loop for RL tasks: human compares trajectories from time to time

Attention Is All You Need

Authors: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin

Abstract: The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

URL: https://arxiv.org/abs/1706.03762

Notes: fresh Google paper: fully connected net with two layers of attention gives SOTA in NMT; also curious that they just add positional encoding to word embedding to enrich it with sequential information

Self-Normalizing Neural Networks

Authors: Günter Klambauer, Thomas Unterthiner, Andreas Mayr, Sepp Hochreiter

Abstract: Deep Learning has revolutionized vision via convolutional neural networks (CNNs) and natural language processing via recurrent neural networks (RNNs). However, success stories of Deep Learning with standard feed-forward neural networks (FNNs) are rare. FNNs that perform well are typically shallow and, therefore cannot exploit many levels of abstract representations. We introduce self-normalizing neural networks (SNNs) to enable high-level abstract representations. While batch normalization requires explicit normalization, neuron activations of SNNs automatically converge towards zero mean and unit variance. The activation function of SNNs are "scaled exponential linear units" (SELUs), which induce self-normalizing properties. Using the Banach fixed-point theorem, we prove that activations close to zero mean and unit variance that are propagated through many network layers will converge towards zero mean and unit variance -- even under the presence of noise and perturbations. This convergence property of SNNs allows to (1) train deep networks with many layers, (2) employ strong regularization, and (3) to make learning highly robust. Furthermore, for activations not close to unit variance, we prove an upper and lower bound on the variance, thus, vanishing and exploding gradients are impossible. We compared SNNs on (a) 121 tasks from the UCI machine learning repository, on (b) drug discovery benchmarks, and on (c) astronomy tasks with standard FNNs and other machine learning methods such as random forests and support vector machines. SNNs significantly outperformed all competing FNN methods at 121 UCI tasks, outperformed all competing methods at the Tox21 dataset, and set a new record at an astronomy data set. The winning SNN architectures are often very deep. Implementations are available at: github.com/bioinf-jku/SNNs.

URL: https://arxiv.org/abs/1706.02515

Notes: new paper from author of LSTM: fully connected net with specific activation (like leaky ReLU, but better), seems to be reasonable alternative for SVM, RF, etc. in linear tasks, e.g. many Kaggle competitions

A Mixture Model for Learning Multi-Sense Word Embeddings

Authors: Dai Quoc Nguyen, Dat Quoc Nguyen, Ashutosh Modi, Stefan Thater, Manfred Pinkal

Abstract: Word embeddings are now a standard technique for inducing meaning representations for words. For getting good representations, it is important to take into account different senses of a word. In this paper, we propose a mixture model for learning multi-sense word embeddings. Our model generalizes the previous works in that it allows to induce different weights of different senses of a word. The experimental results show that our model outperforms previous models on standard evaluation tasks.

URL: https://arxiv.org/abs/1706.05111

Notes: multi-sense word embeddings based on topic modeling, short and clear

End-to-End Neural Ad-hoc Ranking with Kernel Pooling

Authors: Chenyan Xiong, Zhuyun Dai, Jamie Callan, Zhiyuan Liu, Russell Power

Abstract: This paper proposes K-NRM, a kernel based neural model for document ranking. Given a query and a set of documents, K-NRM uses a translation matrix that models word-level similarities via word embeddings, a new kernel-pooling technique that uses kernels to extract multi-level soft match features, and a learning-to-rank layer that combines those features into the final ranking score. The whole model is trained end-to-end. The ranking layer learns desired feature patterns from the pairwise ranking loss. The kernels transfer the feature patterns into soft-match targets at each similarity level and enforce them on the translation matrix. The word embeddings are tuned accordingly so that they can produce the desired soft matches. Experiments on a commercial search engine's query log demonstrate the improvements of K-NRM over prior feature-based and neural-based states-of-the-art, and explain the source of K-NRM's advantage: Its kernel-guided embedding encodes a similarity metric tailored for matching query words to document words, and provides effective multi-level soft matches.

URL: https://arxiv.org/abs/1706.06613

Notes: matrix of cosine similarities between words in doc and query and learnable kernels above it makes good ranking

Learning to Compute Word Embeddings On the Fly

Authors: Dzmitry Bahdanau, Tom Bosc, Stanisław Jastrzębski, Edward Grefenstette, Pascal Vincent, Yoshua Bengio

Abstract: Words in natural language follow a Zipfian distribution whereby some words are frequent but most are rare. Learning representations for words in the "long tail" of this distribution requires enormous amounts of data. Representations of rare words trained directly on end-tasks are usually poor, requiring us to pre-train embeddings on external data, or treat all rare words as out-of-vocabulary words with a unique representation. We provide a method for predicting embeddings of rare words on the fly from small amounts of auxiliary data with a network trained against the end task. We show that this improves results against baselines where embeddings are trained on the end task in a reading comprehension task, a recognizing textual entailment task, and in language modelling.

URL: https://arxiv.org/abs/1706.00286

Notes: create embeddings on the fly for rare words with reading by LSTM dictionary definition

RelNet: End-to-end Modeling of Entities & Relations

Authors: Trapit Bansal, Arvind Neelakantan, Andrew McCallum

Abstract: We introduce RelNet: a new model for relational reasoning. RelNet is a memory augmented neural network which models entities as abstract memory slots and is equipped with an additional relational memory which models relations between all memory pairs. The model thus builds an abstract knowledge graph on the entities and relations present in a document which can then be used to answer questions about the document. It is trained end-to-end: only supervision to the model is in the form of correct answers to the questions. We test the model on the 20 bAbI question-answering tasks with 10k examples per task and find that it solves all the tasks with a mean error of 0.3%, achieving 0% error on 11 of the 20 tasks.

URL: https://arxiv.org/abs/1706.07179

Notes: new SOTA on bAbI, next step in EntNet arch, key-value memory and attention on the output

Plan, Attend, Generate: Character-level Neural Machine Translation with Planning in the Decoder

Authors: Caglar Gulcehre, Francis Dutil, Adam Trischler, Yoshua Bengio

Abstract: We investigate the integration of a planning mechanism into an encoder-decoder architecture with an explicit alignment for character-level machine translation. We develop a model that plans ahead when it computes alignments between the source and target sequences, constructing a matrix of proposed future alignments and a commitment vector that governs whether to follow or recompute the plan. This mechanism is inspired by the strategic attentive reader and writer (STRAW) model. Our proposed model is end-to-end trainable with fully differentiable operations. We show that it outperforms a strong baseline on three character-level decoder neural machine translation on WMT'15 corpus. Our analysis demonstrates that our model can compute qualitatively intuitive alignments and achieves superior performance with fewer parameters.

URL: https://arxiv.org/abs/1706.05087

Notes: plan of alignment instead of direct attention; also gumbel-softmax is again stated better than RL

Do GANs actually learn the distribution? An empirical study

Authors: Sanjeev Arora, Yi Zhang

Abstract: Do GANS (Generative Adversarial Nets) actually learn the target distribution? The foundational paper of (Goodfellow et al 2014) suggested they do, if they were given sufficiently large deep nets, sample size, and computation time. A recent theoretical analysis in Arora et al (to appear at ICML 2017) raised doubts whether the same holds when discriminator has finite size. It showed that the training objective can approach its optimum value even if the generated distribution has very low support ---in other words, the training objective is unable to prevent mode collapse. The current note reports experiments suggesting that such problems are not merely theoretical. It presents empirical evidence that well-known GANs approaches do learn distributions of fairly low support, and thus presumably are not learning the target distribution. The main technical contribution is a new proposed test, based upon the famous birthday paradox, for estimating the support size of the generated distribution.

URL: https://arxiv.org/abs/1706.08224

Notes: GANs are learning distribution with is much more lean than original one; B-day paradox exposes this

Relevance of Unsupervised Metrics in Task-Oriented Dialogue for Evaluating Natural Language Generation

Authors: Shikhar Sharma, Layla El Asri, Hannes Schulz, Jeremie Zumer

Abstract: Automated metrics such as BLEU are widely used in the machine translation literature. They have also been used recently in the dialogue community for evaluating dialogue response generation. However, previous work in dialogue response generation has shown that these metrics do not correlate strongly with human judgment in the non task-oriented dialogue setting. Task-oriented dialogue responses are expressed on narrower domains and exhibit lower diversity. It is thus reasonable to think that these automated metrics would correlate well with human judgment in the task-oriented setting where the generation task consists of translating dialogue acts into a sentence. We conduct an empirical study to confirm whether this is the case. Our findings indicate that these automated metrics have stronger correlation with human judgments in the task-oriented setting compared to what has been observed in the non task-oriented setting. We also observe that these metrics correlate even better for datasets which provide multiple ground truth reference sentences. In addition, we show that some of the currently available corpora for task-oriented language generation can be solved with simple models and advocate for more challenging datasets.

URL: https://arxiv.org/abs/1706.09799

Notes: Maluuba's paper on dialog quality measures; shows weak correlation with human judgement for METEOR; with code!

Sub-domain Modelling for Dialogue Management with Hierarchical Reinforcement Learning

Authors: Pawel Budzianowski, Stefan Ultes, Pei-Hao Su, Nikola Mrksic, Tsung-Hsien Wen, Inigo Casanueva, Lina Rojas-Barahona, Milica Gasic

Abstract: Human conversation is inherently complex, often spanning many different topics/domains. This makes policy learning for dialogue systems very challenging. Standard flat reinforcement learning methods do not provide an efficient framework for modelling such dialogues. In this paper, we focus on the under-explored problem of multi-domain dialogue management. First, we propose a new method for hierarchical reinforcement learning using the option framework. Next, we show that the proposed architecture learns faster and arrives at a better policy than the existing flat ones do. Moreover, we show how pretrained policies can be adapted to more complex systems with an additional set of new actions. In doing that, we show that our approach has the potential to facilitate policy optimisation for more sophisticated multi-domain dialogue systems.

URL: https://arxiv.org/abs/1706.06210

Notes: hierarchical skills for dialog systems, next step for original Cambridge dialog system

Toward Neural Phrase-based Machine Translation

Authors: Po-Sen Huang, Chong Wang, Dengyong Zhou, Li Deng

Abstract: In this paper, we propose Neural Phrase-based Machine Translation (NPMT). Our method explicitly models the phrase structures in output sequences through Sleep-WAke Networks (SWAN), a recently proposed segmentation-based sequence modeling method. To alleviate the monotonic alignment requirement of SWAN, we introduce a new layer to perform (soft) local reordering of input sequences. Our experiments show that NPMT achieves state-of-the-art results on IWSLT 2014 German-English translation task without using any attention mechanisms. We also observe that our method produces meaningful phrases in the output language.

URL: https://arxiv.org/abs/1706.05565

Notes: as in prev seq2seq segmentation authors' work with addition of soft attention over sliding window on input seq

Natural Language Does Not Emerge 'Naturally' in Multi-Agent Dialog

Authors: Satwik Kottur, José M.F. Moura, Stefan Lee, Dhruv Batra

Abstract: A number of recent works have proposed techniques for end-to-end learning of communication protocols among cooperative multi-agent populations, and have simultaneously found the emergence of grounded human-interpretable language in the protocols developed by the agents, all learned without any human supervision! In this paper, using a Task and Tell reference game between two agents as a testbed, we present a sequence of 'negative' results culminating in a 'positive' one -- showing that while most agent-invented languages are effective (i.e. achieve near-perfect task rewards), they are decidedly not interpretable or compositional. In essence, we find that natural language does not emerge 'naturally', despite the semblance of ease of natural-language-emergence that one may gather from recent literature. We discuss how it is possible to coax the invented languages to become more and more human-like and compositional by increasing restrictions on how two agents may communicate.

URL: https://arxiv.org/abs/1706.08502

Notes: interesting paper on language creation: compositional language could emerge in some cases, but not for sure

Personalization in Goal-Oriented Dialog

Authors: Chaitanya K. Joshi, Fei Mi, Boi Faltings

Abstract: The main goal of modeling human conversation is to create agents which can interact with people in both open-ended and goal-oriented scenarios. End-to-end trained neural dialog systems are an important line of research for such generalized dialog models as they do not resort to any situation-specific handcrafting of rules. However, incorporating personalization into such systems is a largely unexplored topic as there are no existing corpora to facilitate such work. In this paper, we present a new dataset of goal-oriented dialogs which are influenced by speaker profiles attached to them. We analyze the shortcomings of an existing end-to-end dialog system based on Memory Networks and propose modifications to the architecture which enable personalization. We also investigate personalization in dialog as a multi-task learning problem, and show that a single model which shares features among various profiles outperforms separate models for each profile.

URL: https://arxiv.org/abs/1706.07503

Notes: personalization in retrieval-based dialog systems; dataset is based on bAbI; they use memory networks to solve this task with additionally splitted memory - for dialog history & personal features to be separate; with code!

2017-07

Deep Semantic Role Labeling: What Works and What’s Next

Authors: Luheng He, Kenton Lee, Mike Lewis, Luke Zettlemoyer

Abstract: We introduce a new deep learning model for semantic role labeling (SRL) that significantly improves the state of the art, along with detailed analyses to reveal its strengths and limitations. We use a deep highway BiLSTM architecture with constrained decoding, while observing a number of recent best practices for initialization and regularization. Our 8-layer ensemble model achieves 83.2 F1 on the CoNLL 2005 test set and 83.4 F1 on CoNLL 2012, roughly a 10% relative error reduction over the previous state of the art. Extensive empirical analysis of these gains show that (1) deep models excel at recovering long-distance dependencies but can still make surprisingly obvious errors, and (2) that there is still room for syntactic parsers to improve these results.

URL: https://homes.cs.washington.edu/~luheng/files/acl2017_hllz.pdf

Notes: A* algo with constrains for beam search in semantic role labelling

Learning to Avoid Errors in GANs by Manipulating Input Spaces

Authors: Alexander B. Jung

Abstract: Despite recent advances, large scale visual artifacts are still a common occurrence in images generated by GANs. Previous work has focused on improving the generator's capability to accurately imitate the data distribution pdata. In this paper, we instead explore methods that enable GANs to actively avoid errors by manipulating the input space. The core idea is to apply small changes to each noise vector in order to shift them away from areas in the input space that tend to result in errors. We derive three different architectures from that idea. The main one of these consists of a simple residual module that leads to significantly less visual artifacts, while only slightly decreasing diversity. The module is trivial to add to existing GANs and costs almost zero computation and memory.

URL: https://arxiv.org/abs/1707.00768

Notes: new step in GAN training: try to optimize input noise for generator

Deep Gaussian Embedding of Attributed Graphs: Unsupervised Inductive Learning via Ranking

Authors: Aleksandar Bojchevski, Stephan Günnemann

Abstract: Methods that learn representations of graph nodes play a critical role in network analysis since they enable many downstream learning tasks. We propose Graph2Gauss - an approach that can efficiently learn versatile node embeddings on large scale (attributed) graphs that show strong performance on tasks such as link prediction and node classification. Unlike most approaches that represent nodes as (point) vectors in a lower-dimensional continuous space, we embed each node as a Gaussian distribution, allowing us to capture uncertainty about the representation. Furthermore, in contrast to previous approaches we propose a completely unsupervised method that is also able to handle inductive learning scenarios and is applicable to different types of graphs (plain, attributed, directed, undirected). By leveraging both the topological network structure and the associated node attributes, we are able to generalize to unseen nodes without additional training. To learn the embeddings we adopt a personalized ranking formulation w.r.t. the node distances that exploits the natural ordering between the nodes imposed by the network structure. Experiments on real world networks demonstrate the high performance of our approach, outperforming state-of-the-art network embedding methods on several different tasks.

URL: https://arxiv.org/abs/1707.03815

Notes: soft hinge-loss for cmp of gaussian embeddings for first- and second-order proximity in graph

Autoencoder-augmented Neuroevolution for Visual Doom Playing

Authors: Samuel Alvernaz, Julian Togelius

Abstract: Neuroevolution has proven effective at many reinforcement learning tasks, but does not seem to scale well to high-dimensional controller representations, which are needed for tasks where the input is raw pixel data. We propose a novel method where we train an autoencoder to create a comparatively low-dimensional representation of the environment observation, and then use CMA-ES to train neural network controllers acting on this input data. As the behavior of the agent changes the nature of the input data, the autoencoder training progresses throughout evolution. We test this method in the VizDoom environment built on the classic FPS Doom, where it performs well on a health-pack gathering task.

URL: https://arxiv.org/abs/1707.03902

Notes: autoencoder to produce small representation of input and evolution technique to train on it

Revisiting Unreasonable Effectiveness of Data in Deep Learning Era

Authors: Chen Sun, Abhinav Shrivastava, Saurabh Singh, Abhinav Gupta

Abstract: The success of deep learning in vision can be attributed to: (a) models with high capacity; (b) increased computational power; and (c) availability of large-scale labeled data. Since 2012, there have been significant advances in representation capabilities of the models and computational capabilities of GPUs. But the size of the biggest dataset has surprisingly remained constant. What will happen if we increase the dataset size by 10x or 100x? This paper takes a step towards clearing the clouds of mystery surrounding the relationship between 'enormous data' and deep learning. By exploiting the JFT-300M dataset which has more than 375M noisy labels for 300M images, we investigate how the performance of current vision tasks would change if this data was used for representation learning. Our paper delivers some surprising (and some expected) findings. First, we find that the performance on vision tasks still increases linearly with orders of magnitude of training data size. Second, we show that representation learning (or pre-training) still holds a lot of promise. One can improve performance on any vision tasks by just training a better base model. Finally, as expected, we present new state-of-the-art results for different vision tasks including image classification, object detection, semantic segmentation and human pose estimation. Our sincere hope is that this inspires vision community to not undervalue the data and develop collective efforts in building larger datasets.

URL: https://arxiv.org/abs/1707.02968

Notes: enormous datasets and computational power are the future, at least for CV

The Reversible Residual Network: Backpropagation Without Storing Activations

Authors: Aidan N. Gomez, Mengye Ren, Raquel Urtasun, Roger B. Grosse

Abstract: Deep residual networks (ResNets) have significantly pushed forward the state-of-the-art on image classification, increasing in performance as networks grow both deeper and wider. However, memory consumption becomes a bottleneck, as one needs to store the activations in order to calculate gradients using backpropagation. We present the Reversible Residual Network (RevNet), a variant of ResNets where each layer's activations can be reconstructed exactly from the next layer's. Therefore, the activations for most layers need not be stored in memory during backpropagation. We demonstrate the effectiveness of RevNets on CIFAR-10, CIFAR-100, and ImageNet, establishing nearly identical classification accuracy to equally-sized ResNets, even though the activation storage requirements are independent of depth.

URL: https://arxiv.org/abs/1707.04585

Notes: memory saving on not storing the activatons; authors say that could be extended to RNN; first paper from Uber for me

Variance Regularizing Adversarial Learning

Authors: Karan Grewal, R Devon Hjelm, Yoshua Bengio

Abstract: We introduce a novel approach for training adversarial models by replacing the discriminator score with a bi-modal Gaussian distribution over the real/fake indicator variables. In order to do this, we train the Gaussian classifier to match the target bi-modal distribution implicitly through meta-adversarial training. We hypothesize that this approach ensures a non-zero gradient to the generator, even in the limit of a perfect classifier. We test our method against standard benchmark image datasets as well as show the classifier output distribution is smooth and has overlap between the real and fake modes.

URL: https://arxiv.org/abs/1707.00309

Notes: training two gaussians for true and false labels enables the generator to produce more similar to true distibution

Be Careful What You Backpropagate: A Case For Linear Output Activations & Gradient Boosting

Authors: Anders Oland, Aayush Bansal, Roger B. Dannenberg, Bhiksha Raj

Abstract: In this work, we show that saturating output activation functions, such as the softmax, impede learning on a number of standard classification tasks. Moreover, we present results showing that the utility of softmax does not stem from the normalization, as some have speculated. In fact, the normalization makes things worse. Rather, the advantage is in the exponentiation of error gradients. This exponential gradient boosting is shown to speed up convergence and improve generalization. To this end, we demonstrate faster convergence and better performance on diverse classification tasks: image classification using CIFAR-10 and ImageNet, and semantic segmentation using PASCAL VOC 2012. In the latter case, using the state-of-the-art neural network architecture, the model converged 33% faster with our method (roughly two days of training less) than with the standard softmax activation, and with a slightly better performance to boot.

URL: https://arxiv.org/abs/1707.04199

Notes: throw the softmax out and use linear activations with special (exp or pow3) boosting, it's enough

Effective Inference for Generative Neural Parsing

Authors: Mitchell Stern, Daniel Fried, Dan Klein

Abstract: Generative neural models have recently achieved state-of-the-art results for constituency parsing. However, without a feasible search procedure, their use has so far been limited to reranking the output of external parsers in which decoding is more tractable. We describe an alternative to the conventional action-level beam search used for discriminative neural models that enables us to decode directly in these generative models. We then show that by improving our basic candidate selection strategy and using a coarse pruning function, we can improve accuracy while exploring significantly less of the search space. Applied to the model of Choe and Charniak (2016), our inference procedure obtains 92.56 F1 on section 23 of the Penn Treebank, surpassing prior state-of-the-art results for single-model systems.

URL: https://arxiv.org/abs/1707.08976

Notes: effective pruning and fast-track for some branches gives SotA result in parsing

Adversarial Examples for Evaluating Reading Comprehension Systems

Authors: Robin Jia, Percy Liang

Abstract: Standard accuracy metrics indicate that reading comprehension systems are making rapid progress, but the extent to which these systems truly understand language remains unclear. To reward systems with real language understanding abilities, we propose an adversarial evaluation scheme for the Stanford Question Answering Dataset (SQuAD). Our method tests whether systems can answer questions about paragraphs that contain adversarially inserted sentences, which are automatically generated to distract computer systems without changing the correct answer or misleading humans. In this adversarial setting, the accuracy of sixteen published models drops from an average of $75%$ F1 score to $36%$; when the adversary is allowed to add ungrammatical sequences of words, average accuracy on four models decreases further to $7%$. We hope our insights will motivate the development of new models that understand language more precisely.

URL: https://arxiv.org/abs/1707.07328

Notes: simple adversarial techniques are enough to fool the reading comprehension; handy for augmentation also

Unsupervised, Knowledge-Free, and Interpretable Word Sense Disambiguation

Authors: Alexander Panchenko, Fide Marten, Eugen Ruppert, Stefano Faralli, Dmitry Ustalov, Simone Paolo Ponzetto, Chris Biemann

Abstract: Interpretability of a predictive model is a powerful feature that gains the trust of users in the correctness of the predictions. In word sense disambiguation (WSD), knowledge-based systems tend to be much more interpretable than knowledge-free counterparts as they rely on the wealth of manually-encoded elements representing word senses, such as hypernyms, usage examples, and images. We present a WSD system that bridges the gap between these two so far disconnected groups of methods. Namely, our system, providing access to several state-of-the-art WSD models, aims to be interpretable as a knowledge-based system while it remains completely unsupervised and knowledge-free. The presented tool features a Web interface for all-word disambiguation of texts that makes the sense predictions human readable by providing interpretable word sense inventories, sense representations, and disambiguation results. We provide a public API, enabling seamless integration.

URL: https://arxiv.org/abs/1707.06878

Notes: open source system for word disambiguation based on large cropora processing (hey Dima!)

2017-08

Referenceless Quality Estimation for Natural Language Generation

Authors: Ondřej Dušek, Jekaterina Novikova, Verena Rieser

Abstract: Traditional automatic evaluation measures for natural language generation (NLG) use costly human-authored references to estimate the quality of a system output. In this paper, we propose a referenceless quality estimation (QE) approach based on recurrent neural networks, which predicts a quality score for a NLG system output by comparing it to the source meaning representation only. Our method outperforms traditional metrics and a constant baseline in most respects; we also show that synthetic data helps to increase correlation results by 21% compared to the base system. Our results are comparable to results obtained in similar QE tasks despite the more challenging setting.

URL: https://arxiv.org/abs/1708.01759

Notes: neural net predicting Likert scale human scores, shows greater correlation than classic BLEU, etc.

Toward Controlled Generation of Text

Authors: Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan Salakhutdinov, Eric P. Xing

Abstract: Generic generation and manipulation of text is challenging and has limited success compared to recent deep generative modeling in visual domain. This paper aims at generating plausible text sentences, whose attributes are controlled by learning disentangled latent representations with designated semantics. We propose a new neural generative model which combines variational auto-encoders (VAEs) and holistic attribute discriminators for effective imposition of semantic structures. The model can alternatively be seen as enhancing VAEs with the wake-sleep algorithm for leveraging fake samples as extra training data. With differentiable approximation to discrete text samples, explicit constraints on independent attribute controls, and efficient collaborative learning of generator and discriminators, our model learns interpretable representations from even only word annotations, and produces short sentences with desired attributes of sentiment and tenses. Quantitative experiments using trained classifiers as evaluators validate the accuracy of sentence and attribute generation.

URL: http://proceedings.mlr.press/v70/hu17e/hu17e.pdf

Notes: the best ICML'17 paper as for me; sleep-wake VAE of Enc, Dec & Gen; decreasing temperature to smooth training

MISC: A data set of information-seeking conversations

Authors: Paul �Thomas, Daniel McDu�ff, Mary Czerwinski, Nick Craswell

Abstract: Conversational interfaces to information retrieval systems, via so�ftware agents such as Siri or Cortana, are of commercial and research interest. To build or evaluate these so�ware interfaces it is natural to consider how people act in the same role, but there is li�ttle public, �fine-grained, data on interactions with intermediaries for web tasks. We introduce the Microso� Information-Seeking Conversation data (MISC), a set of recordings of information-seeking conversations between human “seekers” and “intermediaries”. MISC includes audio and video signals; transcripts of conversation; a�ffectual and physiological signals; recordings of search and other computer use; and post-task surveys on emotion, success, and e�ort. We hope that these recordings will support conversational retrieval interfaces both in engineering (how can we make “natural” systems?) and evaluation (what does a “good” conversation look like?).

URL: https://www.microsoft.com/en-us/research/wp-content/uploads/2017/07/Thomas-etal-CAIR17.pdf

Notes: new Microsoft Information-Seeking Convervations dataset, follows WOz paradigm

Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation

Authors: Yuhuai Wu, Elman Mansimov, Shun Liao, Roger Grosse, Jimmy Ba

Abstract: In this work, we propose to apply trust region optimization to deep reinforcement learning using a recently proposed Kronecker-factored approximation to the curvature. We extend the framework of natural policy gradient and propose to optimize both the actor and the critic using Kronecker-factored approximate curvature (K-FAC) with trust region; hence we call our method Actor Critic using Kronecker-Factored Trust Region (ACKTR). To the best of our knowledge, this is the first scalable trust region natural gradient method for actor-critic methods. It is also a method that learns non-trivial tasks in continuous control as well as discrete control policies directly from raw pixel inputs. We tested our approach across discrete domains in Atari games as well as continuous domains in the MuJoCo environment. With the proposed methods, we are able to achieve higher rewards and a 2- to 3-fold improvement in sample efficiency on average, compared to previous state-of-the-art on-policy actor-critic methods. Code is available at URL

URL: https://arxiv.org/abs/1708.05144

Notes: ACKTR is a new method, sample-efficient and distributed, based on Kronecker-factored approximation of natural gradient

Measuring Catastrophic Forgetting in Neural Networks

Authors: Ronald Kemker, Angelina Abitino, Marc McClure, Christopher Kanan

Abstract: Deep multi-layer perceptron neural networks are used in many state-of-the-art systems for machine perception (e.g., speech-to-text, image classification, and object detection). Once a network is trained to do a specific task, e.g., fine-grained bird classification, it cannot easily be trained to do new tasks, e.g., incrementally learning to recognize additional bird species or learning an entirely different task such as fine-grained flower recognition. When new tasks are added, deep neural networks are prone to catastrophically forgetting previously learned information. Catastrophic forgetting has hindered the use of neural networks in deployed applications that require lifelong learning. There have been multiple attempts to develop schemes that mitigate catastrophic forgetting, but these methods have yet to be compared and the kinds of tests used to evaluate individual methods vary greatly. In this paper, we compare multiple mechanisms designed to mitigate catastrophic forgetting in neural networks. Experiments showed that the mechanism(s) that are critical for optimal performance vary based on the incremental training paradigm and type of data being used.

URL: https://arxiv.org/abs/1708.02072

Notes: framework for measure forgetting in NN; permute data, learning additional classes one by one, additional modalities

Regularizing and Optimizing LSTM Language Models

Authors: Stephen Merity, Nitish Shirish Keskar, Richard Socher

Abstract: Recurrent neural networks (RNNs), such as long short-term memory networks (LSTMs), serve as a fundamental building block for many sequence learning tasks, including machine translation, language modeling, and question answering. In this paper, we consider the specific problem of word-level language modeling and investigate strategies for regularizing and optimizing LSTM-based models. We propose the weight-dropped LSTM which uses DropConnect on hidden-to-hidden weights as a form of recurrent regularization. Further, we introduce NT-ASGD, a variant of the averaged stochastic gradient method, wherein the averaging trigger is determined using a non-monotonic condition as opposed to being tuned by the user. Using these and other regularization strategies, we achieve state-of-the-art word level perplexities on two data sets: 57.3 on Penn Treebank and 65.8 on WikiText-2. In exploring the effectiveness of a neural cache in conjunction with our proposed model, we achieve an even lower state-of-the-art perplexity of 52.8 on Penn Treebank and 52.0 on WikiText-2.

URL: https://arxiv.org/abs/1708.02182

Notes: at last! effective dropout for LSTMs from Socher's group; l2 regularization for activations of hidden state

Twin Networks: Using the Future as a Regularizer

Authors: Dmitriy Serdyuk, Rosemary Nan Ke, Alessandro Sordoni, Chris Pal, Yoshua Bengio

Abstract: Being able to model long-term dependencies in sequential data, such as text, has been among the long-standing challenges of recurrent neural networks (RNNs). This issue is strictly related to the absence of explicit planning in current RNN architectures. More explicitly, the RNNs are trained to predict only the next token given previous ones. In this paper, we introduce a simple way of encouraging the RNNs to plan for the future. In order to accomplish this, we introduce an additional neural network which is trained to generate the sequence in reverse order, and we require closeness between the states of the forward RNN and backward RNN that predict the same token. At each step, the states of the forward RNN are required to match the future information contained in the backward states. We hypothesize that the approach eases modeling of long-term dependencies thus helping in generating more globally consistent samples. The model trained with conditional generation for a speech recognition task achieved 12% relative improvement (CER of 6.7 compared to a baseline of 7.6).

URL: https://arxiv.org/abs/1708.06742

Notes: nice idea: the states of forward pass is bounded to backward i.e. future states (in terms of bidirectional RNN)

A Neural Network Approach for Mixing Language Models

Authors: Youssef Oualil, Dietrich Klakow

Abstract: The performance of Neural Network (NN)-based language models is steadily improving due to the emergence of new architectures, which are able to learn different natural language characteristics. This paper presents a novel framework, which shows that a significant improvement can be achieved by combining different existing heterogeneous models in a single architecture. This is done through 1) a feature layer, which separately learns different NN-based models and 2) a mixture layer, which merges the resulting model features. In doing so, this architecture benefits from the learning capabilities of each model with no noticeable increase in the number of model parameters or the training time. Extensive experiments conducted on the Penn Treebank (PTB) and the Large Text Compression Benchmark (LTCB) corpus showed a significant reduction of the perplexity when compared to state-of-the-art feedforward as well as recurrent neural network architectures.

URL: https://arxiv.org/abs/1708.06989

Notes: neural ensambling, pretty obvious, but I don't remember prev paper on this

Learned in Translation: Contextualized Word Vectors

Authors: Bryan McCann, James Bradbury, Caiming Xiong, Richard Socher

Abstract: Computer vision has benefited from initializing multiple deep layers with weights pretrained on large supervised training sets like ImageNet. Natural language processing (NLP) typically sees initialization of only the lowest layer of deep models with pretrained word vectors. In this paper, we use a deep LSTM encoder from an attentional sequence-to-sequence model trained for machine translation (MT) to contextualize word vectors. We show that adding these context vectors (CoVe) improves performance over using only unsupervised word and character vectors on a wide variety of common NLP tasks: sentiment analysis (SST, IMDb), question classification (TREC), entailment (SNLI), and question answering (SQuAD). For fine-grained sentiment analysis and entailment, CoVe improves performance of our baseline models to the state of the art.

URL: https://arxiv.org/abs/1708.00107

Notes: another brilliant idea: just use context vec from biLSTM added to original word emb and everything becomes better!

Natural Language Processing with Small Feed-Forward Networks

Authors: Jan A. Botha, Emily Pitler, Ji Ma, Anton Bakalov, Alex Salcianu, David Weiss, Ryan McDonald, Slav Petrov

Abstract: We show that small and shallow feed-forward neural networks can achieve near state-of-the-art results on a range of unstructured and structured language processing tasks while being considerably cheaper in memory and computational requirements than deep recurrent models. Motivated by resource-constrained environments like mobile phones, we showcase simple techniques for obtaining such small neural network models, and investigate different tradeoffs when deciding how to allocate a small memory budget.

URL: https://arxiv.org/abs/1708.00214

Notes: small footprint NNs from Google showing near SotA on NLP tasks

$k$-Nearest Neighbor Augmented Neural Networks for Text Classification

Authors: Zhiguo Wang, Wael Hamza, Linfeng Song

Abstract: In recent years, many deep-learning based models are proposed for text classification. This kind of models well fits the training set from the statistical point of view. However, it lacks the capacity of utilizing instance-level information from individual instances in the training set. In this work, we propose to enhance neural network models by allowing them to leverage information from $k$-nearest neighbor (kNN) of the input text. Our model employs a neural network that encodes texts into text embeddings. Moreover, we also utilize $k$-nearest neighbor of the input text as an external memory, and utilize it to capture instance-level information from the training set. The final prediction is made based on features from both the neural network encoder and the kNN memory. Experimental results on several standard benchmark datasets show that our model outperforms the baseline model on all the datasets, and it even beats a very deep neural network model (with 29 layers) in several datasets. Our model also shows superior performance when training instances are scarce, and when the training set is severely unbalanced. Our model also leverages techniques such as semi-supervised training and transfer learning quite well.

URL: https://arxiv.org/abs/1708.07863

Notes: IBM's paper on text classification, they use kNN as external memory for additional attentive embedding of input

A Continuous Relaxation of Beam Search for End-to-end Training of Neural Sequence Models

Authors: Kartik Goyal, Graham Neubig, Chris Dyer, Taylor Berg-Kirkpatrick

Abstract: Beam search is a desirable choice of test-time decoding algorithm for neural sequence models because it potentially avoids search errors made by simpler greedy methods. However, typical cross entropy training procedures for these models do not directly consider the behaviour of the final decoding method. As a result, for cross-entropy trained models, beam decoding can sometimes yield reduced test performance when compared with greedy decoding. In order to train models that can more effectively make use of beam search, we propose a new training procedure that focuses on the final loss metric (e.g. Hamming loss) evaluated on the output of beam search. While well-defined, this "direct loss" objective is itself discontinuous and thus difficult to optimize. Hence, in our approach, we form a sub-differentiable surrogate objective by introducing a novel continuous approximation of the beam search decoding procedure. In experiments, we show that optimizing this new training objective yields substantially better results on two sequence tasks (Named Entity Recognition and CCG Supertagging) when compared with both cross entropy trained greedy decoding and cross entropy trained beam decoding baselines.

URL: https://arxiv.org/abs/1708.00111

Notes: soft beamsearch is based on peaked-softmax (which is like k-maxpool)

2017-09

A Deep Reinforcement Learning Chatbot

Authors: Iulian V. Serban, Chinnadhurai Sankar, Mathieu Germain, Saizheng Zhang, Zhouhan Lin, Sandeep Subramanian, Taesup Kim, Michael Pieper, Sarath Chandar, Nan Rosemary Ke, Sai Mudumba, Alexandre de Brebisson, Jose M. R. Sotelo, Dendi Suhubdy, Vincent Michalski, Alexandre Nguyen, Joelle Pineau, Yoshua Bengio

Abstract: We present MILABOT: a deep reinforcement learning chatbot developed by the Montreal Institute for Learning Algorithms (MILA) for the Amazon Alexa Prize competition. MILABOT is capable of conversing with humans on popular small talk topics through both speech and text. The system consists of an ensemble of natural language generation and retrieval models, including template-based models, bag-of-words models, sequence-to-sequence neural network and latent variable neural network models. By applying reinforcement learning to crowdsourced data and real-world user interactions, the system has been trained to select an appropriate response from the models in its ensemble. The system has been evaluated through A/B testing with real-world users, where it performed significantly better than competing systems. Due to its machine learning architecture, the system is likely to improve with additional data.

URL: https://arxiv.org/abs/1709.02349

Notes: architecture of Alexa Prize UMontreal bot, interesting part is parametrization for q-learning for choice of base bot

Training RNNs as Fast as CNNs

Authors: Tao Lei, Yu Zhang

Abstract: Recurrent neural networks scale poorly due to the intrinsic difficulty in parallelizing their state computations. For instance, the forward pass computation of $h_t$ is blocked until the entire computation of $h_{t-1}$ finishes, which is a major bottleneck for parallel computing. In this work, we propose an alternative RNN implementation by deliberately simplifying the state computation and exposing more parallelism. The proposed recurrent unit operates as fast as a convolutional layer and 5-10x faster than cuDNN-optimized LSTM. We demonstrate the unit's effectiveness across a wide range of applications including classification, question answering, language modeling, translation and speech recognition. We open source our implementation in PyTorch and CNTK.

URL: https://arxiv.org/abs/1709.02755

Notes: new RNN type with dropped hidden-to-hidden dependency shows SotA on clf, ASR, and on par with CNN on comp speed

StarSpace: Embed All The Things!

Authors: Ledell Wu, Adam Fisch, Sumit Chopra, Keith Adams, Antoine Bordes, Jason Weston

Abstract: We present StarSpace, a general-purpose neural embedding model that can solve a wide variety of problems: labeling tasks such as text classification, ranking tasks such as information retrieval/web search, collaborative filtering-based or content-based recommendation, embedding of multi-relational graphs, and learning word, sentence or document level embeddings. In each case the model works by embedding those entities comprised of discrete features and comparing them against each other -- learning similarities dependent on the task. Empirical results on a number of tasks show that StarSpace is highly competitive with existing methods, whilst also being generally applicable to new cases where those methods are not.

URL: https://arxiv.org/abs/1709.03856

Notes: One Embedding to Rule Them All! just triple loss for all different tasks and new opensource framework from FAIR

Iterative Policy Learning in End-to-End Trainable Task-Oriented Neural Dialog Models

Authors: Bing Liu, Ian Lane

Abstract: In this paper, we present a deep reinforcement learning (RL) framework for iterative dialog policy optimization in end-to-end task-oriented dialog systems. Popular approaches in learning dialog policy with RL include letting a dialog agent to learn against a user simulator. Building a reliable user simulator, however, is not trivial, often as difficult as building a good dialog agent. We address this challenge by jointly optimizing the dialog agent and the user simulator with deep RL by simulating dialogs between the two agents. We first bootstrap a basic dialog agent and a basic user simulator by learning directly from dialog corpora with supervised training. We then improve them further by letting the two agents to conduct task-oriented dialogs and iteratively optimizing their policies with deep RL. Both the dialog agent and the user simulator are designed with neural network models that can be trained end-to-end. Our experiment results show that the proposed method leads to promising improvements on task success rate and total task reward comparing to supervised training and single-agent RL training baseline models.

URL: https://arxiv.org/abs/1709.06136

Notes: end-to-end learning for user simulator & dialog system at once; some results on DSTC2

Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning

Authors: Victor Zhong, Caiming Xiong, Richard Socher

Abstract: A significant amount of the world's knowledge is stored in relational databases. However, the ability for users to retrieve facts from a database is limited due to a lack of understanding of query languages such as SQL. We propose Seq2SQL, a deep neural network for translating natural language questions to corresponding SQL queries. Our model leverages the structure of SQL queries to significantly reduce the output space of generated queries. Moreover, we use rewards from in-the-loop query execution over the database to learn a policy to generate unordered parts of the query, which we show are less suitable for optimization via cross entropy loss. In addition, we will publish WikiSQL, a dataset of 87726 hand-annotated examples of questions and SQL queries distributed across 26375 tables from Wikipedia. This dataset is required to train our model and is an order of magnitude larger than comparable datasets. By applying policy-based reinforcement learning with a query execution environment to WikiSQL, our model Seq2SQL outperforms attentional sequence to sequence models, improving execution accuracy from 35.9% to 60.3% and logical form accuracy from 23.4% to 49.2%.

URL: https://arxiv.org/abs/1709.00103

Notes: generating SQL from natural language question (with RL for choosing operations) from Richard Socher

Attention-based Mixture Density Recurrent Networks for History-based Recommendation

Authors: Tian Wang, Kyunghyun Cho

Abstract: The goal of personalized history-based recommendation is to automatically output a distribution over all the items given a sequence of previous purchases of a user. In this work, we present a novel approach that uses a recurrent network for summarizing the history of purchases, continuous vectors representing items for scalability, and a novel attention-based recurrent mixture density network, which outputs each component in a mixture sequentially, for modelling a multi-modal conditional distribution. We evaluate the proposed approach on two publicly available datasets, MovieLens-20M and RecSys15. The experiments show that the proposed approach, which explicitly models the multi-modal nature of the predictive distribution, is able to improve the performance over various baselines in terms of precision, recall and nDCG.

URL: https://arxiv.org/abs/1709.07545

Notes: RecSys based on word2vec idea - items as tokens, also with fancy probability distribution prediction

Generating Sentences by Editing Prototypes

Authors: Kelvin Guu, Tatsunori B. Hashimoto, Yonatan Oren, Percy Liang

Abstract: We propose a new generative model of sentences that first samples a prototype sentence from the training corpus and then edits it into a new sentence. Compared to traditional models that generate from scratch either left-to-right or by first sampling a latent sentence vector, our prototype-then-edit model improves perplexity on language modeling and generates higher quality outputs according to human evaluation. Furthermore, the model gives rise to a latent edit vector that captures interpretable semantics such as sentence similarity and sentence-level analogies.

URL: https://arxiv.org/abs/1709.08878

Notes: fresh paper on sentence rewrite from Stanford, ses2seq (w/ attn) with additional edit vector input

2017-10

Word Translation Without Parallel Data

Authors: Alexis Conneau, Guillaume Lample, Marc'Aurelio Ranzato, Ludovic Denoyer, Hervé Jégou

Abstract: State-of-the-art methods for learning cross-lingual word embeddings have relied on bilingual dictionaries or parallel corpora. Recent works showed that the need for parallel data supervision can be alleviated with character-level information. While these methods showed encouraging results, they are not on par with their supervised counterparts and are limited to pairs of languages sharing a common alphabet. In this work, we show that we can build a bilingual dictionary between two languages without using any parallel corpora, by aligning monolingual word embedding spaces in an unsupervised way. Without using any character information, our model even outperforms existing supervised methods on cross-lingual tasks for some language pairs. Our experiments demonstrate that our method works very well also for distant language pairs, like English-Russian or English-Chinese. We finally show that our method is a first step towards fully unsupervised machine translation and describe experiments on the English-Esperanto language pair, on which there only exists a limited amount of parallel data.

URL: https://arxiv.org/abs/1710.04087

Notes: paper on unsupervised word embs mapping for different languages; trained with adversarial procedure; results in IR

Emergent Translation in Multi-Agent Communication

Authors: Jason Lee, Kyunghyun Cho, Jason Weston, Douwe Kiela

Abstract: While most machine translation systems to date are trained on large parallel corpora, humans learn language in a different way: by being grounded in an environment and interacting with other humans. In this work, we propose a communication game where two agents, native speakers of their own respective languages, jointly learn to solve a visual referential task. We find that the ability to understand and translate a foreign language emerges as a means to achieve shared goals. The emergent translation is interactive and multimodal, and crucially does not require parallel corpora, but only monolingual, independent text and corresponding images. Our proposed translation model achieves this by grounding the source and target languages into a shared visual modality, and outperforms several baselines on both word-level and sentence-level translation tasks. Furthermore, we show that agents in a multilingual community learn to translate better and faster than in a bilingual communication setting.

URL: https://arxiv.org/abs/1710.06922

Notes: FAIR paper on training translation in dialog of two agents speaking different languages describing images

Convolutional Neural Knowledge Graph Learning

Authors: Feipeng Zhao, Martin Renqiang Min, Chen Shen, Amit Chakraborty

Abstract: Previous models for learning entity and relationship embeddings of knowledge graphs such as TransE, TransH, and TransR aim to explore new links based on learned representations. However, these models interpret relationships as simple translations on entity embeddings. In this paper, we try to learn more complex connections between entities and relationships. In particular, we use a Convolutional Neural Network (CNN) to learn entity and relationship representations in knowledge graphs. In our model, we treat entities and relationships as one-dimensional numerical sequences with the same length. After that, we combine each triplet of head, relationship, and tail together as a matrix with height 3. CNN is applied to the triplets to get confidence scores. Positive and manually corrupted negative triplets are used to train the embeddings and the CNN model simultaneously. Experimental results on public benchmark datasets show that the proposed model outperforms state-of-the-art models on exploring unseen relationships, which proves that CNN is effective to learn complex interactive patterns between entities and relationships.

URL: https://arxiv.org/abs/1710.08502

Notes: conv over embeddings of entities & relationships, could be useful for long-awaited automatic ontology creation

ALL-IN-1: Short Text Classification with One Model for All Languages

Authors: Barbara Plank

Abstract: We present ALL-IN-1, a simple model for multilingual text classification that does not require any parallel data. It is based on a traditional Support Vector Machine classifier exploiting multilingual word embeddings and character n-grams. Our model is simple, easily extendable yet very effective, overall ranking 1st (out of 12 teams) in the IJCNLP 2017 shared task on customer feedback analysis in four languages: English, French, Japanese and Spanish.

URL: https://arxiv.org/abs/1710.09589

Notes: short text clf using multilingual embs; interesting that authors dismiss DL approach due to lack of data

2017-11

Still not systematic after all these years: On the compositional skills of sequence-to-sequence recurrent networks

Authors: Brenden M. Lake, Marco Baroni

Abstract: Humans can understand and produce new utterances effortlessly, thanks to their systematic compositional skills. Once a person learns the meaning of a new verb "dax," he or she can immediately understand the meaning of "dax twice" or "sing and dax." In this paper, we introduce the SCAN domain, consisting of a set of simple compositional navigation commands paired with the corresponding action sequences. We then test the zero-shot generalization capabilities of a variety of recurrent neural networks (RNNs) trained on SCAN with sequence-to-sequence methods. We find that RNNs can generalize well when the differences between training and test commands are small, so that they can apply "mix-and-match" strategies to solve the task. However, when generalization requires systematic compositional skills (as in the "dax" example above), RNNs fail spectacularly. We conclude with a proof-of-concept experiment in neural machine translation, supporting the conjecture that lack of systematicity is an important factor explaining why neural networks need very large training sets.

URL: https://arxiv.org/abs/1711.00350

Notes: FAIR shows the lack of compositionality ability in RNNs; langauge is entirely compositional, so we need new tool

Learning with Latent Language

Authors: Jacob Andreas, Dan Klein, Sergey Levine

Abstract: The named concepts and compositional operators present in natural language provide a rich source of information about the kinds of abstractions humans use to navigate the world. Can this linguistic background knowledge improve the generality and efficiency of learned classifiers and control policies? This paper aims to show that using the space of natural language strings as a parameter space is an effective way to capture natural task structure. In a pretraining phase, we learn a language interpretation model that transforms inputs (e.g. images) into outputs (e.g. labels) given natural language descriptions. To learn a new concept (e.g. a classifier), we search directly in the space of descriptions to minimize the interpreter's loss on training examples. Crucially, our models do not require language data to learn these concepts: language is used only in pretraining to impose structure on subsequent learning. Results on image classification, text editing, and reinforcement learning show that, in all settings, models with a linguistic parameterization outperform those without.

URL: https://arxiv.org/abs/1711.00482

Notes: Levine et al. show that once we learn language model it could be used as a prior for img clf & policy search, etc.

Don't Decay the Learning Rate, Increase the Batch Size

Authors: Samuel L. Smith, Pieter-Jan Kindermans, Quoc V. Le

Abstract: It is common practice to decay the learning rate. Here we show one can usually obtain the same learning curve on both training and test sets by instead increasing the batch size during training. This procedure is successful for stochastic gradient descent (SGD), SGD with momentum, Nesterov momentum, and Adam. It reaches equivalent test accuracies after the same number of training epochs, but with fewer parameter updates, leading to greater parallelism and shorter training times. We can further reduce the number of parameter updates by increasing the learning rate $\epsilon$ and scaling the batch size $B \propto \epsilon$. Finally, one can increase the momentum coefficient $m$ and scale $B \propto 1/(1-m)$, although this tends to slightly reduce the test accuracy. Crucially, our techniques allow us to repurpose existing training schedules for large batch training with no hyper-parameter tuning. We train Inception-ResNet-V2 on ImageNet to $77%$ validation accuracy in under 2500 parameter updates, efficiently utilizing training batches of 65536 images.

URL: https://arxiv.org/abs/1711.00489

Notes: you could increase batch size instead of decreasing learning rate; the question is where you get the hardware?

Unsupervised Machine Translation Using Monolingual Corpora Only

Authors: Guillaume Lample, Ludovic Denoyer, Marc'Aurelio Ranzato

Abstract: Machine translation has recently achieved impressive performance thanks to recent advances in deep learning and the availability of large-scale parallel corpora. There have been numerous attempts to extend these successes to low-resource language pairs, yet requiring tens of thousands of parallel sentences. In this work, we take this research direction to the extreme and investigate whether it is possible to learn to translate even without any parallel data. We propose a model that takes sentences from monolingual corpora in two different languages and maps them into the same latent space. By learning to reconstruct in both languages from this shared feature space, the model effectively learns to translate without using any labeled data. We demonstrate our model on two widely used datasets and two language pairs, reporting BLEU scores up to 32.8, without using even a single parallel sentence at training time.

URL: https://arxiv.org/abs/1711.00043

Notes: machine translation without any parallel corpora; uses GAN approach to make encs project into the same space

Non-Autoregressive Neural Machine Translation

Authors: Jiatao Gu, James Bradbury, Caiming Xiong, Victor O.K. Li, Richard Socher

Abstract: Existing approaches to neural machine translation condition each output word on previously generated outputs. We introduce a model that avoids this autoregressive property and produces its outputs in parallel, allowing an order of magnitude lower latency during inference. Through knowledge distillation, the use of input token fertilities as a latent variable, and policy gradient fine-tuning, we achieve this at a cost of as little as 2.0 BLEU points relative to the autoregressive Transformer network used as a teacher. We demonstrate substantial cumulative improvements associated with each of the three aspects of our training strategy, and validate our approach on IWSLT 2016 English-German and two WMT language pairs. By sampling fertilities in parallel at inference time, our non-autoregressive model achieves near-state-of-the-art performance of 29.8 BLEU on WMT 2016 English-Romanian.

URL: https://arxiv.org/abs/1711.02281

Notes: parallel generation for NMT, next step in Transformer arch; to make it parallel they use discrete distribution to generate so called fertility for each word - the enc output will be copied so many times as prodused ferility; ablation study provided

DLPaper2Code: Auto-generation of Code from Deep Learning Research Papers

Authors: Akshay Sethi, Anush Sankaran, Naveen Panwar, Shreya Khare, Senthil Mani

Abstract: With an abundance of research papers in deep learning, reproducibility or adoption of the existing works becomes a challenge. This is due to the lack of open source implementations provided by the authors. Further, re-implementing research papers in a different library is a daunting task. To address these challenges, we propose a novel extensible approach, DLPaper2Code, to extract and understand deep learning design flow diagrams and tables available in a research paper and convert them to an abstract computational graph. The extracted computational graph is then converted into execution ready source code in both Keras and Caffe, in real-time. An arXiv-like website is created where the automatically generated designs is made publicly available for 5,000 research papers. The generated designs could be rated and edited using an intuitive drag-and-drop UI framework in a crowdsourced manner. To evaluate our approach, we create a simulated dataset with over 216,000 valid design visualizations using a manually defined grammar. Experiments on the simulated dataset show that the proposed framework provide more than $93%$ accuracy in flow diagram content extraction.

URL: https://arxiv.org/abs/1711.03543

Notes: it's a magic on the first sight: guys generate Keras code from the PDF! in details, they use pattern matching, OCR to identify layers and links between them; small subset of possible layers, small subset of possible hyperparams

Supervised and Unsupervised Transfer Learning for Question Answering

Authors: Yu-An Chung, Hung-Yi Lee, James Glass

Abstract: Although transfer learning has been shown to be successful for tasks like object and speech recognition, its applicability to question answering (QA) has yet to be well-studied. In this paper, we conduct extensive experiments to investigate the transferability of knowledge learned from a source QA dataset to a target dataset using two QA models. The performance of both models on a TOEFL listening comprehension test (Tseng et al., 2016) and MCTest (Richardson et al., 2013) is significantly improved via a simple transfer learning technique from MovieQA (Tapaswi et al., 2016). In particular, one of the models achieves the state-of-the-art on all target datasets; for the TOEFL listening comprehension test, it outperforms the previous best model by 7%. Finally, we show that transfer learning is helpful even in unsupervised scenarios when correct answers for target QA dataset examples are not available.

URL: https://arxiv.org/abs/1711.05345

Notes: transfer learning for question answering; unsupervised finetune: supervised pretrain on one dataset, predict answers for unknown dataset, use as right ones, train on that in superwise way; they use CNNs (better) & MemN2N; new SotA on few datasets in QA

Mastering the Dungeon: Grounded Language Learning by Mechanical Turker Descent

Authors: Zhilin Yang, Saizheng Zhang, Jack Urbanek, Will Feng, Alexander H. Miller, Arthur Szlam, Douwe Kiela, Jason Weston

Abstract: Contrary to most natural language processing research, which makes use of static datasets, humans learn language interactively, grounded in an environment. In this work we propose an interactive learning procedure called Mechanical Turker Descent (MTD) and use it to train agents to execute natural language commands grounded in a fantasy text adventure game. In MTD, Turkers compete to train better agents in the short term, and collaborate by sharing their agents' skills in the long term. This results in a gamified, engaging experience for the Turkers and a better quality teaching signal for the agents compared to static datasets, as the Turkers naturally adapt the training data to the agent's abilities.

URL: https://arxiv.org/abs/1711.07950

Notes: FAIR introduced environment for solving text quests; they use seq2seq as baseline; only predefined actions & knowledge graph representation for game; MTurkers generate training datasets for models competing with each other and sharing resulting datasets

Does Higher Order LSTM Have Better Accuracy in Chunking and Named Entity Recognition?

Authors: Yi Zhang, Xu Sun, Yang Yang

Abstract: Current researches usually employ single order setting by default when dealing with sequence labeling tasks. In our work, "order" means the number of tags that a prediction involves at every time step. High order models tend to capture more dependency information among tags. We first propose a simple method that low order models can be easily extended to high order models. To our surprise, the high order models which are supposed to capture more dependency information behave worse when increasing the order. We suppose that forcing neural networks to learn complex structure may lead to overfitting. To deal with the problem, we propose a method which combines low order and high order information together to decode the tag sequence. The proposed method, multi-order decoding (MOD), keeps the scalability to high order models with a pruning technique. MOD achieves higher accuracies than existing methods of single order setting. It results in a 21% error reduction compared to baselines in chunking and an error reduction over 23% in two NER tasks. The code is available at URL

URL: https://arxiv.org/abs/1711.08231

Notes: new SotA on some NER datasets; the idea is simple and brilliant - lets use one token length tags to generate tags for two or more consequent tokens (with pruning for only top scored variants), after that apply Viterbi to produce final sequence; with code!

Go for a Walk and Arrive at the Answer: Reasoning Over Paths in Knowledge Bases using Reinforcement Learning

Authors: Rajarshi Das, Shehzaad Dhuliawala, Manzil Zaheer, Luke Vilnis, Ishan Durugkar, Akshay Krishnamurthy, Alex Smola, Andrew McCallum

Abstract: Knowledge bases (KB), both automatically and manually constructed, are often incomplete --- many valid facts can be inferred from the KB by synthesizing existing information. A popular approach to KB completion is to infer new relations by combinatory reasoning over the information found along other paths connecting a pair of entities. Given the enormous size of KBs and the exponential number of paths, previous path-based models have considered only the problem of predicting a missing relation given two entities or evaluating the truth of a proposed triple. Additionally, these methods have traditionally used random paths between fixed entity pairs or more recently learned to pick paths between them. We propose a new algorithm MINERVA, which addresses the much more difficult and practical task of answering questions where the relation is known, but only one entity. Since random walks are impractical in a setting with combinatorially many destinations from a start node, we present a neural reinforcement learning approach which learns how to navigate the graph conditioned on the input query to find predictive paths. Empirically, this approach obtains state-of-the-art results on several datasets, significantly outperforming prior methods.

URL: https://arxiv.org/abs/1711.05851

Notes: question answering using knowledge graph, authors use RL to compute paths on graph; interesting that graph in this setup is only partially observable

Zero-Shot Style Transfer in Text Using Recurrent Neural Networks

Authors: Keith Carlson, Allen Riddell, Daniel Rockmore

Abstract: Zero-shot translation is the task of translating between a language pair where no aligned data for the pair is provided during training. In this work we employ a model that creates paraphrases which are written in the style of another existing text. Since we provide the model with no paired examples from the source style to the target style during training, we call this task zero-shot style transfer. Herein, we identify a high-quality source of aligned, stylistically distinct text in Bible versions and use this data to train an encoder/decoder recurrent neural model. We also train a statistical machine translation system, Moses, for comparison. We find that the neural network outperforms Moses on the established BLEU and PINC metrics for evaluating paraphrase quality. This technique can be widely applied due to the broad definition of style which is used. For example, tasks like text simplification can easily be viewed as style transfer. The corpus itself is highly parallel with 33 distinct Bible Versions used, and human-aligned due to the presence of chapter and verse numbers within the text. This makes the data a rich source of study for other natural language tasks.

URL: https://arxiv.org/abs/1711.04731

Notes: simple Seq2Seq is showing really good results in highly parallel style transfer of texts; authors use two versions of Bible to present results; with code!

2017-12

Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

Authors: David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, Demis Hassabis

Abstract: The game of chess is the most widely-studied domain in the history of artificial intelligence. The strongest programs are based on a combination of sophisticated search techniques, domain-specific adaptations, and handcrafted evaluation functions that have been refined by human experts over several decades. In contrast, the AlphaGo Zero program recently achieved superhuman performance in the game of Go, by tabula rasa reinforcement learning from games of self-play. In this paper, we generalise this approach into a single AlphaZero algorithm that can achieve, tabula rasa, superhuman performance in many challenging domains. Starting from random play, and given no domain knowledge except the game rules, AlphaZero achieved within 24 hours a superhuman level of play in the games of chess and shogi (Japanese chess) as well as Go, and convincingly defeated a world-champion program in each case.

URL: https://arxiv.org/abs/1712.01815

Notes: short paper from DeepMind on AlphaZero - self-teaching algorithm to play chess/go or any formalized game; they use Monte-Carlo Tree Search, Bayesian optimization with noise; also they train it on 5000 TPU; waiting for full paper

Natural Language Policy Search

Authors: Jacob Andreas, Dan Klein, Sergey Levine

Abstract: The named concepts and compositional operators present in natural language provide a rich source of information about the kinds of abstractions humans use to navigate the world. Can this linguistic background knowledge improve the generality and efficiency of learned control policies? This paper aims to show that using the space of natural language strings as a parameter space is an effective way to capture natural task structure. In a pretraining phase, we learn an instruction-following model that transforms natural language instructions into policies. To learn in a new environment, we search directly in the space of instructions to maximize the instruction-follower’s reward. Crucially, our approach do not require language data for new tasks: language is used only in pretraining to impose structure on subsequent learning. Results on both reinforcement learning and imitation learning show that, policies with a linguistic parameterization outperform those without.

URL: https://drive.google.com/file/d/16SS8sfHPX5rgcRFoCjMDy57-heovYK-W/view

Notes: paper from Levine's group, guys are using natural language instructions to create policy, and even more, in unknown environment they search for policy basing on instruction space

Deep Learning Scaling is Predictable, Empirically

Authors: Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md. Mostofa Ali Patwary, Yang Yang, Yanqi Zhou

Abstract: Deep learning (DL) creates impactful advances following a virtuous recipe: model architecture search, creating large training data sets, and scaling computation. It is widely believed that growing training sets and models should improve accuracy and result in better products. As DL application domains grow, we would like a deeper understanding of the relationships between training set size, computational scale, and model accuracy improvements to advance the state-of-the-art. This paper presents a large scale empirical characterization of generalization error and model size growth as training sets grow. We introduce a methodology for this measurement and test four machine learning domains: machine translation, language modeling, image processing, and speech recognition. Our empirical results show power-law generalization error scaling across a breadth of factors, resulting in power-law exponents---the "steepness" of the learning curve---yet to be explained by theoretical work. Further, model improvements only shift the error but do not appear to affect the power-law exponent. We also show that model size scales sublinearly with data size. These scaling relationships have significant implications on deep learning research, practice, and systems. They can assist model debugging, setting accuracy targets, and decisions about data set growth. They can also guide computing system design and underscore the importance of continued computational scaling.

URL: https://arxiv.org/abs/1712.00409

Notes: Baidu's effort to develop predictable DL: NMT & LM models show sublinear growth rate meaning that they are far from saturation, we need much bigger datasets; growth rate for SGD is significantly lesser than one for Adam; experiments are of 50 GPU years

Attentive Memory Networks: Efficient Machine Reading for Conversational Search

Authors: Tom Kenter, Maarten de Rijke

Abstract: Recent advances in conversational systems have changed the search paradigm. Traditionally, a user poses a query to a search engine that returns an answer based on its index, possibly leveraging external knowledge bases and conditioning the response on earlier interactions in the search session. In a natural conversation, there is an additional source of information to take into account: utterances produced earlier in a conversation can also be referred to and a conversational IR system has to keep track of information conveyed by the user during the conversation, even if it is implicit. We argue that the process of building a representation of the conversation can be framed as a machine reading task, where an automated system is presented with a number of statements about which it should answer questions. The questions should be answered solely by referring to the statements provided, without consulting external knowledge. The time is right for the information retrieval community to embrace this task, both as a stand-alone task and integrated in a broader conversational search setting. In this paper, we focus on machine reading as a stand-alone task and present the Attentive Memory Network (AMN), an end-to-end trainable machine reading algorithm. Its key contribution is in efficiency, achieved by having an hierarchical input encoder, iterating over the input only once. Speed is an important requirement in the setting of conversational search, as gaps between conversational turns have a detrimental effect on naturalness. On 20 datasets commonly used for evaluating machine reading algorithms we show that the AMN achieves performance comparable to the state-of-the-art models, while using considerably fewer computations.

URL: https://arxiv.org/abs/1712.07229

Notes: paper from de Rijke group; they introduce attentive memory networks; std attn on encoder states; requires less computations to achieve good (yet not SotA performance) on bAbI; it would be interesting to see it on actual diloag data

Advances in Pre-Training Distributed Word Representations

Authors: Tomas Mikolov, Edouard Grave, Piotr Bojanowski, Christian Puhrsch, Armand Joulin

Abstract: Many Natural Language Processing applications nowadays rely on pre-trained word representations estimated from large text corpora such as news collections, Wikipedia and Web Crawl. In this paper, we show how to train high-quality word vector representations by using a combination of known tricks that are however rarely used together. The main result of our work is the new set of publicly available pre-trained models that outperform the current state of the art by a large margin on a number of tasks.

URL: https://arxiv.org/abs/1712.09405

Notes: short paper from Mikolov's group presenting significant details on fastText word vectors training; interesting results on comparison of pre-trained vectors on sentence datasets; also interesting usage of positional encoding

A Deep Network Model for Paraphrase Detection in Short Text Messages

Authors: Basant Agarwal, Heri Ramampiaro, Helge Langseth, Massimiliano Ruocco

Abstract: This paper is concerned with paraphrase detection. The ability to detect similar sentences written in natural language is crucial for several applications, such as text mining, text summarization, plagiarism detection, authorship authentication and question answering. Given two sentences, the objective is to detect whether they are semantically identical. An important insight from this work is that existing paraphrase systems perform well when applied on clean texts, but they do not necessarily deliver good performance against noisy texts. Challenges with paraphrase detection on user generated short texts, such as Twitter, include language irregularity and noise. To cope with these challenges, we propose a novel deep neural network-based approach that relies on coarse-grained sentence modeling using a convolutional neural network and a long short-term memory model, combined with a specific fine-grained word-level similarity matching model. Our experimental results show that the proposed approach outperforms existing state-of-the-art approaches on user-generated noisy social media data, such as Twitter texts, and achieves highly competitive performance on a cleaner corpus.

URL: https://arxiv.org/abs/1712.02820

Notes: paraphrase detecting architecture: CNN over word embeddings, LSTM on CNN feature vectors, word-to-word similarity matrix and CNN over it, some classic features, like TF-IDF and wordnets; new SotA on tweets