intro_1.tex

\section{Introduction}
\label{sec:intro}
\BB{The paper does not discuss structured prediction much. We could either make it more prominent in the introduction and throughout, or ignore it (and remove from title) until the experimental results where we say something about how we can also use for structured prediction.} \drew{I think the paper is supposed to be saying IL > RL for structured prediction,and demonstrating the application of IL to deep networks. I think it's basically a paper teaching people how/when to use IL, particularly to train networks. How do we get it there?} 
A fundamental challenge in artificial intelligence, robotics, and language processing is to reason, plan and make a sequence of decisions to minimize accumulated cost or achieve a long-term goal. 

find a policy that maps state to actions to minimize accumulated cost or achieve a long-term goal. A promising approach to this problem is Imitation Learning (IL), which automatically learns a policy from expert demonstrations.  IL has become an important tool in a range of fields including robotics \cite{ross2013learning}, statistical inference \cite{ross2011_CVPR,sun2016learning}, and language parsing \cite{chang2015learning_dependency}. 

Classical imitation learning \cite{abbeel2004apprenticeship,syed2008apprenticeship,ziebart2008maximum,finn2016guided,ho2016generative} learns a policy from a fixed-size dataset pre-collected from experts. The learned policy intends to match the expert's behaviour in terms of either the expected feature counts of visited states (e.g., MaxEnt \cite{ziebart2008maximum}) or the entire distribution of demonstrated trajectories \cite{ho2016generative}. A pernicious problem with these classical methods is that they require the training and test data to be sampled from the same distribution.  This is very difficult to enforce in practice, and, as a result, policies learned by these methods can fail spectacularly in real-world scenarios~\cite{ross2010efficient}.

\drew{I feel we are talking a lot about IL, and not about our contribution...}

Interactive learning approaches to IL such as SEARN \cite{daume2009search}, DAgger \cite{Ross2011_AISTATS}, AggreVaTe \cite{ross2014reinforcement} and LOLS \cite{chang2015learning}, interleave learning and testing procedures to overcome the data mismatch issue and, as a result, work well in practical applications. 
Interactive approaches also provide strong theoretical guarantees through a reduction to no-regret online learning \cite{Zinkevich2003_ICML,shalev2012online}. For example, with the proper assumptions, AggreVaTe \cite{ross2014reinforcement} is guaranteed to learn a policy that performs as well as an optimal expert and to robustly degrade when the learner is imperfect.% and even outperform a sub-optimal expert 
\drew{Changed aboved. See if it helos.}

Interactive IL is often framed as \emph{Data Aggregation}: in every episode, the algorithm first collects data by executing the current learned policy and querying an expert for feedback, it next aggregates the new data with it's existing dataset, and finally updates the policy by batch supervised learning on the aggregated dataset (e.g., an optimal cost-sensitive oracle). One drawback of these approaches is that they do not seem to scale well to complex tasks. This is frustrating: recent advances in Reinforcement Learning (RL), have been able to solve a range of problems including challenging robotics tasks, video games, and board games \cite{schulman2015trust,duan2016benchmarking,silver2016mastering}. With expert demonstrations, intuitively, IL should \emph{outperform} RL methods, but previous IL algorithms do not seem to scale well to these tasks. \BB{It would be nice if the previous sentence were true. Is it? if so we need to cite, and *ideally* we would come back to this in the experimental results somehow. If it's not true we need to change the motivation here. It would also lessen the impact of the statement at the end of the introduction about IL $>$ RL, which seems obvious. }
\drew{I'm not sure the sentence is true. I think IL $>$ RL is completely non-obvious to our deep colleagues. Hal and John always use the fully online version of aggrevated for all their examples.  would say that it's a misunderstanding that says one has to aggregate, but one that we are correcting here.}

In this paper we attack the problem of reward-aware imitation learning, similar to the setting in SEARN, AggreVaTe and LOLS, where one considers the reward-to-go of the expert (not just  simple agreement with the expert at each step). \drew{Is this helping Samy Bengio understand?} The final goal of these methods is to train a policy that can achieve maximum total reward. \drew{Given the target perahps these should be losses?}
With the goal of extending imitation learning to challenging tasks such as high-dimensional continuous control in robotics, we combine our approach with ideas from \emph{policy search}. We provide a framework called \emph{Differentiable  Imitation Learning} (DIL) \drew{Make macro. I don't like DIL.} which closely follows the reduction framework from AggreVaTe \cite{ross2014reinforcement} to retain the theoretical guarantees to the extent possible. DIL provides two gradient update procedures: a regular gradient update procedure developed from Online Gradient Descent (OGD) \cite{Zinkevich2003_ICML} and a natural gradient update procedure motivated by Exponential Gradient Descent (EG) \cite{shalev2012online} \drew{This is a novel motivation, but why don't we cit the derivation of natural gradient and then relate to EG. I think they have fundamentally different theoretical properties so I'm sketpical to say the least.}. We provide detailed derivations of unbiased estimations of policy gradients in IL and their variance-reduced variants \cite{greensmith2004variance}. 

We demonstrate our approach by learning neural network polices on various robotics simulators with continuous state and action spaces, and a Dependency Parsing task on Handwritten Algebra with raw image data \cite{duyckpredicting}. We show DIL can achieve expert-level performance and even \emph{super-expert} performance, which rarely can be achieved by other non-interactive IL approaches. We also show that DIL can handle partial observable settings by using LSTM-based policies. Finally, we compare our approach to policy gradient RL algorithms and show that IL can learn much faster than RL in practice. \drew{Again, I think the last point is the key one.}

In addition to providing a set of practical algorithms, we provide a comprehensive  theoretical study of IL on discrete MDPs. We design a special MDP structure to demonstrate that imitation learning can be exponentially more sample efficient than any RL algorithm. \drew{True, but sounds terrible. We prove that.... Design sounds like it's totally make believe.} For general discrete MDPs, we provide a regret upper bound for AggreVaTe with EG. The regret upper bound shows IL can still learn significantly faster than RL in this more general MDP framework. We provide a regret lower bound for any IL algorithm, which demonstrates that AggreVaTe is near-optimal. Both the experimental results and the theoretical results indicate that: 
\begin{displayquote}
\emph{Imitation Learning $>$ Reinforcement Learning}, when we have access to expert demonstrations.
\end{displayquote}

\todo{That's nice, but needs a rephrasing. Again, but yourself in Samy Bengio's shoes. What are you telling him?}