Paul Christiano - Deep Reinforcement Learning from Human Preferences (2017)

History / Edit / PDF / EPUB / BIB /
Created: August 5, 2017 / Updated: November 2, 2024 / Status: finished / 4 min read (~635 words)
Machine learning

  • Human evaluators are provided short video clips (trajectory segments) where they have to rate two clips against each other
  • The RL agent is given information as to which trajectory segments are preferred other other trajectory segments
    • This allows it to try and determine which state/action are better than others given the human feedback

  • In order to practically train deep RL systems with human feedback, we need to decrease the amount of feedback required by several orders of magnitude
  • Our approach is to learn a reward function from human feedback and then to optimize that reward function
  • We desire a solution to sequential decision problems without a well-specified reward function that
    • enables us to solve tasks for which we can only recognize the desired behavior, but not necessarily demonstrate it
    • allows agents to be taught by non-expert users
    • scales to large problems
    • is economical with user feedback
  • Our work could also be seen as a specific instance of the cooperative inverse reinforcement learning framework
  • This framework considers a two-player game between a human and a robot interacting with an environment with the purpose of maximizing the human's reward function
  • In our setting the human is only allowed to interact with this game by stating their preferences
  • Our key contribution is to scale human feedback up to deep reinforcement learning and to learn much more complex behaviors

  • Instead of assuming that the environment produces a reward signal, we assume that there is a human overseer who can express preferences between trajectory segments
  • A trajectory segment is a sequence of observations and actions, $\sigma = ((o_0, a_0), (o_1, a_1), \dots, (o_{k-1}, a_{k-1})) \in (\mathcal{0} \times \mathcal{A})^k$
  • $\sigma^1 \succ \sigma^2$ indicates that the human prefers trajectory segment $\sigma^1$ over trajectory segment $\sigma^2$
  • Informally, the goal of the agent is to produce trajectories which are preferred by the human, while making as few queries as possible to the human
  • We will evaluate our algorithms' behavior in two way:
    • Quantitative: We say that preference $\succ$ are generated by a reward function $r: \mathcal{0} \times \mathcal{A} \rightarrow \mathbb{R}$ if

      $$ ((o_0^1, a_0^1), \dots, (o_{k-1}^1, a_{k-1}^1)) \succ ((o_0^2, a_0^2), \dots, (o_{k-1}^2, a_{k-1}^2))$$

      whenever

      $$ r(o_0^1, a_0^1) + \dots + r(o_{k-1}^1, a_{k-1}^1) > r(o_0^2, a_0^2) + \dots + r(o_{k-1}^2, a_{k-1}^2)$$

    • Qualitative: Qualitatively evaluate how well the agent satisfies the human's preferences

  • The human overseer is given a visualization of two trajectory segments, in the form of short movie clips
  • The human then indicates which segment they prefer, that the two segments are equally good (or bad?), or that they are unable to compare the two segments
  • The human judgments are recorded in a database $\mathcal{D}$ of triples $(\sigma^1, \sigma^2, \mu)$, where $\sigma^1$ and $\sigma^2$ are the two segments and $\mu$ is a distribution over {1, 2} indicating which segment the user preferred
    • If the human selects one segment as preferable, then $\mu$ puts all of its mass on that choice
    • If the human marks the segments as equally preferable, then $\mu$ is uniform
    • If the human marks the segments as incomparable, then the comparison is not included in the database

  • In general we discovered that asking humans to compare longer clips was significantly more helpful per clip, and significantly less helpful per frame
  • We found that for short clips it took human rates a while just to understand the situation, while for longer clips the evaluation time was a roughly linear function of the clip length
  • We tried to choose the shortest clip length for which the evaluation time was linear

  • Christiano, Paul, et al. "Deep reinforcement learning from human preferences." arXiv preprint arXiv:1706.03741 (2017).