Home ML Papers Paul Christiano - Deep Reinforcement Learning from Human Preferences (2017)

Paul Christiano - Deep Reinforcement Learning from Human Preferences (2017)

History / Edit / PDF / EPUB / BIB /
Created: August 5, 2017 / Updated: August 30, 2025 / Status: finished / 4 min read (~635 words)
machine-learning

Human evaluators are provided short video clips (trajectory segments) where they have to rate two clips against each other
The RL agent is given information as to which trajectory segments are preferred other other trajectory segments
- This allows it to try and determine which state/action are better than others given the human feedback

In order to practically train deep RL systems with human feedback, we need to decrease the amount of feedback required by several orders of magnitude
Our approach is to learn a reward function from human feedback and then to optimize that reward function
We desire a solution to sequential decision problems without a well-specified reward function that
- enables us to solve tasks for which we can only recognize the desired behavior, but not necessarily demonstrate it
- allows agents to be taught by non-expert users
- scales to large problems
- is economical with user feedback

Our work could also be seen as a specific instance of the cooperative inverse reinforcement learning framework
This framework considers a two-player game between a human and a robot interacting with an environment with the purpose of maximizing the human's reward function
In our setting the human is only allowed to interact with this game by stating their preferences
Our key contribution is to scale human feedback up to deep reinforcement learning and to learn much more complex behaviors

Instead of assuming that the environment produces a reward signal, we assume that there is a human overseer who can express preferences between trajectory segments
A trajectory segment is a sequence of observations and actions, $\sigma = ((o_0, a_0), (o_1, a_1), \dots, (o_{k-1}, a_{k-1})) \in (\mathcal{0} \times \mathcal{A})^k$
$\sigma^1 \succ \sigma^2$ indicates that the human prefers trajectory segment $\sigma^1$ over trajectory segment $\sigma^2$
Informally, the goal of the agent is to produce trajectories which are preferred by the human, while making as few queries as possible to the human
We will evaluate our algorithms' behavior in two way:
- Quantitative: We say that preference $\succ$ are generated by a reward function $r: \mathcal{0} \times \mathcal{A} \rightarrow \mathbb{R}$ if
  $$ ((o_0^1, a_0^1), \dots, (o_{k-1}^1, a_{k-1}^1)) \succ ((o_0^2, a_0^2), \dots, (o_{k-1}^2, a_{k-1}^2))$$
  
  whenever
  
  $$ r(o_0^1, a_0^1) + \dots + r(o_{k-1}^1, a_{k-1}^1) > r(o_0^2, a_0^2) + \dots + r(o_{k-1}^2, a_{k-1}^2)$$
- Qualitative: Qualitatively evaluate how well the agent satisfies the human's preferences

The human overseer is given a visualization of two trajectory segments, in the form of short movie clips
The human then indicates which segment they prefer, that the two segments are equally good (or bad?), or that they are unable to compare the two segments
The human judgments are recorded in a database $\mathcal{D}$ of triples $(\sigma^1, \sigma^2, \mu)$, where $\sigma^1$ and $\sigma^2$ are the two segments and $\mu$ is a distribution over {1, 2} indicating which segment the user preferred
- If the human selects one segment as preferable, then $\mu$ puts all of its mass on that choice
- If the human marks the segments as equally preferable, then $\mu$ is uniform
- If the human marks the segments as incomparable, then the comparison is not included in the database

In general we discovered that asking humans to compare longer clips was significantly more helpful per clip, and significantly less helpful per frame
We found that for short clips it took human rates a while just to understand the situation, while for longer clips the evaluation time was a roughly linear function of the clip length
We tried to choose the shortest clip length for which the evaluation time was linear

Christiano, Paul, et al. "Deep reinforcement learning from human preferences." arXiv preprint arXiv:1706.03741 (2017).