Paul Christiano - Deep Reinforcement Learning from Human Preferences (2017)
History /
Edit /
PDF /
EPUB /
BIB /
Created: August 5, 2017 / Updated: November 2, 2024 / Status: finished / 4 min read (~635 words)
Created: August 5, 2017 / Updated: November 2, 2024 / Status: finished / 4 min read (~635 words)
- Human evaluators are provided short video clips (trajectory segments) where they have to rate two clips against each other
- The RL agent is given information as to which trajectory segments are preferred other other trajectory segments
- This allows it to try and determine which state/action are better than others given the human feedback
- In order to practically train deep RL systems with human feedback, we need to decrease the amount of feedback required by several orders of magnitude
- Our approach is to learn a reward function from human feedback and then to optimize that reward function
- We desire a solution to sequential decision problems without a well-specified reward function that
- enables us to solve tasks for which we can only recognize the desired behavior, but not necessarily demonstrate it
- allows agents to be taught by non-expert users
- scales to large problems
- is economical with user feedback
- Our work could also be seen as a specific instance of the cooperative inverse reinforcement learning framework
- This framework considers a two-player game between a human and a robot interacting with an environment with the purpose of maximizing the human's reward function
- In our setting the human is only allowed to interact with this game by stating their preferences
- Our key contribution is to scale human feedback up to deep reinforcement learning and to learn much more complex behaviors
- Instead of assuming that the environment produces a reward signal, we assume that there is a human overseer who can express preferences between trajectory segments
- A trajectory segment is a sequence of observations and actions, $\sigma = ((o_0, a_0), (o_1, a_1), \dots, (o_{k-1}, a_{k-1})) \in (\mathcal{0} \times \mathcal{A})^k$
- $\sigma^1 \succ \sigma^2$ indicates that the human prefers trajectory segment $\sigma^1$ over trajectory segment $\sigma^2$
- Informally, the goal of the agent is to produce trajectories which are preferred by the human, while making as few queries as possible to the human
- We will evaluate our algorithms' behavior in two way:
- Quantitative: We say that preference $\succ$ are generated by a reward function $r: \mathcal{0} \times \mathcal{A} \rightarrow \mathbb{R}$ if
$$ ((o_0^1, a_0^1), \dots, (o_{k-1}^1, a_{k-1}^1)) \succ ((o_0^2, a_0^2), \dots, (o_{k-1}^2, a_{k-1}^2))$$
whenever
$$ r(o_0^1, a_0^1) + \dots + r(o_{k-1}^1, a_{k-1}^1) > r(o_0^2, a_0^2) + \dots + r(o_{k-1}^2, a_{k-1}^2)$$
- Qualitative: Qualitatively evaluate how well the agent satisfies the human's preferences
- Quantitative: We say that preference $\succ$ are generated by a reward function $r: \mathcal{0} \times \mathcal{A} \rightarrow \mathbb{R}$ if
- The human overseer is given a visualization of two trajectory segments, in the form of short movie clips
- The human then indicates which segment they prefer, that the two segments are equally good (or bad?), or that they are unable to compare the two segments
- The human judgments are recorded in a database $\mathcal{D}$ of triples $(\sigma^1, \sigma^2, \mu)$, where $\sigma^1$ and $\sigma^2$ are the two segments and $\mu$ is a distribution over {1, 2} indicating which segment the user preferred
- If the human selects one segment as preferable, then $\mu$ puts all of its mass on that choice
- If the human marks the segments as equally preferable, then $\mu$ is uniform
- If the human marks the segments as incomparable, then the comparison is not included in the database
- In general we discovered that asking humans to compare longer clips was significantly more helpful per clip, and significantly less helpful per frame
- We found that for short clips it took human rates a while just to understand the situation, while for longer clips the evaluation time was a roughly linear function of the clip length
- We tried to choose the shortest clip length for which the evaluation time was linear
- Christiano, Paul, et al. "Deep reinforcement learning from human preferences." arXiv preprint arXiv:1706.03741 (2017).