Levente Kocsis - Bandit based Monte-Carlo Planning (2006)
History /
Edit /
PDF /
EPUB /
BIB /
Created: December 8, 2017 / Updated: November 2, 2024 / Status: finished / 2 min read (~286 words)
Created: December 8, 2017 / Updated: November 2, 2024 / Status: finished / 2 min read (~286 words)
- Can UCT be used to rapidly determine "interesting values" in black box testing?
- We are interested in Monte-Carlo planning algorithms with two important characteristics:
- small error probability if the algorithm is stopped prematurely
- convergence to the best action if enough time is given
- The main idea of the algorithm proposed in this paper is to sample actions selectively
- Let us consider problems with a large number of actions and assume that the lookahead is carried out at a fixed depth D
- If sampling can be restricted to say half of the actions at all stages then the overall work reduction is $(1/2)^D$
- If one is able to identify a large subset of the suboptimal actions early in the sampling procedure then huge performance improvements can be expected
- By definition, an action is suboptimal for a given state, if its value is less than the best of the action-values for the same state
- A rollout-based algorithm builds its lookahead tree by repeatedly sampling episodes from the initial state
- An episode is a sequence of state-action-reward triplets that are obtained using the domains generative model
- A policy is said to resolve the exploration-exploitation tradeoff if its regret growth rate is within a constant factor of the best possible regret rate
- UCB1 keeps track of the average rewards for all the arms and chooses the arm with the best upper confidence bound
- In UCT the action selection problem is treated as a separate multi-armed bandit for every (explored) internal node
- Kocsis, Levente, and Csaba Szepesvári. "Bandit based monte-carlo planning." ECML. Vol. 6. 2006.