Abstract

A key aspect of artificial intelligence is the ability to learn from experience. If examples of correct solutions exist, supervised learning techniques can be used to predict what the correct solution will be for future observations. However, often such examples are not readily available. The field of reinforcement learning investigates
... read more
methods that can learn from experience when no examples of correct behavior are given, but a reinforcement signal is supplied to the learning entity. Many problems fit this problem description. In games, the reinforcement signal might be whether or not the game was won. In economic settings, the reinforcement can represent the profit or loss that is eventually made. Furthermore, in robotics it is often easier to specify how well the robot is doing than it is to find examples of good behavior beforehand. An advantage of reinforcement learning is that the designer of the system need not know what good solutions to a problem may be. Rather, the system will find good solutions by trial and error. Of particular interest to us are model-free temporal-difference algorithms. These algorithms do not use experiences to build an explicit model of the environment, but construct an approximation of the expected value for each possible action. The values can then be used to construct solutions. These methods are computationally efficient, easy to implement and often find solutions quickly. Additionally, in many settings it is easier to find a good policy to select actions than to model the whole environment and then to use this model to try to determine what to do. In this dissertation, we analyze several existing model-free temporal-difference algorithms. We discuss some problems with these approaches, such as a potentially huge overestimation of the action values by the popular Q-learning algorithm. We discuss ways to prevent these issues and propose a number of new algorithms. We analyze the new algorithms and compare their performance on a number of tasks. We conclude that it depends highly on the characteristics of the problem which algorithm performs best. We give some indications on which algorithms are to be preferred in different problem settings. To solve problems with unknown characteristics, we propose using ensemble methods that combine action-selection policies of a number of different entities. We discuss several approaches to combine these policies and demonstrate empirically that good solutions can reliably be found. Additionally, we extend the idea of model-free temporal-difference algorithms to problems with continuous action spaces. In such problems, conventional approaches are not applicable, because they can not handle the infinite number of possible actions. We propose a new algorithm that is explicitly designed for continuous spaces and show that it compares favorably to the current state of the art.
show less