#### Date of Award

8-2014

#### Document Type

Dissertation

#### Degree Name

Doctor of Philosophy (PhD)

#### Legacy Department

Mathematical Science

#### Committee Chair/Advisor

Dr. Peter Kiesler

#### Committee Member

Dr. Robert Lund

#### Committee Member

Dr. Colin Gallagher

#### Committee Member

Dr. Brian Fralix

#### Abstract

In the field of Reinforcement Learning, Markov Decision Processes with a finite number of states and actions have been well studied, and there exist algorithms capable of producing a sequence of policies which converge to an optimal policy with probability one. Convergence guarantees for problems with continuous states also exist. Until recently, no online algorithm for continuous states and continuous actions has been proven to produce optimal policies. This Dissertation contains the results of research into reinforcement learning algorithms for problems in which both the state and action spaces are continuous. The problems to be solved are introduced formally as Markov Decision Processes. Also introduced is a value-function solution method known as Q-learning. The primary result of this Dissertation is the presentation of a Q-learning type algorithm adapted for continuous states and actions, and the proof that it asymptotically learns an optimal policy with probability one. While the algorithm is intended to advance the theory of continuous domain reinforcement learning, an example is given to show that with appropriate exploration policies, it can produce satisfactory solutions to non-trivial benchmark problems. Kernel regression based algorithms have excellent theoretical properties, but have high computational cost and do not adapt well to high-dimensional problems. A class of batch-mode regression tree-based algorithms is introduced. These algorithms are modular in the sense that different methods for partitioning, performing local regression, and choosing representative actions can be chosen. Experiments demonstrate superior performance over kernel methods. Batch algorithms possess superior computational efficiency, but pay the price of not being able to use past observations to inform exploration. A data structure useful for limited learning during the exploration phase is introduced. It is then demonstrated that this limited learning can outperform batch algorithms using totally random action exploration.

#### Recommended Citation

Carden, Stephen, "Convergence of a Reinforcement Learning Algorithm in Continuous Domains" (2014). *All Dissertations*. 1325.

https://tigerprints.clemson.edu/all_dissertations/1325

#### Included in

Applied Mathematics Commons, Artificial Intelligence and Robotics Commons, Statistics and Probability Commons