动态环境中的分层强化学习

沈晶; 程晓北; 刘海波; 顾国昌; 张国印

引用本文:	沈晶, 程晓北, 刘海波, 顾国昌, 张国印.动态环境中的分层强化学习[J].控制理论与应用,2008,25(1):71~74.[点击复制]
	SHEN Jing, CHENG Xiao-bei, LIU Hai-bo, GU Guo-chang, ZHANG Guo-yin.Hierarchical reinforcement learning in dynamic environment[J].Control Theory & Applications,2008,25(1):71~74.[点击复制]

动态环境中的分层强化学习

Hierarchical reinforcement learning in dynamic environment

摘要点击 6747 全文点击 2779 投稿时间：2005-12-16 修订日期：2007-01-05

查看全文查看/发表评论下载PDF阅读器 HTML

DOI编号 10.7641/j.issn.1000-8152.2008.1.012

2008,25(1):71-74

中文关键词分层强化学习动态环境 Option 策略更新

英文关键词 hierarchical reinforcement learning dynamic environment Option strategy update

基金项目中国博士后基金资助项目(20060400809); 哈尔滨工程大学基础研究基金资助项目(HEUFT07022, HEUFT05068, HEUFT05021).

作者	单位
沈晶, 程晓北, 刘海波, 顾国昌, 张国印	哈尔滨工程大学计算机科学与技术学院, 黑龙江哈尔滨150001

中文摘要

现有的强化学习方法都不能很好地处理动态环境中的学习问题, 当环境变化时需要重新学习最优策略, 若环境变化的时间间隔小于策略收敛时间, 学习算法则不能收敛. 本文在Option分层强化学习方法的基础上提出一种适应动态环境的分层强化学习方法, 该方法利用学习的分层特性, 仅关注分层任务子目标状态及当前Option内部环境状态的变化, 将策略更新过程限制在规模较小的局部空间或维数较低的高层空间上, 从而加快学习速度.以二维动态栅格空间内两点间最短路径规划为背景进行了仿真实验, 实验结果表明, 该方法策略学习速度明显高于以往的方法, 且学习算法收敛性对环境变化频率的依赖性有所降低.

英文摘要

The existing reinforcement learning approaches cannot satisfactorily solve the learning problems in dynamic environment. The optimal strategy must be re-learned when environment changes. The learning algorithm cannot converge to optimal strategy if the interval between the changes is shorter than the duration of strategy converging. In this paper, a hierarchical reinforcement learning approach adapting to dynamic environments is presented based on the Option hierarchical reinforcement learning. According to the hierarchical characteristic of learning, the approach only takes into account the changes taking place in the sub-goal states of hierarchical tasks or the environment states of current Option. So the process of strategy update is limited in a small-scale local space or a low dimension high-level space. Consequently, the process of strategy update is accelerated. The experiments with shortest path planning in a two-dimensional dynamic grid space show that the presented approach is obviously faster than the existing approach in strategy update. Additionally the dependency of convergence of the learning algorithm on the frequency of environment change is reduced.