引用本文:沈晶, 程晓北, 刘海波, 顾国昌, 张国印.动态环境中的分层强化学习[J].控制理论与应用,2008,25(1):71~74.[点击复制]
SHEN Jing, CHENG Xiao-bei, LIU Hai-bo, GU Guo-chang, ZHANG Guo-yin.Hierarchical reinforcement learning in dynamic environment[J].Control Theory and Technology,2008,25(1):71~74.[点击复制]
动态环境中的分层强化学习
Hierarchical reinforcement learning in dynamic environment
摘要点击 5634  全文点击 2536  投稿时间:2005-12-16  修订日期:2007-01-05
查看全文  查看/发表评论  下载PDF阅读器
DOI编号  10.7641/j.issn.1000-8152.2008.1.012
  2008,25(1):71-74
中文关键词  分层强化学习  动态环境  Option  策略更新
英文关键词  hierarchical reinforcement learning  dynamic environment  Option  strategy update
基金项目  中国博士后基金资助项目(20060400809); 哈尔滨工程大学基础研究基金资助项目(HEUFT07022, HEUFT05068, HEUFT05021).
作者单位
沈晶, 程晓北, 刘海波, 顾国昌, 张国印 哈尔滨工程大学计算机科学与技术学院, 黑龙江哈尔滨150001 
中文摘要
      现有的强化学习方法都不能很好地处理动态环境中的学习问题, 当环境变化时需要重新学习最优策略, 若环境变化的时间间隔小于策略收敛时间, 学习算法则不能收敛. 本文在Option分层强化学习方法的基础上提出一种适应动态环境的分层强化学习方法, 该方法利用学习的分层特性, 仅关注分层任务子目标状态及当前Option内部环境状态的变化, 将策略更新过程限制在规模较小的局部空间或维数较低的高层空间上, 从而加快学习速度.以二维动态栅格空间内两点间最短路径规划为背景进行了仿真实验, 实验结果表明, 该方法策略学习速度明显高于以往的方法, 且学习算法收敛性对环境变化频率的依赖性有所降低.
英文摘要
      The existing reinforcement learning approaches cannot satisfactorily solve the learning problems in dynamic environment. The optimal strategy must be re-learned when environment changes. The learning algorithm cannot converge to optimal strategy if the interval between the changes is shorter than the duration of strategy converging. In this paper, a hierarchical reinforcement learning approach adapting to dynamic environments is presented based on the Option hierarchical reinforcement learning. According to the hierarchical characteristic of learning, the approach only takes into account the changes taking place in the sub-goal states of hierarchical tasks or the environment states of current Option. So the process of strategy update is limited in a small-scale local space or a low dimension high-level space. Consequently, the process of strategy update is accelerated. The experiments with shortest path planning in a two-dimensional dynamic grid space show that the presented approach is obviously faster than the existing approach in strategy update. Additionally the dependency of convergence of the learning algorithm on the frequency of environment change is reduced.