引用本文:苏章圣,蒋胜龙,彭功状,梁越永.基于分层强化学习的天车调度优化方法[J].控制理论与应用,2025,42(11):2261~2273.[点击复制]
SU Zhang-sheng,JIANG Sheng-long,PENG Gong-zhaung,LIANG Yue-yong.Hierarchical reinforcement learning-based optimization method for crane scheduling[J].Control Theory & Applications,2025,42(11):2261~2273.[点击复制]
基于分层强化学习的天车调度优化方法
Hierarchical reinforcement learning-based optimization method for crane scheduling
摘要点击 2228  全文点击 152  投稿时间:2025-04-17  修订日期:2025-10-29
查看全文  查看/发表评论  下载PDF阅读器   HTML
DOI编号  10.7641/CTA.2025.50167
  2025,42(11):2261-2273
中文关键词  天车调度  分层强化学习  任务指派  路径规划  动作禁忌
英文关键词  crane scheduling  hierarchical reinforcement learning  task assignment  path planning  action tabu
基金项目  国家自然科学基金项目(92367106,62273032,61873042)资助.
作者单位E-mail
苏章圣 重庆大学材料科学与工程学院 szs@stu.cqu.edu.cn 
蒋胜龙* 重庆大学自动化学院 jiang_shl@cqu.edu.cn 
彭功状 北京航空航天大学 国际前沿交叉科学研究院  
梁越永 上海宝信软件股份有限公司  
中文摘要
      天车是车间、仓库、港口等工业园区中关键的重型物料运载装置,其调度对运输效率及生产目标达成影响 显著.针对带时间窗约束的天车调度问题(CSP-TW),建立了一种基于时空离散化的混合整数线性规划模型.基于模 型特征,设计了一种分层强化学习(HRL)决策框架:上层决策网络将运输任务指派至适合天车;下层决策网络规划 各天车运行路径执行具体运输.在学习过程中,引入了动作禁忌规则规避无效动作,引导上下层决策网络向优势策 略空间探索.随后,采用了外部经验池和D3QN策略训练决策网络.基于某公司钢厂物流仿真平台进行测试:消融实 验表明所引入的动作禁忌规则可提高HRL学习效率;训练比较表明HRL的收敛性优于端到端框架;对比实验表明 HRL的求解效果优于多规则组合法、元启发式算法、端到端和DQN等多类方法,并符合秒级响应的应用需求.
英文摘要
      Cranes are key heavy-duty material handling equipment widely used in shops, warehouses, ports, and other industrial settings. The scheduling of cranes significantly affects transportation efficiency and the achievement of produc tion goals. To address the crane scheduling problem with time windows (CSP-TW), a mixed-integer linear programming model based on spatio-temporal discretization is developed. Based on the characteristics of the model, a hierarchical rein forcement learning (HRL) decision-making framework is designed. The high-level decision network assigns transportation tasks to appropriate cranes, while the low-level network plans paths for each crane to complete its assigned task. During the learning process, action tabu rules are introduced to avoid ineffective actions and guide the decision networks toward the dominant policy space. Subsequently, external experience pooling and the dueling double deep Q-network strategy are adopted to train the decision networks. Tests were executed based on the logistics simulation platform of a steel plant from a certain company. Ablation experiments show that the introduction of action tabu rules improves learning efficiency.Training comparisons indicate that HRL achieves better convergence than the end-to-end framework. Compar ative experiments demonstrate that HRL outperforms several methods, including multi-rule combinations, meta-heuristic algorithms, end-to-end and deep Q-network, while satisfying second-level response-time requirements for applications.