基于分层强化学习的天车调度优化方法

苏章圣; 蒋胜龙; 彭功状; 梁越永

引用本文:	苏章圣,蒋胜龙,彭功状,梁越永.基于分层强化学习的天车调度优化方法[J].控制理论与应用,2025,42(11):2261~2273.[点击复制]
	SU Zhang-sheng,JIANG Sheng-long,PENG Gong-zhaung,LIANG Yue-yong.Hierarchical reinforcement learning-based optimization method for crane scheduling[J].Control Theory & Applications,2025,42(11):2261~2273.[点击复制]

基于分层强化学习的天车调度优化方法

Hierarchical reinforcement learning-based optimization method for crane scheduling

摘要点击 2228 全文点击 152 投稿时间：2025-04-17 修订日期：2025-10-29

查看全文查看/发表评论下载PDF阅读器 HTML

DOI编号 10.7641/CTA.2025.50167

2025,42(11):2261-2273

中文关键词天车调度分层强化学习任务指派路径规划动作禁忌

英文关键词 crane scheduling hierarchical reinforcement learning task assignment path planning action tabu

基金项目国家自然科学基金项目(92367106,62273032,61873042)资助.

作者	单位	E-mail
苏章圣	重庆大学材料科学与工程学院	szs@stu.cqu.edu.cn
蒋胜龙^*	重庆大学自动化学院	jiang_shl@cqu.edu.cn
彭功状	北京航空航天大学国际前沿交叉科学研究院
梁越永	上海宝信软件股份有限公司

中文摘要

天车是车间、仓库、港口等工业园区中关键的重型物料运载装置,其调度对运输效率及生产目标达成影响显著.针对带时间窗约束的天车调度问题(CSP-TW),建立了一种基于时空离散化的混合整数线性规划模型.基于模型特征,设计了一种分层强化学习(HRL)决策框架:上层决策网络将运输任务指派至适合天车;下层决策网络规划各天车运行路径执行具体运输.在学习过程中,引入了动作禁忌规则规避无效动作,引导上下层决策网络向优势策略空间探索.随后,采用了外部经验池和D3QN策略训练决策网络.基于某公司钢厂物流仿真平台进行测试:消融实验表明所引入的动作禁忌规则可提高HRL学习效率;训练比较表明HRL的收敛性优于端到端框架;对比实验表明 HRL的求解效果优于多规则组合法、元启发式算法、端到端和DQN等多类方法,并符合秒级响应的应用需求.

英文摘要

Cranes are key heavy-duty material handling equipment widely used in shops, warehouses, ports, and other industrial settings. The scheduling of cranes significantly affects transportation efficiency and the achievement of produc tion goals. To address the crane scheduling problem with time windows (CSP-TW), a mixed-integer linear programming model based on spatio-temporal discretization is developed. Based on the characteristics of the model, a hierarchical rein forcement learning (HRL) decision-making framework is designed. The high-level decision network assigns transportation tasks to appropriate cranes, while the low-level network plans paths for each crane to complete its assigned task. During the learning process, action tabu rules are introduced to avoid ineffective actions and guide the decision networks toward the dominant policy space. Subsequently, external experience pooling and the dueling double deep Q-network strategy are adopted to train the decision networks. Tests were executed based on the logistics simulation platform of a steel plant from a certain company. Ablation experiments show that the introduction of action tabu rules improves learning efficiency.Training comparisons indicate that HRL achieves better convergence than the end-to-end framework. Compar ative experiments demonstrate that HRL outperforms several methods, including multi-rule combinations, meta-heuristic algorithms, end-to-end and deep Q-network, while satisfying second-level response-time requirements for applications.