随机需求车辆路径问题的价值逼近在线决策

张晓楠; 张建雄

引用本文:	张晓楠,张建雄.随机需求车辆路径问题的价值逼近在线决策[J].控制理论与应用,2022,39(2):241~254.[点击复制]
	ZHANG Xiao-nan,ZHANG Jian-xiong.Value-approximation-based online policy for vehicle routing problem with stochastic demand[J].Control Theory & Applications,2022,39(2):241~254.[点击复制]

随机需求车辆路径问题的价值逼近在线决策

Value-approximation-based online policy for vehicle routing problem with stochastic demand

摘要点击 2801 全文点击 1069 投稿时间：2021-05-17 修订日期：2022-02-14

查看全文查看/发表评论下载PDF阅读器 HTML

DOI编号 10.7641/CTA.2021.10411

2022,39(2):241-254

中文关键词路径问题随机需求马尔可夫决策强化学习价值逼近算法

英文关键词 routing problems stochastic demand Markov decision process reinforcement learning value approximation iteration

基金项目国家自然科学基金项目(71802120, 71971152), 陕西省创新能力支撑计划(2020KRM024), 陕西省教育厅专项科研计划项目(19JK0125)资助.

作者	单位	E-mail
张晓楠	天津大学,陕西科技大学	WLxn_2010@126.com
张建雄^*	天津大学	jxzhang@tju.edu.cn

中文摘要

随着高效实时物流的发展, 不确定车辆路径问题面临着兼顾决策精度和实时响应能力的新挑战. 本文以应用最为广泛的随机需求车辆路径问题为例, 研究提出一种有效的在线决策方法. 首先, 考虑多车辆同时在线, 以总旅行成本最小化为目标, 建立马尔科夫决策模型, 并引入可信度约束和邻域半径减少策略缩小行动空间, 提高求解效率. 其次, 设计强化学习中的价值逼近算法求解模型, 其中, 采用基函数估计期望未来成本, 并将求解过程分离为离线训练和在线决策两个环节, 基函数的权重被离线训练并用于在线决策以减少在线决策时间, 同时, 在算法中嵌入了邻域半径的动态更新机制. 最后, 测试多组算例验证了本文方法的有效性.

英文摘要

With the development of effective real-time logistic, new challenges of making high-quality and real-time dynamic routing decisions have been brought to uncertain vehicle routing problem (VRP). This paper focuses on the vehicle routing problem with stochastic demand (VRPSD), a well-known uncertain VRP, and proposes an effective online method for solving it. First, considering multiple vehicles, we formulate a multi-vehicle Markov decision process (M-MDP), with the aim of minimizing the total travel cost. In the model, the credibility constraints and the neighborhood radius reduction strategy are introduced to reduce action space, which improves the efficiency. Second, we develop a reinforcement learning technology, namely value approximation iteration including offline training phase and online execution phase, to solve the model. In the method, the expected cost-to-go is estimated by a set of basis functions designed, the weight vector of basis function is trained offline to reduce online calculation time, and also, the value of neighborhood radius is dynamically updated offline. Numerical experiments show that the proposed method has good performance in both solution quality and time efficiency.