基于角度特征的分布式DDPG无人机追击决策

王昱; 任田君; 范子琳; 孟光磊

引用本文:	王昱,任田君,范子琳,孟光磊.基于角度特征的分布式DDPG无人机追击决策[J].控制理论与应用,2025,42(7):1356~1366.[点击复制]
	WANG Yu,REN Tian-jun,FAN Zi-lin,MENG Guang-lei.Distributed DDPG UAV pursuit decision based on angle feature[J].Control Theory & Applications,2025,42(7):1356~1366.[点击复制]

基于角度特征的分布式DDPG无人机追击决策

Distributed DDPG UAV pursuit decision based on angle feature

摘要点击 3111 全文点击 218 投稿时间：2023-03-05 修订日期：2025-04-05

查看全文查看/发表评论下载PDF阅读器 HTML

DOI编号 10.7641/CTA.2024.30105

2025,42(7):1356-1366

中文关键词追击决策强化学习分布式DDPG算法角度特征

英文关键词 pursuit decision-making reinforcement learning distributed DDPG algorithm angle feature

基金项目国家自然科学基金项目(61906125, 62373261), 辽宁省属本科高校基本科研业务费专项基金项目(LJ232410143020, LJ212410143047)资助.

作者	单位	E-mail
王昱^*	沈阳航空航天大学自动化学院	wangyu@sau.edu.cn
任田君	沈阳航空航天大学自动化学院
范子琳	沈阳航空航天大学自动化学院
孟光磊	沈阳航空航天大学自动化学院

中文摘要

无人机执行追击任务过程中态势变化迅速, 不灵活的网络更新机制和固化的奖励函数使得现有决策模型难以持续输出正确且高效的策略. 针对此问题, 提出了一种基于角度特征的分布式深度确定性策略梯度(DDPG)算法. 首先, 为避免梯度消失或爆炸以稳定模型训练过程, 提出先利用梯度上升计算目标值, 再使用MSE损失函数训练的Actor网络更新机制; 然后, 依据双方角度特征划分策略引导区域, 通过设置不同的奖励函数权重, 构建基于5个DDPG网络的分布式决策模型, 利用在不同态势下对奖励函数权重的动态选择和无缝切换提升算法的决策能力. 仿真实验表明, 相比于DDPG和双延迟深度确定性策略梯度(TD3)算法, 所提算法无论追击直线逃逸目标或智能逃逸目标, 均具有更高的成功率和决策效率.

英文摘要

The situation of the UAV changes rapidly during the pursuit mission. The inflexible network update mechanism and the fixed reward function make it difficult for the existing decision model to continuously output correct and efficient strategies. To solve this problem, a distributed deep deterministic policy gradient (DDPG) algorithm based on angle feature is proposed. Firstly, to avoid gradient disappearing or exploding, stabilize the training process of the model, a parameter update mechanism of Actor network is proposed, which uses gradient ascent to calculate the target value of Actor network, and then trains Actor network with the mean-square error (MSE) loss function. Then, the strategy guidance area is divided according to the situation of both sides. By assigning different weights to the reward function, a distributed decision-making model is built based on five DDPG networks. Using the dynamic selection and seamless switching of reward function weights under different situations, the decision-making ability of the algorithm is improved. Simulation results show that comparing with the algorithms of DDPG and twin delayed deep deterministic policy gradient (TD3), the proposed algorithm has a higher success rate and higher decision-making efficiency when pursuing the linear escape target or the intelligent escape target.