Computer Integrated Manufacturing System ›› 2025, Vol. 31 ›› Issue (10): 3762-3772.DOI: 10.13196/j.cims.2023.0347

Previous Articles     Next Articles

Reward guided conservative advantage learning path planning method based on RGCAL-TD3

WANG Keping1,LI Hongtao1+,WANG Tian2,YANG Yi1   

  1. 1.School of Electrical Engineering and Automation,Henan Polytechnic University
    2.Institute of Artificial Intelligence,Beihang University
  • Online:2025-10-31 Published:2025-11-19
  • Supported by:
    Project supported by the National Natural Science Foundation,China(No.61972016).

基于RGCAL-TD3的奖励引导保守优势学习路径规划方法

王科平1,李宏涛1+,王田2,杨艺1   

  1. 1.河南理工大学电气工程与自动化学院
    2.北京航空航天大学人工智能研究院
  • 作者简介:
    王科平(1976-)女,河北张家口人,副教授,博士,研究方向:强化学习、图像清晰化处理等,E-mail:wangkp@hpu.edu.cn;

    +李宏涛(1998-)男,河南南阳人,硕士研究生,研究方向:强化学习、路径规划等,通讯作者,E-mail:leesincere@163.com;

    王田(1987-)男,湖北孝感人,副教授,博士,研究方向:计算机视觉、强化学习、模式识别等,E-mail:wangtian@buaa.edu.cn;

    杨艺(1980-)男,湖北利川人,副教授,博士,研究方向:强化学习、路径规划、智能控制等,E-mail:yangyi@hpu.edu.cn。
  • 基金资助:
    国家自然科学基金资助项目(61972016)。

Abstract: In response to the problem of low sample efficiency in existing path planning methods based on deep reinforcement learning in dynamic scenarios,a Reward Guided Conservative Advantage Learning (RGCAL) approach was proposed based on the Twin Delayed Deep Deterministic policy gradient algorithm (TD3).The path planning task was modeled as a Partially Observable Markov Decision Process (POMDP) due to the partially observable nature of dynamic scenes.Rewards were introduced into the conservative advantage learning framework,leading to the redefinition of the advantage operator and its involvement in the TD error update,thereby enhancing the learning capability for non-linear action gap based on the reward values in the replay experience.Multiple dynamic experimental scenarios were designed on the Gazebo platform for comparative experiments with mainstream deep reinforcement learning algorithms.Simulation results demonstrated that the proposed approach achieved higher sample efficiency compared to other algorithms and exhibits overall advantages in terms of execution time,movement steps and navigation success rate.Real-world testing further validated the feasibility and effectiveness of the proposed approach.

Key words: dynamic scenarios, path planning, deep reinforcement learning, reward guided conservative advantage learning, action gap

摘要: 针对现有基于深度强化学习的路径规划方法在动态场景中样本利用率低下的问题,在双延迟深度确定性策略梯度算法(TD3)基础上,提出一种奖励引导的保守优势学习方法(RGCAL)。首先,鉴于动态场景的部分可观测特性,将路径规划任务建模为部分可观测马尔可夫决策过程。其次,将奖励引入到保守优势学习中,在此基础上重新定义了优势学习算子,并参与到TD误差的更新,从而根据回放经验中的奖励值增强对动作差距非线性的学习能力。最后,在Gazebo平台设计了多种动态实验场景,与主流深度强化学习算法进行对比实验。仿真实验结果表明,所提算法的样本利用率优于其他算法,在运行时间、移动步数以及导航成功率等指标上也具有整体优势。最后,在真实场景下进行了测试,进一步验证了所提算法的可行性和有效性。

关键词: 动态场景, 路径规划, 深度强化学习, 奖励引导的保守优势学习, 动作差距

CLC Number: