quotation:[Copy]
Leilei Cui1,Zhong-Ping Jiang1.[en_title][J].Control Theory and Technology,2023,21(3):374~389.[Copy]
【Print page】 【Online reading】【Download 【PDF Full text】 View/Add CommentDownload reader Close

←Previous page|Page Next →

Back Issue    Advanced search

This Paper:Browse 252   Download 0 本文二维码信息
码上扫一扫!
A Lyapunov characterization of robust policy optimization
LeileiCui1,Zhong-PingJiang1
0
(1 Department of Electrical and Computer Engineering, New York University, Brooklyn, NY 11201, USA)
摘要:
In this paper, we study the robustness property of policy optimization (particularly Gauss–Newton gradient descent algorithm which is equivalent to the policy iteration in reinforcement learning) subject to noise at each iteration. By invoking the concept of input-to-state stability and utilizing Lyapunov’s direct method, it is shown that, if the noise is sufficiently small, the policy iteration algorithm converges to a small neighborhood of the optimal solution even in the presence of noise at each iteration. Explicit expressions of the upperbound on the noise and the size of the neighborhood to which the policies ultimately converge are provided. Based onWillems’ fundamental lemma, a learning-based policy iteration algorithm is proposed. The persistent excitation condition can be readily guaranteed by checking the rank of the Hankel matrix related to an exploration signal. The robustness of the learning-based policy iteration to measurement noise and unknown system disturbances is theoretically demonstrated by the input-to-state stability of the policy iteration. Several numerical simulations are conducted to demonstrate the efficacy of the proposed method.
关键词:  Policy optimization · Policy iteration (PI) · Input-to-state stability (ISS) · Lyapunov’s direct method
DOI:https://doi.org/10.1007/s11768-023-00163-w
基金项目:This work was supported in part by the National Science Foundation (Nos. ECCS-2210320, CNS-2148304).
A Lyapunov characterization of robust policy optimization
Leilei Cui1,Zhong-Ping Jiang1
(1 Department of Electrical and Computer Engineering, New York University, Brooklyn, NY 11201, USA)
Abstract:
In this paper, we study the robustness property of policy optimization (particularly Gauss–Newton gradient descent algorithm which is equivalent to the policy iteration in reinforcement learning) subject to noise at each iteration. By invoking the concept of input-to-state stability and utilizing Lyapunov’s direct method, it is shown that, if the noise is sufficiently small, the policy iteration algorithm converges to a small neighborhood of the optimal solution even in the presence of noise at each iteration. Explicit expressions of the upperbound on the noise and the size of the neighborhood to which the policies ultimately converge are provided. Based onWillems’ fundamental lemma, a learning-based policy iteration algorithm is proposed. The persistent excitation condition can be readily guaranteed by checking the rank of the Hankel matrix related to an exploration signal. The robustness of the learning-based policy iteration to measurement noise and unknown system disturbances is theoretically demonstrated by the input-to-state stability of the policy iteration. Several numerical simulations are conducted to demonstrate the efficacy of the proposed method.
Key words:  Policy optimization · Policy iteration (PI) · Input-to-state stability (ISS) · Lyapunov’s direct method