摘要: |
In this paper, we study the robustness property of policy optimization (particularly Gauss–Newton gradient descent algorithm
which is equivalent to the policy iteration in reinforcement learning) subject to noise at each iteration. By invoking the concept
of input-to-state stability and utilizing Lyapunov’s direct method, it is shown that, if the noise is sufficiently small, the policy
iteration algorithm converges to a small neighborhood of the optimal solution even in the presence of noise at each iteration.
Explicit expressions of the upperbound on the noise and the size of the neighborhood to which the policies ultimately converge
are provided. Based onWillems’ fundamental lemma, a learning-based policy iteration algorithm is proposed. The persistent
excitation condition can be readily guaranteed by checking the rank of the Hankel matrix related to an exploration signal.
The robustness of the learning-based policy iteration to measurement noise and unknown system disturbances is theoretically
demonstrated by the input-to-state stability of the policy iteration. Several numerical simulations are conducted to demonstrate
the efficacy of the proposed method. |
关键词: Policy optimization · Policy iteration (PI) · Input-to-state stability (ISS) · Lyapunov’s direct method |
DOI:https://doi.org/10.1007/s11768-023-00163-w |
|
基金项目:This work was supported in part by the National Science Foundation (Nos. ECCS-2210320, CNS-2148304). |
|
A Lyapunov characterization of robust policy optimization |
Leilei Cui1,Zhong-Ping Jiang1 |
(1 Department of Electrical and Computer Engineering, New York University, Brooklyn, NY 11201, USA) |
Abstract: |
In this paper, we study the robustness property of policy optimization (particularly Gauss–Newton gradient descent algorithm
which is equivalent to the policy iteration in reinforcement learning) subject to noise at each iteration. By invoking the concept
of input-to-state stability and utilizing Lyapunov’s direct method, it is shown that, if the noise is sufficiently small, the policy
iteration algorithm converges to a small neighborhood of the optimal solution even in the presence of noise at each iteration.
Explicit expressions of the upperbound on the noise and the size of the neighborhood to which the policies ultimately converge
are provided. Based onWillems’ fundamental lemma, a learning-based policy iteration algorithm is proposed. The persistent
excitation condition can be readily guaranteed by checking the rank of the Hankel matrix related to an exploration signal.
The robustness of the learning-based policy iteration to measurement noise and unknown system disturbances is theoretically
demonstrated by the input-to-state stability of the policy iteration. Several numerical simulations are conducted to demonstrate
the efficacy of the proposed method. |
Key words: Policy optimization · Policy iteration (PI) · Input-to-state stability (ISS) · Lyapunov’s direct method |