Foundational Advances in Reinforcement Learning II – Invited Special Session

B5L-E: Foundational Advances in Reinforcement Learning II - Invited Special Session

Session Type: Lecture
Session Code: B5L-E
Location: Room 5
Date & Time: Thursday March 23, 2023 (15:20-16:20)
Chair: Alec Koppel
Track: 12
Paper IDPaper TitleAuthorsAbstract
3089Receding Horizon Policy Gradient for Zero-Sum Mean-Field Type GamesMuhammad Aneeq Uz Zaman{3}, Mathieu Laurière{2}, Alec Koppel{1}, Tamer Başar{4}In this paper we propose a Receding Horizon Policy Gradient (RHPG) algorithm for Linear-Quadratic Zero-Sum Mean-Field Type Games (ZS-MFTG) with discounted utilities over an infinite horizon. In the ZS-MFTG two competing players influence a large number of agents to achieve their respective conflicting objectives. The agents on the other hand are assumed to be non-decision making. Our main focus is the design of the RHPG algorithm which does not require initial controller estimates to be stabilizing. We prove finite sample convergence bounds for the RHPG algorithm with access to (1) a cost oracle standard in the literature, and (2) with access to empirical finite roll-out cost. We also conduct numerical analysis to investigate the efficacy of the RHPG under both cost oracles.
3104Safe Control of Partially Unknown Systems with Uncertainty Dependent ConstraintsJafar Abbaszadeh Chekan, Cedric LangbortThe problem of safely learning and controlling a dynamical system - i.e., of stabilizing an originally (partially) unknown system while ensuring that it does not leave a prescribed \'safe set\' - has recently received tremendous attention in the controls community. Further complexities arise, however, when the structure of the safe set itself depends on the unknown part of the system\'s dynamics. In particular, a popular approach based on control Lyapunov functions (CLF), control barrier functions (CBF) and Gaussian processes (to build confidence set around the unknown term), which has proved successful in the known-safe set setting, becomes inefficient as-is, due to the introduction of higher-order terms to be estimated and bounded with high probability using only system state measurements. In this paper, we build on the recent literature on GPs and reproducing kernels to perform this latter task, and show how to correspondingly modify the CLF-CBF-based approach to obtain safety guarantees. Namely, we derive exponential CLF and second relative order exponential CBF constraints whose satisfaction guarantees stability and forward in-variance of the partially unknown safe set with high probability. To overcome the intractability of verification of these conditions on the continuous domain, we apply discretization of the state space and use Lipschitz continuity properties of dynamics to derive equivalent CLF and CBF certificates in discrete state space. Finally, we present an algorithm for the control design aim using the derived certificates.
3198Information-Directed Policy Search in Sparse-Reward Settings via the Occupancy Information RatioWesley Suttle{3}, Alec Koppel{1}, Ji Liu{2}We examine a new measure of the exploration/exploitation trade-off in reinforcement learning (RL) called the occupancy information ratio (OIR). We derive the Information-Directed Actor-Critic (IDAC) algorithm for solving the OIR problem, provide an overview of the rich theory underlying IDAC and related OIR policy gradient methods, and experimentally investigate the advantages of such methods. The central contribution of this paper is to provide empirical evidence that, due to the form of the OIR objective, IDAC enjoys superior performance over vanilla RL methods in sparse-reward environments.