The Interplay Between Learning, Optimization, & Control II – Invited Special Session

B2L-F: The Interplay Between Learning, Optimization, & Control II - Invited Special Session

Session Type: Lecture
Session Code: B2L-F
Location: Room 6
Date & Time: Thursday March 23, 2023 (10:20-11:20)
Chair: Guannan Qu, Li Na
Track: 12

Paper No.	Paper Name	Authors	Abstract
3035	Revisiting the Linear-Programming Framework for Offline RL with General Function Approximation	Asuman Ozdaglar{1}, Sarath Pattathil{1}, Jiawei Zhang{1}, Kaiqing Zhang{2}	Offline reinforcement learning (RL) concerns pursuing an optimal policy for sequential decision-making from a pre-collected dataset, without further interaction with the environment. Recent theoretical progresses have focused on developing sample-efficient offline RL algorithms with various relaxed assumptions on data coverage and function approximators, especially to handle the case with excessively large state-action spaces. Among them, the framework based on the linear-programming (LP) reformulation of Markov decision processes has shown promise: it enables sample-efficient offline RL with function approximation, under only partial data coverage and realizability assumptions on the function classes, with favorable computational efficiency. In this work, we revisit the LP framework for offline RL, and advance the existing results in several aspects, relaxing certain assumptions and achieving optimal statistical rates in terms of sample size. Our key enabler is to introduce proper constraints in the reformulation, instead of using any regularization as in the literature, sometimes also with careful choices of the function classes and initial state distributions. We hope our insights to further advocate the study on the power of the LP framework, as well as the induced primal-dual reformulation and minimax optimization, in offline RL.
3009	Receding-Horizon Policy Gradient: A Generic Model-Free Learning Framework for Linear Quadratic Control and Estimation	Xiangyuan Zhang{1}, Tamer Başar{2}	Policy gradient (PG) methods promise a generic data-driven framework for addressing continuous control tasks. Starting with the linear quadratic regulator (LQR), we now have a good understanding of the optimization landscape that facilitates the global convergence of PG methods in several state-feedback linear control benchmarks. Unfortunately, going beyond state-feedback control settings, naively applying PG methods leads to a deficit of most, if not all, favorable landscape properties, which hinders global convergence. Toward developing a generic framework for output-feedback control, we introduce the receding-horizon PG (RHPG) methodology and demonstrate its global convergence and sample complexity in solving LQR and Kalman filtering (KF), two fundamental benchmarks in modern control theory. Specifically, RHPG first approximates the infinite-horizon control problem using a finite-horizon problem formulation and further decomposes the finite-horizon problem into a sequence of one-step sub-problems using dynamic programming. Then, RHPG solves each sub-problem efficiently using model-free PG methods. To accommodate the inevitable computational errors that incur in solving these sub-problems, we establish the generalized principle of optimality that bounds the accumulated bias by controlling the inaccuracies in solving each sub-problem. Compared to the prior LQR literature, RHPG enjoys a matching sample complexity and does not require the assumption of knowing a stabilizing initial policy. In addressing the KF problem, RHPG is the first globally convergent PG method with fine-grained sample complexity analysis, matching the sample complexity of solving LQR. Notably, RHPG could be applied to learn the optimal KF for open-loop unstable systems without requiring a stabilizing filter for initialization.
3076	Online Switching Control with Stability and Regret Guarantees	Yingying Li{1}, James Preiss{1}, Na Li{2}, Yiheng Lin{1}, Adam Wierman{1}, Jeff Shamma{3}	This paper considers online switching control with a finite candidate controller pool, an unknown dynamical system, and unknown cost functions. The candidate controllers can be unstabilizing policies. We only require at least one candidate controller to satisfy certain stability properties, but we do not know which one is stabilizing. We design an online algorithm that guarantees finite-gain stability throughout the duration of its execution. We also provide a sublinear policy regret guarantee compared with the optimal stabilizing candidate controller. Lastly, we numerically test our algorithm on quadrotor planar flights and compare it with a classical switching control algorithm, falsification-based switching, and a classical multi-armed bandit algorithm, Exp3 with batches.

Top