# The Interplay Between Learning, Optimization, & Control I – Invited Special Session

## B1L-F: The Interplay Between Learning, Optimization, & Control I - Invited Special Session

Session Type: LectureSession Code: B1L-F

Location: Room 6

Date & Time: Thursday March 23, 2023 (09:00 - 10:00)

Chair: Li Na, Guannan Qu

Track: 12

Paper No. | Paper Name | Authors | Abstract |
---|---|---|---|

3020 | Model-Free Nonlinear Feedback Optimization | Zhiyu He{1}, Jianping He{2}, Xinping Guan{2}, Saverio Bolognani{1}, Florian DÃ¶rfler{1} | Feedback optimization is a control paradigm that enables physical systems to autonomously reach efficient operating points. Its central idea is to interconnect optimization iterations in closed-loop with the physical plant. Since iterative gradient-based methods are extensively used to achieve optimality, feedback optimization controllers typically require the knowledge of the steady-state sensitivity of the plant, which may not be easily accessible in some applications. In contrast, in this paper we develop a model-free feedback controller for efficient steady-state operation of general dynamical systems. The proposed design consists in updating control inputs via gradient estimates constructed from evaluations of the nonconvex objective at the current input and at the measured output. We study the dynamic interconnection of the proposed iterative controller with a stable nonlinear discrete-time plant. For this setup, we characterize the optimality and the stability of the closed-loop behavior as functions of the problem dimension, the number of iterations, and the rate of convergence of the physical plant. To handle general constraints that affect multiple inputs, we enhance the controller with Frank-Wolfe type updates. |

3103 | Collaborative Linear Bandits with Adversarial Agents: Near-Optimal Regret Bounds | Aritra Mitra, Arman Adibi, George J. Pappas, Hamed Hassani | A major challenge in modern distributed computing paradigms such as federated learning is robustness to adversarial agents. While several recent works have explored this theme for supervised learning, little to nothing is known about sequential decision-making problems (e.g., reinforcement learning and bandits) in this regard. Motivated by the above gap, we consider a collaborative linear stochastic bandit setting where a fraction of the agents is adversarial and can act arbitrarily. This leads to the following tension: while collaboration can potentially reduce regret, it can also disrupt the process of learning due to adversaries. We provide a fundamental understanding of this tension by designing new algorithms that balance the exploration-exploitation trade-off via carefully constructed robust confidence intervals. When the fraction of adversaries is small, we show that our proposed algorithms provably benefit from collaboration, despite arbitrary adversarial behavior. We also provide an information-theoretic lower bound, revealing that our regret guarantees are minimax optimal in all relevant parameters. Notably, our work provides the first set of tight, near-optimal regret bounds for collaborative linear bandits with adversaries. |

3037 | Multi-Armed Bandit Learning on a Graph | Tianpeng Zhang{1}, Kasper Johansson{2}, Na Li{1} | The multi-armed bandit(MAB) problem is a simple yet powerful framework that has been extensively studied in the context of decision-making under uncertainty. In many real-world applications, such as robotic applications, selecting an arm corresponds to a physical action that constrains the choices of the next available arms (actions). Motivated by this, we study an extension of MAB called the graph bandit, where an agent travels over a graph to maximize the reward collected from different nodes. The graph defines the agent\'s freedom in selecting the next available nodes at each step. We assume the graph structure is fully available, but the reward distributions are unknown. Built upon an offline graph-based planning algorithm and the principle of optimism, we design a learning algorithm, G-UCB, that balances long-term exploration-exploitation using the principle of optimism. We show that our proposed algorithm achieves $O(\\sqrt{|S|T\\log(T)}+D|S|\\log T)$ learning regret, where $|S|$ is the number of nodes and $D$ is the diameter of the graph, which matches the theoretical lower bound $\\Omega(\\sqrt{|S|T})$ up to logarithmic factors. To our knowledge, this result is among the first tight regret bounds in non-episodic, un-discounted learning problems with known deterministic transitions. Numerical experiments confirm that our algorithm outperforms several benchmarks. We also present a synthetic robotic application modeled under the graph bandit framework, where a robot moves on a network of rural/suburban locations to provide high-speed internet access using our proposed algorithm. |