Foundational Advances in Reinforcement Learning I – Invited Special Session

B4L-E: Foundational Advances in Reinforcement Learning I - Invited Special Session

Session Type: Lecture
Session Code: B4L-E
Location: Room 5
Date & Time: Thursday March 23, 2023 (14:00-15:00)
Chair: Alec Koppel
Track: 12
Paper No. Paper NameAuthorsAbstract
3038Learning in Low-Rank MDPs with Density FeaturesAudrey Huang, Jinglin Chen, Nan JiangIn online reinforcement learning (RL) with large state spaces, MDPs with low-rank transitions---that is, the transition matrix can be factored into the product of two matrices, left and right---is a highly representative structure that enables tractable exploration. When given to the learner, the left matrix enables expressive function approximation for value-based learning, and this setting has been studied extensively, (e.g., in linear MDPs). Similarly, the right matrix induces powerful models for state-occupancy densities. However, using such density features to learn in low-rank MDPs has never been studied (to the best of our knowledge), and is a setting with interesting connections to leveraging the power of generative models in RL. In this work, we initiate the study of learning in low-rank MDPs with density features. Our algorithm performs reward-free learning and builds an exploratory distribution in a level-by-level manner. It uses the density features for off-policy estimation of the policies\' state distributions, and constructs the exploratory data by choosing the barycentric spanner of these distributions. From an analytical point of view, the additive error of distribution estimation is largely incompatible with the multiplicative definition of data coverage (e.g., concentrability). Especially in the absence of strong assumptions like reachability, this incompatibility may lead to exponential or even infinite errors under standard analysis strategies, which we overcome via novel technical tools.
3073Adaptive Learning and Robustness in Stochastic Incentive DesignsSina Sanjari{1}, Subhonmesh Bose{2}, Tamer Başar{2}Stochastic incentive designs entail hierarchical stochastic decision-making between a leader and a single (or multiple) follower(s) with possibly different goals. We consider in this presentation a stochastic Stackelberg game with a single follower, where the leader commits to a strategy that takes into account the follower’s optimum response. We consider in particular the case where the leader seeks to craft a ``soft’\' policy to incentivize the follower to behave in her best interest. By soft, we mean a policy that varies smoothly in follower’s action. Such problems arise in different contexts, such as designing tax codes and imposing environmental regulations on corporations. The players may have access to private information that is not shared with each other. These dynamic games with decentralized information structures have been well studied under the assumption that the leader has access to the follower’s observations, actions, and the cost model. Not having access to one of these can result in performance loss for the leader. This talk is on incentive design problems where this assumption is relaxed. In the first part of the talk, we consider a game where the leader has access to the follower’s action through a random monitoring channel, and learns about the follower’s observations through a follower-designed signal. In this setup, we establish the existence of a signaling-based incentive equilibrium strategy for the leader that induces honest reporting and desired response from the follower. Then, we discuss the setup where the leader’s knowledge about the follower’s cost and distributions of cost-relevant random variables is inaccurate, and we establish the existence of a robust incentive equilibrium strategy that bounds the performance loss due to such inaccurate knowledge. In the second part of the talk, we consider the setup where the leader does not know some of the parameters that characterize the follower’s cost structure. Here, the goal is to design an adaptive incentive policy that enables the leader to learn the unknown parameters through repeated interactions with the follower. Specifically, we study a Thompson sampling approach, where the leader maintains a belief over the unknown parameters, samples from that belief to design incentive policies, and refines her belief using the responses from the follower.
3077Efficient Exploration in Model-Based Reinforcement LearningSouradip Chakraborty{2}, Amrit Singh Bedi{2}, Alec Koppel{1}, Furong Huang{2}To balance exploration and computational efficiency in Model-based reinforcement learning (MBRL), we propose an augmentation of posterior sampling that optimizes the ratio of the value function sub-optimality to the distributional distance between the transition model induced by a model-based RL algorithm and the one caused by the optimal policy. Distributional distance is quantified in terms of an integral probability metric (IPM), which can be computed in closed form with the kernelized Stein Discrepancy (KSD) under suitable conditions. The merit of this definition is that it permits us to establish a Bayesian regret bound for the tabular settings that are independent of the prior and improve upon previous dependencies of the cardinality of the state and action spaces, which is the core contribution of this work. Experimentally, we show the effectiveness of the proposed approach on sparse environments, where MBRL without exploration can lead to spurious behavior.