Recent Advancements in Optimization Methods for Machine Learning II – Invited Special Session

B2L-D: Recent Advancements in Optimization Methods for Machine Learning II - Invited Special Session

Session Type: Lecture
Session Code: B2L-D
Location: Room 4
Date & Time: Thursday March 23, 2023 (10:20-11:20)
Chair: Nicolas Loizou
Track: 12
Paper No. Paper NameAuthorsAbstract
3067Single-Call Stochastic Extragradient Methods: Improved Analysis Under Weaker ConditionsSayantan Choudhury{1}, Eduard Gorbunov{2}, Nicolas Loizou{1}Single-call stochastic extragradient methods, like stochastic past extragradient (SPEG) and stochastic optimistic gradient (SOG), have gained a lot of interest in recent years and are one of the most efficient algorithms for solving large-scale min-max optimization and variational inequalities problems (VIP) appearing in various machine learning tasks. However, despite their undoubted popularity, current convergence analyses of SPEG and SOG require a bounded variance assumption. In addition, several important questions regarding the convergence properties of these methods are still open, including mini-batching, efficient step-size selection, and convergence guarantees under different sampling strategies. In this work, we address these questions and provide convergence guarantees for two large classes of structured non-monotone VIPs: (i) quasi-strongly monotone problems (a generalization of strongly monotone problems) and (ii) Minty variational inequalities (a generalization of monotone problems). We introduce the expected residual condition, explain its benefits, and show that it is a strictly weaker assumption than previously used growth conditions, expected co-coercivity, or bounded variance assumptions. Equipped with this condition, we provide theoretical guarantees for the convergence of single-call extragradient methods for different step-size selections, including constant, decreasing, and stepsize-switching rules. Furthermore, our convergence analysis holds under the arbitrary sampling paradigm, which includes importance sampling and various mini-batching strategies as special cases.
3078Primal-Dual Proximal Optimization Methods with Bregman DistancesXin Jiang{1}, Lieven Vandenberghe{2}We discuss Bregman distance extensions of the primal-dual three-operator (PD3O) and Condat-Vu proximal algorithms. When used with standard proximal operators these algorithms include several important methods as special cases. Extensions to generalized Bregman distances are attractive if the complexity per iteration can be reduced by matching the Bregman distance to the structure in the problem. As an example, we apply the proposed method to the centering problem in sparse semidefinite programming. The logarithmic barrier function for the cone of positive semidefinite completable sparse matrices is used as a distance-generating kernel. For this distance, the complexity of evaluating the Bregman proximal operator is shown to be roughly proportional to the cost of a sparse Cholesky factorization. This is much cheaper than the standard proximal operator with Euclidean distances, which requires an eigenvalue decomposition.
3091Feature Learning in Two-Layer Neural NetworksMurat ErdogduWe study the effect of optimization on the first-layer parameters of a width-N two-layer neural network. In an idealized student-teacher setting, we show that the first gradient update contains a rank-1 spike, which results in the alignment between the first-layer weights and the linear component of the teacher model. When the learning rate is small, based on a recently established Stein-CLT in the feature space, we prove a Gaussian equivalence property for the trained feature map; this allows us to show that the learned kernel improves upon the initial random features model, but cannot defeat the best linear model on the input. Whereas for sufficiently large learning rate, we prove that trained features can go beyond this linear regime and outperform a wide range of random features models.