- Improved Analysis of Score-based Generative Modeling: User-Friendly Bounds under Minimal Smoothness Assumptions We give an improved theoretical analysis of score-based generative modeling. Under a score estimate with small L^2 error (averaged across timesteps), we provide efficient convergence guarantees for any data distribution with second-order moment, by either employing early stopping or assuming smoothness condition on the score function of the data distribution. Our result does not rely on any log-concavity or functional inequality assumption and has a logarithmic dependence on the smoothness. In particular, we show that under only a finite second moment condition, approximating the following in reverse KL divergence in epsilon-accuracy can be done in tilde Oleft(d log (1/delta){epsilon}right) steps: 1) the variance-delta Gaussian perturbation of any data distribution; 2) data distributions with 1/delta-smooth score functions. Our analysis also provides a quantitative comparison between different discrete approximations and may guide the choice of discretization points in practice. 3 authors · Nov 3, 2022
- Towards Gradient Free and Projection Free Stochastic Optimization This paper focuses on the problem of constrained stochastic optimization. A zeroth order Frank-Wolfe algorithm is proposed, which in addition to the projection-free nature of the vanilla Frank-Wolfe algorithm makes it gradient free. Under convexity and smoothness assumption, we show that the proposed algorithm converges to the optimal objective function at a rate Oleft(1/T^{1/3}right), where T denotes the iteration count. In particular, the primal sub-optimality gap is shown to have a dimension dependence of Oleft(d^{1/3}right), which is the best known dimension dependence among all zeroth order optimization algorithms with one directional derivative per iteration. For non-convex functions, we obtain the Frank-Wolfe gap to be Oleft(d^{1/3}T^{-1/4}right). Experiments on black-box optimization setups demonstrate the efficacy of the proposed algorithm. 3 authors · Oct 7, 2018
- On User-Level Private Convex Optimization We introduce a new mechanism for stochastic convex optimization (SCO) with user-level differential privacy guarantees. The convergence rates of this mechanism are similar to those in the prior work of Levy et al. (2021); Narayanan et al. (2022), but with two important improvements. Our mechanism does not require any smoothness assumptions on the loss. Furthermore, our bounds are also the first where the minimum number of users needed for user-level privacy has no dependence on the dimension and only a logarithmic dependence on the desired excess error. The main idea underlying the new mechanism is to show that the optimizers of strongly convex losses have low local deletion sensitivity, along with an output perturbation method for functions with low local deletion sensitivity, which could be of independent interest. 6 authors · May 8, 2023
1 Time-varying Signals Recovery via Graph Neural Networks The recovery of time-varying graph signals is a fundamental problem with numerous applications in sensor networks and forecasting in time series. Effectively capturing the spatio-temporal information in these signals is essential for the downstream tasks. Previous studies have used the smoothness of the temporal differences of such graph signals as an initial assumption. Nevertheless, this smoothness assumption could result in a degradation of performance in the corresponding application when the prior does not hold. In this work, we relax the requirement of this hypothesis by including a learning module. We propose a Time Graph Neural Network (TimeGNN) for the recovery of time-varying graph signals. Our algorithm uses an encoder-decoder architecture with a specialized loss composed of a mean squared error function and a Sobolev smoothness operator.TimeGNN shows competitive performance against previous methods in real datasets. 6 authors · Feb 22, 2023
- Adam Exploits $\ell_\infty$-geometry of Loss Landscape via Coordinate-wise Adaptivity Adam outperforms SGD when training language models. Yet this advantage is not well-understood theoretically -- previous convergence analysis for Adam and SGD mainly focuses on the number of steps T and is already minimax-optimal in non-convex cases, which are both O(T^{-1/4}). In this work, we argue that the exploitation of nice ell_infty-geometry is the key advantage of Adam over SGD. More specifically, we give a new convergence analysis for Adam under novel assumptions that loss is smooth under ell_infty-geometry rather than the more common ell_2-geometry, which yields a much better empirical smoothness constant for GPT-2 and ResNet models. Our experiments confirm that Adam performs much worse when the favorable ell_infty-geometry is changed while SGD provably remains unaffected. We also extend the convergence analysis to blockwise Adam under novel blockwise smoothness assumptions. 3 authors · Oct 10, 2024
- GD doesn't make the cut: Three ways that non-differentiability affects neural network training This paper investigates the distinctions between gradient methods applied to non-differentiable functions (NGDMs) and classical gradient descents (GDs) designed for differentiable functions. First, we demonstrate significant differences in the convergence properties of NGDMs compared to GDs, challenging the applicability of the extensive neural network convergence literature based on L-smoothness to non-smooth neural networks. Next, we demonstrate the paradoxical nature of NGDM solutions for L_{1}-regularized problems, showing that increasing the regularization penalty leads to an increase in the L_{1} norm of optimal solutions in NGDMs. Consequently, we show that widely adopted L_{1} penalization-based techniques for network pruning do not yield expected results. Finally, we explore the Edge of Stability phenomenon, indicating its inapplicability even to Lipschitz continuous convex differentiable functions, leaving its relevance to non-convex non-differentiable neural networks inconclusive. Our analysis exposes misguided interpretations of NGDMs in widely referenced papers and texts due to an overreliance on strong smoothness assumptions, emphasizing the necessity for a nuanced understanding of foundational assumptions in the analysis of these systems. 1 authors · Jan 16, 2024
1 Accelerating Convergence of Score-Based Diffusion Models, Provably Score-based diffusion models, while achieving remarkable empirical performance, often suffer from low sampling speed, due to extensive function evaluations needed during the sampling phase. Despite a flurry of recent activities towards speeding up diffusion generative modeling in practice, theoretical underpinnings for acceleration techniques remain severely limited. In this paper, we design novel training-free algorithms to accelerate popular deterministic (i.e., DDIM) and stochastic (i.e., DDPM) samplers. Our accelerated deterministic sampler converges at a rate O(1/{T}^2) with T the number of steps, improving upon the O(1/T) rate for the DDIM sampler; and our accelerated stochastic sampler converges at a rate O(1/T), outperforming the rate O(1/T) for the DDPM sampler. The design of our algorithms leverages insights from higher-order approximation, and shares similar intuitions as popular high-order ODE solvers like the DPM-Solver-2. Our theory accommodates ell_2-accurate score estimates, and does not require log-concavity or smoothness on the target distribution. 6 authors · Mar 6, 2024
- Statistical guarantees for denoising reflected diffusion models In recent years, denoising diffusion models have become a crucial area of research due to their abundance in the rapidly expanding field of generative AI. While recent statistical advances have delivered explanations for the generation ability of idealised denoising diffusion models for high-dimensional target data, implementations introduce thresholding procedures for the generating process to overcome issues arising from the unbounded state space of such models. This mismatch between theoretical design and implementation of diffusion models has been addressed empirically by using a reflected diffusion process as the driver of noise instead. In this paper, we study statistical guarantees of these denoising reflected diffusion models. In particular, we establish minimax optimal rates of convergence in total variation, up to a polylogarithmic factor, under Sobolev smoothness assumptions. Our main contributions include the statistical analysis of this novel class of denoising reflected diffusion models and a refined score approximation method in both time and space, leveraging spectral decomposition and rigorous neural network analysis. 3 authors · Nov 3, 2024
- Learning Control-Oriented Dynamical Structure from Data Even for known nonlinear dynamical systems, feedback controller synthesis is a difficult problem that often requires leveraging the particular structure of the dynamics to induce a stable closed-loop system. For general nonlinear models, including those fit to data, there may not be enough known structure to reliably synthesize a stabilizing feedback controller. In this paper, we discuss a state-dependent nonlinear tracking controller formulation based on a state-dependent Riccati equation for general nonlinear control-affine systems. This formulation depends on a nonlinear factorization of the system of vector fields defining the control-affine dynamics, which always exists under mild smoothness assumptions. We propose a method for learning this factorization from a finite set of data. On a variety of simulated nonlinear dynamical systems, we empirically demonstrate the efficacy of learned versions of this controller in stable trajectory tracking. Alongside our learning method, we evaluate recent ideas in jointly learning a controller and stabilizability certificate for known dynamical systems; we show experimentally that such methods can be frail in comparison. 4 authors · Feb 5, 2023
- Quantum Lower Bounds for Finding Stationary Points of Nonconvex Functions Quantum algorithms for optimization problems are of general interest. Despite recent progress in classical lower bounds for nonconvex optimization under different settings and quantum lower bounds for convex optimization, quantum lower bounds for nonconvex optimization are still widely open. In this paper, we conduct a systematic study of quantum query lower bounds on finding epsilon-approximate stationary points of nonconvex functions, and we consider the following two important settings: 1) having access to p-th order derivatives; or 2) having access to stochastic gradients. The classical query lower bounds is Omegabig(epsilon^{-1+p{p}}big) regarding the first setting, and Omega(epsilon^{-4}) regarding the second setting (or Omega(epsilon^{-3}) if the stochastic gradient function is mean-squared smooth). In this paper, we extend all these classical lower bounds to the quantum setting. They match the classical algorithmic results respectively, demonstrating that there is no quantum speedup for finding epsilon-stationary points of nonconvex functions with p-th order derivative inputs or stochastic gradient inputs, whether with or without the mean-squared smoothness assumption. Technically, our quantum lower bounds are obtained by showing that the sequential nature of classical hard instances in all these settings also applies to quantum queries, preventing any quantum speedup other than revealing information of the stationary points sequentially. 2 authors · Dec 7, 2022
1 Iterate to Accelerate: A Unified Framework for Iterative Reasoning and Feedback Convergence We introduce a unified framework for iterative reasoning that leverages non-Euclidean geometry via Bregman divergences, higher-order operator averaging, and adaptive feedback mechanisms. Our analysis establishes that, under mild smoothness and contractivity assumptions, a generalized update scheme not only unifies classical methods such as mirror descent and dynamic programming but also captures modern chain-of-thought reasoning processes in large language models. In particular, we prove that our accelerated iterative update achieves an O(1/t^2) convergence rate in the absence of persistent perturbations, and we further demonstrate that feedback (iterative) architectures are necessary to approximate certain fixed-point functions efficiently. These theoretical insights bridge classical acceleration techniques with contemporary applications in neural computation and optimization. 1 authors · Feb 6, 2025
- The Minkowski Billiard Characterization of the EHZ-capacity of Convex Lagrangian Products We rigorously state the connection between the EHZ-capacity of convex Lagrangian products Ktimes TsubsetR^ntimesR^n and the minimal length of closed (K,T)-Minkowski billiard trajectories. This connection was made explicit for the first time by Artstein-Avidan and Ostrover under the assumption of smoothness and strict convexity of both K and T. We prove this connection in its full generality, i.e., without requiring any conditions on the convex bodies K and T. This prepares the computation of the EHZ-capacity of convex Lagrangian products of two convex polytopes by using discrete computational methods. 1 authors · Mar 3, 2022
- Error Feedback Reloaded: From Quadratic to Arithmetic Mean of Smoothness Constants Error Feedback (EF) is a highly popular and immensely effective mechanism for fixing convergence issues which arise in distributed training methods (such as distributed GD or SGD) when these are enhanced with greedy communication compression techniques such as TopK. While EF was proposed almost a decade ago (Seide et al., 2014), and despite concentrated effort by the community to advance the theoretical understanding of this mechanism, there is still a lot to explore. In this work we study a modern form of error feedback called EF21 (Richtarik et al., 2021) which offers the currently best-known theoretical guarantees, under the weakest assumptions, and also works well in practice. In particular, while the theoretical communication complexity of EF21 depends on the quadratic mean of certain smoothness parameters, we improve this dependence to their arithmetic mean, which is always smaller, and can be substantially smaller, especially in heterogeneous data regimes. We take the reader on a journey of our discovery process. Starting with the idea of applying EF21 to an equivalent reformulation of the underlying problem which (unfortunately) requires (often impractical) machine cloning, we continue to the discovery of a new weighted version of EF21 which can (fortunately) be executed without any cloning, and finally circle back to an improved analysis of the original EF21 method. While this development applies to the simplest form of EF21, our approach naturally extends to more elaborate variants involving stochastic gradients and partial participation. Further, our technique improves the best-known theory of EF21 in the rare features regime (Richtarik et al., 2023). Finally, we validate our theoretical findings with suitable experiments. 3 authors · Feb 16, 2024
- Nearest Neighbour Based Estimates of Gradients: Sharp Nonasymptotic Bounds and Applications Motivated by a wide variety of applications, ranging from stochastic optimization to dimension reduction through variable selection, the problem of estimating gradients accurately is of crucial importance in statistics and learning theory. We consider here the classic regression setup, where a real valued square integrable r.v. Y is to be predicted upon observing a (possibly high dimensional) random vector X by means of a predictive function f(X) as accurately as possible in the mean-squared sense and study a nearest-neighbour-based pointwise estimate of the gradient of the optimal predictive function, the regression function m(x)=E[Ymid X=x]. Under classic smoothness conditions combined with the assumption that the tails of Y-m(X) are sub-Gaussian, we prove nonasymptotic bounds improving upon those obtained for alternative estimation methods. Beyond the novel theoretical results established, several illustrative numerical experiments have been carried out. The latter provide strong empirical evidence that the estimation method proposed works very well for various statistical problems involving gradient estimation, namely dimensionality reduction, stochastic gradient descent optimization and quantifying disentanglement. 3 authors · Jun 26, 2020
- AdaGrad Meets Muon: Adaptive Stepsizes for Orthogonal Updates The recently proposed Muon optimizer updates weight matrices via orthogonalized momentum and has demonstrated strong empirical success in large language model training. However, it remains unclear how to determine the learning rates for such orthogonalized updates. AdaGrad, by contrast, is a widely used adaptive method that scales stochastic gradients by accumulated past gradients. We propose a new algorithm, AdaGO, which combines a norm-based AdaGrad-type stepsize with an orthogonalized update direction, bringing together the benefits of both approaches. Unlike other adaptive variants of Muon, AdaGO preserves the orthogonality of the update direction, which can be interpreted as a spectral descent direction, while adapting the stepsizes to the optimization landscape by scaling the direction with accumulated past gradient norms. The implementation of AdaGO requires only minimal modification to Muon, with a single additional scalar variable, the accumulated squared gradient norms, to be computed, making it computationally and memory efficient. Optimal theoretical convergence rates are established for nonconvex functions in both stochastic and deterministic settings under standard smoothness and unbiased bounded-variance noise assumptions. Empirical results on CIFAR-10 classification and function regression demonstrate that AdaGO outperforms Muon and Adam. 3 authors · Sep 2, 2025
- Convergence Guarantees for RMSProp and Adam in Generalized-smooth Non-convex Optimization with Affine Noise Variance This paper provides the first tight convergence analyses for RMSProp and Adam in non-convex optimization under the most relaxed assumptions of coordinate-wise generalized smoothness and affine noise variance. We first analyze RMSProp, which is a special case of Adam with adaptive learning rates but without first-order momentum. Specifically, to solve the challenges due to dependence among adaptive update, unbounded gradient estimate and Lipschitz constant, we demonstrate that the first-order term in the descent lemma converges and its denominator is upper bounded by a function of gradient norm. Based on this result, we show that RMSProp with proper hyperparameters converges to an epsilon-stationary point with an iteration complexity of mathcal O(epsilon^{-4}). We then generalize our analysis to Adam, where the additional challenge is due to a mismatch between the gradient and first-order momentum. We develop a new upper bound on the first-order term in the descent lemma, which is also a function of the gradient norm. We show that Adam with proper hyperparameters converges to an epsilon-stationary point with an iteration complexity of mathcal O(epsilon^{-4}). Our complexity results for both RMSProp and Adam match with the complexity lower bound established in arjevani2023lower. 3 authors · Apr 1, 2024
- IGLU: Efficient GCN Training via Lazy Updates Training multi-layer Graph Convolution Networks (GCN) using standard SGD techniques scales poorly as each descent step ends up updating node embeddings for a large portion of the graph. Recent attempts to remedy this sub-sample the graph that reduces compute but introduce additional variance and may offer suboptimal performance. This paper develops the IGLU method that caches intermediate computations at various GCN layers thus enabling lazy updates that significantly reduce the compute cost of descent. IGLU introduces bounded bias into the gradients but nevertheless converges to a first-order saddle point under standard assumptions such as objective smoothness. Benchmark experiments show that IGLU offers up to 1.2% better accuracy despite requiring up to 88% less compute. 5 authors · Sep 28, 2021