A Smoothed Augmented Lagrangian Framework for Convex Optimization with Nonsmooth Constraints

Zhang, Peixuan; Shanbhag, Uday V.; Fang, Ethan X.

doi:10.1007/s10915-025-02934-w

A Smoothed Augmented Lagrangian Framework for Convex Optimization with Nonsmooth Constraints

Open access
Published: 21 June 2025

Volume 104, article number 46, (2025)
Cite this article

You have full access to this open access article

Download PDF

Journal of Scientific Computing Aims and scope Submit manuscript

A Smoothed Augmented Lagrangian Framework for Convex Optimization with Nonsmooth Constraints

Download PDF

Peixuan Zhang¹,
Uday V. Shanbhag² &
Ethan X. Fang³

958 Accesses
Explore all metrics

Abstract

Augmented Lagrangian (AL) methods have proven remarkably useful in solving optimization problems with complicated constraints. The last decade has seen the development of overall complexity guarantees for inexact AL variants. Yet, a crucial gap persists in addressing nonsmooth convex constraints. To this end, we present a smoothed augmented Lagrangian (AL) framework where nonsmooth terms are progressively smoothed with a smoothing parameter $\eta _k$. The resulting AL subproblems are $\eta _k$-smooth, allowing for leveraging accelerated schemes. By a careful selection of the inexactness level $\epsilon _k$ (for inexact subproblem resolution), the penalty parameter $\rho _k$, and smoothing parameter $\eta _k$ at epoch k, we derive rate and complexity guarantees of $\tilde{\mathcal {O}}(1/\varvec{\varepsilon }^{3/2})$ and $\tilde{\mathcal {O}}(1/\varvec{\varepsilon })$ in convex and strongly convex regimes for computing an $\varvec{\varepsilon }$-optimal solution, when $\rho _k$ increases at a geometric rate, a significant improvement over the best available guarantees for AL schemes for convex programs with nonsmooth constraints. Analogous guarantees are developed for settings with $\rho _k = \rho $ as well as $\eta _k = \eta $. Preliminary numerics on a fused Lasso problem display promise.

Stochastic inexact augmented Lagrangian method for nonconvex expectation constrained optimization

Article 07 September 2023

Iteration complexity of inexact augmented Lagrangian methods for constrained convex programming

Article 21 August 2019

An adaptive primal-dual framework for nonsmooth convex minimization

Article 31 October 2019

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

We consider the nonsmooth convex program, defined as

$$\begin{aligned} \min _{{\textbf {x}}\in {\mathcal {X}}}&\ \left\{ \, f({\textbf {x}}) \, \mid \, g({\textbf {x}})\, \le \, 0 \right\} , \end{aligned}$$

(NSCopt)

where $f: {\mathcal {X}}\rightarrow \mathbb {R}$ is a real-valued convex function and is possibly nonsmooth (but smoothable), ${\mathcal {X}}\subset \mathbb {R}^{n}$ is a closed and convex set, and $g(\textbf{x}) = (g_1(\textbf{x}),g_2(\textbf{x}),...,g_m(\textbf{x}))^\top $ that each $g_i :{\mathcal {X}}\rightarrow \mathbb {R},i = 1, 2, \cdots , m$ is a possibly complicated nonsmooth (but smoothable) convex function. Generally, the presence of such constraints precludes usage of projection-based methods to ensure feasibility of iterates. In deterministic regimes, a host of approaches have been employed for contending with complicated constraints, a subset of which include sequential quadratic programming [18, 43], interior point methods [8], and augmented Lagrangian (AL) schemes [38, 39]. Of these, AL schemes have proven to be enormously influential in the context of scientific computing [1, 9, 13], and more specifically in nonlinear programming in the form of solvers such as minos [16, 28] and lancelot [10] as well as more refined techniques [15, 17]. There has been a significant interest in deriving overall complexity bounds [24, 44] in convex regimes when the Lagrangian subproblem is solved via a first-order method. However, such bounds tend to be poor when constraints are possibly nonsmooth; e.g. standard AL schemes display complexity guarantees of $\mathcal {O}(\varvec{\varepsilon }^{-5})$ for computing an $\varvec{\varepsilon }$-optimal solution in such settings (see Table 1).

Table 1 ALM for deterministic convex optimization

Full size table

1.1. Related work. Before proceeding, we discuss related prior research. (a) Augmented Lagrangian Methods. The augmented Lagrangian method (ALM) was proposed by Hestenes [19] and Powell [37] with a comprehensive rate analysis subsequently provided by Rockafellar [38]. The ALM framework relies on solving a sequence of unconstrained (or relaxed) problems, requiring the minimization of a suitably defined augmented Lagrangian function $\mathcal{L}_{\rho }(\textbf{x},\lambda )$ in $\textbf{x}$, where $\rho $ and $\lambda $ denote the penalty parameter and the Lagrange multiplier associated with g, respectively. In high-dimensional settings, the Lagrangian subproblems cannot be solved exactly, leading to the development of variants that allow for inexact resolution of the Lagrangian subproblem. Kang et al. [21] presented an inexact accelerated ALM for strongly convex optimization with linear constraints at a rate of $\mathcal {O}(1/k^2)$, where k is the iteration counter. Non-ergodic convergence guarantees were provided in [24, 25], where either smoothness of f [24] or a composite structure [25] is assumed. Overall complexity guarantees were first provided by Lan and Monteiro [24], Aybat and Iyengar [4], Necoara et al. [29] and most recently Lu and Zhou [26], where the latter three references allowed for conic settings. In fact, Lu and Zhou [26] showed that in conic convex settings with smooth nonlinear constraints, by introducing a regularization, the overall complexity is improved to $\mathcal {O}\left( \varvec{\varepsilon }^{-1}\ln (\varvec{\varepsilon }^{-1})\right) $ with a geometrically increasing penalty parameter. Nedelcu et al. [30] considered convex and strongly convex regimes. Notably, Necoara et al. [29] derived an overall complexity of $\mathcal {O}(\varvec{\varepsilon }^{-\frac{3}{2}})$ and $\mathcal {O}(\varvec{\varepsilon }^{-1})$ for smooth settings objective, respectively. More recently, Xu [44] considered nonlinear but smooth regimes in proposing an inexact ALM (under a suitable boundedness requirement) with complexity guarantees of $\mathcal {O}(\varvec{\varepsilon }^{-1})$ (under convex f) and $\mathcal {O}(\varvec{\varepsilon }^{-\frac{1}{2}}\log ({\varvec{\varepsilon }^{-1}}))$ (under strongly convex f), respectively. Table 1 compares existing complexity guarantees for AL schemes with both our schemes in convex (Sm-AL) and strongly convex settings (Sm-AL(S)) and standard ALM (N-AL), where $\tilde{\mathcal {O}}$ suppresses logarithmic terms.

(b) Smoothing techniques. While subgradient methods have proven effective in addressing nonsmooth convex objectives [36], smoothing techniques [6] represent an efficient avenue for a subclass of nonsmooth problems. Moreau [27] introduced the (Moreau)-smoothing $f_\eta $ of a convex function f, with parameter $\eta $, defined as

$$\begin{aligned} f_{\eta }(\textbf{x}) \, \triangleq \, \inf _{\textbf{u}}\left\{ f(\textbf{u}) + \tfrac{1}{\eta }\Vert \textbf{u}-\textbf{x}\Vert ^2\right\} . \end{aligned}$$

Nesterov [33] employed a fixed smoothing parameter in developing a smoothing framework for nonsmooth convex optimization problems with a rate of $\mathcal {O}(\varvec{\varepsilon }^{-1})$, an improvement over $\mathcal {O}(\varvec{\varepsilon }^{-2})$ attainable by subgradient methods. In related work, Aybat and Iyengar [3] designed a smoothed penalty method for obtain $\varvec{\varepsilon }$-optimal solutions for $l_1$-minimization problems with linear equality constraints in $\tilde{\mathcal {O}}\left( \varvec{\varepsilon }^{-3/2}\right) $ steps. Subsequently, Beck and Teboulle [7] defined an $(\alpha , \beta )$-smoothing for a nonsmooth convex f satisfying the following two conditions (i) $f_{\eta }(\textbf{x}) \, \le \, f(\textbf{x}) \, \le \, f_{\eta }(\textbf{x}) + \eta \beta $ for all $\textbf{x}$ and (ii) $\, f_{\eta }$ is $(\alpha /\eta )$-smooth. For instance, $f(\textbf{x}) \triangleq \max \{0,\textbf{x}\}$ has a smoothing $f_{\eta }$, defined as $f_{\eta }(\textbf{x}) \triangleq \eta \log (1+\exp (\frac{\textbf{x}}{\eta }))-\eta \log 2.$ Analogous approaches have been employed for addressing deterministic [12] and stochastic [20] convex optimization problems.

1.2. Applications. We present three applications where nonsmooth convex constraints emerge.

(a) Regression. Lasso regression [40] is a model widely used in variable selection in statistical learning. Assuming that the dataset consists of $\{y_i,X_i\}_{i=1}^{N}$, where $(y_i,X_i)$ denotes the outcome and feature vector for ith instance. Then an elastic-net model [46] can be articulated as follows where $C_1 > 0$.

$$\begin{aligned} \min _{\beta }\, \left\{ \, \Vert y-X\beta \Vert ^2_2 \, \mid \, (1-\alpha )\Vert \beta \Vert _1 + \alpha \Vert \beta \Vert _2 \le C_1 \, \right\} . \end{aligned}$$

(1)

This reduces to standard Lasso [40] when $\alpha = 0$ and is generalizable to fused Lasso [41] by adding an additional nonsmooth constraint $\sum _{j = 2}^{p}|\beta _j-\beta _{j-1}| \le C_2$, where $C_2 > 0$.

(b) Classification. In statistical learning, the Neyman-Pearson (NP) classification [42] is designed to minimize the type II error while maintaining type I error below a user-specified level $\alpha $. Consider a labeled training dataset $\{a_i\}_{i = 1}^N$ where the positive and negative set are represented by $\{a_i^{(1)}\}_{i = 1}^{N_{(1)}}$ and $\{a_i^{(-1)}\}_{i = 1}^{N_{(-1)}}$, respectively. The empirical NP classification problem is given by [45] as follows

$$\begin{aligned} \min _{\textbf{x}} \,\left\{ \, \tfrac{\sum _{i = i}^{N_{(-1)}}\varvec{\ell }\left( 1, \textbf{x}^\top a_{i}^{(-1)}\right) }{N_{(-1)}} \, \bigg | \, \tfrac{\sum _{i = i}^{N_{(1)}}\varvec{\ell }\left( -1, \textbf{x}^\top a_{i}^{(1)}\right) }{N_{(1)}}-\alpha \le 0 \, \right\} , \end{aligned}$$

where $\varvec{\ell }(\bullet )$ denotes the loss function. Choices of the loss function include nonsmooth variants such as mean absolute error (MAE) and hinge loss.

(c) Multiple Kernel learning. Multiple kernel learning (MKL) employs a predefined set of kernels to learn an optimal linear or nonlinear combination of these kernels, defined as follows [22].

$$\begin{aligned} \min _{ w, b, (\theta ,\xi )\ge 0}&\quad \tfrac{1}{2}\sum _{m = 1}^{M}\tfrac{\Vert w_m\Vert _2^2}{\theta _m} + C\Vert \xi \Vert _1 \, \\ \mathop {\mathrm {subject\;to}}\limits&\quad y_i\left( \sum _{m = 1}^{M}w_m'\psi _{m}(\textbf{x}_i) + b\right) \, \ge \, 1-\xi _i, \quad i = 1, \cdots , m \\&\quad \Vert \theta \Vert _{p}^{p}\, \le \, 1, \end{aligned}$$

where $\psi _i(\bullet ), i = 1, \dots , m$ are predefined kernels, $\theta $ is a vector of coefficients for each kernel, w is a weight vector for the primal model for learning with multiple kernels.

1.3. Contributions. We present a smoothed AL framework (Sm-AL) where the nonsmooth (but smoothable) objective/constraints are smoothed with a diminishing smoothing parameter $\eta _k$. Consequently, the AL subproblem (with penalty parameter $\rho _k$) is proven to be $\mathcal {O}(\rho _k/\eta _k)$-smooth, allowing for (accelerated) computation of an $\epsilon _k$-exact solution in finite time. By a careful selection of the sequences $\{\epsilon _k,\eta _k,\rho _k\}$, we derive rate and complexity guarantees. Our contributions are formalized next.

(i) In Section 2, we derive an ex-ante bound on the optimal multiplier set of the $\eta $-smoothed problem. This result, which is of independent interest, allows for claiming that a saddle-point of the $\eta $-smoothed problem is an $\mathcal {O}(\eta )$-saddle point of the original problem, allowing for deriving fixed smoothing schemes.

(ii) In Section 3, we establish a dual suboptimality rate of $\mathcal {O}(k^{-1})$ and primal infeasibility rate of $\mathcal {O}(k^{-1/2})$ (constant penalty) while geometric rates of $\mathcal {O}(1/\rho _k)$ on primal infeasibility and suboptimality are derived under geometrically increasing penalty parameters. In Section 4, by employing an accelerated gradient framework for resolving the $\eta _k$-smoothed AL subproblem, the overall complexities of (Sm-AL) in terms of inner projection steps for obtaining an $\varvec{\varepsilon }$-optimal solution are proven to be $\mathcal {O}(\varvec{\varepsilon }^{-(3+\delta )})$ (constant penalty) and $\tilde{\mathcal {O}}(\varvec{\varepsilon }^{-3/2})$ (geometrically increasing penalty). Analogous bounds in strongly convex settings are given by $\tilde{\mathcal {O}}(\varvec{\varepsilon }^{-(2+\delta )})$ and $\tilde{\mathcal {O}}(\varvec{\varepsilon }^{-1})$ for constant and geometrically increasing penalty parameters, respectively. Similar complexity guarantees are available with a fixed smoothing parameter, akin to those developed in [7, 33] for convex programs with nonsmooth objectives.

(iii) We also develop practical termination criteria in Section 2, which when overlaid with our proposed scheme lead to significantly improved empirical complexity in our numerical experiments with little impact on accuracy.

(iv) Preliminary numerical results are provided in Section 5 before concluding in Section 6.

Organization The remainder of the paper is organized as follows. In Section 2, we introduce the smoothed augmented Lagrangian framework, providing the requisite background and the assumptions. Sections 3 and 4 provide the rate and complexity analysis while Section 5 presents a description of our numerical experiments. The paper concludes in Section 6.

Notation. Let $\Vert \cdot \Vert $ denote the Euclidean norm. Given a closed convex set $X \subseteq \mathbb {R}^n$ and $y\in \mathbb {R}^n$, $d_{\mathcal {X}}(y)\triangleq {\displaystyle \min _{s\in \mathcal {X}}} \Vert y-s\Vert $, $d^2_{\mathcal {X}}(y)\triangleq \left( d_{\mathcal {X}}(y)\right) ^2$, and $\varPi _{\mathcal {X}}(y)\triangleq {\displaystyle \text{ argmin}_{s\in \mathcal{X}}} \Vert y-s\Vert $; hence, $d_{\mathcal {X}}(y)=\Vert y-\varPi _{\mathcal {X}}(y)\Vert $. Moreover, $d^2_{\mathcal {K}}(\cdot )$ is differentiable and its gradient $\nabla d^2_{\mathcal {X}}(y)=2(y-\varPi _{\mathcal {X}}(y))$. $d_{-}(u)$ denotes the distance of u to the nonpositive orthant $\mathbb {R}^n_{-}$, where $d_{-}(u)$ is defined as $d_{-}(u) \, \triangleq \, \Vert u - \varPi _{\mathbb {R}^n_-} [u]\Vert _{2}.$ Finally, $\tilde{\mathcal {O}}(f(n))$ is $\mathcal {O}(f(n))$ up to a $\log (n)$ factor. Finally, $\textbf{1}$ denotes the column of ones in $\mathbb {R}^n$.

2 A Smoothed Augmented Lagrangian Framework

In this section, we first provide some background and then analyze the smoothed problem, ending with a relation between a saddle-point of the $\eta $-smoothed problem and an $\eta $-approximate saddle-point of the original problem.

2.1 Background and Assumptions

Corresponding to problem (NSCopt), we may define the Lagrangian function $\mathcal {L}_0$ as follows.

$$\begin{aligned} \mathcal {L}_0(\textbf{x},\lambda ) \, \triangleq \, {\left\{ \begin{array}{ll} f(\textbf{x}) + \lambda ^\top g(\textbf{x}), & \lambda \, \ge \, 0 \\ -\infty . & \, \text{ otherwise } \end{array}\right. } \end{aligned}$$

This allows for denoting the set of minimizers of $\mathcal {L}_0({\bullet },\lambda )$ over the set $\mathcal{X}$ by $\mathcal {X}^*(\lambda )$, the dual function by $\mathcal {D}_0(\lambda )$, and the dual solution set by $\varLambda ^*$, each of which is defined next.

$$\begin{aligned} \mathcal {X}^*(\lambda ) \, \triangleq \, \arg \min _{\textbf{x}\in \mathcal X} \, \mathcal {L}_0(\textbf{x},\lambda ), \, \mathcal {D}_0(\lambda ) \, \triangleq \, \inf _{\textbf{x}\in \mathcal {X}}\, \mathcal {L}_0(\textbf{x},\lambda ), \text{ and } \varLambda ^* \, \triangleq \, \arg \max _{\lambda \ge 0} \mathcal {D}_0(\lambda ). \end{aligned}$$

By adding a slack variable $\textbf{v} \in \mathbb {R}^m$, we may recast (NSCopt) as follows.

$$\begin{aligned} \begin{aligned} \min _{\textbf{x}\, \in \, {\mathcal {X}}, \textbf{v} \, \ge \, 0}&\quad f(\textbf{x}) \\ \mathop {\mathrm {subject\;to}}\limits&\quad g(\textbf{x}) + \textbf{v} = 0, \qquad (\lambda ) \end{aligned} \end{aligned}$$

where $\lambda \in \mathbb {R}^m$ denotes the Lagrange multiplier associated with the constraint $g(\textbf{x}) + \textbf{v} = 0$. Then the augmented Lagrangian function, denoted by $\mathcal {L}_{\rho }$, where $\rho $ denotes the penalty parameter, is defined as follows (cf. [38]).

$$\begin{aligned} \mathcal {L}_{\rho } (\textbf{x},\lambda )&\, \triangleq \, \min _{\textbf{v}\, \ge \, 0} \, \left[ \, f(\textbf{x})+ \lambda ^\top (g(\textbf{x}) + \textbf{v}) + \tfrac{\rho }{2} \left\| \, g(\textbf{x})+ \textbf{v}\right\| ^2\, \right] . \end{aligned}$$

(2)

It has been shown that $(\bar{\textbf{x}},\bar{\lambda })$ is a saddle-point of the augmented Lagrangian $\mathcal{L}_{\rho }$ for any $\rho \ge 0$ if and only if $(\bar{\textbf{x}},\bar{\lambda })$ is a saddle-point of $\mathcal{L}_0$. Further, if $\bar{\lambda }$ is an optimal dual solution, then $\bar{\textbf{x}}$ is an optimal solution of (NSCopt) if and only if $\bar{\textbf{x}}$ minimizes $\mathcal{L}(\bullet ,\bar{\lambda })$ over $\mathcal{X}$ [38, Th. 3.5].

If $d_{-}(u) \, \triangleq \, {\displaystyle \inf _{v \in \mathbb {R}^n_-}} \Vert u-v\Vert $ and $\varPi _{+}[u]$ denotes the Euclidean projection of u onto $\mathbb {R}^m_+$, then the AL function $\mathcal {L}_{\rho }$ and its gradient can be expressed as follows [38, Sec. 2].

Lemma 1

Consider the function $\mathcal {L}_{\rho }$ for $\rho > 0$, $\textbf{x}\in {\mathcal {X}}$ and $\lambda \ge 0$. Then

$$\begin{aligned} \mathcal {L}_{\rho } ({\textbf {x}},\lambda )&=\left( f({\textbf {x}})+ \tfrac{\rho }{2} \left( d_-\left( \tfrac{\lambda }{\rho } + g({\textbf {x}}) \right) \right) ^2 - \tfrac{1}{2\rho }\Vert \lambda \Vert ^2 \right) , \\ \nabla _{\lambda } \mathcal {L}_{\rho } ({\textbf {x}},\lambda )&= \left( -\tfrac{\lambda }{\rho } + \varPi _{+} \left( \tfrac{\lambda }{\rho } + g({\textbf {x}})\right) \right) , \\ \text { and } \nabla _{{\textbf {x}}} \mathcal {L}_{\rho }({\textbf {x}},\lambda )&= \nabla _{{\textbf {x}}}f({\textbf {x}})+\rho J_g({\textbf {x}}) \left( \tfrac{\lambda }{\rho } + g({\textbf {x}}) - \varPi _{-} \left( \tfrac{\lambda }{\rho } + g({\textbf {x}}) \right) \right) , \end{aligned}$$

where $J_{g}(\textbf{x})$ is Jacobian matrix of g. $\Box $

Similarly, the augmented dual function $\mathcal {D}_{\rho }$, defined as

$$\begin{aligned} \begin{aligned} \mathcal {D}_{\rho }(\lambda )&\, \triangleq \, \inf _{\textbf{x}\in \mathcal {X}} \mathcal {L}_{\rho }(\textbf{x},\lambda ), \end{aligned} \end{aligned}$$

(3)

can be shown to be differentiable [38, Th. 3.2].

Lemma 2

Consider the function $\mathcal {D}_{\rho }$ defined as (3). Then $\mathcal {D}_{\rho }$ is a C$^1$ and concave function over $\mathbb {R}^m$ and is the Moreau envelope of $\mathcal {D}_0$, defined as

$$\begin{aligned} \mathcal {D}_{\rho }(\lambda ) = \max _{u \in \mathbb {R}^m} \left[ \mathcal {D}_0(u) - \tfrac{1}{2\rho }\Vert u-\lambda \Vert ^2\right] \text{ and } \nabla _{\lambda } \mathcal {D}_{\rho }{(\lambda )}\, \triangleq \, \tfrac{1}{\rho }\left( q_{\rho }(\lambda ) - \lambda \right) , \end{aligned}$$

where $q_{\rho }(\lambda ) \, \triangleq \, \arg {\displaystyle \max _{u}} \left[ \mathcal {D}_0(u) - \tfrac{1}{2\rho }\Vert u-\lambda \Vert ^2\right] .$ $\Box $

Since $\mathcal{D}_{\rho }$ is the Moreau envelope of $\mathcal{D}_0$, $\mathcal{D}_{\rho }$ has the same set of maximizers as $\mathcal{D}_0$ for any $\rho \ge 0$ [38, Th. 3.2]. Our interest lies in nonsmooth, albeit smoothable, convex functions, defined next [7].

Definition 1

A closed, proper, and convex function $h: \mathbb {R}^n \rightarrow \mathbb {R}$ is $(\alpha ,\beta )$ smoothable if for any $\eta > 0$, there exists a convex differentiable function $h_{\eta }$ such that

$$\begin{aligned} \left\| \, \nabla _{\textbf{x}} h_{\eta }(\textbf{x}_1) - \nabla _{\textbf{x}} h_{\eta }(\textbf{x}_2) \, \right\|&\, \le \, \tfrac{\alpha }{\eta } \Vert \textbf{x}_1-\textbf{x}_2\Vert , \quad \forall \textbf{x}_1, \textbf{x}_2 \in \mathbb {R}^n, \\ h_{\eta }(\textbf{x}) \, \le \, h(\textbf{x})&\, \le \, h_{\eta }(\textbf{x}) + \eta \beta , \qquad \forall \textbf{x}\in \mathbb {R}^n. \end{aligned}$$

$\Box $

In fact, one may be faced by compositional convex constraints in which the layers may be nonsmooth. In such instances, under suitable conditions, smoothability of the layers implies smoothability of the compositional function but we postpone such avenues for future work. We leverage smoothability assumptions in [7] to state our basic assumptions on the objective and constraint functions. In addition, we impose both compactness requirements on $\mathcal {X}$ as well as a Slater regularity condition. Before stating the required assumptions, we need to define the $\epsilon $-KKT conditions of (NSCopt), which is inspired by KKT conditions.

Definition 2

($\epsilon $-optimal solution) Let $f^*$ be the optimal value of (NSCOPT). Given $\epsilon \ge 0$, a point $\tilde{\textbf{x}}\, \in \, \mathcal {X}$ is called an $\epsilon $-optimal and $\epsilon $-feasible solution to (NSCOPT) if

$$\begin{aligned} \, f(\tilde{\textbf{x}})-f^* \, \le \, \epsilon \text{ and } d_{-}\left( g(\tilde{\textbf{x}})\right) \, \le \, \epsilon , \quad \text{ respectively }. \end{aligned}$$

(4)

$\Box $

Then the partial KKT conditions corresponding to relaxing the constraint $g(\textbf{x}) \, \le \, 0$ are defined as follows, where $\mathcal {L}(\bullet ,\bullet )$ denotes the Lagrangian function and $\mathcal{N}_\mathcal{X}(x)$ denotes the normal cone of $\mathcal{X}$ at x.

$$\begin{aligned} 0 \,&\in \, \nabla _{\textbf{x}} \mathcal{L}(\textbf{x}, \lambda ) + \mathcal{N}_\mathcal{X}(\textbf{x}) \end{aligned}$$

(5)

$$\begin{aligned} 0 \,&\le \, \lambda \, \perp \, g(\textbf{x}) \, \le \, 0. \end{aligned}$$

(6)

Recall that given an optimization problem, defined as

$$\begin{aligned} \begin{aligned} \min _{x}&\ f(\textbf{x}) \\ \text{ subject } \text{ to }&\ g(\textbf{x}) \, \le \, 0, \end{aligned} \end{aligned}$$

(C-Opt)

where $f, g_{i}$ are smooth functions mapping from $\mathbb {R}^n$ to $\mathbb {R}$ for $i = 1, \cdots , m$. Then under a suitable regularity condition, if $x^*$ is a local minimizer of (C-Opt), then there exists $\lambda \in \mathbb {R}^m_+$ such that

$$\begin{aligned} \nabla f(\textbf{x}) + \sum _{i=1}^m \lambda _i \nabla g_i(\textbf{x})&= 0 \end{aligned}$$

(7)

$$\begin{aligned} \lambda _i g_i(\textbf{x})&= 0, \quad i = 1, \cdots , m \end{aligned}$$

(8)

$$\begin{aligned} g(\textbf{x})&\, \le 0. \end{aligned}$$

(9)

In fact, (8)–(9), together with $\lambda \ge 0$, can be compactly stated as

$$\begin{aligned} \lambda \ge 0, \quad \lambda _i g_i(\textbf{x}) = 0, \forall i, \quad g(\textbf{x}) \, \le \, 0. \end{aligned}$$

By leveraging the “perp” notation, we have that $\lambda \perp g(x)$ or $\lambda _i g_i(x) = 0$ for all i. Therefore, we may compactly represent the KKT conditions as

$$\begin{aligned}&\nabla f(\textbf{x}) + \sum _{i=1}^m \lambda _i \nabla g_i(\textbf{x}) = 0 \end{aligned}$$

(10)

$$\begin{aligned}&0 \le \lambda \, \perp \, g(\textbf{x}) \, \le 0. \end{aligned}$$

(11)

Note that such a notation is common in complementarity theory (see Cottle, Pang, and Stone [11] or Facchinei and Pang [14]). This allows us to define a (partial) $\epsilon $-KKT point.

Definition 3

(Partial $\epsilon $-KKT condition) Consider the problem (NSCOPT). Then ($\textbf{x}_{\epsilon },\lambda _{\epsilon }$) is a partial $\epsilon $-KKT point if $\textbf{x}_{\epsilon } \in \mathcal{X}$,

$$\begin{aligned} \mathcal{L}(\textbf{x}_{\epsilon },\lambda _{\epsilon })&\, \le \, \mathcal{L}(\textbf{x}^*,\lambda _{\epsilon }) +\epsilon , \end{aligned}$$

(12)

$$\begin{aligned} 0 \, \le \, \lambda _{\epsilon },&\quad g(\textbf{x}_{\epsilon }) \, \le \, \epsilon \textbf{1}, \text{ and } \lambda _{\epsilon }^\top g(\textbf{x}_{\epsilon }) \, \ge \, -\epsilon , \end{aligned}$$

(13)

where $(\textbf{x}^*,\lambda ^*)$ denotes a KKT point of (NSCOPT) satisfying (5)–(6). $\square $

This allows us to build a simple relation whereby an $\epsilon $-KKT point satisfies $\epsilon $-optimality and $\epsilon $-feasibility.

Lemma 3

Consider a tuple $(\textbf{x}_{\epsilon },\lambda _{\epsilon })$ satisfying the $\epsilon $- KKT conditions given by (12)–(13). Then $(\textbf{x}_{\epsilon },\lambda _{\epsilon })$ satisfies ${2\epsilon }$-suboptimality and ${m \epsilon }$-infeasibility, collectively captured by (4).

Proof

We observe that $\epsilon $-primal suboptimality in (4) holds by the following sequence of relations.

$$\begin{aligned} f(\textbf{x}_{\epsilon }) -\epsilon&\overset{(13)}{\le } f(\textbf{x}_{\epsilon }) +\lambda _{\epsilon }^\top g(\textbf{x}_{\epsilon }) = \mathcal{L}(\textbf{x}_{\epsilon },\lambda _{\epsilon }) \\&\overset{()}{\le } \mathcal{L}(\textbf{x}^*,\lambda _{\epsilon }) + \epsilon = f(\textbf{x}^*)+\underbrace{\lambda _{\epsilon }^\top g(\textbf{x}^*)}_{\, \le \, 0} + \epsilon \\&\le f(\textbf{x}^*)+ \epsilon \\ \implies f(\textbf{x}_{\epsilon })&\le f(\textbf{x}^*) + 2\epsilon . \end{aligned}$$

To show $\epsilon $-feasibility of $\textbf{x}_{\epsilon }$ as prescribed in (4), we observe that

$$d_{-}\left( g(\textbf{x}_{\epsilon })\right) \le {\displaystyle \sum _{i=1}^m} \max \{g_i(\textbf{x}_{\epsilon }), 0\} \le m\epsilon ,$$

which completes the proof. $\square $

We now present our ground assumption on the problem of interest and is assumed to hold throughout the paper, unless explicitly mentioned otherwise.

Condition (d) allows for bounding the set of optimal dual variables (cf. [23]). We now consider the smoothed counterpart of (NSCopt), defined as

We note that the solution and multiplier set of (NSCopt$_{\eta })$ are denoted by $X^*_{\eta }$ and $\varLambda ^*_{\eta }$, respectively. Naturally, associated with this problem is the Lagrangian function $\mathcal{L}_{\eta ,0}$ of the smoothed problem (referred to as the smoothed Lagrangian) as well as the corresponding dual function $\mathcal{D}_{\eta ,0}$; these objects and their augmented counterparts are defined and analyzed in the next subsection.

2.2 Analysis of Smoothed Lagrangians

We now analyze the smoothed Lagrangian framework where f and g are approximated by smoothings $f_{\eta }$ and $g_{\eta }$, where the latter is a vector function with components $g_{1,\eta }, \cdots , g_{m,\eta }$. The resulting smoothed Lagrangian function $\mathcal {L}_{\eta ,0}$ and the smoothed dual function $\mathcal {D}_{\eta ,0}(\lambda )$ are defined as

$$\begin{aligned} \mathcal {L}_{\eta ,0}(\textbf{x},\lambda )&\triangleq \left. {\left\{ \begin{array}{ll} f_{\eta }(\textbf{x}) + \lambda ^\top g_{{\eta }}(\textbf{x}), & \lambda \ge 0 \\ -\infty , & \text{ otherwise } \end{array}\right. } \right\} \text{ and } \mathcal {D}_{\eta ,0}(\lambda ) \triangleq \inf _{\textbf{x}\in \mathcal {X}} \mathcal {L}_{\eta ,0}(\textbf{x},\lambda ). \end{aligned}$$

Then the smoothed augmented Lagrangian function $\mathcal {L}_{\eta , \rho }$ is defined as

$$\begin{aligned} \mathcal {L}_{\eta ,\rho } (\textbf{x},\lambda )&\triangleq \min _{\textbf{v} \ge 0} \left[ \, f_{\eta }(\textbf{x}) + \lambda ^\top (g_{{\eta }}(\textbf{x}) + \textbf{v}) + \tfrac{\rho }{2} \Vert g_{{\eta }}(\textbf{x}) + \textbf{v}\Vert ^2\, \right] \\&= f_{\eta }(\textbf{x})+ \tfrac{\rho }{2} \left( d_-\left( \tfrac{\lambda }{\rho } + g_{\eta }(\textbf{x}) \right) \right) ^2 - \tfrac{1}{2\rho }\Vert \lambda \Vert ^2. \end{aligned}$$

We may now define $\mathcal {D}_{\eta ,\rho }$ and $q_{\eta ,\rho }$ as $\mathcal {D}_{\eta ,\rho }(\lambda ) \, = \, \max _u [\, \mathcal {D}_{\eta ,0}(u) - \tfrac{1}{2\rho }\Vert u-\lambda \Vert ^2\,]$ and $\nabla _{\lambda } \mathcal {D}_{\eta ,\rho }(\lambda ) \, = \, \tfrac{1}{\rho }\left( q_{\eta ,\rho }(\lambda ) - \lambda \right) $, where $q_{\eta , \rho }(\lambda ) \triangleq \textrm{arg}\hspace{-0.02in}\max _{u} [\, \mathcal {D}_{\eta ,0}(u) - \tfrac{1}{2\rho }\Vert u-\lambda \Vert ^2\, ].$ We now relate $\mathcal {D}_{\rho }$ to $\mathcal {D}_{\eta ,\rho }$ and $q_{\rho }$ to $q_{\eta ,\rho }$ in the next lemma.

Lemma 4

For any $\lambda \in \mathbb {R}_{+}^m$, the following hold:

(i) $\left| \mathcal {L}_{0}(\textbf{x},\lambda ) - \mathcal {L}_{\eta ,0}(\textbf{x},\lambda ) \right| \, \le \, \eta (\Vert \lambda \Vert m+1) {\beta };$

(ii) $| \mathcal {D}_{\eta ,0}(\lambda ) - \mathcal {D}_{0}(\lambda )| \le \eta (\Vert \lambda \Vert m+1) {\beta } ;$

(iii)$| \mathcal {D}_{\eta ,\rho }(\lambda ) - \mathcal {D}_{\rho }(\lambda )| \le \eta (\Vert \lambda \Vert m+1) {\beta }.$ $\Box $

Under a Slater regularity condition, the set of optimal multipliers is bounded (cf. [23]). Similar bounds are derived for the $\eta $-smoothed problem.

Proof

(a) By Assumption 1(d), there exists a vector $\bar{\textbf{x}} \in \mathcal{X}$ such that $g(\bar{\textbf{x}}) < 0$, implying that $g_{\eta }(\bar{\textbf{x}}) < 0$ by the property of smoothability (Def. 1).

(b) By the Slater regularity condition, we directly conclude from [23] that

$$\begin{aligned} \varLambda ^*\,\subseteq \, \left\{ \,\lambda \ge 0\,\bigg |\, \sum _{i = 1}^{m}\lambda _i\,\le \, \tfrac{f(\bar{{\textbf {x}}}) - \mathcal {D}_{0}^*}{\min _{j}\{-g_j(\bar{{\textbf {x}}})\}}\right\} , \text { where } \mathcal {D}_0^* = f^*. \end{aligned}$$

(c) Similarly, $\varLambda _{\eta }^*$, the dual optimal solution set, is bounded as follows.

$$\begin{aligned} \varLambda _{\eta }^* \, \subseteq \, \left\{ \, \lambda \ge 0 \, \mid \, \sum _{i=1}^m \lambda _i \, \le \, \tfrac{f_{\eta }(\bar{\textbf{x}}) - \mathcal{D}_{0,\eta }^*}{\min _{j} \{ - g_{j,\eta }({\bar{\textbf{x}}})\}} \, \right\} \, \subseteq \, \left\{ \, \lambda \ge 0 \, \mid \, \sum _{i=1}^m \lambda _i \, \le \, \tfrac{f(\bar{\textbf{x}}) - \mathcal{D}_{0,\eta }^*}{\min _{j} \{ - g_{j,\eta }({\bar{\textbf{x}}})\}} \, \right\} . \end{aligned}$$

Recall that $-g_{j,\eta }({\bar{\textbf{x}}}) \ge -g_j({\bar{\textbf{x}}})$ for $j = 1, \cdots , m$. Furthermore, ${\displaystyle \min _j}\{ -g_{j,\eta }({\bar{\textbf{x}}})\} \ge {\displaystyle \min _j} \{-g_{j}({\bar{\textbf{x}}})\}$. It follows from (b) that

$$ -\mathcal{D}_{0,\eta }(\lambda _{\eta }^*) \, \overset{\tiny \text{(Optimality } \text{ of } \lambda _{\eta }^*)}{\le } \, -\mathcal{D}_{0,\eta }(\lambda ^*) \, \overset{\tiny \text{(Lemma } 4\hbox {(ii))}}{\le } \, -\mathcal{D}_0(\lambda ^*) + \eta (mb_{\lambda }+1)\beta .$$

Consequently, if $\mathcal{D}_{0,\eta }^* \triangleq \mathcal{D}_{0,\eta }(\lambda _{\eta }^*)$, $\mathcal{D}_0^* \triangleq \mathcal{D}_{0}(\lambda ^*)$, then

$$\begin{aligned} \varLambda _{\eta }^*&\, \subseteq \, \left\{ \, \lambda \ge 0 \, \mid \, \sum _{i=1}^m \lambda _i \, \le \, \tfrac{f(\bar{\textbf{x}}) - \mathcal{D}_{0,\eta }^*}{\min _{j} \{ - g_{j,\eta }({\bar{\textbf{x}}})\}} \, \right\} \\&\, \subseteq \, \left\{ \, \lambda \ge 0 \, \mid \, \sum _{i=1}^m \lambda _i \, \le \, \tfrac{f(\bar{\textbf{x}}) - \mathcal{D}_{0,\eta }^*}{\min _{j} \{ - g_{j}({\bar{\textbf{x}}})\}} \, \right\} \\&\, \subseteq \, \left\{ \, \lambda \ge 0 \, \mid \, \sum _{i=1}^m \lambda _i \, \le \, \tfrac{f(\bar{\textbf{x}}) - \mathcal{D}_{0}^* + \eta {(mb_{\lambda }+1)\beta }}{\min _{j} \{ - g_{j}({\bar{\textbf{x}}})\}} \, \right\} \\&\,\subseteq \, B_{\lambda ,\eta } \,\triangleq \, \left\{ \lambda \ge 0\,|\, \sum _{i = 1}^{m} \lambda _i \le b_{\lambda ,\eta }\right\} . \end{aligned}$$

$\square $

Both Lemma 4 and Proposition 1 play crucial roles in the convergence analysis presented in Section 3. We now relate a saddle-point $(\textbf{x}^*_{\eta },\lambda _{\eta }^*)$ of (NSCopt$_{\eta }$) to an $\eta $-saddle-point $(\textbf{x}^*,\lambda ^*)$ of (NSCopt), where the bound on the multipliers for (NScopt) and (NSCopt$_{\eta }$) are denoted by ${b}_{\lambda }$ and ${b}_{\lambda ,\eta }$ , respectively. Next, we relate a saddle-point of (NSCopt$_{\eta }$) to an $\eta $-saddle-point of (NSCopt), where an $\eta $-saddle point satisfies the saddle-point requirements with an $\mathcal {O}(\eta )$ error.

Proof

(a) Suppose $\textbf{x}_{\eta }^* \, \in \, \mathcal{X}$ is a feasible solution of (NSCopt$_{\eta }$). Then $g_{\eta }(\textbf{x}^*_{\eta }) \le 0$. Furthermore, $g(\textbf{x}_{\eta }^*) \le g_{\eta }(\textbf{x}_{\eta }^*) + \eta \beta \textbf{1} \le \eta \beta \textbf{1}$, implying that $d_{-}(g(\textbf{x}_{\eta }^*)) \le \eta \beta \Vert \textbf{1}\Vert .$

(b) The dual optimal set $\varLambda _{\eta }^*$ is nonempty and bounded as per Lemma 1. Let $(\textbf{x}_{\eta }^*,\lambda _{\eta }^*)$ be the saddle point of $L_{\eta ,0}(\cdot ,\cdot )$. We now proceed to show that $(\textbf{x}_{\eta }^*,\lambda _{\eta }^*)$ is an approximate saddle-point of $\mathcal{L}_0$.

$$\begin{aligned} {\mathcal {L}_{{0}}}&({\textbf {x}}_{\eta }^*,\lambda _{\eta }^*) = f({\textbf {x}}_{\eta }^*) + (\lambda ^*_{\eta })^\top g({\textbf {x}}_{\eta }^*) \le f_{\eta }({\textbf {x}}_{\eta }^*) +\eta {\beta }+ (\lambda ^*_{\eta })^\top g_{\eta }({\textbf {x}}_{\eta }^*) + \eta b_{\lambda ,\eta }{\beta }\Vert {\textbf {1}}\Vert \\ &= {{\mathcal {L}_{0,\eta }}({\textbf {x}}_{\eta }^*,\lambda ^*_{\eta }) + \eta \beta (1+b_{\lambda ,\eta } m)} \le {\mathcal {L}_{0,\eta }}({\textbf {x}},\lambda ^*_{\eta }) + \eta \beta (1+b_{\lambda ,\eta } m) \text { for } \text { all } {\textbf {x}}\in \mathcal {X} \\ &\overset{\tiny {-(\lambda ^*_{\eta })^\top g({\textbf {x}}) \le 0}}{=}\mathcal {L}_{{0}}({\textbf {x}},\lambda ^*_{\eta }) + f_{\eta }({\textbf {x}}) - f({\textbf {x}}) + (\lambda ^*_{\eta })^\top (g_{\eta }({\textbf {x}})-g({\textbf {x}})) + \eta \beta (1+b_{\lambda ,\eta } m)\\ &\le {\mathcal {L}_{{0}}({\textbf {x}},\lambda ^*_{\eta }) + \eta \beta (1+b_{\lambda ,\eta } m)} \text { for } \text { all } {\textbf {x}}\in \mathcal {X}. \end{aligned}$$

The final result follows through the following sequence of inequalities as provided next

$$\begin{aligned} \mathcal {L}_{{0}}({\textbf {x}}_{\eta }^*,\lambda _{\eta }^*)&= f({\textbf {x}}_{\eta }^*) + (\lambda ^*_{\eta })^\top g({\textbf {x}}_{\eta }^*) \ge f_\eta ({\textbf {x}}_{\eta }^*) + (\lambda _{\eta }^*)^\top \left( g_{\eta }({\textbf {x}}_{\eta }^*)\right) \\ &= \mathcal {L}_{0,\eta }({\textbf {x}}_{\eta }^*,\lambda _{\eta }^*) \ge \mathcal {L}_{0,\eta }({\textbf {x}}_{\eta }^*,\lambda ) \quad \text{ for } \text{ all } \lambda \in \mathbb {R}^m_+ \\ &= \mathcal {L}_{{0}}({\textbf {x}}_{\eta }^*,\lambda ) + f_{\eta }({\textbf {x}}_{\eta }^*) - f({\textbf {x}}_{\eta }^*) + \lambda ^\top \left( g_{\eta }({\textbf {x}}_{\eta }^*) - g({\textbf {x}}_{\eta }^*)\right) \quad \\ &\ge \mathcal {L}_{{0}}({\textbf {x}}_{\eta }^*,\lambda ) - \eta \beta (1 + m \Vert \lambda \Vert ) \\ &\ge \mathcal {L}_{{0}}({\textbf {x}}_{\eta }^*,\lambda ) - \eta \beta \big (1 + m \max \{b_{\lambda ,{\eta }}, \Vert \lambda \Vert \}\big ) \quad \forall \lambda \in \mathbb {R}^m_+. \end{aligned}$$

$\square $

The following Lemma 5 shows the relation between $q_{\eta ,\rho }(\bullet )$ and $q_{\rho }(\bullet )$.

Lemma 5

For any $\lambda \in \mathbb {R}_{+}^m$, the following hold:

(i) $\Vert q_{\eta ,\rho }(\lambda ) - q_{\rho }(\lambda )\Vert \le \sqrt{{4}\rho \eta (\Vert \lambda \Vert m+C_m) {\beta }};$

(ii) $\Vert \nabla _{\lambda } \mathcal {D}_{\eta ,\rho }(\lambda )-\nabla _{\lambda } \mathcal {D}_{\rho }(\lambda )\Vert = \tfrac{1}{\rho }\Vert q_{\eta ,\rho }(\lambda ) - q_{\rho }(\lambda )\Vert \le \sqrt{\tfrac{{4} \eta (\Vert \lambda \Vert m+C_m) {\beta } }{\rho }}.$ $\Box $

We now formally state the smoothed AL scheme. The traditional ALM is reliant on solving the subproblem exactly or $\epsilon _k$-inexactly at epoch k. However, in regimes with nonsmooth constraints, the AL subproblem is nonsmooth, precluding the usage of accelerated gradient methods, leading to far poorer performance. Our proposed scheme solves a sequence of $\eta _k$-smoothed problems solved within an error tolerance of $\epsilon _k \eta _k^b$ where $b\ge 0$. A formal statement of the scheme is provided next.

Observe that step [1] requires that $\textbf{x}_{k+1}$ is an $\epsilon _k \eta _k^b$-minimizer of the AL subproblem, given by

$$\begin{aligned} \min _{x \in X} \ \mathcal{L}_{\eta _k,\rho _k}(\textbf{x},\lambda _k), \end{aligned}$$

where $\mathcal{D}_{\eta _k,\rho _k}(\lambda _k) = \min _{x \in X} \mathcal{L}_{\eta _k,\rho _k}(\textbf{x},\lambda _k).$ Since we have rate guarantees for the accelerated scheme applied to the subproblem, we can determine the minimum number of gradient steps that ensure that $\epsilon _k \eta _k^b$-suboptimality holds. The Lagrange multiplier update can be expressed as follows (cf. [2]).

Lemma 6

Consider the smoothed augmented Lagrangian scheme (Sm-AL). Then for any $k > 0$, step [2] is equivalent to the following equation.

$$\begin{aligned} \lambda _{k+1} = \varPi _+\left[ \, {\lambda _k}+{\rho _k}g_{\eta _k}(\textbf{x}_{k+1})\, \right] . \end{aligned}$$

(14)

The next assumption holds for parameter sequences employed in (Sm-AL). Unless mentioned otherwise, Assumptions 1 and 2 hold throughout.

While our rate guarantees for the schemes responsible for resolving the subproblem as well as the outer (dual) problem allow for defining precise lower bounds on the number of steps required, this computational requirement is reliant on a worst-case analysis. In addition, we may attempt to check if the sub-optimality requirement is met at some intermediate step. However, it is not obvious how to check the sub-optimality in the current setting since the optimal value corresponding to either the subproblem or the outer level problem are unavailable. Instead, we appeal to a residual function and consider such an approach next. We emphasize that such a potential early termination of either the subproblem solver or the outer scheme may have computational benefits.

2.3 Termination Criteria

Our inexact augmented Lagrangian framework relies on utilizing inexact solutions to the Lagrangian subproblem, obtained by taking finite but increasing number of gradient-based steps and then leveraging the rate guarantees for accelerated gradient methods. However, we may well meet the required accuracy prior to taking the prescribed number of gradient steps by checking a suitable condition. Such a condition is by no means immediate since a naive assessment of accuracy requires knowing the optimal value to the subproblem; instead, we present a new analysis by leveraging a residual function and present such an analysis next for both the inner and outer loops.

(I). Termination criterion for Inner loop. The inner loop at iteration k terminates when $x_{k+1}$ satisfies the following $\epsilon _k \eta _k^b$-optimality requirement, where $\epsilon _k$ is a positive accuracy thresholdat iteration k, $\eta _k$ is the smoothing parameter at iteration k, and b is a nonnegative scalar that is defined subsequently in the complexity analysis.

$$\begin{aligned} \mathcal{L}_{\eta _k,\rho _k}(\textbf{x}_{k+1},\lambda _k) - \mathcal{D}_{\eta _k,\rho _k}(\lambda _k) \, \le \, \epsilon _k \eta _k^b. \end{aligned}$$

(15)

In effect, if we view the minimization of the augmented Lagrangian function by the following convex problem, defined as

$$\begin{aligned} \min _{\textbf{x}\, \in \, \mathcal{X}} \ h(\textbf{x}) \triangleq \mathcal{L}_{\eta _k,\rho _k}(\textbf{x},\lambda _k), \end{aligned}$$

(Opt)

where h is a convex and smooth function on $\mathcal{X}$, a closed and convex set. We proceed to show that (15) is equivalent to $x_{k+1}$ approximately satisfying the variational inequality problem.

$$\begin{aligned} \nabla _x \mathcal{L}_{\eta _k,\rho _k}(\textbf{x}_{k+1},\lambda _k)^\top (y - \textbf{x}_{k+1}) \ge -\epsilon _k \eta _k^b \qquad \forall y \, \in \, X. \end{aligned}$$

(16)

In fact, we now develop a verifiable condition whose satisfaction implies (16).

Lemma 7

Consider the problem (Opt). Suppose $\Vert \textbf{y}\Vert ^2 \le C$ and $\Vert \nabla h(\textbf{y})\Vert ^2 \le D$ for any $\textbf{y}\in X$ and $\gamma $ is any positive scalar. Consider the following statements.

(a)
$\textbf{x}^*_{\epsilon }$ is an $\epsilon $-optimal solution of (Opt).
(b)
$\nabla h(\textbf{x}^*_{\epsilon })^\top (\textbf{y}-\textbf{x}^*_{\epsilon }) \, \ge \,- \epsilon , \quad \forall \, \textbf{y}\, \in \, \mathcal {X}.$
(c)
Suppose there exists $\textbf{u}\in \mathcal{X}$ and $\textbf{x}^*_{\epsilon } \in \mathcal{X}$ such that $F^{{\textrm{nat}},\tilde{\epsilon }}_\mathcal{X} (\textbf{x}^*_{\epsilon },\textbf{u}) = 0$, where $F^{{{{\textrm{nat}}}},\tilde{\epsilon }}_\mathcal{X}(\bullet ,\bullet )$ represents the perturbed natural map with a chosen parameter $\gamma $, defined as
$$\begin{aligned} F^{{{{\textrm{nat}}}},\tilde{\epsilon }}_\mathcal{X}(\textbf{x},\textbf{u})\, \triangleq \, \tilde{\epsilon } \left( \, \textbf{u}- \varPi _X \left[ \, \textbf{x}- \gamma \nabla h(\textbf{x}) \, \right] \,\right) - \textbf{x}+ \varPi _\mathcal{X} \left[ \, \textbf{x}- \gamma \nabla h(\textbf{x}) \, \right] . \end{aligned}$$

Then the following hold.

(i)
$(a) \, \iff \, (b)$;
(ii)
$(c) \implies (b)$, where $\tilde{\epsilon } \, = \frac{\gamma \epsilon }{7C + \gamma (C+D)}$ and $\epsilon <\frac{7C + \gamma (C+D)}{\gamma }$. $\Box $

Observe that the perturbed natural map is rooted in the natural map, a residual function for variational inequality problems [14]. When specialized to the setting of the the smooth convex optimization problem

$$\begin{aligned} \min _{\textbf{x}\in \mathcal{X}} f(\textbf{x}), \end{aligned}$$

(COpt)

we have that

$$ \left[ \, \textbf{x}^* \text{ solves } \text{(COpt) } \, \right] \, \iff \, \left[ F^{{\textrm{nat}}}_\mathcal{X}(\textbf{x}^*) \triangleq \, \textbf{x}^* - \varPi _X \left[ \textbf{x}^* - \gamma \nabla f(\textbf{x}^*) \, \right] = 0 \, \right] . $$

The lemma above develops a suitably defined $\tilde{\epsilon }$-perturbed counterpart of $F^{{\textrm{nat}}}_\mathcal{X}$, denoted by $F^{{\textrm{nat}},\tilde{\epsilon }}_\mathcal{X}$. We observe that $F^{{\textrm{nat}},\tilde{\epsilon }}_\mathcal{X}(\textbf{x},\textbf{x})$ reduces to

$$\begin{aligned} F^{{\textrm{nat}},\tilde{\epsilon }}_X(\textbf{x},\textbf{x})\, \triangleq \, \left( 1-\tilde{\epsilon }\right) \left( \varPi _\mathcal{X} \left[ \, \textbf{x}- \gamma \nabla h(\textbf{x}) \, \right] -\textbf{x}\right) . \end{aligned}$$

(17)

In other words, for any $\tilde{\epsilon } < 1$,

$$\begin{aligned} F^{{\textrm{nat}},\tilde{\epsilon }}_\mathcal{X}(\textbf{x},\textbf{x}) \, = \, 0 \quad \iff F^{{\textrm{nat}}}_\mathcal{X}(\textbf{x}) \, = \, 0, \end{aligned}$$

(18)

where $F^{{\textrm{nat}}}_\mathcal{X}(\textbf{x}) \triangleq - \textbf{x}+ \varPi _\mathcal{X}\left[ \, \textbf{x}- \gamma \nabla h(\textbf{x}) \, \right] .$ Based on the aforementioned result, in the kth iteration, this termination criterion reduces to

$$\begin{aligned} (\textbf{T1}) \begin{aligned}\quad&\, \left\| \, \tilde{\epsilon }_k \textbf{u}+ \left( 1-\tilde{\epsilon }_k\right) \varPi _\mathcal{X} \left[ \, \textbf{x}_{k+1} - \gamma \nabla _{\textbf{x}} \mathcal{L}_{\eta _k,\rho _k}(\textbf{x}_{k+1},\lambda _k)\, \right] - \textbf{x}_{k+1} \, \right\| \, = 0, \end{aligned} \end{aligned}$$

(19)

where $\tilde{\epsilon }_k = \tfrac{\gamma \epsilon _k\eta _k^b}{{7C+ \gamma (C+D)}}$ and $\textbf{u}\in \mathcal{X}$.

(II) Termination criterion for outer loop. Here we consider two settings.

(a) Constant penalty parameter. In setting (a), the outer scheme terminates when

$$\begin{aligned}&\left| \, f(\bar{\textbf{x}}_K)-f^*\, \right| \, \le \, \tfrac{C_1}{\sqrt{K}}+\eta _K\beta \text{ and } d_{-}\left( g(\bar{\textbf{x}}_K)\right) \, \le \, \tfrac{C_2}{\sqrt{K}} +m\eta _K\beta , \end{aligned}$$

where $C_1 \triangleq B_5, C_2 \triangleq B_4$, $B_3, B_4, B_5, B_6$ are defined in Table 3. Since we have access to $g(\bullet )$, it is easy to check $d_{-}\left( g(\textbf{x}_K)\right) \le \sqrt{\epsilon }$. However, evaluating $f(\bar{x}_K) - f^*$ is not directly possible, since $f^*$ is unavailable. Since f is nonsmooth, we apply Lemma 7 to the optimality gap of the smoothed problem $|f_{\eta _K}(\bar{\textbf{x}}_K)-f_{\eta _K}^*|$ since it is related to the true optimality gap, i.e. by leveraging the property of smoothability of f,

$$\begin{aligned} f(\bar{\textbf{x}}_K) - f(\textbf{x}^*) \, \le \, f_{\eta _K}(\bar{\textbf{x}}_K) - f_{\eta _K}(\textbf{x}^*) + \eta _K B \, \le \, \left| \, f_{\eta _K}(\bar{\textbf{x}}_K) - f_{\eta _K}(\textbf{x}^*) \,\right| + \eta _K B \\ f(\textbf{x}^*) - f(\bar{\textbf{x}}_K) \, \le \, f_{\eta _K}(\textbf{x}^*) - f_{\eta _K}(\bar{\textbf{x}}_K) + \eta _K B\, \le \, \left| \, f_{\eta _K}(\bar{\textbf{x}}_K) - f_{\eta _K}(\textbf{x}^*) \,\right| + \eta _K B \\ \implies \left| \, f(\bar{\textbf{x}}_K) - f(\textbf{x}^*) \,\right| \, \le \left| \, f_{\eta _K}(\bar{\textbf{x}}_K) - f_{\eta _K}(\textbf{x}^*) \,\right| + \eta _K B. \end{aligned}$$

Consequently, it suffices to get a bound on each term on the right. To get a bound on $\left| \, f_{\eta _K}(\bar{\textbf{x}}_K) - f_{\eta _K}(\textbf{x}^*) \,\right| $ given $\hat{\textbf{x}} \in \mathcal{X}$, we leverage the following residual function that

$$\begin{aligned} G_{\mathcal {X}}^{\text {nat},\tilde{\epsilon }_K}(\textbf{x}_{K+1},\hat{\textbf{x}}) \triangleq \tilde{\epsilon }_K \hat{\textbf{x}} + (1-\tilde{\epsilon }_K)\varPi _{X}\left[ \textbf{x}_{K+1}-\gamma \nabla \mathcal{L}_{\eta _K}(\textbf{x}_{K+1})\,\right] - \textbf{x}_{K+1}, \end{aligned}$$

where $\tilde{\epsilon }_K \triangleq \tfrac{\gamma C_1}{({7C + \gamma (C+D)})\sqrt{K}}$, and C, D are as defined in Lemma 7. (We can set the values for $\eta _K$ such that the overall optimality gap ($|f-f^*|$) remains controlled below a tighter error tolerance $\epsilon ^2$ to ensure the consistency with our complexity analysis.) Therefore, we may employ the following termination criterion (T2) at the Kth iterate.

$$\begin{aligned} (\textbf{T2}) \quad \left\| \, G^{{\textrm{nat}},\tilde{\epsilon }_K}_\mathcal{X}(\textbf{x}_{K+1},\hat{\textbf{x}}) \, \right\| \, =\, 0 \text{ and } d_{-}\left( g(\textbf{x}_K)\right) \le \tilde{\epsilon }_K. \end{aligned}$$

(20)

(b) Increasing penalty parameter. In setting (b), the outer scheme terminates when

$$\begin{aligned}&\left| \, f(\textbf{x}_K)-f^*\, \right| \, \le \, \tfrac{C_1}{\rho _K} +\eta _K\beta \text{ and } d_{-}\left( g(\textbf{x}_K)\right) \, \le \, \tfrac{C_2}{\rho _K} + m\eta _K\beta , \end{aligned}$$

where $C_1\triangleq {B_7}$ and $C_2 \triangleq {B_8}$ as defined in Table 3. While it is easy to check $d_{-}\left( g(\textbf{x}_K)\right) \le \epsilon $, since $f^*$ is unavailable and f is nonsmooth, we apply Lemma 7 to the optimality gap of the smoothed problem $|f_{\eta _K}(\textbf{x}_K)-f_{\eta _K}^*|$ since it is related to the true optimality gap, i.e. by leveraging the property of smoothability of f, similar to the previous analysis, $\left| \, f(\textbf{x}_K) - f(\textbf{x}^*) \,\right| \, \le \left| \, f_{\eta _K}(\textbf{x}_K) - f_{\eta _K}(\textbf{x}^*) \,\right| + \eta _K B.$ Consequently, it suffices to get a bound on both terms on the right. To get a bound on $\left| \, f_{\eta _K}(\textbf{x}_K) - f_{\eta _K}(\textbf{x}^*) \,\right| $ and given $\hat{x} \in X$, we leverage the following residual function that

$$\begin{aligned} G_{\mathcal {X}}^{\text {nat},\tilde{\epsilon }_K}(\textbf{x}_{K+1},\hat{\textbf{x}}) \triangleq \tilde{\epsilon }_{{K}} \hat{\textbf{x}} + (1-\tilde{\epsilon }_K)\varPi _{X}\left[ \textbf{x}_{K+1}-\gamma \nabla \mathcal{L}_{\eta _K}(\textbf{x}_{K+1})\,\right] - \textbf{x}_{K+1}, \end{aligned}$$

where $\tilde{\epsilon }_K \triangleq \tfrac{\gamma C_1}{({7C + \gamma (C+D)})\rho _{K}}$, and C, D are as defined in Lemma 7. Akin to earlier, we may set the value of $\eta _K$ such that the overall optimality gap ($|f-f^*|$) remains controlled below $\epsilon $. Therefore, we may employ the following termination criterion (T2) at the Kth iterate.

$$\begin{aligned} (\textbf{T2}) \quad \left\| \, G^{{\textrm{nat}},\tilde{\epsilon }_K}_\mathcal{X}(\textbf{x}_{K+1},\hat{x}) \, \right\| \, =\, 0 \text{ and } d_{-}\left( g(\textbf{x}_K)\right) \le \tilde{\epsilon }_K. \end{aligned}$$

(21)

The modified algorithm statement should read as follows.

Note that the subproblem solver is essentially an accelerated gradient scheme introduced in Section 4, the minimum number of steps as prescribed by the rate guarantees is denoted by $M_k$ and derived in Section 4.

3 Rate Analysis

In this section, we analyze the rate of convergence for (Sm-AL). In 3.1, we provide some preliminaries and then derive rate statements for constant and increasing penalties in Subsections 3.2 and 3.3, respectively.

3.1 Preliminary results

We begin by recalling the following bound, an extension of the result proved in [38, Lemma 4.3].

Lemma 8

Let $\{\textbf{x}_{k},\lambda _{k}\}$ be generated by (Sm-AL). For any $k \ge 0$, suppose $\textbf{x}_{k+1}$ satisfies $\mathcal {L}_{\eta _k,\rho _k}(\textbf{x}_{k+1},\lambda _k) - {\mathcal {D}_{\rho _k,\eta _k}}(\lambda _k) \le \epsilon _k\eta _k^b$ where $b\ge 0$. Then for $k \ge 0,$

$$\begin{aligned} \left\| \nabla _{\lambda } \mathcal {L}_{\eta _k,\rho _k} (x_{k+1},\lambda _k) - \nabla _{\lambda } \mathcal {D}_{\eta _k,\rho _k}(\lambda _k) \right\| ^2 \le \tfrac{2\epsilon _k\eta _k^b}{\rho _k}. \end{aligned}$$

(22)

By choosing appropriate sequences $\{\epsilon _k,\eta _k,\rho _k\}$, $\{(2\epsilon _k\eta _k^b)/\rho _k\}$ is diminishing (see Lemma 8). We now derive a uniform bound on the sequence $\{\lambda _k\}$.

Lemma 9

(Bound on $\lambda _k$) Consider $\{\lambda _k\}$ generated by (Sm-AL).

(a) $\{ \lambda _k\}$ is a convergent sequence. (b) For any K, we have

$$\Vert \lambda _K - \lambda ^*\Vert \le \sum _{k=0}^{\infty } \left( \sqrt{2\rho _k \epsilon _k{\eta _k^b}} +2\sqrt{\eta _k \rho _k({\Vert \lambda ^*\Vert }m+{C_m})\beta } \right) + \Vert \lambda _0-\lambda ^*\Vert { \, \triangleq \, B_{\lambda }}.$$

3.2 Rate analysis under constant $\rho _k$

Next, we derive rate statements for the dual sub-optimality and primal infeasibility when $\rho _k = \rho $ for all k. Our first result relies on the observation that the augmented dual function $\mathcal{D}_{\rho }$ has the same set of optimal solutions (and supremum) as the original dual function $\mathcal{D}_0$ (see [38, Th. 3.2]).

Proof

Recall that $\mathcal {D}_{\eta _k,\rho }$ is the Moreau envelope of $\mathcal{D}_{\eta _k,0}$. Consequently, $\nabla _{\lambda } \mathcal{D}_{\eta _k,\rho }$ is $\tfrac{1}{\rho }$-Lipschitz. We then have

$$\begin{aligned} -\mathcal{D}_{\eta _k,\rho }&(\lambda _{k+1}) \le -\mathcal{D}_{\eta _k,\rho }(\lambda _k) - \nabla _{\lambda } \mathcal{D}_{\eta _k,\rho }(\lambda _k)^\top (\lambda _{k+1}-\lambda _{k}) + \tfrac{1}{2\rho }\Vert \lambda _{k+1}-\lambda _k\Vert ^2 \\&= -\mathcal{D}_{\eta _k,\rho }(\lambda _k) - \nabla _{\lambda } \mathcal{D}_{\eta _k,\rho }(\lambda _k)^\top {(\lambda _{k+1}-\lambda ^*)}{- \nabla _{\lambda } \mathcal{D}_{\eta _k,\rho }(\lambda _k)^\top (\lambda ^*-\lambda _{k})} \\&+ \tfrac{1}{2\rho }\Vert \lambda _{k+1}-\lambda _k\Vert ^2 \\&\le -\mathcal{D}_{\eta _k,\rho }(\lambda _k) - \nabla _{\lambda } \mathcal{D}_{\eta _k,\rho }(\lambda _k)^\top {(\lambda _{k+1}-\lambda ^*)}{+ (\mathcal{D}_{\eta _k,\rho }(\lambda _k) - \mathcal{D}_{\eta _k,\rho }(\lambda ^*))} \\&+ \tfrac{1}{2\rho }\Vert \lambda _{k+1}-\lambda _k\Vert ^2 \\&{=} -\mathcal{D}_{\eta _k,\rho }(\lambda ^*) - \nabla _{\lambda } \mathcal{D}_{\eta _k,\rho }(\lambda _k)^\top (\lambda _{k+1}-\lambda ^*) + \tfrac{1}{2\rho }\Vert \lambda _{k+1}-\lambda _k\Vert ^2, \end{aligned}$$

where $-\mathcal{D}_{\eta _k,\rho }(\lambda ^*) \ge -\mathcal{D}_{\eta _k,\rho }(\lambda _k) - \nabla _{\lambda } \mathcal{D}_{\eta _k,\rho }(\lambda _k)^\top (\lambda ^*-\lambda _k).$ By adding and subtracting $\nabla _{\lambda } \mathcal {L}_{\eta _k,\rho }(\textbf{x}_{k+1},\lambda _k)^\top (\lambda _{k+1}-\lambda ^*) $, it follows that

$$\begin{aligned} -\mathcal{D}_{\eta _k,\rho }&(\lambda _{k+1}) { \,\le \,} -\mathcal{D}_{\eta _k,\rho }(\lambda ^*) - \nabla _{\lambda } \mathcal {L}_{\eta _k,\rho }(\textbf{x}_{k+1},\lambda _k)^\top (\lambda _{k+1}-\lambda ^*) + \tfrac{1}{2\rho }\Vert \lambda _{k+1}-\lambda _k\Vert ^2 \\&\quad - \left( \nabla _{\lambda } \mathcal{D}_{\eta _k,\rho }(\lambda _k)- \nabla _{\lambda } \mathcal {L}_{\eta _k,\rho }(\textbf{x}_{k+1},\lambda _k)\right) ^\top (\lambda _{k+1}-\lambda ^*) \\&{\, = \, } -\mathcal{D}_{\eta _k,\rho }(\lambda ^*) - \tfrac{1}{\rho }(\lambda _{k+1}-\lambda _k)^\top (\lambda _{k+1}-\lambda ^*) + \tfrac{1}{2\rho }\Vert \lambda _{k+1}-\lambda _k\Vert ^2 \\&\quad - \left( \nabla _{\lambda } \mathcal{D}_{\eta _k,\rho }(\lambda _k)- \nabla _{\lambda } \mathcal {L}_{\eta _k,\rho }(\textbf{x}_{k+1},\lambda _k)\right) ^\top (\lambda _{k+1}-\lambda ^*) \\&\le -\mathcal{D}_{\eta _k,\rho }(\lambda ^*) - \tfrac{1}{\rho }(\lambda _{k+1}-\lambda _k)^\top (\lambda _{k+1}-\lambda ^*)+ \tfrac{1}{2\rho }\Vert \lambda _{k+1}-\lambda _k\Vert ^2 \\&\quad + \Vert \nabla _{\lambda } \mathcal{D}_{\eta _k,\rho }(\lambda _k)- \nabla _{\lambda } \mathcal {L}_{\eta _k,\rho }(\textbf{x}_{k+1},\lambda _k) \Vert \Vert \lambda _{k+1}-\lambda ^* \Vert \\&= -\mathcal{D}_{\eta _k,\rho }(\lambda ^*) + \tfrac{1}{{2\rho }} (\Vert \lambda _k-\lambda ^*\Vert ^2 - \Vert \lambda _{k+1}-\lambda ^*\Vert ^2) \\&\quad + \Vert \nabla _{\lambda } \mathcal{D}_{\eta _k,\rho }(\lambda _k)- \nabla _{\lambda } \mathcal {L}_{\eta _k,\rho }(\textbf{x}_{k+1},\lambda _k) \Vert \Vert \lambda _{k+1}-\lambda ^* \Vert \\&\le -\mathcal{D}_{{\rho }}(\lambda ^*) + \eta _k({\Vert \lambda ^*\Vert }m+1)\beta + \tfrac{1}{{2\rho }} (\Vert \lambda _k-\lambda ^*\Vert ^2 - \Vert \lambda _{k+1}-\lambda ^*\Vert ^2) \\&\quad + \Vert \nabla _{\lambda } \mathcal{D}_{\eta _k,\rho }(\lambda _k)- \nabla _{\lambda } \mathcal {L}_{\eta _k,\rho }(\textbf{x}_{k+1},\lambda _k) \Vert \Vert \lambda _{k+1}-\lambda ^* \Vert , \end{aligned}$$

where the last inequality follows from Lemma 4(iii). By invoking Lemma 9, and $\Vert \lambda _k\Vert +\Vert \lambda ^*\Vert \le {\Vert \lambda _k-\lambda ^*\Vert +\Vert \lambda ^*\Vert } +\Vert \lambda ^*\Vert \le B_{\lambda } + 2b_{\lambda } {\, \triangleq \, \tilde{B}_{\lambda }}$, we obtain

$$\begin{aligned} -\mathcal{D}_{\rho }(\lambda _{k+1})&\le -\mathcal{D}_{\rho }(\lambda ^*) +{\eta _k(\Vert \lambda _{k+1}\Vert m+1)\beta }+ {\eta _k(\Vert \lambda ^*\Vert m+1)\beta }\\&\quad + \tfrac{1}{{2\rho }} (\Vert \lambda _k-\lambda ^*\Vert ^2 - \Vert \lambda _{k+1}-\lambda ^*\Vert ^2) \\&\quad + \Vert \nabla _{\lambda } \mathcal{D}_{\eta _k,\rho }(\lambda _k)- \nabla _{\lambda } \mathcal {L}_{\eta _k,\rho }(\textbf{x}_{k+1},\lambda _k) \Vert \Vert \lambda _{k+1}-\lambda ^* \Vert \\&{ \le -\mathcal{D}_{\rho }(\lambda ^*) +\eta _k({\tilde{B}_{\lambda }}m+1)\beta } + \tfrac{1}{2\rho } (\Vert \lambda _k-\lambda ^*\Vert ^2 - \Vert \lambda _{k+1}-\lambda ^*\Vert ^2) \\&\quad + \Vert \nabla _{\lambda } D_{\eta _k,\rho }(\lambda _k)- \nabla _{\lambda } \mathcal {L}_{\eta _k,\rho }(\textbf{x}_{k+1},\lambda _k) \Vert \Vert \lambda _{k+1}-\lambda ^* \Vert . \end{aligned}$$

By summing from $k = 0,\cdots ,K-1$ and dividing by K, we obtain

$$\begin{aligned}&-\left( \tfrac{1}{K}{\sum _{i=0}^{K-1}\mathcal{D}_{\rho }(\lambda _{i+1})} - {\mathcal{D}_{\rho }(\lambda ^*)}\right) \nonumber \\&\le \tfrac{1}{2\rho K} (\Vert \lambda _0-\lambda ^*\Vert ^2 - \Vert \lambda _{K}-\lambda ^*\Vert ^2) + \tfrac{1}{K} \sum _{k=0}^{K-1}{\eta _k ({\tilde{B}_{\lambda }} m+1)\beta }\nonumber \\&\quad +\tfrac{1}{K}\sum _{k = 0}^{K-1}\left\| \nabla _{\lambda } D_{\eta _k,\rho }(\lambda _k)- \nabla _{\lambda } \mathcal {L}_{\eta _k,\rho }(\textbf{x}_{k+1},\lambda _k) \right\| \Vert \lambda _{k+1}-\lambda ^*\Vert \nonumber \\&\le \tfrac{1}{2\rho K}\Vert \lambda _0-\lambda ^*\Vert ^2 + \tfrac{{B_{\lambda }}}{K} \sum _{k=0}^{K-1} \tfrac{\sqrt{2\epsilon _k\eta _k^b}}{\sqrt{\rho }} +\tfrac{{B_2}}{K} \sum _{k=0}^{K-1}\eta _k, \end{aligned}$$

(23)

where boundedness of $\lambda _k$ follows from Lemma 4 and $\tilde{B}_{\lambda }, {B_\lambda , B_2}$ are constants. Consequently, by invoking the concavity of $\mathcal{D}_{\rho }$, we may bound the term on the left to obtain the required inequality, where $\bar{\lambda }_K = \tfrac{1}{K}\sum _{i=1}^{K} \lambda _i$.

$$\begin{aligned} -\left( {\mathcal{D}_{\rho }(\bar{\lambda }_K) - \mathcal{D}_{\rho }(\lambda ^*)}\right)&\le \tfrac{1}{2\rho K} (\Vert \lambda _0-\lambda ^*\Vert ^2 - \Vert \lambda _{K}-\lambda ^*\Vert ^2) + \tfrac{1}{K} \sum _{k=0}^{K-1}{\eta _k ({\tilde{B}_{\lambda }} m+1)\beta }\nonumber \\&\quad +\tfrac{1}{K}\sum _{k = 0}^{K-1}\left\| \nabla _{\lambda } D_{\eta _k,\rho }(\lambda _k)- \nabla _{\lambda } \mathcal {L}_{\eta _k,\rho }(\textbf{x}_{k+1},\lambda _k) \right\| \Vert \lambda _{k+1}-\lambda ^*\Vert \nonumber \\&\le \tfrac{1}{2\rho K}\Vert \lambda _0-\lambda ^*\Vert ^2 + \tfrac{{B_{\lambda }}}{K} \sum _{k=0}^{K-1} \tfrac{\sqrt{2\epsilon _k\eta _k^b}}{\sqrt{\rho }} +\tfrac{{B_2}}{K} \sum _{k=0}^{K-1}\eta _k. \end{aligned}$$

(24)

The final result follows by noting that $\mathcal{D}_{\rho }$ is the Moreau envelope of $\mathcal{D}_0$ and strong duality holds, implying that $\mathcal{D}_{\rho }(\lambda ^*) = \mathcal{D}_{0}(\lambda ^*) = f(\textbf{x}^*)$. $\square $

Next, we derive a rate statement on the infeasibility.

Proof

We have that $g_{\eta _k}(\textbf{x}_{k+1})$ can be expressed as

$$\begin{aligned} g_{\eta _k}(\textbf{x}_{k+1}) = \nabla _{\lambda } \mathcal {L}_{\eta _k,\rho }(\textbf{x}_{k+1},\lambda _{k}) + \left( \varPi _{-} \left( \tfrac{\lambda _{{k}}}{\rho } + g_{\eta _k}(\textbf{x}_{k+1})\right) \right) . \end{aligned}$$

Recall that $d_{-}(u+v) \le d_-(u) + \Vert v\Vert $ for any $u,v \in \mathbb {R}^m$. Consequently,

$$\begin{aligned} d_-(g_{\eta _k}(\textbf{x}_{k+1}))&\le \Vert \nabla _{\lambda } \mathcal {L}_{\eta _k,\rho }(\textbf{x}_{k+1},\lambda _{{k}})\Vert + \underbrace{d_- \left( \varPi _{-} \left( \tfrac{\lambda _{k}}{\rho } + g_{\eta _k}(\textbf{x}_{k+1}) \right) \right) }_{=0} \nonumber \\&= \Vert \nabla _{\lambda } \mathcal {L}_{\eta _k,\rho }(\textbf{x}_{k+1},\lambda _k)\Vert . \end{aligned}$$

(25)

By definition of $d_{-}(\bullet )$, convexity of $\max \{g_j(\bullet ),0\}$, and $\Vert u\Vert _2 \le \Vert u\Vert _1 \le \sqrt{m}\Vert u\Vert _2$,

$$\begin{aligned}&d_{-}(g(\bar{\textbf{x}}_K)) \, = \, \inf _{u \in \mathbb {R}^m_-} \Vert g(\bar{\textbf{x}}_K) - u \Vert _2 \, \le \, \inf _{u \in \mathbb {R}^m_-} \Vert g(\bar{\textbf{x}}_K) - u\Vert _1 \, = \, \sum _{j=1}^m \inf _{u_j \le 0} \left| g_j(\bar{\textbf{x}}_K) - u_j\right| _1\nonumber \\&\, = \sum _{j=1}^m \max \{g_j(\bar{\textbf{x}}_K),0\} \, \le \, \tfrac{1}{K} \sum _{{i=0}}^{{K-1}} \sum _{j=1}^m \max \{g_j(\textbf{x}_{{i+1}}),0\} \nonumber \\&\le \tfrac{1}{K} \sum _{{i=0}}^{{K-1}} \sum _{j=1}^m \max \{{g_{j,\eta _i}(\textbf{x}_{{i+1}})+\eta _i \beta },0\} = \, \tfrac{1}{K} \sum _{{i=0}}^{{K-1}} \inf _{u \in \mathbb {R}^m_-} \Vert {g_{\eta _i}(\textbf{x}_{{i+1}}) + \eta _i \beta \textbf{1}} - u \Vert _1 \nonumber \\&\le \tfrac{1}{K} \sum _{{i=0}}^{K-1} \inf _{u \in \mathbb {R}^m_-} \sqrt{m} \Vert {g_{\eta _i}}(\textbf{x}_{{i}+1}) + \eta _i \beta \textbf{1} - u \Vert _2 \, = \, \tfrac{\sqrt{m}}{K} \sum _{k=1}^{K-1} d_{-} (g_{{\eta _i}}(\textbf{x}_{{i}+1})+{\eta _i \beta \textbf{1}})\nonumber \\&\le \tfrac{{\sqrt{m}}}{K} \sum _{i=0}^{K-1}\left( d_-(g_{{\eta _i}}(\textbf{x}_{i+1})) + \eta _i\beta {\Vert \textbf{1}\Vert _2}\right) \overset{\tiny (25)}{\le } \tfrac{{\sqrt{m}}}{K}\sum _{i=0}^{K-1}\left( \Vert \nabla _{\lambda } \mathcal {L}_{\eta _i,\rho }(\textbf{x}_{i+1},\lambda _{i})\Vert + {\sqrt{m}}\eta _i\beta \right) \nonumber \\&\le \tfrac{{\sqrt{m}}}{K}\sum _{i=0}^{K-1} \left( \Vert \nabla _{\lambda } \mathcal {L}_{\eta _i,\rho }(\textbf{x}_{i+1},\lambda _{i}) - \nabla _{\lambda } \mathcal {D}_{\eta _i,\rho }(\lambda _i)\Vert \right. \nonumber \\&+ \left. \Vert \nabla _{\lambda } \mathcal {D}_{\eta _i,\rho }(\lambda _i)\Vert + {\sqrt{m}}\eta _i\beta \right) . \end{aligned}$$

(26)

Recall that

$$ \Vert \nabla _{\lambda } \mathcal {D}_{\eta _k,\rho }(\lambda _1)-\nabla _{\lambda } \mathcal {D}_{\eta _k,\rho }(\lambda _2)\Vert \le \tfrac{1}{\rho }\left\| q_{\eta ,\rho }(\lambda _1)-q_{\eta ,\rho }(\lambda _2)\right\| + \tfrac{1}{\rho }\left\| \lambda _1-\lambda _2\right\| \le \tfrac{2}{\rho }\Vert \lambda _1-\lambda _2\Vert ,$$

allowing us to claim that $\mathcal {D}_{\eta _k,\rho }$ is a $({2}/\rho )$-smooth concave function. Then by leveraging [32] for any $\lambda \ge 0$,

$$\begin{aligned} \Vert \nabla _{\lambda }&\mathcal {D}_{\eta _k,\rho }(\lambda ) \Vert \le \sqrt{\tfrac{2}{\rho } \left( \mathcal {D}_{\eta _k,\rho }({\lambda _{\eta _k}^*})-\mathcal{D}_{\eta _k,\rho }(\lambda )\right) } \le \sqrt{\tfrac{2}{\rho } \left( \mathcal {D}_{\rho }({\lambda _{\eta _k}^*})-\mathcal{D}_{\rho }(\lambda )+2\eta _k \beta \tilde{B}_{\lambda }\right) } \\&\le \sqrt{\tfrac{2}{\rho } \left( \mathcal {D}_{\rho }({\lambda ^*})-\mathcal{D}_{\rho }(\lambda )+2\eta _k \beta \tilde{B}_{\lambda }\right) } \le \sqrt{\tfrac{2}{\rho } \left( \mathcal {D}_{\rho }({\lambda ^*})-\mathcal{D}_{\rho }(\lambda )\right) } +{2\sqrt{\tfrac{\eta _k \beta \tilde{B}_{\lambda }}{\rho }}}, \end{aligned}$$

where ${\lambda _{\eta }^*}$ is a maximizer of $\mathcal {D}_{\eta ,\rho }$. By leveraging the concavity of the square-root function, the prior dual sub-optimality bounds, $\sqrt{u+v} \le \sqrt{u}+\sqrt{v}$ for $u, v \ge 0$, the subaddivity of concave functions, we have from (26),

$$\begin{aligned} \hspace{-0.1in}&d_-(g(\bar{\textbf{x}}_{K})) \le \tfrac{{\sqrt{m}}}{K}\sum _{i=0}^{K-1}\left( \sqrt{\tfrac{2\epsilon _i{\eta _i^b}}{\rho }}+{\sqrt{m}}\eta _i\beta \right) + \tfrac{{\sqrt{m}}}{K}\sum _{i=0}^{K-1}\sqrt{\tfrac{2}{\rho } \left( \mathcal {D}_{\rho }({\lambda ^*})-\mathcal{D}_{\rho }(\lambda _i)\right) } \\&+ {\tfrac{2\sqrt{m}}{K}\sum _{i=0}^{K-1}\sqrt{\tfrac{\eta _i \beta \tilde{B}_{\lambda }}{\rho }}} \\&\overset{\tiny \text{(Concavity } \text{ of } \sqrt{\cdot })}{\le } \tfrac{{\sqrt{m}}}{K}\sum _{i=0}^{K-1}\left( \sqrt{\tfrac{2\epsilon _i{\eta _i^b}}{\rho }}+{\sqrt{m}}\eta _i\beta \right) + \sqrt{\tfrac{2m}{\rho } \left( \mathcal {D}_{\rho }({\lambda ^*})-\tfrac{1}{K} \sum _{i=0}^{K-1}\mathcal{D}_{\rho }(\lambda _i)\right) } \\&+ {\tfrac{\sqrt{m}}{K}\sum _{i=0}^{K-1}\sqrt{\tfrac{2 \eta _i \beta \tilde{B}_{\lambda }}{\rho }}}. \end{aligned}$$

Recall from (24), it follows that

$$\begin{aligned} \tfrac{1}{K} \sum _{i=0}^{K-1}\left( \mathcal {D}_{\rho }({\lambda ^*})-\mathcal{D}_{\rho }(\lambda _i)\right) \le \tfrac{1}{2\rho K}\Vert \lambda _0-\lambda ^*\Vert ^2 + \tfrac{{B_{\lambda }}}{K} \sum _{k=0}^{K-1} \tfrac{\sqrt{2\epsilon _k\eta _k^b}}{\sqrt{\rho }} +\tfrac{{B_2}}{K} \sum _{k=0}^{K-1}\eta _k = \tfrac{C}{K}, \end{aligned}$$

which implies that

$$\begin{aligned} d_-(g(\bar{\textbf{x}}_{K}))\le \tfrac{{\sqrt{m}}}{K}\sum _{i=0}^{K-1}\left( \sqrt{\tfrac{2\epsilon _i{\eta _i^b}}{\rho }}+{\sqrt{m}}\eta _i\beta + \sqrt{\tfrac{2 \eta _i \beta \tilde{B}_{\lambda }}{\rho }}\right) + \sqrt{\tfrac{2mC}{\rho K} } \end{aligned}$$

where $C \triangleq \tfrac{\Vert \lambda _0-\lambda ^*\Vert ^2}{2\rho }+\left( B_{\lambda }\sum _{k=0}^{K-1}\tfrac{2\epsilon _k\eta _k^b}{\sqrt{\rho }}+B_2\sum _{k=0}^{K-1}\eta _k\right) $. $\square $

We now derive a rate statement for the primal sub-optimality.

Proof

Recall that since $\textbf{x}_k$ may not be feasible with respect to the constraints, we derive upper and lower bounds on the sub-optimality.

(i) Lower bound. A rate statement for the lower bound is first constructed. Since $\max _{\lambda } \mathcal {D}_{\rho }(\lambda ) = {\displaystyle \min _{\textbf{x}\in \mathcal {X}}} \ \mathcal {L}_{\rho }(\textbf{x},\lambda ^*) = f^*$, the following sequence of inequalities hold where $\bar{\textbf{x}}_K = \tfrac{1}{K}\sum _{k = 0}^{K-1}\textbf{x}_k$, $f_{\eta _K}^* = {\displaystyle \min _{\textbf{x}\in \mathcal X}} \ \mathcal {L}_{\eta _K,{\rho }}\left( \textbf{x}, {\lambda _{\eta _K}^*}\right) $, and $(\textbf{x}_{\eta _K}^*,\lambda _{\eta _K}^*)$ is the saddle point of $\mathcal {L}_{\eta _K, 0}(\textbf{x},\lambda )$.

$$\begin{aligned} f_{\eta _K}^*&{= \mathcal{L}_{\eta _K,\rho }(\textbf{x}^*_{\eta _K},\lambda ^*_{\eta _K}) } \le \mathcal {L}_{\eta _K,{\rho }}(\bar{\textbf{x}}_K,{\lambda _{\eta _K}^*}) \\&= f_{\eta _K}(\bar{\textbf{x}}_K) +\tfrac{\rho }{2} \left( d_-\left( \tfrac{{{\lambda ^*_{\eta _K}}}}{\rho } + g_{\eta _K}(\bar{\textbf{x}}_K) \right) \right) ^2 - \tfrac{1}{2\rho }\Vert {\lambda ^*_{\eta _K}}\Vert ^2 \\&\le f_{\eta _K}(\bar{\textbf{x}}_K) +\tfrac{\rho }{2} \left( d_-\left( g_{\eta _K}(\bar{\textbf{x}}_K) \right) + \left\| \tfrac{{\lambda _{\eta _K}^*}}{\rho }\right\| \right) ^2 - \tfrac{1}{2\rho }\Vert {\lambda _{\eta _K}^*}\Vert ^2 \\&= f_{\eta _K}(\bar{\textbf{x}}_K) +\tfrac{\rho }{2} \left( d_-\left( g_{\eta _K}(\bar{\textbf{x}}_K) \right) \right) ^2 + \left\| {\lambda _{\eta _K}^*}\right\| d_-\left( g_{\eta _K}(\bar{\textbf{x}}_K) \right) \\&\overset{\tiny \text{ Lem. } 1}{\le } f_{\eta _K}(\bar{\textbf{x}}_K) +\tfrac{\rho }{2} \left( d_-\left( g_{\eta _K}(\bar{\textbf{x}}_K) \right) \right) ^2 + {b_{\lambda ,\eta }} d_-\left( g_{\eta _K}(\bar{\textbf{x}}_K) \right) . \end{aligned}$$

By invoking Proposition 3, we obtain the following inequality.

$$\begin{aligned} f_{\eta _K}^* - f_{\eta _K}(\bar{\textbf{x}}_K) \le \tfrac{{B_5^2}}{K} + \tfrac{{B_5}}{\sqrt{K}}. \end{aligned}$$

(28)

Let $\textbf{x}^*\in \mathcal {X}^*$ and $\textbf{x}_{{\eta _K}}^*$ is a minimizer of $L_{\eta _K,\rho }(\cdot , {\lambda _{\eta _K}^*})$. By Lemma 4, we have that

$$\begin{aligned} f(\textbf{x}^*) = \mathcal{L}(\textbf{x}^*,\lambda ^*)&\le \mathcal{L}(\textbf{x}^*_{\eta _K},\lambda ^*) = f(\textbf{x}^*_{\eta _K}) + \sum _{i=1}^m \lambda ^*_i g_i(x^*_{\eta _K}) \nonumber \\&\le f(\textbf{x}^*_{\eta _K}) + \sum _{i=1}^m \lambda ^*_i (g_i(x^*_{\eta _K}) - g_{i,\eta _K}(\textbf{x}_{\eta _K}^*)) \nonumber \\&\le f(\textbf{x}^*_{\eta _K}) + m b_{\lambda } \eta _K \beta , \end{aligned}$$

(29)

implying that $f(\textbf{x}^*) \le f(\textbf{x}^*_{\eta _K}) + mb_{\lambda } \beta \eta _K.$ By definition of the smoothing, $f(\textbf{x}_{{\eta _K}}^*)-f_{\eta _K}(\textbf{x}_{{\eta _K}}^*) \le \beta \eta _K $ and $f_{\eta _K}(\bar{\textbf{x}}_K)-f(\bar{\textbf{x}}_K) \le 0$.

$$\begin{aligned} f(\textbf{x}^*)-f(\bar{\textbf{x}}_K)&= \underbrace{f(\textbf{x}^*) - f(\textbf{x}_{{\eta _K}}^*)}_{\le {mb_{\lambda } \beta \eta _K}} + \underbrace{f(\textbf{x}_{{\eta _K}}^*)-f_{\eta _K}(\textbf{x}_{{\eta _K}}^*)}_{\le \beta \eta _K } + \underbrace{f_{\eta _K}(\textbf{x}_{{\eta _K}}^*) -f_{\eta _K}(\bar{\textbf{x}}_K)}_{(28) } \\&\quad +\underbrace{f_{\eta _K}(\bar{\textbf{x}}_K)-f(\bar{\textbf{x}}_K)}_{\le 0} \le (1+ {mb_{\lambda }}) \eta _K \beta + \tfrac{{B_5^2}}{K} + \tfrac{{B_5}}{\sqrt{K}}. \end{aligned}$$

(ii) Upper bound. Let $\textbf{x}_{\eta _k,\lambda _k}^* {\in } \arg {\displaystyle \min _{\textbf{x}\in \mathcal X}} \ \mathcal {L}_{\eta _k,{\rho }}\left( \textbf{x}, {\lambda _k}\right) $ and $(\textbf{x}_{\eta _k}^*,\lambda _{\eta _k}^*)$ be the saddle point of $\mathcal {L}_{\eta _k, 0}(\textbf{x},\lambda )$. Based on the definition of $\textbf{x}_{\eta _k,\lambda _k}^*$ and $\textbf{x}_{\eta _k}^*$, the following two inequalities hold.

$$\begin{aligned} \mathcal {L}_{\eta _k,\rho }(\textbf{x}_{k+1},\lambda _k) - \mathcal {L}_{\eta _k,\rho }(\textbf{x}_{\eta _k,\lambda _k}^*,{\lambda _k})&\le \epsilon _k{\eta _k^b}\\ \mathcal {L}_{\eta _k,\rho }(\textbf{x}_{\eta _k,\lambda _k}^*,\lambda _k)&\le \mathcal {L}_{\eta _k,\rho }(\textbf{x}_{\eta _k}^*,\lambda _k) \end{aligned}$$

By adding the two inequalities, we obtain

$$\begin{aligned}&\mathcal {L}_{\eta _k,\rho }(\textbf{x}_{k+1},\lambda _k) - \mathcal {L}_{\eta _k,\rho }(\textbf{x}_{\eta _k,}^*,{\lambda _k})\nonumber \\&= \mathcal {L}_{\eta _k,\rho }(\textbf{x}_{k+1},\lambda _k)- \mathcal {L}_{\eta _k,\rho }(\textbf{x}_{\eta _k,\rho _k}^*,\lambda _k) + \mathcal {L}_{\eta _k,\rho }(\textbf{x}_{\eta _k,\rho _k}^*,\lambda _k)- \mathcal {L}_{\eta _k,\rho }(\textbf{x}_{\eta _k,}^*,{\lambda _k})\nonumber \\&\le \epsilon _k\eta _k^b. \end{aligned}$$

(30)

Consequently, by leveraging (30) and invoking the definition of $\mathcal{L}_{\eta _k,\rho }(\cdot ,\lambda _k)$, we have that

$$\begin{aligned} f_{\eta _k}(\textbf{x}_{k+1}) - f_{\eta _k}{(\textbf{x}^*_{\eta _k})}&\le \tfrac{\rho }{2} \left( d_{\_}\left( \tfrac{\lambda _k}{\rho } +g_{\eta _k}(\textbf{x}_{\eta _k}^*)\right) \right) ^2 - \tfrac{\rho }{2} \left( d_{\_}\left( \tfrac{\lambda _k}{\rho } + g_{\eta _k}(\textbf{x}_{k+1}) \right) \right) ^2+ \epsilon _k\eta _k^b. \end{aligned}$$

We observe that

$$\begin{aligned} d_{-} (u) = \left\| \varPi _-(u) - u\right\| = \left\| \varPi _-(u) - (\varPi _-(u) + \varPi _+(u)) \right\| = \left\| - \varPi _+(u) \right\| = \left\| \varPi _+(u)\right\| . \end{aligned}$$

By choosing $u = g_{\eta _k}(\textbf{x}_{k+1}) + \tfrac{\lambda _k}{\rho }$, it follows from Lemma 6 that

$$\begin{aligned} d_{\_}\left( g_{\eta _k}(\textbf{x}_{k+1}) + \tfrac{\lambda _k}{\rho } \right) = \left\| \varPi _+ \left( g_{\eta _k}(\textbf{x}_{k+1}) + \tfrac{\lambda _k}{\rho } \right) \right\| = \left\| \tfrac{\lambda _{k+1}}{\rho }\right\| . \end{aligned}$$

Furthermore, we have that $g_{\eta _k}(\textbf{x}_{\eta _k}^*) \le 0$ since $\textbf{x}_{\eta _k}^*$ is feasible with respect to $\eta _k$-smoothed objective, implying

$$\begin{aligned} d_{\_}\left( \tfrac{\lambda _k}{\rho } + g_{\eta _k}(\textbf{x}_{\eta _k}^*) \right) \le \underbrace{d_{\_}\left( g_{\eta _k}(\textbf{x}_{\eta _k}^*) \right) }_{\tiny = 0, \text{ since } g_{\eta _k}(\textbf{x}_{\eta _k}^*) \le 0}+ \left\| \tfrac{\lambda _k}{\rho } \right\| \end{aligned}$$

which implies

$$\begin{aligned}&f_{\eta _k}(\textbf{x}_{k+1}) - {f_{\eta _k}(\textbf{x}_{\eta _k}^*)}\le \tfrac{\rho }{2}\left( \left\| \tfrac{\lambda _k}{\rho }\right\| ^2-\left\| \tfrac{\lambda _{k+1}}{\rho }\right\| ^2\right) + \epsilon _k\eta _k^b. \end{aligned}$$

(31)

We observe that that $g_{\eta _k}(\textbf{x}^*) \le g(\textbf{x}^*) \le 0$, implying that $\textbf{x}^*$ is feasible for the $\eta _k$-smoothed problem and consequently,

$$\begin{aligned} f_{\eta _k}(\textbf{x}^*_{\eta _k}) - f_{\eta _k}(\textbf{x}^*) \le 0. \end{aligned}$$

(32)

Summing from $k = 0$ to $K-1$ and leveraging convexity of $f_{\eta _k}$ and , we obtain that

where ${B_6}> 0$ is a constant. $\square $

3.3 Rate analysis under increasing $\rho _k$

We now consider the setting where $\{\rho _k\}$ is an increasing sequence.

Lemma 10

(Rate on primal infeasibility) Suppose $\{{(\textbf{x}_k,\lambda _k)}\}$ is generated by (Sm-AL). Then for any $k \ge 0$, $d_{-}\left( g(\textbf{x}_{k+1}) \right) \le \left\| \tfrac{\lambda _{k+1}-\lambda _k}{\rho _k}\right\| + m\eta _k {\beta }.$

Proof

By the update rule, we have that

$$\begin{aligned} \lambda _{k+1}&:= \lambda _k + \rho _k \nabla _{\lambda } \mathcal {L}_{\eta _k,\rho _k}(\textbf{x}_{k+1},\lambda _k) = \lambda _k + \rho _k g_{\eta _k}(\textbf{x}_{k+1}) - \rho _k \varPi _-\left( \tfrac{\lambda _k}{\rho _k}+ g_{\eta _k}(\textbf{x}_{k+1}) \right) . \end{aligned}$$

It follows that $ g_{\eta _k}(\textbf{x}_{k+1}) = \tfrac{\lambda _{k+1}-\lambda _k}{\rho _k} + \varPi _-\left( \tfrac{\lambda _k}{\rho _k}+g_{\eta _k}(\textbf{x}_{k+1}) \right) $, implying

$$\begin{aligned} d_{-}\left( g_{\eta _k}(\textbf{x}_{k+1}) \right)&\le d_{-}\left( \varPi _-\left( \tfrac{\lambda _k}{\rho _k}+g_{\eta _k}(\textbf{x}_{k+1}) \right) \right) + \left\| \tfrac{\lambda _{k+1}-\lambda _k}{\rho _k}\right\| = \left\| \tfrac{\lambda _{k+1}-\lambda _k}{\rho _k}\right\| . \end{aligned}$$

Akin to the proof in Proposition 3, we have

$$\begin{aligned} d_{-}\left( g(\textbf{x}_{k+1}) \right) \le d_{-}\left( g_{\eta _k}(\textbf{x}_{k+1}) \right) +m\eta _{k}{\beta } \le \left\| \tfrac{\lambda _{k+1}-\lambda _k}{\rho _k}\right\| + m\eta _k {\beta }. \end{aligned}$$

$\square $

Proof

(i) Let $f_{\eta _k}^* \triangleq f_{\eta _k}({\textbf{x}_{\eta _k}^*})$ and $(\textbf{x}_{\eta _k}^*,\lambda _{\eta _k}^*)$ be the saddle point of $\mathcal {L}_{\eta _k, 0}(\textbf{x},\lambda )$. We have that

$$\begin{aligned} f_{\eta _k}^*&\le \mathcal {L}_{\eta _k,\rho _k}({\textbf{x}_{k+1}},{\lambda _{\eta _k}^*}) = f_{\eta _k}({\textbf{x}_{k+1}}) +\tfrac{\rho _k}{2} \left( d_-\left( \tfrac{{\lambda _{\eta _k}^*}}{{\rho _k}} + g_{\eta _k}({\textbf{x}_{k+1}}) \right) \right) ^2 - \tfrac{1}{2\rho _k}\Vert {\lambda _{\eta _k}^*}\Vert ^2 \nonumber \\&{=} f_{\eta _k}({\textbf{x}_{k+1}}) +\tfrac{\rho _k}{2} \left( d_-\left( \tfrac{{\lambda _{k}}}{{\rho _k}} - \tfrac{{\lambda _{k}}}{{\rho _k}} + \tfrac{{\lambda _{\eta _k}^*}}{{\rho _k}} + g_{\eta _k}({\textbf{x}_{k+1}}) \right) \right) ^2 - \tfrac{1}{2\rho _k}\Vert {\lambda _{\eta _k}^*}\Vert ^2 \nonumber \\&\le f_{\eta _k}({\textbf{x}_{k+1}}) +\tfrac{\rho _k}{2} \left( d_-\left( \tfrac{{\lambda _{k}}}{{\rho _k}} + g_{\eta _k}({\textbf{x}_{k+1}}) \right) + \left\| \tfrac{{\lambda _{k}}}{{\rho _k}} - \tfrac{{\lambda _{\eta _k}^*}}{{\rho _k}} \right\| \right) ^2 - \tfrac{1}{2\rho _k}\Vert {\lambda _{\eta _k}^*}\Vert ^2 \nonumber \\&\le f_{\eta _k}({\textbf{x}_{k+1}}) +\tfrac{\rho _k}{2} \left( \tfrac{\Vert \lambda _{k+1}\Vert }{\rho _k} + \left\| \tfrac{{\lambda _{k}}}{{\rho _k}} - \tfrac{{\lambda _{\eta _k}^*}}{{\rho _k}} \right\| \right) ^2 - \tfrac{1}{2\rho _k}\Vert {\lambda _{\eta _k}^*}\Vert ^2 \nonumber \\&\le f_{\eta _k}({\textbf{x}_{k+1}}) +\tfrac{1}{\rho _k} {\left( \Vert \lambda _{k+1}\Vert ^2 + \left\| \lambda _{k}-\lambda _{\eta _k}^* \right\| ^2\right) }. \end{aligned}$$

(33)

By adding and subtracting $f(\textbf{x}_{\eta _k}^*), f_{\eta _k}^*$ and $ f_{\eta _k}({\textbf{x}_{k+1}})$, it follows that

$$\begin{aligned} f^*-f({\textbf{x}_{k+1}})&= \underbrace{f^* - f(\textbf{x}_{\eta _k}^*)}_{\le { b_{\lambda } m \beta \eta _k } \tiny { \text{ from } (29)}} + \underbrace{f(\textbf{x}_{\eta _k}^*)-f_{\eta _k}^*}_{\le \eta _k \beta } \\&\quad + \underbrace{f_{\eta _k}^* -f_{\eta _k}({\textbf{x}_{k+1}})}_{(33)} + \underbrace{f_{\eta _k}({\textbf{x}_{k+1}}) -f({\textbf{x}_{k+1}})}_{\le 0}. \end{aligned}$$

Consequently, we have that $f(\textbf{x}_{k+1}) - f(\textbf{x}^*) \ge -{(1+b_{\lambda } m)}\eta _k \beta -\left( \tfrac{\Vert \lambda _{k+1}\Vert ^2}{\rho _k} + \tfrac{\Vert {\lambda _{\eta _k}^*}-\lambda _k\Vert ^2}{\rho _k}\right) .$

(ii) Similar to the previous analysis in Theorem 2, we have

$$\begin{aligned}&\mathcal {L}_{\eta _k,\rho _k}(\textbf{x}_{k+1},\lambda _k) - \mathcal {L}_{\eta _k,\rho _k}(\textbf{x}_{\eta _k}^*,\lambda _k) \le \epsilon _k{\eta _k^b}. \end{aligned}$$

which implies

$$\begin{aligned}&f_{\eta _k}(\textbf{x}_{k+1}) - f_{\eta _k}^* \\&\le \tfrac{\rho _k}{2}\left( \left( d_{\_}\left( \tfrac{\lambda _k}{\rho _k} +g_{\eta _k}(\textbf{x}_{\eta _k}^*)\right) \right) ^2 - \left( d_{\_}\left( \tfrac{\lambda _k}{\rho _k} + g_{\eta _k}(\textbf{x}_{k+1}) \right) \right) ^2\right) +\epsilon _k\eta _k^b\\&\le {\tfrac{\rho _k}{2}\left( \left( d_{\_}\left( \tfrac{\lambda _k}{\rho _k} +g_{\eta _k}(\textbf{x}_{\eta _k}^*)\right) \right) ^2 \right) +\epsilon _k\eta _k^b} \le {\tfrac{\rho _k}{2}\left( \left( d_{\_}\left( g_{\eta _k}(\textbf{x}_{\eta _k}^*)\right) + \tfrac{\Vert \lambda _k\Vert }{\rho _k} \right) ^2 \right) +\epsilon _k\eta _k^b}\\&= {\left( \tfrac{\Vert \lambda _k\Vert ^2}{2\rho _k} \right) +\epsilon _k\eta _k^b}\\ \implies&f(\textbf{x}_{k+1}) - f^* =\underbrace{ f(\textbf{x}_{k+1}) - f_{\eta _k}(\textbf{x}_{k+1})}_{\le \eta _k {\beta }}+ f_{\eta _k}(\textbf{x}_{k+1})- f_{\eta _k}^*+\underbrace{f_{\eta _k}(\textbf{x}_{\eta _k}^*)-f_{\eta _k}(\textbf{x}^*)}_{\le 0 { \tiny \text{ from } (32)}}\\&+ \underbrace{f_{\eta _k}(\textbf{x}^*)- f^*}_{\le 0} \le \eta _k {\beta }+ \tfrac{\Vert \lambda _k\Vert ^2}{{2}\rho _k} + \epsilon _k\eta _k^b. \end{aligned}$$

$\square $

We conclude with an overall rate for sub-optimality and infeasibility.

Proof

Suppose $\rho _k = \rho _0 \zeta ^k$ where $\zeta > 1$. By choosing $\epsilon _k\eta _k^b = \tfrac{1}{k^{2+\delta \rho _k}}$, we have that

$$\begin{aligned} |&f(\textbf{x}_{k+1}) - f^* |\le \max \left\{ \eta _{k} \beta {(1+b_{\lambda }m)}+ \tfrac{\Vert {\lambda _{k+1}\Vert ^2}}{{\rho _{k}}} + \tfrac{\Vert {\lambda _{\eta _k}^*}-{\lambda _{k}}\Vert ^2}{{\rho _{k}}}, \eta _k {\beta }+ \tfrac{\Vert \lambda _k\Vert ^2}{2\rho _k} + {\epsilon _k\eta _k^b} \right\} \\&\le \eta _k {\beta }{(1+b_{\lambda }m)}+\tfrac{2\Vert \lambda _{k+1}\Vert ^2+5\Vert {\lambda _{k}}\Vert ^2 + {4} \Vert {\lambda _{\eta _k}^*}\Vert ^2}{2\rho _k} + {\epsilon _k\eta _k^b} \le \eta _k {\beta }{(1+b_{\lambda }m)} + \tfrac{\tilde{C}_1}{\rho _k} + \tfrac{1}{k^{2+\delta }\rho _k}\\&\le \eta _k {\beta }{(1+b_{\lambda }m)}+ \tfrac{{B_7}}{\rho _k}. \end{aligned}$$

Next, we derive a rate on the expected infeasibility. Recall from Lemma 4, $g(\textbf{x}_{k+1}) \le g_{\eta _k}(\textbf{x}_{k+1}) + \eta _k {\beta } \textbf{1}$, implying that $d_-(g(\textbf{x}_{k+1}) \le d_-(g_{\eta _k}(\textbf{x}_{k+1}) + \eta _k {\beta } \textbf{1})$. Therefore,

$$\begin{aligned} d_{-}\left( g(\textbf{x}_{k+1})\right)&\le d_-(g_{\eta _k}(\textbf{x}_{k+1}) + \eta _k {\beta } \textbf{1}) \le \left\| \tfrac{\lambda _{k+1}-\lambda _k}{\rho _k}\right\| + \eta _k {\beta } \Vert \textbf{1}\Vert \\&\le \eta _k {\beta m} + \tfrac{2{B_{\lambda }}}{\rho _k} = \eta _k {\beta m} + \tfrac{B_8}{\rho _k}. \end{aligned}$$

4 Overall Complexity Guarantees

In 4.1, we begin with some preliminaries, including the derivation of Lipschitzian properties for the smoothed AL function. This allows for employing an accelerated gradient framework for inexact resolution of the subproblem, leading to suitable complexity guarantees in 4.2 for convex and strongly convex regimes. In 4.3, overall complexity guarantees for (Sm-AL) with a fixed smoothing parameter are presented.

4.1 Preliminaries

We first derive L-smoothness of $\mathcal {L}_{\eta ,\rho }(\bullet ,\lambda )$ uniformly in $\lambda $. Our bound necessitates utilizing an upper bound on $\eta $, which we denoted by $\eta ^u$.

Lemma 11

Suppose ${0 \, < \, \eta \, \le \eta ^u}$ and $\rho \ge {1}$. Then the following hold.

(a) For any $\lambda \ge 0$, there exists $\tilde{C}$ such that $\mathcal {L}_{\eta ,\rho }(\bullet ,\lambda )$ is $\tfrac{\tilde{C} \rho }{\eta }$-smooth.

(b) $\mathcal{L}_{\eta ,\rho }(\textbf{x},\lambda )$ is convex in $\textbf{x}\in \mathcal{X}$ and concave in $\lambda \ge 0$. $\Box $

Next, we formally state an accelerated gradient method for resolving the augmented Lagrangian subproblem (ALSub$_{\eta _k,\rho _k}(\lambda _k)$), defined as

Suppose $\textbf{x}_k^*$ denotes an optimal solution of (ALSub$_{\eta _k,\rho _k}(\lambda _k)$). Since $\mathcal {L}_{\eta _k,\rho _k}(\bullet ,\lambda _k)$ is a convex and ${\tfrac{\tilde{C} \rho _k}{\eta _k}}$-smooth function, we employ an accelerated gradient method that constructs a sequence $\{ \textbf{y}_j,\textbf{z}_j \}_{j=0}^{M_k}$ as follows, where $\textbf{z}_0 = \textbf{y}_0 = \textbf{x}_k$.

$$\begin{aligned}\left\{ \begin{aligned} \textbf{y}_{j+1}&= \varPi _{X} \left[ \, \textbf{z}_j - \beta _j \nabla _{\textbf{x}} \mathcal {L}_{\eta _k,\rho _k}(\textbf{z}_j,\lambda _k) \, \right] \\ \textbf{z}_{j+1}&= \textbf{y}_{j+1} + \gamma _j \left( \textbf{y}_{j+1}- \textbf{y}_j \right) \end{aligned} \right\} , \quad j > 0. \end{aligned}$$

(AG)

We now restate the convergence guarantees [6, 31, 32, 34] associated with (AG).

4.2 Complexity guarantees for convex and strongly convex f

We begin by leveraging Theorem 4 to develop complexity guarantees in convex settings for an $\varvec{\varepsilon }$-optimal solution by leveraging the rate statement for dual suboptimality (in constant penalty settings) and primal sub-optimality (in increasing penalty settings). Throughout, we recall that AL subproblem objective is $L_k$-smooth, where $L_k = \tfrac{\tilde{C} \rho _k}{\eta _k}$ and $\Vert x-y\Vert \le {C_1}$ for any $x, y \in X$. Additionally, complexity guarantees are derived by utilizing the rate guarantees presented in Theorem 2 (Constant $\rho _0$) or Theorem 3 (increasing $\rho _k$) to determine the number of outer iterations K; specifically, by these results, to ensure $\varvec{\varepsilon }$-suboptimal solutions, we require that $K = \lceil \tfrac{C}{\varvec{\varepsilon }}\rceil $ (constant $\rho $) or $K = \lceil \tfrac{\ln (C/\varvec{\varepsilon })}{\ln (\zeta )}\rceil $ (increasing $\rho _k$) for a suitable constant C.

Proof

(a) By Theorem 4, $M_k$ is the smallest integer satisfying

$$\begin{aligned} \mathcal {L}_{\rho _k,\eta _k}({\textbf{x}_{k+1}},\lambda _{k}) - {\mathcal {D}_{\rho _k,\eta _k}}(\lambda _k)&\le {\left( \tfrac{{C_1}L_k}{M_k^2}\right) {\,=\,} \left( \tfrac{{C_1}\tilde{C}\rho _0}{\eta _k M_k^2}\right) \le \epsilon _k\eta _k^b} \\ \implies M_k&= {\bigg \lceil \sqrt{\tfrac{{C_1}\tilde{C} \rho _0}{ \epsilon _k \eta _k^{b+1}}}\bigg \rceil = \bigg \lceil \left( \sqrt{ {C_1} \tilde{C} \rho _0} \right) k^{2(1+\delta )}\bigg \rceil }. \end{aligned}$$

Then the iteration complexity of computing a $(\bar{\textbf{x}}_K,\bar{\lambda }_K)$ where $f^*-\mathcal {D}(\bar{\lambda }_K)\le \varvec{\varepsilon }$ requires

$$\begin{aligned} \sum _{k = 1}^{K(\varvec{\varepsilon })} M_k= \sum _{k = 1}^{{\lceil C/\varvec{\varepsilon }\rceil }}\bigg \lceil \left( \sqrt{ {C_1} \tilde{C} \rho _0} \right) k^{2(1+\delta )}\bigg \rceil = \mathcal {O}\left( \varvec{\varepsilon }^{-(3+{2}\delta )}\right) . \end{aligned}$$

(b) Proceeding similarly, by Theorem 4, $M_k$ is defined as follows.

$$\begin{aligned} M_k = {\bigg \lceil \sqrt{\tfrac{{C_1}\tilde{C} \rho _k}{ \epsilon _k \eta _k^{b+1}}} \bigg \rceil } = \bigg \lceil \sqrt{\tfrac{{C_1}\tilde{C} \rho _k^2 \eta _k^b k^{(2+\delta )}}{ \eta _k^{b+1}}}\bigg \rceil =\bigg \lceil \left( \sqrt{{C_1}\tilde{C}}\right) \rho _k^{3/2} k^{2+\delta }\bigg \rceil . \end{aligned}$$

Then the iteration complexity of producing an $\textbf{x}_K$ satisfying $|f^*-f({\textbf{x}_K})|\,\le \, \varvec{\varepsilon }$ requires

$$\begin{aligned} \sum _{k=1}^{K(\varvec{\varepsilon })} M_k&= \sum _{k=1}^{\lceil \ln {\frac{C}{\varvec{\varepsilon }}}/\ln {\zeta }\rceil }\bigg \lceil \left( \sqrt{{C_1}\tilde{C}}\right) \rho _k^{\frac{3}{2}}k^{(2+\delta )}\bigg \rceil {\, \le \, } {2}{\left( \sqrt{{C_1}\tilde{C}}\right) \rho _0^{\frac{3}{2}}}\sum _{k=1}^{{\log }_{{\zeta }}{\left( \frac{C}{\varvec{\varepsilon }}\right) +1}}\zeta ^{\frac{3}{2}k}k^{(2+\delta )}\\&\le 2{\left( \sqrt{{C_1}\tilde{C}}\right) \rho _0^{3/2}}\left( \lceil \ln {\left( \tfrac{C}{\varvec{\varepsilon }}\right) }+1\rceil \right) ^{3(1+\delta )}\int _{1}^{\ln _{{\zeta }}{\left( \frac{C}{\varvec{\varepsilon }}\right) }{+2}}\zeta ^{\frac{3}{2}u} du \le \tilde{\mathcal {O}}\left( \varvec{\varepsilon }^{-\frac{3}{2}}\right) . \end{aligned}$$

Remark 1

(Constant $\rho $.) Suppose $\varvec{\varepsilon }$ is a positive scalar. Let $K \triangleq \lceil C/\varvec{\varepsilon }\rceil $ where C is defined in Proposition 2. Suppose Sm-AL scheme runs for K iterations and produces $\bar{\textbf{x}}_K$ and $\bar{\lambda }_K$. Then we have that

$$\begin{aligned} f^*-\mathcal {D}(\bar{\lambda }_K)\le \varvec{\varepsilon }, |f^*-f(\bar{\textbf{x}}_K)|\le {\mathcal {O}\left( \sqrt{\varvec{\varepsilon }}\right) }, \text { and } d_{-}\left( g(\bar{\textbf{x}}_K)\right) \le {\mathcal {O}\left( \sqrt{\varvec{\varepsilon }}\right) .} \end{aligned}$$

(Increasing $\rho _k$). Suppose $\varvec{\varepsilon }$ is a positive scalar. Let $K \triangleq \lceil \ln \left( \tfrac{C}{\varvec{\varepsilon }}\right) /\ln \left( \zeta \right) \rceil $ where C is defined in Theorem 3 and $\rho _k = \rho _0\zeta ^k$ with $\zeta >1$. Suppose Sm-AL scheme runs for K iterations and produces $\bar{\textbf{x}}_K$ and $\bar{\lambda }_K$, where

$$\begin{aligned} \left| \, f^*-f\textbf{x}_K) \, \right| \, \le \, \varvec{\varepsilon }\text { and } d_{-}\left( g(\textbf{x}_K)\right) \, \le \, {\mathcal {O}\left( \varvec{\varepsilon }\right) }. \end{aligned}$$

We now produce an extension of the results for strongly convex settings.

Proof

(a) Suppose $\rho _k = \rho _0$ for all k. Suppose $M_k$ represents the least number of steps taken at step k to achieve $(\epsilon _k \eta _k^b)$-optimality of the subproblem. By Theorem 4 and $\ln (x) \ge \tfrac{x-1}{x}$ for $x > 0$,

$$\begin{aligned} \mathcal {L}_{\rho _k,\eta _k}&(\textbf{x}_{{k+1}},\lambda _{k}) - {\mathcal {D}_{\rho _k,\eta _k}(\lambda _k)}\le {\tilde{C}{\tfrac{\rho _0}{\eta _k}}\left( 1-\tfrac{\sqrt{\mu }}{\sqrt{L_k}}\right) ^{M_k}} \, \le \, \epsilon _k\eta _k^b. \end{aligned}$$

$$\begin{aligned} \implies M_k&= {\bigg \lceil \left( \tfrac{\ln \left( \tfrac{\tilde{C}{\rho _0}}{\epsilon _k\eta _k^{{b{+1}}}}\right) }{\ln \left( \tfrac{\sqrt{L_k}}{\sqrt{L_k}-\sqrt{\mu }}\right) }\right) \bigg \rceil } \le {\bigg \lceil \left( \tfrac{\ln \left( \tilde{C}{\rho _0 {k^{4+2\delta }}}\right) }{\left( 1- \left( \tfrac{\sqrt{L_k}-\sqrt{\mu }}{\sqrt{L_k}}\right) \right) }\right) \bigg \rceil }\\&= {\bigg \lceil \left( \tfrac{\ln \left( \tilde{C}{\rho _0 {k^{4+2\delta }}}\right) }{\left( \tfrac{\sqrt{\mu }}{\sqrt{L_k}}\right) }\right) \bigg \rceil } = {\bigg \lceil \left( \tfrac{\sqrt{\tilde{C}\rho _0}\ln \left( \tilde{C}{\rho _0 {k^{4+2\delta }}}\right) }{\left( \sqrt{\mu \eta _k}\right) }\right) \bigg \rceil }\\&= {\bigg \lceil \left( \tfrac{\sqrt{\tilde{C}\rho _0}\ln \left( {(\hat{C}k)^{4+2\delta }}\right) }{\left( \sqrt{\mu \eta _k}\right) }\right) \bigg \rceil }, \text{ where } \hat{C} = {(\tilde{C} \rho _0})^{1/({4}+2\delta )}. \end{aligned}$$

Consequently, since $K(\varvec{\varepsilon }) = \lceil C/\varvec{\varepsilon }\rceil $ outer steps are required, the overall complexity is

$$\begin{aligned} \sum _{k = 1}^{K(\varvec{\varepsilon })}M_k&= \sum _{k = 1}^{\lceil C/\varvec{\varepsilon }\rceil }{\bigg \lceil \left( \tfrac{\sqrt{\tilde{C}\rho _0}\ln \left( {(\hat{C}k)^{4+2\delta }}\right) }{\left( \sqrt{\mu \eta _k}\right) }\right) \bigg \rceil } = \sum _{k = 1}^{\lceil C/\varvec{\varepsilon }\rceil }{\bigg \lceil \left( \tfrac{(4+2\delta )k^{1+\delta }\sqrt{\tilde{C}\rho _0}\ln \left( {(\hat{C}k)}\right) }{\left( \sqrt{\mu }\right) }\right) \bigg \rceil }\\&\le \mathcal {O}\left( \tfrac{1}{\varvec{\varepsilon }^{{2+{2}\delta }}}\ln \left( \tfrac{1}{\varvec{\varepsilon }}\right) \right) . \end{aligned}$$

(b) Consider $\rho _k = \rho _0\zeta ^{k}$ where $k\ge 0$ and $\zeta >1$. Proceeding as in (a) and by Theorem 4 and $\ln (x) \ge \tfrac{x-1}{x}$ for $x > 0$,

$$\begin{aligned} \mathcal {L}_{\rho _k,\eta _k}&(\textbf{x}_{k+1},\lambda _{k}) - {\mathcal {D}_{\rho _k,\eta _k}(\lambda _k)} \le \tilde{C}{\left( \tfrac{\rho _k}{\eta _k}\right) }\left( 1-\tfrac{\sqrt{\mu }}{\sqrt{L_k}}\right) ^{M_k} \, \le \, \epsilon _k\eta _k^b \\ \implies M_k&= \bigg \lceil \left( \tfrac{\ln \left( \tfrac{\tilde{C}{\rho _k}}{\epsilon _k\eta _k^{b+1}}\right) }{\ln \left( \tfrac{\sqrt{L_k}}{\sqrt{L_k}-\sqrt{\mu }}\right) }\right) \bigg \rceil \le \bigg \lceil \left( \tfrac{\ln \left( \tilde{C}k^{({4}+{2}\delta )} {\rho _k^{{3}}} \right) }{\left( 1-\tfrac{\sqrt{L_k}-\sqrt{\mu }}{\sqrt{L_k}}\right) }\right) \bigg \rceil \le {\tfrac{2 \sqrt{\rho _k}\ln \left( \rho _k^{{3}} \tilde{C}k^{(4+2\delta )}\right) }{\sqrt{\mu \eta _k}}}. \end{aligned}$$

Consequently, if $K(\varvec{\varepsilon }) = \lceil \ln (C/\varvec{\varepsilon })/\ln (\zeta ) \rceil = \lceil {\log }_{\zeta }(C/\varvec{\varepsilon })\rceil $ outer steps are employed, then the overall complexity can be bounded as follows.

$$\begin{aligned} \sum _{k = 1}^{K(\varvec{\varepsilon })}M_k&= \sum _{k = 1}^{\lceil {\log }_{\zeta }(C/\varvec{\varepsilon }) \rceil }2\bigg \lceil \tfrac{\sqrt{\rho _k}}{\sqrt{\mu \eta _k}}\ln \left( \rho _k^{{3}} {\tilde{C}}k^{(4+2\delta )}\right) \bigg \rceil \\&\le \sum _{k = 1}^{\lceil {\log }_{\zeta }(C/\varvec{\varepsilon }) \rceil }{\tilde{C}_1 \bigg \lceil \rho _k k^{(1+\delta )} \ln \left( \rho _k^{{3}} \tilde{C}k^{(4+{2}\delta )}\right) \bigg \rceil } \\&\le \rho _0 \zeta ^{(\lceil {\log }_{\zeta }(C/\varvec{\varepsilon })\rceil )} \left( \lceil {\log }_{\zeta }(C/\varvec{\varepsilon }) \rceil \right) ^{(1+\delta )} \\&\times \ln \left( \rho _0^{{3}} \zeta ^{{{3}}(\lceil {\log _{\zeta }}(C/\varvec{\varepsilon })\rceil )} \tilde{C} \left( \lceil {\log _{\zeta }}(C/\varvec{\varepsilon })\rceil \right) ^{(4+{2}\delta )} \right) \\&\le \tilde{\mathcal {O}}\left( \tfrac{1}{\varvec{\varepsilon }}\right) . \end{aligned}$$

$\square $

Remark 2

Sm-AL is designed for convex problems with nonsmooth nonlinear convex constraints, achieving an overall complexity of $\tilde{\mathcal {O}}\left( \varvec{\varepsilon }^{-3/2}\right) $ under geometric growth of $\rho _k$, slightly worse than the best known complexities for contending with smooth nonlinear constraints (cf. [26, 44]), i.e. ${\mathcal {O}}(\varvec{\varepsilon }^{-1})$ (up to log. terms).

4.3 Complexity Analysis for (Sm-AL) with fixed $\eta $

Next, we apply (Sm-AL) to (NSCopt$_{\eta }$) with a fixed and appropriately chosen $\eta $ with the overall goal of finding an $(\bar{\textbf{x}}_K,\bar{\lambda }_K)$ such that either dual suboptimality is sufficiently small, i.e. $f_{\eta }^* - \mathcal{D}_{\eta ,0}(\bar{\lambda }_K) \, \le \, \varvec{\varepsilon }$ (constant $\rho _k = \rho _0$) or primal suboptimality is sufficiently small $|f_{\eta }(\textbf{x}_K) - f_{\eta }^*| < \varvec{\varepsilon }$ (geometrically increasing $\rho _k$).

(a) (Constant $\rho $) Suppose ${\eta \le \tilde{c}\varvec{\varepsilon }}$, where $\tilde{c}$ needs specification. After K steps in (Sm-AL), $f_{\eta }^* - \mathcal{D}_{\eta ,0}(\bar{\lambda }_K) \, \le \, \tfrac{\varvec{\varepsilon }}{2}$, where $K = \bigg \lceil \tfrac{C}{\varvec{\varepsilon }}\bigg \rceil $ for a suitable C. By Lemma 4,

$$\begin{aligned} f(\textbf{x}^*) - \mathcal{D}_0(\bar{\lambda }_K) \,&\le \, f_{\eta }(\textbf{x}^*) + \eta \beta - \mathcal{D}_{\eta ,0}(\bar{\lambda }_K) + \eta (\Vert \bar{\lambda }_K\Vert m + 1) \beta \\ \,&\le \, {\underbrace{f_{\eta }(\textbf{x}_{\eta }^*) - \mathcal{D}_{\eta ,0}(\bar{\lambda }_K)}_{{\, \le \, \tfrac{\varvec{\varepsilon }}{2}}}} + \underbrace{\eta \left( \beta ({\tilde{B}}_{\lambda } m+2)\right) }_{{\, \le \, \tfrac{\varvec{\varepsilon }}{2}}} \, \le \, {\varvec{\varepsilon }}. \end{aligned}$$

To ensure that the second term is less than $\varvec{\varepsilon }/2$, we select $\eta \le \tfrac{\varvec{\varepsilon }}{2 \left( \beta (2 + {\tilde{B}}_{\lambda } m)\right) }$.

(b) (Geometrically increasing $\rho _k$). Proceeding similarly, suppose ${\eta \le \tilde{c} \varvec{\varepsilon }}$, then by taking K steps in (Sm-AL), $|f_{\eta }(\textbf{x}_K) - f_{\eta }^*| \, \le \, \tfrac{\varvec{\varepsilon }}{2}$, where $K = {\lceil \tfrac{C}{\varvec{\varepsilon }} \rceil }$ for a suitable C. Consequently, we have that if $\eta \le \tfrac{\varvec{\varepsilon }}{2\beta }$, we have that $f(\textbf{x}_K) - f^* \le \varvec{\varepsilon }$.

$$\begin{aligned} f(\textbf{x}_K) - f^* \, \le \, f_{\eta }(\textbf{x}_K) - f_{\eta }(\textbf{x}^*) + \eta \beta \, \le \, \underbrace{f_{\eta }(\textbf{x}_K) - f_{\eta }(\textbf{x}_{\eta }^*)}_{\le \tfrac{\varvec{\varepsilon }}{2}} + \underbrace{\eta \beta }_{\le \tfrac{\varvec{\varepsilon }}{2}} \, \le \, {\varvec{\varepsilon }}. \end{aligned}$$

Similarly, if $\eta \le \tfrac{\varvec{\varepsilon }}{2\beta }$, $f^* - f(\textbf{x}_K) \le \varvec{\varepsilon }$, implying that if $\eta \le \tfrac{\varvec{\varepsilon }}{2\beta }$, $| f(\textbf{x}_K)-f^*| \le \varvec{\varepsilon }$.

Proof

(a.) By Theorem 4, $M_k$ is the smallest integer satisfying

$$\begin{aligned}&\mathcal {L}_{\rho _k,\eta }(\textbf{x}_{k+1},\lambda _k) - {\mathcal {D}_{\rho _k,\eta _k}(\lambda _k)}\le \left( \tfrac{{C_1}L_k}{M_k^2}\right) \le \left( \tfrac{{C_1}\tilde{C}\rho _0}{\eta M_k^2}\right) \le \epsilon _k\\ \implies&M_k = {\bigg \lceil \sqrt{\tfrac{{C_1}\tilde{C}\rho _0}{\epsilon _k\eta }}\bigg \rceil } = \bigg \lceil \left( \sqrt{ \tfrac{{2{C_1}\tilde{C} \left( \beta (2 + B_{\lambda } m)\right) }\rho _0}{\varvec{\varepsilon }}}\right) k^{1+\delta }\bigg \rceil = \bigg \lceil \left( \sqrt{\tfrac{D\rho _0}{\varvec{\varepsilon }}}\right) k^{1+\delta }\bigg \rceil \end{aligned}$$

where $C_1, \tilde{C}, \beta , B_{\lambda }$ are constants and $D \triangleq 2C_1\tilde{C} \left( \beta (2 + B_{\lambda } m)\right) $. Then the complexity of computing a $(\bar{\textbf{x}}_K,\bar{\lambda }_K)$ where $f^* - \mathcal {D}_0(\bar{\lambda }_K) \le \varvec{\varepsilon }$ requires

$$\begin{aligned} \sum _{k = 1}^{K(\varvec{\varepsilon })} M_k = \sum _{k = 1}^{\lceil C/\varvec{\varepsilon }\rceil } \bigg \lceil \left( \sqrt{\tfrac{D\rho _0}{\varvec{\varepsilon }}}\right) k^{1+\delta }\bigg \rceil = {\sqrt{D\rho _0}\varvec{\varepsilon }^{-\tfrac{1}{2}} \sum _{k = 1}^{\lceil C/\varvec{\varepsilon }\rceil }\bigg \lceil k^{1+\delta }\bigg \rceil }\le \mathcal {O}\left( \varvec{\varepsilon }^{-\left( \frac{5}{2} + \delta \right) }\right) . \end{aligned}$$

(b) Consider $\rho _k = \rho _0\zeta ^{k}$ where $k\ge 0$ and $\zeta >1$. Proceeding as in (a) and by invoking Theorem 4,

$$\begin{aligned} M_k = {\bigg \lceil \sqrt{\tfrac{{C_1}\tilde{C}\rho _k}{\epsilon _k\eta }}\bigg \rceil }= \bigg \lceil \sqrt{\tfrac{2C_1\tilde{C}\beta }{\varvec{\varepsilon }}}\rho _kk^{1+\delta }\bigg \rceil = \bigg \lceil \sqrt{\tfrac{D}{\varvec{\varepsilon }}}\rho _kk^{1+\delta }\bigg \rceil \end{aligned}$$

where $C_1, \tilde{C}, \beta $ are constants and $D \triangleq 2C_1\tilde{C}\beta $. Then the iteration complexity of producing an $\textbf{x}_K$ satisfying $|f - f(\textbf{x}_K)| \le \varvec{\varepsilon }$ leads to the following bound, where $C, D > 0$.

$$\begin{aligned}&\sum _{k=1}^{K(\varvec{\varepsilon })} M_k = \sum _{k=1}^{\lceil \ln {\frac{C}{\varvec{\varepsilon }}}/\ln {\zeta }\rceil }\bigg \lceil \left( \sqrt{\tfrac{D}{\varvec{\varepsilon }}}\right) \rho _kk^{(1+\delta )}\bigg \rceil \, \le \, \left( \sqrt{\tfrac{D}{\varvec{\varepsilon }}}\right) \rho _0^2\sum _{k=1}^{\log _{\zeta }{\left( \frac{C}{\varvec{\varepsilon }}\right) }+1}\zeta ^{k}k^{(1+\delta )}\\&\le \sqrt{D}\rho _0^2\varvec{\varepsilon }^{-\tfrac{1}{2}} \left( \lceil \log _{\zeta }{\left( \tfrac{C}{\varvec{\varepsilon }}\right) }+1\rceil \right) ^{2(1+\delta )}\int _{1}^{\log _{\zeta }{\left( \frac{C}{\varvec{\varepsilon }}\right) }+2}\zeta ^{u} du \le \tilde{\mathcal {O}}\left( \varvec{\varepsilon }^{-\frac{3}{2}}\right) . \end{aligned}$$

$\square $

Remark 3

We observe that the complexity guarantees are close to those for diminishing $\eta _k$ with a slight improvement in the constant $\rho _0$ regime. We recall that Nesterov [33] and Beck and Teboulle [7] adopted different smoothing techniques with fixed $\eta $ to get an $\varvec{\varepsilon }$-optimal solution within $\mathcal {O}(1/\varvec{\varepsilon })$. When compared to these smoothing schemes in [7, 33], Sm-AL targets problems with nonsmooth constraint functions. Moreover, Sm-AL accommodates both fixed and varying $\eta $, with an effective complexity rate $\tilde{\mathcal {O}}(\epsilon ^{-3/2})$, matching the complexity of a smoothed penalized scheme [3].

Table 2 summarizes rate and complexities for S-AL, S-AL($\eta $), S-AL(S), and N-AL where (a). Sm-AL is smoothed ALM for convex problems; (b). Sm-AL($\eta $) is $\eta $-smoothed ALM; (c). Sm-AL(S) is Sm-AL for strongly convex problems; (d). N-AL is original ALM for nonsmooth problems. Additionally, Table 3 captures all of the constants utilized in the results from Sections 3 and 4 in a single table.

Table 2 Rates & Complexity

Full size table

Table 3 Constants in Theorems/Propositions

Full size table

5 Numerical Experiments

5.1 Fused Lasso Problems

In this section, we apply (Sm-AL) on a fused lasso problem with datasets $\left\{ X_i, y_i\right\} _{i = 1}^N$ where $X_i$ is the d-dimensional feature vector for ith instance and $y_i$ is the corresponding response. Consider the $\eta $-smoothing of (1).

$$\begin{aligned} \min _{\beta \in {\mathcal {X}}}&\quad \Vert Y-X^\top \beta \Vert ^2 \\ \mathop {\mathrm {subject\;to}}\limits&\quad \sum _{j}\left( \sqrt{\beta _j^2 + \eta ^2} - \eta \right) \le C_1, \sum _{j}\left( \sqrt{(\beta _j-\beta _{j-1})^2+\eta ^2}-\eta \right) \le C_2. \end{aligned}$$

We conducted the experiments on simulated datasets with dimensions of $\beta $ ranging from 5 to 1000. The results are shown in Table 4. The optimal solutions for each experiment are obtained by using fmincon in Matlab. In Table 4, we compare the results from Sm-AL with those from N-AL. Both Sm-AL and N-AL terminated at 50 outer iterations except that $n = 1000$ case for Sm-AL was stopped at the 30th outer iteration to save time. N-AL was terminated when the overall runtime exceeded two hours for higher dimensional problems. In all cases, Sm-AL outperforms N-AL with respect to primal suboptimality and overall runtime.

Table 4 Numerical results

Full size table

Next, we compare the results from Sm-AL with AL on an $\eta $-smoothed problem for a single instance ($n = 5$). We observe that such fixed-smoothing avenues provide relatively coarse approximations compared to their iteratively smoothed counterparts. Finally, we compare empirical rates of Sm-AL in two settings of $\rho _k$ for a smaller problem ($n = 5$) in terms of primal suboptimality in Figure 1 and observe alignment with the theoretical rates, represented by blue lines with triangular markers.

The following insights were derived from the analysis of primal suboptimality, as shown in Figure 1.

(i)
First, employing a constant $\eta $ leads to a sequence that converges to an approximate solution while diminishing $\eta _k$ allows for asymptotic guarantees to a true solution.
(ii)
Second, choosing a very small $\eta $ may impede early progress of the scheme since this leads to a large Lipschitz constant L, constraining the steplength and limiting the progress. On the other hand, selecting a larger $\eta $ allows for better early progress but the sequence will converge to a solution that may differ significantly from the true solution. A diminishing $\eta _k$ sequence starts with a larger $\eta $ (allowing for larger steps and greater progress) but comes with a guarantee that the sequence will converge to a true solution. This is reflected in Figure 1.
(iii)
We observe that the complexity guarantees for constant $\eta $ are close to those for diminishing $\eta _k$ with a slight improvement in the constant $\rho _0$ regime (see Theorem X.). We recall that Nesterov [33] and Beck and Teboulle [7] adopted different smoothing techniques with fixed $\eta $ to get an $\varvec{\varepsilon }$-optimal solution within $\mathcal {O}(1/\varvec{\varepsilon })$. When compared to these smoothing schemes in [7, 33], Sm-AL targets problems with nonsmooth constraint functions. Moreover, Sm-AL accommodates both fixed and varying $\eta $, with an effective complexity rate $\tilde{\mathcal {O}}(\varvec{\varepsilon }^{-3/2})$, matching the complexity of a smoothed penalized scheme [3]. When compared to the results in Proposition 2 with constant $\rho $, SM-AL with constant $\eta $ improves overall complexity by $\mathcal {O}\left( \varvec{\varepsilon }^{-1/2}\right) $. The diminishing nature of $\eta _k$ slows down the convergence process due to the additional summable requirement for varying $\eta _k$.

5.2 Incorporation of termination criteria

Next, we consider the introduction of termination criteria T1 and T2 and examine the impact of potentially early termination, measured by $\sum _{k}N_k$. Table 5 provides a comparison between the Sm-AL scheme with and without termination criteria. It can be observed that the incorporation of these termination criteria leads to significant computational benefits with little (if any) impact on accuracy. A natural question lies in the choice $\gamma $ in the definition of the residual function $F^{{\textrm{nat}},\tilde{\epsilon }}_\mathcal{X}$. We observe that when $\textbf{x}\in \mathcal{X}$,

$$\begin{aligned} \left\| F^{{\textrm{nat}},\tilde{\epsilon }}_\mathcal{X}(\textbf{x},{u}) \right\|&= \left\| \tilde{\epsilon } \left( \textbf{u}- \varPi _X \left[ \, \textbf{x}- \gamma \nabla h(\textbf{x}) \, \right] \right) - \textbf{x}+ \varPi _\mathcal{X} \left[ \, \textbf{x}- \gamma \nabla h(\textbf{x}) \, \right] \right\| \\&\le \left\| \tilde{\epsilon } \left( \varPi _\mathcal{X}[\textbf{u}] - \varPi _X \left[ \, \textbf{x}- \gamma \nabla h(\textbf{x}) \, \right] \right) \right\| + \left\| \varPi _\mathcal{X}[\textbf{x}] - \varPi _\mathcal{X} \left[ \, \textbf{x}- \gamma \nabla h(\textbf{x}) \, \right] \right\| \\&\le \left\| \tilde{\epsilon } \left( \textbf{u}- \textbf{x}+ \gamma \nabla h(\textbf{x}) \, \right) \right\| + \left\| \gamma \nabla h(\textbf{x}) \, \right\| \\&\le 2\tilde{\epsilon } {C} + \gamma (1+\tilde{\epsilon }) {D}, \end{aligned}$$

where $\Vert \textbf{x}\Vert \le {C}$ and $\Vert \nabla h(\textbf{x})\Vert \le {D}$ for any $\textbf{x}, {u} \in \mathcal X$. From the above bound, it may be observed that small choices of $\gamma $ may lead to early satisfaction of conditions T1 or T2 while larger choices of $\gamma $ may require significantly more iterations. Ideally, since we have already developed convergence guarantees, it would be helpful to relate $\gamma $ to $\eta _k$. Some preliminary numerics are provided where the choice of $\gamma $ is varied in condition T2, leading to some variability in performance. It can be surmised from this table that constant $\gamma $ leads to poorer performance while diminishing choices for $\gamma $ lead to far superior behavior. This is less surprising in that for larger values of K, $\gamma $ is smaller and imposes a more modest threshold for satisfying the condition and thereby allowing for earlier termination.

6 Conclusion

In this paper, we develop a smoothed AL scheme for resolving convex programs with possibly nonsmooth constraints and provide rate and complexity guarantees for convex and strongly convex settings under constant and increasing penalty parameter sequences. The complexity guarantees represent significant improvements over the best available guarantees for AL schemes applied to convex programs with nonsmooth objectives and constraints. A by-product of our analysis develops a relationship between saddle-points of $\eta $-smoothed problems and $\eta $-saddle points of our original problem. Moreover, to improve the practical behavior of the proposed Sm-AL scheme, we have developed termination criteria that allow for premature termination. Our preliminary numerics suggest that such criteria lead to significant improvements in the complexity of our scheme with modest impacts on accuracy of the resulting solutions.. We believe that our findings represent a foundation for considering extensions to compositional regimes with expectation-valued and possibly nonsmooth constraints.

Table 5 Numerical results with termination criteria

Full size table

Table 6 Performance vs choice of $\gamma $

Full size table

References

Alger, N., Villa, U., Bui-Thanh, T., Ghattas, O.: A data scalable augmented Lagrangian KKT preconditioner for large-scale inverse problems. SIAM J. Sci. Comput. 39(5), A2365–A2393 (2017)
Article MathSciNet Google Scholar
Aybat, N.S., Ahmadi, H., Shanbhag, U.V.: On the analysis of inexact augmented lagrangian schemes for misspecified conic convex programs. IEEE Transactions on Automatic Control 67(8), 3981–3996 (2021)
Article MathSciNet Google Scholar
Aybat, N.S., Iyengar, G.: A first-order smoothed penalty method for compressed sensing. SIAM Journal on Optimization 21(1), 287–313 (2011)
Article MathSciNet Google Scholar
Aybat, N.S., Iyengar, G.: An augmented Lagrangian method for conic convex programming. arXiv preprint arXiv:1302.6322 (2013)
Beck, A.: Introduction to nonlinear optimization: Theory, algorithms, and applications with MATLAB. SIAM (2014)
Beck, A.: First-order methods in optimization. SIAM (2017)
Beck, A., Teboulle, M.: Smoothing and first order methods: A unified framework. SIAM Journal on Optimization 22(2), 557–580 (2012)
Article MathSciNet Google Scholar
Byrd, R.H., Hribar, M.E., Nocedal, J.: An interior point algorithm for large-scale nonlinear programming. SIAM Journal on Optimization 9(4), 877–900 (1999)
Article MathSciNet Google Scholar
Chang, H., Lou, Y., Ng, M.K., Zeng, T.: Phase retrieval from incomplete magnitude information via total variation regularization. SIAM J. Sci. Comput. 38(6), A3672–A3695 (2016)
Article MathSciNet Google Scholar
Conn, A.R., Gould, G., Toint, P.L.: LANCELOT: a Fortran package for large-scale nonlinear optimization (Release A), vol. 17. Springer Science & Business Media (2013)
Google Scholar
Cottle, R.W., Pang, J.S., Stone, R.E.: The Linear Complementarity Problem. Academic Press Inc, Boston, MA (1992)
Google Scholar
Devolder, O., Glineur, F., Nesterov, Y.: Double smoothing technique for large-scale linearly constrained convex optimization. SIAM Journal on Optimization 22(2), 702–727 (2012)
Article MathSciNet Google Scholar
Dong, B., Zhang, Y.: An efficient algorithm for $\ell _0$ minimization in wavelet frame based image restoration. J. Sci. Comput. 54(2–3), 350–368 (2013)
Article MathSciNet Google Scholar
Facchinei, F., Pang, J.S.: Finite-dimensional variational inequalities and complementarity problems, vol. I. Springer Series in Operations Research. Springer-Verlag, New York (2003)
Google Scholar
Friedlander, M.P., Leyffer, S.: Global and finite termination of a two-phase augmented Lagrangian filter method for general quadratic programs. SIAM J. Sci. Comput. 30(4), 1706–1729 (2008)
Article MathSciNet Google Scholar
Friedlander, M.P., Saunders, M.A.: A globally convergent linearly constrained Lagrangian method for nonlinear optimization. SIAM J. Optim. 15(3), 863–897 (2005)
Article MathSciNet Google Scholar
Gao, B., Liu, X., Yuan, Yx.: Parallelizable algorithms for optimization problems with orthogonality constraints. SIAM J. Sci. Comput. 41(3), A1949–A1983 (2019)
Article MathSciNet Google Scholar
Gill, P.E., Murray, W., Saunders, M.A.: SNOPT: an SQP algorithm for large-scale constrained optimization. SIAM Rev. 47(1), 99–131 (2005). ((electronic))
Article MathSciNet Google Scholar
Hestenes, M.R.: Multiplier and gradient methods. Journal of Optimization Theory and Applications 4(5), 303–320 (1969)
Article MathSciNet Google Scholar
Jalilzadeh, A., Shanbhag, U.V., Blanchet, J., Glynn, P.W.: Smoothed variable sample-size accelerated proximal methods for nonsmooth stochastic convex programs. Stochastic Systems 12(4), 373–410 (2022)
Article MathSciNet Google Scholar
Kang, M., Kang, M., Jung, M.: Inexact accelerated augmented Lagrangian methods. Computational Optimization and Applications 62(2), 373–404 (2015)
Article MathSciNet Google Scholar
Kloft, M., Brefeld, U., Laskov, P., Müller, K.R., Zien, A., Sonnenburg, S.: Efficient and accurate ${L}_p$-norm multiple kernel learning. Advances in Neural Information Processing Systems 22 (2009)
Koshal, J., Nedić, A., Shanbhag, U.V.: Multiuser optimization: Distributed algorithms and error analysis. SIAM Journal on Optimization 21(3), 1046–1081 (2011)
Article MathSciNet Google Scholar
Lan, G., Monteiro, R.D.: Iteration-complexity of first-order augmented Lagrangian methods for convex programming. Mathematical Programming 155(1–2), 511–547 (2016)
Article MathSciNet Google Scholar
Liu, Y.F., Liu, X., Ma, S.: On the nonergodic convergence rate of an inexact augmented Lagrangian framework for composite convex programming. Mathematics of Operations Research 44(2), 632–650 (2019)
Article MathSciNet Google Scholar
Lu, Z., Zhou, Z.: Iteration-complexity of first-order augmented Lagrangian methods for convex conic programming. SIAM J. Optim. 33(2), 1159–1190 (2023)
Article MathSciNet Google Scholar
Moreau, J.J.: Proximité et dualité dans un espace hilbertien. Bulletin de la Société mathématique de France 93, 273–299 (1965)
Article MathSciNet Google Scholar
Murtagh, B.A., Saunders, M.A.: A projected Lagrangian algorithm and its implementation for sparse nonlinear constraints. Springer (1982)
Book Google Scholar
Necoara, I., Patrascu, A., Glineur, F.: Complexity of first-order inexact Lagrangian and penalty methods for conic convex programming. Optimization Methods and Software 34(2), 305–335 (2019)
Article MathSciNet Google Scholar
Nedelcu, V., Necoara, I., Tran-Dinh, Q.: Computational complexity of inexact gradient augmented Lagrangian methods: application to constrained mpc. SIAM Journal on Control and Optimization 52(5), 3109–3134 (2014)
Article MathSciNet Google Scholar
Nesterov, Y.: A method for unconstrained convex minimization problem with the rate of convergence $\cal{O} (1/k^{2})$. Doklady an ussr 269, 543–547 (1983)
Google Scholar
Nesterov, Y.: Introductory lectures on convex optimization: A basic course, vol. 87. Springer Science & Business Media (2003)
Google Scholar
Nesterov, Y.: Smooth minimization of non-smooth functions. Mathematical programming 103(1), 127–152 (2005)
Article MathSciNet Google Scholar
Nesterov, Y., et al.: Lectures on convex optimization, vol. 137. Springer (2018)
Google Scholar
Patrascu, A., Necoara, I., Tran-Dinh, Q.: Adaptive inexact fast augmented Lagrangian methods for constrained convex optimization. Optimization Letters 11(3), 609–626 (2017)
Article MathSciNet Google Scholar
Polyak, B.T.: Introduction to optimization (1987)
Powell, M.J.: A method for nonlinear constraints in minimization problems. Optimization pp. 283–298 (1969)
Rockafellar, R.T.: A dual approach to solving nonlinear programming problems by unconstrained optimization. Mathematical Programming 5(1), 354–373 (1973)
Article MathSciNet Google Scholar
Rockafellar, R.T.: Augmented Lagrangians and applications of the proximal point algorithm in convex programming. Mathematics of Operations Research 1(2), 97–116 (1976)
Article MathSciNet Google Scholar
Tibshirani, R.: Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological) 58(1), 267–288 (1996)
Article MathSciNet Google Scholar
Tibshirani, R., Saunders, M., Rosset, S., Zhu, J., Knight, K.: Sparsity and smoothness via the fused Lasso. Journal of the Royal Statistical Society: Series B (Stat. Method.) 67(1), 91–108 (2005)
Article MathSciNet Google Scholar
Tong, X., Xia, L., Wang, J., Feng, Y.: Neyman-Pearson classification: parametrics and sample size requirement. The Jrnl. of Machine Learning Research 21(1), 380–427 (2020)
MathSciNet Google Scholar
Wilson, R.B.: A simplicial algorithm for concave programming. Ph. D. Dissertation, Graduate School of Bussiness Administration (1963)
Xu, Y.: Iteration complexity of inexact augmented Lagrangian methods for constrained convex programming. Mathematical Programming 185(1), 199–244 (2021)
Article MathSciNet Google Scholar
Zhang, L., Zhang, Y., Wu, J., Xiao, X.: Solving stochastic optimization with expectation constraints efficiently by a stochastic augmented Lagrangian-type algorithm. INFORMS Journal on Computing 34(6), 2989–3006 (2022)
Article MathSciNet Google Scholar
Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: series B (Statistical Methodology) 67(2), 301–320 (2005)
Article MathSciNet Google Scholar

Download references

Acknowledgements

We extend our sincere appreciation to Dr. Qi Wang (University of Michigan at Ann Arbor) for her invaluable suggestions and careful reading of a recent draft of this paper. In addition, the second author would like to acknowledge his early collaboration with Dr. N. Serhat Aybat (Pennsylvania State University) that provided some of the seeds for this study.

Funding

P. Zhang and Uday V. Shanbhag are partially supported by ONR Grant N00014-22-1-2589, AFOSR Grant FA9550-24-1-0259, and DOE Grant DE-SC0023303. Ethan X. Fang would like to acknowledge support from NSF Grants DMS-2346292 and DMS-2434666.

Author information

Authors and Affiliations

Pennsylvania State University, University Park, Pennsylvania, PA, USA
Peixuan Zhang
University of Michigan, Ann Arbor, MI, USA
Uday V. Shanbhag
Duke University, Durham, NC, USA
Ethan X. Fang

Authors

Peixuan Zhang
View author publications
Search author on:PubMed Google Scholar
Uday V. Shanbhag
View author publications
Search author on:PubMed Google Scholar
Ethan X. Fang
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Uday V. Shanbhag.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Uday V. Shanbhag would like to dedicate this paper to Prof. Michael A. Saunders for his help, mentorship, and guidance as well as his immense and enduring contributions to the theoretical development and large-scale implementation of algorithms for nonlinear programming.

Appendix

1.1 Proof of Lemma 1

Proof

$$\begin{aligned} \mathcal {L}_{\rho } (\textbf{x},\lambda )&\triangleq \min _{\textbf{v}\ge 0} \left\{ f(\textbf{x}) + \lambda ^T(g(\textbf{x}) + \textbf{v}) + \tfrac{\rho }{2} \Vert g(\textbf{x}) + \textbf{v}\Vert ^2\right\} \\&= \min _{\textbf{v}\ge 0} \left\{ f(\textbf{x}) + \tfrac{1}{2\rho }\Vert \lambda \Vert ^2 + \lambda ^T(g(\textbf{x}) + \textbf{v}) + \tfrac{\rho }{2} \Vert g(\textbf{x}) + \textbf{v}\Vert ^2\right\} - \tfrac{1}{2\rho }\Vert \lambda \Vert ^2 \\&= \min _{\textbf{v}\ge 0} \left\{ f(\textbf{x}) + \tfrac{1}{2}\Vert \tfrac{1}{\sqrt{\rho }}\lambda + \sqrt{\rho }(g(\textbf{x}) + \textbf{v})\Vert ^2 \right\} - \tfrac{1}{2\rho }\Vert \lambda \Vert ^2 \\&= \min _{\textbf{v}\ge 0} \left\{ f(\textbf{x}) + \tfrac{1}{2}\Vert \tfrac{1}{\sqrt{\rho }}\lambda + \sqrt{\rho }g(\textbf{x}) + \sqrt{\rho } \textbf{v}\Vert ^2 \right\} - \tfrac{1}{2\rho }\Vert \lambda \Vert ^2 \\&= \left\{ f(\textbf{x}) + \min _{\textbf{v}\ge 0}\tfrac{1}{2\rho }\Vert \lambda + \rho g(\textbf{x}) + \rho \textbf{v}\Vert ^2 \right\} - \tfrac{1}{2\rho }\Vert \lambda \Vert ^2 \\&= \left\{ f(\textbf{x}) + \min _{\textbf{v}\ge 0}\tfrac{\rho }{2}\Vert \tfrac{\lambda }{\rho } + g(\textbf{x}) + \textbf{v}\Vert ^2 \right\} - \tfrac{1}{2\rho }\Vert \lambda \Vert ^2 \\&= \left\{ f(\textbf{x}) + \tfrac{\rho }{2} \left( d_+\left( -(\tfrac{\lambda }{\rho } + g(\textbf{x}) )\right) \right) ^2 - \tfrac{1}{2\rho }\Vert \lambda \Vert ^2 \right\} \\&= \left\{ f(\textbf{x}) + \tfrac{\rho }{2} \left( d_-\left( (\tfrac{\lambda }{\rho } + g(\textbf{x}) )\right) \right) ^2 - \tfrac{1}{2\rho }\Vert \lambda \Vert ^2 \right\} , \end{aligned}$$

where the last equality follows from $d_{+}(-v) = d_{-}(v)$ and $d_\mathcal {K}(u) \triangleq \min _{v \in \mathcal K} \Vert v-u\Vert $. We now derive $\nabla _{\lambda } \mathcal {L}_{\rho }(\textbf{x},\lambda )$ as follows.

$$\begin{aligned} \nabla _{\lambda } \mathcal {L}_{\rho }(\textbf{x},\lambda )&= \nabla _{\lambda } \left[ \tfrac{\rho }{2} \left( d_-\left( (\tfrac{\lambda }{\rho } + g(\textbf{x}) \right) \right) ^2 - \tfrac{1}{2\rho }\Vert \lambda \Vert ^2\right] \\&= \left( \tfrac{\lambda }{\rho } + g(\textbf{x}) - \varPi _{-} \left( \tfrac{\lambda }{\rho } + g(\textbf{x}) \right) \right) - \tfrac{\lambda }{\rho }\\&= \left( g(\textbf{x}) - \varPi _{-} \left( \tfrac{\lambda }{\rho } +g(\textbf{x}) \right) \right) \\&= \left( -\tfrac{\lambda }{\rho } + \varPi _{+} \left( \tfrac{\lambda }{\rho } + g(\textbf{x}) \right) \right) , \end{aligned}$$

where the second equality is a result of $\nabla _u d^2_\mathcal{K}(u) = 2(u-\varPi _\mathcal{K}[u])$ for any cone $\mathcal{K}$, the last equality is a consequence of $u = \varPi _{-\mathcal {K}}(u) + \varPi _{\mathcal {K}^*}(u)$ and $\mathcal {K}\triangleq \{u: u \ge 0\}$. Similarly, we derive $\nabla _{\textbf{x}} \mathcal {L}_{\rho }(\textbf{x},\lambda )$ as follows.

$$\begin{aligned} \nabla _{{{\textbf {x}}}} \mathcal {L}_{\rho }({\textbf {x}},\lambda )&= \nabla _{{\textbf {x}}}f({\textbf {x}}) + \nabla _{{\textbf {x}}} \left[ \tfrac{\rho }{2} \left( d_-\left( (\tfrac{\lambda }{\rho } + g({\textbf {x}}) \right) \right) ^2 - \tfrac{1}{2\rho }\Vert \lambda \Vert ^2\right] \\&= \nabla _{{\textbf {x}}}f({\textbf {x}})+\rho J_g({\textbf {x}}) \left( \tfrac{\lambda }{\rho } + g({\textbf {x}}) - \varPi _{-} \left( \tfrac{\lambda }{\rho } + g({\textbf {x}}) \right) \right) , \end{aligned}$$

where $J_{g}(\textbf{x})$ is Jacobian matrix of g. $\square $

1.2 Proof of Lemma 2

Proof

For completeness, we provide this proof which is based on that provided in [38]. Let $u = g(\textbf{x})+v$ and $p_{\rho }(u)\triangleq \inf _{\textbf{x}\in \mathcal {X}}f(\textbf{x})+\tfrac{\rho }{2}\Vert u\Vert ^2$, where $p_\rho $ can be regarded as a “permutation” function. Then we have

$$p_{\rho } = p_0+\rho q, \quad \text {where } q(u) = \tfrac{1}{2}\Vert u\Vert ^2.$$

The augmented dual function can be expressed as

$$\begin{aligned} \mathcal {D}_{\rho }(\lambda )&= \inf _{u\in \mathbb {R}^m}\big \{p_\rho (u)+u^\top \lambda \big \} = -p_{\rho }^*(-\lambda ) = -(p_0+\rho q)^*(-\lambda ) = -(p_0^*\square q^*\rho )(-\lambda )\\&= -\min _{u\in \mathbb {R}^m}\left\{ p_0^*(-\lambda )+\rho q^*\left( \tfrac{\lambda -u}{\rho }\right) \right\} \\&=\max _{u\in \mathbb {R}^m}\left\{ -p_0^*(-\lambda )-\rho \big (\tfrac{\lambda -u}{\rho }\big )^2\right\} \\&=\max _{u\in \mathbb {R}^m}\{\mathcal {D}_0(u)-\tfrac{1}{2\rho }\Vert u-\lambda \Vert ^2\} \end{aligned}$$

where the infimal convolution of two functions is defined as

$$f\square g(x) = \inf \{f(x-y)+g(y)|y\in \mathbb {R}^n\}.$$

$$\mathcal {D}_{\rho }(\lambda ) = \max _{u\in \mathbb {R}^m}\{\mathcal {D}_0(\lambda )-\tfrac{1}{2\rho }\Vert u-\lambda \Vert ^2\}$$

Consequently, by Danskin’s theorem,

$$\begin{aligned} \nabla _{\lambda }\mathcal {D}_{\rho }(\lambda )&= \tfrac{1}{\rho }\left( \left( \textrm{arg}\hspace{-0.02in}\max _{u}(\mathcal {D}_0(u)-\tfrac{1}{\rho }\Vert u-\lambda \Vert ^2\right) -\lambda \right) \\&=\tfrac{1}{\rho }(q_{\rho }(\lambda )-\lambda ). \end{aligned}$$

$\square $

1.3 Proof of Lemma 6

Proof

We observe that (14) can be expressed as

$$\begin{aligned} \nonumber \lambda _{k+1}&=\lambda _k + \rho _k \nabla _{\lambda } \mathcal {L}_{\eta _k, \rho _k} ({\textbf {x}}_{k+1},\lambda _k) \\&\overset{\tiny \text { Lemma 1} }{=} \lambda _k + \rho _k \left( -\tfrac{\lambda _k}{\rho _k} + \varPi _+\left[ \, \tfrac{\lambda _k}{\rho _k}+g_{\eta _k}({\textbf {x}}_{k+1})\, \right] \right) \nonumber \\ &= \varPi _+\left[ \, {\lambda _k}+{\rho _k}g_{\eta _k}({\textbf {x}}_{k+1})\, \right] . \end{aligned}$$

(34)

$\square $

1.4 Proof of Lemma 4

Proof

(i) Since for any $\textbf{x}\, \in \, \mathcal {X}$, we have that

$$\begin{aligned} |f(\textbf{x})- f_{\eta }(\textbf{x})|&\, \le \, \eta {\beta } \end{aligned}$$

(35)

$$\begin{aligned} |g_{i}(\textbf{x})- g_{i,\eta }(\textbf{x})|&\, \le \, \eta {\beta }, \quad i = 1, 2, \dots , m. \end{aligned}$$

(36)

Consequently, for any $\lambda \ge 0$, by adding (35) to $\lambda _i \times $ (36) for $i=1, \cdots , m$,

$$\begin{aligned} \left| \mathcal {L}_{0}(\textbf{x},\lambda ) - \mathcal {L}_{\eta ,0}(\textbf{x},\lambda ) \right| \, \le \, \eta (\Vert \lambda \Vert m+1) {\beta }. \end{aligned}$$

(ii) Suppose ${{\bar{\textbf{x}}} \, \in } \, {\displaystyle \arg \min _{\textbf{x}\in \mathcal {X}}} \, \mathcal {L}_0(\textbf{x},\lambda )$ and ${\bar{\textbf{x}}}_{\eta } {\, \in \, } {\displaystyle \arg \min _{\textbf{x}\in \mathcal {X}}} \, \mathcal {L}_{\eta ,0}(\textbf{x},\lambda )$. It follows that $\mathcal {D}_0(\lambda ) = \mathcal {L}_0({{\bar{\textbf{x}}}},\lambda )$ and $\mathcal {D}_{\eta ,0}(\lambda ) = \mathcal {L}_{\eta ,0}({{\bar{\textbf{x}}}}_{\eta },\lambda )$. Let $C = (\Vert \lambda \Vert m+1){\beta }$.

$$\begin{aligned} \mathcal {D}_0(\lambda ) = \mathcal {L}_0({{\bar{\textbf{x}}}},\lambda ) \le \mathcal {L}_0({{\bar{\textbf{x}}}_{\eta }}, \lambda ) \le \mathcal {L}_{\eta ,0}({{\bar{\textbf{x}}}_{\eta }},\lambda )+\eta C = \mathcal {D}_{\eta ,0}(\lambda ) + \eta C. \end{aligned}$$

Similarly, we have that

$$\begin{aligned} \mathcal {D}_{\eta ,0}(\lambda ) = \mathcal {L}_{\eta ,0}({{\bar{\textbf{x}}}_{\eta }},\lambda ) \le \mathcal {L}_{\eta ,0}({{\bar{\textbf{x}}}},\lambda ) \le \mathcal {L}_{0}({{\bar{\textbf{x}}}},\lambda ) + \eta C = \mathcal {D}_0(\lambda ) + \eta C. \end{aligned}$$

This implies that for any $\lambda \in \mathbb {R}^m_+$, $| \mathcal {D}_{\eta ,0}(\lambda ) - \mathcal {D}_{0}(\lambda )| \le {\eta C}.$

(iii) By the prior definitions,

$$\begin{aligned} \mathcal {D}_{\eta ,\rho }(\lambda )&= \max _{u\in \mathbb {R}^m}\left[ \,\mathcal {D}_{\eta ,0}(u)-\frac{1}{2\rho }\Vert u-\lambda \Vert ^2\,\right] \text{ and } \\ \mathcal {D}_{\rho }(\lambda )&= \max _{u\in \mathbb {R}^m}\left[ \mathcal {D}_{0}(u)-\frac{1}{2\rho }\Vert u-\lambda \Vert ^2\right] . \end{aligned}$$

For any $\lambda \ge 0$, let $u_1 {\, \in \,} \arg {\displaystyle \max _u} \,\mathcal {D}_{\eta ,\rho }(\lambda )$ and $u_2 {\, \in \,} \arg {\displaystyle \max _u} \, \mathcal {D}_{\rho }(\lambda )$. Then

$$\begin{aligned} \mathcal {D}_{\eta ,\rho }(\lambda )-\mathcal {D}_{\rho }(\lambda )\,&= \max _{u\in \mathbb {R}^m}\left[ \,\mathcal {D}_{\eta ,0}(u)-\frac{1}{2\rho }\left\| u-\lambda \right\| ^2\,\right] - \max _{u\in \mathbb {R}^m}\left[ \,\mathcal {D}_{0}(u)-\frac{1}{2\rho }\left\| u-\lambda \right\| ^2\,\right] \\&= \,\max _{u\in \mathbb {R}^m}\left[ \,\mathcal {D}_{\eta ,0}(u)-\frac{1}{2\rho }\left\| u-\lambda \right\| ^2\,\right] -\left[ \,\mathcal {D}_{0}(u_2)-\frac{1}{2\rho }\left\| u_2-\lambda \right\| ^2\,\right] \,\\&\le \,\max _{u\in \mathbb {R}^m}\left[ \,\mathcal {D}_{\eta ,0}(u)-\frac{1}{2\rho }\left\| u-\lambda \right\| ^2\,\right] -\left[ \,\mathcal {D}_{0}(u_1)-\frac{1}{2\rho }\left\| u_1-\lambda \right\| ^2\,\right] \,\\&\le \left| \,\mathcal {D}_{\eta ,0}(u_1)-\mathcal {D}_{0}(u_1)\,\right| \overset{\tiny \text{ Lemma } 4(ii)}{\le } \eta {C}. \end{aligned}$$

Similarly, $\mathcal {D}_{\rho }(\lambda )-\mathcal {D}_{\eta ,\rho }(\lambda ) \le \eta C$, implying the result. $\square $

1.5 Proof of Lemma 5

Proof

(i) By definition, we have that

$$\begin{aligned} q_{\rho }(\lambda )&= \text{ arg }\hspace{-0.02in}\max _{u \in \mathbb {R}^m} \left( \mathcal {D}_{0}(u) - \tfrac{1}{2\rho }\Vert u-\lambda \Vert ^2 \right) \nonumber \\&=\text{ arg }\hspace{-0.02in}\min _{u \in \mathbb {R}^m} \left( -\mathcal {D}_{0}(u) + \tfrac{1}{2\rho }\Vert u-\lambda \Vert ^2 \right) = \text{ prox}_{-{\mathcal {D}}_{0},\rho }(\lambda ). \end{aligned}$$

(37)

$$\begin{aligned} \text{ Similarly, } q_{\eta ,\rho }(\lambda )&= \text{ prox}_{-D_{\eta ,0},\rho }(\lambda ). \end{aligned}$$

(38)

By strong convexity of $-\mathcal{D}_0(\bullet ) + \tfrac{1}{2\rho }\Vert \bullet -\lambda \Vert ^2$ and $-\mathcal{D}_{\eta ,0}(\bullet ) + \tfrac{1}{2\rho }\Vert \bullet -\lambda \Vert ^2$ and by noting that $q_{\rho }(\lambda )$ and $q_{\eta ,\rho }(\lambda )$ uniquely minimize (37) and (38), respectively, we obtain that

$$\begin{aligned} -\mathcal {D}_0(q_{\eta ,\rho }(\lambda )) + \tfrac{1}{2\rho }\Vert q_{\eta ,\rho }(\lambda )-\lambda \Vert ^2&\ge -\mathcal {D}_0(q_{\rho }(\lambda )) + \tfrac{1}{2\rho }\Vert q_{\rho }(\lambda )-\lambda \Vert ^2\\&+ \tfrac{1}{4\rho } \Vert q_{\eta ,\rho }(\lambda ) - q_{\rho }(\lambda )\Vert ^2, \\ -\mathcal {D}_{\eta ,0}(q_{\rho }(\lambda )) + \tfrac{1}{2\rho }\Vert q_{\rho }(\lambda )-\lambda \Vert ^2&\ge -\mathcal {D}_{\eta ,0}(q_{\eta ,\rho }(\lambda )) + \tfrac{1}{2\rho }\Vert q_{\eta ,\rho }(\lambda )-\lambda \Vert ^2\\&+ \tfrac{1}{4\rho } \Vert q_{\eta ,\rho }(\lambda ) - q_{\rho }(\lambda )\Vert ^2. \end{aligned}$$

Consequently, by summing the two inequalities above, we have that

$$\begin{aligned} \tfrac{1}{{{2}\rho }} \Vert q_{\eta ,\rho }(\lambda ) - q_{\rho }(\lambda )\Vert ^2&\le \mathcal {D}_{\eta ,0}(q_{\eta ,\rho }(\lambda ))-\mathcal {D}_0(q_{\eta ,\rho }(\lambda )) + \mathcal {D}_0(q_{\rho }(\lambda )) -\mathcal {D}_{\eta ,0}(q_{\rho }(\lambda )) \\&\le \eta \left( \Vert q_{\eta ,\rho }(\lambda )\Vert m+1\right) \beta + \eta \left( \Vert q_{\rho }(\lambda )\Vert m+1\right) \beta . \end{aligned}$$

By definitions of $\lambda _{\eta }^*$ and $\lambda ^*$, we have $q_{\eta ,\rho }(\lambda _{\eta }^*) = \lambda _{\eta }^*$ and $q_{\rho }(\lambda ^*) = \lambda ^*$. Therefore, we have the following bounds on $\Vert q_{\eta ,\rho }(\lambda )\Vert $ and $\Vert q_{\rho }(\lambda )\Vert $.

$$\begin{aligned} \Vert q_{\eta ,\rho }(\lambda )\Vert&=\Vert q_{\eta ,\rho }(\lambda )-q_{\eta ,\rho }(\lambda _{\eta }^*)+\lambda _{\eta }^*\Vert \le \underbrace{\Vert q_{\eta ,\rho }(\lambda )-q_{\eta ,\rho }(\lambda _{\eta }^*)\Vert }_{q_{\eta ,\rho }(\bullet )\text { is non-expansive} }+\Vert \lambda _{\eta }^*\Vert \\ &\le \Vert \lambda -\lambda _{\eta }^*\Vert +\Vert \lambda _{\eta }^*\Vert \le \Vert \lambda \Vert +2\Vert \lambda _{\eta }^*\Vert . \end{aligned}$$

Similarly, $ \Vert q_{\rho }(\lambda )\Vert =\left\| q_{\rho }(\lambda )-q_{\rho }(\lambda ^*)+\lambda ^*\right\| \le \Vert \lambda \Vert +2\Vert \lambda ^*\Vert .$ Therefore, It follows that for any $\lambda \ge 0$,

$$\begin{aligned} \tfrac{1}{2\rho }\Vert q_{\eta ,\rho }(\lambda ) - q_{\rho }(\lambda )\Vert ^2&\le \eta \beta \left( 2+m\left( \Vert q_{\eta ,\rho }(\lambda )\Vert +\Vert q_{\rho }(\lambda )\Vert \right) \right) \\ &\le \eta \beta \left( 2+m\left( 2\Vert \lambda \Vert +2(b_{\lambda ,\eta }+b_{\lambda })\right) \right) \\ &= 2\eta \beta \left( C_m+m(\Vert \lambda \Vert )\right) , \end{aligned}$$

where $C_{m} \triangleq 1+m(b_{\lambda ,\eta }+b_{\lambda })$ is a constant.

(ii) By recalling the definitions of $\nabla _{\lambda } \mathcal {D}_{\rho }(\lambda )$ and $\nabla _{\lambda } \mathcal {D}_{\eta ,\rho }(\lambda )$ from Lemma 2,

$$\begin{aligned} \Vert \nabla _{\lambda } \mathcal {D}_{\eta ,\rho }(\lambda )-\nabla _{\lambda } \mathcal {D}_{\rho }(\lambda )\Vert = \tfrac{1}{\rho }\Vert q_{\eta ,\rho }(\lambda ) - q_{\rho }(\lambda )\Vert \le \sqrt{\tfrac{{4} {\eta (\Vert \lambda \Vert m+C_{m}) \beta }}{{\rho }}}. \end{aligned}$$

$\square $

1.6 Proof of Lemma 7

Proof

(a) $\implies $ (b). Suppose $\textbf{x}^*_{\epsilon }$ is an $\epsilon $-optimal solution of (Opt). Suppose (b) does not hold and there exists $\textbf{y}\in \mathcal{X}$ such that

$$ \nabla h(\textbf{x}^*_{\epsilon })^\top \left( \, \textbf{y}- \textbf{x}^*_{\epsilon }\, \right) \, < \, -\epsilon . $$

Consequently, $h^\prime (\textbf{x}^*_{\epsilon };d) =\nabla h(\textbf{x}^*_{\epsilon })^\top d$ where $d = \textbf{y}-\textbf{x}_{\epsilon }^*$. Since d is a descent direction, by Lemma 4.2 [5], we have that for some $\delta \le 1$, $ h(\textbf{x}^*_{\epsilon }+td) - h(\textbf{x}^*_{\epsilon }) < -\epsilon $ for any $t \in (0,\delta )$. Note that $\textbf{x}^*_{\epsilon } + td \in \mathcal{X}$ since $\mathcal{X}$ is a convex set. It follows that there exists a feasible point $\textbf{x}^*_{\epsilon }+td \in \mathcal{X}$ such that $ h(\textbf{x}^*_{\epsilon }+td) - h(\textbf{x}^*_{\epsilon }) < -\epsilon $, violating $\epsilon $-optimality of $\textbf{x}^*_{\epsilon }$.

(b) $\implies $ (a). By convexity of h, we have that

$$\begin{aligned} h(\textbf{y}) \,&\ge \, h(\textbf{x}^*_{\epsilon }) + \nabla h(\textbf{x}^*_{\epsilon })^\top \left( \, \textbf{y}- \textbf{x}_{\epsilon }^*\, \right) , \qquad \forall \textbf{y}\, \in \, \mathcal{X} \nonumber \\&\ge \,h(\textbf{x}^*_{\epsilon }) -\epsilon , \qquad \qquad \qquad \qquad \qquad \forall \textbf{y}\, \in \, \mathcal{X}. \end{aligned}$$

(39)

Consequently, $h(\textbf{x}^*_{\epsilon }) \, \le \, h(\textbf{x}^*) + \epsilon \, \le \, h(\textbf{x}) + \epsilon $ for any $x \in \mathcal{X}$, implying that $\textbf{x}^*_{\epsilon }$ is an $\epsilon $-optimal solution.

(c) $\implies $ (b). Given $\textbf{u}\in \mathcal{X}$ and $\textbf{x}^*_{\epsilon } \in \mathcal{X}$, we have that $F^{{\textrm{nat}},\tilde{\epsilon }}_\mathcal{X}(\textbf{x}_{\epsilon }^*,\textbf{u}) = 0$. Consequently, we have that $\textbf{x}_{\epsilon }^* = \tilde{\epsilon } v + \varPi _\mathcal{X}\left[ \, \textbf{x}_{\epsilon }^* - \gamma \nabla h(\textbf{x}_{\epsilon }^*)\, \right] $, where $\tilde{\epsilon } v + \varPi _\mathcal{X}\left[ \, \textbf{x}_{\epsilon }^* - \gamma \nabla h(\textbf{x}_{\epsilon }^*)\, \right] \in \mathcal{X}$ and $v = \textbf{u}- \varPi _\mathcal{X}\left[ \, \textbf{x}_{\epsilon }^* - \gamma \nabla h(\textbf{x}_{\epsilon }^*)\, \right] $. It is easily seen that the former of these assertions holds as observed next.

$$\begin{aligned} \tilde{\epsilon } v + \varPi _\mathcal{X}\left[ \, \textbf{x}_{\epsilon }^* - \gamma \nabla h(\textbf{x}_{\epsilon }^*)\, \right]&= \tilde{\epsilon } \textbf{u}+ (1-\tilde{\epsilon }) \varPi _\mathcal{X}\left[ \, \textbf{x}_{\epsilon }^* - \gamma \nabla h(\textbf{x}_{\epsilon }^*)\, \right] \, \in \, \mathcal{X}, \end{aligned}$$

since $\textbf{u}\in \mathcal{X}$, $\tilde{\epsilon } \in (0,1)$, and $\mathcal{X}$ is a convex set. For ease of exposition, we denote $\varPi _\mathcal{X}\left[ \, \textbf{x}_{\epsilon }^* - \gamma \nabla h(\textbf{x}_{\epsilon }^*)\, \right] $ as $\tilde{\textbf{x}}$. For any $\textbf{y}\, \in \,\mathcal{X}$,

$$\begin{aligned} \left( \, \textbf{y}-\textbf{x}_{\epsilon }^* \right) ^\top \nabla h(\textbf{x}_{\epsilon }^*)&= \tfrac{1}{\gamma }\left( \, \textbf{y}-\textbf{x}_{\epsilon }^* \right) ^\top \left( \, \textbf{x}_{\epsilon }^*- (\textbf{x}_{\epsilon }^* - {\gamma }\nabla h(\textbf{x}_{\epsilon }^*)) \, \right) \\ \,&= \underbrace{\tfrac{1}{\gamma }\,\left( \, \textbf{y}-\varPi _\mathcal{X}\left[ \, \textbf{x}_{\epsilon }^* - \gamma \nabla h(\textbf{x}_{\epsilon }^* )\, \right] \right) ^\top \left( \, \tilde{\epsilon } v + \varPi _\mathcal{X}\left[ \, \textbf{x}_{\epsilon }^* - \gamma \nabla h(\textbf{x}_{\epsilon }^* )\, \right] - (\textbf{x}_{\epsilon }^* -\gamma \nabla h(\textbf{x}_{\epsilon }^* ))\, \right) }_{\text{ Term } \text{1 }} \\&\underbrace{ - \, {\tfrac{1}{\gamma }} \tilde{\epsilon } v ^\top \left( \, \tilde{\epsilon }v+\varPi _\mathcal{X}\left[ \, \textbf{x}_{\epsilon }^* - \gamma \nabla h(\textbf{x}_{\epsilon }^* )\, \right] - (\textbf{x}_{\epsilon }^* -\gamma \nabla h(\textbf{x}_{\epsilon }^* ))\, \right) }_{\text{ Term } \text{2 }}. \end{aligned}$$

We first derive a bound on Term 1.

$$\begin{aligned} \text{ Term } \text{1 }&= \tfrac{1}{\gamma }\,\left( \, \textbf{y}-\varPi _\mathcal{X}\left[ \, \textbf{x}_{\epsilon }^* - \gamma \nabla h(\textbf{x}_{\epsilon }^* )\, \right] \right) ^\top \left( \, \tilde{\epsilon } v + \varPi _\mathcal{X}\left[ \, \textbf{x}_{\epsilon }^* - \gamma \nabla h(\textbf{x}_{\epsilon }^* )\, \right] - (\textbf{x}_{\epsilon }^* -\gamma \nabla h(\textbf{x}_{\epsilon }^* ))\, \right) \nonumber \\&= \, {\tfrac{1}{\gamma }}\underbrace{\left( \, \textbf{y}- \varPi _\mathcal{X}\left[ \, \textbf{x}_{\epsilon }^* - \gamma \nabla h(\textbf{x}_{\epsilon }^* )\, \right] \right) ^\top \left( \, \varPi _\mathcal{X}\left[ \, \textbf{x}_{\epsilon }^* - \gamma \nabla h(\textbf{x}_{\epsilon }^* )\, \right] - (\textbf{x}_{\epsilon }^* -\gamma \nabla h(\textbf{x}_{\epsilon }^* ))\, \right) }_{\, \ge \, 0 \quad \text{(Projection } \text{ identity) }} \nonumber \\&\quad + {\tfrac{1}{\gamma }} \left( \, \textbf{y}-\tilde{\textbf{x}} \right) ^\top \left( \, \tilde{\epsilon } (\textbf{u}- \tilde{x}) \, \right) \ge {\tfrac{1}{\gamma }}\left( \, \textbf{y}-\textbf{x}_{\epsilon }^* \right) ^\top \left( \, \tilde{\epsilon } \left( \, \textbf{u}- \tilde{\textbf{x}}\, \right) \right) \nonumber \\&\ge -\tfrac{\tilde{\epsilon }}{2{\gamma }}\left( 2\Vert \textbf{y}\Vert ^2 + 2\Vert \textbf{x}_{\epsilon }^* \Vert ^2+ 2 \Vert \textbf{u}\Vert ^2 + 2\Vert \tilde{\textbf{x}}\Vert ^2 \right) \ge -{\tfrac{4}{\gamma }}\tilde{\epsilon } C = -\epsilon _1, \end{aligned}$$

(40)

where $\Vert \textbf{y}\Vert ^2 \le C$ for any $\textbf{y}\in \mathcal{X}$ and $\tilde{\textbf{x}} \triangleq \varPi _\mathcal{X}\left[ \, \textbf{x}_{\epsilon }^* - \gamma \nabla h(\textbf{x}_{\epsilon }^* )\, \right] $. Consider Term 2.

$$\begin{aligned} \text{ Term } \text{2 }&= - \, {\tfrac{1}{\gamma }} \tilde{\epsilon } v ^\top \left( \, \tilde{\epsilon }v+\varPi _\mathcal{X}\left[ \, \textbf{x}_{\epsilon }^* - \gamma \nabla h(\textbf{x}_{\epsilon }^* )\, \right] - (\textbf{x}_{\epsilon }^* -\gamma \nabla h(\textbf{x}_{\epsilon }^* ))\, \right) \\&= - \, {\tfrac{1}{\gamma }} \tilde{\epsilon } (\textbf{u}- \tilde{\textbf{x}}) ^\top \left( \, \tilde{\epsilon }(\textbf{u}-\tilde{\textbf{x}})+\tilde{\textbf{x}} - (\textbf{x}_{\epsilon }^* -\gamma \nabla h(\textbf{x}_{\epsilon }^* ))\, \right) \\&= - \, {\tfrac{1}{\gamma }} \tilde{\epsilon } (\textbf{u}- \tilde{\textbf{x}}) ^\top \left( \, \tilde{\epsilon }\textbf{u}+ (1-\tilde{\epsilon })\tilde{\textbf{x}} - (\textbf{x}_{\epsilon }^* -\gamma \nabla h(\textbf{x}_{\epsilon }^* ))\, \right) \\&= - \, {\tfrac{1}{\gamma }} \tilde{\epsilon } \left( \tilde{\epsilon } \Vert \textbf{u}\Vert ^2 + (1-2\tilde{\epsilon }) \textbf{u}^\top \tilde{\textbf{x}} - (1-\tilde{\epsilon }) \Vert \tilde{\textbf{x}}\Vert ^2 - \textbf{u}^\top \textbf{x}^*_{\epsilon } + \tilde{\textbf{x}}^\top \textbf{x}^*_{\epsilon } + \gamma \textbf{u}^\top \nabla h(\textbf{x}^*_{\epsilon }) - \gamma \tilde{\textbf{x}}^\top \nabla h(\textbf{x}^*_{\epsilon }) \right) \\&\ge - {\tfrac{1}{\gamma }} \tilde{\epsilon } \left( \tilde{\epsilon } \Vert \textbf{u}\Vert ^2 + \tfrac{(1-2\tilde{\epsilon })}{2}(\Vert \textbf{u}\Vert ^2 + \Vert \tilde{\textbf{x}}\Vert ^2) + \tfrac{1}{2} (\Vert \textbf{u}\Vert ^2 + \Vert \textbf{x}^*_{\epsilon }\Vert ^2) + \tfrac{1}{2} (\Vert \tilde{\textbf{x}}\Vert ^2 + \Vert \textbf{x}^*_{\epsilon }\Vert ^2) \right. \\&\left. + \tfrac{\gamma }{2}( \Vert \textbf{u}\Vert ^2 + \Vert \tilde{\textbf{x}}\Vert ^2 +2\Vert \nabla h(\textbf{x}^*_{\epsilon })\Vert ^2) \right) \\&\ge - {\tfrac{1}{\gamma }} \tilde{\epsilon } \left( (1+\tfrac{\gamma }{2}) \Vert \textbf{u}\Vert ^2 + ( (1-\tilde{\epsilon }+\tfrac{\gamma }{2}) \Vert \tilde{\textbf{x}}\Vert ^2 + \Vert \textbf{x}^*_{\epsilon }\Vert ^2 + \gamma \Vert \nabla h(\textbf{x}^*_{\epsilon })\Vert ^2) \right) \\&\ge - {\tfrac{1}{\gamma }} \tilde{\epsilon } ((3+\gamma ) C + \gamma D) \triangleq -\epsilon _2. \end{aligned}$$

Consequently, we have that

$$\begin{aligned} \left( \, \textbf{y}-\textbf{x}_{\epsilon }^* \right) ^\top \nabla h(\textbf{x}_{\epsilon }^*) \, \ge \, -\epsilon ,&\text{ where } \epsilon = \epsilon _1 + \epsilon _2 = {\tfrac{4}{\gamma }}\tilde{\epsilon } C + {\tfrac{1}{\gamma }} \tilde{\epsilon } ((3+\gamma ) C + \gamma D)\\ \, \implies \, \tilde{\epsilon }&= \tfrac{\gamma \epsilon }{7C + \gamma (C+D)}. \end{aligned}$$

$\square $

1.7 Proof of Lemma 9

Proof

(a) By adding and subtracting $q_{\eta _k,\rho _k}(\lambda _{k}),q_{\eta _k,\rho _k}(\lambda ^*) ,q_{\rho _k}(\lambda ^*) $, it follows that

$$\begin{aligned} \left\| \lambda _{k+1}-\lambda ^*\right\|&\le \Vert \lambda _{k+1}-q_{\eta _k,\rho _k}(\lambda _{k})\Vert +\Vert q_{\eta _k,\rho _k}(\lambda _{k})-q_{\eta _k,\rho _k}(\lambda ^*)\Vert \\&+ \Vert q_{\eta _k,\rho _k}(\lambda ^*)-q_{\rho _k}(\lambda ^*)\Vert + \underbrace{\Vert q_{\rho _k}(\lambda ^*) - \lambda ^*\Vert }_{=0}. \end{aligned}$$

Next, we derive a bound on $\Vert \lambda _{k+1}-q_{\eta _k,\rho _k}(\lambda _k)\Vert $ that

$$\begin{aligned} \Vert \lambda _{k+1}-q_{\eta _k,\rho _k}(\lambda _k)\Vert&=\left\| \lambda _{k}+ \rho _k \left( \nabla _{\lambda } {\mathcal L}_{\eta _k,\rho _k} (\textbf{x}_{k+1},\lambda _k)\right) -q_{\eta _k,\rho _k}(\lambda _k)\right\| \\&= \left\| \lambda _{k}+ \rho _k \left( \nabla _{\lambda } {\mathcal L}_{\eta _k,\rho _k} (\textbf{x}_{k+1},\lambda _k)\right) - \rho _k \nabla _{\lambda } \mathcal {D}_{\eta _k,\rho _k}(\lambda _k) - \lambda _k \right\| \\&\le \rho _k\left\| \nabla _{\lambda } \mathcal {L}_{\eta _k,\rho _k} (\textbf{x}_{k+1},\lambda _k))- \nabla _{\lambda } \mathcal {D}_{\eta _k,\rho _k}(\lambda _k)\right\| \overset{\tiny \text{ Lem. } 8}{\le } \sqrt{2\rho _k\epsilon _k\eta _k^b}. \end{aligned}$$

From Lemma 4, $\Vert q_{\eta _k,\rho _k}(\lambda ^*)-q_{\rho _k}(\lambda ^*)\Vert \le 2\sqrt{\rho _k\eta _k(\Vert \lambda ^*\Vert m + {C_m})\beta }$, implying that

$$\begin{aligned} \Vert \lambda _{k+1}-\lambda ^*\Vert&\le \sqrt{2\rho _k \epsilon _k{\eta _k^b}} + 2\sqrt{\rho _k\eta _k ({\Vert \lambda ^*\Vert }m+C_{m})\beta } +\Vert \lambda _k - \lambda ^*\Vert . \end{aligned}$$

(41)

By leveraging the deterministic form of the Robbins-Siegmund Lemma [36], if

$\sqrt{2\rho _k \epsilon _k{\eta _k^b}} + 2\sqrt{\rho _k\eta _k ({\Vert \lambda ^*\Vert }m+{C_m})\beta }$ is summable, then $\{ \Vert \lambda _{k}-\lambda ^*\Vert \}$ converges to a nonnegative value. It follows that $\{\lambda _k\}$ is convergent.

(b) Summing (41) from $k=0, \cdots , K-1$, we obtain that

$$\begin{aligned} \Vert \lambda _K - \lambda ^*\Vert&\le \sum _{k=0}^{K-1} \left( \sqrt{2\rho _k \epsilon _k{\eta _k^b}} +2\sqrt{\eta _k \rho _k({\Vert \lambda ^*\Vert }m+{C_m})\beta } \right) + \Vert \lambda _0-\lambda ^*\Vert \\&\le { \sum _{k=0}^{\infty } \left( \sqrt{2\rho _k \epsilon _k{\eta _k^b}} +2\sqrt{\eta _k \rho _k({\Vert \lambda ^*\Vert }m+{C_m})\beta } \right) + \Vert \lambda _0-\lambda ^*\Vert \, \triangleq B_{\lambda }}. \end{aligned}$$

$\square $

1.8 Proof of Lemma 11

Proof

Recall that $\mathcal {L}_{\eta ,\rho }(\textbf{x},\lambda )$ and its gradient $\nabla _{\textbf{x}}\mathcal {L}_{\eta ,\rho }(\textbf{x},\lambda )$ are defined as

$$\begin{aligned} \mathcal {L}_{\eta ,\rho }(\textbf{x},\lambda )\,&= \, f_{\eta }(\textbf{x}) + \tfrac{\rho }{2}\left( d_{-}\left( \tfrac{\lambda }{\rho }+g_{\eta }(\textbf{x})\right) \right) ^2-\tfrac{1}{2\rho }\Vert \lambda \Vert ^2\\ \nabla _{\textbf{x}} \mathcal {L}_{\eta ,\rho }(\textbf{x},\lambda ) \,&= \, \nabla _{\textbf{x}}f_{\eta }(\textbf{x}) + \rho {\textbf{J}_{g}(\textbf{x})}^\top \left( \tfrac{\lambda }{\rho } + g_{\eta }(\textbf{x})-\varPi _{-}\left[ \tfrac{\lambda }{\rho }+g_{\eta }(\textbf{x})\right] \right) , \end{aligned}$$

where $\left( \textbf{J}_{g}(\textbf{x})\right) ^\top \triangleq \begin{bmatrix} \nabla _{\textbf{x}}g_{\eta ,1}(\textbf{x})&\nabla _{\textbf{x}}g_{\eta ,2}(\textbf{x})&\dots&\nabla _{\textbf{x}}g_{\eta ,m}(\textbf{x}) \end{bmatrix}$ and $\textbf{J}_g(\textbf{x})$ denotes the Jacobian matrix of $g_{\eta }(\textbf{x})$. By Assumption 1 and Definition 1, $g_{\eta }$ and $\textbf{J}_g$ are bounded on $\mathcal {X}$ by $M_g$ and $M_G$, respectively. Since $\textbf{J}_g$ is bounded, $g_{\eta }$ is Lipschitz continuous on $\mathcal {X}$ with constant $L_g$. By Lemma 9, for all $\textbf{x}_1, \textbf{x}_2 \in \mathcal {X}$, it follows that

$$\begin{aligned}&\left\| \nabla _{\textbf{x}}\mathcal {L}_{\eta ,\rho }(\textbf{x}_1,\lambda )-\nabla _{\textbf{x}}\mathcal {L}_{\eta ,\rho }(\textbf{x}_2,\lambda )\right\| \le \left\| \nabla _{\textbf{x}}f_{\eta }(\textbf{x}_1)-\nabla _{\textbf{x}}f_{\eta }(\textbf{x}_2)\right\| \\&\quad +\rho \left\| \textbf{J}_g(\textbf{x}_1)^\top \left( \tfrac{\lambda }{\rho } + g_{\eta }(\textbf{x}_1)-\varPi _{-}\left[ \tfrac{\lambda }{\rho }+g_{\eta }(\textbf{x}_1)\right] \right) \right. \\&\quad -\left. \textbf{J}_g(\textbf{x}_2)^\top \left( \tfrac{\lambda }{\rho } + g_{\eta }(\textbf{x}_2)-\varPi _{-}\left[ \tfrac{\lambda }{\rho }+g_{\eta }(\textbf{x}_2)\right] \right) \right\| . \end{aligned}$$

Next we show that the second term is Lipschitz continuous in $\textbf{x}$. By adding and subtracting $-{\textbf{J}_{g}(\textbf{x}_2)}^\top \left( \tfrac{\lambda }{\rho } + g_{\eta }(\textbf{x}_1)-\varPi _{-}\left[ \tfrac{\lambda }{\rho }+g_{\eta }(\textbf{x}_1)\right] \right) $, we have

$$\begin{aligned}&\left\| \textbf{J}_{g}(\textbf{x}_1)^\top \left( \tfrac{\lambda }{\rho } + g_{\eta }(\textbf{x}_1)-\varPi _{-}\left[ \tfrac{\lambda }{\rho }+g_{\eta }(\textbf{x}_1)\right] \right) - \textbf{J}_{g}(\textbf{x}_2)^\top \left( \tfrac{\lambda }{\rho } + g_{\eta }(\textbf{x}_2)-\varPi _{-}\left[ \tfrac{\lambda }{\rho }+g_{\eta }(\textbf{x}_2)\right] \right) \right\| \\&\quad \le \left\| \textbf{J}_{g}(\textbf{x}_1)^\top \left( \tfrac{\lambda }{\rho } + g_{\eta }(\textbf{x}_1)-\varPi _{-}\left[ \tfrac{\lambda }{\rho }+g_{\eta }(\textbf{x}_1)\right] \right) - \textbf{J}_{g}(\textbf{x}_2)^\top \left( \tfrac{\lambda }{\rho } + g_{\eta }(\textbf{x}_1)-\varPi _{-}\left[ \tfrac{\lambda }{\rho }+g_{\eta }(\textbf{x}_1)\right] \right) \right\| \\&\quad + \left\| \textbf{J}_{g}(\textbf{x}_2)^\top \left( \tfrac{\lambda }{\rho } + g_{\eta }(\textbf{x}_1)-\varPi _{-}\left[ \tfrac{\lambda }{\rho }+g_{\eta }(\textbf{x}_1)\right] \right) - \textbf{J}_{g}(\textbf{x}_2)^\top \left( \tfrac{\lambda }{\rho } + g_{\eta }(\textbf{x}_2)-\varPi _{-}\left[ \tfrac{\lambda }{\rho }+g_{\eta }(\textbf{x}_2)\right] \right) \right\| \\&\quad \le \left\| \textbf{J}_{g}(\textbf{x}_1)-\textbf{J}_{g}(\textbf{x}_2)\right\| \underbrace{\left\| \tfrac{\lambda }{\rho } + g_{\eta }(\textbf{x}_1)-\varPi _{-}\left[ \tfrac{\lambda }{\rho }+g_{\eta }(\textbf{x}_1)\right] \right\| }_{\tiny \, = \, \left\| \varPi _+\left[ \tfrac{\lambda }{\rho }+g_{\eta }(\textbf{x}_1)\right] \right\| } \\&\quad + \Vert \textbf{J}_{g}(\textbf{x}_2)\Vert \bigg (\left\| g_{\eta }(\textbf{x}_1)-g_{\eta }(\textbf{x}_2)\right\| + \underbrace{\left\| \varPi _{-}\left[ \tfrac{\lambda }{\rho }+g_{\eta }(\textbf{x}_1)\right] -\varPi _{-}\left[ \tfrac{\lambda }{\rho }+g_{\eta }(\textbf{x}_2)\right] \right\| }_{\text {non-expansive} }\bigg )\\&\quad \le \tfrac{m\alpha _g}{\eta }\Vert \textbf{x}_1-\textbf{x}_2\Vert \left( \tfrac{b_\lambda }{\rho }+{M_g}\right) + M_G\left( 2L_g\Vert \textbf{x}_1-\textbf{x}_2\Vert \right) . \end{aligned}$$

Consequently, $\mathcal {L}_{\eta ,\rho }(\textbf{x},\lambda )$ is $(\tfrac{{\tilde{C}}\rho }{\eta })$-smooth by observing that

$$\begin{aligned}&\ \left\| \nabla _{\textbf{x}}\mathcal {L}_{\eta ,\rho }(\textbf{x}_1,\lambda )-\nabla _{\textbf{x}}\mathcal {L}_{\eta ,\rho }(\textbf{x}_2,\lambda )\right\| \le \tfrac{\alpha _f}{\eta }\Vert \textbf{x}_1-\textbf{x}_2\Vert + \rho \left( \tfrac{m\alpha _g}{\eta }\left( \tfrac{b_{\lambda }}{\rho }+M_g\right) + 2M_GL_g\right) \\&\quad \times \Vert \textbf{x}_1-\textbf{x}_2\Vert \le \tfrac{{\tilde{C}} \rho }{\eta } \Vert \textbf{x}_1-\textbf{x}_2\Vert , \end{aligned}$$

where $\rho \ge 1$, $\eta \le \eta ^u$, and

$$\begin{aligned} \tfrac{\alpha _f}{\eta }+\rho \left( \tfrac{m\alpha _g}{\eta }\left( \tfrac{b_{\lambda }}{\rho }+M_g\right) + 2M_GL_g\right)&= \tfrac{\alpha _f + m\alpha _g b_{\lambda }}{\eta }+\rho \left( \tfrac{m\alpha _g M_g}{\eta } + 2M_GL_g\right) \\&\le \tfrac{(\alpha _f + m\alpha _g b_{\lambda })\rho }{\eta }+\rho \left( \tfrac{m\alpha _g M_g}{\eta } + \tfrac{2M_GL_g\eta }{\eta }\right) \\&\le \tfrac{\rho }{\eta }\left( \alpha _f + m\alpha _g (b_{\lambda }+M_g) + 2M_GL_g\eta ^u\right) \\&= \tfrac{\tilde{C}\rho }{\eta }. \end{aligned}$$

(b) This has been shown in [38, Th. 3.1]. $\square $

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Zhang, P., Shanbhag, U.V. & Fang, E.X. A Smoothed Augmented Lagrangian Framework for Convex Optimization with Nonsmooth Constraints. J Sci Comput 104, 46 (2025). https://doi.org/10.1007/s10915-025-02934-w

Download citation

Received: 26 November 2023
Revised: 25 December 2024
Accepted: 16 January 2025
Published: 21 June 2025
Version of record: 21 June 2025
DOI: https://doi.org/10.1007/s10915-025-02934-w

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A Smoothed Augmented Lagrangian Framework for Convex Optimization with Nonsmooth Constraints

Abstract

Similar content being viewed by others

Stochastic inexact augmented Lagrangian method for nonconvex expectation constrained optimization

Iteration complexity of inexact augmented Lagrangian methods for constrained convex programming

An adaptive primal-dual framework for nonsmooth convex minimization

Explore related subjects

1 Introduction

2 A Smoothed Augmented Lagrangian Framework

2.1 Background and Assumptions

Lemma 1

Lemma 2

Definition 1

Definition 2

Definition 3

Lemma 3

Proof

2.2 Analysis of Smoothed Lagrangians

Lemma 4

Proof

Proof

Lemma 5

Lemma 6

2.3 Termination Criteria

Lemma 7

3 Rate Analysis

3.1 Preliminary results

Lemma 8

Lemma 9

3.2 Rate analysis under constant \(\rho _k\)

Proof

Proof

Proof

3.3 Rate analysis under increasing \(\rho _k\)

Lemma 10

Proof

Proof

Proof

4 Overall Complexity Guarantees

4.1 Preliminaries

Lemma 11

4.2 Complexity guarantees for convex and strongly convex f

Proof

Remark 1

Proof

Remark 2

4.3 Complexity Analysis for (Sm-AL) with fixed \(\eta \)

Proof

Remark 3

5 Numerical Experiments

5.1 Fused Lasso Problems

5.2 Incorporation of termination criteria

6 Conclusion

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

Appendix

1.1 Proof of Lemma 1

Proof

1.2 Proof of Lemma 2

Proof

1.3 Proof of Lemma 6

Proof

1.4 Proof of Lemma 4

Proof

1.5 Proof of Lemma 5

Proof

1.6 Proof of Lemma 7

Proof

1.7 Proof of Lemma 9

Proof

1.8 Proof of Lemma 11

Proof

Rights and permissions