Abstract
Augmented Lagrangian (AL) methods have proven remarkably useful in solving optimization problems with complicated constraints. The last decade has seen the development of overall complexity guarantees for inexact AL variants. Yet, a crucial gap persists in addressing nonsmooth convex constraints. To this end, we present a smoothed augmented Lagrangian (AL) framework where nonsmooth terms are progressively smoothed with a smoothing parameter \(\eta _k\). The resulting AL subproblems are \(\eta _k\)-smooth, allowing for leveraging accelerated schemes. By a careful selection of the inexactness level \(\epsilon _k\) (for inexact subproblem resolution), the penalty parameter \(\rho _k\), and smoothing parameter \(\eta _k\) at epoch k, we derive rate and complexity guarantees of \(\tilde{\mathcal {O}}(1/\varvec{\varepsilon }^{3/2})\) and \(\tilde{\mathcal {O}}(1/\varvec{\varepsilon })\) in convex and strongly convex regimes for computing an \(\varvec{\varepsilon }\)-optimal solution, when \(\rho _k\) increases at a geometric rate, a significant improvement over the best available guarantees for AL schemes for convex programs with nonsmooth constraints. Analogous guarantees are developed for settings with \(\rho _k = \rho \) as well as \(\eta _k = \eta \). Preliminary numerics on a fused Lasso problem display promise.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
We consider the nonsmooth convex program, defined as
where \(f: {\mathcal {X}}\rightarrow \mathbb {R}\) is a real-valued convex function and is possibly nonsmooth (but smoothable), \({\mathcal {X}}\subset \mathbb {R}^{n}\) is a closed and convex set, and \(g(\textbf{x}) = (g_1(\textbf{x}),g_2(\textbf{x}),...,g_m(\textbf{x}))^\top \) that each \(g_i :{\mathcal {X}}\rightarrow \mathbb {R},i = 1, 2, \cdots , m\) is a possibly complicated nonsmooth (but smoothable) convex function. Generally, the presence of such constraints precludes usage of projection-based methods to ensure feasibility of iterates. In deterministic regimes, a host of approaches have been employed for contending with complicated constraints, a subset of which include sequential quadratic programming [18, 43], interior point methods [8], and augmented Lagrangian (AL) schemes [38, 39]. Of these, AL schemes have proven to be enormously influential in the context of scientific computing [1, 9, 13], and more specifically in nonlinear programming in the form of solvers such as minos [16, 28] and lancelot [10] as well as more refined techniques [15, 17]. There has been a significant interest in deriving overall complexity bounds [24, 44] in convex regimes when the Lagrangian subproblem is solved via a first-order method. However, such bounds tend to be poor when constraints are possibly nonsmooth; e.g. standard AL schemes display complexity guarantees of \(\mathcal {O}(\varvec{\varepsilon }^{-5})\) for computing an \(\varvec{\varepsilon }\)-optimal solution in such settings (see Table 1).

1.1. Related work. Before proceeding, we discuss related prior research. (a) Augmented Lagrangian Methods. The augmented Lagrangian method (ALM) was proposed by Hestenes [19] and Powell [37] with a comprehensive rate analysis subsequently provided by Rockafellar [38]. The ALM framework relies on solving a sequence of unconstrained (or relaxed) problems, requiring the minimization of a suitably defined augmented Lagrangian function \(\mathcal{L}_{\rho }(\textbf{x},\lambda )\) in \(\textbf{x}\), where \(\rho \) and \(\lambda \) denote the penalty parameter and the Lagrange multiplier associated with g, respectively. In high-dimensional settings, the Lagrangian subproblems cannot be solved exactly, leading to the development of variants that allow for inexact resolution of the Lagrangian subproblem. Kang et al. [21] presented an inexact accelerated ALM for strongly convex optimization with linear constraints at a rate of \(\mathcal {O}(1/k^2)\), where k is the iteration counter. Non-ergodic convergence guarantees were provided in [24, 25], where either smoothness of f [24] or a composite structure [25] is assumed. Overall complexity guarantees were first provided by Lan and Monteiro [24], Aybat and Iyengar [4], Necoara et al. [29] and most recently Lu and Zhou [26], where the latter three references allowed for conic settings. In fact, Lu and Zhou [26] showed that in conic convex settings with smooth nonlinear constraints, by introducing a regularization, the overall complexity is improved to \(\mathcal {O}\left( \varvec{\varepsilon }^{-1}\ln (\varvec{\varepsilon }^{-1})\right) \) with a geometrically increasing penalty parameter. Nedelcu et al. [30] considered convex and strongly convex regimes. Notably, Necoara et al. [29] derived an overall complexity of \(\mathcal {O}(\varvec{\varepsilon }^{-\frac{3}{2}})\) and \(\mathcal {O}(\varvec{\varepsilon }^{-1})\) for smooth settings objective, respectively. More recently, Xu [44] considered nonlinear but smooth regimes in proposing an inexact ALM (under a suitable boundedness requirement) with complexity guarantees of \(\mathcal {O}(\varvec{\varepsilon }^{-1})\) (under convex f) and \(\mathcal {O}(\varvec{\varepsilon }^{-\frac{1}{2}}\log ({\varvec{\varepsilon }^{-1}}))\) (under strongly convex f), respectively. Table 1 compares existing complexity guarantees for AL schemes with both our schemes in convex (Sm-AL) and strongly convex settings (Sm-AL(S)) and standard ALM (N-AL), where \(\tilde{\mathcal {O}}\) suppresses logarithmic terms.
(b) Smoothing techniques. While subgradient methods have proven effective in addressing nonsmooth convex objectives [36], smoothing techniques [6] represent an efficient avenue for a subclass of nonsmooth problems. Moreau [27] introduced the (Moreau)-smoothing \(f_\eta \) of a convex function f, with parameter \(\eta \), defined as
Nesterov [33] employed a fixed smoothing parameter in developing a smoothing framework for nonsmooth convex optimization problems with a rate of \(\mathcal {O}(\varvec{\varepsilon }^{-1})\), an improvement over \(\mathcal {O}(\varvec{\varepsilon }^{-2})\) attainable by subgradient methods. In related work, Aybat and Iyengar [3] designed a smoothed penalty method for obtain \(\varvec{\varepsilon }\)-optimal solutions for \(l_1\)-minimization problems with linear equality constraints in \(\tilde{\mathcal {O}}\left( \varvec{\varepsilon }^{-3/2}\right) \) steps. Subsequently, Beck and Teboulle [7] defined an \((\alpha , \beta )\)-smoothing for a nonsmooth convex f satisfying the following two conditions (i) \(f_{\eta }(\textbf{x}) \, \le \, f(\textbf{x}) \, \le \, f_{\eta }(\textbf{x}) + \eta \beta \) for all \(\textbf{x}\) and (ii) \(\, f_{\eta }\) is \((\alpha /\eta )\)-smooth. For instance, \(f(\textbf{x}) \triangleq \max \{0,\textbf{x}\}\) has a smoothing \(f_{\eta }\), defined as \(f_{\eta }(\textbf{x}) \triangleq \eta \log (1+\exp (\frac{\textbf{x}}{\eta }))-\eta \log 2.\) Analogous approaches have been employed for addressing deterministic [12] and stochastic [20] convex optimization problems.
1.2. Applications. We present three applications where nonsmooth convex constraints emerge.
(a) Regression. Lasso regression [40] is a model widely used in variable selection in statistical learning. Assuming that the dataset consists of \(\{y_i,X_i\}_{i=1}^{N}\), where \((y_i,X_i)\) denotes the outcome and feature vector for ith instance. Then an elastic-net model [46] can be articulated as follows where \(C_1 > 0\).
This reduces to standard Lasso [40] when \(\alpha = 0\) and is generalizable to fused Lasso [41] by adding an additional nonsmooth constraint \(\sum _{j = 2}^{p}|\beta _j-\beta _{j-1}| \le C_2\), where \(C_2 > 0\).
(b) Classification. In statistical learning, the Neyman-Pearson (NP) classification [42] is designed to minimize the type II error while maintaining type I error below a user-specified level \(\alpha \). Consider a labeled training dataset \(\{a_i\}_{i = 1}^N\) where the positive and negative set are represented by \(\{a_i^{(1)}\}_{i = 1}^{N_{(1)}}\) and \(\{a_i^{(-1)}\}_{i = 1}^{N_{(-1)}}\), respectively. The empirical NP classification problem is given by [45] as follows
where \(\varvec{\ell }(\bullet )\) denotes the loss function. Choices of the loss function include nonsmooth variants such as mean absolute error (MAE) and hinge loss.
(c) Multiple Kernel learning. Multiple kernel learning (MKL) employs a predefined set of kernels to learn an optimal linear or nonlinear combination of these kernels, defined as follows [22].
where \(\psi _i(\bullet ), i = 1, \dots , m\) are predefined kernels, \(\theta \) is a vector of coefficients for each kernel, w is a weight vector for the primal model for learning with multiple kernels.
1.3. Contributions. We present a smoothed AL framework (Sm-AL) where the nonsmooth (but smoothable) objective/constraints are smoothed with a diminishing smoothing parameter \(\eta _k\). Consequently, the AL subproblem (with penalty parameter \(\rho _k\)) is proven to be \(\mathcal {O}(\rho _k/\eta _k)\)-smooth, allowing for (accelerated) computation of an \(\epsilon _k\)-exact solution in finite time. By a careful selection of the sequences \(\{\epsilon _k,\eta _k,\rho _k\}\), we derive rate and complexity guarantees. Our contributions are formalized next.
(i) In Section 2, we derive an ex-ante bound on the optimal multiplier set of the \(\eta \)-smoothed problem. This result, which is of independent interest, allows for claiming that a saddle-point of the \(\eta \)-smoothed problem is an \(\mathcal {O}(\eta )\)-saddle point of the original problem, allowing for deriving fixed smoothing schemes.
(ii) In Section 3, we establish a dual suboptimality rate of \(\mathcal {O}(k^{-1})\) and primal infeasibility rate of \(\mathcal {O}(k^{-1/2})\) (constant penalty) while geometric rates of \(\mathcal {O}(1/\rho _k)\) on primal infeasibility and suboptimality are derived under geometrically increasing penalty parameters. In Section 4, by employing an accelerated gradient framework for resolving the \(\eta _k\)-smoothed AL subproblem, the overall complexities of (Sm-AL) in terms of inner projection steps for obtaining an \(\varvec{\varepsilon }\)-optimal solution are proven to be \(\mathcal {O}(\varvec{\varepsilon }^{-(3+\delta )})\) (constant penalty) and \(\tilde{\mathcal {O}}(\varvec{\varepsilon }^{-3/2})\) (geometrically increasing penalty). Analogous bounds in strongly convex settings are given by \(\tilde{\mathcal {O}}(\varvec{\varepsilon }^{-(2+\delta )})\) and \(\tilde{\mathcal {O}}(\varvec{\varepsilon }^{-1})\) for constant and geometrically increasing penalty parameters, respectively. Similar complexity guarantees are available with a fixed smoothing parameter, akin to those developed in [7, 33] for convex programs with nonsmooth objectives.
(iii) We also develop practical termination criteria in Section 2, which when overlaid with our proposed scheme lead to significantly improved empirical complexity in our numerical experiments with little impact on accuracy.
(iv) Preliminary numerical results are provided in Section 5 before concluding in Section 6.
Organization The remainder of the paper is organized as follows. In Section 2, we introduce the smoothed augmented Lagrangian framework, providing the requisite background and the assumptions. Sections 3 and 4 provide the rate and complexity analysis while Section 5 presents a description of our numerical experiments. The paper concludes in Section 6.
Notation. Let \(\Vert \cdot \Vert \) denote the Euclidean norm. Given a closed convex set \(X \subseteq \mathbb {R}^n\) and \(y\in \mathbb {R}^n\), \(d_{\mathcal {X}}(y)\triangleq {\displaystyle \min _{s\in \mathcal {X}}} \Vert y-s\Vert \), \(d^2_{\mathcal {X}}(y)\triangleq \left( d_{\mathcal {X}}(y)\right) ^2\), and \(\varPi _{\mathcal {X}}(y)\triangleq {\displaystyle \text{ argmin}_{s\in \mathcal{X}}} \Vert y-s\Vert \); hence, \(d_{\mathcal {X}}(y)=\Vert y-\varPi _{\mathcal {X}}(y)\Vert \). Moreover, \(d^2_{\mathcal {K}}(\cdot )\) is differentiable and its gradient \(\nabla d^2_{\mathcal {X}}(y)=2(y-\varPi _{\mathcal {X}}(y))\). \(d_{-}(u)\) denotes the distance of u to the nonpositive orthant \(\mathbb {R}^n_{-}\), where \(d_{-}(u)\) is defined as \(d_{-}(u) \, \triangleq \, \Vert u - \varPi _{\mathbb {R}^n_-} [u]\Vert _{2}.\) Finally, \(\tilde{\mathcal {O}}(f(n))\) is \(\mathcal {O}(f(n))\) up to a \(\log (n)\) factor. Finally, \(\textbf{1}\) denotes the column of ones in \(\mathbb {R}^n\).
2 A Smoothed Augmented Lagrangian Framework
In this section, we first provide some background and then analyze the smoothed problem, ending with a relation between a saddle-point of the \(\eta \)-smoothed problem and an \(\eta \)-approximate saddle-point of the original problem.
2.1 Background and Assumptions
Corresponding to problem (NSCopt), we may define the Lagrangian function \(\mathcal {L}_0\) as follows.
This allows for denoting the set of minimizers of \(\mathcal {L}_0({\bullet },\lambda )\) over the set \(\mathcal{X}\) by \(\mathcal {X}^*(\lambda )\), the dual function by \(\mathcal {D}_0(\lambda )\), and the dual solution set by \(\varLambda ^*\), each of which is defined next.
By adding a slack variable \(\textbf{v} \in \mathbb {R}^m\), we may recast (NSCopt) as follows.
where \(\lambda \in \mathbb {R}^m\) denotes the Lagrange multiplier associated with the constraint \(g(\textbf{x}) + \textbf{v} = 0\). Then the augmented Lagrangian function, denoted by \(\mathcal {L}_{\rho }\), where \(\rho \) denotes the penalty parameter, is defined as follows (cf. [38]).
It has been shown that \((\bar{\textbf{x}},\bar{\lambda })\) is a saddle-point of the augmented Lagrangian \(\mathcal{L}_{\rho }\) for any \(\rho \ge 0\) if and only if \((\bar{\textbf{x}},\bar{\lambda })\) is a saddle-point of \(\mathcal{L}_0\). Further, if \(\bar{\lambda }\) is an optimal dual solution, then \(\bar{\textbf{x}}\) is an optimal solution of (NSCopt) if and only if \(\bar{\textbf{x}}\) minimizes \(\mathcal{L}(\bullet ,\bar{\lambda })\) over \(\mathcal{X}\) [38, Th. 3.5].
If \(d_{-}(u) \, \triangleq \, {\displaystyle \inf _{v \in \mathbb {R}^n_-}} \Vert u-v\Vert \) and \(\varPi _{+}[u]\) denotes the Euclidean projection of u onto \(\mathbb {R}^m_+\), then the AL function \(\mathcal {L}_{\rho }\) and its gradient can be expressed as follows [38, Sec. 2].
Lemma 1
Consider the function \(\mathcal {L}_{\rho }\) for \(\rho > 0\), \(\textbf{x}\in {\mathcal {X}}\) and \(\lambda \ge 0\). Then
where \(J_{g}(\textbf{x})\) is Jacobian matrix of g. \(\Box \)
Similarly, the augmented dual function \(\mathcal {D}_{\rho }\), defined as
can be shown to be differentiable [38, Th. 3.2].
Lemma 2
Consider the function \(\mathcal {D}_{\rho }\) defined as (3). Then \(\mathcal {D}_{\rho }\) is a C\(^1\) and concave function over \(\mathbb {R}^m\) and is the Moreau envelope of \(\mathcal {D}_0\), defined as
where \(q_{\rho }(\lambda ) \, \triangleq \, \arg {\displaystyle \max _{u}} \left[ \mathcal {D}_0(u) - \tfrac{1}{2\rho }\Vert u-\lambda \Vert ^2\right] .\) \(\Box \)
Since \(\mathcal{D}_{\rho }\) is the Moreau envelope of \(\mathcal{D}_0\), \(\mathcal{D}_{\rho }\) has the same set of maximizers as \(\mathcal{D}_0\) for any \(\rho \ge 0\) [38, Th. 3.2]. Our interest lies in nonsmooth, albeit smoothable, convex functions, defined next [7].
Definition 1
A closed, proper, and convex function \(h: \mathbb {R}^n \rightarrow \mathbb {R}\) is \((\alpha ,\beta )\) smoothable if for any \(\eta > 0\), there exists a convex differentiable function \(h_{\eta }\) such that
\(\Box \)
In fact, one may be faced by compositional convex constraints in which the layers may be nonsmooth. In such instances, under suitable conditions, smoothability of the layers implies smoothability of the compositional function but we postpone such avenues for future work. We leverage smoothability assumptions in [7] to state our basic assumptions on the objective and constraint functions. In addition, we impose both compactness requirements on \(\mathcal {X}\) as well as a Slater regularity condition. Before stating the required assumptions, we need to define the \(\epsilon \)-KKT conditions of (NSCopt), which is inspired by KKT conditions.
Definition 2
(\(\epsilon \)-optimal solution) Let \(f^*\) be the optimal value of (NSCOPT). Given \(\epsilon \ge 0\), a point \(\tilde{\textbf{x}}\, \in \, \mathcal {X}\) is called an \(\epsilon \)-optimal and \(\epsilon \)-feasible solution to (NSCOPT) if
\(\Box \)
Then the partial KKT conditions corresponding to relaxing the constraint \(g(\textbf{x}) \, \le \, 0\) are defined as follows, where \(\mathcal {L}(\bullet ,\bullet )\) denotes the Lagrangian function and \(\mathcal{N}_\mathcal{X}(x)\) denotes the normal cone of \(\mathcal{X}\) at x.
Recall that given an optimization problem, defined as
where \(f, g_{i}\) are smooth functions mapping from \(\mathbb {R}^n\) to \(\mathbb {R}\) for \(i = 1, \cdots , m\). Then under a suitable regularity condition, if \(x^*\) is a local minimizer of (C-Opt), then there exists \(\lambda \in \mathbb {R}^m_+\) such that
In fact, (8)–(9), together with \(\lambda \ge 0\), can be compactly stated as
By leveraging the “perp” notation, we have that \(\lambda \perp g(x)\) or \(\lambda _i g_i(x) = 0\) for all i. Therefore, we may compactly represent the KKT conditions as
Note that such a notation is common in complementarity theory (see Cottle, Pang, and Stone [11] or Facchinei and Pang [14]). This allows us to define a (partial) \(\epsilon \)-KKT point.
Definition 3
(Partial \(\epsilon \)-KKT condition) Consider the problem (NSCOPT). Then (\(\textbf{x}_{\epsilon },\lambda _{\epsilon }\)) is a partial \(\epsilon \)-KKT point if \(\textbf{x}_{\epsilon } \in \mathcal{X}\),
where \((\textbf{x}^*,\lambda ^*)\) denotes a KKT point of (NSCOPT) satisfying (5)–(6). \(\square \)
This allows us to build a simple relation whereby an \(\epsilon \)-KKT point satisfies \(\epsilon \)-optimality and \(\epsilon \)-feasibility.
Lemma 3
Consider a tuple \((\textbf{x}_{\epsilon },\lambda _{\epsilon })\) satisfying the \(\epsilon \)- KKT conditions given by (12)–(13). Then \((\textbf{x}_{\epsilon },\lambda _{\epsilon })\) satisfies \({2\epsilon }\)-suboptimality and \({m \epsilon }\)-infeasibility, collectively captured by (4).
Proof
We observe that \(\epsilon \)-primal suboptimality in (4) holds by the following sequence of relations.
To show \(\epsilon \)-feasibility of \(\textbf{x}_{\epsilon }\) as prescribed in (4), we observe that
which completes the proof. \(\square \)
We now present our ground assumption on the problem of interest and is assumed to hold throughout the paper, unless explicitly mentioned otherwise.

Condition (d) allows for bounding the set of optimal dual variables (cf. [23]). We now consider the smoothed counterpart of (NSCopt), defined as

We note that the solution and multiplier set of (NSCopt\(_{\eta })\) are denoted by \(X^*_{\eta }\) and \(\varLambda ^*_{\eta }\), respectively. Naturally, associated with this problem is the Lagrangian function \(\mathcal{L}_{\eta ,0}\) of the smoothed problem (referred to as the smoothed Lagrangian) as well as the corresponding dual function \(\mathcal{D}_{\eta ,0}\); these objects and their augmented counterparts are defined and analyzed in the next subsection.
2.2 Analysis of Smoothed Lagrangians
We now analyze the smoothed Lagrangian framework where f and g are approximated by smoothings \(f_{\eta }\) and \(g_{\eta }\), where the latter is a vector function with components \(g_{1,\eta }, \cdots , g_{m,\eta }\). The resulting smoothed Lagrangian function \(\mathcal {L}_{\eta ,0}\) and the smoothed dual function \(\mathcal {D}_{\eta ,0}(\lambda )\) are defined as
Then the smoothed augmented Lagrangian function \(\mathcal {L}_{\eta , \rho }\) is defined as
We may now define \(\mathcal {D}_{\eta ,\rho }\) and \(q_{\eta ,\rho }\) as \(\mathcal {D}_{\eta ,\rho }(\lambda ) \, = \, \max _u [\, \mathcal {D}_{\eta ,0}(u) - \tfrac{1}{2\rho }\Vert u-\lambda \Vert ^2\,]\) and \(\nabla _{\lambda } \mathcal {D}_{\eta ,\rho }(\lambda ) \, = \, \tfrac{1}{\rho }\left( q_{\eta ,\rho }(\lambda ) - \lambda \right) \), where \(q_{\eta , \rho }(\lambda ) \triangleq \textrm{arg}\hspace{-0.02in}\max _{u} [\, \mathcal {D}_{\eta ,0}(u) - \tfrac{1}{2\rho }\Vert u-\lambda \Vert ^2\, ].\) We now relate \(\mathcal {D}_{\rho }\) to \(\mathcal {D}_{\eta ,\rho }\) and \(q_{\rho }\) to \(q_{\eta ,\rho }\) in the next lemma.
Lemma 4
For any \(\lambda \in \mathbb {R}_{+}^m\), the following hold:
(i) \(\left| \mathcal {L}_{0}(\textbf{x},\lambda ) - \mathcal {L}_{\eta ,0}(\textbf{x},\lambda ) \right| \, \le \, \eta (\Vert \lambda \Vert m+1) {\beta };\)
(ii) \(| \mathcal {D}_{\eta ,0}(\lambda ) - \mathcal {D}_{0}(\lambda )| \le \eta (\Vert \lambda \Vert m+1) {\beta } ;\)
(iii)\(| \mathcal {D}_{\eta ,\rho }(\lambda ) - \mathcal {D}_{\rho }(\lambda )| \le \eta (\Vert \lambda \Vert m+1) {\beta }.\) \(\Box \)
Under a Slater regularity condition, the set of optimal multipliers is bounded (cf. [23]). Similar bounds are derived for the \(\eta \)-smoothed problem.

Proof
(a) By Assumption 1(d), there exists a vector \(\bar{\textbf{x}} \in \mathcal{X}\) such that \(g(\bar{\textbf{x}}) < 0\), implying that \(g_{\eta }(\bar{\textbf{x}}) < 0\) by the property of smoothability (Def. 1).
(b) By the Slater regularity condition, we directly conclude from [23] that
(c) Similarly, \(\varLambda _{\eta }^*\), the dual optimal solution set, is bounded as follows.
Recall that \(-g_{j,\eta }({\bar{\textbf{x}}}) \ge -g_j({\bar{\textbf{x}}})\) for \(j = 1, \cdots , m\). Furthermore, \({\displaystyle \min _j}\{ -g_{j,\eta }({\bar{\textbf{x}}})\} \ge {\displaystyle \min _j} \{-g_{j}({\bar{\textbf{x}}})\}\). It follows from (b) that
Consequently, if \(\mathcal{D}_{0,\eta }^* \triangleq \mathcal{D}_{0,\eta }(\lambda _{\eta }^*)\), \(\mathcal{D}_0^* \triangleq \mathcal{D}_{0}(\lambda ^*)\), then
\(\square \)
Both Lemma 4 and Proposition 1 play crucial roles in the convergence analysis presented in Section 3. We now relate a saddle-point \((\textbf{x}^*_{\eta },\lambda _{\eta }^*)\) of (NSCopt\(_{\eta }\)) to an \(\eta \)-saddle-point \((\textbf{x}^*,\lambda ^*)\) of (NSCopt), where the bound on the multipliers for (NScopt) and (NSCopt\(_{\eta }\)) are denoted by \({b}_{\lambda }\) and \({b}_{\lambda ,\eta }\) , respectively. Next, we relate a saddle-point of (NSCopt\(_{\eta }\)) to an \(\eta \)-saddle-point of (NSCopt), where an \(\eta \)-saddle point satisfies the saddle-point requirements with an \(\mathcal {O}(\eta )\) error.

Proof
(a) Suppose \(\textbf{x}_{\eta }^* \, \in \, \mathcal{X}\) is a feasible solution of (NSCopt\(_{\eta }\)). Then \(g_{\eta }(\textbf{x}^*_{\eta }) \le 0\). Furthermore, \(g(\textbf{x}_{\eta }^*) \le g_{\eta }(\textbf{x}_{\eta }^*) + \eta \beta \textbf{1} \le \eta \beta \textbf{1}\), implying that \(d_{-}(g(\textbf{x}_{\eta }^*)) \le \eta \beta \Vert \textbf{1}\Vert .\)
(b) The dual optimal set \(\varLambda _{\eta }^*\) is nonempty and bounded as per Lemma 1. Let \((\textbf{x}_{\eta }^*,\lambda _{\eta }^*)\) be the saddle point of \(L_{\eta ,0}(\cdot ,\cdot )\). We now proceed to show that \((\textbf{x}_{\eta }^*,\lambda _{\eta }^*)\) is an approximate saddle-point of \(\mathcal{L}_0\).
The final result follows through the following sequence of inequalities as provided next
\(\square \)
The following Lemma 5 shows the relation between \(q_{\eta ,\rho }(\bullet )\) and \(q_{\rho }(\bullet )\).
Lemma 5
For any \(\lambda \in \mathbb {R}_{+}^m\), the following hold:
(i) \(\Vert q_{\eta ,\rho }(\lambda ) - q_{\rho }(\lambda )\Vert \le \sqrt{{4}\rho \eta (\Vert \lambda \Vert m+C_m) {\beta }};\)
(ii) \(\Vert \nabla _{\lambda } \mathcal {D}_{\eta ,\rho }(\lambda )-\nabla _{\lambda } \mathcal {D}_{\rho }(\lambda )\Vert = \tfrac{1}{\rho }\Vert q_{\eta ,\rho }(\lambda ) - q_{\rho }(\lambda )\Vert \le \sqrt{\tfrac{{4} \eta (\Vert \lambda \Vert m+C_m) {\beta } }{\rho }}.\) \(\Box \)
We now formally state the smoothed AL scheme. The traditional ALM is reliant on solving the subproblem exactly or \(\epsilon _k\)-inexactly at epoch k. However, in regimes with nonsmooth constraints, the AL subproblem is nonsmooth, precluding the usage of accelerated gradient methods, leading to far poorer performance. Our proposed scheme solves a sequence of \(\eta _k\)-smoothed problems solved within an error tolerance of \(\epsilon _k \eta _k^b\) where \(b\ge 0\). A formal statement of the scheme is provided next.

Observe that step [1] requires that \(\textbf{x}_{k+1}\) is an \(\epsilon _k \eta _k^b\)-minimizer of the AL subproblem, given by
where \(\mathcal{D}_{\eta _k,\rho _k}(\lambda _k) = \min _{x \in X} \mathcal{L}_{\eta _k,\rho _k}(\textbf{x},\lambda _k).\) Since we have rate guarantees for the accelerated scheme applied to the subproblem, we can determine the minimum number of gradient steps that ensure that \(\epsilon _k \eta _k^b\)-suboptimality holds. The Lagrange multiplier update can be expressed as follows (cf. [2]).
Lemma 6
Consider the smoothed augmented Lagrangian scheme (Sm-AL). Then for any \(k > 0\), step [2] is equivalent to the following equation.
The next assumption holds for parameter sequences employed in (Sm-AL). Unless mentioned otherwise, Assumptions 1 and 2 hold throughout.

While our rate guarantees for the schemes responsible for resolving the subproblem as well as the outer (dual) problem allow for defining precise lower bounds on the number of steps required, this computational requirement is reliant on a worst-case analysis. In addition, we may attempt to check if the sub-optimality requirement is met at some intermediate step. However, it is not obvious how to check the sub-optimality in the current setting since the optimal value corresponding to either the subproblem or the outer level problem are unavailable. Instead, we appeal to a residual function and consider such an approach next. We emphasize that such a potential early termination of either the subproblem solver or the outer scheme may have computational benefits.
2.3 Termination Criteria
Our inexact augmented Lagrangian framework relies on utilizing inexact solutions to the Lagrangian subproblem, obtained by taking finite but increasing number of gradient-based steps and then leveraging the rate guarantees for accelerated gradient methods. However, we may well meet the required accuracy prior to taking the prescribed number of gradient steps by checking a suitable condition. Such a condition is by no means immediate since a naive assessment of accuracy requires knowing the optimal value to the subproblem; instead, we present a new analysis by leveraging a residual function and present such an analysis next for both the inner and outer loops.
(I). Termination criterion for Inner loop. The inner loop at iteration k terminates when \(x_{k+1}\) satisfies the following \(\epsilon _k \eta _k^b\)-optimality requirement, where \(\epsilon _k\) is a positive accuracy thresholdat iteration k, \(\eta _k\) is the smoothing parameter at iteration k, and b is a nonnegative scalar that is defined subsequently in the complexity analysis.
In effect, if we view the minimization of the augmented Lagrangian function by the following convex problem, defined as
where h is a convex and smooth function on \(\mathcal{X}\), a closed and convex set. We proceed to show that (15) is equivalent to \(x_{k+1}\) approximately satisfying the variational inequality problem.
In fact, we now develop a verifiable condition whose satisfaction implies (16).
Lemma 7
Consider the problem (Opt). Suppose \(\Vert \textbf{y}\Vert ^2 \le C\) and \(\Vert \nabla h(\textbf{y})\Vert ^2 \le D\) for any \(\textbf{y}\in X\) and \(\gamma \) is any positive scalar. Consider the following statements.
-
(a)
\(\textbf{x}^*_{\epsilon }\) is an \(\epsilon \)-optimal solution of (Opt).
-
(b)
\(\nabla h(\textbf{x}^*_{\epsilon })^\top (\textbf{y}-\textbf{x}^*_{\epsilon }) \, \ge \,- \epsilon , \quad \forall \, \textbf{y}\, \in \, \mathcal {X}.\)
-
(c)
Suppose there exists \(\textbf{u}\in \mathcal{X}\) and \(\textbf{x}^*_{\epsilon } \in \mathcal{X}\) such that \(F^{{\textrm{nat}},\tilde{\epsilon }}_\mathcal{X} (\textbf{x}^*_{\epsilon },\textbf{u}) = 0\), where \(F^{{{{\textrm{nat}}}},\tilde{\epsilon }}_\mathcal{X}(\bullet ,\bullet )\) represents the perturbed natural map with a chosen parameter \(\gamma \), defined as
$$\begin{aligned} F^{{{{\textrm{nat}}}},\tilde{\epsilon }}_\mathcal{X}(\textbf{x},\textbf{u})\, \triangleq \, \tilde{\epsilon } \left( \, \textbf{u}- \varPi _X \left[ \, \textbf{x}- \gamma \nabla h(\textbf{x}) \, \right] \,\right) - \textbf{x}+ \varPi _\mathcal{X} \left[ \, \textbf{x}- \gamma \nabla h(\textbf{x}) \, \right] . \end{aligned}$$
Then the following hold.
-
(i)
\((a) \, \iff \, (b)\);
-
(ii)
\((c) \implies (b)\), where \(\tilde{\epsilon } \, = \frac{\gamma \epsilon }{7C + \gamma (C+D)}\) and \(\epsilon <\frac{7C + \gamma (C+D)}{\gamma }\). \(\Box \)
Observe that the perturbed natural map is rooted in the natural map, a residual function for variational inequality problems [14]. When specialized to the setting of the the smooth convex optimization problem
we have that
The lemma above develops a suitably defined \(\tilde{\epsilon }\)-perturbed counterpart of \(F^{{\textrm{nat}}}_\mathcal{X}\), denoted by \(F^{{\textrm{nat}},\tilde{\epsilon }}_\mathcal{X}\). We observe that \(F^{{\textrm{nat}},\tilde{\epsilon }}_\mathcal{X}(\textbf{x},\textbf{x})\) reduces to
In other words, for any \(\tilde{\epsilon } < 1\),
where \(F^{{\textrm{nat}}}_\mathcal{X}(\textbf{x}) \triangleq - \textbf{x}+ \varPi _\mathcal{X}\left[ \, \textbf{x}- \gamma \nabla h(\textbf{x}) \, \right] .\) Based on the aforementioned result, in the kth iteration, this termination criterion reduces to
where \(\tilde{\epsilon }_k = \tfrac{\gamma \epsilon _k\eta _k^b}{{7C+ \gamma (C+D)}}\) and \(\textbf{u}\in \mathcal{X}\).
(II) Termination criterion for outer loop. Here we consider two settings.
(a) Constant penalty parameter. In setting (a), the outer scheme terminates when
where \(C_1 \triangleq B_5, C_2 \triangleq B_4\), \(B_3, B_4, B_5, B_6\) are defined in Table 3. Since we have access to \(g(\bullet )\), it is easy to check \(d_{-}\left( g(\textbf{x}_K)\right) \le \sqrt{\epsilon }\). However, evaluating \(f(\bar{x}_K) - f^*\) is not directly possible, since \(f^*\) is unavailable. Since f is nonsmooth, we apply Lemma 7 to the optimality gap of the smoothed problem \(|f_{\eta _K}(\bar{\textbf{x}}_K)-f_{\eta _K}^*|\) since it is related to the true optimality gap, i.e. by leveraging the property of smoothability of f,
Consequently, it suffices to get a bound on each term on the right. To get a bound on \(\left| \, f_{\eta _K}(\bar{\textbf{x}}_K) - f_{\eta _K}(\textbf{x}^*) \,\right| \) given \(\hat{\textbf{x}} \in \mathcal{X}\), we leverage the following residual function that
where \(\tilde{\epsilon }_K \triangleq \tfrac{\gamma C_1}{({7C + \gamma (C+D)})\sqrt{K}}\), and C, D are as defined in Lemma 7. (We can set the values for \(\eta _K\) such that the overall optimality gap (\(|f-f^*|\)) remains controlled below a tighter error tolerance \(\epsilon ^2\) to ensure the consistency with our complexity analysis.) Therefore, we may employ the following termination criterion (T2) at the Kth iterate.
(b) Increasing penalty parameter. In setting (b), the outer scheme terminates when
where \(C_1\triangleq {B_7}\) and \(C_2 \triangleq {B_8}\) as defined in Table 3. While it is easy to check \(d_{-}\left( g(\textbf{x}_K)\right) \le \epsilon \), since \(f^*\) is unavailable and f is nonsmooth, we apply Lemma 7 to the optimality gap of the smoothed problem \(|f_{\eta _K}(\textbf{x}_K)-f_{\eta _K}^*|\) since it is related to the true optimality gap, i.e. by leveraging the property of smoothability of f, similar to the previous analysis, \(\left| \, f(\textbf{x}_K) - f(\textbf{x}^*) \,\right| \, \le \left| \, f_{\eta _K}(\textbf{x}_K) - f_{\eta _K}(\textbf{x}^*) \,\right| + \eta _K B.\) Consequently, it suffices to get a bound on both terms on the right. To get a bound on \(\left| \, f_{\eta _K}(\textbf{x}_K) - f_{\eta _K}(\textbf{x}^*) \,\right| \) and given \(\hat{x} \in X\), we leverage the following residual function that
where \(\tilde{\epsilon }_K \triangleq \tfrac{\gamma C_1}{({7C + \gamma (C+D)})\rho _{K}}\), and C, D are as defined in Lemma 7. Akin to earlier, we may set the value of \(\eta _K\) such that the overall optimality gap (\(|f-f^*|\)) remains controlled below \(\epsilon \). Therefore, we may employ the following termination criterion (T2) at the Kth iterate.
The modified algorithm statement should read as follows.

Note that the subproblem solver is essentially an accelerated gradient scheme introduced in Section 4, the minimum number of steps as prescribed by the rate guarantees is denoted by \(M_k\) and derived in Section 4.
3 Rate Analysis
In this section, we analyze the rate of convergence for (Sm-AL). In 3.1, we provide some preliminaries and then derive rate statements for constant and increasing penalties in Subsections 3.2 and 3.3, respectively.
3.1 Preliminary results
We begin by recalling the following bound, an extension of the result proved in [38, Lemma 4.3].
Lemma 8
Let \(\{\textbf{x}_{k},\lambda _{k}\}\) be generated by (Sm-AL). For any \(k \ge 0\), suppose \(\textbf{x}_{k+1}\) satisfies \(\mathcal {L}_{\eta _k,\rho _k}(\textbf{x}_{k+1},\lambda _k) - {\mathcal {D}_{\rho _k,\eta _k}}(\lambda _k) \le \epsilon _k\eta _k^b\) where \(b\ge 0\). Then for \(k \ge 0,\)
By choosing appropriate sequences \(\{\epsilon _k,\eta _k,\rho _k\}\), \(\{(2\epsilon _k\eta _k^b)/\rho _k\}\) is diminishing (see Lemma 8). We now derive a uniform bound on the sequence \(\{\lambda _k\}\).
Lemma 9
(Bound on \(\lambda _k\)) Consider \(\{\lambda _k\}\) generated by (Sm-AL).
(a) \(\{ \lambda _k\}\) is a convergent sequence. (b) For any K, we have
3.2 Rate analysis under constant \(\rho _k\)
Next, we derive rate statements for the dual sub-optimality and primal infeasibility when \(\rho _k = \rho \) for all k. Our first result relies on the observation that the augmented dual function \(\mathcal{D}_{\rho }\) has the same set of optimal solutions (and supremum) as the original dual function \(\mathcal{D}_0\) (see [38, Th. 3.2]).

Proof
Recall that \(\mathcal {D}_{\eta _k,\rho }\) is the Moreau envelope of \(\mathcal{D}_{\eta _k,0}\). Consequently, \(\nabla _{\lambda } \mathcal{D}_{\eta _k,\rho }\) is \(\tfrac{1}{\rho }\)-Lipschitz. We then have
where \(-\mathcal{D}_{\eta _k,\rho }(\lambda ^*) \ge -\mathcal{D}_{\eta _k,\rho }(\lambda _k) - \nabla _{\lambda } \mathcal{D}_{\eta _k,\rho }(\lambda _k)^\top (\lambda ^*-\lambda _k).\) By adding and subtracting \(\nabla _{\lambda } \mathcal {L}_{\eta _k,\rho }(\textbf{x}_{k+1},\lambda _k)^\top (\lambda _{k+1}-\lambda ^*) \), it follows that
where the last inequality follows from Lemma 4(iii). By invoking Lemma 9, and \(\Vert \lambda _k\Vert +\Vert \lambda ^*\Vert \le {\Vert \lambda _k-\lambda ^*\Vert +\Vert \lambda ^*\Vert } +\Vert \lambda ^*\Vert \le B_{\lambda } + 2b_{\lambda } {\, \triangleq \, \tilde{B}_{\lambda }}\), we obtain
By summing from \(k = 0,\cdots ,K-1\) and dividing by K, we obtain
where boundedness of \(\lambda _k\) follows from Lemma 4 and \(\tilde{B}_{\lambda }, {B_\lambda , B_2}\) are constants. Consequently, by invoking the concavity of \(\mathcal{D}_{\rho }\), we may bound the term on the left to obtain the required inequality, where \(\bar{\lambda }_K = \tfrac{1}{K}\sum _{i=1}^{K} \lambda _i\).
The final result follows by noting that \(\mathcal{D}_{\rho }\) is the Moreau envelope of \(\mathcal{D}_0\) and strong duality holds, implying that \(\mathcal{D}_{\rho }(\lambda ^*) = \mathcal{D}_{0}(\lambda ^*) = f(\textbf{x}^*)\). \(\square \)
Next, we derive a rate statement on the infeasibility.

Proof
We have that \(g_{\eta _k}(\textbf{x}_{k+1})\) can be expressed as
Recall that \(d_{-}(u+v) \le d_-(u) + \Vert v\Vert \) for any \(u,v \in \mathbb {R}^m\). Consequently,
By definition of \(d_{-}(\bullet )\), convexity of \(\max \{g_j(\bullet ),0\}\), and \(\Vert u\Vert _2 \le \Vert u\Vert _1 \le \sqrt{m}\Vert u\Vert _2\),
Recall that
allowing us to claim that \(\mathcal {D}_{\eta _k,\rho }\) is a \(({2}/\rho )\)-smooth concave function. Then by leveraging [32] for any \(\lambda \ge 0\),
where \({\lambda _{\eta }^*}\) is a maximizer of \(\mathcal {D}_{\eta ,\rho }\). By leveraging the concavity of the square-root function, the prior dual sub-optimality bounds, \(\sqrt{u+v} \le \sqrt{u}+\sqrt{v}\) for \(u, v \ge 0\), the subaddivity of concave functions, we have from (26),
Recall from (24), it follows that
which implies that
where \(C \triangleq \tfrac{\Vert \lambda _0-\lambda ^*\Vert ^2}{2\rho }+\left( B_{\lambda }\sum _{k=0}^{K-1}\tfrac{2\epsilon _k\eta _k^b}{\sqrt{\rho }}+B_2\sum _{k=0}^{K-1}\eta _k\right) \). \(\square \)
We now derive a rate statement for the primal sub-optimality.

Proof
Recall that since \(\textbf{x}_k\) may not be feasible with respect to the constraints, we derive upper and lower bounds on the sub-optimality.
(i) Lower bound. A rate statement for the lower bound is first constructed. Since \(\max _{\lambda } \mathcal {D}_{\rho }(\lambda ) = {\displaystyle \min _{\textbf{x}\in \mathcal {X}}} \ \mathcal {L}_{\rho }(\textbf{x},\lambda ^*) = f^*\), the following sequence of inequalities hold where \(\bar{\textbf{x}}_K = \tfrac{1}{K}\sum _{k = 0}^{K-1}\textbf{x}_k\), \(f_{\eta _K}^* = {\displaystyle \min _{\textbf{x}\in \mathcal X}} \ \mathcal {L}_{\eta _K,{\rho }}\left( \textbf{x}, {\lambda _{\eta _K}^*}\right) \), and \((\textbf{x}_{\eta _K}^*,\lambda _{\eta _K}^*)\) is the saddle point of \(\mathcal {L}_{\eta _K, 0}(\textbf{x},\lambda )\).
By invoking Proposition 3, we obtain the following inequality.
Let \(\textbf{x}^*\in \mathcal {X}^*\) and \(\textbf{x}_{{\eta _K}}^*\) is a minimizer of \(L_{\eta _K,\rho }(\cdot , {\lambda _{\eta _K}^*})\). By Lemma 4, we have that
implying that \(f(\textbf{x}^*) \le f(\textbf{x}^*_{\eta _K}) + mb_{\lambda } \beta \eta _K.\) By definition of the smoothing, \(f(\textbf{x}_{{\eta _K}}^*)-f_{\eta _K}(\textbf{x}_{{\eta _K}}^*) \le \beta \eta _K \) and \(f_{\eta _K}(\bar{\textbf{x}}_K)-f(\bar{\textbf{x}}_K) \le 0\).
(ii) Upper bound. Let \(\textbf{x}_{\eta _k,\lambda _k}^* {\in } \arg {\displaystyle \min _{\textbf{x}\in \mathcal X}} \ \mathcal {L}_{\eta _k,{\rho }}\left( \textbf{x}, {\lambda _k}\right) \) and \((\textbf{x}_{\eta _k}^*,\lambda _{\eta _k}^*)\) be the saddle point of \(\mathcal {L}_{\eta _k, 0}(\textbf{x},\lambda )\). Based on the definition of \(\textbf{x}_{\eta _k,\lambda _k}^*\) and \(\textbf{x}_{\eta _k}^*\), the following two inequalities hold.
By adding the two inequalities, we obtain
Consequently, by leveraging (30) and invoking the definition of \(\mathcal{L}_{\eta _k,\rho }(\cdot ,\lambda _k)\), we have that
We observe that
By choosing \(u = g_{\eta _k}(\textbf{x}_{k+1}) + \tfrac{\lambda _k}{\rho }\), it follows from Lemma 6 that
Furthermore, we have that \(g_{\eta _k}(\textbf{x}_{\eta _k}^*) \le 0\) since \(\textbf{x}_{\eta _k}^*\) is feasible with respect to \(\eta _k\)-smoothed objective, implying
which implies
We observe that that \(g_{\eta _k}(\textbf{x}^*) \le g(\textbf{x}^*) \le 0\), implying that \(\textbf{x}^*\) is feasible for the \(\eta _k\)-smoothed problem and consequently,
Summing from \(k = 0\) to \(K-1\) and leveraging convexity of \(f_{\eta _k}\) and , we obtain that

where \({B_6}> 0\) is a constant. \(\square \)
3.3 Rate analysis under increasing \(\rho _k\)
We now consider the setting where \(\{\rho _k\}\) is an increasing sequence.
Lemma 10
(Rate on primal infeasibility) Suppose \(\{{(\textbf{x}_k,\lambda _k)}\}\) is generated by (Sm-AL). Then for any \(k \ge 0\), \(d_{-}\left( g(\textbf{x}_{k+1}) \right) \le \left\| \tfrac{\lambda _{k+1}-\lambda _k}{\rho _k}\right\| + m\eta _k {\beta }.\)
Proof
By the update rule, we have that
It follows that \( g_{\eta _k}(\textbf{x}_{k+1}) = \tfrac{\lambda _{k+1}-\lambda _k}{\rho _k} + \varPi _-\left( \tfrac{\lambda _k}{\rho _k}+g_{\eta _k}(\textbf{x}_{k+1}) \right) \), implying
Akin to the proof in Proposition 3, we have
\(\square \)

Proof
(i) Let \(f_{\eta _k}^* \triangleq f_{\eta _k}({\textbf{x}_{\eta _k}^*})\) and \((\textbf{x}_{\eta _k}^*,\lambda _{\eta _k}^*)\) be the saddle point of \(\mathcal {L}_{\eta _k, 0}(\textbf{x},\lambda )\). We have that
By adding and subtracting \(f(\textbf{x}_{\eta _k}^*), f_{\eta _k}^*\) and \( f_{\eta _k}({\textbf{x}_{k+1}})\), it follows that
Consequently, we have that \(f(\textbf{x}_{k+1}) - f(\textbf{x}^*) \ge -{(1+b_{\lambda } m)}\eta _k \beta -\left( \tfrac{\Vert \lambda _{k+1}\Vert ^2}{\rho _k} + \tfrac{\Vert {\lambda _{\eta _k}^*}-\lambda _k\Vert ^2}{\rho _k}\right) .\)
(ii) Similar to the previous analysis in Theorem 2, we have
which implies
\(\square \)
We conclude with an overall rate for sub-optimality and infeasibility.

Proof
Suppose \(\rho _k = \rho _0 \zeta ^k\) where \(\zeta > 1\). By choosing \(\epsilon _k\eta _k^b = \tfrac{1}{k^{2+\delta \rho _k}}\), we have that
Next, we derive a rate on the expected infeasibility. Recall from Lemma 4, \(g(\textbf{x}_{k+1}) \le g_{\eta _k}(\textbf{x}_{k+1}) + \eta _k {\beta } \textbf{1}\), implying that \(d_-(g(\textbf{x}_{k+1}) \le d_-(g_{\eta _k}(\textbf{x}_{k+1}) + \eta _k {\beta } \textbf{1})\). Therefore,
4 Overall Complexity Guarantees
In 4.1, we begin with some preliminaries, including the derivation of Lipschitzian properties for the smoothed AL function. This allows for employing an accelerated gradient framework for inexact resolution of the subproblem, leading to suitable complexity guarantees in 4.2 for convex and strongly convex regimes. In 4.3, overall complexity guarantees for (Sm-AL) with a fixed smoothing parameter are presented.
4.1 Preliminaries
We first derive L-smoothness of \(\mathcal {L}_{\eta ,\rho }(\bullet ,\lambda )\) uniformly in \(\lambda \). Our bound necessitates utilizing an upper bound on \(\eta \), which we denoted by \(\eta ^u\).
Lemma 11
Suppose \({0 \, < \, \eta \, \le \eta ^u}\) and \(\rho \ge {1}\). Then the following hold.
(a) For any \(\lambda \ge 0\), there exists \(\tilde{C}\) such that \(\mathcal {L}_{\eta ,\rho }(\bullet ,\lambda )\) is \(\tfrac{\tilde{C} \rho }{\eta }\)-smooth.
(b) \(\mathcal{L}_{\eta ,\rho }(\textbf{x},\lambda )\) is convex in \(\textbf{x}\in \mathcal{X}\) and concave in \(\lambda \ge 0\). \(\Box \)
Next, we formally state an accelerated gradient method for resolving the augmented Lagrangian subproblem (ALSub\(_{\eta _k,\rho _k}(\lambda _k)\)), defined as

Suppose \(\textbf{x}_k^*\) denotes an optimal solution of (ALSub\(_{\eta _k,\rho _k}(\lambda _k)\)). Since \(\mathcal {L}_{\eta _k,\rho _k}(\bullet ,\lambda _k)\) is a convex and \({\tfrac{\tilde{C} \rho _k}{\eta _k}}\)-smooth function, we employ an accelerated gradient method that constructs a sequence \(\{ \textbf{y}_j,\textbf{z}_j \}_{j=0}^{M_k}\) as follows, where \(\textbf{z}_0 = \textbf{y}_0 = \textbf{x}_k\).
We now restate the convergence guarantees [6, 31, 32, 34] associated with (AG).

4.2 Complexity guarantees for convex and strongly convex f
We begin by leveraging Theorem 4 to develop complexity guarantees in convex settings for an \(\varvec{\varepsilon }\)-optimal solution by leveraging the rate statement for dual suboptimality (in constant penalty settings) and primal sub-optimality (in increasing penalty settings). Throughout, we recall that AL subproblem objective is \(L_k\)-smooth, where \(L_k = \tfrac{\tilde{C} \rho _k}{\eta _k}\) and \(\Vert x-y\Vert \le {C_1}\) for any \(x, y \in X\). Additionally, complexity guarantees are derived by utilizing the rate guarantees presented in Theorem 2 (Constant \(\rho _0\)) or Theorem 3 (increasing \(\rho _k\)) to determine the number of outer iterations K; specifically, by these results, to ensure \(\varvec{\varepsilon }\)-suboptimal solutions, we require that \(K = \lceil \tfrac{C}{\varvec{\varepsilon }}\rceil \) (constant \(\rho \)) or \(K = \lceil \tfrac{\ln (C/\varvec{\varepsilon })}{\ln (\zeta )}\rceil \) (increasing \(\rho _k\)) for a suitable constant C.

Proof
(a) By Theorem 4, \(M_k\) is the smallest integer satisfying
Then the iteration complexity of computing a \((\bar{\textbf{x}}_K,\bar{\lambda }_K)\) where \(f^*-\mathcal {D}(\bar{\lambda }_K)\le \varvec{\varepsilon }\) requires
(b) Proceeding similarly, by Theorem 4, \(M_k\) is defined as follows.
Then the iteration complexity of producing an \(\textbf{x}_K\) satisfying \(|f^*-f({\textbf{x}_K})|\,\le \, \varvec{\varepsilon }\) requires
Remark 1
(Constant \(\rho \).) Suppose \(\varvec{\varepsilon }\) is a positive scalar. Let \(K \triangleq \lceil C/\varvec{\varepsilon }\rceil \) where C is defined in Proposition 2. Suppose Sm-AL scheme runs for K iterations and produces \(\bar{\textbf{x}}_K\) and \(\bar{\lambda }_K\). Then we have that
(Increasing \(\rho _k\)). Suppose \(\varvec{\varepsilon }\) is a positive scalar. Let \(K \triangleq \lceil \ln \left( \tfrac{C}{\varvec{\varepsilon }}\right) /\ln \left( \zeta \right) \rceil \) where C is defined in Theorem 3 and \(\rho _k = \rho _0\zeta ^k\) with \(\zeta >1\). Suppose Sm-AL scheme runs for K iterations and produces \(\bar{\textbf{x}}_K\) and \(\bar{\lambda }_K\), where
We now produce an extension of the results for strongly convex settings.

Proof
(a) Suppose \(\rho _k = \rho _0\) for all k. Suppose \(M_k\) represents the least number of steps taken at step k to achieve \((\epsilon _k \eta _k^b)\)-optimality of the subproblem. By Theorem 4 and \(\ln (x) \ge \tfrac{x-1}{x}\) for \(x > 0\),
Consequently, since \(K(\varvec{\varepsilon }) = \lceil C/\varvec{\varepsilon }\rceil \) outer steps are required, the overall complexity is
(b) Consider \(\rho _k = \rho _0\zeta ^{k}\) where \(k\ge 0\) and \(\zeta >1\). Proceeding as in (a) and by Theorem 4 and \(\ln (x) \ge \tfrac{x-1}{x}\) for \(x > 0\),
Consequently, if \(K(\varvec{\varepsilon }) = \lceil \ln (C/\varvec{\varepsilon })/\ln (\zeta ) \rceil = \lceil {\log }_{\zeta }(C/\varvec{\varepsilon })\rceil \) outer steps are employed, then the overall complexity can be bounded as follows.
\(\square \)
Remark 2
Sm-AL is designed for convex problems with nonsmooth nonlinear convex constraints, achieving an overall complexity of \(\tilde{\mathcal {O}}\left( \varvec{\varepsilon }^{-3/2}\right) \) under geometric growth of \(\rho _k\), slightly worse than the best known complexities for contending with smooth nonlinear constraints (cf. [26, 44]), i.e. \({\mathcal {O}}(\varvec{\varepsilon }^{-1})\) (up to log. terms).
4.3 Complexity Analysis for (Sm-AL) with fixed \(\eta \)
Next, we apply (Sm-AL) to (NSCopt\(_{\eta }\)) with a fixed and appropriately chosen \(\eta \) with the overall goal of finding an \((\bar{\textbf{x}}_K,\bar{\lambda }_K)\) such that either dual suboptimality is sufficiently small, i.e. \(f_{\eta }^* - \mathcal{D}_{\eta ,0}(\bar{\lambda }_K) \, \le \, \varvec{\varepsilon }\) (constant \(\rho _k = \rho _0\)) or primal suboptimality is sufficiently small \(|f_{\eta }(\textbf{x}_K) - f_{\eta }^*| < \varvec{\varepsilon }\) (geometrically increasing \(\rho _k\)).
(a) (Constant \(\rho \)) Suppose \({\eta \le \tilde{c}\varvec{\varepsilon }}\), where \(\tilde{c}\) needs specification. After K steps in (Sm-AL), \(f_{\eta }^* - \mathcal{D}_{\eta ,0}(\bar{\lambda }_K) \, \le \, \tfrac{\varvec{\varepsilon }}{2}\), where \(K = \bigg \lceil \tfrac{C}{\varvec{\varepsilon }}\bigg \rceil \) for a suitable C. By Lemma 4,
To ensure that the second term is less than \(\varvec{\varepsilon }/2\), we select \(\eta \le \tfrac{\varvec{\varepsilon }}{2 \left( \beta (2 + {\tilde{B}}_{\lambda } m)\right) }\).
(b) (Geometrically increasing \(\rho _k\)). Proceeding similarly, suppose \({\eta \le \tilde{c} \varvec{\varepsilon }}\), then by taking K steps in (Sm-AL), \(|f_{\eta }(\textbf{x}_K) - f_{\eta }^*| \, \le \, \tfrac{\varvec{\varepsilon }}{2}\), where \(K = {\lceil \tfrac{C}{\varvec{\varepsilon }} \rceil }\) for a suitable C. Consequently, we have that if \(\eta \le \tfrac{\varvec{\varepsilon }}{2\beta }\), we have that \(f(\textbf{x}_K) - f^* \le \varvec{\varepsilon }\).
Similarly, if \(\eta \le \tfrac{\varvec{\varepsilon }}{2\beta }\), \(f^* - f(\textbf{x}_K) \le \varvec{\varepsilon }\), implying that if \(\eta \le \tfrac{\varvec{\varepsilon }}{2\beta }\), \(| f(\textbf{x}_K)-f^*| \le \varvec{\varepsilon }\).

Proof
(a.) By Theorem 4, \(M_k\) is the smallest integer satisfying
where \(C_1, \tilde{C}, \beta , B_{\lambda }\) are constants and \(D \triangleq 2C_1\tilde{C} \left( \beta (2 + B_{\lambda } m)\right) \). Then the complexity of computing a \((\bar{\textbf{x}}_K,\bar{\lambda }_K)\) where \(f^* - \mathcal {D}_0(\bar{\lambda }_K) \le \varvec{\varepsilon }\) requires
(b) Consider \(\rho _k = \rho _0\zeta ^{k}\) where \(k\ge 0\) and \(\zeta >1\). Proceeding as in (a) and by invoking Theorem 4,
where \(C_1, \tilde{C}, \beta \) are constants and \(D \triangleq 2C_1\tilde{C}\beta \). Then the iteration complexity of producing an \(\textbf{x}_K\) satisfying \(|f - f(\textbf{x}_K)| \le \varvec{\varepsilon }\) leads to the following bound, where \(C, D > 0\).
\(\square \)
Remark 3
We observe that the complexity guarantees are close to those for diminishing \(\eta _k\) with a slight improvement in the constant \(\rho _0\) regime. We recall that Nesterov [33] and Beck and Teboulle [7] adopted different smoothing techniques with fixed \(\eta \) to get an \(\varvec{\varepsilon }\)-optimal solution within \(\mathcal {O}(1/\varvec{\varepsilon })\). When compared to these smoothing schemes in [7, 33], Sm-AL targets problems with nonsmooth constraint functions. Moreover, Sm-AL accommodates both fixed and varying \(\eta \), with an effective complexity rate \(\tilde{\mathcal {O}}(\epsilon ^{-3/2})\), matching the complexity of a smoothed penalized scheme [3].
Table 2 summarizes rate and complexities for S-AL, S-AL(\(\eta \)), S-AL(S), and N-AL where (a). Sm-AL is smoothed ALM for convex problems; (b). Sm-AL(\(\eta \)) is \(\eta \)-smoothed ALM; (c). Sm-AL(S) is Sm-AL for strongly convex problems; (d). N-AL is original ALM for nonsmooth problems. Additionally, Table 3 captures all of the constants utilized in the results from Sections 3 and 4 in a single table.
5 Numerical Experiments
5.1 Fused Lasso Problems
In this section, we apply (Sm-AL) on a fused lasso problem with datasets \(\left\{ X_i, y_i\right\} _{i = 1}^N\) where \(X_i\) is the d-dimensional feature vector for ith instance and \(y_i\) is the corresponding response. Consider the \(\eta \)-smoothing of (1).
We conducted the experiments on simulated datasets with dimensions of \(\beta \) ranging from 5 to 1000. The results are shown in Table 4. The optimal solutions for each experiment are obtained by using fmincon in Matlab. In Table 4, we compare the results from Sm-AL with those from N-AL. Both Sm-AL and N-AL terminated at 50 outer iterations except that \(n = 1000\) case for Sm-AL was stopped at the 30th outer iteration to save time. N-AL was terminated when the overall runtime exceeded two hours for higher dimensional problems. In all cases, Sm-AL outperforms N-AL with respect to primal suboptimality and overall runtime.
Next, we compare the results from Sm-AL with AL on an \(\eta \)-smoothed problem for a single instance (\(n = 5\)). We observe that such fixed-smoothing avenues provide relatively coarse approximations compared to their iteratively smoothed counterparts. Finally, we compare empirical rates of Sm-AL in two settings of \(\rho _k\) for a smaller problem (\(n = 5\)) in terms of primal suboptimality in Figure 1 and observe alignment with the theoretical rates, represented by blue lines with triangular markers.
The following insights were derived from the analysis of primal suboptimality, as shown in Figure 1.
-
(i)
First, employing a constant \(\eta \) leads to a sequence that converges to an approximate solution while diminishing \(\eta _k\) allows for asymptotic guarantees to a true solution.
-
(ii)
Second, choosing a very small \(\eta \) may impede early progress of the scheme since this leads to a large Lipschitz constant L, constraining the steplength and limiting the progress. On the other hand, selecting a larger \(\eta \) allows for better early progress but the sequence will converge to a solution that may differ significantly from the true solution. A diminishing \(\eta _k\) sequence starts with a larger \(\eta \) (allowing for larger steps and greater progress) but comes with a guarantee that the sequence will converge to a true solution. This is reflected in Figure 1.
-
(iii)
We observe that the complexity guarantees for constant \(\eta \) are close to those for diminishing \(\eta _k\) with a slight improvement in the constant \(\rho _0\) regime (see Theorem X.). We recall that Nesterov [33] and Beck and Teboulle [7] adopted different smoothing techniques with fixed \(\eta \) to get an \(\varvec{\varepsilon }\)-optimal solution within \(\mathcal {O}(1/\varvec{\varepsilon })\). When compared to these smoothing schemes in [7, 33], Sm-AL targets problems with nonsmooth constraint functions. Moreover, Sm-AL accommodates both fixed and varying \(\eta \), with an effective complexity rate \(\tilde{\mathcal {O}}(\varvec{\varepsilon }^{-3/2})\), matching the complexity of a smoothed penalized scheme [3]. When compared to the results in Proposition 2 with constant \(\rho \), SM-AL with constant \(\eta \) improves overall complexity by \(\mathcal {O}\left( \varvec{\varepsilon }^{-1/2}\right) \). The diminishing nature of \(\eta _k\) slows down the convergence process due to the additional summable requirement for varying \(\eta _k\).
5.2 Incorporation of termination criteria
Next, we consider the introduction of termination criteria T1 and T2 and examine the impact of potentially early termination, measured by \(\sum _{k}N_k\). Table 5 provides a comparison between the Sm-AL scheme with and without termination criteria. It can be observed that the incorporation of these termination criteria leads to significant computational benefits with little (if any) impact on accuracy. A natural question lies in the choice \(\gamma \) in the definition of the residual function \(F^{{\textrm{nat}},\tilde{\epsilon }}_\mathcal{X}\). We observe that when \(\textbf{x}\in \mathcal{X}\),
where \(\Vert \textbf{x}\Vert \le {C}\) and \(\Vert \nabla h(\textbf{x})\Vert \le {D}\) for any \(\textbf{x}, {u} \in \mathcal X\). From the above bound, it may be observed that small choices of \(\gamma \) may lead to early satisfaction of conditions T1 or T2 while larger choices of \(\gamma \) may require significantly more iterations. Ideally, since we have already developed convergence guarantees, it would be helpful to relate \(\gamma \) to \(\eta _k\). Some preliminary numerics are provided where the choice of \(\gamma \) is varied in condition T2, leading to some variability in performance. It can be surmised from this table that constant \(\gamma \) leads to poorer performance while diminishing choices for \(\gamma \) lead to far superior behavior. This is less surprising in that for larger values of K, \(\gamma \) is smaller and imposes a more modest threshold for satisfying the condition and thereby allowing for earlier termination.
6 Conclusion
In this paper, we develop a smoothed AL scheme for resolving convex programs with possibly nonsmooth constraints and provide rate and complexity guarantees for convex and strongly convex settings under constant and increasing penalty parameter sequences. The complexity guarantees represent significant improvements over the best available guarantees for AL schemes applied to convex programs with nonsmooth objectives and constraints. A by-product of our analysis develops a relationship between saddle-points of \(\eta \)-smoothed problems and \(\eta \)-saddle points of our original problem. Moreover, to improve the practical behavior of the proposed Sm-AL scheme, we have developed termination criteria that allow for premature termination. Our preliminary numerics suggest that such criteria lead to significant improvements in the complexity of our scheme with modest impacts on accuracy of the resulting solutions.. We believe that our findings represent a foundation for considering extensions to compositional regimes with expectation-valued and possibly nonsmooth constraints.
References
Alger, N., Villa, U., Bui-Thanh, T., Ghattas, O.: A data scalable augmented Lagrangian KKT preconditioner for large-scale inverse problems. SIAM J. Sci. Comput. 39(5), A2365–A2393 (2017)
Aybat, N.S., Ahmadi, H., Shanbhag, U.V.: On the analysis of inexact augmented lagrangian schemes for misspecified conic convex programs. IEEE Transactions on Automatic Control 67(8), 3981–3996 (2021)
Aybat, N.S., Iyengar, G.: A first-order smoothed penalty method for compressed sensing. SIAM Journal on Optimization 21(1), 287–313 (2011)
Aybat, N.S., Iyengar, G.: An augmented Lagrangian method for conic convex programming. arXiv preprint arXiv:1302.6322 (2013)
Beck, A.: Introduction to nonlinear optimization: Theory, algorithms, and applications with MATLAB. SIAM (2014)
Beck, A.: First-order methods in optimization. SIAM (2017)
Beck, A., Teboulle, M.: Smoothing and first order methods: A unified framework. SIAM Journal on Optimization 22(2), 557–580 (2012)
Byrd, R.H., Hribar, M.E., Nocedal, J.: An interior point algorithm for large-scale nonlinear programming. SIAM Journal on Optimization 9(4), 877–900 (1999)
Chang, H., Lou, Y., Ng, M.K., Zeng, T.: Phase retrieval from incomplete magnitude information via total variation regularization. SIAM J. Sci. Comput. 38(6), A3672–A3695 (2016)
Conn, A.R., Gould, G., Toint, P.L.: LANCELOT: a Fortran package for large-scale nonlinear optimization (Release A), vol. 17. Springer Science & Business Media (2013)
Cottle, R.W., Pang, J.S., Stone, R.E.: The Linear Complementarity Problem. Academic Press Inc, Boston, MA (1992)
Devolder, O., Glineur, F., Nesterov, Y.: Double smoothing technique for large-scale linearly constrained convex optimization. SIAM Journal on Optimization 22(2), 702–727 (2012)
Dong, B., Zhang, Y.: An efficient algorithm for \(\ell _0\) minimization in wavelet frame based image restoration. J. Sci. Comput. 54(2–3), 350–368 (2013)
Facchinei, F., Pang, J.S.: Finite-dimensional variational inequalities and complementarity problems, vol. I. Springer Series in Operations Research. Springer-Verlag, New York (2003)
Friedlander, M.P., Leyffer, S.: Global and finite termination of a two-phase augmented Lagrangian filter method for general quadratic programs. SIAM J. Sci. Comput. 30(4), 1706–1729 (2008)
Friedlander, M.P., Saunders, M.A.: A globally convergent linearly constrained Lagrangian method for nonlinear optimization. SIAM J. Optim. 15(3), 863–897 (2005)
Gao, B., Liu, X., Yuan, Yx.: Parallelizable algorithms for optimization problems with orthogonality constraints. SIAM J. Sci. Comput. 41(3), A1949–A1983 (2019)
Gill, P.E., Murray, W., Saunders, M.A.: SNOPT: an SQP algorithm for large-scale constrained optimization. SIAM Rev. 47(1), 99–131 (2005). ((electronic))
Hestenes, M.R.: Multiplier and gradient methods. Journal of Optimization Theory and Applications 4(5), 303–320 (1969)
Jalilzadeh, A., Shanbhag, U.V., Blanchet, J., Glynn, P.W.: Smoothed variable sample-size accelerated proximal methods for nonsmooth stochastic convex programs. Stochastic Systems 12(4), 373–410 (2022)
Kang, M., Kang, M., Jung, M.: Inexact accelerated augmented Lagrangian methods. Computational Optimization and Applications 62(2), 373–404 (2015)
Kloft, M., Brefeld, U., Laskov, P., Müller, K.R., Zien, A., Sonnenburg, S.: Efficient and accurate \({L}_p\)-norm multiple kernel learning. Advances in Neural Information Processing Systems 22 (2009)
Koshal, J., Nedić, A., Shanbhag, U.V.: Multiuser optimization: Distributed algorithms and error analysis. SIAM Journal on Optimization 21(3), 1046–1081 (2011)
Lan, G., Monteiro, R.D.: Iteration-complexity of first-order augmented Lagrangian methods for convex programming. Mathematical Programming 155(1–2), 511–547 (2016)
Liu, Y.F., Liu, X., Ma, S.: On the nonergodic convergence rate of an inexact augmented Lagrangian framework for composite convex programming. Mathematics of Operations Research 44(2), 632–650 (2019)
Lu, Z., Zhou, Z.: Iteration-complexity of first-order augmented Lagrangian methods for convex conic programming. SIAM J. Optim. 33(2), 1159–1190 (2023)
Moreau, J.J.: Proximité et dualité dans un espace hilbertien. Bulletin de la Société mathématique de France 93, 273–299 (1965)
Murtagh, B.A., Saunders, M.A.: A projected Lagrangian algorithm and its implementation for sparse nonlinear constraints. Springer (1982)
Necoara, I., Patrascu, A., Glineur, F.: Complexity of first-order inexact Lagrangian and penalty methods for conic convex programming. Optimization Methods and Software 34(2), 305–335 (2019)
Nedelcu, V., Necoara, I., Tran-Dinh, Q.: Computational complexity of inexact gradient augmented Lagrangian methods: application to constrained mpc. SIAM Journal on Control and Optimization 52(5), 3109–3134 (2014)
Nesterov, Y.: A method for unconstrained convex minimization problem with the rate of convergence \(\cal{O} (1/k^{2})\). Doklady an ussr 269, 543–547 (1983)
Nesterov, Y.: Introductory lectures on convex optimization: A basic course, vol. 87. Springer Science & Business Media (2003)
Nesterov, Y.: Smooth minimization of non-smooth functions. Mathematical programming 103(1), 127–152 (2005)
Nesterov, Y., et al.: Lectures on convex optimization, vol. 137. Springer (2018)
Patrascu, A., Necoara, I., Tran-Dinh, Q.: Adaptive inexact fast augmented Lagrangian methods for constrained convex optimization. Optimization Letters 11(3), 609–626 (2017)
Polyak, B.T.: Introduction to optimization (1987)
Powell, M.J.: A method for nonlinear constraints in minimization problems. Optimization pp. 283–298 (1969)
Rockafellar, R.T.: A dual approach to solving nonlinear programming problems by unconstrained optimization. Mathematical Programming 5(1), 354–373 (1973)
Rockafellar, R.T.: Augmented Lagrangians and applications of the proximal point algorithm in convex programming. Mathematics of Operations Research 1(2), 97–116 (1976)
Tibshirani, R.: Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological) 58(1), 267–288 (1996)
Tibshirani, R., Saunders, M., Rosset, S., Zhu, J., Knight, K.: Sparsity and smoothness via the fused Lasso. Journal of the Royal Statistical Society: Series B (Stat. Method.) 67(1), 91–108 (2005)
Tong, X., Xia, L., Wang, J., Feng, Y.: Neyman-Pearson classification: parametrics and sample size requirement. The Jrnl. of Machine Learning Research 21(1), 380–427 (2020)
Wilson, R.B.: A simplicial algorithm for concave programming. Ph. D. Dissertation, Graduate School of Bussiness Administration (1963)
Xu, Y.: Iteration complexity of inexact augmented Lagrangian methods for constrained convex programming. Mathematical Programming 185(1), 199–244 (2021)
Zhang, L., Zhang, Y., Wu, J., Xiao, X.: Solving stochastic optimization with expectation constraints efficiently by a stochastic augmented Lagrangian-type algorithm. INFORMS Journal on Computing 34(6), 2989–3006 (2022)
Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: series B (Statistical Methodology) 67(2), 301–320 (2005)
Acknowledgements
We extend our sincere appreciation to Dr. Qi Wang (University of Michigan at Ann Arbor) for her invaluable suggestions and careful reading of a recent draft of this paper. In addition, the second author would like to acknowledge his early collaboration with Dr. N. Serhat Aybat (Pennsylvania State University) that provided some of the seeds for this study.
Funding
P. Zhang and Uday V. Shanbhag are partially supported by ONR Grant N00014-22-1-2589, AFOSR Grant FA9550-24-1-0259, and DOE Grant DE-SC0023303. Ethan X. Fang would like to acknowledge support from NSF Grants DMS-2346292 and DMS-2434666.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Uday V. Shanbhag would like to dedicate this paper to Prof. Michael A. Saunders for his help, mentorship, and guidance as well as his immense and enduring contributions to the theoretical development and large-scale implementation of algorithms for nonlinear programming.
Appendix
Appendix
1.1 Proof of Lemma 1
Proof
where the last equality follows from \(d_{+}(-v) = d_{-}(v)\) and \(d_\mathcal {K}(u) \triangleq \min _{v \in \mathcal K} \Vert v-u\Vert \). We now derive \(\nabla _{\lambda } \mathcal {L}_{\rho }(\textbf{x},\lambda )\) as follows.
where the second equality is a result of \(\nabla _u d^2_\mathcal{K}(u) = 2(u-\varPi _\mathcal{K}[u])\) for any cone \(\mathcal{K}\), the last equality is a consequence of \(u = \varPi _{-\mathcal {K}}(u) + \varPi _{\mathcal {K}^*}(u)\) and \(\mathcal {K}\triangleq \{u: u \ge 0\}\). Similarly, we derive \(\nabla _{\textbf{x}} \mathcal {L}_{\rho }(\textbf{x},\lambda )\) as follows.
where \(J_{g}(\textbf{x})\) is Jacobian matrix of g. \(\square \)
1.2 Proof of Lemma 2
Proof
For completeness, we provide this proof which is based on that provided in [38]. Let \(u = g(\textbf{x})+v\) and \(p_{\rho }(u)\triangleq \inf _{\textbf{x}\in \mathcal {X}}f(\textbf{x})+\tfrac{\rho }{2}\Vert u\Vert ^2\), where \(p_\rho \) can be regarded as a “permutation” function. Then we have
The augmented dual function can be expressed as
where the infimal convolution of two functions is defined as
Consequently, by Danskin’s theorem,
\(\square \)
1.3 Proof of Lemma 6
Proof
We observe that (14) can be expressed as
\(\square \)
1.4 Proof of Lemma 4
Proof
(i) Since for any \(\textbf{x}\, \in \, \mathcal {X}\), we have that
Consequently, for any \(\lambda \ge 0\), by adding (35) to \(\lambda _i \times \) (36) for \(i=1, \cdots , m\),
(ii) Suppose \({{\bar{\textbf{x}}} \, \in } \, {\displaystyle \arg \min _{\textbf{x}\in \mathcal {X}}} \, \mathcal {L}_0(\textbf{x},\lambda )\) and \({\bar{\textbf{x}}}_{\eta } {\, \in \, } {\displaystyle \arg \min _{\textbf{x}\in \mathcal {X}}} \, \mathcal {L}_{\eta ,0}(\textbf{x},\lambda )\). It follows that \(\mathcal {D}_0(\lambda ) = \mathcal {L}_0({{\bar{\textbf{x}}}},\lambda )\) and \(\mathcal {D}_{\eta ,0}(\lambda ) = \mathcal {L}_{\eta ,0}({{\bar{\textbf{x}}}}_{\eta },\lambda )\). Let \(C = (\Vert \lambda \Vert m+1){\beta }\).
Similarly, we have that
This implies that for any \(\lambda \in \mathbb {R}^m_+\), \(| \mathcal {D}_{\eta ,0}(\lambda ) - \mathcal {D}_{0}(\lambda )| \le {\eta C}.\)
(iii) By the prior definitions,
For any \(\lambda \ge 0\), let \(u_1 {\, \in \,} \arg {\displaystyle \max _u} \,\mathcal {D}_{\eta ,\rho }(\lambda )\) and \(u_2 {\, \in \,} \arg {\displaystyle \max _u} \, \mathcal {D}_{\rho }(\lambda )\). Then
Similarly, \(\mathcal {D}_{\rho }(\lambda )-\mathcal {D}_{\eta ,\rho }(\lambda ) \le \eta C\), implying the result. \(\square \)
1.5 Proof of Lemma 5
Proof
(i) By definition, we have that
By strong convexity of \(-\mathcal{D}_0(\bullet ) + \tfrac{1}{2\rho }\Vert \bullet -\lambda \Vert ^2\) and \(-\mathcal{D}_{\eta ,0}(\bullet ) + \tfrac{1}{2\rho }\Vert \bullet -\lambda \Vert ^2\) and by noting that \(q_{\rho }(\lambda )\) and \(q_{\eta ,\rho }(\lambda )\) uniquely minimize (37) and (38), respectively, we obtain that
Consequently, by summing the two inequalities above, we have that
By definitions of \(\lambda _{\eta }^*\) and \(\lambda ^*\), we have \(q_{\eta ,\rho }(\lambda _{\eta }^*) = \lambda _{\eta }^*\) and \(q_{\rho }(\lambda ^*) = \lambda ^*\). Therefore, we have the following bounds on \(\Vert q_{\eta ,\rho }(\lambda )\Vert \) and \(\Vert q_{\rho }(\lambda )\Vert \).
Similarly, \( \Vert q_{\rho }(\lambda )\Vert =\left\| q_{\rho }(\lambda )-q_{\rho }(\lambda ^*)+\lambda ^*\right\| \le \Vert \lambda \Vert +2\Vert \lambda ^*\Vert .\) Therefore, It follows that for any \(\lambda \ge 0\),
where \(C_{m} \triangleq 1+m(b_{\lambda ,\eta }+b_{\lambda })\) is a constant.
(ii) By recalling the definitions of \(\nabla _{\lambda } \mathcal {D}_{\rho }(\lambda )\) and \(\nabla _{\lambda } \mathcal {D}_{\eta ,\rho }(\lambda )\) from Lemma 2,
\(\square \)
1.6 Proof of Lemma 7
Proof
(a) \(\implies \) (b). Suppose \(\textbf{x}^*_{\epsilon }\) is an \(\epsilon \)-optimal solution of (Opt). Suppose (b) does not hold and there exists \(\textbf{y}\in \mathcal{X}\) such that
Consequently, \(h^\prime (\textbf{x}^*_{\epsilon };d) =\nabla h(\textbf{x}^*_{\epsilon })^\top d\) where \(d = \textbf{y}-\textbf{x}_{\epsilon }^*\). Since d is a descent direction, by Lemma 4.2 [5], we have that for some \(\delta \le 1\), \( h(\textbf{x}^*_{\epsilon }+td) - h(\textbf{x}^*_{\epsilon }) < -\epsilon \) for any \(t \in (0,\delta )\). Note that \(\textbf{x}^*_{\epsilon } + td \in \mathcal{X}\) since \(\mathcal{X}\) is a convex set. It follows that there exists a feasible point \(\textbf{x}^*_{\epsilon }+td \in \mathcal{X}\) such that \( h(\textbf{x}^*_{\epsilon }+td) - h(\textbf{x}^*_{\epsilon }) < -\epsilon \), violating \(\epsilon \)-optimality of \(\textbf{x}^*_{\epsilon }\).
(b) \(\implies \) (a). By convexity of h, we have that
Consequently, \(h(\textbf{x}^*_{\epsilon }) \, \le \, h(\textbf{x}^*) + \epsilon \, \le \, h(\textbf{x}) + \epsilon \) for any \(x \in \mathcal{X}\), implying that \(\textbf{x}^*_{\epsilon }\) is an \(\epsilon \)-optimal solution.
(c) \(\implies \) (b). Given \(\textbf{u}\in \mathcal{X}\) and \(\textbf{x}^*_{\epsilon } \in \mathcal{X}\), we have that \(F^{{\textrm{nat}},\tilde{\epsilon }}_\mathcal{X}(\textbf{x}_{\epsilon }^*,\textbf{u}) = 0\). Consequently, we have that \(\textbf{x}_{\epsilon }^* = \tilde{\epsilon } v + \varPi _\mathcal{X}\left[ \, \textbf{x}_{\epsilon }^* - \gamma \nabla h(\textbf{x}_{\epsilon }^*)\, \right] \), where \(\tilde{\epsilon } v + \varPi _\mathcal{X}\left[ \, \textbf{x}_{\epsilon }^* - \gamma \nabla h(\textbf{x}_{\epsilon }^*)\, \right] \in \mathcal{X}\) and \(v = \textbf{u}- \varPi _\mathcal{X}\left[ \, \textbf{x}_{\epsilon }^* - \gamma \nabla h(\textbf{x}_{\epsilon }^*)\, \right] \). It is easily seen that the former of these assertions holds as observed next.
since \(\textbf{u}\in \mathcal{X}\), \(\tilde{\epsilon } \in (0,1)\), and \(\mathcal{X}\) is a convex set. For ease of exposition, we denote \(\varPi _\mathcal{X}\left[ \, \textbf{x}_{\epsilon }^* - \gamma \nabla h(\textbf{x}_{\epsilon }^*)\, \right] \) as \(\tilde{\textbf{x}}\). For any \(\textbf{y}\, \in \,\mathcal{X}\),
We first derive a bound on Term 1.
where \(\Vert \textbf{y}\Vert ^2 \le C\) for any \(\textbf{y}\in \mathcal{X}\) and \(\tilde{\textbf{x}} \triangleq \varPi _\mathcal{X}\left[ \, \textbf{x}_{\epsilon }^* - \gamma \nabla h(\textbf{x}_{\epsilon }^* )\, \right] \). Consider Term 2.
Consequently, we have that
\(\square \)
1.7 Proof of Lemma 9
Proof
(a) By adding and subtracting \(q_{\eta _k,\rho _k}(\lambda _{k}),q_{\eta _k,\rho _k}(\lambda ^*) ,q_{\rho _k}(\lambda ^*) \), it follows that
Next, we derive a bound on \(\Vert \lambda _{k+1}-q_{\eta _k,\rho _k}(\lambda _k)\Vert \) that
From Lemma 4, \(\Vert q_{\eta _k,\rho _k}(\lambda ^*)-q_{\rho _k}(\lambda ^*)\Vert \le 2\sqrt{\rho _k\eta _k(\Vert \lambda ^*\Vert m + {C_m})\beta }\), implying that
By leveraging the deterministic form of the Robbins-Siegmund Lemma [36], if
\(\sqrt{2\rho _k \epsilon _k{\eta _k^b}} + 2\sqrt{\rho _k\eta _k ({\Vert \lambda ^*\Vert }m+{C_m})\beta }\) is summable, then \(\{ \Vert \lambda _{k}-\lambda ^*\Vert \}\) converges to a nonnegative value. It follows that \(\{\lambda _k\}\) is convergent.
(b) Summing (41) from \(k=0, \cdots , K-1\), we obtain that
\(\square \)
1.8 Proof of Lemma 11
Proof
Recall that \(\mathcal {L}_{\eta ,\rho }(\textbf{x},\lambda )\) and its gradient \(\nabla _{\textbf{x}}\mathcal {L}_{\eta ,\rho }(\textbf{x},\lambda )\) are defined as
where \(\left( \textbf{J}_{g}(\textbf{x})\right) ^\top \triangleq \begin{bmatrix} \nabla _{\textbf{x}}g_{\eta ,1}(\textbf{x})&\nabla _{\textbf{x}}g_{\eta ,2}(\textbf{x})&\dots&\nabla _{\textbf{x}}g_{\eta ,m}(\textbf{x}) \end{bmatrix}\) and \(\textbf{J}_g(\textbf{x})\) denotes the Jacobian matrix of \(g_{\eta }(\textbf{x})\). By Assumption 1 and Definition 1, \(g_{\eta }\) and \(\textbf{J}_g\) are bounded on \(\mathcal {X}\) by \(M_g\) and \(M_G\), respectively. Since \(\textbf{J}_g\) is bounded, \(g_{\eta }\) is Lipschitz continuous on \(\mathcal {X}\) with constant \(L_g\). By Lemma 9, for all \(\textbf{x}_1, \textbf{x}_2 \in \mathcal {X}\), it follows that
Next we show that the second term is Lipschitz continuous in \(\textbf{x}\). By adding and subtracting \(-{\textbf{J}_{g}(\textbf{x}_2)}^\top \left( \tfrac{\lambda }{\rho } + g_{\eta }(\textbf{x}_1)-\varPi _{-}\left[ \tfrac{\lambda }{\rho }+g_{\eta }(\textbf{x}_1)\right] \right) \), we have
Consequently, \(\mathcal {L}_{\eta ,\rho }(\textbf{x},\lambda )\) is \((\tfrac{{\tilde{C}}\rho }{\eta })\)-smooth by observing that
where \(\rho \ge 1\), \(\eta \le \eta ^u\), and
(b) This has been shown in [38, Th. 3.1]. \(\square \)
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Zhang, P., Shanbhag, U.V. & Fang, E.X. A Smoothed Augmented Lagrangian Framework for Convex Optimization with Nonsmooth Constraints. J Sci Comput 104, 46 (2025). https://doi.org/10.1007/s10915-025-02934-w
Received:
Revised:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1007/s10915-025-02934-w



