1 Introduction

We consider the nonsmooth convex program, defined as

$$\begin{aligned} \min _{{\textbf {x}}\in {\mathcal {X}}}&\ \left\{ \, f({\textbf {x}}) \, \mid \, g({\textbf {x}})\, \le \, 0 \right\} , \end{aligned}$$
(NSCopt)

where \(f: {\mathcal {X}}\rightarrow \mathbb {R}\) is a real-valued convex function and is possibly nonsmooth (but smoothable), \({\mathcal {X}}\subset \mathbb {R}^{n}\) is a closed and convex set, and \(g(\textbf{x}) = (g_1(\textbf{x}),g_2(\textbf{x}),...,g_m(\textbf{x}))^\top \) that each \(g_i :{\mathcal {X}}\rightarrow \mathbb {R},i = 1, 2, \cdots , m\) is a possibly complicated nonsmooth (but smoothable) convex function. Generally, the presence of such constraints precludes usage of projection-based methods to ensure feasibility of iterates. In deterministic regimes, a host of approaches have been employed for contending with complicated constraints, a subset of which include sequential quadratic programming [18, 43], interior point methods [8], and augmented Lagrangian (AL) schemes [38, 39]. Of these, AL schemes have proven to be enormously influential in the context of scientific computing [1, 9, 13], and more specifically in nonlinear programming in the form of solvers such as minos [16, 28] and lancelot [10] as well as more refined techniques [15, 17]. There has been a significant interest in deriving overall complexity bounds [24, 44] in convex regimes when the Lagrangian subproblem is solved via a first-order method. However, such bounds tend to be poor when constraints are possibly nonsmooth; e.g. standard AL schemes display complexity guarantees of \(\mathcal {O}(\varvec{\varepsilon }^{-5})\) for computing an \(\varvec{\varepsilon }\)-optimal solution in such settings (see Table 1).

figure a
Table 1 ALM for deterministic convex optimization

1.1. Related work. Before proceeding, we discuss related prior research. (a) Augmented Lagrangian Methods. The augmented Lagrangian method (ALM) was proposed by Hestenes [19] and Powell [37] with a comprehensive rate analysis subsequently provided by Rockafellar [38]. The ALM framework relies on solving a sequence of unconstrained (or relaxed) problems, requiring the minimization of a suitably defined augmented Lagrangian function \(\mathcal{L}_{\rho }(\textbf{x},\lambda )\) in \(\textbf{x}\), where \(\rho \) and \(\lambda \) denote the penalty parameter and the Lagrange multiplier associated with g, respectively. In high-dimensional settings, the Lagrangian subproblems cannot be solved exactly, leading to the development of variants that allow for inexact resolution of the Lagrangian subproblem. Kang et al. [21] presented an inexact accelerated ALM for strongly convex optimization with linear constraints at a rate of \(\mathcal {O}(1/k^2)\), where k is the iteration counter. Non-ergodic convergence guarantees were provided in [24, 25], where either smoothness of f [24] or a composite structure [25] is assumed. Overall complexity guarantees were first provided by Lan and Monteiro [24], Aybat and Iyengar [4], Necoara et al. [29] and most recently Lu and Zhou [26], where the latter three references allowed for conic settings. In fact, Lu and Zhou [26] showed that in conic convex settings with smooth nonlinear constraints, by introducing a regularization, the overall complexity is improved to \(\mathcal {O}\left( \varvec{\varepsilon }^{-1}\ln (\varvec{\varepsilon }^{-1})\right) \) with a geometrically increasing penalty parameter. Nedelcu et al. [30] considered convex and strongly convex regimes. Notably, Necoara et al. [29] derived an overall complexity of \(\mathcal {O}(\varvec{\varepsilon }^{-\frac{3}{2}})\) and \(\mathcal {O}(\varvec{\varepsilon }^{-1})\) for smooth settings objective, respectively. More recently, Xu [44] considered nonlinear but smooth regimes in proposing an inexact ALM (under a suitable boundedness requirement) with complexity guarantees of \(\mathcal {O}(\varvec{\varepsilon }^{-1})\) (under convex f) and \(\mathcal {O}(\varvec{\varepsilon }^{-\frac{1}{2}}\log ({\varvec{\varepsilon }^{-1}}))\) (under strongly convex f), respectively. Table 1 compares existing complexity guarantees for AL schemes with both our schemes in convex (Sm-AL) and strongly convex settings (Sm-AL(S)) and standard ALM (N-AL), where \(\tilde{\mathcal {O}}\) suppresses logarithmic terms.

(b) Smoothing techniques. While subgradient methods have proven effective in addressing nonsmooth convex objectives [36], smoothing techniques [6] represent an efficient avenue for a subclass of nonsmooth problems. Moreau [27] introduced the (Moreau)-smoothing \(f_\eta \) of a convex function f, with parameter \(\eta \), defined as

$$\begin{aligned} f_{\eta }(\textbf{x}) \, \triangleq \, \inf _{\textbf{u}}\left\{ f(\textbf{u}) + \tfrac{1}{\eta }\Vert \textbf{u}-\textbf{x}\Vert ^2\right\} . \end{aligned}$$

Nesterov [33] employed a fixed smoothing parameter in developing a smoothing framework for nonsmooth convex optimization problems with a rate of \(\mathcal {O}(\varvec{\varepsilon }^{-1})\), an improvement over \(\mathcal {O}(\varvec{\varepsilon }^{-2})\) attainable by subgradient methods. In related work, Aybat and Iyengar [3] designed a smoothed penalty method for obtain \(\varvec{\varepsilon }\)-optimal solutions for \(l_1\)-minimization problems with linear equality constraints in \(\tilde{\mathcal {O}}\left( \varvec{\varepsilon }^{-3/2}\right) \) steps. Subsequently, Beck and Teboulle [7] defined an \((\alpha , \beta )\)-smoothing for a nonsmooth convex f satisfying the following two conditions (i) \(f_{\eta }(\textbf{x}) \, \le \, f(\textbf{x}) \, \le \, f_{\eta }(\textbf{x}) + \eta \beta \) for all \(\textbf{x}\) and (ii) \(\, f_{\eta }\) is \((\alpha /\eta )\)-smooth. For instance, \(f(\textbf{x}) \triangleq \max \{0,\textbf{x}\}\) has a smoothing \(f_{\eta }\), defined as \(f_{\eta }(\textbf{x}) \triangleq \eta \log (1+\exp (\frac{\textbf{x}}{\eta }))-\eta \log 2.\) Analogous approaches have been employed for addressing deterministic [12] and stochastic [20] convex optimization problems.

1.2. Applications. We present three applications where nonsmooth convex constraints emerge.

(a) Regression. Lasso regression [40] is a model widely used in variable selection in statistical learning. Assuming that the dataset consists of \(\{y_i,X_i\}_{i=1}^{N}\), where \((y_i,X_i)\) denotes the outcome and feature vector for ith instance. Then an elastic-net model [46] can be articulated as follows where \(C_1 > 0\).

$$\begin{aligned} \min _{\beta }\, \left\{ \, \Vert y-X\beta \Vert ^2_2 \, \mid \, (1-\alpha )\Vert \beta \Vert _1 + \alpha \Vert \beta \Vert _2 \le C_1 \, \right\} . \end{aligned}$$
(1)

This reduces to standard Lasso [40] when \(\alpha = 0\) and is generalizable to fused Lasso [41] by adding an additional nonsmooth constraint \(\sum _{j = 2}^{p}|\beta _j-\beta _{j-1}| \le C_2\), where \(C_2 > 0\).

(b) Classification. In statistical learning, the Neyman-Pearson (NP) classification [42] is designed to minimize the type II error while maintaining type I error below a user-specified level \(\alpha \). Consider a labeled training dataset \(\{a_i\}_{i = 1}^N\) where the positive and negative set are represented by \(\{a_i^{(1)}\}_{i = 1}^{N_{(1)}}\) and \(\{a_i^{(-1)}\}_{i = 1}^{N_{(-1)}}\), respectively. The empirical NP classification problem is given by [45] as follows

$$\begin{aligned} \min _{\textbf{x}} \,\left\{ \, \tfrac{\sum _{i = i}^{N_{(-1)}}\varvec{\ell }\left( 1, \textbf{x}^\top a_{i}^{(-1)}\right) }{N_{(-1)}} \, \bigg | \, \tfrac{\sum _{i = i}^{N_{(1)}}\varvec{\ell }\left( -1, \textbf{x}^\top a_{i}^{(1)}\right) }{N_{(1)}}-\alpha \le 0 \, \right\} , \end{aligned}$$

where \(\varvec{\ell }(\bullet )\) denotes the loss function. Choices of the loss function include nonsmooth variants such as mean absolute error (MAE) and hinge loss.

(c) Multiple Kernel learning. Multiple kernel learning (MKL) employs a predefined set of kernels to learn an optimal linear or nonlinear combination of these kernels, defined as follows [22].

$$\begin{aligned} \min _{ w, b, (\theta ,\xi )\ge 0}&\quad \tfrac{1}{2}\sum _{m = 1}^{M}\tfrac{\Vert w_m\Vert _2^2}{\theta _m} + C\Vert \xi \Vert _1 \, \\ \mathop {\mathrm {subject\;to}}\limits&\quad y_i\left( \sum _{m = 1}^{M}w_m'\psi _{m}(\textbf{x}_i) + b\right) \, \ge \, 1-\xi _i, \quad i = 1, \cdots , m \\&\quad \Vert \theta \Vert _{p}^{p}\, \le \, 1, \end{aligned}$$

where \(\psi _i(\bullet ), i = 1, \dots , m\) are predefined kernels, \(\theta \) is a vector of coefficients for each kernel, w is a weight vector for the primal model for learning with multiple kernels.

1.3. Contributions. We present a smoothed AL framework (Sm-AL) where the nonsmooth (but smoothable) objective/constraints are smoothed with a diminishing smoothing parameter \(\eta _k\). Consequently, the AL subproblem (with penalty parameter \(\rho _k\)) is proven to be \(\mathcal {O}(\rho _k/\eta _k)\)-smooth, allowing for (accelerated) computation of an \(\epsilon _k\)-exact solution in finite time. By a careful selection of the sequences \(\{\epsilon _k,\eta _k,\rho _k\}\), we derive rate and complexity guarantees. Our contributions are formalized next.

(i) In Section 2, we derive an ex-ante bound on the optimal multiplier set of the \(\eta \)-smoothed problem. This result, which is of independent interest, allows for claiming that a saddle-point of the \(\eta \)-smoothed problem is an \(\mathcal {O}(\eta )\)-saddle point of the original problem, allowing for deriving fixed smoothing schemes.

(ii) In Section 3, we establish a dual suboptimality rate of \(\mathcal {O}(k^{-1})\) and primal infeasibility rate of \(\mathcal {O}(k^{-1/2})\) (constant penalty) while geometric rates of \(\mathcal {O}(1/\rho _k)\) on primal infeasibility and suboptimality are derived under geometrically increasing penalty parameters. In Section 4, by employing an accelerated gradient framework for resolving the \(\eta _k\)-smoothed AL subproblem, the overall complexities of (Sm-AL) in terms of inner projection steps for obtaining an \(\varvec{\varepsilon }\)-optimal solution are proven to be \(\mathcal {O}(\varvec{\varepsilon }^{-(3+\delta )})\) (constant penalty) and \(\tilde{\mathcal {O}}(\varvec{\varepsilon }^{-3/2})\) (geometrically increasing penalty). Analogous bounds in strongly convex settings are given by \(\tilde{\mathcal {O}}(\varvec{\varepsilon }^{-(2+\delta )})\) and \(\tilde{\mathcal {O}}(\varvec{\varepsilon }^{-1})\) for constant and geometrically increasing penalty parameters, respectively. Similar complexity guarantees are available with a fixed smoothing parameter, akin to those developed in [7, 33] for convex programs with nonsmooth objectives.

(iii) We also develop practical termination criteria in Section 2, which when overlaid with our proposed scheme lead to significantly improved empirical complexity in our numerical experiments with little impact on accuracy.

(iv) Preliminary numerical results are provided in Section 5 before concluding in Section 6.

Organization The remainder of the paper is organized as follows. In Section 2, we introduce the smoothed augmented Lagrangian framework, providing the requisite background and the assumptions. Sections 3 and 4 provide the rate and complexity analysis while Section 5 presents a description of our numerical experiments. The paper concludes in Section 6.

Notation. Let \(\Vert \cdot \Vert \) denote the Euclidean norm. Given a closed convex set \(X \subseteq \mathbb {R}^n\) and \(y\in \mathbb {R}^n\), \(d_{\mathcal {X}}(y)\triangleq {\displaystyle \min _{s\in \mathcal {X}}} \Vert y-s\Vert \), \(d^2_{\mathcal {X}}(y)\triangleq \left( d_{\mathcal {X}}(y)\right) ^2\), and \(\varPi _{\mathcal {X}}(y)\triangleq {\displaystyle \text{ argmin}_{s\in \mathcal{X}}} \Vert y-s\Vert \); hence, \(d_{\mathcal {X}}(y)=\Vert y-\varPi _{\mathcal {X}}(y)\Vert \). Moreover, \(d^2_{\mathcal {K}}(\cdot )\) is differentiable and its gradient \(\nabla d^2_{\mathcal {X}}(y)=2(y-\varPi _{\mathcal {X}}(y))\). \(d_{-}(u)\) denotes the distance of u to the nonpositive orthant \(\mathbb {R}^n_{-}\), where \(d_{-}(u)\) is defined as \(d_{-}(u) \, \triangleq \, \Vert u - \varPi _{\mathbb {R}^n_-} [u]\Vert _{2}.\) Finally, \(\tilde{\mathcal {O}}(f(n))\) is \(\mathcal {O}(f(n))\) up to a \(\log (n)\) factor. Finally, \(\textbf{1}\) denotes the column of ones in \(\mathbb {R}^n\).

2 A Smoothed Augmented Lagrangian Framework

In this section, we first provide some background and then analyze the smoothed problem, ending with a relation between a saddle-point of the \(\eta \)-smoothed problem and an \(\eta \)-approximate saddle-point of the original problem.

2.1 Background and Assumptions

Corresponding to problem (NSCopt), we may define the Lagrangian function \(\mathcal {L}_0\) as follows.

$$\begin{aligned} \mathcal {L}_0(\textbf{x},\lambda ) \, \triangleq \, {\left\{ \begin{array}{ll} f(\textbf{x}) + \lambda ^\top g(\textbf{x}), & \lambda \, \ge \, 0 \\ -\infty . & \, \text{ otherwise } \end{array}\right. } \end{aligned}$$

This allows for denoting the set of minimizers of \(\mathcal {L}_0({\bullet },\lambda )\) over the set \(\mathcal{X}\) by \(\mathcal {X}^*(\lambda )\), the dual function by \(\mathcal {D}_0(\lambda )\), and the dual solution set by \(\varLambda ^*\), each of which is defined next.

$$\begin{aligned} \mathcal {X}^*(\lambda ) \, \triangleq \, \arg \min _{\textbf{x}\in \mathcal X} \, \mathcal {L}_0(\textbf{x},\lambda ), \, \mathcal {D}_0(\lambda ) \, \triangleq \, \inf _{\textbf{x}\in \mathcal {X}}\, \mathcal {L}_0(\textbf{x},\lambda ), \text{ and } \varLambda ^* \, \triangleq \, \arg \max _{\lambda \ge 0} \mathcal {D}_0(\lambda ). \end{aligned}$$

By adding a slack variable \(\textbf{v} \in \mathbb {R}^m\), we may recast (NSCopt) as follows.

$$\begin{aligned} \begin{aligned} \min _{\textbf{x}\, \in \, {\mathcal {X}}, \textbf{v} \, \ge \, 0}&\quad f(\textbf{x}) \\ \mathop {\mathrm {subject\;to}}\limits&\quad g(\textbf{x}) + \textbf{v} = 0, \qquad (\lambda ) \end{aligned} \end{aligned}$$

where \(\lambda \in \mathbb {R}^m\) denotes the Lagrange multiplier associated with the constraint \(g(\textbf{x}) + \textbf{v} = 0\). Then the augmented Lagrangian function, denoted by \(\mathcal {L}_{\rho }\), where \(\rho \) denotes the penalty parameter, is defined as follows (cf. [38]).

$$\begin{aligned} \mathcal {L}_{\rho } (\textbf{x},\lambda )&\, \triangleq \, \min _{\textbf{v}\, \ge \, 0} \, \left[ \, f(\textbf{x})+ \lambda ^\top (g(\textbf{x}) + \textbf{v}) + \tfrac{\rho }{2} \left\| \, g(\textbf{x})+ \textbf{v}\right\| ^2\, \right] . \end{aligned}$$
(2)

It has been shown that \((\bar{\textbf{x}},\bar{\lambda })\) is a saddle-point of the augmented Lagrangian \(\mathcal{L}_{\rho }\) for any \(\rho \ge 0\) if and only if \((\bar{\textbf{x}},\bar{\lambda })\) is a saddle-point of \(\mathcal{L}_0\). Further, if \(\bar{\lambda }\) is an optimal dual solution, then \(\bar{\textbf{x}}\) is an optimal solution of (NSCopt) if and only if \(\bar{\textbf{x}}\) minimizes \(\mathcal{L}(\bullet ,\bar{\lambda })\) over \(\mathcal{X}\) [38, Th. 3.5].

If \(d_{-}(u) \, \triangleq \, {\displaystyle \inf _{v \in \mathbb {R}^n_-}} \Vert u-v\Vert \) and \(\varPi _{+}[u]\) denotes the Euclidean projection of u onto \(\mathbb {R}^m_+\), then the AL function \(\mathcal {L}_{\rho }\) and its gradient can be expressed as follows [38, Sec. 2].

Lemma 1

Consider the function \(\mathcal {L}_{\rho }\) for \(\rho > 0\), \(\textbf{x}\in {\mathcal {X}}\) and \(\lambda \ge 0\). Then

$$\begin{aligned} \mathcal {L}_{\rho } ({\textbf {x}},\lambda )&=\left( f({\textbf {x}})+ \tfrac{\rho }{2} \left( d_-\left( \tfrac{\lambda }{\rho } + g({\textbf {x}}) \right) \right) ^2 - \tfrac{1}{2\rho }\Vert \lambda \Vert ^2 \right) , \\ \nabla _{\lambda } \mathcal {L}_{\rho } ({\textbf {x}},\lambda )&= \left( -\tfrac{\lambda }{\rho } + \varPi _{+} \left( \tfrac{\lambda }{\rho } + g({\textbf {x}})\right) \right) , \\ \text { and } \nabla _{{\textbf {x}}} \mathcal {L}_{\rho }({\textbf {x}},\lambda )&= \nabla _{{\textbf {x}}}f({\textbf {x}})+\rho J_g({\textbf {x}}) \left( \tfrac{\lambda }{\rho } + g({\textbf {x}}) - \varPi _{-} \left( \tfrac{\lambda }{\rho } + g({\textbf {x}}) \right) \right) , \end{aligned}$$

where \(J_{g}(\textbf{x})\) is Jacobian matrix of g. \(\Box \)

Similarly, the augmented dual function \(\mathcal {D}_{\rho }\), defined as

$$\begin{aligned} \begin{aligned} \mathcal {D}_{\rho }(\lambda )&\, \triangleq \, \inf _{\textbf{x}\in \mathcal {X}} \mathcal {L}_{\rho }(\textbf{x},\lambda ), \end{aligned} \end{aligned}$$
(3)

can be shown to be differentiable [38, Th. 3.2].

Lemma 2

Consider the function \(\mathcal {D}_{\rho }\) defined as (3). Then \(\mathcal {D}_{\rho }\) is a C\(^1\) and concave function over \(\mathbb {R}^m\) and is the Moreau envelope of \(\mathcal {D}_0\), defined as

$$\begin{aligned} \mathcal {D}_{\rho }(\lambda ) = \max _{u \in \mathbb {R}^m} \left[ \mathcal {D}_0(u) - \tfrac{1}{2\rho }\Vert u-\lambda \Vert ^2\right] \text{ and } \nabla _{\lambda } \mathcal {D}_{\rho }{(\lambda )}\, \triangleq \, \tfrac{1}{\rho }\left( q_{\rho }(\lambda ) - \lambda \right) , \end{aligned}$$

where \(q_{\rho }(\lambda ) \, \triangleq \, \arg {\displaystyle \max _{u}} \left[ \mathcal {D}_0(u) - \tfrac{1}{2\rho }\Vert u-\lambda \Vert ^2\right] .\) \(\Box \)

Since \(\mathcal{D}_{\rho }\) is the Moreau envelope of \(\mathcal{D}_0\), \(\mathcal{D}_{\rho }\) has the same set of maximizers as \(\mathcal{D}_0\) for any \(\rho \ge 0\) [38, Th. 3.2]. Our interest lies in nonsmooth, albeit smoothable, convex functions, defined next [7].

Definition 1

A closed, proper, and convex function \(h: \mathbb {R}^n \rightarrow \mathbb {R}\) is \((\alpha ,\beta )\) smoothable if for any \(\eta > 0\), there exists a convex differentiable function \(h_{\eta }\) such that

$$\begin{aligned} \left\| \, \nabla _{\textbf{x}} h_{\eta }(\textbf{x}_1) - \nabla _{\textbf{x}} h_{\eta }(\textbf{x}_2) \, \right\|&\, \le \, \tfrac{\alpha }{\eta } \Vert \textbf{x}_1-\textbf{x}_2\Vert , \quad \forall \textbf{x}_1, \textbf{x}_2 \in \mathbb {R}^n, \\ h_{\eta }(\textbf{x}) \, \le \, h(\textbf{x})&\, \le \, h_{\eta }(\textbf{x}) + \eta \beta , \qquad \forall \textbf{x}\in \mathbb {R}^n. \end{aligned}$$

\(\Box \)

In fact, one may be faced by compositional convex constraints in which the layers may be nonsmooth. In such instances, under suitable conditions, smoothability of the layers implies smoothability of the compositional function but we postpone such avenues for future work. We leverage smoothability assumptions in [7] to state our basic assumptions on the objective and constraint functions. In addition, we impose both compactness requirements on \(\mathcal {X}\) as well as a Slater regularity condition. Before stating the required assumptions, we need to define the \(\epsilon \)-KKT conditions of (NSCopt), which is inspired by KKT conditions.

Definition 2

(\(\epsilon \)-optimal solution) Let \(f^*\) be the optimal value of (NSCOPT). Given \(\epsilon \ge 0\), a point \(\tilde{\textbf{x}}\, \in \, \mathcal {X}\) is called an \(\epsilon \)-optimal and \(\epsilon \)-feasible solution to (NSCOPT) if

$$\begin{aligned} \, f(\tilde{\textbf{x}})-f^* \, \le \, \epsilon \text{ and } d_{-}\left( g(\tilde{\textbf{x}})\right) \, \le \, \epsilon , \quad \text{ respectively }. \end{aligned}$$
(4)

\(\Box \)

Then the partial KKT conditions corresponding to relaxing the constraint \(g(\textbf{x}) \, \le \, 0\) are defined as follows, where \(\mathcal {L}(\bullet ,\bullet )\) denotes the Lagrangian function and \(\mathcal{N}_\mathcal{X}(x)\) denotes the normal cone of \(\mathcal{X}\) at x.

$$\begin{aligned} 0 \,&\in \, \nabla _{\textbf{x}} \mathcal{L}(\textbf{x}, \lambda ) + \mathcal{N}_\mathcal{X}(\textbf{x}) \end{aligned}$$
(5)
$$\begin{aligned} 0 \,&\le \, \lambda \, \perp \, g(\textbf{x}) \, \le \, 0. \end{aligned}$$
(6)

Recall that given an optimization problem, defined as

$$\begin{aligned} \begin{aligned} \min _{x}&\ f(\textbf{x}) \\ \text{ subject } \text{ to }&\ g(\textbf{x}) \, \le \, 0, \end{aligned} \end{aligned}$$
(C-Opt)

where \(f, g_{i}\) are smooth functions mapping from \(\mathbb {R}^n\) to \(\mathbb {R}\) for \(i = 1, \cdots , m\). Then under a suitable regularity condition, if \(x^*\) is a local minimizer of (C-Opt), then there exists \(\lambda \in \mathbb {R}^m_+\) such that

$$\begin{aligned} \nabla f(\textbf{x}) + \sum _{i=1}^m \lambda _i \nabla g_i(\textbf{x})&= 0 \end{aligned}$$
(7)
$$\begin{aligned} \lambda _i g_i(\textbf{x})&= 0, \quad i = 1, \cdots , m \end{aligned}$$
(8)
$$\begin{aligned} g(\textbf{x})&\, \le 0. \end{aligned}$$
(9)

In fact, (8)–(9), together with \(\lambda \ge 0\), can be compactly stated as

$$\begin{aligned} \lambda \ge 0, \quad \lambda _i g_i(\textbf{x}) = 0, \forall i, \quad g(\textbf{x}) \, \le \, 0. \end{aligned}$$

By leveraging the “perp” notation, we have that \(\lambda \perp g(x)\) or \(\lambda _i g_i(x) = 0\) for all i. Therefore, we may compactly represent the KKT conditions as

$$\begin{aligned}&\nabla f(\textbf{x}) + \sum _{i=1}^m \lambda _i \nabla g_i(\textbf{x}) = 0 \end{aligned}$$
(10)
$$\begin{aligned}&0 \le \lambda \, \perp \, g(\textbf{x}) \, \le 0. \end{aligned}$$
(11)

Note that such a notation is common in complementarity theory (see Cottle, Pang, and Stone [11] or Facchinei and Pang [14]). This allows us to define a (partial) \(\epsilon \)-KKT point.

Definition 3

(Partial \(\epsilon \)-KKT condition) Consider the problem (NSCOPT). Then (\(\textbf{x}_{\epsilon },\lambda _{\epsilon }\)) is a partial \(\epsilon \)-KKT point if \(\textbf{x}_{\epsilon } \in \mathcal{X}\),

$$\begin{aligned} \mathcal{L}(\textbf{x}_{\epsilon },\lambda _{\epsilon })&\, \le \, \mathcal{L}(\textbf{x}^*,\lambda _{\epsilon }) +\epsilon , \end{aligned}$$
(12)
$$\begin{aligned} 0 \, \le \, \lambda _{\epsilon },&\quad g(\textbf{x}_{\epsilon }) \, \le \, \epsilon \textbf{1}, \text{ and } \lambda _{\epsilon }^\top g(\textbf{x}_{\epsilon }) \, \ge \, -\epsilon , \end{aligned}$$
(13)

where \((\textbf{x}^*,\lambda ^*)\) denotes a KKT point of (NSCOPT) satisfying (5)–(6). \(\square \)

This allows us to build a simple relation whereby an \(\epsilon \)-KKT point satisfies \(\epsilon \)-optimality and \(\epsilon \)-feasibility.

Lemma 3

Consider a tuple \((\textbf{x}_{\epsilon },\lambda _{\epsilon })\) satisfying the \(\epsilon \)- KKT conditions given by (12)–(13). Then \((\textbf{x}_{\epsilon },\lambda _{\epsilon })\) satisfies \({2\epsilon }\)-suboptimality and \({m \epsilon }\)-infeasibility, collectively captured by (4).

Proof

We observe that \(\epsilon \)-primal suboptimality in (4) holds by the following sequence of relations.

$$\begin{aligned} f(\textbf{x}_{\epsilon }) -\epsilon&\overset{(13)}{\le } f(\textbf{x}_{\epsilon }) +\lambda _{\epsilon }^\top g(\textbf{x}_{\epsilon }) = \mathcal{L}(\textbf{x}_{\epsilon },\lambda _{\epsilon }) \\&\overset{()}{\le } \mathcal{L}(\textbf{x}^*,\lambda _{\epsilon }) + \epsilon = f(\textbf{x}^*)+\underbrace{\lambda _{\epsilon }^\top g(\textbf{x}^*)}_{\, \le \, 0} + \epsilon \\&\le f(\textbf{x}^*)+ \epsilon \\ \implies f(\textbf{x}_{\epsilon })&\le f(\textbf{x}^*) + 2\epsilon . \end{aligned}$$

To show \(\epsilon \)-feasibility of \(\textbf{x}_{\epsilon }\) as prescribed in (4), we observe that

$$d_{-}\left( g(\textbf{x}_{\epsilon })\right) \le {\displaystyle \sum _{i=1}^m} \max \{g_i(\textbf{x}_{\epsilon }), 0\} \le m\epsilon ,$$

which completes the proof. \(\square \)

We now present our ground assumption on the problem of interest and is assumed to hold throughout the paper, unless explicitly mentioned otherwise.

figure b

Condition (d) allows for bounding the set of optimal dual variables (cf. [23]). We now consider the smoothed counterpart of (NSCopt), defined as

figure c

We note that the solution and multiplier set of (NSCopt\(_{\eta })\) are denoted by \(X^*_{\eta }\) and \(\varLambda ^*_{\eta }\), respectively. Naturally, associated with this problem is the Lagrangian function \(\mathcal{L}_{\eta ,0}\) of the smoothed problem (referred to as the smoothed Lagrangian) as well as the corresponding dual function \(\mathcal{D}_{\eta ,0}\); these objects and their augmented counterparts are defined and analyzed in the next subsection.

2.2 Analysis of Smoothed Lagrangians

We now analyze the smoothed Lagrangian framework where f and g are approximated by smoothings \(f_{\eta }\) and \(g_{\eta }\), where the latter is a vector function with components \(g_{1,\eta }, \cdots , g_{m,\eta }\). The resulting smoothed Lagrangian function \(\mathcal {L}_{\eta ,0}\) and the smoothed dual function \(\mathcal {D}_{\eta ,0}(\lambda )\) are defined as

$$\begin{aligned} \mathcal {L}_{\eta ,0}(\textbf{x},\lambda )&\triangleq \left. {\left\{ \begin{array}{ll} f_{\eta }(\textbf{x}) + \lambda ^\top g_{{\eta }}(\textbf{x}), & \lambda \ge 0 \\ -\infty , & \text{ otherwise } \end{array}\right. } \right\} \text{ and } \mathcal {D}_{\eta ,0}(\lambda ) \triangleq \inf _{\textbf{x}\in \mathcal {X}} \mathcal {L}_{\eta ,0}(\textbf{x},\lambda ). \end{aligned}$$

Then the smoothed augmented Lagrangian function \(\mathcal {L}_{\eta , \rho }\) is defined as

$$\begin{aligned} \mathcal {L}_{\eta ,\rho } (\textbf{x},\lambda )&\triangleq \min _{\textbf{v} \ge 0} \left[ \, f_{\eta }(\textbf{x}) + \lambda ^\top (g_{{\eta }}(\textbf{x}) + \textbf{v}) + \tfrac{\rho }{2} \Vert g_{{\eta }}(\textbf{x}) + \textbf{v}\Vert ^2\, \right] \\&= f_{\eta }(\textbf{x})+ \tfrac{\rho }{2} \left( d_-\left( \tfrac{\lambda }{\rho } + g_{\eta }(\textbf{x}) \right) \right) ^2 - \tfrac{1}{2\rho }\Vert \lambda \Vert ^2. \end{aligned}$$

We may now define \(\mathcal {D}_{\eta ,\rho }\) and \(q_{\eta ,\rho }\) as \(\mathcal {D}_{\eta ,\rho }(\lambda ) \, = \, \max _u [\, \mathcal {D}_{\eta ,0}(u) - \tfrac{1}{2\rho }\Vert u-\lambda \Vert ^2\,]\) and \(\nabla _{\lambda } \mathcal {D}_{\eta ,\rho }(\lambda ) \, = \, \tfrac{1}{\rho }\left( q_{\eta ,\rho }(\lambda ) - \lambda \right) \), where \(q_{\eta , \rho }(\lambda ) \triangleq \textrm{arg}\hspace{-0.02in}\max _{u} [\, \mathcal {D}_{\eta ,0}(u) - \tfrac{1}{2\rho }\Vert u-\lambda \Vert ^2\, ].\) We now relate \(\mathcal {D}_{\rho }\) to \(\mathcal {D}_{\eta ,\rho }\) and \(q_{\rho }\) to \(q_{\eta ,\rho }\) in the next lemma.

Lemma 4

For any \(\lambda \in \mathbb {R}_{+}^m\), the following hold:

(i) \(\left| \mathcal {L}_{0}(\textbf{x},\lambda ) - \mathcal {L}_{\eta ,0}(\textbf{x},\lambda ) \right| \, \le \, \eta (\Vert \lambda \Vert m+1) {\beta };\)

(ii) \(| \mathcal {D}_{\eta ,0}(\lambda ) - \mathcal {D}_{0}(\lambda )| \le \eta (\Vert \lambda \Vert m+1) {\beta } ;\)

(iii)\(| \mathcal {D}_{\eta ,\rho }(\lambda ) - \mathcal {D}_{\rho }(\lambda )| \le \eta (\Vert \lambda \Vert m+1) {\beta }.\) \(\Box \)

Under a Slater regularity condition, the set of optimal multipliers is bounded (cf. [23]). Similar bounds are derived for the \(\eta \)-smoothed problem.

figure d

Proof

(a) By Assumption 1(d), there exists a vector \(\bar{\textbf{x}} \in \mathcal{X}\) such that \(g(\bar{\textbf{x}}) < 0\), implying that \(g_{\eta }(\bar{\textbf{x}}) < 0\) by the property of smoothability (Def. 1).

(b) By the Slater regularity condition, we directly conclude from [23] that

$$\begin{aligned} \varLambda ^*\,\subseteq \, \left\{ \,\lambda \ge 0\,\bigg |\, \sum _{i = 1}^{m}\lambda _i\,\le \, \tfrac{f(\bar{{\textbf {x}}}) - \mathcal {D}_{0}^*}{\min _{j}\{-g_j(\bar{{\textbf {x}}})\}}\right\} , \text { where } \mathcal {D}_0^* = f^*. \end{aligned}$$

(c) Similarly, \(\varLambda _{\eta }^*\), the dual optimal solution set, is bounded as follows.

$$\begin{aligned} \varLambda _{\eta }^* \, \subseteq \, \left\{ \, \lambda \ge 0 \, \mid \, \sum _{i=1}^m \lambda _i \, \le \, \tfrac{f_{\eta }(\bar{\textbf{x}}) - \mathcal{D}_{0,\eta }^*}{\min _{j} \{ - g_{j,\eta }({\bar{\textbf{x}}})\}} \, \right\} \, \subseteq \, \left\{ \, \lambda \ge 0 \, \mid \, \sum _{i=1}^m \lambda _i \, \le \, \tfrac{f(\bar{\textbf{x}}) - \mathcal{D}_{0,\eta }^*}{\min _{j} \{ - g_{j,\eta }({\bar{\textbf{x}}})\}} \, \right\} . \end{aligned}$$

Recall that \(-g_{j,\eta }({\bar{\textbf{x}}}) \ge -g_j({\bar{\textbf{x}}})\) for \(j = 1, \cdots , m\). Furthermore, \({\displaystyle \min _j}\{ -g_{j,\eta }({\bar{\textbf{x}}})\} \ge {\displaystyle \min _j} \{-g_{j}({\bar{\textbf{x}}})\}\). It follows from (b) that

$$ -\mathcal{D}_{0,\eta }(\lambda _{\eta }^*) \, \overset{\tiny \text{(Optimality } \text{ of } \lambda _{\eta }^*)}{\le } \, -\mathcal{D}_{0,\eta }(\lambda ^*) \, \overset{\tiny \text{(Lemma } 4\hbox {(ii))}}{\le } \, -\mathcal{D}_0(\lambda ^*) + \eta (mb_{\lambda }+1)\beta .$$

Consequently, if \(\mathcal{D}_{0,\eta }^* \triangleq \mathcal{D}_{0,\eta }(\lambda _{\eta }^*)\), \(\mathcal{D}_0^* \triangleq \mathcal{D}_{0}(\lambda ^*)\), then

$$\begin{aligned} \varLambda _{\eta }^*&\, \subseteq \, \left\{ \, \lambda \ge 0 \, \mid \, \sum _{i=1}^m \lambda _i \, \le \, \tfrac{f(\bar{\textbf{x}}) - \mathcal{D}_{0,\eta }^*}{\min _{j} \{ - g_{j,\eta }({\bar{\textbf{x}}})\}} \, \right\} \\&\, \subseteq \, \left\{ \, \lambda \ge 0 \, \mid \, \sum _{i=1}^m \lambda _i \, \le \, \tfrac{f(\bar{\textbf{x}}) - \mathcal{D}_{0,\eta }^*}{\min _{j} \{ - g_{j}({\bar{\textbf{x}}})\}} \, \right\} \\&\, \subseteq \, \left\{ \, \lambda \ge 0 \, \mid \, \sum _{i=1}^m \lambda _i \, \le \, \tfrac{f(\bar{\textbf{x}}) - \mathcal{D}_{0}^* + \eta {(mb_{\lambda }+1)\beta }}{\min _{j} \{ - g_{j}({\bar{\textbf{x}}})\}} \, \right\} \\&\,\subseteq \, B_{\lambda ,\eta } \,\triangleq \, \left\{ \lambda \ge 0\,|\, \sum _{i = 1}^{m} \lambda _i \le b_{\lambda ,\eta }\right\} . \end{aligned}$$

\(\square \)

Both Lemma 4 and Proposition 1 play crucial roles in the convergence analysis presented in Section 3. We now relate a saddle-point \((\textbf{x}^*_{\eta },\lambda _{\eta }^*)\) of (NSCopt\(_{\eta }\)) to an \(\eta \)-saddle-point \((\textbf{x}^*,\lambda ^*)\) of (NSCopt), where the bound on the multipliers for (NScopt) and (NSCopt\(_{\eta }\)) are denoted by \({b}_{\lambda }\) and \({b}_{\lambda ,\eta }\) , respectively. Next, we relate a saddle-point of (NSCopt\(_{\eta }\)) to an \(\eta \)-saddle-point of (NSCopt), where an \(\eta \)-saddle point satisfies the saddle-point requirements with an \(\mathcal {O}(\eta )\) error.

figure e

Proof

(a) Suppose \(\textbf{x}_{\eta }^* \, \in \, \mathcal{X}\) is a feasible solution of (NSCopt\(_{\eta }\)). Then \(g_{\eta }(\textbf{x}^*_{\eta }) \le 0\). Furthermore, \(g(\textbf{x}_{\eta }^*) \le g_{\eta }(\textbf{x}_{\eta }^*) + \eta \beta \textbf{1} \le \eta \beta \textbf{1}\), implying that \(d_{-}(g(\textbf{x}_{\eta }^*)) \le \eta \beta \Vert \textbf{1}\Vert .\)

(b) The dual optimal set \(\varLambda _{\eta }^*\) is nonempty and bounded as per Lemma 1. Let \((\textbf{x}_{\eta }^*,\lambda _{\eta }^*)\) be the saddle point of \(L_{\eta ,0}(\cdot ,\cdot )\). We now proceed to show that \((\textbf{x}_{\eta }^*,\lambda _{\eta }^*)\) is an approximate saddle-point of \(\mathcal{L}_0\).

$$\begin{aligned} {\mathcal {L}_{{0}}}&({\textbf {x}}_{\eta }^*,\lambda _{\eta }^*) = f({\textbf {x}}_{\eta }^*) + (\lambda ^*_{\eta })^\top g({\textbf {x}}_{\eta }^*) \le f_{\eta }({\textbf {x}}_{\eta }^*) +\eta {\beta }+ (\lambda ^*_{\eta })^\top g_{\eta }({\textbf {x}}_{\eta }^*) + \eta b_{\lambda ,\eta }{\beta }\Vert {\textbf {1}}\Vert \\ &= {{\mathcal {L}_{0,\eta }}({\textbf {x}}_{\eta }^*,\lambda ^*_{\eta }) + \eta \beta (1+b_{\lambda ,\eta } m)} \le {\mathcal {L}_{0,\eta }}({\textbf {x}},\lambda ^*_{\eta }) + \eta \beta (1+b_{\lambda ,\eta } m) \text { for } \text { all } {\textbf {x}}\in \mathcal {X} \\ &\overset{\tiny {-(\lambda ^*_{\eta })^\top g({\textbf {x}}) \le 0}}{=}\mathcal {L}_{{0}}({\textbf {x}},\lambda ^*_{\eta }) + f_{\eta }({\textbf {x}}) - f({\textbf {x}}) + (\lambda ^*_{\eta })^\top (g_{\eta }({\textbf {x}})-g({\textbf {x}})) + \eta \beta (1+b_{\lambda ,\eta } m)\\ &\le {\mathcal {L}_{{0}}({\textbf {x}},\lambda ^*_{\eta }) + \eta \beta (1+b_{\lambda ,\eta } m)} \text { for } \text { all } {\textbf {x}}\in \mathcal {X}. \end{aligned}$$

The final result follows through the following sequence of inequalities as provided next

$$\begin{aligned} \mathcal {L}_{{0}}({\textbf {x}}_{\eta }^*,\lambda _{\eta }^*)&= f({\textbf {x}}_{\eta }^*) + (\lambda ^*_{\eta })^\top g({\textbf {x}}_{\eta }^*) \ge f_\eta ({\textbf {x}}_{\eta }^*) + (\lambda _{\eta }^*)^\top \left( g_{\eta }({\textbf {x}}_{\eta }^*)\right) \\ &= \mathcal {L}_{0,\eta }({\textbf {x}}_{\eta }^*,\lambda _{\eta }^*) \ge \mathcal {L}_{0,\eta }({\textbf {x}}_{\eta }^*,\lambda ) \quad \text{ for } \text{ all } \lambda \in \mathbb {R}^m_+ \\ &= \mathcal {L}_{{0}}({\textbf {x}}_{\eta }^*,\lambda ) + f_{\eta }({\textbf {x}}_{\eta }^*) - f({\textbf {x}}_{\eta }^*) + \lambda ^\top \left( g_{\eta }({\textbf {x}}_{\eta }^*) - g({\textbf {x}}_{\eta }^*)\right) \quad \\ &\ge \mathcal {L}_{{0}}({\textbf {x}}_{\eta }^*,\lambda ) - \eta \beta (1 + m \Vert \lambda \Vert ) \\ &\ge \mathcal {L}_{{0}}({\textbf {x}}_{\eta }^*,\lambda ) - \eta \beta \big (1 + m \max \{b_{\lambda ,{\eta }}, \Vert \lambda \Vert \}\big ) \quad \forall \lambda \in \mathbb {R}^m_+. \end{aligned}$$

\(\square \)

The following Lemma 5 shows the relation between \(q_{\eta ,\rho }(\bullet )\) and \(q_{\rho }(\bullet )\).

Lemma 5

For any \(\lambda \in \mathbb {R}_{+}^m\), the following hold:

(i) \(\Vert q_{\eta ,\rho }(\lambda ) - q_{\rho }(\lambda )\Vert \le \sqrt{{4}\rho \eta (\Vert \lambda \Vert m+C_m) {\beta }};\)

(ii) \(\Vert \nabla _{\lambda } \mathcal {D}_{\eta ,\rho }(\lambda )-\nabla _{\lambda } \mathcal {D}_{\rho }(\lambda )\Vert = \tfrac{1}{\rho }\Vert q_{\eta ,\rho }(\lambda ) - q_{\rho }(\lambda )\Vert \le \sqrt{\tfrac{{4} \eta (\Vert \lambda \Vert m+C_m) {\beta } }{\rho }}.\) \(\Box \)

We now formally state the smoothed AL scheme. The traditional ALM is reliant on solving the subproblem exactly or \(\epsilon _k\)-inexactly at epoch k. However, in regimes with nonsmooth constraints, the AL subproblem is nonsmooth, precluding the usage of accelerated gradient methods, leading to far poorer performance. Our proposed scheme solves a sequence of \(\eta _k\)-smoothed problems solved within an error tolerance of \(\epsilon _k \eta _k^b\) where \(b\ge 0\). A formal statement of the scheme is provided next.

figure f

Observe that step [1] requires that \(\textbf{x}_{k+1}\) is an \(\epsilon _k \eta _k^b\)-minimizer of the AL subproblem, given by

$$\begin{aligned} \min _{x \in X} \ \mathcal{L}_{\eta _k,\rho _k}(\textbf{x},\lambda _k), \end{aligned}$$

where \(\mathcal{D}_{\eta _k,\rho _k}(\lambda _k) = \min _{x \in X} \mathcal{L}_{\eta _k,\rho _k}(\textbf{x},\lambda _k).\) Since we have rate guarantees for the accelerated scheme applied to the subproblem, we can determine the minimum number of gradient steps that ensure that \(\epsilon _k \eta _k^b\)-suboptimality holds. The Lagrange multiplier update can be expressed as follows (cf. [2]).

Lemma 6

Consider the smoothed augmented Lagrangian scheme (Sm-AL). Then for any \(k > 0\), step [2] is equivalent to the following equation.

$$\begin{aligned} \lambda _{k+1} = \varPi _+\left[ \, {\lambda _k}+{\rho _k}g_{\eta _k}(\textbf{x}_{k+1})\, \right] . \end{aligned}$$
(14)

The next assumption holds for parameter sequences employed in (Sm-AL). Unless mentioned otherwise, Assumptions 1 and 2 hold throughout.

figure g

While our rate guarantees for the schemes responsible for resolving the subproblem as well as the outer (dual) problem allow for defining precise lower bounds on the number of steps required, this computational requirement is reliant on a worst-case analysis. In addition, we may attempt to check if the sub-optimality requirement is met at some intermediate step. However, it is not obvious how to check the sub-optimality in the current setting since the optimal value corresponding to either the subproblem or the outer level problem are unavailable. Instead, we appeal to a residual function and consider such an approach next. We emphasize that such a potential early termination of either the subproblem solver or the outer scheme may have computational benefits.

2.3 Termination Criteria

Our inexact augmented Lagrangian framework relies on utilizing inexact solutions to the Lagrangian subproblem, obtained by taking finite but increasing number of gradient-based steps and then leveraging the rate guarantees for accelerated gradient methods. However, we may well meet the required accuracy prior to taking the prescribed number of gradient steps by checking a suitable condition. Such a condition is by no means immediate since a naive assessment of accuracy requires knowing the optimal value to the subproblem; instead, we present a new analysis by leveraging a residual function and present such an analysis next for both the inner and outer loops.

(I). Termination criterion for Inner loop. The inner loop at iteration k terminates when \(x_{k+1}\) satisfies the following \(\epsilon _k \eta _k^b\)-optimality requirement, where \(\epsilon _k\) is a positive accuracy thresholdat iteration k, \(\eta _k\) is the smoothing parameter at iteration k, and b is a nonnegative scalar that is defined subsequently in the complexity analysis.

$$\begin{aligned} \mathcal{L}_{\eta _k,\rho _k}(\textbf{x}_{k+1},\lambda _k) - \mathcal{D}_{\eta _k,\rho _k}(\lambda _k) \, \le \, \epsilon _k \eta _k^b. \end{aligned}$$
(15)

In effect, if we view the minimization of the augmented Lagrangian function by the following convex problem, defined as

$$\begin{aligned} \min _{\textbf{x}\, \in \, \mathcal{X}} \ h(\textbf{x}) \triangleq \mathcal{L}_{\eta _k,\rho _k}(\textbf{x},\lambda _k), \end{aligned}$$
(Opt)

where h is a convex and smooth function on \(\mathcal{X}\), a closed and convex set. We proceed to show that (15) is equivalent to \(x_{k+1}\) approximately satisfying the variational inequality problem.

$$\begin{aligned} \nabla _x \mathcal{L}_{\eta _k,\rho _k}(\textbf{x}_{k+1},\lambda _k)^\top (y - \textbf{x}_{k+1}) \ge -\epsilon _k \eta _k^b \qquad \forall y \, \in \, X. \end{aligned}$$
(16)

In fact, we now develop a verifiable condition whose satisfaction implies (16).

Lemma 7

Consider the problem (Opt). Suppose \(\Vert \textbf{y}\Vert ^2 \le C\) and \(\Vert \nabla h(\textbf{y})\Vert ^2 \le D\) for any \(\textbf{y}\in X\) and \(\gamma \) is any positive scalar. Consider the following statements.

  1. (a)

    \(\textbf{x}^*_{\epsilon }\) is an \(\epsilon \)-optimal solution of (Opt).

  2. (b)

    \(\nabla h(\textbf{x}^*_{\epsilon })^\top (\textbf{y}-\textbf{x}^*_{\epsilon }) \, \ge \,- \epsilon , \quad \forall \, \textbf{y}\, \in \, \mathcal {X}.\)

  3. (c)

    Suppose there exists \(\textbf{u}\in \mathcal{X}\) and \(\textbf{x}^*_{\epsilon } \in \mathcal{X}\) such that \(F^{{\textrm{nat}},\tilde{\epsilon }}_\mathcal{X} (\textbf{x}^*_{\epsilon },\textbf{u}) = 0\), where \(F^{{{{\textrm{nat}}}},\tilde{\epsilon }}_\mathcal{X}(\bullet ,\bullet )\) represents the perturbed natural map with a chosen parameter \(\gamma \), defined as

    $$\begin{aligned} F^{{{{\textrm{nat}}}},\tilde{\epsilon }}_\mathcal{X}(\textbf{x},\textbf{u})\, \triangleq \, \tilde{\epsilon } \left( \, \textbf{u}- \varPi _X \left[ \, \textbf{x}- \gamma \nabla h(\textbf{x}) \, \right] \,\right) - \textbf{x}+ \varPi _\mathcal{X} \left[ \, \textbf{x}- \gamma \nabla h(\textbf{x}) \, \right] . \end{aligned}$$

Then the following hold.

  1. (i)

    \((a) \, \iff \, (b)\);

  2. (ii)

    \((c) \implies (b)\), where \(\tilde{\epsilon } \, = \frac{\gamma \epsilon }{7C + \gamma (C+D)}\) and \(\epsilon <\frac{7C + \gamma (C+D)}{\gamma }\). \(\Box \)

Observe that the perturbed natural map is rooted in the natural map, a residual function for variational inequality problems [14]. When specialized to the setting of the the smooth convex optimization problem

$$\begin{aligned} \min _{\textbf{x}\in \mathcal{X}} f(\textbf{x}), \end{aligned}$$
(COpt)

we have that

$$ \left[ \, \textbf{x}^* \text{ solves } \text{(COpt) } \, \right] \, \iff \, \left[ F^{{\textrm{nat}}}_\mathcal{X}(\textbf{x}^*) \triangleq \, \textbf{x}^* - \varPi _X \left[ \textbf{x}^* - \gamma \nabla f(\textbf{x}^*) \, \right] = 0 \, \right] . $$

The lemma above develops a suitably defined \(\tilde{\epsilon }\)-perturbed counterpart of \(F^{{\textrm{nat}}}_\mathcal{X}\), denoted by \(F^{{\textrm{nat}},\tilde{\epsilon }}_\mathcal{X}\). We observe that \(F^{{\textrm{nat}},\tilde{\epsilon }}_\mathcal{X}(\textbf{x},\textbf{x})\) reduces to

$$\begin{aligned} F^{{\textrm{nat}},\tilde{\epsilon }}_X(\textbf{x},\textbf{x})\, \triangleq \, \left( 1-\tilde{\epsilon }\right) \left( \varPi _\mathcal{X} \left[ \, \textbf{x}- \gamma \nabla h(\textbf{x}) \, \right] -\textbf{x}\right) . \end{aligned}$$
(17)

In other words, for any \(\tilde{\epsilon } < 1\),

$$\begin{aligned} F^{{\textrm{nat}},\tilde{\epsilon }}_\mathcal{X}(\textbf{x},\textbf{x}) \, = \, 0 \quad \iff F^{{\textrm{nat}}}_\mathcal{X}(\textbf{x}) \, = \, 0, \end{aligned}$$
(18)

where \(F^{{\textrm{nat}}}_\mathcal{X}(\textbf{x}) \triangleq - \textbf{x}+ \varPi _\mathcal{X}\left[ \, \textbf{x}- \gamma \nabla h(\textbf{x}) \, \right] .\) Based on the aforementioned result, in the kth iteration, this termination criterion reduces to

$$\begin{aligned} (\textbf{T1}) \begin{aligned}\quad&\, \left\| \, \tilde{\epsilon }_k \textbf{u}+ \left( 1-\tilde{\epsilon }_k\right) \varPi _\mathcal{X} \left[ \, \textbf{x}_{k+1} - \gamma \nabla _{\textbf{x}} \mathcal{L}_{\eta _k,\rho _k}(\textbf{x}_{k+1},\lambda _k)\, \right] - \textbf{x}_{k+1} \, \right\| \, = 0, \end{aligned} \end{aligned}$$
(19)

where \(\tilde{\epsilon }_k = \tfrac{\gamma \epsilon _k\eta _k^b}{{7C+ \gamma (C+D)}}\) and \(\textbf{u}\in \mathcal{X}\).

(II) Termination criterion for outer loop. Here we consider two settings.

(a) Constant penalty parameter. In setting (a), the outer scheme terminates when

$$\begin{aligned}&\left| \, f(\bar{\textbf{x}}_K)-f^*\, \right| \, \le \, \tfrac{C_1}{\sqrt{K}}+\eta _K\beta \text{ and } d_{-}\left( g(\bar{\textbf{x}}_K)\right) \, \le \, \tfrac{C_2}{\sqrt{K}} +m\eta _K\beta , \end{aligned}$$

where \(C_1 \triangleq B_5, C_2 \triangleq B_4\), \(B_3, B_4, B_5, B_6\) are defined in Table 3. Since we have access to \(g(\bullet )\), it is easy to check \(d_{-}\left( g(\textbf{x}_K)\right) \le \sqrt{\epsilon }\). However, evaluating \(f(\bar{x}_K) - f^*\) is not directly possible, since \(f^*\) is unavailable. Since f is nonsmooth, we apply Lemma 7 to the optimality gap of the smoothed problem \(|f_{\eta _K}(\bar{\textbf{x}}_K)-f_{\eta _K}^*|\) since it is related to the true optimality gap, i.e. by leveraging the property of smoothability of f,

$$\begin{aligned} f(\bar{\textbf{x}}_K) - f(\textbf{x}^*) \, \le \, f_{\eta _K}(\bar{\textbf{x}}_K) - f_{\eta _K}(\textbf{x}^*) + \eta _K B \, \le \, \left| \, f_{\eta _K}(\bar{\textbf{x}}_K) - f_{\eta _K}(\textbf{x}^*) \,\right| + \eta _K B \\ f(\textbf{x}^*) - f(\bar{\textbf{x}}_K) \, \le \, f_{\eta _K}(\textbf{x}^*) - f_{\eta _K}(\bar{\textbf{x}}_K) + \eta _K B\, \le \, \left| \, f_{\eta _K}(\bar{\textbf{x}}_K) - f_{\eta _K}(\textbf{x}^*) \,\right| + \eta _K B \\ \implies \left| \, f(\bar{\textbf{x}}_K) - f(\textbf{x}^*) \,\right| \, \le \left| \, f_{\eta _K}(\bar{\textbf{x}}_K) - f_{\eta _K}(\textbf{x}^*) \,\right| + \eta _K B. \end{aligned}$$

Consequently, it suffices to get a bound on each term on the right. To get a bound on \(\left| \, f_{\eta _K}(\bar{\textbf{x}}_K) - f_{\eta _K}(\textbf{x}^*) \,\right| \) given \(\hat{\textbf{x}} \in \mathcal{X}\), we leverage the following residual function that

$$\begin{aligned} G_{\mathcal {X}}^{\text {nat},\tilde{\epsilon }_K}(\textbf{x}_{K+1},\hat{\textbf{x}}) \triangleq \tilde{\epsilon }_K \hat{\textbf{x}} + (1-\tilde{\epsilon }_K)\varPi _{X}\left[ \textbf{x}_{K+1}-\gamma \nabla \mathcal{L}_{\eta _K}(\textbf{x}_{K+1})\,\right] - \textbf{x}_{K+1}, \end{aligned}$$

where \(\tilde{\epsilon }_K \triangleq \tfrac{\gamma C_1}{({7C + \gamma (C+D)})\sqrt{K}}\), and CD are as defined in Lemma 7. (We can set the values for \(\eta _K\) such that the overall optimality gap (\(|f-f^*|\)) remains controlled below a tighter error tolerance \(\epsilon ^2\) to ensure the consistency with our complexity analysis.) Therefore, we may employ the following termination criterion (T2) at the Kth iterate.

$$\begin{aligned} (\textbf{T2}) \quad \left\| \, G^{{\textrm{nat}},\tilde{\epsilon }_K}_\mathcal{X}(\textbf{x}_{K+1},\hat{\textbf{x}}) \, \right\| \, =\, 0 \text{ and } d_{-}\left( g(\textbf{x}_K)\right) \le \tilde{\epsilon }_K. \end{aligned}$$
(20)

(b) Increasing penalty parameter. In setting (b), the outer scheme terminates when

$$\begin{aligned}&\left| \, f(\textbf{x}_K)-f^*\, \right| \, \le \, \tfrac{C_1}{\rho _K} +\eta _K\beta \text{ and } d_{-}\left( g(\textbf{x}_K)\right) \, \le \, \tfrac{C_2}{\rho _K} + m\eta _K\beta , \end{aligned}$$

where \(C_1\triangleq {B_7}\) and \(C_2 \triangleq {B_8}\) as defined in Table 3. While it is easy to check \(d_{-}\left( g(\textbf{x}_K)\right) \le \epsilon \), since \(f^*\) is unavailable and f is nonsmooth, we apply Lemma 7 to the optimality gap of the smoothed problem \(|f_{\eta _K}(\textbf{x}_K)-f_{\eta _K}^*|\) since it is related to the true optimality gap, i.e. by leveraging the property of smoothability of f, similar to the previous analysis, \(\left| \, f(\textbf{x}_K) - f(\textbf{x}^*) \,\right| \, \le \left| \, f_{\eta _K}(\textbf{x}_K) - f_{\eta _K}(\textbf{x}^*) \,\right| + \eta _K B.\) Consequently, it suffices to get a bound on both terms on the right. To get a bound on \(\left| \, f_{\eta _K}(\textbf{x}_K) - f_{\eta _K}(\textbf{x}^*) \,\right| \) and given \(\hat{x} \in X\), we leverage the following residual function that

$$\begin{aligned} G_{\mathcal {X}}^{\text {nat},\tilde{\epsilon }_K}(\textbf{x}_{K+1},\hat{\textbf{x}}) \triangleq \tilde{\epsilon }_{{K}} \hat{\textbf{x}} + (1-\tilde{\epsilon }_K)\varPi _{X}\left[ \textbf{x}_{K+1}-\gamma \nabla \mathcal{L}_{\eta _K}(\textbf{x}_{K+1})\,\right] - \textbf{x}_{K+1}, \end{aligned}$$

where \(\tilde{\epsilon }_K \triangleq \tfrac{\gamma C_1}{({7C + \gamma (C+D)})\rho _{K}}\), and CD are as defined in Lemma 7. Akin to earlier, we may set the value of \(\eta _K\) such that the overall optimality gap (\(|f-f^*|\)) remains controlled below \(\epsilon \). Therefore, we may employ the following termination criterion (T2) at the Kth iterate.

$$\begin{aligned} (\textbf{T2}) \quad \left\| \, G^{{\textrm{nat}},\tilde{\epsilon }_K}_\mathcal{X}(\textbf{x}_{K+1},\hat{x}) \, \right\| \, =\, 0 \text{ and } d_{-}\left( g(\textbf{x}_K)\right) \le \tilde{\epsilon }_K. \end{aligned}$$
(21)

The modified algorithm statement should read as follows.

figure h

Note that the subproblem solver is essentially an accelerated gradient scheme introduced in Section 4, the minimum number of steps as prescribed by the rate guarantees is denoted by \(M_k\) and derived in Section 4.

3 Rate Analysis

In this section, we analyze the rate of convergence for (Sm-AL). In 3.1, we provide some preliminaries and then derive rate statements for constant and increasing penalties in Subsections 3.2 and 3.3, respectively.

3.1 Preliminary results

We begin by recalling the following bound, an extension of the result proved in [38, Lemma 4.3].

Lemma 8

Let \(\{\textbf{x}_{k},\lambda _{k}\}\) be generated by (Sm-AL). For any \(k \ge 0\), suppose \(\textbf{x}_{k+1}\) satisfies \(\mathcal {L}_{\eta _k,\rho _k}(\textbf{x}_{k+1},\lambda _k) - {\mathcal {D}_{\rho _k,\eta _k}}(\lambda _k) \le \epsilon _k\eta _k^b\) where \(b\ge 0\). Then for \(k \ge 0,\)

$$\begin{aligned} \left\| \nabla _{\lambda } \mathcal {L}_{\eta _k,\rho _k} (x_{k+1},\lambda _k) - \nabla _{\lambda } \mathcal {D}_{\eta _k,\rho _k}(\lambda _k) \right\| ^2 \le \tfrac{2\epsilon _k\eta _k^b}{\rho _k}. \end{aligned}$$
(22)

By choosing appropriate sequences \(\{\epsilon _k,\eta _k,\rho _k\}\), \(\{(2\epsilon _k\eta _k^b)/\rho _k\}\) is diminishing (see Lemma 8). We now derive a uniform bound on the sequence \(\{\lambda _k\}\).

Lemma 9

(Bound on \(\lambda _k\)) Consider \(\{\lambda _k\}\) generated by (Sm-AL).

(a) \(\{ \lambda _k\}\) is a convergent sequence. (b) For any K, we have

$$\Vert \lambda _K - \lambda ^*\Vert \le \sum _{k=0}^{\infty } \left( \sqrt{2\rho _k \epsilon _k{\eta _k^b}} +2\sqrt{\eta _k \rho _k({\Vert \lambda ^*\Vert }m+{C_m})\beta } \right) + \Vert \lambda _0-\lambda ^*\Vert { \, \triangleq \, B_{\lambda }}.$$

3.2 Rate analysis under constant \(\rho _k\)

Next, we derive rate statements for the dual sub-optimality and primal infeasibility when \(\rho _k = \rho \) for all k. Our first result relies on the observation that the augmented dual function \(\mathcal{D}_{\rho }\) has the same set of optimal solutions (and supremum) as the original dual function \(\mathcal{D}_0\) (see [38, Th. 3.2]).

figure i

Proof

Recall that \(\mathcal {D}_{\eta _k,\rho }\) is the Moreau envelope of \(\mathcal{D}_{\eta _k,0}\). Consequently, \(\nabla _{\lambda } \mathcal{D}_{\eta _k,\rho }\) is \(\tfrac{1}{\rho }\)-Lipschitz. We then have

$$\begin{aligned} -\mathcal{D}_{\eta _k,\rho }&(\lambda _{k+1}) \le -\mathcal{D}_{\eta _k,\rho }(\lambda _k) - \nabla _{\lambda } \mathcal{D}_{\eta _k,\rho }(\lambda _k)^\top (\lambda _{k+1}-\lambda _{k}) + \tfrac{1}{2\rho }\Vert \lambda _{k+1}-\lambda _k\Vert ^2 \\&= -\mathcal{D}_{\eta _k,\rho }(\lambda _k) - \nabla _{\lambda } \mathcal{D}_{\eta _k,\rho }(\lambda _k)^\top {(\lambda _{k+1}-\lambda ^*)}{- \nabla _{\lambda } \mathcal{D}_{\eta _k,\rho }(\lambda _k)^\top (\lambda ^*-\lambda _{k})} \\&+ \tfrac{1}{2\rho }\Vert \lambda _{k+1}-\lambda _k\Vert ^2 \\&\le -\mathcal{D}_{\eta _k,\rho }(\lambda _k) - \nabla _{\lambda } \mathcal{D}_{\eta _k,\rho }(\lambda _k)^\top {(\lambda _{k+1}-\lambda ^*)}{+ (\mathcal{D}_{\eta _k,\rho }(\lambda _k) - \mathcal{D}_{\eta _k,\rho }(\lambda ^*))} \\&+ \tfrac{1}{2\rho }\Vert \lambda _{k+1}-\lambda _k\Vert ^2 \\&{=} -\mathcal{D}_{\eta _k,\rho }(\lambda ^*) - \nabla _{\lambda } \mathcal{D}_{\eta _k,\rho }(\lambda _k)^\top (\lambda _{k+1}-\lambda ^*) + \tfrac{1}{2\rho }\Vert \lambda _{k+1}-\lambda _k\Vert ^2, \end{aligned}$$

where \(-\mathcal{D}_{\eta _k,\rho }(\lambda ^*) \ge -\mathcal{D}_{\eta _k,\rho }(\lambda _k) - \nabla _{\lambda } \mathcal{D}_{\eta _k,\rho }(\lambda _k)^\top (\lambda ^*-\lambda _k).\) By adding and subtracting \(\nabla _{\lambda } \mathcal {L}_{\eta _k,\rho }(\textbf{x}_{k+1},\lambda _k)^\top (\lambda _{k+1}-\lambda ^*) \), it follows that

$$\begin{aligned} -\mathcal{D}_{\eta _k,\rho }&(\lambda _{k+1}) { \,\le \,} -\mathcal{D}_{\eta _k,\rho }(\lambda ^*) - \nabla _{\lambda } \mathcal {L}_{\eta _k,\rho }(\textbf{x}_{k+1},\lambda _k)^\top (\lambda _{k+1}-\lambda ^*) + \tfrac{1}{2\rho }\Vert \lambda _{k+1}-\lambda _k\Vert ^2 \\&\quad - \left( \nabla _{\lambda } \mathcal{D}_{\eta _k,\rho }(\lambda _k)- \nabla _{\lambda } \mathcal {L}_{\eta _k,\rho }(\textbf{x}_{k+1},\lambda _k)\right) ^\top (\lambda _{k+1}-\lambda ^*) \\&{\, = \, } -\mathcal{D}_{\eta _k,\rho }(\lambda ^*) - \tfrac{1}{\rho }(\lambda _{k+1}-\lambda _k)^\top (\lambda _{k+1}-\lambda ^*) + \tfrac{1}{2\rho }\Vert \lambda _{k+1}-\lambda _k\Vert ^2 \\&\quad - \left( \nabla _{\lambda } \mathcal{D}_{\eta _k,\rho }(\lambda _k)- \nabla _{\lambda } \mathcal {L}_{\eta _k,\rho }(\textbf{x}_{k+1},\lambda _k)\right) ^\top (\lambda _{k+1}-\lambda ^*) \\&\le -\mathcal{D}_{\eta _k,\rho }(\lambda ^*) - \tfrac{1}{\rho }(\lambda _{k+1}-\lambda _k)^\top (\lambda _{k+1}-\lambda ^*)+ \tfrac{1}{2\rho }\Vert \lambda _{k+1}-\lambda _k\Vert ^2 \\&\quad + \Vert \nabla _{\lambda } \mathcal{D}_{\eta _k,\rho }(\lambda _k)- \nabla _{\lambda } \mathcal {L}_{\eta _k,\rho }(\textbf{x}_{k+1},\lambda _k) \Vert \Vert \lambda _{k+1}-\lambda ^* \Vert \\&= -\mathcal{D}_{\eta _k,\rho }(\lambda ^*) + \tfrac{1}{{2\rho }} (\Vert \lambda _k-\lambda ^*\Vert ^2 - \Vert \lambda _{k+1}-\lambda ^*\Vert ^2) \\&\quad + \Vert \nabla _{\lambda } \mathcal{D}_{\eta _k,\rho }(\lambda _k)- \nabla _{\lambda } \mathcal {L}_{\eta _k,\rho }(\textbf{x}_{k+1},\lambda _k) \Vert \Vert \lambda _{k+1}-\lambda ^* \Vert \\&\le -\mathcal{D}_{{\rho }}(\lambda ^*) + \eta _k({\Vert \lambda ^*\Vert }m+1)\beta + \tfrac{1}{{2\rho }} (\Vert \lambda _k-\lambda ^*\Vert ^2 - \Vert \lambda _{k+1}-\lambda ^*\Vert ^2) \\&\quad + \Vert \nabla _{\lambda } \mathcal{D}_{\eta _k,\rho }(\lambda _k)- \nabla _{\lambda } \mathcal {L}_{\eta _k,\rho }(\textbf{x}_{k+1},\lambda _k) \Vert \Vert \lambda _{k+1}-\lambda ^* \Vert , \end{aligned}$$

where the last inequality follows from Lemma 4(iii). By invoking Lemma 9, and \(\Vert \lambda _k\Vert +\Vert \lambda ^*\Vert \le {\Vert \lambda _k-\lambda ^*\Vert +\Vert \lambda ^*\Vert } +\Vert \lambda ^*\Vert \le B_{\lambda } + 2b_{\lambda } {\, \triangleq \, \tilde{B}_{\lambda }}\), we obtain

$$\begin{aligned} -\mathcal{D}_{\rho }(\lambda _{k+1})&\le -\mathcal{D}_{\rho }(\lambda ^*) +{\eta _k(\Vert \lambda _{k+1}\Vert m+1)\beta }+ {\eta _k(\Vert \lambda ^*\Vert m+1)\beta }\\&\quad + \tfrac{1}{{2\rho }} (\Vert \lambda _k-\lambda ^*\Vert ^2 - \Vert \lambda _{k+1}-\lambda ^*\Vert ^2) \\&\quad + \Vert \nabla _{\lambda } \mathcal{D}_{\eta _k,\rho }(\lambda _k)- \nabla _{\lambda } \mathcal {L}_{\eta _k,\rho }(\textbf{x}_{k+1},\lambda _k) \Vert \Vert \lambda _{k+1}-\lambda ^* \Vert \\&{ \le -\mathcal{D}_{\rho }(\lambda ^*) +\eta _k({\tilde{B}_{\lambda }}m+1)\beta } + \tfrac{1}{2\rho } (\Vert \lambda _k-\lambda ^*\Vert ^2 - \Vert \lambda _{k+1}-\lambda ^*\Vert ^2) \\&\quad + \Vert \nabla _{\lambda } D_{\eta _k,\rho }(\lambda _k)- \nabla _{\lambda } \mathcal {L}_{\eta _k,\rho }(\textbf{x}_{k+1},\lambda _k) \Vert \Vert \lambda _{k+1}-\lambda ^* \Vert . \end{aligned}$$

By summing from \(k = 0,\cdots ,K-1\) and dividing by K, we obtain

$$\begin{aligned}&-\left( \tfrac{1}{K}{\sum _{i=0}^{K-1}\mathcal{D}_{\rho }(\lambda _{i+1})} - {\mathcal{D}_{\rho }(\lambda ^*)}\right) \nonumber \\&\le \tfrac{1}{2\rho K} (\Vert \lambda _0-\lambda ^*\Vert ^2 - \Vert \lambda _{K}-\lambda ^*\Vert ^2) + \tfrac{1}{K} \sum _{k=0}^{K-1}{\eta _k ({\tilde{B}_{\lambda }} m+1)\beta }\nonumber \\&\quad +\tfrac{1}{K}\sum _{k = 0}^{K-1}\left\| \nabla _{\lambda } D_{\eta _k,\rho }(\lambda _k)- \nabla _{\lambda } \mathcal {L}_{\eta _k,\rho }(\textbf{x}_{k+1},\lambda _k) \right\| \Vert \lambda _{k+1}-\lambda ^*\Vert \nonumber \\&\le \tfrac{1}{2\rho K}\Vert \lambda _0-\lambda ^*\Vert ^2 + \tfrac{{B_{\lambda }}}{K} \sum _{k=0}^{K-1} \tfrac{\sqrt{2\epsilon _k\eta _k^b}}{\sqrt{\rho }} +\tfrac{{B_2}}{K} \sum _{k=0}^{K-1}\eta _k, \end{aligned}$$
(23)

where boundedness of \(\lambda _k\) follows from Lemma 4 and \(\tilde{B}_{\lambda }, {B_\lambda , B_2}\) are constants. Consequently, by invoking the concavity of \(\mathcal{D}_{\rho }\), we may bound the term on the left to obtain the required inequality, where \(\bar{\lambda }_K = \tfrac{1}{K}\sum _{i=1}^{K} \lambda _i\).

$$\begin{aligned} -\left( {\mathcal{D}_{\rho }(\bar{\lambda }_K) - \mathcal{D}_{\rho }(\lambda ^*)}\right)&\le \tfrac{1}{2\rho K} (\Vert \lambda _0-\lambda ^*\Vert ^2 - \Vert \lambda _{K}-\lambda ^*\Vert ^2) + \tfrac{1}{K} \sum _{k=0}^{K-1}{\eta _k ({\tilde{B}_{\lambda }} m+1)\beta }\nonumber \\&\quad +\tfrac{1}{K}\sum _{k = 0}^{K-1}\left\| \nabla _{\lambda } D_{\eta _k,\rho }(\lambda _k)- \nabla _{\lambda } \mathcal {L}_{\eta _k,\rho }(\textbf{x}_{k+1},\lambda _k) \right\| \Vert \lambda _{k+1}-\lambda ^*\Vert \nonumber \\&\le \tfrac{1}{2\rho K}\Vert \lambda _0-\lambda ^*\Vert ^2 + \tfrac{{B_{\lambda }}}{K} \sum _{k=0}^{K-1} \tfrac{\sqrt{2\epsilon _k\eta _k^b}}{\sqrt{\rho }} +\tfrac{{B_2}}{K} \sum _{k=0}^{K-1}\eta _k. \end{aligned}$$
(24)

The final result follows by noting that \(\mathcal{D}_{\rho }\) is the Moreau envelope of \(\mathcal{D}_0\) and strong duality holds, implying that \(\mathcal{D}_{\rho }(\lambda ^*) = \mathcal{D}_{0}(\lambda ^*) = f(\textbf{x}^*)\). \(\square \)

Next, we derive a rate statement on the infeasibility.

figure j

Proof

We have that \(g_{\eta _k}(\textbf{x}_{k+1})\) can be expressed as

$$\begin{aligned} g_{\eta _k}(\textbf{x}_{k+1}) = \nabla _{\lambda } \mathcal {L}_{\eta _k,\rho }(\textbf{x}_{k+1},\lambda _{k}) + \left( \varPi _{-} \left( \tfrac{\lambda _{{k}}}{\rho } + g_{\eta _k}(\textbf{x}_{k+1})\right) \right) . \end{aligned}$$

Recall that \(d_{-}(u+v) \le d_-(u) + \Vert v\Vert \) for any \(u,v \in \mathbb {R}^m\). Consequently,

$$\begin{aligned} d_-(g_{\eta _k}(\textbf{x}_{k+1}))&\le \Vert \nabla _{\lambda } \mathcal {L}_{\eta _k,\rho }(\textbf{x}_{k+1},\lambda _{{k}})\Vert + \underbrace{d_- \left( \varPi _{-} \left( \tfrac{\lambda _{k}}{\rho } + g_{\eta _k}(\textbf{x}_{k+1}) \right) \right) }_{=0} \nonumber \\&= \Vert \nabla _{\lambda } \mathcal {L}_{\eta _k,\rho }(\textbf{x}_{k+1},\lambda _k)\Vert . \end{aligned}$$
(25)

By definition of \(d_{-}(\bullet )\), convexity of \(\max \{g_j(\bullet ),0\}\), and \(\Vert u\Vert _2 \le \Vert u\Vert _1 \le \sqrt{m}\Vert u\Vert _2\),

$$\begin{aligned}&d_{-}(g(\bar{\textbf{x}}_K)) \, = \, \inf _{u \in \mathbb {R}^m_-} \Vert g(\bar{\textbf{x}}_K) - u \Vert _2 \, \le \, \inf _{u \in \mathbb {R}^m_-} \Vert g(\bar{\textbf{x}}_K) - u\Vert _1 \, = \, \sum _{j=1}^m \inf _{u_j \le 0} \left| g_j(\bar{\textbf{x}}_K) - u_j\right| _1\nonumber \\&\, = \sum _{j=1}^m \max \{g_j(\bar{\textbf{x}}_K),0\} \, \le \, \tfrac{1}{K} \sum _{{i=0}}^{{K-1}} \sum _{j=1}^m \max \{g_j(\textbf{x}_{{i+1}}),0\} \nonumber \\&\le \tfrac{1}{K} \sum _{{i=0}}^{{K-1}} \sum _{j=1}^m \max \{{g_{j,\eta _i}(\textbf{x}_{{i+1}})+\eta _i \beta },0\} = \, \tfrac{1}{K} \sum _{{i=0}}^{{K-1}} \inf _{u \in \mathbb {R}^m_-} \Vert {g_{\eta _i}(\textbf{x}_{{i+1}}) + \eta _i \beta \textbf{1}} - u \Vert _1 \nonumber \\&\le \tfrac{1}{K} \sum _{{i=0}}^{K-1} \inf _{u \in \mathbb {R}^m_-} \sqrt{m} \Vert {g_{\eta _i}}(\textbf{x}_{{i}+1}) + \eta _i \beta \textbf{1} - u \Vert _2 \, = \, \tfrac{\sqrt{m}}{K} \sum _{k=1}^{K-1} d_{-} (g_{{\eta _i}}(\textbf{x}_{{i}+1})+{\eta _i \beta \textbf{1}})\nonumber \\&\le \tfrac{{\sqrt{m}}}{K} \sum _{i=0}^{K-1}\left( d_-(g_{{\eta _i}}(\textbf{x}_{i+1})) + \eta _i\beta {\Vert \textbf{1}\Vert _2}\right) \overset{\tiny (25)}{\le } \tfrac{{\sqrt{m}}}{K}\sum _{i=0}^{K-1}\left( \Vert \nabla _{\lambda } \mathcal {L}_{\eta _i,\rho }(\textbf{x}_{i+1},\lambda _{i})\Vert + {\sqrt{m}}\eta _i\beta \right) \nonumber \\&\le \tfrac{{\sqrt{m}}}{K}\sum _{i=0}^{K-1} \left( \Vert \nabla _{\lambda } \mathcal {L}_{\eta _i,\rho }(\textbf{x}_{i+1},\lambda _{i}) - \nabla _{\lambda } \mathcal {D}_{\eta _i,\rho }(\lambda _i)\Vert \right. \nonumber \\&+ \left. \Vert \nabla _{\lambda } \mathcal {D}_{\eta _i,\rho }(\lambda _i)\Vert + {\sqrt{m}}\eta _i\beta \right) . \end{aligned}$$
(26)

Recall that

$$ \Vert \nabla _{\lambda } \mathcal {D}_{\eta _k,\rho }(\lambda _1)-\nabla _{\lambda } \mathcal {D}_{\eta _k,\rho }(\lambda _2)\Vert \le \tfrac{1}{\rho }\left\| q_{\eta ,\rho }(\lambda _1)-q_{\eta ,\rho }(\lambda _2)\right\| + \tfrac{1}{\rho }\left\| \lambda _1-\lambda _2\right\| \le \tfrac{2}{\rho }\Vert \lambda _1-\lambda _2\Vert ,$$

allowing us to claim that \(\mathcal {D}_{\eta _k,\rho }\) is a \(({2}/\rho )\)-smooth concave function. Then by leveraging [32] for any \(\lambda \ge 0\),

$$\begin{aligned} \Vert \nabla _{\lambda }&\mathcal {D}_{\eta _k,\rho }(\lambda ) \Vert \le \sqrt{\tfrac{2}{\rho } \left( \mathcal {D}_{\eta _k,\rho }({\lambda _{\eta _k}^*})-\mathcal{D}_{\eta _k,\rho }(\lambda )\right) } \le \sqrt{\tfrac{2}{\rho } \left( \mathcal {D}_{\rho }({\lambda _{\eta _k}^*})-\mathcal{D}_{\rho }(\lambda )+2\eta _k \beta \tilde{B}_{\lambda }\right) } \\&\le \sqrt{\tfrac{2}{\rho } \left( \mathcal {D}_{\rho }({\lambda ^*})-\mathcal{D}_{\rho }(\lambda )+2\eta _k \beta \tilde{B}_{\lambda }\right) } \le \sqrt{\tfrac{2}{\rho } \left( \mathcal {D}_{\rho }({\lambda ^*})-\mathcal{D}_{\rho }(\lambda )\right) } +{2\sqrt{\tfrac{\eta _k \beta \tilde{B}_{\lambda }}{\rho }}}, \end{aligned}$$

where \({\lambda _{\eta }^*}\) is a maximizer of \(\mathcal {D}_{\eta ,\rho }\). By leveraging the concavity of the square-root function, the prior dual sub-optimality bounds, \(\sqrt{u+v} \le \sqrt{u}+\sqrt{v}\) for \(u, v \ge 0\), the subaddivity of concave functions, we have from (26),

$$\begin{aligned} \hspace{-0.1in}&d_-(g(\bar{\textbf{x}}_{K})) \le \tfrac{{\sqrt{m}}}{K}\sum _{i=0}^{K-1}\left( \sqrt{\tfrac{2\epsilon _i{\eta _i^b}}{\rho }}+{\sqrt{m}}\eta _i\beta \right) + \tfrac{{\sqrt{m}}}{K}\sum _{i=0}^{K-1}\sqrt{\tfrac{2}{\rho } \left( \mathcal {D}_{\rho }({\lambda ^*})-\mathcal{D}_{\rho }(\lambda _i)\right) } \\&+ {\tfrac{2\sqrt{m}}{K}\sum _{i=0}^{K-1}\sqrt{\tfrac{\eta _i \beta \tilde{B}_{\lambda }}{\rho }}} \\&\overset{\tiny \text{(Concavity } \text{ of } \sqrt{\cdot })}{\le } \tfrac{{\sqrt{m}}}{K}\sum _{i=0}^{K-1}\left( \sqrt{\tfrac{2\epsilon _i{\eta _i^b}}{\rho }}+{\sqrt{m}}\eta _i\beta \right) + \sqrt{\tfrac{2m}{\rho } \left( \mathcal {D}_{\rho }({\lambda ^*})-\tfrac{1}{K} \sum _{i=0}^{K-1}\mathcal{D}_{\rho }(\lambda _i)\right) } \\&+ {\tfrac{\sqrt{m}}{K}\sum _{i=0}^{K-1}\sqrt{\tfrac{2 \eta _i \beta \tilde{B}_{\lambda }}{\rho }}}. \end{aligned}$$

Recall from (24), it follows that

$$\begin{aligned} \tfrac{1}{K} \sum _{i=0}^{K-1}\left( \mathcal {D}_{\rho }({\lambda ^*})-\mathcal{D}_{\rho }(\lambda _i)\right) \le \tfrac{1}{2\rho K}\Vert \lambda _0-\lambda ^*\Vert ^2 + \tfrac{{B_{\lambda }}}{K} \sum _{k=0}^{K-1} \tfrac{\sqrt{2\epsilon _k\eta _k^b}}{\sqrt{\rho }} +\tfrac{{B_2}}{K} \sum _{k=0}^{K-1}\eta _k = \tfrac{C}{K}, \end{aligned}$$

which implies that

$$\begin{aligned} d_-(g(\bar{\textbf{x}}_{K}))\le \tfrac{{\sqrt{m}}}{K}\sum _{i=0}^{K-1}\left( \sqrt{\tfrac{2\epsilon _i{\eta _i^b}}{\rho }}+{\sqrt{m}}\eta _i\beta + \sqrt{\tfrac{2 \eta _i \beta \tilde{B}_{\lambda }}{\rho }}\right) + \sqrt{\tfrac{2mC}{\rho K} } \end{aligned}$$

where \(C \triangleq \tfrac{\Vert \lambda _0-\lambda ^*\Vert ^2}{2\rho }+\left( B_{\lambda }\sum _{k=0}^{K-1}\tfrac{2\epsilon _k\eta _k^b}{\sqrt{\rho }}+B_2\sum _{k=0}^{K-1}\eta _k\right) \). \(\square \)

We now derive a rate statement for the primal sub-optimality.

figure k

Proof

Recall that since \(\textbf{x}_k\) may not be feasible with respect to the constraints, we derive upper and lower bounds on the sub-optimality.

(i) Lower bound. A rate statement for the lower bound is first constructed. Since \(\max _{\lambda } \mathcal {D}_{\rho }(\lambda ) = {\displaystyle \min _{\textbf{x}\in \mathcal {X}}} \ \mathcal {L}_{\rho }(\textbf{x},\lambda ^*) = f^*\), the following sequence of inequalities hold where \(\bar{\textbf{x}}_K = \tfrac{1}{K}\sum _{k = 0}^{K-1}\textbf{x}_k\), \(f_{\eta _K}^* = {\displaystyle \min _{\textbf{x}\in \mathcal X}} \ \mathcal {L}_{\eta _K,{\rho }}\left( \textbf{x}, {\lambda _{\eta _K}^*}\right) \), and \((\textbf{x}_{\eta _K}^*,\lambda _{\eta _K}^*)\) is the saddle point of \(\mathcal {L}_{\eta _K, 0}(\textbf{x},\lambda )\).

$$\begin{aligned} f_{\eta _K}^*&{= \mathcal{L}_{\eta _K,\rho }(\textbf{x}^*_{\eta _K},\lambda ^*_{\eta _K}) } \le \mathcal {L}_{\eta _K,{\rho }}(\bar{\textbf{x}}_K,{\lambda _{\eta _K}^*}) \\&= f_{\eta _K}(\bar{\textbf{x}}_K) +\tfrac{\rho }{2} \left( d_-\left( \tfrac{{{\lambda ^*_{\eta _K}}}}{\rho } + g_{\eta _K}(\bar{\textbf{x}}_K) \right) \right) ^2 - \tfrac{1}{2\rho }\Vert {\lambda ^*_{\eta _K}}\Vert ^2 \\&\le f_{\eta _K}(\bar{\textbf{x}}_K) +\tfrac{\rho }{2} \left( d_-\left( g_{\eta _K}(\bar{\textbf{x}}_K) \right) + \left\| \tfrac{{\lambda _{\eta _K}^*}}{\rho }\right\| \right) ^2 - \tfrac{1}{2\rho }\Vert {\lambda _{\eta _K}^*}\Vert ^2 \\&= f_{\eta _K}(\bar{\textbf{x}}_K) +\tfrac{\rho }{2} \left( d_-\left( g_{\eta _K}(\bar{\textbf{x}}_K) \right) \right) ^2 + \left\| {\lambda _{\eta _K}^*}\right\| d_-\left( g_{\eta _K}(\bar{\textbf{x}}_K) \right) \\&\overset{\tiny \text{ Lem. } 1}{\le } f_{\eta _K}(\bar{\textbf{x}}_K) +\tfrac{\rho }{2} \left( d_-\left( g_{\eta _K}(\bar{\textbf{x}}_K) \right) \right) ^2 + {b_{\lambda ,\eta }} d_-\left( g_{\eta _K}(\bar{\textbf{x}}_K) \right) . \end{aligned}$$

By invoking Proposition 3, we obtain the following inequality.

$$\begin{aligned} f_{\eta _K}^* - f_{\eta _K}(\bar{\textbf{x}}_K) \le \tfrac{{B_5^2}}{K} + \tfrac{{B_5}}{\sqrt{K}}. \end{aligned}$$
(28)

Let \(\textbf{x}^*\in \mathcal {X}^*\) and \(\textbf{x}_{{\eta _K}}^*\) is a minimizer of \(L_{\eta _K,\rho }(\cdot , {\lambda _{\eta _K}^*})\). By Lemma 4, we have that

$$\begin{aligned} f(\textbf{x}^*) = \mathcal{L}(\textbf{x}^*,\lambda ^*)&\le \mathcal{L}(\textbf{x}^*_{\eta _K},\lambda ^*) = f(\textbf{x}^*_{\eta _K}) + \sum _{i=1}^m \lambda ^*_i g_i(x^*_{\eta _K}) \nonumber \\&\le f(\textbf{x}^*_{\eta _K}) + \sum _{i=1}^m \lambda ^*_i (g_i(x^*_{\eta _K}) - g_{i,\eta _K}(\textbf{x}_{\eta _K}^*)) \nonumber \\&\le f(\textbf{x}^*_{\eta _K}) + m b_{\lambda } \eta _K \beta , \end{aligned}$$
(29)

implying that \(f(\textbf{x}^*) \le f(\textbf{x}^*_{\eta _K}) + mb_{\lambda } \beta \eta _K.\) By definition of the smoothing, \(f(\textbf{x}_{{\eta _K}}^*)-f_{\eta _K}(\textbf{x}_{{\eta _K}}^*) \le \beta \eta _K \) and \(f_{\eta _K}(\bar{\textbf{x}}_K)-f(\bar{\textbf{x}}_K) \le 0\).

$$\begin{aligned} f(\textbf{x}^*)-f(\bar{\textbf{x}}_K)&= \underbrace{f(\textbf{x}^*) - f(\textbf{x}_{{\eta _K}}^*)}_{\le {mb_{\lambda } \beta \eta _K}} + \underbrace{f(\textbf{x}_{{\eta _K}}^*)-f_{\eta _K}(\textbf{x}_{{\eta _K}}^*)}_{\le \beta \eta _K } + \underbrace{f_{\eta _K}(\textbf{x}_{{\eta _K}}^*) -f_{\eta _K}(\bar{\textbf{x}}_K)}_{(28) } \\&\quad +\underbrace{f_{\eta _K}(\bar{\textbf{x}}_K)-f(\bar{\textbf{x}}_K)}_{\le 0} \le (1+ {mb_{\lambda }}) \eta _K \beta + \tfrac{{B_5^2}}{K} + \tfrac{{B_5}}{\sqrt{K}}. \end{aligned}$$

(ii) Upper bound. Let \(\textbf{x}_{\eta _k,\lambda _k}^* {\in } \arg {\displaystyle \min _{\textbf{x}\in \mathcal X}} \ \mathcal {L}_{\eta _k,{\rho }}\left( \textbf{x}, {\lambda _k}\right) \) and \((\textbf{x}_{\eta _k}^*,\lambda _{\eta _k}^*)\) be the saddle point of \(\mathcal {L}_{\eta _k, 0}(\textbf{x},\lambda )\). Based on the definition of \(\textbf{x}_{\eta _k,\lambda _k}^*\) and \(\textbf{x}_{\eta _k}^*\), the following two inequalities hold.

$$\begin{aligned} \mathcal {L}_{\eta _k,\rho }(\textbf{x}_{k+1},\lambda _k) - \mathcal {L}_{\eta _k,\rho }(\textbf{x}_{\eta _k,\lambda _k}^*,{\lambda _k})&\le \epsilon _k{\eta _k^b}\\ \mathcal {L}_{\eta _k,\rho }(\textbf{x}_{\eta _k,\lambda _k}^*,\lambda _k)&\le \mathcal {L}_{\eta _k,\rho }(\textbf{x}_{\eta _k}^*,\lambda _k) \end{aligned}$$

By adding the two inequalities, we obtain

$$\begin{aligned}&\mathcal {L}_{\eta _k,\rho }(\textbf{x}_{k+1},\lambda _k) - \mathcal {L}_{\eta _k,\rho }(\textbf{x}_{\eta _k,}^*,{\lambda _k})\nonumber \\&= \mathcal {L}_{\eta _k,\rho }(\textbf{x}_{k+1},\lambda _k)- \mathcal {L}_{\eta _k,\rho }(\textbf{x}_{\eta _k,\rho _k}^*,\lambda _k) + \mathcal {L}_{\eta _k,\rho }(\textbf{x}_{\eta _k,\rho _k}^*,\lambda _k)- \mathcal {L}_{\eta _k,\rho }(\textbf{x}_{\eta _k,}^*,{\lambda _k})\nonumber \\&\le \epsilon _k\eta _k^b. \end{aligned}$$
(30)

Consequently, by leveraging (30) and invoking the definition of \(\mathcal{L}_{\eta _k,\rho }(\cdot ,\lambda _k)\), we have that

$$\begin{aligned} f_{\eta _k}(\textbf{x}_{k+1}) - f_{\eta _k}{(\textbf{x}^*_{\eta _k})}&\le \tfrac{\rho }{2} \left( d_{\_}\left( \tfrac{\lambda _k}{\rho } +g_{\eta _k}(\textbf{x}_{\eta _k}^*)\right) \right) ^2 - \tfrac{\rho }{2} \left( d_{\_}\left( \tfrac{\lambda _k}{\rho } + g_{\eta _k}(\textbf{x}_{k+1}) \right) \right) ^2+ \epsilon _k\eta _k^b. \end{aligned}$$

We observe that

$$\begin{aligned} d_{-} (u) = \left\| \varPi _-(u) - u\right\| = \left\| \varPi _-(u) - (\varPi _-(u) + \varPi _+(u)) \right\| = \left\| - \varPi _+(u) \right\| = \left\| \varPi _+(u)\right\| . \end{aligned}$$

By choosing \(u = g_{\eta _k}(\textbf{x}_{k+1}) + \tfrac{\lambda _k}{\rho }\), it follows from Lemma 6 that

$$\begin{aligned} d_{\_}\left( g_{\eta _k}(\textbf{x}_{k+1}) + \tfrac{\lambda _k}{\rho } \right) = \left\| \varPi _+ \left( g_{\eta _k}(\textbf{x}_{k+1}) + \tfrac{\lambda _k}{\rho } \right) \right\| = \left\| \tfrac{\lambda _{k+1}}{\rho }\right\| . \end{aligned}$$

Furthermore, we have that \(g_{\eta _k}(\textbf{x}_{\eta _k}^*) \le 0\) since \(\textbf{x}_{\eta _k}^*\) is feasible with respect to \(\eta _k\)-smoothed objective, implying

$$\begin{aligned} d_{\_}\left( \tfrac{\lambda _k}{\rho } + g_{\eta _k}(\textbf{x}_{\eta _k}^*) \right) \le \underbrace{d_{\_}\left( g_{\eta _k}(\textbf{x}_{\eta _k}^*) \right) }_{\tiny = 0, \text{ since } g_{\eta _k}(\textbf{x}_{\eta _k}^*) \le 0}+ \left\| \tfrac{\lambda _k}{\rho } \right\| \end{aligned}$$

which implies

$$\begin{aligned}&f_{\eta _k}(\textbf{x}_{k+1}) - {f_{\eta _k}(\textbf{x}_{\eta _k}^*)}\le \tfrac{\rho }{2}\left( \left\| \tfrac{\lambda _k}{\rho }\right\| ^2-\left\| \tfrac{\lambda _{k+1}}{\rho }\right\| ^2\right) + \epsilon _k\eta _k^b. \end{aligned}$$
(31)

We observe that that \(g_{\eta _k}(\textbf{x}^*) \le g(\textbf{x}^*) \le 0\), implying that \(\textbf{x}^*\) is feasible for the \(\eta _k\)-smoothed problem and consequently,

$$\begin{aligned} f_{\eta _k}(\textbf{x}^*_{\eta _k}) - f_{\eta _k}(\textbf{x}^*) \le 0. \end{aligned}$$
(32)

Summing from \(k = 0\) to \(K-1\) and leveraging convexity of \(f_{\eta _k}\) and , we obtain that

where \({B_6}> 0\) is a constant. \(\square \)

3.3 Rate analysis under increasing \(\rho _k\)

We now consider the setting where \(\{\rho _k\}\) is an increasing sequence.

Lemma 10

(Rate on primal infeasibility) Suppose \(\{{(\textbf{x}_k,\lambda _k)}\}\) is generated by (Sm-AL). Then for any \(k \ge 0\), \(d_{-}\left( g(\textbf{x}_{k+1}) \right) \le \left\| \tfrac{\lambda _{k+1}-\lambda _k}{\rho _k}\right\| + m\eta _k {\beta }.\)

Proof

By the update rule, we have that

$$\begin{aligned} \lambda _{k+1}&:= \lambda _k + \rho _k \nabla _{\lambda } \mathcal {L}_{\eta _k,\rho _k}(\textbf{x}_{k+1},\lambda _k) = \lambda _k + \rho _k g_{\eta _k}(\textbf{x}_{k+1}) - \rho _k \varPi _-\left( \tfrac{\lambda _k}{\rho _k}+ g_{\eta _k}(\textbf{x}_{k+1}) \right) . \end{aligned}$$

It follows that \( g_{\eta _k}(\textbf{x}_{k+1}) = \tfrac{\lambda _{k+1}-\lambda _k}{\rho _k} + \varPi _-\left( \tfrac{\lambda _k}{\rho _k}+g_{\eta _k}(\textbf{x}_{k+1}) \right) \), implying

$$\begin{aligned} d_{-}\left( g_{\eta _k}(\textbf{x}_{k+1}) \right)&\le d_{-}\left( \varPi _-\left( \tfrac{\lambda _k}{\rho _k}+g_{\eta _k}(\textbf{x}_{k+1}) \right) \right) + \left\| \tfrac{\lambda _{k+1}-\lambda _k}{\rho _k}\right\| = \left\| \tfrac{\lambda _{k+1}-\lambda _k}{\rho _k}\right\| . \end{aligned}$$

Akin to the proof in Proposition 3, we have

$$\begin{aligned} d_{-}\left( g(\textbf{x}_{k+1}) \right) \le d_{-}\left( g_{\eta _k}(\textbf{x}_{k+1}) \right) +m\eta _{k}{\beta } \le \left\| \tfrac{\lambda _{k+1}-\lambda _k}{\rho _k}\right\| + m\eta _k {\beta }. \end{aligned}$$

\(\square \)

figure l

Proof

(i) Let \(f_{\eta _k}^* \triangleq f_{\eta _k}({\textbf{x}_{\eta _k}^*})\) and \((\textbf{x}_{\eta _k}^*,\lambda _{\eta _k}^*)\) be the saddle point of \(\mathcal {L}_{\eta _k, 0}(\textbf{x},\lambda )\). We have that

$$\begin{aligned} f_{\eta _k}^*&\le \mathcal {L}_{\eta _k,\rho _k}({\textbf{x}_{k+1}},{\lambda _{\eta _k}^*}) = f_{\eta _k}({\textbf{x}_{k+1}}) +\tfrac{\rho _k}{2} \left( d_-\left( \tfrac{{\lambda _{\eta _k}^*}}{{\rho _k}} + g_{\eta _k}({\textbf{x}_{k+1}}) \right) \right) ^2 - \tfrac{1}{2\rho _k}\Vert {\lambda _{\eta _k}^*}\Vert ^2 \nonumber \\&{=} f_{\eta _k}({\textbf{x}_{k+1}}) +\tfrac{\rho _k}{2} \left( d_-\left( \tfrac{{\lambda _{k}}}{{\rho _k}} - \tfrac{{\lambda _{k}}}{{\rho _k}} + \tfrac{{\lambda _{\eta _k}^*}}{{\rho _k}} + g_{\eta _k}({\textbf{x}_{k+1}}) \right) \right) ^2 - \tfrac{1}{2\rho _k}\Vert {\lambda _{\eta _k}^*}\Vert ^2 \nonumber \\&\le f_{\eta _k}({\textbf{x}_{k+1}}) +\tfrac{\rho _k}{2} \left( d_-\left( \tfrac{{\lambda _{k}}}{{\rho _k}} + g_{\eta _k}({\textbf{x}_{k+1}}) \right) + \left\| \tfrac{{\lambda _{k}}}{{\rho _k}} - \tfrac{{\lambda _{\eta _k}^*}}{{\rho _k}} \right\| \right) ^2 - \tfrac{1}{2\rho _k}\Vert {\lambda _{\eta _k}^*}\Vert ^2 \nonumber \\&\le f_{\eta _k}({\textbf{x}_{k+1}}) +\tfrac{\rho _k}{2} \left( \tfrac{\Vert \lambda _{k+1}\Vert }{\rho _k} + \left\| \tfrac{{\lambda _{k}}}{{\rho _k}} - \tfrac{{\lambda _{\eta _k}^*}}{{\rho _k}} \right\| \right) ^2 - \tfrac{1}{2\rho _k}\Vert {\lambda _{\eta _k}^*}\Vert ^2 \nonumber \\&\le f_{\eta _k}({\textbf{x}_{k+1}}) +\tfrac{1}{\rho _k} {\left( \Vert \lambda _{k+1}\Vert ^2 + \left\| \lambda _{k}-\lambda _{\eta _k}^* \right\| ^2\right) }. \end{aligned}$$
(33)

By adding and subtracting \(f(\textbf{x}_{\eta _k}^*), f_{\eta _k}^*\) and \( f_{\eta _k}({\textbf{x}_{k+1}})\), it follows that

$$\begin{aligned} f^*-f({\textbf{x}_{k+1}})&= \underbrace{f^* - f(\textbf{x}_{\eta _k}^*)}_{\le { b_{\lambda } m \beta \eta _k } \tiny { \text{ from } (29)}} + \underbrace{f(\textbf{x}_{\eta _k}^*)-f_{\eta _k}^*}_{\le \eta _k \beta } \\&\quad + \underbrace{f_{\eta _k}^* -f_{\eta _k}({\textbf{x}_{k+1}})}_{(33)} + \underbrace{f_{\eta _k}({\textbf{x}_{k+1}}) -f({\textbf{x}_{k+1}})}_{\le 0}. \end{aligned}$$

Consequently, we have that \(f(\textbf{x}_{k+1}) - f(\textbf{x}^*) \ge -{(1+b_{\lambda } m)}\eta _k \beta -\left( \tfrac{\Vert \lambda _{k+1}\Vert ^2}{\rho _k} + \tfrac{\Vert {\lambda _{\eta _k}^*}-\lambda _k\Vert ^2}{\rho _k}\right) .\)

(ii) Similar to the previous analysis in Theorem 2, we have

$$\begin{aligned}&\mathcal {L}_{\eta _k,\rho _k}(\textbf{x}_{k+1},\lambda _k) - \mathcal {L}_{\eta _k,\rho _k}(\textbf{x}_{\eta _k}^*,\lambda _k) \le \epsilon _k{\eta _k^b}. \end{aligned}$$

which implies

$$\begin{aligned}&f_{\eta _k}(\textbf{x}_{k+1}) - f_{\eta _k}^* \\&\le \tfrac{\rho _k}{2}\left( \left( d_{\_}\left( \tfrac{\lambda _k}{\rho _k} +g_{\eta _k}(\textbf{x}_{\eta _k}^*)\right) \right) ^2 - \left( d_{\_}\left( \tfrac{\lambda _k}{\rho _k} + g_{\eta _k}(\textbf{x}_{k+1}) \right) \right) ^2\right) +\epsilon _k\eta _k^b\\&\le {\tfrac{\rho _k}{2}\left( \left( d_{\_}\left( \tfrac{\lambda _k}{\rho _k} +g_{\eta _k}(\textbf{x}_{\eta _k}^*)\right) \right) ^2 \right) +\epsilon _k\eta _k^b} \le {\tfrac{\rho _k}{2}\left( \left( d_{\_}\left( g_{\eta _k}(\textbf{x}_{\eta _k}^*)\right) + \tfrac{\Vert \lambda _k\Vert }{\rho _k} \right) ^2 \right) +\epsilon _k\eta _k^b}\\&= {\left( \tfrac{\Vert \lambda _k\Vert ^2}{2\rho _k} \right) +\epsilon _k\eta _k^b}\\ \implies&f(\textbf{x}_{k+1}) - f^* =\underbrace{ f(\textbf{x}_{k+1}) - f_{\eta _k}(\textbf{x}_{k+1})}_{\le \eta _k {\beta }}+ f_{\eta _k}(\textbf{x}_{k+1})- f_{\eta _k}^*+\underbrace{f_{\eta _k}(\textbf{x}_{\eta _k}^*)-f_{\eta _k}(\textbf{x}^*)}_{\le 0 { \tiny \text{ from } (32)}}\\&+ \underbrace{f_{\eta _k}(\textbf{x}^*)- f^*}_{\le 0} \le \eta _k {\beta }+ \tfrac{\Vert \lambda _k\Vert ^2}{{2}\rho _k} + \epsilon _k\eta _k^b. \end{aligned}$$

\(\square \)

We conclude with an overall rate for sub-optimality and infeasibility.

figure m

Proof

Suppose \(\rho _k = \rho _0 \zeta ^k\) where \(\zeta > 1\). By choosing \(\epsilon _k\eta _k^b = \tfrac{1}{k^{2+\delta \rho _k}}\), we have that

$$\begin{aligned} |&f(\textbf{x}_{k+1}) - f^* |\le \max \left\{ \eta _{k} \beta {(1+b_{\lambda }m)}+ \tfrac{\Vert {\lambda _{k+1}\Vert ^2}}{{\rho _{k}}} + \tfrac{\Vert {\lambda _{\eta _k}^*}-{\lambda _{k}}\Vert ^2}{{\rho _{k}}}, \eta _k {\beta }+ \tfrac{\Vert \lambda _k\Vert ^2}{2\rho _k} + {\epsilon _k\eta _k^b} \right\} \\&\le \eta _k {\beta }{(1+b_{\lambda }m)}+\tfrac{2\Vert \lambda _{k+1}\Vert ^2+5\Vert {\lambda _{k}}\Vert ^2 + {4} \Vert {\lambda _{\eta _k}^*}\Vert ^2}{2\rho _k} + {\epsilon _k\eta _k^b} \le \eta _k {\beta }{(1+b_{\lambda }m)} + \tfrac{\tilde{C}_1}{\rho _k} + \tfrac{1}{k^{2+\delta }\rho _k}\\&\le \eta _k {\beta }{(1+b_{\lambda }m)}+ \tfrac{{B_7}}{\rho _k}. \end{aligned}$$

Next, we derive a rate on the expected infeasibility. Recall from Lemma 4, \(g(\textbf{x}_{k+1}) \le g_{\eta _k}(\textbf{x}_{k+1}) + \eta _k {\beta } \textbf{1}\), implying that \(d_-(g(\textbf{x}_{k+1}) \le d_-(g_{\eta _k}(\textbf{x}_{k+1}) + \eta _k {\beta } \textbf{1})\). Therefore,

$$\begin{aligned} d_{-}\left( g(\textbf{x}_{k+1})\right)&\le d_-(g_{\eta _k}(\textbf{x}_{k+1}) + \eta _k {\beta } \textbf{1}) \le \left\| \tfrac{\lambda _{k+1}-\lambda _k}{\rho _k}\right\| + \eta _k {\beta } \Vert \textbf{1}\Vert \\&\le \eta _k {\beta m} + \tfrac{2{B_{\lambda }}}{\rho _k} = \eta _k {\beta m} + \tfrac{B_8}{\rho _k}. \end{aligned}$$

4 Overall Complexity Guarantees

In 4.1, we begin with some preliminaries, including the derivation of Lipschitzian properties for the smoothed AL function. This allows for employing an accelerated gradient framework for inexact resolution of the subproblem, leading to suitable complexity guarantees in 4.2 for convex and strongly convex regimes. In 4.3, overall complexity guarantees for (Sm-AL) with a fixed smoothing parameter are presented.

4.1 Preliminaries

We first derive L-smoothness of \(\mathcal {L}_{\eta ,\rho }(\bullet ,\lambda )\) uniformly in \(\lambda \). Our bound necessitates utilizing an upper bound on \(\eta \), which we denoted by \(\eta ^u\).

Lemma 11

Suppose \({0 \, < \, \eta \, \le \eta ^u}\) and \(\rho \ge {1}\). Then the following hold.

(a) For any \(\lambda \ge 0\), there exists \(\tilde{C}\) such that \(\mathcal {L}_{\eta ,\rho }(\bullet ,\lambda )\) is \(\tfrac{\tilde{C} \rho }{\eta }\)-smooth.

(b) \(\mathcal{L}_{\eta ,\rho }(\textbf{x},\lambda )\) is convex in \(\textbf{x}\in \mathcal{X}\) and concave in \(\lambda \ge 0\). \(\Box \)

Next, we formally state an accelerated gradient method for resolving the augmented Lagrangian subproblem (ALSub\(_{\eta _k,\rho _k}(\lambda _k)\)), defined as

figure n

Suppose \(\textbf{x}_k^*\) denotes an optimal solution of (ALSub\(_{\eta _k,\rho _k}(\lambda _k)\)). Since \(\mathcal {L}_{\eta _k,\rho _k}(\bullet ,\lambda _k)\) is a convex and \({\tfrac{\tilde{C} \rho _k}{\eta _k}}\)-smooth function, we employ an accelerated gradient method that constructs a sequence \(\{ \textbf{y}_j,\textbf{z}_j \}_{j=0}^{M_k}\) as follows, where \(\textbf{z}_0 = \textbf{y}_0 = \textbf{x}_k\).

$$\begin{aligned}\left\{ \begin{aligned} \textbf{y}_{j+1}&= \varPi _{X} \left[ \, \textbf{z}_j - \beta _j \nabla _{\textbf{x}} \mathcal {L}_{\eta _k,\rho _k}(\textbf{z}_j,\lambda _k) \, \right] \\ \textbf{z}_{j+1}&= \textbf{y}_{j+1} + \gamma _j \left( \textbf{y}_{j+1}- \textbf{y}_j \right) \end{aligned} \right\} , \quad j > 0. \end{aligned}$$
(AG)

We now restate the convergence guarantees [6, 31, 32, 34] associated with (AG).

figure o

4.2 Complexity guarantees for convex and strongly convex f

We begin by leveraging Theorem 4 to develop complexity guarantees in convex settings for an \(\varvec{\varepsilon }\)-optimal solution by leveraging the rate statement for dual suboptimality (in constant penalty settings) and primal sub-optimality (in increasing penalty settings). Throughout, we recall that AL subproblem objective is \(L_k\)-smooth, where \(L_k = \tfrac{\tilde{C} \rho _k}{\eta _k}\) and \(\Vert x-y\Vert \le {C_1}\) for any \(x, y \in X\). Additionally, complexity guarantees are derived by utilizing the rate guarantees presented in Theorem 2 (Constant \(\rho _0\)) or Theorem 3 (increasing \(\rho _k\)) to determine the number of outer iterations K; specifically, by these results, to ensure \(\varvec{\varepsilon }\)-suboptimal solutions, we require that \(K = \lceil \tfrac{C}{\varvec{\varepsilon }}\rceil \) (constant \(\rho \)) or \(K = \lceil \tfrac{\ln (C/\varvec{\varepsilon })}{\ln (\zeta )}\rceil \) (increasing \(\rho _k\)) for a suitable constant C.

figure p

Proof

(a) By Theorem 4, \(M_k\) is the smallest integer satisfying

$$\begin{aligned} \mathcal {L}_{\rho _k,\eta _k}({\textbf{x}_{k+1}},\lambda _{k}) - {\mathcal {D}_{\rho _k,\eta _k}}(\lambda _k)&\le {\left( \tfrac{{C_1}L_k}{M_k^2}\right) {\,=\,} \left( \tfrac{{C_1}\tilde{C}\rho _0}{\eta _k M_k^2}\right) \le \epsilon _k\eta _k^b} \\ \implies M_k&= {\bigg \lceil \sqrt{\tfrac{{C_1}\tilde{C} \rho _0}{ \epsilon _k \eta _k^{b+1}}}\bigg \rceil = \bigg \lceil \left( \sqrt{ {C_1} \tilde{C} \rho _0} \right) k^{2(1+\delta )}\bigg \rceil }. \end{aligned}$$

Then the iteration complexity of computing a \((\bar{\textbf{x}}_K,\bar{\lambda }_K)\) where \(f^*-\mathcal {D}(\bar{\lambda }_K)\le \varvec{\varepsilon }\) requires

$$\begin{aligned} \sum _{k = 1}^{K(\varvec{\varepsilon })} M_k= \sum _{k = 1}^{{\lceil C/\varvec{\varepsilon }\rceil }}\bigg \lceil \left( \sqrt{ {C_1} \tilde{C} \rho _0} \right) k^{2(1+\delta )}\bigg \rceil = \mathcal {O}\left( \varvec{\varepsilon }^{-(3+{2}\delta )}\right) . \end{aligned}$$

(b) Proceeding similarly, by Theorem 4, \(M_k\) is defined as follows.

$$\begin{aligned} M_k = {\bigg \lceil \sqrt{\tfrac{{C_1}\tilde{C} \rho _k}{ \epsilon _k \eta _k^{b+1}}} \bigg \rceil } = \bigg \lceil \sqrt{\tfrac{{C_1}\tilde{C} \rho _k^2 \eta _k^b k^{(2+\delta )}}{ \eta _k^{b+1}}}\bigg \rceil =\bigg \lceil \left( \sqrt{{C_1}\tilde{C}}\right) \rho _k^{3/2} k^{2+\delta }\bigg \rceil . \end{aligned}$$

Then the iteration complexity of producing an \(\textbf{x}_K\) satisfying \(|f^*-f({\textbf{x}_K})|\,\le \, \varvec{\varepsilon }\) requires

$$\begin{aligned} \sum _{k=1}^{K(\varvec{\varepsilon })} M_k&= \sum _{k=1}^{\lceil \ln {\frac{C}{\varvec{\varepsilon }}}/\ln {\zeta }\rceil }\bigg \lceil \left( \sqrt{{C_1}\tilde{C}}\right) \rho _k^{\frac{3}{2}}k^{(2+\delta )}\bigg \rceil {\, \le \, } {2}{\left( \sqrt{{C_1}\tilde{C}}\right) \rho _0^{\frac{3}{2}}}\sum _{k=1}^{{\log }_{{\zeta }}{\left( \frac{C}{\varvec{\varepsilon }}\right) +1}}\zeta ^{\frac{3}{2}k}k^{(2+\delta )}\\&\le 2{\left( \sqrt{{C_1}\tilde{C}}\right) \rho _0^{3/2}}\left( \lceil \ln {\left( \tfrac{C}{\varvec{\varepsilon }}\right) }+1\rceil \right) ^{3(1+\delta )}\int _{1}^{\ln _{{\zeta }}{\left( \frac{C}{\varvec{\varepsilon }}\right) }{+2}}\zeta ^{\frac{3}{2}u} du \le \tilde{\mathcal {O}}\left( \varvec{\varepsilon }^{-\frac{3}{2}}\right) . \end{aligned}$$

Remark 1

(Constant \(\rho \).) Suppose \(\varvec{\varepsilon }\) is a positive scalar. Let \(K \triangleq \lceil C/\varvec{\varepsilon }\rceil \) where C is defined in Proposition 2. Suppose Sm-AL scheme runs for K iterations and produces \(\bar{\textbf{x}}_K\) and \(\bar{\lambda }_K\). Then we have that

$$\begin{aligned} f^*-\mathcal {D}(\bar{\lambda }_K)\le \varvec{\varepsilon }, |f^*-f(\bar{\textbf{x}}_K)|\le {\mathcal {O}\left( \sqrt{\varvec{\varepsilon }}\right) }, \text { and } d_{-}\left( g(\bar{\textbf{x}}_K)\right) \le {\mathcal {O}\left( \sqrt{\varvec{\varepsilon }}\right) .} \end{aligned}$$

(Increasing \(\rho _k\)). Suppose \(\varvec{\varepsilon }\) is a positive scalar. Let \(K \triangleq \lceil \ln \left( \tfrac{C}{\varvec{\varepsilon }}\right) /\ln \left( \zeta \right) \rceil \) where C is defined in Theorem 3 and \(\rho _k = \rho _0\zeta ^k\) with \(\zeta >1\). Suppose Sm-AL scheme runs for K iterations and produces \(\bar{\textbf{x}}_K\) and \(\bar{\lambda }_K\), where

$$\begin{aligned} \left| \, f^*-f\textbf{x}_K) \, \right| \, \le \, \varvec{\varepsilon }\text { and } d_{-}\left( g(\textbf{x}_K)\right) \, \le \, {\mathcal {O}\left( \varvec{\varepsilon }\right) }. \end{aligned}$$

We now produce an extension of the results for strongly convex settings.

figure q

Proof

(a) Suppose \(\rho _k = \rho _0\) for all k. Suppose \(M_k\) represents the least number of steps taken at step k to achieve \((\epsilon _k \eta _k^b)\)-optimality of the subproblem. By Theorem 4 and \(\ln (x) \ge \tfrac{x-1}{x}\) for \(x > 0\),

$$\begin{aligned} \mathcal {L}_{\rho _k,\eta _k}&(\textbf{x}_{{k+1}},\lambda _{k}) - {\mathcal {D}_{\rho _k,\eta _k}(\lambda _k)}\le {\tilde{C}{\tfrac{\rho _0}{\eta _k}}\left( 1-\tfrac{\sqrt{\mu }}{\sqrt{L_k}}\right) ^{M_k}} \, \le \, \epsilon _k\eta _k^b. \end{aligned}$$
$$\begin{aligned} \implies M_k&= {\bigg \lceil \left( \tfrac{\ln \left( \tfrac{\tilde{C}{\rho _0}}{\epsilon _k\eta _k^{{b{+1}}}}\right) }{\ln \left( \tfrac{\sqrt{L_k}}{\sqrt{L_k}-\sqrt{\mu }}\right) }\right) \bigg \rceil } \le {\bigg \lceil \left( \tfrac{\ln \left( \tilde{C}{\rho _0 {k^{4+2\delta }}}\right) }{\left( 1- \left( \tfrac{\sqrt{L_k}-\sqrt{\mu }}{\sqrt{L_k}}\right) \right) }\right) \bigg \rceil }\\&= {\bigg \lceil \left( \tfrac{\ln \left( \tilde{C}{\rho _0 {k^{4+2\delta }}}\right) }{\left( \tfrac{\sqrt{\mu }}{\sqrt{L_k}}\right) }\right) \bigg \rceil } = {\bigg \lceil \left( \tfrac{\sqrt{\tilde{C}\rho _0}\ln \left( \tilde{C}{\rho _0 {k^{4+2\delta }}}\right) }{\left( \sqrt{\mu \eta _k}\right) }\right) \bigg \rceil }\\&= {\bigg \lceil \left( \tfrac{\sqrt{\tilde{C}\rho _0}\ln \left( {(\hat{C}k)^{4+2\delta }}\right) }{\left( \sqrt{\mu \eta _k}\right) }\right) \bigg \rceil }, \text{ where } \hat{C} = {(\tilde{C} \rho _0})^{1/({4}+2\delta )}. \end{aligned}$$

Consequently, since \(K(\varvec{\varepsilon }) = \lceil C/\varvec{\varepsilon }\rceil \) outer steps are required, the overall complexity is

$$\begin{aligned} \sum _{k = 1}^{K(\varvec{\varepsilon })}M_k&= \sum _{k = 1}^{\lceil C/\varvec{\varepsilon }\rceil }{\bigg \lceil \left( \tfrac{\sqrt{\tilde{C}\rho _0}\ln \left( {(\hat{C}k)^{4+2\delta }}\right) }{\left( \sqrt{\mu \eta _k}\right) }\right) \bigg \rceil } = \sum _{k = 1}^{\lceil C/\varvec{\varepsilon }\rceil }{\bigg \lceil \left( \tfrac{(4+2\delta )k^{1+\delta }\sqrt{\tilde{C}\rho _0}\ln \left( {(\hat{C}k)}\right) }{\left( \sqrt{\mu }\right) }\right) \bigg \rceil }\\&\le \mathcal {O}\left( \tfrac{1}{\varvec{\varepsilon }^{{2+{2}\delta }}}\ln \left( \tfrac{1}{\varvec{\varepsilon }}\right) \right) . \end{aligned}$$

(b) Consider \(\rho _k = \rho _0\zeta ^{k}\) where \(k\ge 0\) and \(\zeta >1\). Proceeding as in (a) and by Theorem 4 and \(\ln (x) \ge \tfrac{x-1}{x}\) for \(x > 0\),

$$\begin{aligned} \mathcal {L}_{\rho _k,\eta _k}&(\textbf{x}_{k+1},\lambda _{k}) - {\mathcal {D}_{\rho _k,\eta _k}(\lambda _k)} \le \tilde{C}{\left( \tfrac{\rho _k}{\eta _k}\right) }\left( 1-\tfrac{\sqrt{\mu }}{\sqrt{L_k}}\right) ^{M_k} \, \le \, \epsilon _k\eta _k^b \\ \implies M_k&= \bigg \lceil \left( \tfrac{\ln \left( \tfrac{\tilde{C}{\rho _k}}{\epsilon _k\eta _k^{b+1}}\right) }{\ln \left( \tfrac{\sqrt{L_k}}{\sqrt{L_k}-\sqrt{\mu }}\right) }\right) \bigg \rceil \le \bigg \lceil \left( \tfrac{\ln \left( \tilde{C}k^{({4}+{2}\delta )} {\rho _k^{{3}}} \right) }{\left( 1-\tfrac{\sqrt{L_k}-\sqrt{\mu }}{\sqrt{L_k}}\right) }\right) \bigg \rceil \le {\tfrac{2 \sqrt{\rho _k}\ln \left( \rho _k^{{3}} \tilde{C}k^{(4+2\delta )}\right) }{\sqrt{\mu \eta _k}}}. \end{aligned}$$

Consequently, if \(K(\varvec{\varepsilon }) = \lceil \ln (C/\varvec{\varepsilon })/\ln (\zeta ) \rceil = \lceil {\log }_{\zeta }(C/\varvec{\varepsilon })\rceil \) outer steps are employed, then the overall complexity can be bounded as follows.

$$\begin{aligned} \sum _{k = 1}^{K(\varvec{\varepsilon })}M_k&= \sum _{k = 1}^{\lceil {\log }_{\zeta }(C/\varvec{\varepsilon }) \rceil }2\bigg \lceil \tfrac{\sqrt{\rho _k}}{\sqrt{\mu \eta _k}}\ln \left( \rho _k^{{3}} {\tilde{C}}k^{(4+2\delta )}\right) \bigg \rceil \\&\le \sum _{k = 1}^{\lceil {\log }_{\zeta }(C/\varvec{\varepsilon }) \rceil }{\tilde{C}_1 \bigg \lceil \rho _k k^{(1+\delta )} \ln \left( \rho _k^{{3}} \tilde{C}k^{(4+{2}\delta )}\right) \bigg \rceil } \\&\le \rho _0 \zeta ^{(\lceil {\log }_{\zeta }(C/\varvec{\varepsilon })\rceil )} \left( \lceil {\log }_{\zeta }(C/\varvec{\varepsilon }) \rceil \right) ^{(1+\delta )} \\&\times \ln \left( \rho _0^{{3}} \zeta ^{{{3}}(\lceil {\log _{\zeta }}(C/\varvec{\varepsilon })\rceil )} \tilde{C} \left( \lceil {\log _{\zeta }}(C/\varvec{\varepsilon })\rceil \right) ^{(4+{2}\delta )} \right) \\&\le \tilde{\mathcal {O}}\left( \tfrac{1}{\varvec{\varepsilon }}\right) . \end{aligned}$$

\(\square \)

Remark 2

Sm-AL is designed for convex problems with nonsmooth nonlinear convex constraints, achieving an overall complexity of \(\tilde{\mathcal {O}}\left( \varvec{\varepsilon }^{-3/2}\right) \) under geometric growth of \(\rho _k\), slightly worse than the best known complexities for contending with smooth nonlinear constraints (cf. [26, 44]), i.e. \({\mathcal {O}}(\varvec{\varepsilon }^{-1})\) (up to log. terms).

4.3 Complexity Analysis for (Sm-AL) with fixed \(\eta \)

Next, we apply (Sm-AL) to (NSCopt\(_{\eta }\)) with a fixed and appropriately chosen \(\eta \) with the overall goal of finding an \((\bar{\textbf{x}}_K,\bar{\lambda }_K)\) such that either dual suboptimality is sufficiently small, i.e. \(f_{\eta }^* - \mathcal{D}_{\eta ,0}(\bar{\lambda }_K) \, \le \, \varvec{\varepsilon }\) (constant \(\rho _k = \rho _0\)) or primal suboptimality is sufficiently small \(|f_{\eta }(\textbf{x}_K) - f_{\eta }^*| < \varvec{\varepsilon }\) (geometrically increasing \(\rho _k\)).

(a) (Constant \(\rho \)) Suppose \({\eta \le \tilde{c}\varvec{\varepsilon }}\), where \(\tilde{c}\) needs specification. After K steps in (Sm-AL), \(f_{\eta }^* - \mathcal{D}_{\eta ,0}(\bar{\lambda }_K) \, \le \, \tfrac{\varvec{\varepsilon }}{2}\), where \(K = \bigg \lceil \tfrac{C}{\varvec{\varepsilon }}\bigg \rceil \) for a suitable C. By Lemma 4,

$$\begin{aligned} f(\textbf{x}^*) - \mathcal{D}_0(\bar{\lambda }_K) \,&\le \, f_{\eta }(\textbf{x}^*) + \eta \beta - \mathcal{D}_{\eta ,0}(\bar{\lambda }_K) + \eta (\Vert \bar{\lambda }_K\Vert m + 1) \beta \\ \,&\le \, {\underbrace{f_{\eta }(\textbf{x}_{\eta }^*) - \mathcal{D}_{\eta ,0}(\bar{\lambda }_K)}_{{\, \le \, \tfrac{\varvec{\varepsilon }}{2}}}} + \underbrace{\eta \left( \beta ({\tilde{B}}_{\lambda } m+2)\right) }_{{\, \le \, \tfrac{\varvec{\varepsilon }}{2}}} \, \le \, {\varvec{\varepsilon }}. \end{aligned}$$

To ensure that the second term is less than \(\varvec{\varepsilon }/2\), we select \(\eta \le \tfrac{\varvec{\varepsilon }}{2 \left( \beta (2 + {\tilde{B}}_{\lambda } m)\right) }\).

(b) (Geometrically increasing \(\rho _k\)). Proceeding similarly, suppose \({\eta \le \tilde{c} \varvec{\varepsilon }}\), then by taking K steps in (Sm-AL), \(|f_{\eta }(\textbf{x}_K) - f_{\eta }^*| \, \le \, \tfrac{\varvec{\varepsilon }}{2}\), where \(K = {\lceil \tfrac{C}{\varvec{\varepsilon }} \rceil }\) for a suitable C. Consequently, we have that if \(\eta \le \tfrac{\varvec{\varepsilon }}{2\beta }\), we have that \(f(\textbf{x}_K) - f^* \le \varvec{\varepsilon }\).

$$\begin{aligned} f(\textbf{x}_K) - f^* \, \le \, f_{\eta }(\textbf{x}_K) - f_{\eta }(\textbf{x}^*) + \eta \beta \, \le \, \underbrace{f_{\eta }(\textbf{x}_K) - f_{\eta }(\textbf{x}_{\eta }^*)}_{\le \tfrac{\varvec{\varepsilon }}{2}} + \underbrace{\eta \beta }_{\le \tfrac{\varvec{\varepsilon }}{2}} \, \le \, {\varvec{\varepsilon }}. \end{aligned}$$

Similarly, if \(\eta \le \tfrac{\varvec{\varepsilon }}{2\beta }\), \(f^* - f(\textbf{x}_K) \le \varvec{\varepsilon }\), implying that if \(\eta \le \tfrac{\varvec{\varepsilon }}{2\beta }\), \(| f(\textbf{x}_K)-f^*| \le \varvec{\varepsilon }\).

figure r

Proof

(a.) By Theorem 4, \(M_k\) is the smallest integer satisfying

$$\begin{aligned}&\mathcal {L}_{\rho _k,\eta }(\textbf{x}_{k+1},\lambda _k) - {\mathcal {D}_{\rho _k,\eta _k}(\lambda _k)}\le \left( \tfrac{{C_1}L_k}{M_k^2}\right) \le \left( \tfrac{{C_1}\tilde{C}\rho _0}{\eta M_k^2}\right) \le \epsilon _k\\ \implies&M_k = {\bigg \lceil \sqrt{\tfrac{{C_1}\tilde{C}\rho _0}{\epsilon _k\eta }}\bigg \rceil } = \bigg \lceil \left( \sqrt{ \tfrac{{2{C_1}\tilde{C} \left( \beta (2 + B_{\lambda } m)\right) }\rho _0}{\varvec{\varepsilon }}}\right) k^{1+\delta }\bigg \rceil = \bigg \lceil \left( \sqrt{\tfrac{D\rho _0}{\varvec{\varepsilon }}}\right) k^{1+\delta }\bigg \rceil \end{aligned}$$

where \(C_1, \tilde{C}, \beta , B_{\lambda }\) are constants and \(D \triangleq 2C_1\tilde{C} \left( \beta (2 + B_{\lambda } m)\right) \). Then the complexity of computing a \((\bar{\textbf{x}}_K,\bar{\lambda }_K)\) where \(f^* - \mathcal {D}_0(\bar{\lambda }_K) \le \varvec{\varepsilon }\) requires

$$\begin{aligned} \sum _{k = 1}^{K(\varvec{\varepsilon })} M_k = \sum _{k = 1}^{\lceil C/\varvec{\varepsilon }\rceil } \bigg \lceil \left( \sqrt{\tfrac{D\rho _0}{\varvec{\varepsilon }}}\right) k^{1+\delta }\bigg \rceil = {\sqrt{D\rho _0}\varvec{\varepsilon }^{-\tfrac{1}{2}} \sum _{k = 1}^{\lceil C/\varvec{\varepsilon }\rceil }\bigg \lceil k^{1+\delta }\bigg \rceil }\le \mathcal {O}\left( \varvec{\varepsilon }^{-\left( \frac{5}{2} + \delta \right) }\right) . \end{aligned}$$

(b) Consider \(\rho _k = \rho _0\zeta ^{k}\) where \(k\ge 0\) and \(\zeta >1\). Proceeding as in (a) and by invoking Theorem 4,

$$\begin{aligned} M_k = {\bigg \lceil \sqrt{\tfrac{{C_1}\tilde{C}\rho _k}{\epsilon _k\eta }}\bigg \rceil }= \bigg \lceil \sqrt{\tfrac{2C_1\tilde{C}\beta }{\varvec{\varepsilon }}}\rho _kk^{1+\delta }\bigg \rceil = \bigg \lceil \sqrt{\tfrac{D}{\varvec{\varepsilon }}}\rho _kk^{1+\delta }\bigg \rceil \end{aligned}$$

where \(C_1, \tilde{C}, \beta \) are constants and \(D \triangleq 2C_1\tilde{C}\beta \). Then the iteration complexity of producing an \(\textbf{x}_K\) satisfying \(|f - f(\textbf{x}_K)| \le \varvec{\varepsilon }\) leads to the following bound, where \(C, D > 0\).

$$\begin{aligned}&\sum _{k=1}^{K(\varvec{\varepsilon })} M_k = \sum _{k=1}^{\lceil \ln {\frac{C}{\varvec{\varepsilon }}}/\ln {\zeta }\rceil }\bigg \lceil \left( \sqrt{\tfrac{D}{\varvec{\varepsilon }}}\right) \rho _kk^{(1+\delta )}\bigg \rceil \, \le \, \left( \sqrt{\tfrac{D}{\varvec{\varepsilon }}}\right) \rho _0^2\sum _{k=1}^{\log _{\zeta }{\left( \frac{C}{\varvec{\varepsilon }}\right) }+1}\zeta ^{k}k^{(1+\delta )}\\&\le \sqrt{D}\rho _0^2\varvec{\varepsilon }^{-\tfrac{1}{2}} \left( \lceil \log _{\zeta }{\left( \tfrac{C}{\varvec{\varepsilon }}\right) }+1\rceil \right) ^{2(1+\delta )}\int _{1}^{\log _{\zeta }{\left( \frac{C}{\varvec{\varepsilon }}\right) }+2}\zeta ^{u} du \le \tilde{\mathcal {O}}\left( \varvec{\varepsilon }^{-\frac{3}{2}}\right) . \end{aligned}$$

\(\square \)

Remark 3

We observe that the complexity guarantees are close to those for diminishing \(\eta _k\) with a slight improvement in the constant \(\rho _0\) regime. We recall that Nesterov [33] and Beck and Teboulle [7] adopted different smoothing techniques with fixed \(\eta \) to get an \(\varvec{\varepsilon }\)-optimal solution within \(\mathcal {O}(1/\varvec{\varepsilon })\). When compared to these smoothing schemes in [7, 33], Sm-AL targets problems with nonsmooth constraint functions. Moreover, Sm-AL accommodates both fixed and varying \(\eta \), with an effective complexity rate \(\tilde{\mathcal {O}}(\epsilon ^{-3/2})\), matching the complexity of a smoothed penalized scheme [3].

Table 2 summarizes rate and complexities for S-AL, S-AL(\(\eta \)), S-AL(S), and N-AL where (a). Sm-AL is smoothed ALM for convex problems; (b). Sm-AL(\(\eta \)) is \(\eta \)-smoothed ALM; (c). Sm-AL(S) is Sm-AL for strongly convex problems; (d). N-AL is original ALM for nonsmooth problems. Additionally, Table 3 captures all of the constants utilized in the results from Sections 3 and 4 in a single table.

Table 2 Rates & Complexity
Table 3 Constants in Theorems/Propositions

5 Numerical Experiments

5.1 Fused Lasso Problems

In this section, we apply (Sm-AL) on a fused lasso problem with datasets \(\left\{ X_i, y_i\right\} _{i = 1}^N\) where \(X_i\) is the d-dimensional feature vector for ith instance and \(y_i\) is the corresponding response. Consider the \(\eta \)-smoothing of (1).

$$\begin{aligned} \min _{\beta \in {\mathcal {X}}}&\quad \Vert Y-X^\top \beta \Vert ^2 \\ \mathop {\mathrm {subject\;to}}\limits&\quad \sum _{j}\left( \sqrt{\beta _j^2 + \eta ^2} - \eta \right) \le C_1, \sum _{j}\left( \sqrt{(\beta _j-\beta _{j-1})^2+\eta ^2}-\eta \right) \le C_2. \end{aligned}$$

We conducted the experiments on simulated datasets with dimensions of \(\beta \) ranging from 5 to 1000. The results are shown in Table 4. The optimal solutions for each experiment are obtained by using fmincon in Matlab. In Table 4, we compare the results from Sm-AL with those from N-AL. Both Sm-AL and N-AL terminated at 50 outer iterations except that \(n = 1000\) case for Sm-AL was stopped at the 30th outer iteration to save time. N-AL was terminated when the overall runtime exceeded two hours for higher dimensional problems. In all cases, Sm-AL outperforms N-AL with respect to primal suboptimality and overall runtime.

Table 4 Numerical results

Next, we compare the results from Sm-AL with AL on an \(\eta \)-smoothed problem for a single instance (\(n = 5\)). We observe that such fixed-smoothing avenues provide relatively coarse approximations compared to their iteratively smoothed counterparts. Finally, we compare empirical rates of Sm-AL in two settings of \(\rho _k\) for a smaller problem (\(n = 5\)) in terms of primal suboptimality in Figure 1 and observe alignment with the theoretical rates, represented by blue lines with triangular markers.

The following insights were derived from the analysis of primal suboptimality, as shown in Figure 1.

  1. (i)

    First, employing a constant \(\eta \) leads to a sequence that converges to an approximate solution while diminishing \(\eta _k\) allows for asymptotic guarantees to a true solution.

  2. (ii)

    Second, choosing a very small \(\eta \) may impede early progress of the scheme since this leads to a large Lipschitz constant L, constraining the steplength and limiting the progress. On the other hand, selecting a larger \(\eta \) allows for better early progress but the sequence will converge to a solution that may differ significantly from the true solution. A diminishing \(\eta _k\) sequence starts with a larger \(\eta \) (allowing for larger steps and greater progress) but comes with a guarantee that the sequence will converge to a true solution. This is reflected in Figure 1.

  3. (iii)

    We observe that the complexity guarantees for constant \(\eta \) are close to those for diminishing \(\eta _k\) with a slight improvement in the constant \(\rho _0\) regime (see Theorem X.). We recall that Nesterov [33] and Beck and Teboulle [7] adopted different smoothing techniques with fixed \(\eta \) to get an \(\varvec{\varepsilon }\)-optimal solution within \(\mathcal {O}(1/\varvec{\varepsilon })\). When compared to these smoothing schemes in [7, 33], Sm-AL targets problems with nonsmooth constraint functions. Moreover, Sm-AL accommodates both fixed and varying \(\eta \), with an effective complexity rate \(\tilde{\mathcal {O}}(\varvec{\varepsilon }^{-3/2})\), matching the complexity of a smoothed penalized scheme [3]. When compared to the results in Proposition 2 with constant \(\rho \), SM-AL with constant \(\eta \) improves overall complexity by \(\mathcal {O}\left( \varvec{\varepsilon }^{-1/2}\right) \). The diminishing nature of \(\eta _k\) slows down the convergence process due to the additional summable requirement for varying \(\eta _k\).

Fig. 1
figure 1

Primal subopt. for fused lasso problems for constant (L) and increasing \(\rho _k\) (R)

5.2 Incorporation of termination criteria

Next, we consider the introduction of termination criteria T1 and T2 and examine the impact of potentially early termination, measured by \(\sum _{k}N_k\). Table 5 provides a comparison between the Sm-AL scheme with and without termination criteria. It can be observed that the incorporation of these termination criteria leads to significant computational benefits with little (if any) impact on accuracy. A natural question lies in the choice \(\gamma \) in the definition of the residual function \(F^{{\textrm{nat}},\tilde{\epsilon }}_\mathcal{X}\). We observe that when \(\textbf{x}\in \mathcal{X}\),

$$\begin{aligned} \left\| F^{{\textrm{nat}},\tilde{\epsilon }}_\mathcal{X}(\textbf{x},{u}) \right\|&= \left\| \tilde{\epsilon } \left( \textbf{u}- \varPi _X \left[ \, \textbf{x}- \gamma \nabla h(\textbf{x}) \, \right] \right) - \textbf{x}+ \varPi _\mathcal{X} \left[ \, \textbf{x}- \gamma \nabla h(\textbf{x}) \, \right] \right\| \\&\le \left\| \tilde{\epsilon } \left( \varPi _\mathcal{X}[\textbf{u}] - \varPi _X \left[ \, \textbf{x}- \gamma \nabla h(\textbf{x}) \, \right] \right) \right\| + \left\| \varPi _\mathcal{X}[\textbf{x}] - \varPi _\mathcal{X} \left[ \, \textbf{x}- \gamma \nabla h(\textbf{x}) \, \right] \right\| \\&\le \left\| \tilde{\epsilon } \left( \textbf{u}- \textbf{x}+ \gamma \nabla h(\textbf{x}) \, \right) \right\| + \left\| \gamma \nabla h(\textbf{x}) \, \right\| \\&\le 2\tilde{\epsilon } {C} + \gamma (1+\tilde{\epsilon }) {D}, \end{aligned}$$

where \(\Vert \textbf{x}\Vert \le {C}\) and \(\Vert \nabla h(\textbf{x})\Vert \le {D}\) for any \(\textbf{x}, {u} \in \mathcal X\). From the above bound, it may be observed that small choices of \(\gamma \) may lead to early satisfaction of conditions T1 or T2 while larger choices of \(\gamma \) may require significantly more iterations. Ideally, since we have already developed convergence guarantees, it would be helpful to relate \(\gamma \) to \(\eta _k\). Some preliminary numerics are provided where the choice of \(\gamma \) is varied in condition T2, leading to some variability in performance. It can be surmised from this table that constant \(\gamma \) leads to poorer performance while diminishing choices for \(\gamma \) lead to far superior behavior. This is less surprising in that for larger values of K, \(\gamma \) is smaller and imposes a more modest threshold for satisfying the condition and thereby allowing for earlier termination.

6 Conclusion

In this paper, we develop a smoothed AL scheme for resolving convex programs with possibly nonsmooth constraints and provide rate and complexity guarantees for convex and strongly convex settings under constant and increasing penalty parameter sequences. The complexity guarantees represent significant improvements over the best available guarantees for AL schemes applied to convex programs with nonsmooth objectives and constraints. A by-product of our analysis develops a relationship between saddle-points of \(\eta \)-smoothed problems and \(\eta \)-saddle points of our original problem. Moreover, to improve the practical behavior of the proposed Sm-AL scheme, we have developed termination criteria that allow for premature termination. Our preliminary numerics suggest that such criteria lead to significant improvements in the complexity of our scheme with modest impacts on accuracy of the resulting solutions.. We believe that our findings represent a foundation for considering extensions to compositional regimes with expectation-valued and possibly nonsmooth constraints.

Table 5 Numerical results with termination criteria
Table 6 Performance vs choice of \(\gamma \)