Abstract
Unbalanced dimensions are crucial characteristics in various minimax optimization problems, such as few-shot learning (Cortes and Mohri in Adv Neural Inf Process Syst 16, 2003; Ying et al. in Adv Neural Inf Process Syst 29, 2016) and fairness-aware machine learning (Lowd and Meek, in: Proceedings of the eleventh ACM SIGKDD international conference on knowledge discovery in data mining, 2005; Zhang et al., in: Proceedings of the 2018 AAAI/ACM conference on AI, ethics, and society, 2018). In this paper, we propose a communication-efficient second-order method named PANDA (Partially Approximate Newton methods for Distributed minimAx) to solve problems with unbalanced dimensions. PANDA requires almost the same per-iteration communication cost as the first-order methods by utilizing the special problem structure in its design for data exchange between the client and server. More importantly, it exhibits a superior linear-quadratic convergence rate and significantly reduces the total number of communication rounds through the efficient use of second-order information. We also develop GIANT-PANDA based on the framework of PANDA, which further reduces the computation cost of the latter one by performing sketching operations on each client. Through comprehensive theoretical analysis and empirical evaluations, we demonstrate the superior performance of the proposed methods compared to existing state-of-the-art methods.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
We consider a class of minimax optimization problems formulated as finite-sum expressions:
where N denotes the sample size, and \(n_\textbf{x}\) and \(n_\textbf{y}\) denote the dimensions of the variables \(\textbf{x}\) and \(\textbf{y}\) respectively. The smooth function \(f(\textbf{x}, \textbf{y})\) is supposed to be strongly convex in \(\textbf{x}\) and strongly concave in \(\textbf{y}\) (i.e., satisfying the SC-SC condition). For many machine learning problems with a large sample size N, distributed methods are often preferable for solving problems in a parallel fashion. To facilitate our study in a distributed setting, let us divide the N samples that the i-th client holds \(|S_i|\) samples. Consequently, we have \(N=\sum _{i=1}^m|S_i|\), leading us to the following alternative problem:
In this context, for each client \(i=1,\ldots ,m\), \(f^i\) represents its local function and \(S_i\) is the index set of local samples.
Minimax optimization has gained significant attention in the data mining and machine learning community due to its broad applications in various domains, including game theory (Basar & Olsder, 1999; Facchinei, 2003), supervised learning (Lanckriet et al., 2002), robust optimization (Ben-Tal & Nemirovski, 2002; Deng & Mahdavi, 2021; Gao & Kleywegt, 2022), and fairness-aware machine learning (Creswell et al., 2018; Liu et al., 2020). Among these applications, many of them share a critical property that the dimensions of variables \(\textbf{x}\) and \(\textbf{y}\) are unbalanced (Liu et al., 2022). For instance, AUC maximization (Cortes & Mohri, 2003; Ying et al., 2016) aims to train a binary classifier on imbalanced datasets \(\{{\textbf{a}}_j,b_j\}_{j=1}^{N}\), where \({\textbf{a}}_j\) denotes the input with d features and \(b_j\in \{-1,+1\}\) denotes the label. This problem can be formulated as a minimax problem with \(n_\textbf{x}=d+2\) and \(n_\textbf{y}=1\). Additionally, in fairness-aware machine learning tasks (Lowd & Meek, 2005; Zhang et al., 2018), we are given a training set \(\{{\textbf{a}}_j,b_j, {\textbf{c}}_j\}_{j=1}^{N}\), where \({\textbf{a}}_j\) represents d-dimensional features to learn from, \({\textbf{c}}_j\) contains s protected features, and \(b_j\) denotes the label. In this case, we have \(n_\textbf{x}=d\gg n_\textbf{y}=s\). Throughout this paper, we use the term “unbalanced dimensions"to describe the above special problem structure, which can be expressed as \(n_\textbf{x}\gg n_\textbf{y}\) or \(n_\textbf{y}\gg n_\textbf{x}\).Footnote 1
There are numerous first-order methods for solving minimax optimization problems in (1), including gradient descent ascent, extra gradient, and many of their variants (Chavdarova et al., 2019; Hsieh et al., 2019; Korpelevich, 1976; Lin et al., 2020; Malitsky, 2015; Mishchenko et al., 2020; Nedić & Ozdaglar, 2009; Nouiehed et al., 2019; Tseng, 2000). These first-order methods can be straightforwardly generalized to solve distributed problems in (2), by simply aggregating the gradients to the server at each iteration. Distributed first-order methods that perform multiple local iterations for each client before communication have also been proposed to solve minimax problems (Deng & Mahdavi, 2021; Sun & Wei, 2022; Zhang et al., 2024). Among these methods, Zhang et al. (2024) achieved the best-known results in terms of the total communication rounds, with an order of \({\mathcal {O}}(\kappa _\textbf{g}\ln (1/\epsilon ))\) where \(\kappa _\textbf{g}\) is the condition number of the objective (cf. Sect. 2.1) and \(\epsilon\) is the desired accuracy. It is worth noting that the per-iteration communication complexity between the client and server for first-order methods is only \({\mathcal {O}}(n_\textbf{x}+n_\textbf{y})\). However, these methods often require a substantial number of communication rounds to attain an accurate solution (e.g., their communication rounds depend heavily on the condition number). As a result, factors such as unpredictable network latency can lead to expensive total communication costs.
Second-order methods are well-known for their fast convergence rates, brought about by the utilization of the Hessian information of the objective. The Cubic regularized Newton method (Huang et al., 2022) and its restart variant (Huang & Zhang, 2022) have been proposed for solving (1) and demonstrated local superlinear convergence rates under the SC-SC condition. However, applying these methods directly in a distributed setting would require the communication of the full local Hessian matrix, resulting in a per-iteration communication complexity of \(\mathcal {O}((n_\textbf{x}+n_\textbf{y})^2)\). This communication overhead is unacceptable due to bandwidth limitations. On the other hand, several communication-efficientFootnote 2 distributed second-order methods (Islamov et al., 2022; Liu et al., 2024; Shamir et al., 2014; Wang & Li, 2020; Ye et al., 2022) have been proposed for convex optimization problems, eliminating the need for full Hessian communication. However, the design and analysis of these communication-efficient second-order methods heavily depend on the convexity of the objective function, making it challenging to generalize them to the minimax setting. Building upon this, it is natural to ask: Is it possible to develop communication-efficient distributed second-order methods for minimax optimization by leveraging the structure of “unbalanced dimensions"?
In this paper, we provide an affirmative answer to this question by introducing PANDA (Partially Approximate Newton methods for Distributed minimAx optimization). In each iteration, PANDA avoids the need for communicating the full Hessian matrix, instead requiring only the exchange of the partial Hessian matrix associated with \(\textbf{y}\) (recall that we suppose \(n_\textbf{x}\gg n_\textbf{y}\) in this paper), i.e. \(\nabla ^2_{\textbf{y}\textbf{y}} f^{i}(\textbf{x},\textbf{y})\in {\mathbb {R}}^{n_\textbf{y}\times n_\textbf{y}}\), \(\nabla ^2_{\textbf{x}\textbf{y}} f^{i}(\textbf{x},\textbf{y})\in {\mathbb {R}}^{n_\textbf{x}\times n_\textbf{y}}\), and \(\nabla ^2_{\textbf{x}\textbf{x}}f^{i}(\textbf{x},\textbf{y})\nabla ^2_{\textbf{x}\textbf{y}} f(\textbf{x},\textbf{y})\in {\mathbb {R}}^{n_{\textbf{x}}\times n_\textbf{y}}\). Additionally, it exchanges vectors such as gradients and local descent directions. As a result, the per-iteration communication complexity of PANDA can be summarized as:
This complexity significantly reduces that of typical second-order methods, bringing it to the same order as first-order methods. Furthermore, the utilization of second-order information in PANDA results in improved convergence behavior compared to existing distributed first-order methods.
1.1 Contribution
The contribution of this paper is threefold.
-
(a)
We develop a Partially Approximate Newton (PAN) method to solve the general minimax problem (1) with unbalanced dimensions. If the approximate Hessian \(\tilde{\textbf{H}}_{\textbf{x}\textbf{x}}\) satisfies \((1-\eta ) \nabla ^2_{\textbf{x}\textbf{x}} f\preceq \tilde{\textbf{H}}_{\textbf{x}\textbf{x}} \preceq (<span class='crossLinkCiteEqu'>1</span>+\eta ) \nabla ^2_{\textbf{x}\textbf{x}} f\) with \(\eta \in (0, 1)\), then PAN exhibits a linear-quadratic convergence rate for some measure \(\lambda _t\):
$$\begin{aligned} \lambda _{t+1}\le \frac{\eta }{1-\eta }\lambda _t +\beta \lambda _t^2. \end{aligned}$$This result of PAN generalizes the approximate Newton method for convex optimization in Ye et al. (2021) and relaxes the conditions in Liu et al. (2022, Lemma 4.3) for minimax optimization.
-
(b)
We develop the PANDA method to solve the distributed minimax problem (2). If the local partial Hessian satisfies \((1-\eta )\nabla ^2_{\textbf{x}\textbf{x}} f\preceq \nabla ^2_{\textbf{x}\textbf{x}} f^i\preceq (<span class='crossLinkCiteEqu'>1</span>+\eta ) \nabla ^2_{\textbf{x}\textbf{x}} f\) with \(\eta \in (0,1)\), then PANDA exhibits a linear-quadratic convergence rate:
$$\begin{aligned} \lambda _{t+1}\le \frac{\eta ^2}{1-\eta } \lambda _t +\beta _1 \lambda _t^2. \end{aligned}$$Furthermore, we can guarantee the existence of \(\eta\) provided that \(N\ge {\mathcal {O}}(mK/\mu )\), where m is the number of clients, \(K=\max _j\Vert \nabla ^2_{\textbf{x}\textbf{x}}l_j\Vert\), and \(\mu\) is the strong convexity parameter (cf. Sect. 2.1).
-
(c)
We develop the GIANT-PANDA method, which employs matrix sketching techniques on each local client to construct the partial approximate Hessian \(\tilde{\textbf{H}}^i_{\textbf{x}\textbf{x}}\). This method exhibits a linear-quadratic convergence rate:
$$\begin{aligned} \lambda _{t+1}\le \left( \frac{\eta }{\sqrt{m}}+\frac{\eta ^2}{1-\eta }\right) \lambda _t +\beta _2 \lambda _t^2. \end{aligned}$$This result leads to a sharper analysis compared to the original GIANT method in (Wang et al., 2018), as it improves the convergence rate by a factor of \(\sqrt{\kappa _\textbf{g}}\) in the linear term.
Organization We introduce fundamental notation, assumptions, and preliminary results for Hessian approximation in Sect. 2. In Sects. 3 and 4, we introduce PAN for solving (1) and introduce PANDA (along with GIANT-PANDA) for solving (2), respectively. We conduct empirical studies in Sect. 5 and provide conclusions in Sect. 6. All proofs are deferred to the appendix.
2 Preliminaries
2.1 Notation and assumptions
We use \(\textbf{g}_\textbf{x}(\textbf{x},\textbf{y})\) and \(\textbf{H}_{\textbf{x}\textbf{x}}(\textbf{x},\textbf{y})\) to denote the gradient \(\nabla _{\textbf{x}} f(\textbf{x},\textbf{y})\) and Hessian \(\nabla ^2_{\textbf{x}\textbf{x}} f(\textbf{x},\textbf{y})\) with respect to \(\textbf{x}\), respectively (similar for \(\textbf{g}_\textbf{y}\), \(\textbf{H}_{\textbf{x}\textbf{y}}\), \(\textbf{H}_{\textbf{y}\textbf{y}}\)). For the local gradient and Hessian associated with the i-th client, we use \(\textbf{g}_\textbf{x}^i(\textbf{x},\textbf{y})\) and \(\textbf{H}_{\textbf{x}\textbf{x}}^{i}(\textbf{x},\textbf{y})\) to denote \(\nabla _{\textbf{x}} f^i(\textbf{x},\textbf{y})\) and \(\nabla ^2_{\textbf{x}\textbf{x}} f^i(\textbf{x},\textbf{y})\) (similar for \(\textbf{g}_\textbf{y}^{i}\), \(\textbf{H}_{\textbf{x}\textbf{y}}^i\), \(\textbf{H}_{\textbf{y}\textbf{y}}^i\)). We use \(\left\| \,\cdot \, \right\|\) to denote the spectral norm for matrices and the Euclidean norm for vectors. Additionally, we define the matrix row coherence as follows.
Definition 2.1
(Wang et al., 2018, Definition 1) Let \(\textbf{A}\in {\mathbb {R}}^{N\times d}\) be a matrix with full column rank and \(\textbf{A}=\textbf{U}\Sigma \textbf{V}^\top\) be its reduced singular value decomposition with \(\textbf{U}, \textbf{V}\in {\mathbb {R}}^{N\times d}\). The row coherence of \(\textbf{A}\) is defined as \(\nu (\textbf{A})\overset{\textrm{def}}{=}\frac{N}{d}\max _j \Vert \textbf{u}_j\Vert ^2 \in [1, \frac{N}{d}]\), where \(\textbf{u}_j\) is the j-th row of \(\textbf{U}\).
We introduce the following assumption for the objective function in (1).
Assumption 2.2
We assume \(f({\textbf{x}},{\textbf{y}})\) is twice differentiable, \(\mu\)-strongly convex in \({\textbf{x}}\), \(\mu\)-strongly concave in \({\textbf{y}}\), and has \(L_\textbf{g}\)-Lipschitz continuous gradient and \(L_\textbf{H}\)-Lipschitz continuous Hessian. We also assume each individual function \(l_j(\textbf{x},\textbf{y})\) is convex in \(\textbf{x}\). We denote \(\kappa _\textbf{g}\overset{\textrm{def}}{=}L_\textbf{g}/\mu\) and \(\kappa _\textbf{H}\overset{\textrm{def}}{=}L_\textbf{H}/\mu\).
The convexity of each individual function \(l_j\) in \(\textbf{x}\), along with the \(L_\textbf{g}\)-Lipschitz continuity of the gradient \(\textbf{g}_\textbf{x}\), implies that the Hessian \(\nabla _{\textbf{x}\textbf{x}}^2 l_j\) is bounded. Let us denote \(K\overset{\textrm{def}}{=}\max _{j}\Vert \nabla ^2_{\textbf{x}\textbf{x}} l_j\Vert\) and \(\hat{\kappa }\overset{\textrm{def}}{=}K/\mu\).
2.2 Matrix approximation via sub-sampling and sketching
Let us introduce some preliminary results for approximating a positive definite Hessian matrix. We first consider a Hessian matrix in the form of \(\textbf{H}= \frac{1}{N}\sum _{j=1}^N \textbf{H}_j\in {\mathbb {R}}^{d\times d}\), and approximate it using sub-sampling:
where elements in \({\mathcal {S}}\) are uniformly sampled from \(\{1,\cdots , N\}\). The following lemma characterizes the error of the sub-sampling approximation.
Lemma 2.3
(Ye et al., 2021, Lemma 9) Suppose \(\textbf{H}\succeq \mu \textbf{I}\) and \(\max _{1\le j\le N}\Vert \textbf{H}_j\Vert \le \hat{K}\) for some constants \(\mu , \hat{K}>0\). For any \(\delta \in (0,1)\) and \(\eta \in (0,0.5)\), if the sample size satisfies \(|{\mathcal {S}}|\ge \frac{3\hat{K}\log (2d/\delta )}{\mu \eta ^2}\), then with probability at least \(1-\delta\), we have \((1-\eta )\textbf{H}\preceq \tilde{\textbf{H}}\preceq (<span class='crossLinkCiteEqu'>1</span>+\eta )\textbf{H}\) for \(\tilde{\textbf{H}}\) defined in (3).
We then consider a special case where the Hessian matrix is expressed as \(\textbf{H}=\textbf{A}^{\top }\textbf{A}+\alpha \textbf{I}\), with \(\textbf{A}\in {\mathbb {R}}^{N\times d}\) being a full column-rank matrix. This form of Hessian matrix naturally arises in classical regression problems (Ye et al., 2021; Wang et al., 2017). We construct two approximate Hessians using sketching techniques:
here \(\{\textbf{S}_i\}_{i=1}^m\in {\mathbb {R}}^{N\times s'}\) represent the sketching matrices and \(\textbf{S}\overset{\textrm{def}}{=}\frac{1}{\sqrt{m}}[\textbf{S}_1,\cdots ,\textbf{S}_m]\in {\mathbb {R}}^{N\times ms'}\). The following lemma characterizes the errors of these two sketching approximations.
Lemma 2.4
Adapted from (Wang et al., 2018, Lemma 8)
Let \(\eta\) and \(\delta \in (0,1)\) be fixed parameters, \(\nu =\nu (\textbf{A})\), and \(\textbf{S}_1,\cdots ,\textbf{S}_m\in {\mathbb {R}}^{N\times s'}\) be independent uniform sampling matrices with \(s'\ge \frac{3\nu d}{\eta ^2}\log (\frac{d m}{\delta })\). Then, with probability at least \(1-\delta\), we have \((1-\eta )\textbf{H}\preceq \tilde{\textbf{H}}_i\preceq (<span class='crossLinkCiteEqu'>1</span>+\eta ) \textbf{H}\) and \((1-\eta /\sqrt{m})\textbf{H}\preceq \hat{\textbf{H}}\preceq (<span class='crossLinkCiteEqu'>1</span>+\eta /\sqrt{m}) \textbf{H}\) for \(\textbf{H}_i\) and \(\hat{\textbf{H}}\) defined in (4).
3 The analysis framework of partially approximate Newton method
In this section, we propose a Partially Approximate Newton (PAN) method for Problem (1). We start with the classical Newton update:
Using the approximate Hessian matrix \(\tilde{\textbf{H}}_{\textbf{x}\textbf{x}}\) to replace the exact Hessian matrix \(\textbf{H}_{\textbf{x}\textbf{x}}(\textbf{x},\textbf{y})\) in (5) leads to the update rule of PAN as follows:
We use the weighted gradient norm as the measure in our analysis (Liu et al., 2022):
where \(\textbf{P}(\textbf{x},\textbf{y})\) is defined as
The following lemma shows that if \(\tilde{\textbf{H}}_{\textbf{x}\textbf{x}}\) is a good approximation of \(\textbf{H}_{\textbf{x}\textbf{x}}(\textbf{x},\textbf{y})\), then \(\textbf{P}(\textbf{x},\textbf{y})\) can be also well approximated by \(\textbf{C}(\textbf{x},\textbf{y})\), defined as
Lemma 3.1
Under Assumption 2.2 and suppose that \(\tilde{\textbf{H}}_{\textbf{x}\textbf{x}}\) satisfies \((1-\eta ) \textbf{H}_{\textbf{x}\textbf{x}}\preceq \tilde{\textbf{H}}_{\textbf{x}\textbf{x}}\preceq (<span class='crossLinkCiteEqu'>1</span>+\eta )\textbf{H}_{\textbf{x}\textbf{x}}\), we have
We establish a linear-quadratic convergence rate for the PAN update when \(\textbf{P}(\textbf{x},\textbf{y})\) and \(\textbf{C}(\textbf{x},\textbf{y})\) are close.
Theorem 3.2
Under the Assumption 2.2 and suppose \(\textbf{P}(\textbf{x},\textbf{y})\) and \(\textbf{C}(\textbf{x},\textbf{y})\) are close such that \(\Vert \textbf{I}- \textbf{P}(\textbf{x},\textbf{y})^{1/2}\textbf{C}(\textbf{x},\textbf{y})^{-1}\textbf{P}(\textbf{x},\textbf{y})^{1/2}\Vert \le \eta _1\), the update of PAN in (6) exhibits the following linear-quadratic convergence rate:
When employing the sub-sampling approximation to construct \(\tilde{\textbf{H}}_{\textbf{x}\textbf{x}}\), we derive the following corollary by combining the results from Lemma 3.1 and Theorem 3.2.
Corollary 3.3
Let us construct the partial Hessian approximation by sub-sampling \(\tilde{\textbf{H}}_{\textbf{x}\textbf{x}}=\frac{1}{|S|}\sum _{j\in S}\nabla ^2_{\textbf{x}\textbf{x}} l_j(\textbf{x},\textbf{y})\). Under Assumption 2.2 and for any \(\delta \in (0, 1)\), if the sample size satisfies \(|S|\ge {12\hat{\kappa }\log (2n_\textbf{x}/\delta )}\), then with probability at least \(1-\delta\), the update of PAN in (6) satisfies
with \(\eta =\eta _{\textrm{PAN}}\overset{\textrm{def}}{=}\sqrt{\frac{3\hat{\kappa }\log (2n_\textbf{x}/\delta )}{|S|}}\).
Corollary 3.3 suggests that PAN requires \({\mathcal {O}}\left( {\log (1/\epsilon )}/{\log (|S|/\hat{\kappa })}\right)\) iterations to achieve \(\epsilon\)-accuracy in terms of the measure \(\lambda (\textbf{x},\textbf{y})\) for a quadratic objective function. Analyzing the complexity of linear-quadratic rates on quadratic functions is a common practice in the literature (Roosta-Khorasani & Mahoney, 2019; Wang et al., 2018; Ye et al., 2020, 2021), which allows us to simply ignore the quadratic term in (9) since \(\kappa _\textbf{H}=0\). In comparison, state-of-the-art first-order methods such as optimistic gradient and extra gradient methods have complexities \({\mathcal {O}}\left( \kappa _\textbf{g}\log (1/\epsilon )\right)\); methods that do not access the full Hessian at each iteration such as the quasi-Newton method (Liu & Luo, 2022) and the partial-quasi-Newton method (Liu et al., 2022) have complexities \({\mathcal {O}}(\kappa _\textbf{g}^2 + \sqrt{n_\textbf{x}\log (1/\epsilon )})\) and \({\mathcal {O}}(\kappa _\textbf{g}+ \sqrt{n_\textbf{x}\log (1/\epsilon )})\), respectively. We can see that PAN exhibits a much weaker dependency on the condition number \(\kappa _\textbf{g}\). We present the comparisons in Table 1.
4 Partially approximate Newton methods for distributed minimax optimization
In this section, we present the PANDA method for solving Problem (2) in Sect. 4.1, establish its convergence results in Sect. 4.2, and extend PANDA to GIANT-PANDA for a special function class that commonly appears in regression problems in Sect. 4.3.
4.1 The PANDA algorithm
For simplicity, we suppress the evaluation point and use \(\textbf{g}_\textbf{x}\) to denote \(\textbf{g}_\textbf{x}(\textbf{x},\textbf{y})\) (similar to \(\textbf{g}_\textbf{y}\), \(\textbf{H}_{\textbf{x}\textbf{x}}\), \(\textbf{H}_{\textbf{x}\textbf{y}}\), \(\textbf{H}_{\textbf{y}\textbf{y}}\)). We start with the standard Newton direction \(\begin{bmatrix} {\textbf{d}}_{\textbf{x}}\\ {\textbf{d}}_\textbf{y}\end{bmatrix} \overset{\textrm{def}}{=}\begin{bmatrix} \textbf{H}_{\textbf{x}\textbf{x}}& \textbf{H}_{\textbf{x}\textbf{y}}\\ \textbf{H}_{\textbf{x}\textbf{y}}^{\top }& \textbf{H}_{\textbf{y}\textbf{y}} \end{bmatrix} ^{-1}\begin{bmatrix} \textbf{g}_\textbf{x}\\ \textbf{g}_\textbf{y}\end{bmatrix}\), which can be expressed explicitly by using block matrix inversion formula:
where \({\Delta }_{\textbf{y}\textbf{y}} = (\textbf{H}_{\textbf{y}\textbf{y}}-\textbf{H}_{\textbf{x}\textbf{y}}^{\top }\textbf{H}_{\textbf{x}\textbf{x}}^{-1}\textbf{H}_{\textbf{x}\textbf{y}} )^{-1}.\)
Under the setup of unbalanced dimensions \(n_\textbf{x}\gg n_\textbf{y}\), obtaining the exact Hessian \(\textbf{H}_{\textbf{x}\textbf{x}}\) on the server is prohibitive due to the communication overhead associated with \(\textbf{H}^i_{\textbf{x}\textbf{x}}\). However, communication costs of gradients and partial Hessians \(\textbf{H}_{\textbf{x}\textbf{y}}^{i}\) and \(\textbf{H}_{\textbf{y}\textbf{y}}^i\) are relatively low. Thus, in the first round of PANDA, the server aggregates these quantities to acquire precise gradient and partial Hessian information as follows:
The server then broadcasts the above-aggregated quantities to the clients, allowing each client to access global information of \(\textbf{g}_\textbf{x}\), \(\textbf{g}_\textbf{y}\), \(\textbf{H}_{\textbf{x}\textbf{y}}\), and \(\textbf{H}_{\textbf{y}\textbf{y}}\). Further, since the communication costs of \(\textbf{Q}^{i}_{\textbf{x}\textbf{y}}\overset{\textrm{def}}{=}[\textbf{H}^{i}_{\textbf{x}\textbf{x}}]^{-1}\textbf{H}_{\textbf{x}\textbf{y}}\) and \(\textbf{q}^{i}_\textbf{x}\overset{\textrm{def}}{=}[\textbf{H}^{i}_{\textbf{x}\textbf{x}}]^{-1}\textbf{g}_\textbf{x}\) are only \({\mathcal {O}}(n_\textbf{x}n_\textbf{y})\) and \({\mathcal {O}}(n_\textbf{x})\), in the second round of PANDA, the server aggregates \(\textbf{Q}^{i}_{\textbf{x}\textbf{y}}\) and \(\textbf{q}^{i}_\textbf{x}\) as follows:
Using \(\textbf{Q}_{\textbf{x}\textbf{y}}\) and \(\textbf{q}_\textbf{x}\) to replace \(\textbf{H}_{\textbf{x}\textbf{x}}^{-1}\textbf{H}_{\textbf{x}\textbf{y}}\) and \(\textbf{H}_{\textbf{x}\textbf{x}}^{-1}\textbf{g}_\textbf{x}\) in (10), the server finally computes the following approximate Newton direction
with \(\tilde{\Delta }_{\textbf{y}\textbf{y}} \overset{\textrm{def}}{=}[\textbf{H}_{\textbf{y}\textbf{y}} -\textbf{H}_{\textbf{x}\textbf{y}}^{\top }\textbf{Q}_{\textbf{x}\textbf{y}}]^{-1}\) and updates the parameters based on \(\tilde{{\textbf{d}}}_\textbf{x}\) and \(\tilde{{\textbf{d}}}_\textbf{y}\).
We formally summarize the PANDA method in Algorithm 1. The following proposition indicates that the update rule of PANDA can be viewed as a partially approximate Newton method.
Proposition 4.1
Using PANDA in Algorithm 1, the update rule on the server is equivalent to
where \(\tilde{\textbf{H}}_{\textbf{x}\textbf{x}, t} \overset{\textrm{def}}{=}\left[ \frac{1}{m}\sum _{i=1}^m(\textbf{H}^i_{\textbf{x}\textbf{x}}(\textbf{x}_t,\textbf{y}_t))^{-1}\right] ^{-1}.\)
Proof
We ignore the subscript t in the following proof such that \(\tilde{\textbf{H}}_{\textbf{x}\textbf{x}}=\tilde{\textbf{H}}_{\textbf{x}\textbf{x},t}\), \(\textbf{H}_{\textbf{x}\textbf{y}}=\textbf{H}_{\textbf{x}\textbf{y}}(\textbf{x}_t,\textbf{y}_t)\) (similar for \(\textbf{H}_{\textbf{x}\textbf{x}}\), \(\textbf{H}_{\textbf{x}\textbf{x}}^i\), \(\textbf{H}_{\textbf{y}\textbf{y}}\), \(\textbf{g}_\textbf{x}\), \(\textbf{g}_\textbf{y}\)). We denote
Then, it holds that
\(\square\)
4.2 Convergence analysis of PANDA
We suppose the N samples are i.i.d drawn from some distribution and each sample is associated with a local loss function \(l_j(\cdot )\). We also assume each client holds s samples drawn from \(\{l_j(\cdot )\}_{j=1}^N\), such that \(N=ms\) and \(|S_i|\equiv s\). According to Lemma 2.3, each local partial Hessian, \(\textbf{H}_{\textbf{x}\textbf{x}}^i(\textbf{x}_t,\textbf{y}_t) = \frac{1}{s} \sum _{j\in S_i}\nabla ^2 l_j(\textbf{x}_t,\textbf{y}_t)\), can be viewed as an sub-sampling approximation of \(\textbf{H}_{\textbf{x}\textbf{x}}(\textbf{x}_t,\textbf{y}_t)\) when s is large. The following Lemma indicates that
with \(\tilde{\textbf{H}}_{\textbf{x}\textbf{x},t}\) defined in (14) is a good estimation of \(\textbf{P}(\textbf{x}_t,\textbf{y}_t)\).
Lemma 4.2
Under Assumption 2.2 and suppose that for all \(i\in [m]\), \(\textbf{H}_{\textbf{x}\textbf{x}}^i(\textbf{x},\textbf{y})\) satisfies that
then we have \(\left\| \textbf{I}- \textbf{P}(\textbf{x}_t,\textbf{y}_t)^{1/2}{\textbf{C}}_{t}^{-1}\textbf{P}(\textbf{x}_t,\textbf{y}_t)^{1/2} \right\| \le \frac{\eta ^2}{1-\eta }\).
Incorporating the linear-quadratic rates established by the PAN framework, we can obtain the improved linear-quadratic rates for PANDA.
Theorem 4.3
Under Assumption 2.2 and suppose that
holds for all \(i\in [m]\), the update rule of PANDA (Algorithm 1) in (14) satisfies that
Similar to Corollary 3.3, we can guarantee a small \(\eta \in (0,0.5)\) for Theorem 4.3.
Corollary 4.4
Under Assumption 2.2, for any \(\delta \in (0,1)\) and \(\eta \in (0,0.5)\), if each client holds \(s\ge \frac{3\hat{\kappa }\log (2n_\textbf{x}m/\delta )}{\eta ^2}\) samples, then with probability at least \(1-\delta\), the update rule of PANDA (Algorithm 1) in (14) satisfies (15).
Remark 4.5
The Corollary 4.4 can be interpreted in this way: if N is at least \(12m\hat{\kappa }\log (2n_\textbf{x}m/\delta )\), then (15) holds with probability at least \(1-\delta\) where \(\eta =\eta _\mathrm{\text {PANDA}}\overset{\textrm{def}}{=}\sqrt{\frac{3\hat{\kappa }m\log (2n_\textbf{x}m/\delta )}{ N}}\).
We highlight the advancements of the PANDA method in the following two aspects:
-
(a)
We compare PANDA with its single-agent version, which corresponds to using N/m samples to construct the approximated Hessian in PAN. According to Corollary 3.3 and Corollary 4.4, we observe that \(\eta _{\text {PAN}} = \sqrt{\frac{3\hat{\kappa }m\log (2n_\textbf{x}/\delta )}{N}}\approx \eta _{\text {PANDA}}\). This indicates that the linear-quadratic rate (15) of PANDA significantly improves upon its single-agent version (9), which demonstrates the superiority of using the distributed framework.
-
(b)
We compare PANDA with state-of-the-art first-order methods in Table 2. Both distributed EG and Proxskip-VI-FL (Zhang et al., 2024) require the communication rounds of \({\mathcal {O}}(\kappa _\textbf{g}\log (1/\epsilon ))\), whereas PANDA only requires \({\mathcal {O}}\left( \frac{\log (1/\epsilon )}{\log (N/(m\hat{\kappa }))}\right)\) communication rounds. This highlights the advantage of using second-order information.
4.3 Extension to the GIANT-PANDA Algorithm
PANDA exhibits provably faster convergence rates than the first-order methods for minimax distributed optimization, however, each client is required to access the full local Hessian at each iteration. In this section, we develop a communication-efficient algorithm that allows using inexact Hessian instead of the exact one during the local computation.
We focus on a specific function class that \(l_j(\cdot )\) in (2) can be expressed as \(l_j(\textbf{x},\textbf{y})\overset{\textrm{def}}{=}h_j(\textbf{w}^{\top }\textbf{x},\textbf{y}) + \frac{\mu }{2}\Vert \textbf{x}\Vert ^2\), where \(h_j(\cdot ,\cdot )\) is convex in \(\textbf{x}\) and \(\mu\)-strongly concave in \(\textbf{y}\). This function class generalizes the objective considered in convex optimization as discussed in GIANT (Wang et al., 2018), which has important applications in regression-type models.
The partial Hessian of the objective at \((\textbf{x}_t,\textbf{y}_t)\) can be written as
where \(\textbf{A}_t \overset{\textrm{def}}{=}\left[ {\textbf{a}}_1^{\top },\cdots ,{\textbf{a}}_N^{\top }\right] \in {\mathbb {R}}^{N\times n_\textbf{x}}\) is a full column-rank matrix with \(n_\textbf{x}\le N\), \({\textbf{a}}_j = \sqrt{\nabla ^2_{\textbf{x}\textbf{x}}h_j(\textbf{w}^{\top }\textbf{x}_t,\textbf{y}_t)}\textbf{w}/\sqrt{N}\), \(\textbf{S}^{i}\) is some sketching matrix such that \((\textbf{S}^i)^{\top }\textbf{A}_t\) contains the rows of \(\textbf{A}_t\) indexed by \({\mathcal {S}}_i\). The local partial Hessian of the i-th client can be indicated by \(\textbf{H}_{\textbf{x}\textbf{x}}^{i}(\textbf{x}_t,\textbf{y}_t)\overset{\textrm{def}}{=}\left\{ \textbf{A}_t^{\top }\textbf{S}^{i}(\textbf{S}^{i})^{\top }\textbf{A}_t + \mu \textbf{I}\right\}\)
Taking advantage of such a structure, we perform a sketch operation on \(\textbf{H}_{\textbf{x}\textbf{x}}^{i}(\textbf{x}_t,\textbf{y}_t)\) to reduce the computation cost on the client such that:
where \(\tilde{\textbf{S}}_t^{i}\in {\mathbb {R}}^{n_\textbf{x}\times s_t}\) is chosen randomly from the columns of \(\textbf{S}_i\) so that \(s_t\le s\). We replace \(\textbf{H}_{\textbf{x}\textbf{x}}^i(\textbf{x}_t,\textbf{y}_t)\) by its sketched approximation \(\tilde{\textbf{H}}_{\textbf{x}\textbf{x}}^{i}(\textbf{x}_t,\textbf{y}_t)\) in line 15 and line 16 of PANDA (Algorithm 1), naturally resulting the modified algorithm GIANT-PANDA. The routine of GIANT-PANDA is formally presented in Algorithm 2 in A.
Now, we start to characterize the convergence behavior of GIANT-PANDA. Since GIANT-PANDA inherits the framework of PANDA, the update rule of GIANT-PANDA can be viewed as
where \(\tilde{\textbf{H}}^{\text {gp}}_{\textbf{x}\textbf{x}, t} \overset{\textrm{def}}{=}\left[ \frac{1}{m}\sum _{i=1}^m[\tilde{\textbf{H}}^i_{\textbf{x}\textbf{x},t}]^{-1}\right] ^{-1}\).
Let \(\textbf{C}^{\textrm{gp}}_{t}\overset{\textrm{def}}{=}\tilde{\textbf{H}}^{{ \textrm{gp}}}_{\textbf{x}\textbf{x},t}-\textbf{H}_{\textbf{x}\textbf{y}}(\textbf{x}_t,\textbf{y}_t)[\textbf{H}_{\textbf{y}\textbf{y}}(\textbf{x}_t,\textbf{y}_t)]^{-1}(\textbf{H}_{\textbf{x}\textbf{y}}(\textbf{x}_t,\textbf{y}_t))^{\top }\), the following lemma shows \(\textbf{C}^{\textrm{gp}}_{t}\) is still a good approximation to \(\textbf{P}(\textbf{x}_t,\textbf{y}_t)\).
Lemma 4.6
Let \(\eta ,\delta \in (0,1)\) be fixed parameters, \(\nu _t=\nu (\textbf{A}_t)\), and \(\{\tilde{\textbf{S}}_t^i\}\) is independent uniform sampling matrices with \(s_t\ge \frac{3\nu _tn_\textbf{x}}{\eta ^2}\log \left( \frac{mn_\textbf{x}}{\delta }\right)\). Under Assumption 2.2, we have
holds with probability at least \(1-\delta\).
Remark 4.7
The condition of Lemma 4.6 requires \(\{\tilde{\textbf{S}}^i_t\}\) to be uniform sampling matrices, which means we perform uniform sketch to obtain the local approximate Hessian \(\tilde{\textbf{H}}^i(\textbf{x}_t,\textbf{y}_t)\) in GIANT-PANDA. GIANT-PANDA also allows using other sketching techniques like count sketch (Clarkson & Woodruff, 2017; Meng & Mahoney, 2013) or Gaussian sketch (Johnson & Lindenstrauss, 1984) to obtain \(\tilde{\textbf{S}}^i_t\). These sketching methods can improve the dependence of \(s_t\) on \(\nu _t\), but will be more expensive to implement than the simple uniform sketching matrix (Wang et al., 2018).
Using the analysis framework of PAN, we establish the linear-quadratic rate of GIANT-PANDA.
Theorem 4.8
Under the same condition of Lemma 4.6, the update of GIANT-PANDA (Algorithm 2) in (17) satisfies
with probability at least \(1-\delta\), where \(c=\frac{12((\sqrt{m}-1)(\eta ^2-\eta +1)+1)^2}{(1-\eta )^2m}\)
Remark 4.9
\(\left( \frac{\eta }{\sqrt{m}}+\frac{\eta ^2}{1-\eta }\right)\) in the linear term \(\lambda (\textbf{x}_{t},\textbf{y}_t)\) in (18) for GIANT-PANDA is slightly worse than \(\left( \frac{\eta ^2}{1-\eta }\right)\) in (15) for PANDA. This is because GIANT-PANDA uses the approximate local partial Hessian instead of the full local partial Hessian. However, it is still better than \(\left( \frac{\eta }{1-\eta }\right)\) in (9) for PAN by a factor of \(\frac{1}{\sqrt{m}}\). This demonstrates the advantage of utilizing m clients in the parallel training process.
Improved Results for GIANT. GIANT (Wang et al., 2018) (Algorithm 3 in Appendix A) can be regarded as a special case of GIANT-PANDA for convex optimization when taking \(n_\textbf{y}= 0\). Using the analysis techniques developed for GIANT-PANDA, we also improve the convergence results for GIANT.
In the following corollary, we present a sharper linear-quadratic rate for GIANT under the same assumption as in (Wang et al., 2018), which improves the previous result by a factor of \(\sqrt{\kappa _\textbf{g}}\) in the linear term.
Corollary 4.10
To solve the minimization problem \((\mu >0)\) \(\min _{\textbf{x}\in {\mathbb {R}}^{n_\textbf{x}}} f(\textbf{x}) = \frac{1}{N}\sum _{j=1}^N h_j(\textbf{w}^{\top }\textbf{x}) + \frac{\mu }{2}\Vert \textbf{x}\Vert ^2\) on m clients and each client holds s samples, if \(h_j(\cdot )\) is a convex loss function, \(f(\cdot )\) has \(L_2\)-Lipschitz continuous Hessian, and \(s_t\) satisfies that \(s\ge s_t\ge \frac{3\nu _tn_\textbf{x}}{\eta ^2}\log \left( \frac{mn_\textbf{x}}{\delta }\right)\) for some fixed parameters \(\eta ,\delta \in (0,1)\), then with probability at least \(1-\delta\), the update rule of GIANT (Algorithm 3) satisfies that
where \(\hat{\lambda }(\textbf{x})\overset{\textrm{def}}{=}\sqrt{\nabla f(\textbf{x})[\nabla ^2 f(\textbf{x})]^{-1}\nabla f(\textbf{x})}\).
5 Experiments
We validate the proposed methods on the following important data mining tasks, which enjoy the structure of “unbalanced dimension” and have been well studied in previous literature (Liu et al., 2022; Liu & Luo, 2022). The experiments are conducted on a workstation with an Intel(R) Core(TM) i7-10870 H CPU @ 2.20GHz. The code was executed using Python 3.8.
-
AUC Maximization. To train a classifier \(\textbf{w}\) on imbalanced datasets \(\{{\textbf{a}}_j,b_j\}_{j=1}^N\) such that \(p = \frac{N^{+}}{N}\approx 1~\text {or}~0\) where \(N^{+}\) is the number of positive instances, AUC maximization can be reformulated into minimax problems, where \(l_j(\textbf{x},\textbf{y})\) of (1) takes the following quadratic form
$$\begin{aligned} l_j(\textbf{x},y)&= (1-p)\big ( \big (\textbf{w}^\top {\textbf {a}}_j-u \big )^2 - 2(1+y)\textbf{w}^\top {\textbf {a}}_j\big ){\mathbb {I}}_{b_j=1}+\frac{\lambda }{2}\left\| \textbf{x} \right\| ^2\\&\quad +p\big ( \big (\textbf{w}^\top {\textbf {a}}_j-v \big )^2 + 2(1+y)\textbf{w}^\top {\textbf {a}}_j\big ){\mathbb {I}}_{b_j=-1} -p(1-p)y^2, \end{aligned}$$where \(\textbf{x}= [\textbf{w}; u; v] \in {\mathbb {R}}^{d+2}, \textbf{w}\in {\mathbb {R}}^d, u\in {\mathbb {R}}, v\in {\mathbb {R}}\) and \(y\in {\mathbb {R}}\). We set \(\lambda =0.5\). We perform experiments on “a9a” (\(N = 32, 651\), \(n_\textbf{x}=125\), \(n_\textbf{y}=1\), \(p=0.241\)), “w8a” (\(N = 45, 546, n_\textbf{x}=302, n_\textbf{y}= 1, p=0.029\)), and “sido0” (\(N=12,678\), \(n_\textbf{x}=4,932\), \(n_\textbf{y}=1\)) which can be downloaded from Libsvm (Chang & Lin, 2011). We choose the regularized parameter \(\lambda = 0.5\) and the number of the clients \(m=8\). We tune the learning rates of all methods (include the baselines) from \(\{1.0, 0.9, \dots ,0.1\}\).
-
Fairness-Aware Machine Learning. Given the training set \(\{{\textbf{a}}_j, b_j, c_j\}_{j=1}^{N}\) where \({\textbf{a}}_j\in {\mathbb {R}}^d\) and \(c_j\in {\mathbb {R}}\), we can use the following adversarial training model to train a binary classifier \(\textbf{x}\) (Zhang et al., 2018) and make it unbiased to the feature \(c_j\) that we want to protect:
$$\begin{aligned} \!\!\!l_j(\textbf{x}, y) \!=\!\log \big (1\!+\!\exp \big (-b_j({\textbf {a}}_j)^\top \textbf{x}\big )\big ) \!\!+\!\!\lambda \left\| \textbf{x} \right\| ^2 \!\!-\!\! \gamma y^2\!\!-\!\!\beta \log \big (1 \!+\! \exp \big (\!\!-c_j({\textbf {a}}_j)^\top \textbf{x}y\big )\big ). \end{aligned}$$We choose \(\lambda = 0.5\) and \(\beta = \gamma = 0.0001\). We conduct experiments on “adult” (\(N = 32, 651, n_\textbf{x}=122, n_\textbf{y}=1\)) and “law school” (\(N = 20, 427, n_\textbf{x}=379, n_\textbf{y}= 1\)) datasets (Le Quy et al., 2022; Liu & Luo, 2022). We set regularization parameters \(\lambda = 0.5\) and \(\beta =\gamma =0.0001\).
We demonstrate the communication rounds (#comm) against \(\Vert \nabla f(\textbf{x},\textbf{y})\Vert _2\) and running time (second) against \(\Vert \nabla f(\textbf{x},\textbf{y})\Vert _2\) for Fairness-aware machine learning on datasets “adult” and “law school” with different sketch ratio p under the case \(m=8\)
5.1 Comparison with the baselines
We compare PANDA and GIANT-PANDA with existing state-of-the-art communication-efficient methods. Specifically, we adopt distributed version of extra gradient (Korpelevich, 1976; Tseng, 2000) (EG), federated gradient descent ascent with gradient tracking (Sun & Wei, 2022) (FedGDA), and proximal skip method for variational inequalities (Zhang et al., 2024) (ProxSkip) as the baselines. Both EG and ProxSkip achieve the optimal communication complexity for first-order methods. We tune the learning rates of all methods from \(\{1.0, 0.9, \dots ,0.1\}\).
For all experiments, we use \(70\%\) percent local data in GIANT-PANDA. The results for AUC maximization under different client numbers \(m=8\) and \(m=128\) are presented in Figs. 1 and 2. We also demonstrate the results for Fairness-aware machine learning under different client numbers \(m=8\) and \(m=128\) in Figs. 3 and 4.
We observe that our newly proposed PANDA and GIANT-PANDA outperform the baselines in terms of both communication rounds and the running time for all cases. This indicates that our methods indeed not only significantly reduce the communication rounds as compared to the optimal first-order methods, but also maintain communication efficiency which makes the optimization procedure fast.
We also observe that the communication complexity of PANDA can be affected by the number of clients (m). This is because \(\eta _{\text {PANDA}}\) is proportional to \(\sqrt{m}\) according to Remark 4.5. On the other hand, the increase of m makes the training time per iteration smaller due to the distributed framework, thus, larger m always leads to a faster training process. Take (a), (b) of Figs. 1 and 2 for example, PANDA requires less communication round when \(m=8\), but takes less running time when \(m=128\).
5.2 Comparison of different sketch ratios for GIANT-PANDA
We investigate the impact of the sketch ratio (\(p = s_t/s\)) on GIANT-PANDA. We choose different sketch ratios p from \(\{10\%, 30\%, 50\%, 70\%, 100\%\}\) for GIANT-PANDA. For the case \(p=100\%\), GIANT-PANDA reduces to its full version PANDA. We set the number of clients as \(m=8\).
We present the results for AUC maximization and Fairness-aware machine learning in Figs. 5 and 6 respectively. The numerical results show that larger sketch ratios lead to fewer communication rounds for the training process, which is because one can obtain a better approximation to the local exact partial Hessian, and thus get a smaller \(\eta\). GIANT-PANDA with \(p=100\%\) (PANDA) outperforms other cases in terms of the communication rounds. On the other hand, GIANT-PANDA shows its advantage in terms of the running time. We find that GIANT-PANDA with \(p=30\%\) for “a9a”, \(p=10\%\) for “w8a” in AUC maximization and with \(p=10\%\) for “law school” in Fairness-aware machine learning achieves the best behavior in terms of the running time ((b), (d) of Figs. 5 and 6). This is because the sketch operation in GIANT-PANDA reduces the computation time for each client.
We also provide additional experiments to study the impact of sketch ratio under \(m=128\) and the impact of using different sketch methods in G.
6 Conclusion
In this paper, we have proposed PANDA and GIANT-PANDA to solve the distributed minimax problems with unbalanced dimensions. PANDA eliminates the requirement of communicating the full Hessian and substantially reduces the communication rounds compared to the optimal first-order methods. GIANT-PANDA further reduces the computation cost by performing sketch operations to compute the local partial Hessian on each client.
For future work, it is interesting to generalize PANDA and GIANT-PANDA to more general minimax optimization problems (Adil et al., 2022; Lin & Jordan, 2022; Liu & Luo, 2022; Luo et al., 2022). It is also possible to leverage the idea of Hessian average (Na et al., 2023) to further enhance the behavior of GIANT-PANDA and design the decentralized scenario of PANDA and GIANT-PANDA.
Data availibility
No datasets were generated or analysed during the current study.
Notes
For simplicity, we only consider the case \(n_{\textbf{x}} \gg n_{\textbf{y}}\) in this paper, while \(n_\textbf{y}\gg n_\textbf{x}\) can be studied trivially in the same way.
We use the term “communication-efficient” to describe second-order methods whose per-iteration communication cost is of the same order as that of first-order methods.
References
Adil, D., Bullins, B., Jambulapati, A., & Sachdeva, S. (2022). Line search-free methods for higher-order smooth monotone variational inequalities. arXiv preprint arXiv:2205.06167
Basar, T., & Olsder, G. J. (1999). Dynamic noncooperative game theory. ser. Classics in Applied MathematicsSIAM.
Ben-Tal, A., & Nemirovski, A. (2002). Robust optimization-methodology and applications. Mathematical Programming, 92, 453–480.
Chang, C.-C., & Lin, C.-J. (2011). LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2, 27–12727.
Chavdarova, T., Gidel, G., Fleuret, F., & Lacoste-Julien, S. (2019). Reducing noise in gan training with variance reduced extragradient. Advances in Neural Information Processing Systems, 32.
Clarkson, K. L., & Woodruff, D. P. (2017). Low-rank approximation and regression in input sparsity time. Journal of the ACM (JACM), 63(6), 1–45.
Cortes, C., & Mohri, M. (2003). Auc optimization vs. error rate minimization. Advances in neural information processing systems, 16.
Creswell, A., White, T., Dumoulin, V., Arulkumaran, K., Sengupta, B., & Bharath, A. A. (2018). Generative adversarial networks: An overview. IEEE Signal Processing Magazine, 35(1), 53–65.
Deng, Y., & Mahdavi, M. (2021). Local stochastic gradient descent ascent: Convergence analysis and communication efficiency. International conference on artificial intelligence and statistics (pp. 1387–1395). PMLR.
Facchinei, F. (2003). Finite-dimensional variational inequalities and complementarity problems. Springer.
Gao, R., & Kleywegt, A. (2022). Distributionally robust stochastic optimization with wasserstein distance. Mathematics of Operations Research, 48(2), 603–655.
Hsieh, Y.-G., Iutzeler, F., Malick, J., & Mertikopoulos, P. (2019). On the convergence of single-call stochastic extra-gradient methods. Advances in Neural Information Processing Systems, 32.
Huang, K., & Zhang, S. (2022). An approximation-based regularized extra-gradient method for monotone variational inequalities. arXiv preprint arXiv:2210.04440
Huang, K., Zhang, J., & Zhang, S. (2022). Cubic regularized newton method for the saddle point models: A global and local convergence analysis. Journal of Scientific Computing, 91(2), 60.
Islamov, R., Qian, X., Hanzely, S., Safaryan, M., & Richtárik, P. (2022). Distributed Newton-type methods with communication compression and bernoulli aggregation. Transactions on Machine Learning Research
Johnson, W. B., & Lindenstrauss, J. (1984). Extensions of Lipschitz maps into a Hilbert space. Contemporary Mathematics, 26, 189–206.
Korpelevich, G. M. (1976). The extragradient method for finding saddle points and other problems. Matecon, 12, 747–756.
Lanckriet, G. R., Ghaoui, L. E., Bhattacharyya, C., & Jordan, M. I. (2002). A robust minimax approach to classification. Journal of Machine Learning Research, 3, 555–582.
Le Quy, T., Roy, A., Iosifidis, V., Zhang, W., & Ntoutsi, E. (2022). A survey on datasets for fairness-aware machine learning. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 12(3), 1452.
Lin, T., & Jordan, M.I. (2022). Perseus: A simple high-order regularization method for variational inequalities. arXiv preprint arXiv:2205.03202
Lin, T., Jin, C., & Jordan, M. (2020). On gradient descent ascent for nonconvex-concave minimax problems. International conference on machine learning (pp. 6083–6093). PMLR.
Liu, C., & Luo, L. (2022). Regularized newton methods for monotone variational inequalities with holders continuous jacobians. arXiv preprint arXiv:2212.07824
Liu, C., Bi, S., Luo, L., & Lui, J. C. (2022). Partial-quasi-newton methods: Efficient algorithms for minimax optimization problems with unbalanced dimensionality. In Proceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining (pp. 1031–1041).
Liu, C., Chen, L., Luo, L., & Lui, J. (2024). Communication efficient distributed newton method with fast convergence rates. In Proceedings of the 29th ACM SIGKDD conference on knowledge discovery and data mining.
Liu, C., & Luo, L. (2022). Quasi-newton methods for saddle point problems. Advances in Neural Information Processing Systems, 35, 3975–3987.
Liu, M., Zhang, W., Mroueh, Y., Cui, X., Ross, J., Yang, T., & Das, P. (2020). A decentralized parallel algorithm for training generative adversarial nets. Advances in Neural Information Processing Systems, 33, 11056–11070.
Lowd, D., & Meek, C. (2005). Adversarial learning. In Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining (pp. 641–647).
Luo, L., Li, Y., & Chen, C. (2022). Finding second-order stationary points in nonconvex-strongly-concave minimax optimization. Advances in Neural Information Processing Systems, 35, 36667–36679.
Malitsky, Y. (2015). Projected reflected gradient methods for monotone variational inequalities. SIAM Journal on Optimization, 25(1), 502–520.
Meng, X., & Mahoney, M. W. (2013). Low-distortion subspace embeddings in input-sparsity time and applications to robust linear regression. In Proceedings of the forty-fifth annual ACM symposium on theory of computing (pp. 91–100).
Mishchenko, K., Kovalev, D., Shulgin, E., Richtárik, P., & Malitsky, Y. (2020). Revisiting stochastic extragradient. In International conference on artificial intelligence and statistics (pp. 4573–4582). PMLR
Na, S., Dereziński, M., & Mahoney, M. W. (2023). Hessian averaging in stochastic newton methods achieves superlinear convergence. Mathematical Programming, 201(1–2), 473–520.
Nedić, A., & Ozdaglar, A. (2009). Subgradient methods for saddle-point problems. Journal of Optimization Theory and Applications, 142, 205–228.
, Nouiehed, M., Sanjabi, M., Huang, T., Lee, J. D., & Razaviyayn, M. (2019). Solving a class of non-convex min-max games using iterative first order methods. Advances in Neural Information Processing Systems, 32.
Rodomanov, A., & Nesterov, Y. (2021). Greedy quasi-newton methods with explicit superlinear convergence. SIAM Journal on Optimization, 31(1), 785–811.
Roosta-Khorasani, F., & Mahoney, M. W. (2019). Sub-sampled newton methods. Mathematical Programming, 174, 293–326.
Shamir, O., Srebro, N., & Zhang, T. (2014). Communication-efficient distributed optimization using an approximate newton-type method. International conference on machine learning (pp. 1000–1008). PMLR.
Sun, Z., & Wei, E. (2022). A communication-efficient algorithm with linear convergence for federated minimax learning. Advances in Neural Information Processing Systems, 35, 6060–6073.
Tseng, P. (2000). A modified forward-backward splitting method for maximal monotone mappings. SIAM Journal on Control and Optimization, 38(2), 431–446.
Wang, S., Gittens, A., & Mahoney, M. W. (2017). Sketched ridge regression: Optimization perspective, statistical perspective, and model averaging. International conference on machine learning (pp. 3608–3616). PMLR.
Wang, S., Roosta, F., Xu, P., & Mahoney, M. W. (2018). Giant: Globally improved approximate newton method for distributed optimization. Advances in Neural Information Processing Systems, 31.
Wang, Y., & Li, J. (2020). Improved algorithms for convex-concave minimax optimization. Advances in Neural Information Processing Systems, 33, 4800–4810.
Ye, H., He, C., & Chang, X. (2022). Accelerated distributed approximate newton method. IEEE Transactions on Neural Networks and Learning Systems
Ye, H., Luo, L., & Zhang, Z. (2020). Nesterov’s acceleration for approximate newton. The Journal of Machine Learning Research, 21(1), 5627–5663.
Ye, H., Luo, L., & Zhang, Z. (2021). Approximate newton methods. The Journal of Machine Learning Research, 22(1), 3067–3107.
Ying, Y., Wen, L., & Lyu, S. (2016). Stochastic online AUC maximization. NIPS.
Zhang, S., Choudhury, S., Stich, S. U., & Loizou, N. (2024). Communication-efficient gradient descent-accent methods for distributed variational inequalities: Unified analysis and local updates. In The twelfth international conference on learning representations
Zhang, B. H., Lemoine, B., & Mitchell, M. (2018). Mitigating unwanted biases with adversarial learning. In Proceedings of the 2018 AAAI/ACM conference on AI, ethics, and society (pp. 335–340).
Author information
Authors and Affiliations
Contributions
M.X. and C.L. wrote the main manuscript text, including numerical experiments and proofs. C.C. provided proofs of a key lemma. J.L and S.N. reviewed the manuscript and offered valuable suggestions and assistance. All authors reviewed and approved the final version of the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no conflict of interest.
Additional information
Editor: Lam M. Nguyen.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
The Appendix A GIANT-PANDA algorithm
We present the detailed implementation of GIANT-PANDA and GIANT in Algorithms 2 and 3 respectively.
Appendix B Auxiliary Lemmas for positive definite matrices
We first provide some useful lemmas for positive definite matrices.
Lemma B.1
For two positive definite matrices \(\textbf{A},\textbf{B}\), if
for some \(\eta \in (0,1)\), then it holds that
Proof
We have
so that
and
So, we have
\(\square\)
Lemma B.2
For two positive definite matrices \(\textbf{A},\textbf{B}\), if
for some \(\eta \in (0,1)\), then it holds that
for any \(\Delta \succeq \textbf{0}\).
Proof
According to (B1), we have
which means
so that
Since \(\Delta \succeq \textbf{0}\), we have
which means
So that we have
Finally, we have
\(\square\)
Appendix C The Proof of Sect. 2.2
1.1 C.1 The Proof of Lemma 2.4
Proof
Recall the singular value decomposition of \(\textbf{A}=\textbf{U}\Sigma \textbf{V}\) in Definition 2.1 where \(\textbf{U}\in {\mathbb {R}}^{N\times d}\), \(\Sigma \in {\mathbb {R}}^{d\times d}\) and \(\textbf{V}\in {\mathbb {R}}^{d\times N}\). We can directly obtain the following results by taking \(\rho =d\) of Wang et al. (2018, Lemma 8) that
holds for all \(i\in [m]\) with probability at least \(1-\delta\). Then we have
and
Recall the definition of \(\textbf{H}_i=\textbf{A}^{\top }\textbf{S}_i\textbf{S}_i^{\top }\textbf{A}+\alpha\), \(\textbf{H}=\textbf{A}^{\top }\textbf{A}+\alpha \textbf{I}\), we have
and
Similarly, recall the definition of \(\hat{\textbf{H}}=\textbf{A}^{\top }\textbf{S}\textbf{S}^{\top }\textbf{A}+\alpha\), we have
and
\(\square\)
Appendix D The Proof of Sect. 3
1.1 D.1 The Proof of Lemma 3.1
Proof
According to the condition, it holds that
Using Lemma B.1, we have
Denote \(\Delta =-\textbf{H}_{\textbf{x}\textbf{y}}(\textbf{x},\textbf{y})\textbf{H}_{\textbf{y}\textbf{y}}(\textbf{x},\textbf{y})^{-1}\textbf{H}_{\textbf{x}\textbf{y}}^{\top }(\textbf{x},\textbf{y})\). According to Assumption 2.2, we have \(\textbf{H}_{\textbf{y}\textbf{y}}(\textbf{x},\textbf{y})\preceq - \mu ]\textbf{I}\), which means that \(\Delta \succeq \textbf{0}\). Using Lemma B.2 on \(\textbf{A}=\textbf{H}_{\textbf{x}\textbf{x}}(\textbf{x},\textbf{y})\) and \(\textbf{B}=\tilde{\textbf{H}}_{\textbf{x}\textbf{x}}\), we have
\(\square\)
1.2 D.2 The Proof of Theorem 3.2
Denote \(\textbf{J}= \begin{bmatrix} \textbf{P}(\textbf{x}, \textbf{y})^{1/2} & \textbf{0} \\ \textbf{0} & \frac{\sqrt{\mu }}{2}\textbf{I}_{n_\textbf{y}}\end{bmatrix}\) and \(r = \left\| \begin{bmatrix}\textbf{P}(\textbf{x},\textbf{y})^{1/2}(\textbf{x}_{+}-\textbf{x})\\ \frac{\sqrt{\mu }}{2}(\textbf{y}_+ - \textbf{y})\end{bmatrix} \right\|\). The following lemma illustrates the relation between \(\textbf{P}(\textbf{x},\textbf{y})\) and \(\textbf{P}(\textbf{x}_{+},\textbf{y}_{+})\).
Lemma D.1
(Liu et al., 2022, Lemma 4.3) Under Assumptions 2.2, we have
-
\(\frac{1}{1 + \frac{3\kappa ^2\kappa _2}{\sqrt{\mu }}r}\textbf{P}(\textbf{x}, \textbf{y}) \preceq \textbf{P}(\textbf{x}_+, \textbf{y}_+) \preceq \bigg (1 + \frac{3\kappa ^2\kappa _2}{\sqrt{\mu }}r\bigg )\textbf{P}(\textbf{x}, \textbf{y})\).
-
\(\Vert \textbf{J}\Vert \succeq \frac{\sqrt{\mu }}{2}\).
For simplicity, we use \(\textbf{g}_\textbf{x}\), \(\textbf{g}_\textbf{y}\), \(\textbf{H}_{\textbf{x}\textbf{x}}\), \(\textbf{H}_{\textbf{x}\textbf{y}}\), \(\textbf{H}_{\textbf{y}\textbf{y}}\), \(\textbf{P}\), \(\textbf{P}_{+}\), \(\textbf{C}\) to represent the \(\textbf{g}_\textbf{x}(\textbf{x},\textbf{y})\), \(\textbf{g}_\textbf{y}(\textbf{x},\textbf{y})\), \(\textbf{H}_{\textbf{x}\textbf{x}}(\textbf{x},\textbf{y})\), \(\textbf{H}_{\textbf{x}\textbf{y}}(\textbf{x},\textbf{y})\), \(\textbf{H}_{\textbf{y}\textbf{y}}(\textbf{x},\textbf{y})\), \(\textbf{P}_{\textbf{x}\textbf{x}}(\textbf{x},\textbf{y})\), \(\textbf{P}(\textbf{x}_{+},\textbf{y}_{+})\) and \(\textbf{C}(\textbf{x},\textbf{y})\). We also represent the full accurate Hessian matrix \(\textbf{H}= \textbf{H}(\textbf{x}, \textbf{y}) = \begin{bmatrix} \textbf{H}_{\textbf{x}\textbf{x}}(\textbf{x}, \textbf{y}) & \textbf{H}_{\textbf{x}\textbf{y}}(\textbf{x}, \textbf{y}) \\ \textbf{H}_{\textbf{y}\textbf{x}}(\textbf{x}, \textbf{y}) & \textbf{H}_{\textbf{y}\textbf{y}}(\textbf{x}, \textbf{y}) \end{bmatrix}\) and approximated full Hessian matrix \(\tilde{\textbf{H}} = \tilde{\textbf{H}}(\textbf{x}, \textbf{y}) = \begin{bmatrix} \tilde{\textbf{H}}_{\textbf{x}\textbf{x}}(\textbf{x}, \textbf{y}) & \textbf{H}_{\textbf{x}\textbf{y}}(\textbf{x}, \textbf{y}) \\ \textbf{H}_{\textbf{y}\textbf{x}}(\textbf{x}, \textbf{y}) & \textbf{H}_{\textbf{y}\textbf{y}}(\textbf{x}, \textbf{y}) \end{bmatrix}\).
Proof
According to the condition \(\Vert \textbf{I}-\textbf{P}^{1/2}\textbf{C}^{-1}\textbf{P}^{1/2}\Vert \le \eta _1\), we can obtain that
Under assumption 2.2, we have \(\textbf{H}_{\textbf{x}\textbf{x}} \succeq \mu \textbf{I}\) and \(\textbf{H}_{\textbf{y}\textbf{y}} \preceq -\mu \textbf{I}\), then
and
If follows that
and
According to Woodbury identity, we have
Hence, we have
According to the update rule, we have
Using block inverse formula, we can write \(\varvec{\Upsilon }\) as
Let \(\begin{bmatrix} \varvec{\zeta }_\textbf{x}\\ \varvec{\zeta }_\textbf{y}\end{bmatrix} = \textbf{J}^{-1}\begin{bmatrix} \textbf{g}_\textbf{x}\\ \textbf{g}_\textbf{y}\end{bmatrix} = \begin{bmatrix} \textbf{P}^{-1/2} & 0 \\ 0 & \frac{2}{\sqrt{\mu }}\textbf{I}_{n_\textbf{y}}\end{bmatrix}\begin{bmatrix} \textbf{g}_\textbf{x}\\ \textbf{g}_\textbf{y}\end{bmatrix} = \begin{bmatrix} \textbf{P}^{-1/2}\textbf{g}_\textbf{x}\\ \frac{2}{\sqrt{\mu }}\textbf{g}_\textbf{y}\end{bmatrix}\), then the weighted gradient norm can be written as
From (D6) we can build the following relationship
Hence, in order to build the relationship between \(\lambda _+\) and \(\lambda\), we need to bound \(\left\| \textbf{P}_+^{-1/2}\varvec{\Upsilon }_\textbf{x} \right\| , \left\| \textbf{P}_+^{-1/2}\varvec{\Gamma } \right\| _\textbf{x}\) and \(\left\| \varvec{\Gamma } \right\| _y\). Since
and
we have
Using Lemma D.1, we have
Then, the term \(\left\| \varvec{\Gamma } \right\|\) can be bounded under assumption 2.2, that is
We can further bound \(\left\| \textbf{P}_+^{-1/2}\varvec{\Gamma }_\textbf{x} \right\|\) as follows:
With all bounds we have obtained, we are able to construct an incomplete relationship as follows:
Then we need to bound r by \(\lambda\). By the update rule, we have
which is equivalent to
Then we can bound \(\left\| \textbf{P}^{1/2}(\textbf{x}_+-\textbf{x}) \right\|\) and \(\left\| \frac{\sqrt{\mu }}{2}(\textbf{y}_+-\textbf{y}) \right\|\) as follows:
and
Combine these two bounds, we have
Plug this back to (D7), we have
\(\square\)
1.3 D.3 The Proof of Corollary 3.3
Proof
According Lemma 2.3 and take \(\eta = \eta _\mathrm{\text {PAN}}=\sqrt{\frac{3K\log (2n_\textbf{x}/\delta )}{|{\mathcal {S}}|\mu }}\), it holds with probability at least \(1-\delta\) that
Using Lemma 3.1, we have
holds with probability at least \(1-\delta\). Using Theorem 3.2, we can directly conclude (9). \(\square\)
Appendix E The Proof of Sect. 4.2
To simplify the presentation, we use \(\textbf{H}_{\textbf{x}\textbf{x},t}\), \(\textbf{H}_{\textbf{x}\textbf{x},t}^i\) to represent \(\textbf{H}_{\textbf{x}\textbf{x}}(\textbf{x}_t,\textbf{y}_t)\) and \(\textbf{H}_{\textbf{x}\textbf{x}}^i(\textbf{x}_t,\textbf{y}_t)\).
1.1 E.1 The Proof of Lemma 4.2
Proof
We denote
According to the condition for \(\textbf{H}^{i}_{\textbf{x}\textbf{x},t}\), we have
so that \(\textbf{E}^i\) satisfies that
and
We rewrite \(\textbf{H}^{1/2}_{\textbf{x}\textbf{x},t} [\textbf{H}_{\textbf{x}\textbf{x},t}^i ]^{-1}\textbf{H}^{1/2}_{\textbf{x}\textbf{x},t}-\textbf{I}\) as follows
Then, \(\tilde{\textbf{H}}_{\textbf{x}\textbf{x},t}\) satisfies that
Thus, we have
Applying Lemma B.2 on \(\textbf{A}=\textbf{H}_{\textbf{x}\textbf{x},t}\), \(\textbf{B}=\tilde{\textbf{H}}_{\textbf{x}\textbf{x},t}\) and
we obtain
\(\square\)
1.2 E.2 The Proof of Theorem 4.3
Proof
Using Lemma 4.2, we have
According to Proposition 4.1 and using Theorem 3.2 by taking \(\eta _1=\frac{\eta ^2}{1-\eta }\), we obtain (15). \(\square\)
1.3 E.3 The Proof of Corollary 4.4
Proof
Using Lemma 2.3, with sample size on each client \(s \ge \frac{3K/\mu \log (2n_\textbf{x}m/\delta )}{\eta ^2}\), we have
holds with probability at least \(1-\delta /m\). Then we have
holds for all \(i\in [m]\) with probability \(1 - \delta\), remaining proof directly follows the proof of Theorem 4.3. \(\square\)
Appendix F The Proof of Sect. 4.3
To simplify the presentation, we use \(\textbf{H}_{\textbf{x}\textbf{x},t}\), \(\textbf{H}_{\textbf{x}\textbf{x},t}^i\) to represent \(\textbf{H}_{\textbf{x}\textbf{x}}(\textbf{x}_t,\textbf{y}_t)\) and \(\textbf{H}_{\textbf{x}\textbf{x}}^i(\textbf{x}_t,\textbf{y}_t)\).
1.1 F.1 The Proof of Lemma 4.6
Proof
Recall that we do matrix sketching on each client such that as
we denote \(\textbf{S}=\frac{1}{\sqrt{m}}[\tilde{\textbf{S}}_1,\cdots ,\tilde{\textbf{S}}_m]\in {\mathbb {R}}^{N\times ms_t}\), then the aggregation of \(\tilde{\textbf{H}}_{\textbf{x}\textbf{x},t}^i\) can be defined by
Using the results of Lemma 2.4, we have
and
hold with probability at least \(1-\delta\). Following the proof of Lemma 4.2, we also denote
According to (F12), we have
and
hold with probability at least \(1-\delta\). It also satisfies that
Thus, we have
holds with probability at least \(1-\delta\).
Applying Lemma B.2 on \(\textbf{A}=\textbf{H}_{\textbf{x}\textbf{x},t}\), \(\textbf{B}=\tilde{\textbf{H}}^{\text {gp}}_{\textbf{x}\textbf{x},t}\) and
we conclude that
holds with probability at least \(1-\delta\). \(\square\)
1.2 F.2 The Proof of Theorem 4.8
Using Lemma 4.6, we have
holds with probability at least \(1-\delta\). Since the update rule of GIANT-PANDA on the server can be written as
Using the convergence results of PAN (Theorem 3.2) by taking \(\eta _1 = \frac{\eta }{\sqrt{m}}+\frac{\eta ^2}{1-\eta }\), we can directly obtain (18).
1.3 F.3 The Proof of Corollary 4.10
When the variable \(\textbf{y}\) vanishes, we have
the corresponding measure can be written as
which recovers the measure of Rodomanov and Nesterov (2021). Under the condition of Corollary 4.10, f is \(\mu\)-strongly convex, \(L_2\)-Lipschitz continuous. We define \(\hat{r}\overset{\textrm{def}}{=}\Vert \hat{\textbf{P}}(\textbf{x})^{1/2}(\textbf{x}_{+}-\textbf{x})\Vert\), using the results in Rodomanov and Nesterov (2021), it holds that
-
\(\frac{1}{1+\frac{L_2\hat{r}}{2\mu ^{3/2}}} \hat{\textbf{P}}(\textbf{x}) \preceq \hat{\textbf{P}}(\textbf{x}_{+})\preceq \left( 1+\frac{L_2\hat{r}}{2\mu ^{3/2}}\right) \hat{\textbf{P}}(\textbf{x})\),
-
\(\hat{r}\le \hat{\lambda }(\textbf{x})\).
Lemma F.1
Under the same condition of Corollary 4.10, when we use the update rule
where \(\textbf{H}\in {\mathbb {R}}^{n_\textbf{x}\times n_\textbf{x}}\) is a positive definite matrix satisfies
then it holds that
Now, we formally present the proof of Corollary 4.10.
Proof
Replacing the factor \(\sqrt{1+\frac{3\kappa ^2\kappa _2}{\mu }r}\) by \(\sqrt{1+\frac{L_2}{2\mu ^{3/2}}\hat{r}}\) of (D7) in Theorem 3.2, we have
Using \(\hat{r}\le \hat{\lambda }(\textbf{x})\), we have
\(\square\)
Now we present the proof of Corollary 4.10 which is very similar to the proof of Theorem 4.8.
Proof
Using the results of Lemma 4.6, we have
holds with probability at least \(1-\delta\) where \(\tilde{\textbf{H}}_t\) is defined by
The update rule of GIANT can be written as
Using Lemma F.1 by taking \(\eta _1=\frac{\eta }{\sqrt{m}}+\frac{\eta ^2}{1-\eta }\), we have
holds with probability at least \(1-\delta\). \(\square\)
We demonstrate the communication rounds (#comm) against \(\Vert \nabla f(\textbf{x},\textbf{y})\Vert _2\) and running time (second) against \(\Vert \nabla f(\textbf{x},\textbf{y})\Vert _2\) for Fairness-aware machine learning on datasets “adult” and “law school” with different sketch ratio p under the case \(m=128\)
We demonstrate the communication rounds (#comm) against \(\Vert \nabla f(\textbf{x},\textbf{y})\Vert _2\) and running time (second) against \(\Vert \nabla f(\textbf{x},\textbf{y})\Vert _2\) for Fairness-aware machine learning under \(m=8\) on datasets “adult” and “law school” with different sketch methods
Appendix G Additional experiments
This section provides additional experiments to validate our new proposed methods. In G.1, we present additional numerical results of different sketch ratios for GIANT-PANDA. In G.2, we study the impact of the sketch methods on the convergence behavior for GIANT-PANDA. In G.3, we compare PAN with existing state-of-the-art methods for single-agent optimization (\(m=1\)).
1.1 G.1 More study on the impact of the sketch ratio
We choose the sketch ratio from \(p\in \{10\%,30\%,50\%,70\%,100\%\}\). We set the number of clients \(m=128\). We present the results for AUC maximization and Fairness-aware machine learning in Figs. 7 and 8 respectively. We observe similar behaviors as in Sect. 5.2.
1.2 G.2 Comparison of different sketch methods for GIANT-PANDA
We validate the impact of using different sketch methods in GIANT-PANDA. Specifically, we choose uniform sketch, Gauss sketch (Johnson & Lindenstrauss, 1984), and count sketch (Clarkson & Woodruff, 2017) to construct the local approximate partial Hessian \(\tilde{\textbf{H}}^{i}_{\textbf{x}\textbf{x},t}\) in (16) in GIANT-PANDA.
We present the behavior of GIANT-PANDA under \(m=8\) for AUC maximization and Fairness-aware machine learning in Figs. 9 and 10 respectively.
We also present the behavior of GIANT-PANDA under \(m=128\) for AUC maximization and Fairness-aware machine learning in Figs. 11 and 12 respectively.
We observe that when \(m=8\), GIANT-PANDA with uniform sketch achieves the best behavior in terms of both communication rounds and running time (Figs. 9 and 10). This means when the local sample size s is relatively large, employing uniform sketch in GIANT-PANDA is good enough.
However, when the number of clients is as large as \(m=128\), the count sketch and Gauss sketch behave better than the uniform sketch (Figs. 11 and 12). This encourages us to choose more complicated sketch methods to improve the behavior of GIANT-PANDA when the local sample size is small.
1.3 G.3 Comparison of baselines on single-agent optimization
We compare partially approximate Newton (PAN) with existing state-of-the-art methods for single-agent minimax optimization. We adopt extra gradient (EG) and partial-quasi-Newton methods with SR1 update (Liu et al., 2022) (RaSR1) algorithms for comparison. We present the numerical results for AUC maximization and Fairness-aware machine learning in Figs. 13 and 14. The experiment results indicate that PAN with the sample ratio (\(p'=|{\mathcal {S}}|/N\)) at \(p'=10\%\) or \(p'= 20\%\) outperforms the baselines significantly.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Xiao, M., Liu, C., Chen, C. et al. Panda: partially approximate newton methods for distributed minimax optimization with unbalanced dimensions. Mach Learn 114, 174 (2025). https://doi.org/10.1007/s10994-025-06813-1
Received:
Revised:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1007/s10994-025-06813-1



















