1 Introduction

We consider a class of minimax optimization problems formulated as finite-sum expressions:

$$\begin{aligned} \min _{\textbf{x}\in \mathbb {R}^{n_\textbf{x}}} \max _{\textbf{y}\in \mathbb {R}^{n_\textbf{y}}} f(\textbf{x}, \textbf{y}) \overset{\textrm{def}}{=}\frac{1}{N}\sum \limits _{j=1}^{N} l_j(\textbf{x}, \textbf{y}), \end{aligned}$$
(1)

where N denotes the sample size, and \(n_\textbf{x}\) and \(n_\textbf{y}\) denote the dimensions of the variables \(\textbf{x}\) and \(\textbf{y}\) respectively. The smooth function \(f(\textbf{x}, \textbf{y})\) is supposed to be strongly convex in \(\textbf{x}\) and strongly concave in \(\textbf{y}\) (i.e., satisfying the SC-SC condition). For many machine learning problems with a large sample size N, distributed methods are often preferable for solving problems in a parallel fashion. To facilitate our study in a distributed setting, let us divide the N samples that the i-th client holds \(|S_i|\) samples. Consequently, we have \(N=\sum _{i=1}^m|S_i|\), leading us to the following alternative problem:

$$\begin{aligned} \!\!\min _{\textbf{x}\in \mathbb {R}^{n_\textbf{x}}} \max _{\textbf{y}\in \mathbb {R}^{n_\textbf{y}}} f(\textbf{x}, \textbf{y}) \!=\! \frac{1}{m}\sum _{i=1}^{m} f^{i}(\textbf{x},\textbf{y}),~~ \text {where}~~ f^{i}(\textbf{x},\textbf{y}) \!\overset{\textrm{def}}{=}\! \frac{1}{|S_i|}\sum _{j\in S_i}l_j(\textbf{x},\textbf{y}). \end{aligned}$$
(2)

In this context, for each client \(i=1,\ldots ,m\), \(f^i\) represents its local function and \(S_i\) is the index set of local samples.

Minimax optimization has gained significant attention in the data mining and machine learning community due to its broad applications in various domains, including game theory (Basar & Olsder, 1999; Facchinei, 2003), supervised learning (Lanckriet et al., 2002), robust optimization (Ben-Tal & Nemirovski, 2002; Deng & Mahdavi, 2021; Gao & Kleywegt, 2022), and fairness-aware machine learning (Creswell et al., 2018; Liu et al., 2020). Among these applications, many of them share a critical property that the dimensions of variables \(\textbf{x}\) and \(\textbf{y}\) are unbalanced (Liu et al., 2022). For instance, AUC maximization (Cortes & Mohri, 2003; Ying et al., 2016) aims to train a binary classifier on imbalanced datasets \(\{{\textbf{a}}_j,b_j\}_{j=1}^{N}\), where \({\textbf{a}}_j\) denotes the input with d features and \(b_j\in \{-1,+1\}\) denotes the label. This problem can be formulated as a minimax problem with \(n_\textbf{x}=d+2\) and \(n_\textbf{y}=1\). Additionally, in fairness-aware machine learning tasks (Lowd & Meek, 2005; Zhang et al., 2018), we are given a training set \(\{{\textbf{a}}_j,b_j, {\textbf{c}}_j\}_{j=1}^{N}\), where \({\textbf{a}}_j\) represents d-dimensional features to learn from, \({\textbf{c}}_j\) contains s protected features, and \(b_j\) denotes the label. In this case, we have \(n_\textbf{x}=d\gg n_\textbf{y}=s\). Throughout this paper, we use the term “unbalanced dimensions"to describe the above special problem structure, which can be expressed as \(n_\textbf{x}\gg n_\textbf{y}\) or \(n_\textbf{y}\gg n_\textbf{x}\).Footnote 1

There are numerous first-order methods for solving minimax optimization problems in (1), including gradient descent ascent, extra gradient, and many of their variants (Chavdarova et al., 2019; Hsieh et al., 2019; Korpelevich, 1976; Lin et al., 2020; Malitsky, 2015; Mishchenko et al., 2020; Nedić & Ozdaglar, 2009; Nouiehed et al., 2019; Tseng, 2000). These first-order methods can be straightforwardly generalized to solve distributed problems in (2), by simply aggregating the gradients to the server at each iteration. Distributed first-order methods that perform multiple local iterations for each client before communication have also been proposed to solve minimax problems (Deng & Mahdavi, 2021; Sun & Wei, 2022; Zhang et al., 2024). Among these methods, Zhang et al. (2024) achieved the best-known results in terms of the total communication rounds, with an order of \({\mathcal {O}}(\kappa _\textbf{g}\ln (1/\epsilon ))\) where \(\kappa _\textbf{g}\) is the condition number of the objective (cf. Sect. 2.1) and \(\epsilon\) is the desired accuracy. It is worth noting that the per-iteration communication complexity between the client and server for first-order methods is only \({\mathcal {O}}(n_\textbf{x}+n_\textbf{y})\). However, these methods often require a substantial number of communication rounds to attain an accurate solution (e.g., their communication rounds depend heavily on the condition number). As a result, factors such as unpredictable network latency can lead to expensive total communication costs.

Second-order methods are well-known for their fast convergence rates, brought about by the utilization of the Hessian information of the objective. The Cubic regularized Newton method (Huang et al., 2022) and its restart variant (Huang & Zhang, 2022) have been proposed for solving (1) and demonstrated local superlinear convergence rates under the SC-SC condition. However, applying these methods directly in a distributed setting would require the communication of the full local Hessian matrix, resulting in a per-iteration communication complexity of \(\mathcal {O}((n_\textbf{x}+n_\textbf{y})^2)\). This communication overhead is unacceptable due to bandwidth limitations. On the other hand, several communication-efficientFootnote 2 distributed second-order methods (Islamov et al., 2022; Liu et al., 2024; Shamir et al., 2014; Wang & Li, 2020; Ye et al., 2022) have been proposed for convex optimization problems, eliminating the need for full Hessian communication. However, the design and analysis of these communication-efficient second-order methods heavily depend on the convexity of the objective function, making it challenging to generalize them to the minimax setting. Building upon this, it is natural to ask: Is it possible to develop communication-efficient distributed second-order methods for minimax optimization by leveraging the structure of “unbalanced dimensions"?

In this paper, we provide an affirmative answer to this question by introducing PANDA (Partially Approximate Newton methods for Distributed minimAx optimization). In each iteration, PANDA avoids the need for communicating the full Hessian matrix, instead requiring only the exchange of the partial Hessian matrix associated with \(\textbf{y}\) (recall that we suppose \(n_\textbf{x}\gg n_\textbf{y}\) in this paper), i.e. \(\nabla ^2_{\textbf{y}\textbf{y}} f^{i}(\textbf{x},\textbf{y})\in {\mathbb {R}}^{n_\textbf{y}\times n_\textbf{y}}\), \(\nabla ^2_{\textbf{x}\textbf{y}} f^{i}(\textbf{x},\textbf{y})\in {\mathbb {R}}^{n_\textbf{x}\times n_\textbf{y}}\), and \(\nabla ^2_{\textbf{x}\textbf{x}}f^{i}(\textbf{x},\textbf{y})\nabla ^2_{\textbf{x}\textbf{y}} f(\textbf{x},\textbf{y})\in {\mathbb {R}}^{n_{\textbf{x}}\times n_\textbf{y}}\). Additionally, it exchanges vectors such as gradients and local descent directions. As a result, the per-iteration communication complexity of PANDA can be summarized as:

$$\begin{aligned} {\mathcal {O}}\Bigg (\!\!\!\!\!\underbrace{\!\!\!n_\textbf{x}n_\textbf{y}+n_\textbf{y}^2\!\!}_{\nabla ^ 2_{\textbf{x}\textbf{y}} f^i,~ [\nabla ^2_{\textbf{x}\textbf{x}}f^i]^{-1}\nabla ^2_{\textbf{x}\textbf{y}}f,~\nabla ^2_{\textbf{y}\textbf{y}} f^i} \!\!\!\!\!+\ \ \underbrace{n_\textbf{x}+n_\textbf{y}}_{\text {vectors}} \Bigg ) ={\mathcal {O}}(n_\textbf{x}n_\textbf{y}) \approx {\mathcal {O}}(n_\textbf{x}). \end{aligned}$$

This complexity significantly reduces that of typical second-order methods, bringing it to the same order as first-order methods. Furthermore, the utilization of second-order information in PANDA results in improved convergence behavior compared to existing distributed first-order methods.

1.1 Contribution

The contribution of this paper is threefold.

  1. (a)

    We develop a Partially Approximate Newton (PAN) method to solve the general minimax problem (1) with unbalanced dimensions. If the approximate Hessian \(\tilde{\textbf{H}}_{\textbf{x}\textbf{x}}\) satisfies \((1-\eta ) \nabla ^2_{\textbf{x}\textbf{x}} f\preceq \tilde{\textbf{H}}_{\textbf{x}\textbf{x}} \preceq (<span class='crossLinkCiteEqu'>1</span>+\eta ) \nabla ^2_{\textbf{x}\textbf{x}} f\) with \(\eta \in (0, 1)\), then PAN exhibits a linear-quadratic convergence rate for some measure \(\lambda _t\):

    $$\begin{aligned} \lambda _{t+1}\le \frac{\eta }{1-\eta }\lambda _t +\beta \lambda _t^2. \end{aligned}$$

    This result of PAN generalizes the approximate Newton method for convex optimization in Ye et al. (2021) and relaxes the conditions in Liu et al. (2022, Lemma 4.3) for minimax optimization.

  2. (b)

    We develop the PANDA method to solve the distributed minimax problem (2). If the local partial Hessian satisfies \((1-\eta )\nabla ^2_{\textbf{x}\textbf{x}} f\preceq \nabla ^2_{\textbf{x}\textbf{x}} f^i\preceq (<span class='crossLinkCiteEqu'>1</span>+\eta ) \nabla ^2_{\textbf{x}\textbf{x}} f\) with \(\eta \in (0,1)\), then PANDA exhibits a linear-quadratic convergence rate:

    $$\begin{aligned} \lambda _{t+1}\le \frac{\eta ^2}{1-\eta } \lambda _t +\beta _1 \lambda _t^2. \end{aligned}$$

    Furthermore, we can guarantee the existence of \(\eta\) provided that \(N\ge {\mathcal {O}}(mK/\mu )\), where m is the number of clients, \(K=\max _j\Vert \nabla ^2_{\textbf{x}\textbf{x}}l_j\Vert\), and \(\mu\) is the strong convexity parameter (cf. Sect. 2.1).

  3. (c)

    We develop the GIANT-PANDA method, which employs matrix sketching techniques on each local client to construct the partial approximate Hessian \(\tilde{\textbf{H}}^i_{\textbf{x}\textbf{x}}\). This method exhibits a linear-quadratic convergence rate:

    $$\begin{aligned} \lambda _{t+1}\le \left( \frac{\eta }{\sqrt{m}}+\frac{\eta ^2}{1-\eta }\right) \lambda _t +\beta _2 \lambda _t^2. \end{aligned}$$

    This result leads to a sharper analysis compared to the original GIANT method in (Wang et al., 2018), as it improves the convergence rate by a factor of \(\sqrt{\kappa _\textbf{g}}\) in the linear term.

Organization We introduce fundamental notation, assumptions, and preliminary results for Hessian approximation in Sect. 2. In Sects. 3 and 4, we introduce PAN for solving (1) and introduce PANDA (along with GIANT-PANDA) for solving (2), respectively. We conduct empirical studies in Sect. 5 and provide conclusions in Sect. 6. All proofs are deferred to the appendix.

2 Preliminaries

2.1 Notation and assumptions

We use \(\textbf{g}_\textbf{x}(\textbf{x},\textbf{y})\) and \(\textbf{H}_{\textbf{x}\textbf{x}}(\textbf{x},\textbf{y})\) to denote the gradient \(\nabla _{\textbf{x}} f(\textbf{x},\textbf{y})\) and Hessian \(\nabla ^2_{\textbf{x}\textbf{x}} f(\textbf{x},\textbf{y})\) with respect to \(\textbf{x}\), respectively (similar for \(\textbf{g}_\textbf{y}\), \(\textbf{H}_{\textbf{x}\textbf{y}}\), \(\textbf{H}_{\textbf{y}\textbf{y}}\)). For the local gradient and Hessian associated with the i-th client, we use \(\textbf{g}_\textbf{x}^i(\textbf{x},\textbf{y})\) and \(\textbf{H}_{\textbf{x}\textbf{x}}^{i}(\textbf{x},\textbf{y})\) to denote \(\nabla _{\textbf{x}} f^i(\textbf{x},\textbf{y})\) and \(\nabla ^2_{\textbf{x}\textbf{x}} f^i(\textbf{x},\textbf{y})\) (similar for \(\textbf{g}_\textbf{y}^{i}\), \(\textbf{H}_{\textbf{x}\textbf{y}}^i\), \(\textbf{H}_{\textbf{y}\textbf{y}}^i\)). We use \(\left\| \,\cdot \, \right\|\) to denote the spectral norm for matrices and the Euclidean norm for vectors. Additionally, we define the matrix row coherence as follows.

Definition 2.1

(Wang et al., 2018, Definition 1) Let \(\textbf{A}\in {\mathbb {R}}^{N\times d}\) be a matrix with full column rank and \(\textbf{A}=\textbf{U}\Sigma \textbf{V}^\top\) be its reduced singular value decomposition with \(\textbf{U}, \textbf{V}\in {\mathbb {R}}^{N\times d}\). The row coherence of \(\textbf{A}\) is defined as \(\nu (\textbf{A})\overset{\textrm{def}}{=}\frac{N}{d}\max _j \Vert \textbf{u}_j\Vert ^2 \in [1, \frac{N}{d}]\), where \(\textbf{u}_j\) is the j-th row of \(\textbf{U}\).

We introduce the following assumption for the objective function in (1).

Assumption 2.2

We assume \(f({\textbf{x}},{\textbf{y}})\) is twice differentiable, \(\mu\)-strongly convex in \({\textbf{x}}\), \(\mu\)-strongly concave in \({\textbf{y}}\), and has \(L_\textbf{g}\)-Lipschitz continuous gradient and \(L_\textbf{H}\)-Lipschitz continuous Hessian. We also assume each individual function \(l_j(\textbf{x},\textbf{y})\) is convex in \(\textbf{x}\). We denote \(\kappa _\textbf{g}\overset{\textrm{def}}{=}L_\textbf{g}/\mu\) and \(\kappa _\textbf{H}\overset{\textrm{def}}{=}L_\textbf{H}/\mu\).

The convexity of each individual function \(l_j\) in \(\textbf{x}\), along with the \(L_\textbf{g}\)-Lipschitz continuity of the gradient \(\textbf{g}_\textbf{x}\), implies that the Hessian \(\nabla _{\textbf{x}\textbf{x}}^2 l_j\) is bounded. Let us denote \(K\overset{\textrm{def}}{=}\max _{j}\Vert \nabla ^2_{\textbf{x}\textbf{x}} l_j\Vert\) and \(\hat{\kappa }\overset{\textrm{def}}{=}K/\mu\).

2.2 Matrix approximation via sub-sampling and sketching

Let us introduce some preliminary results for approximating a positive definite Hessian matrix. We first consider a Hessian matrix in the form of \(\textbf{H}= \frac{1}{N}\sum _{j=1}^N \textbf{H}_j\in {\mathbb {R}}^{d\times d}\), and approximate it using sub-sampling:

$$\begin{aligned} \tilde{\textbf{H}} = \frac{1}{|{\mathcal {S}}|}\sum _{j\in {\mathcal {S}}}\textbf{H}_j, \end{aligned}$$
(3)

where elements in \({\mathcal {S}}\) are uniformly sampled from \(\{1,\cdots , N\}\). The following lemma characterizes the error of the sub-sampling approximation.

Lemma 2.3

(Ye et al., 2021, Lemma 9) Suppose \(\textbf{H}\succeq \mu \textbf{I}\) and \(\max _{1\le j\le N}\Vert \textbf{H}_j\Vert \le \hat{K}\) for some constants \(\mu , \hat{K}>0\). For any \(\delta \in (0,1)\) and \(\eta \in (0,0.5)\), if the sample size satisfies \(|{\mathcal {S}}|\ge \frac{3\hat{K}\log (2d/\delta )}{\mu \eta ^2}\), then with probability at least \(1-\delta\), we have \((1-\eta )\textbf{H}\preceq \tilde{\textbf{H}}\preceq (<span class='crossLinkCiteEqu'>1</span>+\eta )\textbf{H}\) for \(\tilde{\textbf{H}}\) defined in (3).

We then consider a special case where the Hessian matrix is expressed as \(\textbf{H}=\textbf{A}^{\top }\textbf{A}+\alpha \textbf{I}\), with \(\textbf{A}\in {\mathbb {R}}^{N\times d}\) being a full column-rank matrix. This form of Hessian matrix naturally arises in classical regression problems (Ye et al., 2021; Wang et al., 2017). We construct two approximate Hessians using sketching techniques:

$$\begin{aligned} \tilde{\textbf{H}}_i = \textbf{A}^{\top }\textbf{S}_i\textbf{S}_i^{\top }\textbf{A}+ \alpha \textbf{I},\quad \quad \hat{\textbf{H}}=\textbf{A}^{\top }\textbf{S}\textbf{S}^{\top }\textbf{A}+\alpha \textbf{I}. \end{aligned}$$
(4)

here \(\{\textbf{S}_i\}_{i=1}^m\in {\mathbb {R}}^{N\times s'}\) represent the sketching matrices and \(\textbf{S}\overset{\textrm{def}}{=}\frac{1}{\sqrt{m}}[\textbf{S}_1,\cdots ,\textbf{S}_m]\in {\mathbb {R}}^{N\times ms'}\). The following lemma characterizes the errors of these two sketching approximations.

Lemma 2.4

Adapted from (Wang et al., 2018, Lemma 8)

Let \(\eta\) and \(\delta \in (0,1)\) be fixed parameters, \(\nu =\nu (\textbf{A})\), and \(\textbf{S}_1,\cdots ,\textbf{S}_m\in {\mathbb {R}}^{N\times s'}\) be independent uniform sampling matrices with \(s'\ge \frac{3\nu d}{\eta ^2}\log (\frac{d m}{\delta })\). Then, with probability at least \(1-\delta\), we have \((1-\eta )\textbf{H}\preceq \tilde{\textbf{H}}_i\preceq (<span class='crossLinkCiteEqu'>1</span>+\eta ) \textbf{H}\) and \((1-\eta /\sqrt{m})\textbf{H}\preceq \hat{\textbf{H}}\preceq (<span class='crossLinkCiteEqu'>1</span>+\eta /\sqrt{m}) \textbf{H}\) for \(\textbf{H}_i\) and \(\hat{\textbf{H}}\) defined in (4).

3 The analysis framework of partially approximate Newton method

In this section, we propose a Partially Approximate Newton (PAN) method for Problem (1). We start with the classical Newton update:

$$\begin{aligned} \begin{bmatrix} \textbf{x}_{+}\\ \textbf{y}_{+} \end{bmatrix} ={ \begin{bmatrix} \textbf{x} \\ \textbf{y} \end{bmatrix}}- \begin{bmatrix} \textbf{H}_{\textbf{x}\textbf{x}}(\textbf{x},\textbf{y})& \quad \textbf{H}_{\textbf{x}\textbf{y}}(\textbf{x},\textbf{y})\\ (\textbf{H}_{\textbf{x}\textbf{y}}(\textbf{x},\textbf{y}))^{\top }& \quad \textbf{H}_{\textbf{y}\textbf{y}}(\textbf{x},\textbf{y}) \end{bmatrix}^{-1}\begin{bmatrix} \textbf{g}_\textbf{x}(\textbf{x},\textbf{y})\\ \textbf{g}_\textbf{y}(\textbf{x},\textbf{y}) \end{bmatrix}. \end{aligned}$$
(5)

Using the approximate Hessian matrix \(\tilde{\textbf{H}}_{\textbf{x}\textbf{x}}\) to replace the exact Hessian matrix \(\textbf{H}_{\textbf{x}\textbf{x}}(\textbf{x},\textbf{y})\) in (5) leads to the update rule of PAN as follows:

$$\begin{aligned} \begin{bmatrix} \textbf{x}_{+}\\ \textbf{y}_{+} \end{bmatrix}= { \begin{bmatrix} \textbf{x} \\ \textbf{y} \end{bmatrix}}- \begin{bmatrix} \tilde{\textbf{H}}_{\textbf{x}\textbf{x}}& \quad \textbf{H}_{\textbf{x}\textbf{y}}(\textbf{x},\textbf{y})\\ (\textbf{H}_{\textbf{x}\textbf{y}}(\textbf{x},\textbf{y}))^{\top }& \quad \textbf{H}_{\textbf{y}\textbf{y}}(\textbf{x},\textbf{y}) \end{bmatrix}^{-1}\begin{bmatrix} \textbf{g}_\textbf{x}(\textbf{x},\textbf{y})\\ \textbf{g}_\textbf{y}(\textbf{x},\textbf{y}) \end{bmatrix}. \end{aligned}$$
(6)

We use the weighted gradient norm as the measure in our analysis (Liu et al., 2022):

$$\begin{aligned} \lambda (\textbf{x},\textbf{y})\! \overset{\textrm{def}}{=}\!&\sqrt{(\textbf{g}_\textbf{x}(\textbf{x},\textbf{y}))^{\top }(\textbf{P}(\textbf{x},\textbf{y}))^{-1}\textbf{g}_\textbf{x}(\textbf{x},\textbf{y})}\! +\!\frac{2}{\sqrt{\mu }} \Vert \textbf{g}_\textbf{y}(\textbf{x},\textbf{y})\Vert , \end{aligned}$$

where \(\textbf{P}(\textbf{x},\textbf{y})\) is defined as

$$\begin{aligned} \textbf{P}(\textbf{x},\textbf{y}) \! \overset{\textrm{def}}{=}\textbf{H}_{{\textbf{x}}{\textbf{x}}}({\textbf{x}},{\textbf{y}})\! -\! \textbf{H}_{{\textbf{x}}{\textbf{y}}}({\textbf{x}},{\textbf{y}})(\textbf{H}_{{\textbf{y}}{\textbf{y}}}({\textbf{x}},{\textbf{y}}))^{-1}\textbf{H}_{\textbf{y}\textbf{x}} ({\textbf{x}},{\textbf{y}}). \end{aligned}$$
(7)

The following lemma shows that if \(\tilde{\textbf{H}}_{\textbf{x}\textbf{x}}\) is a good approximation of \(\textbf{H}_{\textbf{x}\textbf{x}}(\textbf{x},\textbf{y})\), then \(\textbf{P}(\textbf{x},\textbf{y})\) can be also well approximated by \(\textbf{C}(\textbf{x},\textbf{y})\), defined as

$$\begin{aligned} \textbf{C}(\textbf{x},\textbf{y}) \overset{\textrm{def}}{=}\tilde{\textbf{H}}_{{\textbf{x}}{\textbf{x}}} - \textbf{H}_{{\textbf{x}}{\textbf{y}}}({\textbf{x}},{\textbf{y}})({\textbf{H}}_{{\textbf{y}}{\textbf{y}}}({\textbf{x}},{\textbf{y}}))^{-1}\textbf{H}_{\textbf{y}\textbf{x}}({\textbf{x}},{\textbf{y}}). \end{aligned}$$
(8)

Lemma 3.1

Under Assumption 2.2 and suppose that \(\tilde{\textbf{H}}_{\textbf{x}\textbf{x}}\) satisfies \((1-\eta ) \textbf{H}_{\textbf{x}\textbf{x}}\preceq \tilde{\textbf{H}}_{\textbf{x}\textbf{x}}\preceq (<span class='crossLinkCiteEqu'>1</span>+\eta )\textbf{H}_{\textbf{x}\textbf{x}}\), we have

$$\begin{aligned} \left\| \textbf{I}- \textbf{P}(\textbf{x},\textbf{y})^{1/2}(\textbf{C}(\textbf{x},\textbf{y}))^{-1}\textbf{P}(\textbf{x},\textbf{y})^{1/2} \right\| \le \frac{\eta }{1-\eta } \end{aligned}$$

We establish a linear-quadratic convergence rate for the PAN update when \(\textbf{P}(\textbf{x},\textbf{y})\) and \(\textbf{C}(\textbf{x},\textbf{y})\) are close.

Theorem 3.2

Under the Assumption 2.2 and suppose \(\textbf{P}(\textbf{x},\textbf{y})\) and \(\textbf{C}(\textbf{x},\textbf{y})\) are close such that \(\Vert \textbf{I}- \textbf{P}(\textbf{x},\textbf{y})^{1/2}\textbf{C}(\textbf{x},\textbf{y})^{-1}\textbf{P}(\textbf{x},\textbf{y})^{1/2}\Vert \le \eta _1\), the update of PAN in (6) exhibits the following linear-quadratic convergence rate:

$$\begin{aligned} \lambda (\textbf{x}_{+},\textbf{y}_{+})\le \eta _1 \lambda (\textbf{x},\textbf{y}) + \frac{12\kappa _\textbf{g}^2\kappa _{\textbf{H}}(1+\eta _1)^2}{\sqrt{\mu }} \lambda (\textbf{x},\textbf{y})^2. \end{aligned}$$

When employing the sub-sampling approximation to construct \(\tilde{\textbf{H}}_{\textbf{x}\textbf{x}}\), we derive the following corollary by combining the results from Lemma 3.1 and Theorem 3.2.

Corollary 3.3

Let us construct the partial Hessian approximation by sub-sampling \(\tilde{\textbf{H}}_{\textbf{x}\textbf{x}}=\frac{1}{|S|}\sum _{j\in S}\nabla ^2_{\textbf{x}\textbf{x}} l_j(\textbf{x},\textbf{y})\). Under Assumption 2.2 and for any \(\delta \in (0, 1)\), if the sample size satisfies \(|S|\ge {12\hat{\kappa }\log (2n_\textbf{x}/\delta )}\), then with probability at least \(1-\delta\), the update of PAN in (6) satisfies

$$\begin{aligned} \lambda (\textbf{x}_{+},\textbf{y}_{+})\le \frac{\eta }{1-\eta } \lambda (\textbf{x},\textbf{y}) + \frac{12\kappa _\textbf{g}^2\kappa _{\textbf{H}}(1+\eta /(1-\eta ))^2}{\sqrt{\mu }} \lambda (\textbf{x},\textbf{y})^2 \end{aligned}$$
(9)

with \(\eta =\eta _{\textrm{PAN}}\overset{\textrm{def}}{=}\sqrt{\frac{3\hat{\kappa }\log (2n_\textbf{x}/\delta )}{|S|}}\).

Corollary 3.3 suggests that PAN requires \({\mathcal {O}}\left( {\log (1/\epsilon )}/{\log (|S|/\hat{\kappa })}\right)\) iterations to achieve \(\epsilon\)-accuracy in terms of the measure \(\lambda (\textbf{x},\textbf{y})\) for a quadratic objective function. Analyzing the complexity of linear-quadratic rates on quadratic functions is a common practice in the literature (Roosta-Khorasani & Mahoney, 2019; Wang et al., 2018; Ye et al., 2020, 2021), which allows us to simply ignore the quadratic term in (9) since \(\kappa _\textbf{H}=0\). In comparison, state-of-the-art first-order methods such as optimistic gradient and extra gradient methods have complexities \({\mathcal {O}}\left( \kappa _\textbf{g}\log (1/\epsilon )\right)\); methods that do not access the full Hessian at each iteration such as the quasi-Newton method (Liu & Luo, 2022) and the partial-quasi-Newton method (Liu et al., 2022) have complexities \({\mathcal {O}}(\kappa _\textbf{g}^2 + \sqrt{n_\textbf{x}\log (1/\epsilon )})\) and \({\mathcal {O}}(\kappa _\textbf{g}+ \sqrt{n_\textbf{x}\log (1/\epsilon )})\), respectively. We can see that PAN exhibits a much weaker dependency on the condition number \(\kappa _\textbf{g}\). We present the comparisons in Table 1.

Table 1 We present the iteration complexity of proposed method (PAN) and baselines for solving quadratic minimax optimization (AUC maximization)

4 Partially approximate Newton methods for distributed minimax optimization

In this section, we present the PANDA method for solving Problem (2) in Sect. 4.1, establish its convergence results in Sect. 4.2, and extend PANDA to GIANT-PANDA for a special function class that commonly appears in regression problems in Sect. 4.3.

Algorithm 1
figure a

PANDA\(\,({\textbf{x}}_0, {\textbf{y}}_0, T)\)

4.1 The PANDA algorithm

For simplicity, we suppress the evaluation point and use \(\textbf{g}_\textbf{x}\) to denote \(\textbf{g}_\textbf{x}(\textbf{x},\textbf{y})\) (similar to \(\textbf{g}_\textbf{y}\), \(\textbf{H}_{\textbf{x}\textbf{x}}\), \(\textbf{H}_{\textbf{x}\textbf{y}}\), \(\textbf{H}_{\textbf{y}\textbf{y}}\)). We start with the standard Newton direction \(\begin{bmatrix} {\textbf{d}}_{\textbf{x}}\\ {\textbf{d}}_\textbf{y}\end{bmatrix} \overset{\textrm{def}}{=}\begin{bmatrix} \textbf{H}_{\textbf{x}\textbf{x}}& \textbf{H}_{\textbf{x}\textbf{y}}\\ \textbf{H}_{\textbf{x}\textbf{y}}^{\top }& \textbf{H}_{\textbf{y}\textbf{y}} \end{bmatrix} ^{-1}\begin{bmatrix} \textbf{g}_\textbf{x}\\ \textbf{g}_\textbf{y}\end{bmatrix}\), which can be expressed explicitly by using block matrix inversion formula:

$$\begin{aligned} \begin{aligned} {\textbf{d}}_{\textbf{x}}&= \textbf{H}_{\textbf{x}\textbf{x}}^{-1}\textbf{g}_\textbf{x}- \left( \textbf{H}_{\textbf{x}\textbf{x}}^{-1}\textbf{H}_{\textbf{x}\textbf{y}} \right) \Delta _{\textbf{y}\textbf{y}}\textbf{g}_\textbf{y}+ \left( \textbf{H}_{\textbf{x}\textbf{x}}^{-1}\textbf{H}_{\textbf{x}\textbf{y}} \right) {\Delta }_{\textbf{y}\textbf{y}} \left( \textbf{H}_{\textbf{x}\textbf{x}}^{-1}\textbf{H}_{\textbf{x}\textbf{y}} \right) ^{\top }\textbf{g}_\textbf{x}\\ {\textbf{d}}_\textbf{y}&= -\Delta _{\textbf{y}\textbf{y}}\textbf{H}_{\textbf{x}\textbf{y}}^{\top }\textbf{H}_{\textbf{x}\textbf{x}}^{-1}\textbf{g}_\textbf{x}+ \Delta _{\textbf{y}\textbf{y}}\textbf{g}_\textbf{y}, \end{aligned} \end{aligned}$$
(10)

where \({\Delta }_{\textbf{y}\textbf{y}} = (\textbf{H}_{\textbf{y}\textbf{y}}-\textbf{H}_{\textbf{x}\textbf{y}}^{\top }\textbf{H}_{\textbf{x}\textbf{x}}^{-1}\textbf{H}_{\textbf{x}\textbf{y}} )^{-1}.\)

Under the setup of unbalanced dimensions \(n_\textbf{x}\gg n_\textbf{y}\), obtaining the exact Hessian \(\textbf{H}_{\textbf{x}\textbf{x}}\) on the server is prohibitive due to the communication overhead associated with \(\textbf{H}^i_{\textbf{x}\textbf{x}}\). However, communication costs of gradients and partial Hessians \(\textbf{H}_{\textbf{x}\textbf{y}}^{i}\) and \(\textbf{H}_{\textbf{y}\textbf{y}}^i\) are relatively low. Thus, in the first round of PANDA, the server aggregates these quantities to acquire precise gradient and partial Hessian information as follows:

$$\begin{aligned} \!\!\!\!\!\!\! \textbf{g}_{\textbf{x}} = \frac{1}{m}\sum _{i=1}^{m} \textbf{g}^i_{\textbf{x}}, \quad \textbf{g}_{\textbf{y}}=\frac{1}{m}\sum _{i=1}^m \textbf{g}^{i}_{\textbf{y}},\quad \textbf{H}_{\textbf{x}\textbf{y}} = \frac{1}{m}\sum _{i=1}^{m} \textbf{H}_{\textbf{x}\textbf{y}}^i,\quad \textbf{H}_{\textbf{y}\textbf{y}} = \frac{1}{m}\sum _{i=1}^{m} \textbf{H}_{\textbf{y}\textbf{y}}^i. \end{aligned}$$
(11)

The server then broadcasts the above-aggregated quantities to the clients, allowing each client to access global information of \(\textbf{g}_\textbf{x}\), \(\textbf{g}_\textbf{y}\), \(\textbf{H}_{\textbf{x}\textbf{y}}\), and \(\textbf{H}_{\textbf{y}\textbf{y}}\). Further, since the communication costs of \(\textbf{Q}^{i}_{\textbf{x}\textbf{y}}\overset{\textrm{def}}{=}[\textbf{H}^{i}_{\textbf{x}\textbf{x}}]^{-1}\textbf{H}_{\textbf{x}\textbf{y}}\) and \(\textbf{q}^{i}_\textbf{x}\overset{\textrm{def}}{=}[\textbf{H}^{i}_{\textbf{x}\textbf{x}}]^{-1}\textbf{g}_\textbf{x}\) are only \({\mathcal {O}}(n_\textbf{x}n_\textbf{y})\) and \({\mathcal {O}}(n_\textbf{x})\), in the second round of PANDA, the server aggregates \(\textbf{Q}^{i}_{\textbf{x}\textbf{y}}\) and \(\textbf{q}^{i}_\textbf{x}\) as follows:

$$\begin{aligned} \textbf{Q}_{\textbf{x}\textbf{y}} = \frac{1}{m}\sum _{i=1}^{m}\textbf{Q}^{i}_{\textbf{x}\textbf{y}},\quad \quad \textbf{q}_{\textbf{x}} =\frac{1}{m} \sum _{i=1}^{m}\textbf{q}^i_\textbf{x}. \end{aligned}$$
(12)

Using \(\textbf{Q}_{\textbf{x}\textbf{y}}\) and \(\textbf{q}_\textbf{x}\) to replace \(\textbf{H}_{\textbf{x}\textbf{x}}^{-1}\textbf{H}_{\textbf{x}\textbf{y}}\) and \(\textbf{H}_{\textbf{x}\textbf{x}}^{-1}\textbf{g}_\textbf{x}\) in (10), the server finally computes the following approximate Newton direction

$$\begin{aligned} \begin{aligned} \begin{bmatrix} \tilde{{\textbf{d}}}_{\textbf{x}}\\ \tilde{{\textbf{d}}}_\textbf{y}\end{bmatrix} \overset{\textrm{def}}{=}\begin{bmatrix} \textbf{q}_\textbf{x}- \textbf{Q}_{\textbf{x}\textbf{y}}\tilde{\Delta }_{\textbf{y}\textbf{y}}\textbf{g}_\textbf{y}+ \textbf{Q}_{\textbf{x}\textbf{y}}\tilde{\Delta }_{\textbf{y}\textbf{y}}\textbf{Q}_{\textbf{x}\textbf{y}}^{\top }\textbf{g}_\textbf{x}\\ -\tilde{\Delta }_{\textbf{y}\textbf{y}}\textbf{H}_{\textbf{x}\textbf{y}}^{\top }\textbf{q}_\textbf{x}+ \tilde{\Delta }_{\textbf{y}\textbf{y}}\textbf{g}_{\textbf{y}} \end{bmatrix}, \end{aligned} \end{aligned}$$
(13)

with \(\tilde{\Delta }_{\textbf{y}\textbf{y}} \overset{\textrm{def}}{=}[\textbf{H}_{\textbf{y}\textbf{y}} -\textbf{H}_{\textbf{x}\textbf{y}}^{\top }\textbf{Q}_{\textbf{x}\textbf{y}}]^{-1}\) and updates the parameters based on \(\tilde{{\textbf{d}}}_\textbf{x}\) and \(\tilde{{\textbf{d}}}_\textbf{y}\).

We formally summarize the PANDA method in Algorithm 1. The following proposition indicates that the update rule of PANDA can be viewed as a partially approximate Newton method.

Proposition 4.1

Using PANDA in Algorithm 1, the update rule on the server is equivalent to

$$\begin{aligned} \begin{aligned} { \begin{bmatrix} \textbf{x}_{t+1}\\ \textbf{y}_{t+1} \end{bmatrix}} ={ \begin{bmatrix} \textbf{x}_{t}\\ \textbf{y}_{t} \end{bmatrix}}-\begin{bmatrix} \tilde{\textbf{H}}_{\textbf{x}\textbf{x}, t } & \quad \textbf{H}_{\textbf{x}\textbf{y}}(\textbf{x}_t,\textbf{y}_t)\\ \textbf{H}_{\textbf{x}\textbf{y}}(\textbf{x}_t,\textbf{y}_t)^{\top }& \quad \textbf{H}_{\textbf{y}\textbf{y}}(\textbf{x}_t,\textbf{y}_t) \end{bmatrix}^{-1}\!\!\begin{bmatrix} \textbf{g}_{\textbf{x}}(\textbf{x}_t,\textbf{y}_t)\\ \textbf{g}_\textbf{y}(\textbf{x}_t,\textbf{y}_t) \end{bmatrix}, \end{aligned} \end{aligned}$$
(14)

where \(\tilde{\textbf{H}}_{\textbf{x}\textbf{x}, t} \overset{\textrm{def}}{=}\left[ \frac{1}{m}\sum _{i=1}^m(\textbf{H}^i_{\textbf{x}\textbf{x}}(\textbf{x}_t,\textbf{y}_t))^{-1}\right] ^{-1}.\)

Proof

We ignore the subscript t in the following proof such that \(\tilde{\textbf{H}}_{\textbf{x}\textbf{x}}=\tilde{\textbf{H}}_{\textbf{x}\textbf{x},t}\), \(\textbf{H}_{\textbf{x}\textbf{y}}=\textbf{H}_{\textbf{x}\textbf{y}}(\textbf{x}_t,\textbf{y}_t)\) (similar for \(\textbf{H}_{\textbf{x}\textbf{x}}\), \(\textbf{H}_{\textbf{x}\textbf{x}}^i\), \(\textbf{H}_{\textbf{y}\textbf{y}}\), \(\textbf{g}_\textbf{x}\), \(\textbf{g}_\textbf{y}\)). We denote

$$\begin{aligned} \hat{\Delta }_{\textbf{y}\textbf{y}}&\!\overset{\textrm{def}}{=}\! [\textbf{H}_{\textbf{y}\textbf{y}}-\textbf{H}_{\textbf{x}\textbf{y}}^{\top }\tilde{\textbf{H}}_{\textbf{x}\textbf{x}}^{-1}\textbf{H}_{\textbf{x}\textbf{y}}]^{-1}\!\!= \!\left[ \textbf{H}_{\textbf{y}\textbf{y}}-\textbf{H}_{\textbf{x}\textbf{y}}^{\top }\frac{1}{m}\sum _{i=1}^m(\textbf{H}_{\textbf{x}\textbf{x}}^i)^{-1}\textbf{H}_{\textbf{x}\textbf{y}}\right] ^{-1}\\&= \left[ \textbf{H}_{\textbf{y}\textbf{y}}-\textbf{H}_{\textbf{x}\textbf{y}}^{\top }\frac{1}{m}\sum _{i=1}^m\textbf{Q}_{\textbf{x}\textbf{y}}^{i}\right] ^{-1}{=}\tilde{\Delta }_{\textbf{y}\textbf{y}}. \end{aligned}$$

Then, it holds that

$$\begin{aligned}&\begin{bmatrix} \tilde{\textbf{H}}_{\textbf{x}\textbf{x}} & \quad \textbf{H}_{\textbf{x}\textbf{y}}\\ \textbf{H}_{\textbf{x}\textbf{y}}^{\top }& \quad \textbf{H}_{\textbf{y}\textbf{y}} \end{bmatrix}^{-1}\begin{bmatrix} \textbf{g}_{\textbf{x}}\\ \textbf{g}_\textbf{y}\end{bmatrix} \\&= \begin{bmatrix} \tilde{\textbf{H}}_{\textbf{x}\textbf{x}}^{-1}\textbf{g}_\textbf{x}+ \left( \tilde{\textbf{H}}_{\textbf{x}\textbf{x}}^{-1}\textbf{H}_{\textbf{x}\textbf{y}} \right) {\hat{\Delta }_{\textbf{y}\textbf{y}}}\left( \tilde{\textbf{H}}_{\textbf{x}\textbf{x}}^{-1}\textbf{H}_{\textbf{x}\textbf{y}} \right) ^{\top }\textbf{g}_\textbf{x}- \left( \tilde{\textbf{H}}_{\textbf{x}\textbf{x}}^{-1}\textbf{H}_{\textbf{x}\textbf{y}} \right) \hat{\Delta }_{\textbf{y}\textbf{y}}\textbf{g}_\textbf{y}\\ -\hat{\Delta }_{\textbf{y}\textbf{y}}\textbf{H}_{\textbf{x}\textbf{y}}^{\top }\tilde{\textbf{H}}_{\textbf{x}\textbf{x}}^{-1}\textbf{g}_\textbf{x}+ \hat{\Delta }_{\textbf{y}\textbf{y}}\textbf{g}_\textbf{y}\end{bmatrix}\\&\!\!\!\overset{(12)}{=}\begin{bmatrix} \frac{1}{m}\sum _{i=1}^m \bigg (\textbf{q}_x^i \!\!+\! \textbf{Q}_{\textbf{x}\textbf{y}}^{i}\tilde{\Delta }_{\textbf{y}\textbf{y}} \big (\frac{1}{m}\sum _{i=1}^m\textbf{Q}_{\textbf{x}\textbf{y}}^i\big )^{\top }\textbf{g}_\textbf{x}- \textbf{Q}_{\textbf{x}\textbf{y}}^i\tilde{\Delta }_{\textbf{y}\textbf{y}}\textbf{g}_\textbf{y}\bigg ) \\ -\frac{1}{m}\sum _{i=1}^m\tilde{\Delta }_{\textbf{y}\textbf{y}}\textbf{H}_{\textbf{x}\textbf{y}}^{\top }\textbf{q}_{\textbf{x}}^i + \tilde{\Delta }_{\textbf{y}\textbf{y}}\textbf{g}_\textbf{y}\end{bmatrix}\!\!\overset{(13)}{=}\!\!\begin{bmatrix} \tilde{{\textbf{d}}}_\textbf{x}\\ \tilde{{\textbf{d}}}_\textbf{y}\end{bmatrix}. \end{aligned}$$

\(\square\)

4.2 Convergence analysis of PANDA

We suppose the N samples are i.i.d drawn from some distribution and each sample is associated with a local loss function \(l_j(\cdot )\). We also assume each client holds s samples drawn from \(\{l_j(\cdot )\}_{j=1}^N\), such that \(N=ms\) and \(|S_i|\equiv s\). According to Lemma 2.3, each local partial Hessian, \(\textbf{H}_{\textbf{x}\textbf{x}}^i(\textbf{x}_t,\textbf{y}_t) = \frac{1}{s} \sum _{j\in S_i}\nabla ^2 l_j(\textbf{x}_t,\textbf{y}_t)\), can be viewed as an sub-sampling approximation of \(\textbf{H}_{\textbf{x}\textbf{x}}(\textbf{x}_t,\textbf{y}_t)\) when s is large. The following Lemma indicates that

$$\begin{aligned} \textbf{C}_t\overset{\textrm{def}}{=}\tilde{\textbf{H}}_{\textbf{x}\textbf{x},t}-\textbf{H}_{\textbf{x}\textbf{y}}(\textbf{x}_t,\textbf{y}_t)(\textbf{H}_{\textbf{y}\textbf{y}}(\textbf{x}_t,\textbf{y}_t))^{-1}(\textbf{H}_{\textbf{x}\textbf{y}}(\textbf{x}_t,\textbf{y}_t))^{\top } \end{aligned}$$

with \(\tilde{\textbf{H}}_{\textbf{x}\textbf{x},t}\) defined in (14) is a good estimation of \(\textbf{P}(\textbf{x}_t,\textbf{y}_t)\).

Lemma 4.2

Under Assumption 2.2 and suppose that for all \(i\in [m]\), \(\textbf{H}_{\textbf{x}\textbf{x}}^i(\textbf{x},\textbf{y})\) satisfies that

$$\begin{aligned} \left( 1-\eta \right) \textbf{H}_{\textbf{x}\textbf{x}}(\textbf{x}_t,\textbf{y}_t)\preceq \textbf{H}_{\textbf{x}\textbf{x}}^{i}(\textbf{x}_t,\textbf{y}_t)\preceq (<span class='crossLinkCiteEqu'>1</span>+\eta ) \textbf{H}_{\textbf{x}\textbf{x}}(\textbf{x}_t,\textbf{y}_t), \end{aligned}$$

then we have \(\left\| \textbf{I}- \textbf{P}(\textbf{x}_t,\textbf{y}_t)^{1/2}{\textbf{C}}_{t}^{-1}\textbf{P}(\textbf{x}_t,\textbf{y}_t)^{1/2} \right\| \le \frac{\eta ^2}{1-\eta }\).

Incorporating the linear-quadratic rates established by the PAN framework, we can obtain the improved linear-quadratic rates for PANDA.

Theorem 4.3

Under Assumption 2.2 and suppose that

$$\begin{aligned} \left( 1-{\eta }\right) \textbf{H}_{\textbf{x}\textbf{x}}(\textbf{x}_t,\textbf{y}_t)\preceq \textbf{H}_{\textbf{x}\textbf{x}}^{i}(\textbf{x}_t,\textbf{y}_t)\preceq (<span class='crossLinkCiteEqu'>1</span>+\eta )\textbf{H}_{\textbf{x}\textbf{x}}(\textbf{x}_t,\textbf{y}_t) \end{aligned}$$

holds for all \(i\in [m]\), the update rule of PANDA (Algorithm 1) in (14) satisfies that

$$\begin{aligned} \begin{aligned} \lambda (\textbf{x}_{t+1},\textbf{y}_{t+1})&\le \frac{\eta ^2}{1-\eta }\lambda (\textbf{x}_t,\textbf{y}_t)+\frac{12\kappa _\textbf{g}^2\kappa _{\textbf{H}}(1-\eta +\eta ^2)^2}{\sqrt{\mu }(1-\eta )^2} \lambda (\textbf{x}_t,\textbf{y}_t)^2. \end{aligned} \end{aligned}$$
(15)

Similar to Corollary 3.3, we can guarantee a small \(\eta \in (0,0.5)\) for Theorem 4.3.

Corollary 4.4

Under Assumption 2.2, for any \(\delta \in (0,1)\) and \(\eta \in (0,0.5)\), if each client holds \(s\ge \frac{3\hat{\kappa }\log (2n_\textbf{x}m/\delta )}{\eta ^2}\) samples, then with probability at least \(1-\delta\), the update rule of PANDA (Algorithm 1) in (14) satisfies (15).

Remark 4.5

The Corollary 4.4 can be interpreted in this way: if N is at least \(12m\hat{\kappa }\log (2n_\textbf{x}m/\delta )\), then (15) holds with probability at least \(1-\delta\) where \(\eta =\eta _\mathrm{\text {PANDA}}\overset{\textrm{def}}{=}\sqrt{\frac{3\hat{\kappa }m\log (2n_\textbf{x}m/\delta )}{ N}}\).

We highlight the advancements of the PANDA method in the following two aspects:

  1. (a)

    We compare PANDA with its single-agent version, which corresponds to using N/m samples to construct the approximated Hessian in PAN. According to Corollary 3.3 and Corollary 4.4, we observe that \(\eta _{\text {PAN}} = \sqrt{\frac{3\hat{\kappa }m\log (2n_\textbf{x}/\delta )}{N}}\approx \eta _{\text {PANDA}}\). This indicates that the linear-quadratic rate (15) of PANDA significantly improves upon its single-agent version (9), which demonstrates the superiority of using the distributed framework.

  2. (b)

    We compare PANDA with state-of-the-art first-order methods in Table 2. Both distributed EG and Proxskip-VI-FL (Zhang et al., 2024) require the communication rounds of \({\mathcal {O}}(\kappa _\textbf{g}\log (1/\epsilon ))\), whereas PANDA only requires \({\mathcal {O}}\left( \frac{\log (1/\epsilon )}{\log (N/(m\hat{\kappa }))}\right)\) communication rounds. This highlights the advantage of using second-order information.

Table 2 We present the communication complexity of proposed method (PANDA) and baselines for solving quadratic distributed minimax optimization (AUC maximization)

4.3 Extension to the GIANT-PANDA Algorithm

PANDA exhibits provably faster convergence rates than the first-order methods for minimax distributed optimization, however, each client is required to access the full local Hessian at each iteration. In this section, we develop a communication-efficient algorithm that allows using inexact Hessian instead of the exact one during the local computation.

We focus on a specific function class that \(l_j(\cdot )\) in (2) can be expressed as \(l_j(\textbf{x},\textbf{y})\overset{\textrm{def}}{=}h_j(\textbf{w}^{\top }\textbf{x},\textbf{y}) + \frac{\mu }{2}\Vert \textbf{x}\Vert ^2\), where \(h_j(\cdot ,\cdot )\) is convex in \(\textbf{x}\) and \(\mu\)-strongly concave in \(\textbf{y}\). This function class generalizes the objective considered in convex optimization as discussed in GIANT (Wang et al., 2018), which has important applications in regression-type models.

The partial Hessian of the objective at \((\textbf{x}_t,\textbf{y}_t)\) can be written as

$$\begin{aligned} \textbf{H}_{\textbf{x}\textbf{x}} (\textbf{x}_t,\textbf{y}_t)&= \frac{1}{N}\sum _{j=1}^N\nabla ^2_{\textbf{x}\textbf{x}} h_j \left( \textbf{w}^{\top }\textbf{x}_t,\textbf{y}_t \right) \textbf{w}\textbf{w}^{\top } + \mu \textbf{I}= \frac{1}{m} \sum _{i=1}^m{\left\{ \textbf{A}_t^{\top }\textbf{S}^{i}(\textbf{S}^{i})^{\top }\textbf{A}_t + \mu \textbf{I}\right\} }, \end{aligned}$$

where \(\textbf{A}_t \overset{\textrm{def}}{=}\left[ {\textbf{a}}_1^{\top },\cdots ,{\textbf{a}}_N^{\top }\right] \in {\mathbb {R}}^{N\times n_\textbf{x}}\) is a full column-rank matrix with \(n_\textbf{x}\le N\), \({\textbf{a}}_j = \sqrt{\nabla ^2_{\textbf{x}\textbf{x}}h_j(\textbf{w}^{\top }\textbf{x}_t,\textbf{y}_t)}\textbf{w}/\sqrt{N}\), \(\textbf{S}^{i}\) is some sketching matrix such that \((\textbf{S}^i)^{\top }\textbf{A}_t\) contains the rows of \(\textbf{A}_t\) indexed by \({\mathcal {S}}_i\). The local partial Hessian of the i-th client can be indicated by \(\textbf{H}_{\textbf{x}\textbf{x}}^{i}(\textbf{x}_t,\textbf{y}_t)\overset{\textrm{def}}{=}\left\{ \textbf{A}_t^{\top }\textbf{S}^{i}(\textbf{S}^{i})^{\top }\textbf{A}_t + \mu \textbf{I}\right\}\)

Taking advantage of such a structure, we perform a sketch operation on \(\textbf{H}_{\textbf{x}\textbf{x}}^{i}(\textbf{x}_t,\textbf{y}_t)\) to reduce the computation cost on the client such that:

$$\begin{aligned} \tilde{\textbf{H}}_{\textbf{x}\textbf{x},t}^i \overset{\textrm{def}}{=}\textbf{A}_t^{\top }\tilde{\textbf{S}}_t^{i}(\tilde{\textbf{S}}_t^{i})^{\top }\textbf{A}_t + \mu \textbf{I}, \end{aligned}$$
(16)

where \(\tilde{\textbf{S}}_t^{i}\in {\mathbb {R}}^{n_\textbf{x}\times s_t}\) is chosen randomly from the columns of \(\textbf{S}_i\) so that \(s_t\le s\). We replace \(\textbf{H}_{\textbf{x}\textbf{x}}^i(\textbf{x}_t,\textbf{y}_t)\) by its sketched approximation \(\tilde{\textbf{H}}_{\textbf{x}\textbf{x}}^{i}(\textbf{x}_t,\textbf{y}_t)\) in line 15 and line 16 of PANDA (Algorithm 1), naturally resulting the modified algorithm GIANT-PANDA. The routine of GIANT-PANDA is formally presented in Algorithm 2 in A.

Now, we start to characterize the convergence behavior of GIANT-PANDA. Since GIANT-PANDA inherits the framework of PANDA, the update rule of GIANT-PANDA can be viewed as

$$\begin{aligned} \begin{aligned} { \begin{bmatrix} \textbf{x}_{t+1}\\ \textbf{y}_{t+1} \end{bmatrix}} ={ \begin{bmatrix} \textbf{x}_{t}\\ \textbf{y}_{t} \end{bmatrix}}-\begin{bmatrix} \tilde{\textbf{H}}_{\textbf{x}\textbf{x}, t }^{\text {gp}} & \quad \textbf{H}_{\textbf{x}\textbf{y}}(\textbf{x}_t,\textbf{y}_t)\\ \textbf{H}_{\textbf{x}\textbf{y}}(\textbf{x}_t,\textbf{y}_t)^{\top }& \quad \textbf{H}_{\textbf{y}\textbf{y}}(\textbf{x}_t,\textbf{y}_t) \end{bmatrix}^{-1}\!\!\begin{bmatrix} \textbf{g}_{\textbf{x}}(\textbf{x}_t,\textbf{y}_t)\\ \textbf{g}_\textbf{y}(\textbf{x}_t,\textbf{y}_t) \end{bmatrix}, \end{aligned} \end{aligned}$$
(17)

where \(\tilde{\textbf{H}}^{\text {gp}}_{\textbf{x}\textbf{x}, t} \overset{\textrm{def}}{=}\left[ \frac{1}{m}\sum _{i=1}^m[\tilde{\textbf{H}}^i_{\textbf{x}\textbf{x},t}]^{-1}\right] ^{-1}\).

Let \(\textbf{C}^{\textrm{gp}}_{t}\overset{\textrm{def}}{=}\tilde{\textbf{H}}^{{ \textrm{gp}}}_{\textbf{x}\textbf{x},t}-\textbf{H}_{\textbf{x}\textbf{y}}(\textbf{x}_t,\textbf{y}_t)[\textbf{H}_{\textbf{y}\textbf{y}}(\textbf{x}_t,\textbf{y}_t)]^{-1}(\textbf{H}_{\textbf{x}\textbf{y}}(\textbf{x}_t,\textbf{y}_t))^{\top }\), the following lemma shows \(\textbf{C}^{\textrm{gp}}_{t}\) is still a good approximation to \(\textbf{P}(\textbf{x}_t,\textbf{y}_t)\).

Lemma 4.6

Let \(\eta ,\delta \in (0,1)\) be fixed parameters, \(\nu _t=\nu (\textbf{A}_t)\), and \(\{\tilde{\textbf{S}}_t^i\}\) is independent uniform sampling matrices with \(s_t\ge \frac{3\nu _tn_\textbf{x}}{\eta ^2}\log \left( \frac{mn_\textbf{x}}{\delta }\right)\). Under Assumption 2.2, we have

$$\begin{aligned} \left\| \textbf{I}- \textbf{P}(\textbf{x}_t,\textbf{y}_t)^{1/2}({\textbf{C}}^{{\textrm{gp}}}_{t})^{-1}\textbf{P}(\textbf{x}_t,\textbf{y}_t)^{1/2} \right\| \le \frac{\eta }{\sqrt{m}}+ \frac{\eta ^2}{1-\eta } \end{aligned}$$

holds with probability at least \(1-\delta\).

Remark 4.7

The condition of Lemma 4.6 requires \(\{\tilde{\textbf{S}}^i_t\}\) to be uniform sampling matrices, which means we perform uniform sketch to obtain the local approximate Hessian \(\tilde{\textbf{H}}^i(\textbf{x}_t,\textbf{y}_t)\) in GIANT-PANDA. GIANT-PANDA also allows using other sketching techniques like count sketch (Clarkson & Woodruff, 2017; Meng & Mahoney, 2013) or Gaussian sketch (Johnson & Lindenstrauss, 1984) to obtain \(\tilde{\textbf{S}}^i_t\). These sketching methods can improve the dependence of \(s_t\) on \(\nu _t\), but will be more expensive to implement than the simple uniform sketching matrix (Wang et al., 2018).

Using the analysis framework of PAN, we establish the linear-quadratic rate of GIANT-PANDA.

Theorem 4.8

Under the same condition of Lemma 4.6, the update of GIANT-PANDA (Algorithm 2) in (17) satisfies

$$\begin{aligned} \begin{aligned} \lambda (\textbf{x}_{t+1},\textbf{y}_{t+1})&\le \left( \frac{\eta }{\sqrt{m}}+\frac{\eta ^2}{1-\eta }\right) \lambda (\textbf{x}_t,\textbf{y}_t) +\frac{c\kappa _\textbf{g}^2\kappa _\textbf{H}}{\sqrt{\mu }} \lambda (\textbf{x}_t,\textbf{y}_t)^2. \end{aligned} \end{aligned}$$
(18)

with probability at least \(1-\delta\), where \(c=\frac{12((\sqrt{m}-1)(\eta ^2-\eta +1)+1)^2}{(1-\eta )^2m}\)

Remark 4.9

\(\left( \frac{\eta }{\sqrt{m}}+\frac{\eta ^2}{1-\eta }\right)\) in the linear term \(\lambda (\textbf{x}_{t},\textbf{y}_t)\) in (18) for GIANT-PANDA is slightly worse than \(\left( \frac{\eta ^2}{1-\eta }\right)\) in (15) for PANDA. This is because GIANT-PANDA uses the approximate local partial Hessian instead of the full local partial Hessian. However, it is still better than \(\left( \frac{\eta }{1-\eta }\right)\) in (9) for PAN by a factor of \(\frac{1}{\sqrt{m}}\). This demonstrates the advantage of utilizing m clients in the parallel training process.

Improved Results for GIANT. GIANT (Wang et al., 2018) (Algorithm 3 in Appendix A) can be regarded as a special case of GIANT-PANDA for convex optimization when taking \(n_\textbf{y}= 0\). Using the analysis techniques developed for GIANT-PANDA, we also improve the convergence results for GIANT.

In the following corollary, we present a sharper linear-quadratic rate for GIANT under the same assumption as in (Wang et al., 2018), which improves the previous result by a factor of \(\sqrt{\kappa _\textbf{g}}\) in the linear term.

Corollary 4.10

To solve the minimization problem \((\mu >0)\) \(\min _{\textbf{x}\in {\mathbb {R}}^{n_\textbf{x}}} f(\textbf{x}) = \frac{1}{N}\sum _{j=1}^N h_j(\textbf{w}^{\top }\textbf{x}) + \frac{\mu }{2}\Vert \textbf{x}\Vert ^2\) on m clients and each client holds s samples, if \(h_j(\cdot )\) is a convex loss function, \(f(\cdot )\) has \(L_2\)-Lipschitz continuous Hessian, and \(s_t\) satisfies that \(s\ge s_t\ge \frac{3\nu _tn_\textbf{x}}{\eta ^2}\log \left( \frac{mn_\textbf{x}}{\delta }\right)\) for some fixed parameters \(\eta ,\delta \in (0,1)\), then with probability at least \(1-\delta\), the update rule of GIANT (Algorithm 3) satisfies that

$$\begin{aligned} \hat{\lambda }(\textbf{x}_{t+1})\le \left( \frac{\eta }{\sqrt{m}}+\frac{\eta ^2}{1-\eta }\right) \hat{\lambda } (\textbf{x}_t) + \frac{2L_2}{\mu ^{3/2}} \hat{\lambda }(\textbf{x}_t)^2, \end{aligned}$$

where \(\hat{\lambda }(\textbf{x})\overset{\textrm{def}}{=}\sqrt{\nabla f(\textbf{x})[\nabla ^2 f(\textbf{x})]^{-1}\nabla f(\textbf{x})}\).

5 Experiments

We validate the proposed methods on the following important data mining tasks, which enjoy the structure of “unbalanced dimension” and have been well studied in previous literature (Liu et al., 2022; Liu & Luo, 2022). The experiments are conducted on a workstation with an Intel(R) Core(TM) i7-10870 H CPU @ 2.20GHz. The code was executed using Python 3.8.

  • AUC Maximization. To train a classifier \(\textbf{w}\) on imbalanced datasets \(\{{\textbf{a}}_j,b_j\}_{j=1}^N\) such that \(p = \frac{N^{+}}{N}\approx 1~\text {or}~0\) where \(N^{+}\) is the number of positive instances, AUC maximization can be reformulated into minimax problems, where \(l_j(\textbf{x},\textbf{y})\) of (1) takes the following quadratic form

    $$\begin{aligned} l_j(\textbf{x},y)&= (1-p)\big ( \big (\textbf{w}^\top {\textbf {a}}_j-u \big )^2 - 2(1+y)\textbf{w}^\top {\textbf {a}}_j\big ){\mathbb {I}}_{b_j=1}+\frac{\lambda }{2}\left\| \textbf{x} \right\| ^2\\&\quad +p\big ( \big (\textbf{w}^\top {\textbf {a}}_j-v \big )^2 + 2(1+y)\textbf{w}^\top {\textbf {a}}_j\big ){\mathbb {I}}_{b_j=-1} -p(1-p)y^2, \end{aligned}$$

    where \(\textbf{x}= [\textbf{w}; u; v] \in {\mathbb {R}}^{d+2}, \textbf{w}\in {\mathbb {R}}^d, u\in {\mathbb {R}}, v\in {\mathbb {R}}\) and \(y\in {\mathbb {R}}\). We set \(\lambda =0.5\). We perform experiments on “a9a” (\(N = 32, 651\), \(n_\textbf{x}=125\), \(n_\textbf{y}=1\), \(p=0.241\)), “w8a” (\(N = 45, 546, n_\textbf{x}=302, n_\textbf{y}= 1, p=0.029\)), and “sido0” (\(N=12,678\), \(n_\textbf{x}=4,932\), \(n_\textbf{y}=1\)) which can be downloaded from Libsvm (Chang & Lin, 2011). We choose the regularized parameter \(\lambda = 0.5\) and the number of the clients \(m=8\). We tune the learning rates of all methods (include the baselines) from \(\{1.0, 0.9, \dots ,0.1\}\).

  • Fairness-Aware Machine Learning. Given the training set \(\{{\textbf{a}}_j, b_j, c_j\}_{j=1}^{N}\) where \({\textbf{a}}_j\in {\mathbb {R}}^d\) and \(c_j\in {\mathbb {R}}\), we can use the following adversarial training model to train a binary classifier \(\textbf{x}\) (Zhang et al., 2018) and make it unbiased to the feature \(c_j\) that we want to protect:

    $$\begin{aligned} \!\!\!l_j(\textbf{x}, y) \!=\!\log \big (1\!+\!\exp \big (-b_j({\textbf {a}}_j)^\top \textbf{x}\big )\big ) \!\!+\!\!\lambda \left\| \textbf{x} \right\| ^2 \!\!-\!\! \gamma y^2\!\!-\!\!\beta \log \big (1 \!+\! \exp \big (\!\!-c_j({\textbf {a}}_j)^\top \textbf{x}y\big )\big ). \end{aligned}$$

    We choose \(\lambda = 0.5\) and \(\beta = \gamma = 0.0001\). We conduct experiments on “adult” (\(N = 32, 651, n_\textbf{x}=122, n_\textbf{y}=1\)) and “law school” (\(N = 20, 427, n_\textbf{x}=379, n_\textbf{y}= 1\)) datasets (Le Quy et al., 2022; Liu & Luo, 2022). We set regularization parameters \(\lambda = 0.5\) and \(\beta =\gamma =0.0001\).

Fig. 1
figure 1

We demonstrate the communication rounds (#comm) against \(\Vert \nabla f(\textbf{x},\textbf{y})\Vert _2\) and running time (seconds) against \(\Vert \nabla f(\textbf{x},\textbf{y})\Vert _2\) for AUC maximization under the case \(m=8\) on datasets “a9a”, “w8a”, and “sido0”

Fig. 2
figure 2

We demonstrate the communication rounds (#comm) against \(\Vert \nabla f(\textbf{x},\textbf{y})\Vert _2\) and running time (seconds) against \(\Vert \nabla f(\textbf{x},\textbf{y})\Vert _2\) for AUC maximization under the case \(m=128\) on datasets “a9a”, “w8a”, and “sido0”

Fig. 3
figure 3

We demonstrate the communication rounds (#comm) against \(\Vert \nabla f(\textbf{x},\textbf{y})\Vert _2\) and running time (second) against \(\Vert \nabla f(\textbf{x},\textbf{y})\Vert _2\) for Fairness-aware machine learning under the case \(m=8\) on datasets “adult” and “law school”

Fig. 4
figure 4

We demonstrate the communication rounds (#comm) against \(\Vert \nabla f(\textbf{x},\textbf{y})\Vert _2\) and running time (second) against \(\Vert \nabla f(\textbf{x},\textbf{y})\Vert _2\) for Fairness-aware machine learning under the case \(m=128\) on datasets “adult” and “law school”

Fig. 5
figure 5

We demonstrate the communication rounds (#comm) against \(\Vert \nabla f(\textbf{x},\textbf{y})\Vert _2\) and running time (second) against \(\Vert \nabla f(\textbf{x},\textbf{y})\Vert _2\) for AUC maximization on datasets “a9a” and “w8a” with different sketch ratio p under the case \(m=8\)

Fig. 6
figure 6

We demonstrate the communication rounds (#comm) against \(\Vert \nabla f(\textbf{x},\textbf{y})\Vert _2\) and running time (second) against \(\Vert \nabla f(\textbf{x},\textbf{y})\Vert _2\) for Fairness-aware machine learning on datasets “adult” and “law school” with different sketch ratio p under the case \(m=8\)

5.1 Comparison with the baselines

We compare PANDA and GIANT-PANDA with existing state-of-the-art communication-efficient methods. Specifically, we adopt distributed version of extra gradient (Korpelevich, 1976; Tseng, 2000) (EG), federated gradient descent ascent with gradient tracking (Sun & Wei, 2022) (FedGDA), and proximal skip method for variational inequalities (Zhang et al., 2024) (ProxSkip) as the baselines. Both EG and ProxSkip achieve the optimal communication complexity for first-order methods. We tune the learning rates of all methods from \(\{1.0, 0.9, \dots ,0.1\}\).

For all experiments, we use \(70\%\) percent local data in GIANT-PANDA. The results for AUC maximization under different client numbers \(m=8\) and \(m=128\) are presented in Figs. 1 and 2. We also demonstrate the results for Fairness-aware machine learning under different client numbers \(m=8\) and \(m=128\) in Figs. 3 and 4.

We observe that our newly proposed PANDA and GIANT-PANDA outperform the baselines in terms of both communication rounds and the running time for all cases. This indicates that our methods indeed not only significantly reduce the communication rounds as compared to the optimal first-order methods, but also maintain communication efficiency which makes the optimization procedure fast.

We also observe that the communication complexity of PANDA can be affected by the number of clients (m). This is because \(\eta _{\text {PANDA}}\) is proportional to \(\sqrt{m}\) according to Remark 4.5. On the other hand, the increase of m makes the training time per iteration smaller due to the distributed framework, thus, larger m always leads to a faster training process. Take (a), (b) of Figs. 1 and 2 for example, PANDA requires less communication round when \(m=8\), but takes less running time when \(m=128\).

5.2 Comparison of different sketch ratios for GIANT-PANDA

We investigate the impact of the sketch ratio (\(p = s_t/s\)) on GIANT-PANDA. We choose different sketch ratios p from \(\{10\%, 30\%, 50\%, 70\%, 100\%\}\) for GIANT-PANDA. For the case \(p=100\%\), GIANT-PANDA reduces to its full version PANDA. We set the number of clients as \(m=8\).

We present the results for AUC maximization and Fairness-aware machine learning in Figs. 5 and 6 respectively. The numerical results show that larger sketch ratios lead to fewer communication rounds for the training process, which is because one can obtain a better approximation to the local exact partial Hessian, and thus get a smaller \(\eta\). GIANT-PANDA with \(p=100\%\) (PANDA) outperforms other cases in terms of the communication rounds. On the other hand, GIANT-PANDA shows its advantage in terms of the running time. We find that GIANT-PANDA with \(p=30\%\) for “a9a”, \(p=10\%\) for “w8a” in AUC maximization and with \(p=10\%\) for “law school” in Fairness-aware machine learning achieves the best behavior in terms of the running time ((b), (d) of Figs. 5 and 6). This is because the sketch operation in GIANT-PANDA reduces the computation time for each client.

We also provide additional experiments to study the impact of sketch ratio under \(m=128\) and the impact of using different sketch methods in G.

6 Conclusion

In this paper, we have proposed PANDA and GIANT-PANDA to solve the distributed minimax problems with unbalanced dimensions. PANDA eliminates the requirement of communicating the full Hessian and substantially reduces the communication rounds compared to the optimal first-order methods. GIANT-PANDA further reduces the computation cost by performing sketch operations to compute the local partial Hessian on each client.

For future work, it is interesting to generalize PANDA and GIANT-PANDA to more general minimax optimization problems (Adil et al., 2022; Lin & Jordan, 2022; Liu & Luo, 2022; Luo et al., 2022). It is also possible to leverage the idea of Hessian average (Na et al., 2023) to further enhance the behavior of GIANT-PANDA and design the decentralized scenario of PANDA and GIANT-PANDA.