Panda: partially approximate newton methods for distributed minimax optimization with unbalanced dimensions

Xiao, Minheng; Liu, Chengchang; Chen, Cheng; Lui, John C. S.; Na, Sen

doi:10.1007/s10994-025-06813-1

Panda: partially approximate newton methods for distributed minimax optimization with unbalanced dimensions

Open access
Published: 19 June 2025

Volume 114, article number 174, (2025)
Cite this article

You have full access to this open access article

Download PDF

Machine Learning Aims and scope Submit manuscript

Panda: partially approximate newton methods for distributed minimax optimization with unbalanced dimensions

Download PDF

Minheng Xiao¹^na1,
Chengchang Liu²^na1,
Cheng Chen³,
John C. S. Lui² &
…
Sen Na⁴

1040 Accesses
1 Altmetric
Explore all metrics

Abstract

Unbalanced dimensions are crucial characteristics in various minimax optimization problems, such as few-shot learning (Cortes and Mohri in Adv Neural Inf Process Syst 16, 2003; Ying et al. in Adv Neural Inf Process Syst 29, 2016) and fairness-aware machine learning (Lowd and Meek, in: Proceedings of the eleventh ACM SIGKDD international conference on knowledge discovery in data mining, 2005; Zhang et al., in: Proceedings of the 2018 AAAI/ACM conference on AI, ethics, and society, 2018). In this paper, we propose a communication-efficient second-order method named PANDA (Partially Approximate Newton methods for Distributed minimAx) to solve problems with unbalanced dimensions. PANDA requires almost the same per-iteration communication cost as the first-order methods by utilizing the special problem structure in its design for data exchange between the client and server. More importantly, it exhibits a superior linear-quadratic convergence rate and significantly reduces the total number of communication rounds through the efficient use of second-order information. We also develop GIANT-PANDA based on the framework of PANDA, which further reduces the computation cost of the latter one by performing sketching operations on each client. Through comprehensive theoretical analysis and empirical evaluations, we demonstrate the superior performance of the proposed methods compared to existing state-of-the-art methods.

Deep machine learning for the PANDA software trigger

Article Open access 27 April 2023

DeepSets and Their Derivative Networks for Solving Symmetric PDEs

Article 08 April 2022

Numerical Bifurcation Analysis of PDEs From Lattice Boltzmann Model Simulations: a Parsimonious Machine Learning Approach

Article Open access 24 June 2022

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

We consider a class of minimax optimization problems formulated as finite-sum expressions:

$$\begin{aligned} \min _{\textbf{x}\in \mathbb {R}^{n_\textbf{x}}} \max _{\textbf{y}\in \mathbb {R}^{n_\textbf{y}}} f(\textbf{x}, \textbf{y}) \overset{\textrm{def}}{=}\frac{1}{N}\sum \limits _{j=1}^{N} l_j(\textbf{x}, \textbf{y}), \end{aligned}$$

(1)

where N denotes the sample size, and $n_\textbf{x}$ and $n_\textbf{y}$ denote the dimensions of the variables $\textbf{x}$ and $\textbf{y}$ respectively. The smooth function $f(\textbf{x}, \textbf{y})$ is supposed to be strongly convex in $\textbf{x}$ and strongly concave in $\textbf{y}$ (i.e., satisfying the SC-SC condition). For many machine learning problems with a large sample size N, distributed methods are often preferable for solving problems in a parallel fashion. To facilitate our study in a distributed setting, let us divide the N samples that the i-th client holds $|S_i|$ samples. Consequently, we have $N=\sum _{i=1}^m|S_i|$, leading us to the following alternative problem:

$$\begin{aligned} \!\!\min _{\textbf{x}\in \mathbb {R}^{n_\textbf{x}}} \max _{\textbf{y}\in \mathbb {R}^{n_\textbf{y}}} f(\textbf{x}, \textbf{y}) \!=\! \frac{1}{m}\sum _{i=1}^{m} f^{i}(\textbf{x},\textbf{y}),~~ \text {where}~~ f^{i}(\textbf{x},\textbf{y}) \!\overset{\textrm{def}}{=}\! \frac{1}{|S_i|}\sum _{j\in S_i}l_j(\textbf{x},\textbf{y}). \end{aligned}$$

(2)

In this context, for each client $i=1,\ldots ,m$, $f^i$ represents its local function and $S_i$ is the index set of local samples.

Minimax optimization has gained significant attention in the data mining and machine learning community due to its broad applications in various domains, including game theory (Basar & Olsder, 1999; Facchinei, 2003), supervised learning (Lanckriet et al., 2002), robust optimization (Ben-Tal & Nemirovski, 2002; Deng & Mahdavi, 2021; Gao & Kleywegt, 2022), and fairness-aware machine learning (Creswell et al., 2018; Liu et al., 2020). Among these applications, many of them share a critical property that the dimensions of variables $\textbf{x}$ and $\textbf{y}$ are unbalanced (Liu et al., 2022). For instance, AUC maximization (Cortes & Mohri, 2003; Ying et al., 2016) aims to train a binary classifier on imbalanced datasets $\{{\textbf{a}}_j,b_j\}_{j=1}^{N}$, where ${\textbf{a}}_j$ denotes the input with d features and $b_j\in \{-1,+1\}$ denotes the label. This problem can be formulated as a minimax problem with $n_\textbf{x}=d+2$ and $n_\textbf{y}=1$. Additionally, in fairness-aware machine learning tasks (Lowd & Meek, 2005; Zhang et al., 2018), we are given a training set $\{{\textbf{a}}_j,b_j, {\textbf{c}}_j\}_{j=1}^{N}$, where ${\textbf{a}}_j$ represents d-dimensional features to learn from, ${\textbf{c}}_j$ contains s protected features, and $b_j$ denotes the label. In this case, we have $n_\textbf{x}=d\gg n_\textbf{y}=s$. Throughout this paper, we use the term “unbalanced dimensions"to describe the above special problem structure, which can be expressed as $n_\textbf{x}\gg n_\textbf{y}$ or $n_\textbf{y}\gg n_\textbf{x}$.^{Footnote 1}

There are numerous first-order methods for solving minimax optimization problems in (1), including gradient descent ascent, extra gradient, and many of their variants (Chavdarova et al., 2019; Hsieh et al., 2019; Korpelevich, 1976; Lin et al., 2020; Malitsky, 2015; Mishchenko et al., 2020; Nedić & Ozdaglar, 2009; Nouiehed et al., 2019; Tseng, 2000). These first-order methods can be straightforwardly generalized to solve distributed problems in (2), by simply aggregating the gradients to the server at each iteration. Distributed first-order methods that perform multiple local iterations for each client before communication have also been proposed to solve minimax problems (Deng & Mahdavi, 2021; Sun & Wei, 2022; Zhang et al., 2024). Among these methods, Zhang et al. (2024) achieved the best-known results in terms of the total communication rounds, with an order of ${\mathcal {O}}(\kappa _\textbf{g}\ln (1/\epsilon ))$ where $\kappa _\textbf{g}$ is the condition number of the objective (cf. Sect. 2.1) and $\epsilon$ is the desired accuracy. It is worth noting that the per-iteration communication complexity between the client and server for first-order methods is only ${\mathcal {O}}(n_\textbf{x}+n_\textbf{y})$. However, these methods often require a substantial number of communication rounds to attain an accurate solution (e.g., their communication rounds depend heavily on the condition number). As a result, factors such as unpredictable network latency can lead to expensive total communication costs.

Second-order methods are well-known for their fast convergence rates, brought about by the utilization of the Hessian information of the objective. The Cubic regularized Newton method (Huang et al., 2022) and its restart variant (Huang & Zhang, 2022) have been proposed for solving (1) and demonstrated local superlinear convergence rates under the SC-SC condition. However, applying these methods directly in a distributed setting would require the communication of the full local Hessian matrix, resulting in a per-iteration communication complexity of $\mathcal {O}((n_\textbf{x}+n_\textbf{y})^2)$. This communication overhead is unacceptable due to bandwidth limitations. On the other hand, several communication-efficient^{Footnote 2} distributed second-order methods (Islamov et al., 2022; Liu et al., 2024; Shamir et al., 2014; Wang & Li, 2020; Ye et al., 2022) have been proposed for convex optimization problems, eliminating the need for full Hessian communication. However, the design and analysis of these communication-efficient second-order methods heavily depend on the convexity of the objective function, making it challenging to generalize them to the minimax setting. Building upon this, it is natural to ask: Is it possible to develop communication-efficient distributed second-order methods for minimax optimization by leveraging the structure of “unbalanced dimensions"?

In this paper, we provide an affirmative answer to this question by introducing PANDA (Partially Approximate Newton methods for Distributed minimAx optimization). In each iteration, PANDA avoids the need for communicating the full Hessian matrix, instead requiring only the exchange of the partial Hessian matrix associated with $\textbf{y}$ (recall that we suppose $n_\textbf{x}\gg n_\textbf{y}$ in this paper), i.e. $\nabla ^2_{\textbf{y}\textbf{y}} f^{i}(\textbf{x},\textbf{y})\in {\mathbb {R}}^{n_\textbf{y}\times n_\textbf{y}}$, $\nabla ^2_{\textbf{x}\textbf{y}} f^{i}(\textbf{x},\textbf{y})\in {\mathbb {R}}^{n_\textbf{x}\times n_\textbf{y}}$, and $\nabla ^2_{\textbf{x}\textbf{x}}f^{i}(\textbf{x},\textbf{y})\nabla ^2_{\textbf{x}\textbf{y}} f(\textbf{x},\textbf{y})\in {\mathbb {R}}^{n_{\textbf{x}}\times n_\textbf{y}}$. Additionally, it exchanges vectors such as gradients and local descent directions. As a result, the per-iteration communication complexity of PANDA can be summarized as:

$$\begin{aligned} {\mathcal {O}}\Bigg (\!\!\!\!\!\underbrace{\!\!\!n_\textbf{x}n_\textbf{y}+n_\textbf{y}^2\!\!}_{\nabla ^ 2_{\textbf{x}\textbf{y}} f^i,~ [\nabla ^2_{\textbf{x}\textbf{x}}f^i]^{-1}\nabla ^2_{\textbf{x}\textbf{y}}f,~\nabla ^2_{\textbf{y}\textbf{y}} f^i} \!\!\!\!\!+\ \ \underbrace{n_\textbf{x}+n_\textbf{y}}_{\text {vectors}} \Bigg ) ={\mathcal {O}}(n_\textbf{x}n_\textbf{y}) \approx {\mathcal {O}}(n_\textbf{x}). \end{aligned}$$

This complexity significantly reduces that of typical second-order methods, bringing it to the same order as first-order methods. Furthermore, the utilization of second-order information in PANDA results in improved convergence behavior compared to existing distributed first-order methods.

1.1 Contribution

The contribution of this paper is threefold.

(a)
We develop a Partially Approximate Newton (PAN) method to solve the general minimax problem (1) with unbalanced dimensions. If the approximate Hessian $\tilde{\textbf{H}}_{\textbf{x}\textbf{x}}$ satisfies $(1-\eta ) \nabla ^2_{\textbf{x}\textbf{x}} f\preceq \tilde{\textbf{H}}_{\textbf{x}\textbf{x}} \preceq (1+\eta ) \nabla ^2_{\textbf{x}\textbf{x}} f$ with $\eta \in (0, 1)$, then PAN exhibits a linear-quadratic convergence rate for some measure $\lambda _t$:
$$\begin{aligned} \lambda _{t+1}\le \frac{\eta }{1-\eta }\lambda _t +\beta \lambda _t^2. \end{aligned}$$
This result of PAN generalizes the approximate Newton method for convex optimization in Ye et al. (2021) and relaxes the conditions in Liu et al. (2022, Lemma 4.3) for minimax optimization.
(b)
We develop the PANDA method to solve the distributed minimax problem (2). If the local partial Hessian satisfies $(1-\eta )\nabla ^2_{\textbf{x}\textbf{x}} f\preceq \nabla ^2_{\textbf{x}\textbf{x}} f^i\preceq (1+\eta ) \nabla ^2_{\textbf{x}\textbf{x}} f$ with $\eta \in (0,1)$, then PANDA exhibits a linear-quadratic convergence rate:
$$\begin{aligned} \lambda _{t+1}\le \frac{\eta ^2}{1-\eta } \lambda _t +\beta _1 \lambda _t^2. \end{aligned}$$
Furthermore, we can guarantee the existence of $\eta$ provided that $N\ge {\mathcal {O}}(mK/\mu )$, where m is the number of clients, $K=\max _j\Vert \nabla ^2_{\textbf{x}\textbf{x}}l_j\Vert$, and $\mu$ is the strong convexity parameter (cf. Sect. 2.1).
(c)
We develop the GIANT-PANDA method, which employs matrix sketching techniques on each local client to construct the partial approximate Hessian $\tilde{\textbf{H}}^i_{\textbf{x}\textbf{x}}$. This method exhibits a linear-quadratic convergence rate:
$$\begin{aligned} \lambda _{t+1}\le \left( \frac{\eta }{\sqrt{m}}+\frac{\eta ^2}{1-\eta }\right) \lambda _t +\beta _2 \lambda _t^2. \end{aligned}$$
This result leads to a sharper analysis compared to the original GIANT method in (Wang et al., 2018), as it improves the convergence rate by a factor of $\sqrt{\kappa _\textbf{g}}$ in the linear term.

Organization We introduce fundamental notation, assumptions, and preliminary results for Hessian approximation in Sect. 2. In Sects. 3 and 4, we introduce PAN for solving (1) and introduce PANDA (along with GIANT-PANDA) for solving (2), respectively. We conduct empirical studies in Sect. 5 and provide conclusions in Sect. 6. All proofs are deferred to the appendix.

2 Preliminaries

2.1 Notation and assumptions

We use $\textbf{g}_\textbf{x}(\textbf{x},\textbf{y})$ and $\textbf{H}_{\textbf{x}\textbf{x}}(\textbf{x},\textbf{y})$ to denote the gradient $\nabla _{\textbf{x}} f(\textbf{x},\textbf{y})$ and Hessian $\nabla ^2_{\textbf{x}\textbf{x}} f(\textbf{x},\textbf{y})$ with respect to $\textbf{x}$, respectively (similar for $\textbf{g}_\textbf{y}$, $\textbf{H}_{\textbf{x}\textbf{y}}$, $\textbf{H}_{\textbf{y}\textbf{y}}$). For the local gradient and Hessian associated with the i-th client, we use $\textbf{g}_\textbf{x}^i(\textbf{x},\textbf{y})$ and $\textbf{H}_{\textbf{x}\textbf{x}}^{i}(\textbf{x},\textbf{y})$ to denote $\nabla _{\textbf{x}} f^i(\textbf{x},\textbf{y})$ and $\nabla ^2_{\textbf{x}\textbf{x}} f^i(\textbf{x},\textbf{y})$ (similar for $\textbf{g}_\textbf{y}^{i}$, $\textbf{H}_{\textbf{x}\textbf{y}}^i$, $\textbf{H}_{\textbf{y}\textbf{y}}^i$). We use $\left\| \,\cdot \, \right\|$ to denote the spectral norm for matrices and the Euclidean norm for vectors. Additionally, we define the matrix row coherence as follows.

Definition 2.1

(Wang et al., 2018, Definition 1) Let $\textbf{A}\in {\mathbb {R}}^{N\times d}$ be a matrix with full column rank and $\textbf{A}=\textbf{U}\Sigma \textbf{V}^\top$ be its reduced singular value decomposition with $\textbf{U}, \textbf{V}\in {\mathbb {R}}^{N\times d}$. The row coherence of $\textbf{A}$ is defined as $\nu (\textbf{A})\overset{\textrm{def}}{=}\frac{N}{d}\max _j \Vert \textbf{u}_j\Vert ^2 \in [1, \frac{N}{d}]$, where $\textbf{u}_j$ is the j-th row of $\textbf{U}$.

We introduce the following assumption for the objective function in (1).

Assumption 2.2

We assume $f({\textbf{x}},{\textbf{y}})$ is twice differentiable, $\mu$-strongly convex in ${\textbf{x}}$, $\mu$-strongly concave in ${\textbf{y}}$, and has $L_\textbf{g}$-Lipschitz continuous gradient and $L_\textbf{H}$-Lipschitz continuous Hessian. We also assume each individual function $l_j(\textbf{x},\textbf{y})$ is convex in $\textbf{x}$. We denote $\kappa _\textbf{g}\overset{\textrm{def}}{=}L_\textbf{g}/\mu$ and $\kappa _\textbf{H}\overset{\textrm{def}}{=}L_\textbf{H}/\mu$.

The convexity of each individual function $l_j$ in $\textbf{x}$, along with the $L_\textbf{g}$-Lipschitz continuity of the gradient $\textbf{g}_\textbf{x}$, implies that the Hessian $\nabla _{\textbf{x}\textbf{x}}^2 l_j$ is bounded. Let us denote $K\overset{\textrm{def}}{=}\max _{j}\Vert \nabla ^2_{\textbf{x}\textbf{x}} l_j\Vert$ and $\hat{\kappa }\overset{\textrm{def}}{=}K/\mu$.

2.2 Matrix approximation via sub-sampling and sketching

Let us introduce some preliminary results for approximating a positive definite Hessian matrix. We first consider a Hessian matrix in the form of $\textbf{H}= \frac{1}{N}\sum _{j=1}^N \textbf{H}_j\in {\mathbb {R}}^{d\times d}$, and approximate it using sub-sampling:

$$\begin{aligned} \tilde{\textbf{H}} = \frac{1}{|{\mathcal {S}}|}\sum _{j\in {\mathcal {S}}}\textbf{H}_j, \end{aligned}$$

(3)

where elements in ${\mathcal {S}}$ are uniformly sampled from $\{1,\cdots , N\}$. The following lemma characterizes the error of the sub-sampling approximation.

Lemma 2.3

(Ye et al., 2021, Lemma 9) Suppose $\textbf{H}\succeq \mu \textbf{I}$ and $\max _{1\le j\le N}\Vert \textbf{H}_j\Vert \le \hat{K}$ for some constants $\mu , \hat{K}>0$. For any $\delta \in (0,1)$ and $\eta \in (0,0.5)$, if the sample size satisfies $|{\mathcal {S}}|\ge \frac{3\hat{K}\log (2d/\delta )}{\mu \eta ^2}$, then with probability at least $1-\delta$, we have $(1-\eta )\textbf{H}\preceq \tilde{\textbf{H}}\preceq (1+\eta )\textbf{H}$ for $\tilde{\textbf{H}}$ defined in (3).

We then consider a special case where the Hessian matrix is expressed as $\textbf{H}=\textbf{A}^{\top }\textbf{A}+\alpha \textbf{I}$, with $\textbf{A}\in {\mathbb {R}}^{N\times d}$ being a full column-rank matrix. This form of Hessian matrix naturally arises in classical regression problems (Ye et al., 2021; Wang et al., 2017). We construct two approximate Hessians using sketching techniques:

$$\begin{aligned} \tilde{\textbf{H}}_i = \textbf{A}^{\top }\textbf{S}_i\textbf{S}_i^{\top }\textbf{A}+ \alpha \textbf{I},\quad \quad \hat{\textbf{H}}=\textbf{A}^{\top }\textbf{S}\textbf{S}^{\top }\textbf{A}+\alpha \textbf{I}. \end{aligned}$$

(4)

here $\{\textbf{S}_i\}_{i=1}^m\in {\mathbb {R}}^{N\times s'}$ represent the sketching matrices and $\textbf{S}\overset{\textrm{def}}{=}\frac{1}{\sqrt{m}}[\textbf{S}_1,\cdots ,\textbf{S}_m]\in {\mathbb {R}}^{N\times ms'}$. The following lemma characterizes the errors of these two sketching approximations.

Lemma 2.4

Adapted from (Wang et al., 2018, Lemma 8)

Let $\eta$ and $\delta \in (0,1)$ be fixed parameters, $\nu =\nu (\textbf{A})$, and $\textbf{S}_1,\cdots ,\textbf{S}_m\in {\mathbb {R}}^{N\times s'}$ be independent uniform sampling matrices with $s'\ge \frac{3\nu d}{\eta ^2}\log (\frac{d m}{\delta })$. Then, with probability at least $1-\delta$, we have $(1-\eta )\textbf{H}\preceq \tilde{\textbf{H}}_i\preceq (1+\eta ) \textbf{H}$ and $(1-\eta /\sqrt{m})\textbf{H}\preceq \hat{\textbf{H}}\preceq (1+\eta /\sqrt{m}) \textbf{H}$ for $\textbf{H}_i$ and $\hat{\textbf{H}}$ defined in (4).

3 The analysis framework of partially approximate Newton method

In this section, we propose a Partially Approximate Newton (PAN) method for Problem (1). We start with the classical Newton update:

$$\begin{aligned} \begin{bmatrix} \textbf{x}_{+}\\ \textbf{y}_{+} \end{bmatrix} ={ \begin{bmatrix} \textbf{x} \\ \textbf{y} \end{bmatrix}}- \begin{bmatrix} \textbf{H}_{\textbf{x}\textbf{x}}(\textbf{x},\textbf{y})& \quad \textbf{H}_{\textbf{x}\textbf{y}}(\textbf{x},\textbf{y})\\ (\textbf{H}_{\textbf{x}\textbf{y}}(\textbf{x},\textbf{y}))^{\top }& \quad \textbf{H}_{\textbf{y}\textbf{y}}(\textbf{x},\textbf{y}) \end{bmatrix}^{-1}\begin{bmatrix} \textbf{g}_\textbf{x}(\textbf{x},\textbf{y})\\ \textbf{g}_\textbf{y}(\textbf{x},\textbf{y}) \end{bmatrix}. \end{aligned}$$

(5)

Using the approximate Hessian matrix $\tilde{\textbf{H}}_{\textbf{x}\textbf{x}}$ to replace the exact Hessian matrix $\textbf{H}_{\textbf{x}\textbf{x}}(\textbf{x},\textbf{y})$ in (5) leads to the update rule of PAN as follows:

$$\begin{aligned} \begin{bmatrix} \textbf{x}_{+}\\ \textbf{y}_{+} \end{bmatrix}= { \begin{bmatrix} \textbf{x} \\ \textbf{y} \end{bmatrix}}- \begin{bmatrix} \tilde{\textbf{H}}_{\textbf{x}\textbf{x}}& \quad \textbf{H}_{\textbf{x}\textbf{y}}(\textbf{x},\textbf{y})\\ (\textbf{H}_{\textbf{x}\textbf{y}}(\textbf{x},\textbf{y}))^{\top }& \quad \textbf{H}_{\textbf{y}\textbf{y}}(\textbf{x},\textbf{y}) \end{bmatrix}^{-1}\begin{bmatrix} \textbf{g}_\textbf{x}(\textbf{x},\textbf{y})\\ \textbf{g}_\textbf{y}(\textbf{x},\textbf{y}) \end{bmatrix}. \end{aligned}$$

(6)

We use the weighted gradient norm as the measure in our analysis (Liu et al., 2022):

$$\begin{aligned} \lambda (\textbf{x},\textbf{y})\! \overset{\textrm{def}}{=}\!&\sqrt{(\textbf{g}_\textbf{x}(\textbf{x},\textbf{y}))^{\top }(\textbf{P}(\textbf{x},\textbf{y}))^{-1}\textbf{g}_\textbf{x}(\textbf{x},\textbf{y})}\! +\!\frac{2}{\sqrt{\mu }} \Vert \textbf{g}_\textbf{y}(\textbf{x},\textbf{y})\Vert , \end{aligned}$$

where $\textbf{P}(\textbf{x},\textbf{y})$ is defined as

$$\begin{aligned} \textbf{P}(\textbf{x},\textbf{y}) \! \overset{\textrm{def}}{=}\textbf{H}_{{\textbf{x}}{\textbf{x}}}({\textbf{x}},{\textbf{y}})\! -\! \textbf{H}_{{\textbf{x}}{\textbf{y}}}({\textbf{x}},{\textbf{y}})(\textbf{H}_{{\textbf{y}}{\textbf{y}}}({\textbf{x}},{\textbf{y}}))^{-1}\textbf{H}_{\textbf{y}\textbf{x}} ({\textbf{x}},{\textbf{y}}). \end{aligned}$$

(7)

The following lemma shows that if $\tilde{\textbf{H}}_{\textbf{x}\textbf{x}}$ is a good approximation of $\textbf{H}_{\textbf{x}\textbf{x}}(\textbf{x},\textbf{y})$, then $\textbf{P}(\textbf{x},\textbf{y})$ can be also well approximated by $\textbf{C}(\textbf{x},\textbf{y})$, defined as

$$\begin{aligned} \textbf{C}(\textbf{x},\textbf{y}) \overset{\textrm{def}}{=}\tilde{\textbf{H}}_{{\textbf{x}}{\textbf{x}}} - \textbf{H}_{{\textbf{x}}{\textbf{y}}}({\textbf{x}},{\textbf{y}})({\textbf{H}}_{{\textbf{y}}{\textbf{y}}}({\textbf{x}},{\textbf{y}}))^{-1}\textbf{H}_{\textbf{y}\textbf{x}}({\textbf{x}},{\textbf{y}}). \end{aligned}$$

(8)

Lemma 3.1

Under Assumption 2.2 and suppose that $\tilde{\textbf{H}}_{\textbf{x}\textbf{x}}$ satisfies $(1-\eta ) \textbf{H}_{\textbf{x}\textbf{x}}\preceq \tilde{\textbf{H}}_{\textbf{x}\textbf{x}}\preceq (1+\eta )\textbf{H}_{\textbf{x}\textbf{x}}$, we have

$$\begin{aligned} \left\| \textbf{I}- \textbf{P}(\textbf{x},\textbf{y})^{1/2}(\textbf{C}(\textbf{x},\textbf{y}))^{-1}\textbf{P}(\textbf{x},\textbf{y})^{1/2} \right\| \le \frac{\eta }{1-\eta } \end{aligned}$$

We establish a linear-quadratic convergence rate for the PAN update when $\textbf{P}(\textbf{x},\textbf{y})$ and $\textbf{C}(\textbf{x},\textbf{y})$ are close.

Theorem 3.2

Under the Assumption 2.2 and suppose $\textbf{P}(\textbf{x},\textbf{y})$ and $\textbf{C}(\textbf{x},\textbf{y})$ are close such that $\Vert \textbf{I}- \textbf{P}(\textbf{x},\textbf{y})^{1/2}\textbf{C}(\textbf{x},\textbf{y})^{-1}\textbf{P}(\textbf{x},\textbf{y})^{1/2}\Vert \le \eta _1$, the update of PAN in (6) exhibits the following linear-quadratic convergence rate:

$$\begin{aligned} \lambda (\textbf{x}_{+},\textbf{y}_{+})\le \eta _1 \lambda (\textbf{x},\textbf{y}) + \frac{12\kappa _\textbf{g}^2\kappa _{\textbf{H}}(1+\eta _1)^2}{\sqrt{\mu }} \lambda (\textbf{x},\textbf{y})^2. \end{aligned}$$

When employing the sub-sampling approximation to construct $\tilde{\textbf{H}}_{\textbf{x}\textbf{x}}$, we derive the following corollary by combining the results from Lemma 3.1 and Theorem 3.2.

Corollary 3.3

Let us construct the partial Hessian approximation by sub-sampling $\tilde{\textbf{H}}_{\textbf{x}\textbf{x}}=\frac{1}{|S|}\sum _{j\in S}\nabla ^2_{\textbf{x}\textbf{x}} l_j(\textbf{x},\textbf{y})$. Under Assumption 2.2 and for any $\delta \in (0, 1)$, if the sample size satisfies $|S|\ge {12\hat{\kappa }\log (2n_\textbf{x}/\delta )}$, then with probability at least $1-\delta$, the update of PAN in (6) satisfies

$$\begin{aligned} \lambda (\textbf{x}_{+},\textbf{y}_{+})\le \frac{\eta }{1-\eta } \lambda (\textbf{x},\textbf{y}) + \frac{12\kappa _\textbf{g}^2\kappa _{\textbf{H}}(1+\eta /(1-\eta ))^2}{\sqrt{\mu }} \lambda (\textbf{x},\textbf{y})^2 \end{aligned}$$

(9)

with $\eta =\eta _{\textrm{PAN}}\overset{\textrm{def}}{=}\sqrt{\frac{3\hat{\kappa }\log (2n_\textbf{x}/\delta )}{|S|}}$.

Corollary 3.3 suggests that PAN requires ${\mathcal {O}}\left( {\log (1/\epsilon )}/{\log (|S|/\hat{\kappa })}\right)$ iterations to achieve $\epsilon$-accuracy in terms of the measure $\lambda (\textbf{x},\textbf{y})$ for a quadratic objective function. Analyzing the complexity of linear-quadratic rates on quadratic functions is a common practice in the literature (Roosta-Khorasani & Mahoney, 2019; Wang et al., 2018; Ye et al., 2020, 2021), which allows us to simply ignore the quadratic term in (9) since $\kappa _\textbf{H}=0$. In comparison, state-of-the-art first-order methods such as optimistic gradient and extra gradient methods have complexities ${\mathcal {O}}\left( \kappa _\textbf{g}\log (1/\epsilon )\right)$; methods that do not access the full Hessian at each iteration such as the quasi-Newton method (Liu & Luo, 2022) and the partial-quasi-Newton method (Liu et al., 2022) have complexities ${\mathcal {O}}(\kappa _\textbf{g}^2 + \sqrt{n_\textbf{x}\log (1/\epsilon )})$ and ${\mathcal {O}}(\kappa _\textbf{g}+ \sqrt{n_\textbf{x}\log (1/\epsilon )})$, respectively. We can see that PAN exhibits a much weaker dependency on the condition number $\kappa _\textbf{g}$. We present the comparisons in Table 1.

Table 1 We present the iteration complexity of proposed method (PAN) and baselines for solving quadratic minimax optimization (AUC maximization)

Full size table

4 Partially approximate Newton methods for distributed minimax optimization

In this section, we present the PANDA method for solving Problem (2) in Sect. 4.1, establish its convergence results in Sect. 4.2, and extend PANDA to GIANT-PANDA for a special function class that commonly appears in regression problems in Sect. 4.3.

4.1 The PANDA algorithm

For simplicity, we suppress the evaluation point and use $\textbf{g}_\textbf{x}$ to denote $\textbf{g}_\textbf{x}(\textbf{x},\textbf{y})$ (similar to $\textbf{g}_\textbf{y}$, $\textbf{H}_{\textbf{x}\textbf{x}}$, $\textbf{H}_{\textbf{x}\textbf{y}}$, $\textbf{H}_{\textbf{y}\textbf{y}}$). We start with the standard Newton direction $\begin{bmatrix} {\textbf{d}}_{\textbf{x}}\\ {\textbf{d}}_\textbf{y}\end{bmatrix} \overset{\textrm{def}}{=}\begin{bmatrix} \textbf{H}_{\textbf{x}\textbf{x}}& \textbf{H}_{\textbf{x}\textbf{y}}\\ \textbf{H}_{\textbf{x}\textbf{y}}^{\top }& \textbf{H}_{\textbf{y}\textbf{y}} \end{bmatrix} ^{-1}\begin{bmatrix} \textbf{g}_\textbf{x}\\ \textbf{g}_\textbf{y}\end{bmatrix}$, which can be expressed explicitly by using block matrix inversion formula:

$$\begin{aligned} \begin{aligned} {\textbf{d}}_{\textbf{x}}&= \textbf{H}_{\textbf{x}\textbf{x}}^{-1}\textbf{g}_\textbf{x}- \left( \textbf{H}_{\textbf{x}\textbf{x}}^{-1}\textbf{H}_{\textbf{x}\textbf{y}} \right) \Delta _{\textbf{y}\textbf{y}}\textbf{g}_\textbf{y}+ \left( \textbf{H}_{\textbf{x}\textbf{x}}^{-1}\textbf{H}_{\textbf{x}\textbf{y}} \right) {\Delta }_{\textbf{y}\textbf{y}} \left( \textbf{H}_{\textbf{x}\textbf{x}}^{-1}\textbf{H}_{\textbf{x}\textbf{y}} \right) ^{\top }\textbf{g}_\textbf{x}\\ {\textbf{d}}_\textbf{y}&= -\Delta _{\textbf{y}\textbf{y}}\textbf{H}_{\textbf{x}\textbf{y}}^{\top }\textbf{H}_{\textbf{x}\textbf{x}}^{-1}\textbf{g}_\textbf{x}+ \Delta _{\textbf{y}\textbf{y}}\textbf{g}_\textbf{y}, \end{aligned} \end{aligned}$$

(10)

where ${\Delta }_{\textbf{y}\textbf{y}} = (\textbf{H}_{\textbf{y}\textbf{y}}-\textbf{H}_{\textbf{x}\textbf{y}}^{\top }\textbf{H}_{\textbf{x}\textbf{x}}^{-1}\textbf{H}_{\textbf{x}\textbf{y}} )^{-1}.$

Under the setup of unbalanced dimensions $n_\textbf{x}\gg n_\textbf{y}$, obtaining the exact Hessian $\textbf{H}_{\textbf{x}\textbf{x}}$ on the server is prohibitive due to the communication overhead associated with $\textbf{H}^i_{\textbf{x}\textbf{x}}$. However, communication costs of gradients and partial Hessians $\textbf{H}_{\textbf{x}\textbf{y}}^{i}$ and $\textbf{H}_{\textbf{y}\textbf{y}}^i$ are relatively low. Thus, in the first round of PANDA, the server aggregates these quantities to acquire precise gradient and partial Hessian information as follows:

$$\begin{aligned} \!\!\!\!\!\!\! \textbf{g}_{\textbf{x}} = \frac{1}{m}\sum _{i=1}^{m} \textbf{g}^i_{\textbf{x}}, \quad \textbf{g}_{\textbf{y}}=\frac{1}{m}\sum _{i=1}^m \textbf{g}^{i}_{\textbf{y}},\quad \textbf{H}_{\textbf{x}\textbf{y}} = \frac{1}{m}\sum _{i=1}^{m} \textbf{H}_{\textbf{x}\textbf{y}}^i,\quad \textbf{H}_{\textbf{y}\textbf{y}} = \frac{1}{m}\sum _{i=1}^{m} \textbf{H}_{\textbf{y}\textbf{y}}^i. \end{aligned}$$

(11)

The server then broadcasts the above-aggregated quantities to the clients, allowing each client to access global information of $\textbf{g}_\textbf{x}$, $\textbf{g}_\textbf{y}$, $\textbf{H}_{\textbf{x}\textbf{y}}$, and $\textbf{H}_{\textbf{y}\textbf{y}}$. Further, since the communication costs of $\textbf{Q}^{i}_{\textbf{x}\textbf{y}}\overset{\textrm{def}}{=}[\textbf{H}^{i}_{\textbf{x}\textbf{x}}]^{-1}\textbf{H}_{\textbf{x}\textbf{y}}$ and $\textbf{q}^{i}_\textbf{x}\overset{\textrm{def}}{=}[\textbf{H}^{i}_{\textbf{x}\textbf{x}}]^{-1}\textbf{g}_\textbf{x}$ are only ${\mathcal {O}}(n_\textbf{x}n_\textbf{y})$ and ${\mathcal {O}}(n_\textbf{x})$, in the second round of PANDA, the server aggregates $\textbf{Q}^{i}_{\textbf{x}\textbf{y}}$ and $\textbf{q}^{i}_\textbf{x}$ as follows:

$$\begin{aligned} \textbf{Q}_{\textbf{x}\textbf{y}} = \frac{1}{m}\sum _{i=1}^{m}\textbf{Q}^{i}_{\textbf{x}\textbf{y}},\quad \quad \textbf{q}_{\textbf{x}} =\frac{1}{m} \sum _{i=1}^{m}\textbf{q}^i_\textbf{x}. \end{aligned}$$

(12)

Using $\textbf{Q}_{\textbf{x}\textbf{y}}$ and $\textbf{q}_\textbf{x}$ to replace $\textbf{H}_{\textbf{x}\textbf{x}}^{-1}\textbf{H}_{\textbf{x}\textbf{y}}$ and $\textbf{H}_{\textbf{x}\textbf{x}}^{-1}\textbf{g}_\textbf{x}$ in (10), the server finally computes the following approximate Newton direction

$$\begin{aligned} \begin{aligned} \begin{bmatrix} \tilde{{\textbf{d}}}_{\textbf{x}}\\ \tilde{{\textbf{d}}}_\textbf{y}\end{bmatrix} \overset{\textrm{def}}{=}\begin{bmatrix} \textbf{q}_\textbf{x}- \textbf{Q}_{\textbf{x}\textbf{y}}\tilde{\Delta }_{\textbf{y}\textbf{y}}\textbf{g}_\textbf{y}+ \textbf{Q}_{\textbf{x}\textbf{y}}\tilde{\Delta }_{\textbf{y}\textbf{y}}\textbf{Q}_{\textbf{x}\textbf{y}}^{\top }\textbf{g}_\textbf{x}\\ -\tilde{\Delta }_{\textbf{y}\textbf{y}}\textbf{H}_{\textbf{x}\textbf{y}}^{\top }\textbf{q}_\textbf{x}+ \tilde{\Delta }_{\textbf{y}\textbf{y}}\textbf{g}_{\textbf{y}} \end{bmatrix}, \end{aligned} \end{aligned}$$

(13)

with $\tilde{\Delta }_{\textbf{y}\textbf{y}} \overset{\textrm{def}}{=}[\textbf{H}_{\textbf{y}\textbf{y}} -\textbf{H}_{\textbf{x}\textbf{y}}^{\top }\textbf{Q}_{\textbf{x}\textbf{y}}]^{-1}$ and updates the parameters based on $\tilde{{\textbf{d}}}_\textbf{x}$ and $\tilde{{\textbf{d}}}_\textbf{y}$.

We formally summarize the PANDA method in Algorithm 1. The following proposition indicates that the update rule of PANDA can be viewed as a partially approximate Newton method.

Proposition 4.1

Using PANDA in Algorithm 1, the update rule on the server is equivalent to

$$\begin{aligned} \begin{aligned} { \begin{bmatrix} \textbf{x}_{t+1}\\ \textbf{y}_{t+1} \end{bmatrix}} ={ \begin{bmatrix} \textbf{x}_{t}\\ \textbf{y}_{t} \end{bmatrix}}-\begin{bmatrix} \tilde{\textbf{H}}_{\textbf{x}\textbf{x}, t } & \quad \textbf{H}_{\textbf{x}\textbf{y}}(\textbf{x}_t,\textbf{y}_t)\\ \textbf{H}_{\textbf{x}\textbf{y}}(\textbf{x}_t,\textbf{y}_t)^{\top }& \quad \textbf{H}_{\textbf{y}\textbf{y}}(\textbf{x}_t,\textbf{y}_t) \end{bmatrix}^{-1}\!\!\begin{bmatrix} \textbf{g}_{\textbf{x}}(\textbf{x}_t,\textbf{y}_t)\\ \textbf{g}_\textbf{y}(\textbf{x}_t,\textbf{y}_t) \end{bmatrix}, \end{aligned} \end{aligned}$$

(14)

where $\tilde{\textbf{H}}_{\textbf{x}\textbf{x}, t} \overset{\textrm{def}}{=}\left[ \frac{1}{m}\sum _{i=1}^m(\textbf{H}^i_{\textbf{x}\textbf{x}}(\textbf{x}_t,\textbf{y}_t))^{-1}\right] ^{-1}.$

Proof

We ignore the subscript t in the following proof such that $\tilde{\textbf{H}}_{\textbf{x}\textbf{x}}=\tilde{\textbf{H}}_{\textbf{x}\textbf{x},t}$, $\textbf{H}_{\textbf{x}\textbf{y}}=\textbf{H}_{\textbf{x}\textbf{y}}(\textbf{x}_t,\textbf{y}_t)$ (similar for $\textbf{H}_{\textbf{x}\textbf{x}}$, $\textbf{H}_{\textbf{x}\textbf{x}}^i$, $\textbf{H}_{\textbf{y}\textbf{y}}$, $\textbf{g}_\textbf{x}$, $\textbf{g}_\textbf{y}$). We denote

$$\begin{aligned} \hat{\Delta }_{\textbf{y}\textbf{y}}&\!\overset{\textrm{def}}{=}\! [\textbf{H}_{\textbf{y}\textbf{y}}-\textbf{H}_{\textbf{x}\textbf{y}}^{\top }\tilde{\textbf{H}}_{\textbf{x}\textbf{x}}^{-1}\textbf{H}_{\textbf{x}\textbf{y}}]^{-1}\!\!= \!\left[ \textbf{H}_{\textbf{y}\textbf{y}}-\textbf{H}_{\textbf{x}\textbf{y}}^{\top }\frac{1}{m}\sum _{i=1}^m(\textbf{H}_{\textbf{x}\textbf{x}}^i)^{-1}\textbf{H}_{\textbf{x}\textbf{y}}\right] ^{-1}\\&= \left[ \textbf{H}_{\textbf{y}\textbf{y}}-\textbf{H}_{\textbf{x}\textbf{y}}^{\top }\frac{1}{m}\sum _{i=1}^m\textbf{Q}_{\textbf{x}\textbf{y}}^{i}\right] ^{-1}{=}\tilde{\Delta }_{\textbf{y}\textbf{y}}. \end{aligned}$$

Then, it holds that

$$\begin{aligned}&\begin{bmatrix} \tilde{\textbf{H}}_{\textbf{x}\textbf{x}} & \quad \textbf{H}_{\textbf{x}\textbf{y}}\\ \textbf{H}_{\textbf{x}\textbf{y}}^{\top }& \quad \textbf{H}_{\textbf{y}\textbf{y}} \end{bmatrix}^{-1}\begin{bmatrix} \textbf{g}_{\textbf{x}}\\ \textbf{g}_\textbf{y}\end{bmatrix} \\&= \begin{bmatrix} \tilde{\textbf{H}}_{\textbf{x}\textbf{x}}^{-1}\textbf{g}_\textbf{x}+ \left( \tilde{\textbf{H}}_{\textbf{x}\textbf{x}}^{-1}\textbf{H}_{\textbf{x}\textbf{y}} \right) {\hat{\Delta }_{\textbf{y}\textbf{y}}}\left( \tilde{\textbf{H}}_{\textbf{x}\textbf{x}}^{-1}\textbf{H}_{\textbf{x}\textbf{y}} \right) ^{\top }\textbf{g}_\textbf{x}- \left( \tilde{\textbf{H}}_{\textbf{x}\textbf{x}}^{-1}\textbf{H}_{\textbf{x}\textbf{y}} \right) \hat{\Delta }_{\textbf{y}\textbf{y}}\textbf{g}_\textbf{y}\\ -\hat{\Delta }_{\textbf{y}\textbf{y}}\textbf{H}_{\textbf{x}\textbf{y}}^{\top }\tilde{\textbf{H}}_{\textbf{x}\textbf{x}}^{-1}\textbf{g}_\textbf{x}+ \hat{\Delta }_{\textbf{y}\textbf{y}}\textbf{g}_\textbf{y}\end{bmatrix}\\&\!\!\!\overset{(12)}{=}\begin{bmatrix} \frac{1}{m}\sum _{i=1}^m \bigg (\textbf{q}_x^i \!\!+\! \textbf{Q}_{\textbf{x}\textbf{y}}^{i}\tilde{\Delta }_{\textbf{y}\textbf{y}} \big (\frac{1}{m}\sum _{i=1}^m\textbf{Q}_{\textbf{x}\textbf{y}}^i\big )^{\top }\textbf{g}_\textbf{x}- \textbf{Q}_{\textbf{x}\textbf{y}}^i\tilde{\Delta }_{\textbf{y}\textbf{y}}\textbf{g}_\textbf{y}\bigg ) \\ -\frac{1}{m}\sum _{i=1}^m\tilde{\Delta }_{\textbf{y}\textbf{y}}\textbf{H}_{\textbf{x}\textbf{y}}^{\top }\textbf{q}_{\textbf{x}}^i + \tilde{\Delta }_{\textbf{y}\textbf{y}}\textbf{g}_\textbf{y}\end{bmatrix}\!\!\overset{(13)}{=}\!\!\begin{bmatrix} \tilde{{\textbf{d}}}_\textbf{x}\\ \tilde{{\textbf{d}}}_\textbf{y}\end{bmatrix}. \end{aligned}$$

$\square$

4.2 Convergence analysis of PANDA

We suppose the N samples are i.i.d drawn from some distribution and each sample is associated with a local loss function $l_j(\cdot )$. We also assume each client holds s samples drawn from $\{l_j(\cdot )\}_{j=1}^N$, such that $N=ms$ and $|S_i|\equiv s$. According to Lemma 2.3, each local partial Hessian, $\textbf{H}_{\textbf{x}\textbf{x}}^i(\textbf{x}_t,\textbf{y}_t) = \frac{1}{s} \sum _{j\in S_i}\nabla ^2 l_j(\textbf{x}_t,\textbf{y}_t)$, can be viewed as an sub-sampling approximation of $\textbf{H}_{\textbf{x}\textbf{x}}(\textbf{x}_t,\textbf{y}_t)$ when s is large. The following Lemma indicates that

$$\begin{aligned} \textbf{C}_t\overset{\textrm{def}}{=}\tilde{\textbf{H}}_{\textbf{x}\textbf{x},t}-\textbf{H}_{\textbf{x}\textbf{y}}(\textbf{x}_t,\textbf{y}_t)(\textbf{H}_{\textbf{y}\textbf{y}}(\textbf{x}_t,\textbf{y}_t))^{-1}(\textbf{H}_{\textbf{x}\textbf{y}}(\textbf{x}_t,\textbf{y}_t))^{\top } \end{aligned}$$

with $\tilde{\textbf{H}}_{\textbf{x}\textbf{x},t}$ defined in (14) is a good estimation of $\textbf{P}(\textbf{x}_t,\textbf{y}_t)$.

Lemma 4.2

Under Assumption 2.2 and suppose that for all $i\in [m]$, $\textbf{H}_{\textbf{x}\textbf{x}}^i(\textbf{x},\textbf{y})$ satisfies that

$$\begin{aligned} \left( 1-\eta \right) \textbf{H}_{\textbf{x}\textbf{x}}(\textbf{x}_t,\textbf{y}_t)\preceq \textbf{H}_{\textbf{x}\textbf{x}}^{i}(\textbf{x}_t,\textbf{y}_t)\preceq (1+\eta ) \textbf{H}_{\textbf{x}\textbf{x}}(\textbf{x}_t,\textbf{y}_t), \end{aligned}$$

then we have $\left\| \textbf{I}- \textbf{P}(\textbf{x}_t,\textbf{y}_t)^{1/2}{\textbf{C}}_{t}^{-1}\textbf{P}(\textbf{x}_t,\textbf{y}_t)^{1/2} \right\| \le \frac{\eta ^2}{1-\eta }$.

Incorporating the linear-quadratic rates established by the PAN framework, we can obtain the improved linear-quadratic rates for PANDA.

Theorem 4.3

Under Assumption 2.2 and suppose that

$$\begin{aligned} \left( 1-{\eta }\right) \textbf{H}_{\textbf{x}\textbf{x}}(\textbf{x}_t,\textbf{y}_t)\preceq \textbf{H}_{\textbf{x}\textbf{x}}^{i}(\textbf{x}_t,\textbf{y}_t)\preceq (1+\eta )\textbf{H}_{\textbf{x}\textbf{x}}(\textbf{x}_t,\textbf{y}_t) \end{aligned}$$

holds for all $i\in [m]$, the update rule of PANDA (Algorithm 1) in (14) satisfies that

$$\begin{aligned} \begin{aligned} \lambda (\textbf{x}_{t+1},\textbf{y}_{t+1})&\le \frac{\eta ^2}{1-\eta }\lambda (\textbf{x}_t,\textbf{y}_t)+\frac{12\kappa _\textbf{g}^2\kappa _{\textbf{H}}(1-\eta +\eta ^2)^2}{\sqrt{\mu }(1-\eta )^2} \lambda (\textbf{x}_t,\textbf{y}_t)^2. \end{aligned} \end{aligned}$$

(15)

Similar to Corollary 3.3, we can guarantee a small $\eta \in (0,0.5)$ for Theorem 4.3.

Corollary 4.4

Under Assumption 2.2, for any $\delta \in (0,1)$ and $\eta \in (0,0.5)$, if each client holds $s\ge \frac{3\hat{\kappa }\log (2n_\textbf{x}m/\delta )}{\eta ^2}$ samples, then with probability at least $1-\delta$, the update rule of PANDA (Algorithm 1) in (14) satisfies (15).

Remark 4.5

The Corollary 4.4 can be interpreted in this way: if N is at least $12m\hat{\kappa }\log (2n_\textbf{x}m/\delta )$, then (15) holds with probability at least $1-\delta$ where $\eta =\eta _\mathrm{\text {PANDA}}\overset{\textrm{def}}{=}\sqrt{\frac{3\hat{\kappa }m\log (2n_\textbf{x}m/\delta )}{ N}}$.

We highlight the advancements of the PANDA method in the following two aspects:

(a)
We compare PANDA with its single-agent version, which corresponds to using N/m samples to construct the approximated Hessian in PAN. According to Corollary 3.3 and Corollary 4.4, we observe that $\eta _{\text {PAN}} = \sqrt{\frac{3\hat{\kappa }m\log (2n_\textbf{x}/\delta )}{N}}\approx \eta _{\text {PANDA}}$. This indicates that the linear-quadratic rate (15) of PANDA significantly improves upon its single-agent version (9), which demonstrates the superiority of using the distributed framework.
(b)
We compare PANDA with state-of-the-art first-order methods in Table 2. Both distributed EG and Proxskip-VI-FL (Zhang et al., 2024) require the communication rounds of ${\mathcal {O}}(\kappa _\textbf{g}\log (1/\epsilon ))$, whereas PANDA only requires ${\mathcal {O}}\left( \frac{\log (1/\epsilon )}{\log (N/(m\hat{\kappa }))}\right)$ communication rounds. This highlights the advantage of using second-order information.

Table 2 We present the communication complexity of proposed method (PANDA) and baselines for solving quadratic distributed minimax optimization (AUC maximization)

Full size table

4.3 Extension to the GIANT-PANDA Algorithm

PANDA exhibits provably faster convergence rates than the first-order methods for minimax distributed optimization, however, each client is required to access the full local Hessian at each iteration. In this section, we develop a communication-efficient algorithm that allows using inexact Hessian instead of the exact one during the local computation.

We focus on a specific function class that $l_j(\cdot )$ in (2) can be expressed as $l_j(\textbf{x},\textbf{y})\overset{\textrm{def}}{=}h_j(\textbf{w}^{\top }\textbf{x},\textbf{y}) + \frac{\mu }{2}\Vert \textbf{x}\Vert ^2$, where $h_j(\cdot ,\cdot )$ is convex in $\textbf{x}$ and $\mu$-strongly concave in $\textbf{y}$. This function class generalizes the objective considered in convex optimization as discussed in GIANT (Wang et al., 2018), which has important applications in regression-type models.

The partial Hessian of the objective at $(\textbf{x}_t,\textbf{y}_t)$ can be written as

$$\begin{aligned} \textbf{H}_{\textbf{x}\textbf{x}} (\textbf{x}_t,\textbf{y}_t)&= \frac{1}{N}\sum _{j=1}^N\nabla ^2_{\textbf{x}\textbf{x}} h_j \left( \textbf{w}^{\top }\textbf{x}_t,\textbf{y}_t \right) \textbf{w}\textbf{w}^{\top } + \mu \textbf{I}= \frac{1}{m} \sum _{i=1}^m{\left\{ \textbf{A}_t^{\top }\textbf{S}^{i}(\textbf{S}^{i})^{\top }\textbf{A}_t + \mu \textbf{I}\right\} }, \end{aligned}$$

where $\textbf{A}_t \overset{\textrm{def}}{=}\left[ {\textbf{a}}_1^{\top },\cdots ,{\textbf{a}}_N^{\top }\right] \in {\mathbb {R}}^{N\times n_\textbf{x}}$ is a full column-rank matrix with $n_\textbf{x}\le N$, ${\textbf{a}}_j = \sqrt{\nabla ^2_{\textbf{x}\textbf{x}}h_j(\textbf{w}^{\top }\textbf{x}_t,\textbf{y}_t)}\textbf{w}/\sqrt{N}$, $\textbf{S}^{i}$ is some sketching matrix such that $(\textbf{S}^i)^{\top }\textbf{A}_t$ contains the rows of $\textbf{A}_t$ indexed by ${\mathcal {S}}_i$. The local partial Hessian of the i-th client can be indicated by $\textbf{H}_{\textbf{x}\textbf{x}}^{i}(\textbf{x}_t,\textbf{y}_t)\overset{\textrm{def}}{=}\left\{ \textbf{A}_t^{\top }\textbf{S}^{i}(\textbf{S}^{i})^{\top }\textbf{A}_t + \mu \textbf{I}\right\}$

Taking advantage of such a structure, we perform a sketch operation on $\textbf{H}_{\textbf{x}\textbf{x}}^{i}(\textbf{x}_t,\textbf{y}_t)$ to reduce the computation cost on the client such that:

$$\begin{aligned} \tilde{\textbf{H}}_{\textbf{x}\textbf{x},t}^i \overset{\textrm{def}}{=}\textbf{A}_t^{\top }\tilde{\textbf{S}}_t^{i}(\tilde{\textbf{S}}_t^{i})^{\top }\textbf{A}_t + \mu \textbf{I}, \end{aligned}$$

(16)

where $\tilde{\textbf{S}}_t^{i}\in {\mathbb {R}}^{n_\textbf{x}\times s_t}$ is chosen randomly from the columns of $\textbf{S}_i$ so that $s_t\le s$. We replace $\textbf{H}_{\textbf{x}\textbf{x}}^i(\textbf{x}_t,\textbf{y}_t)$ by its sketched approximation $\tilde{\textbf{H}}_{\textbf{x}\textbf{x}}^{i}(\textbf{x}_t,\textbf{y}_t)$ in line 15 and line 16 of PANDA (Algorithm 1), naturally resulting the modified algorithm GIANT-PANDA. The routine of GIANT-PANDA is formally presented in Algorithm 2 in A.

Now, we start to characterize the convergence behavior of GIANT-PANDA. Since GIANT-PANDA inherits the framework of PANDA, the update rule of GIANT-PANDA can be viewed as

$$\begin{aligned} \begin{aligned} { \begin{bmatrix} \textbf{x}_{t+1}\\ \textbf{y}_{t+1} \end{bmatrix}} ={ \begin{bmatrix} \textbf{x}_{t}\\ \textbf{y}_{t} \end{bmatrix}}-\begin{bmatrix} \tilde{\textbf{H}}_{\textbf{x}\textbf{x}, t }^{\text {gp}} & \quad \textbf{H}_{\textbf{x}\textbf{y}}(\textbf{x}_t,\textbf{y}_t)\\ \textbf{H}_{\textbf{x}\textbf{y}}(\textbf{x}_t,\textbf{y}_t)^{\top }& \quad \textbf{H}_{\textbf{y}\textbf{y}}(\textbf{x}_t,\textbf{y}_t) \end{bmatrix}^{-1}\!\!\begin{bmatrix} \textbf{g}_{\textbf{x}}(\textbf{x}_t,\textbf{y}_t)\\ \textbf{g}_\textbf{y}(\textbf{x}_t,\textbf{y}_t) \end{bmatrix}, \end{aligned} \end{aligned}$$

(17)

where $\tilde{\textbf{H}}^{\text {gp}}_{\textbf{x}\textbf{x}, t} \overset{\textrm{def}}{=}\left[ \frac{1}{m}\sum _{i=1}^m[\tilde{\textbf{H}}^i_{\textbf{x}\textbf{x},t}]^{-1}\right] ^{-1}$.

Let $\textbf{C}^{\textrm{gp}}_{t}\overset{\textrm{def}}{=}\tilde{\textbf{H}}^{{ \textrm{gp}}}_{\textbf{x}\textbf{x},t}-\textbf{H}_{\textbf{x}\textbf{y}}(\textbf{x}_t,\textbf{y}_t)[\textbf{H}_{\textbf{y}\textbf{y}}(\textbf{x}_t,\textbf{y}_t)]^{-1}(\textbf{H}_{\textbf{x}\textbf{y}}(\textbf{x}_t,\textbf{y}_t))^{\top }$, the following lemma shows $\textbf{C}^{\textrm{gp}}_{t}$ is still a good approximation to $\textbf{P}(\textbf{x}_t,\textbf{y}_t)$.

Lemma 4.6

Let $\eta ,\delta \in (0,1)$ be fixed parameters, $\nu _t=\nu (\textbf{A}_t)$, and $\{\tilde{\textbf{S}}_t^i\}$ is independent uniform sampling matrices with $s_t\ge \frac{3\nu _tn_\textbf{x}}{\eta ^2}\log \left( \frac{mn_\textbf{x}}{\delta }\right)$. Under Assumption 2.2, we have

$$\begin{aligned} \left\| \textbf{I}- \textbf{P}(\textbf{x}_t,\textbf{y}_t)^{1/2}({\textbf{C}}^{{\textrm{gp}}}_{t})^{-1}\textbf{P}(\textbf{x}_t,\textbf{y}_t)^{1/2} \right\| \le \frac{\eta }{\sqrt{m}}+ \frac{\eta ^2}{1-\eta } \end{aligned}$$

holds with probability at least $1-\delta$.

Remark 4.7

The condition of Lemma 4.6 requires $\{\tilde{\textbf{S}}^i_t\}$ to be uniform sampling matrices, which means we perform uniform sketch to obtain the local approximate Hessian $\tilde{\textbf{H}}^i(\textbf{x}_t,\textbf{y}_t)$ in GIANT-PANDA. GIANT-PANDA also allows using other sketching techniques like count sketch (Clarkson & Woodruff, 2017; Meng & Mahoney, 2013) or Gaussian sketch (Johnson & Lindenstrauss, 1984) to obtain $\tilde{\textbf{S}}^i_t$. These sketching methods can improve the dependence of $s_t$ on $\nu _t$, but will be more expensive to implement than the simple uniform sketching matrix (Wang et al., 2018).

Using the analysis framework of PAN, we establish the linear-quadratic rate of GIANT-PANDA.

Theorem 4.8

Under the same condition of Lemma 4.6, the update of GIANT-PANDA (Algorithm 2) in (17) satisfies

$$\begin{aligned} \begin{aligned} \lambda (\textbf{x}_{t+1},\textbf{y}_{t+1})&\le \left( \frac{\eta }{\sqrt{m}}+\frac{\eta ^2}{1-\eta }\right) \lambda (\textbf{x}_t,\textbf{y}_t) +\frac{c\kappa _\textbf{g}^2\kappa _\textbf{H}}{\sqrt{\mu }} \lambda (\textbf{x}_t,\textbf{y}_t)^2. \end{aligned} \end{aligned}$$

(18)

with probability at least $1-\delta$, where $c=\frac{12((\sqrt{m}-1)(\eta ^2-\eta +1)+1)^2}{(1-\eta )^2m}$

Remark 4.9

$\left( \frac{\eta }{\sqrt{m}}+\frac{\eta ^2}{1-\eta }\right)$ in the linear term $\lambda (\textbf{x}_{t},\textbf{y}_t)$ in (18) for GIANT-PANDA is slightly worse than $\left( \frac{\eta ^2}{1-\eta }\right)$ in (15) for PANDA. This is because GIANT-PANDA uses the approximate local partial Hessian instead of the full local partial Hessian. However, it is still better than $\left( \frac{\eta }{1-\eta }\right)$ in (9) for PAN by a factor of $\frac{1}{\sqrt{m}}$. This demonstrates the advantage of utilizing m clients in the parallel training process.

Improved Results for GIANT. GIANT (Wang et al., 2018) (Algorithm 3 in Appendix A) can be regarded as a special case of GIANT-PANDA for convex optimization when taking $n_\textbf{y}= 0$. Using the analysis techniques developed for GIANT-PANDA, we also improve the convergence results for GIANT.

In the following corollary, we present a sharper linear-quadratic rate for GIANT under the same assumption as in (Wang et al., 2018), which improves the previous result by a factor of $\sqrt{\kappa _\textbf{g}}$ in the linear term.

Corollary 4.10

To solve the minimization problem $(\mu >0)$ $\min _{\textbf{x}\in {\mathbb {R}}^{n_\textbf{x}}} f(\textbf{x}) = \frac{1}{N}\sum _{j=1}^N h_j(\textbf{w}^{\top }\textbf{x}) + \frac{\mu }{2}\Vert \textbf{x}\Vert ^2$ on m clients and each client holds s samples, if $h_j(\cdot )$ is a convex loss function, $f(\cdot )$ has $L_2$-Lipschitz continuous Hessian, and $s_t$ satisfies that $s\ge s_t\ge \frac{3\nu _tn_\textbf{x}}{\eta ^2}\log \left( \frac{mn_\textbf{x}}{\delta }\right)$ for some fixed parameters $\eta ,\delta \in (0,1)$, then with probability at least $1-\delta$, the update rule of GIANT (Algorithm 3) satisfies that

$$\begin{aligned} \hat{\lambda }(\textbf{x}_{t+1})\le \left( \frac{\eta }{\sqrt{m}}+\frac{\eta ^2}{1-\eta }\right) \hat{\lambda } (\textbf{x}_t) + \frac{2L_2}{\mu ^{3/2}} \hat{\lambda }(\textbf{x}_t)^2, \end{aligned}$$

where $\hat{\lambda }(\textbf{x})\overset{\textrm{def}}{=}\sqrt{\nabla f(\textbf{x})[\nabla ^2 f(\textbf{x})]^{-1}\nabla f(\textbf{x})}$.

5 Experiments

We validate the proposed methods on the following important data mining tasks, which enjoy the structure of “unbalanced dimension” and have been well studied in previous literature (Liu et al., 2022; Liu & Luo, 2022). The experiments are conducted on a workstation with an Intel(R) Core(TM) i7-10870 H CPU @ 2.20GHz. The code was executed using Python 3.8.

AUC Maximization. To train a classifier $\textbf{w}$ on imbalanced datasets $\{{\textbf{a}}_j,b_j\}_{j=1}^N$ such that $p = \frac{N^{+}}{N}\approx 1~\text {or}~0$ where $N^{+}$ is the number of positive instances, AUC maximization can be reformulated into minimax problems, where $l_j(\textbf{x},\textbf{y})$ of (1) takes the following quadratic form
$$\begin{aligned} l_j(\textbf{x},y)&= (1-p)\big ( \big (\textbf{w}^\top {\textbf {a}}_j-u \big )^2 - 2(1+y)\textbf{w}^\top {\textbf {a}}_j\big ){\mathbb {I}}_{b_j=1}+\frac{\lambda }{2}\left\| \textbf{x} \right\| ^2\\&\quad +p\big ( \big (\textbf{w}^\top {\textbf {a}}_j-v \big )^2 + 2(1+y)\textbf{w}^\top {\textbf {a}}_j\big ){\mathbb {I}}_{b_j=-1} -p(1-p)y^2, \end{aligned}$$
where $\textbf{x}= [\textbf{w}; u; v] \in {\mathbb {R}}^{d+2}, \textbf{w}\in {\mathbb {R}}^d, u\in {\mathbb {R}}, v\in {\mathbb {R}}$ and $y\in {\mathbb {R}}$. We set $\lambda =0.5$. We perform experiments on “a9a” ($N = 32, 651$, $n_\textbf{x}=125$, $n_\textbf{y}=1$, $p=0.241$), “w8a” ($N = 45, 546, n_\textbf{x}=302, n_\textbf{y}= 1, p=0.029$), and “sido0” ($N=12,678$, $n_\textbf{x}=4,932$, $n_\textbf{y}=1$) which can be downloaded from Libsvm (Chang & Lin, 2011). We choose the regularized parameter $\lambda = 0.5$ and the number of the clients $m=8$. We tune the learning rates of all methods (include the baselines) from $\{1.0, 0.9, \dots ,0.1\}$.
Fairness-Aware Machine Learning. Given the training set $\{{\textbf{a}}_j, b_j, c_j\}_{j=1}^{N}$ where ${\textbf{a}}_j\in {\mathbb {R}}^d$ and $c_j\in {\mathbb {R}}$, we can use the following adversarial training model to train a binary classifier $\textbf{x}$ (Zhang et al., 2018) and make it unbiased to the feature $c_j$ that we want to protect:
$$\begin{aligned} \!\!\!l_j(\textbf{x}, y) \!=\!\log \big (1\!+\!\exp \big (-b_j({\textbf {a}}_j)^\top \textbf{x}\big )\big ) \!\!+\!\!\lambda \left\| \textbf{x} \right\| ^2 \!\!-\!\! \gamma y^2\!\!-\!\!\beta \log \big (1 \!+\! \exp \big (\!\!-c_j({\textbf {a}}_j)^\top \textbf{x}y\big )\big ). \end{aligned}$$
We choose $\lambda = 0.5$ and $\beta = \gamma = 0.0001$. We conduct experiments on “adult” ($N = 32, 651, n_\textbf{x}=122, n_\textbf{y}=1$) and “law school” ($N = 20, 427, n_\textbf{x}=379, n_\textbf{y}= 1$) datasets (Le Quy et al., 2022; Liu & Luo, 2022). We set regularization parameters $\lambda = 0.5$ and $\beta =\gamma =0.0001$.

5.1 Comparison with the baselines

We compare PANDA and GIANT-PANDA with existing state-of-the-art communication-efficient methods. Specifically, we adopt distributed version of extra gradient (Korpelevich, 1976; Tseng, 2000) (EG), federated gradient descent ascent with gradient tracking (Sun & Wei, 2022) (FedGDA), and proximal skip method for variational inequalities (Zhang et al., 2024) (ProxSkip) as the baselines. Both EG and ProxSkip achieve the optimal communication complexity for first-order methods. We tune the learning rates of all methods from $\{1.0, 0.9, \dots ,0.1\}$.

For all experiments, we use $70\%$ percent local data in GIANT-PANDA. The results for AUC maximization under different client numbers $m=8$ and $m=128$ are presented in Figs. 1 and 2. We also demonstrate the results for Fairness-aware machine learning under different client numbers $m=8$ and $m=128$ in Figs. 3 and 4.

We observe that our newly proposed PANDA and GIANT-PANDA outperform the baselines in terms of both communication rounds and the running time for all cases. This indicates that our methods indeed not only significantly reduce the communication rounds as compared to the optimal first-order methods, but also maintain communication efficiency which makes the optimization procedure fast.

We also observe that the communication complexity of PANDA can be affected by the number of clients (m). This is because $\eta _{\text {PANDA}}$ is proportional to $\sqrt{m}$ according to Remark 4.5. On the other hand, the increase of m makes the training time per iteration smaller due to the distributed framework, thus, larger m always leads to a faster training process. Take (a), (b) of Figs. 1 and 2 for example, PANDA requires less communication round when $m=8$, but takes less running time when $m=128$.

5.2 Comparison of different sketch ratios for GIANT-PANDA

We investigate the impact of the sketch ratio ($p = s_t/s$) on GIANT-PANDA. We choose different sketch ratios p from $\{10\%, 30\%, 50\%, 70\%, 100\%\}$ for GIANT-PANDA. For the case $p=100\%$, GIANT-PANDA reduces to its full version PANDA. We set the number of clients as $m=8$.

We present the results for AUC maximization and Fairness-aware machine learning in Figs. 5 and 6 respectively. The numerical results show that larger sketch ratios lead to fewer communication rounds for the training process, which is because one can obtain a better approximation to the local exact partial Hessian, and thus get a smaller $\eta$. GIANT-PANDA with $p=100\%$ (PANDA) outperforms other cases in terms of the communication rounds. On the other hand, GIANT-PANDA shows its advantage in terms of the running time. We find that GIANT-PANDA with $p=30\%$ for “a9a”, $p=10\%$ for “w8a” in AUC maximization and with $p=10\%$ for “law school” in Fairness-aware machine learning achieves the best behavior in terms of the running time ((b), (d) of Figs. 5 and 6). This is because the sketch operation in GIANT-PANDA reduces the computation time for each client.

We also provide additional experiments to study the impact of sketch ratio under $m=128$ and the impact of using different sketch methods in G.

6 Conclusion

In this paper, we have proposed PANDA and GIANT-PANDA to solve the distributed minimax problems with unbalanced dimensions. PANDA eliminates the requirement of communicating the full Hessian and substantially reduces the communication rounds compared to the optimal first-order methods. GIANT-PANDA further reduces the computation cost by performing sketch operations to compute the local partial Hessian on each client.

For future work, it is interesting to generalize PANDA and GIANT-PANDA to more general minimax optimization problems (Adil et al., 2022; Lin & Jordan, 2022; Liu & Luo, 2022; Luo et al., 2022). It is also possible to leverage the idea of Hessian average (Na et al., 2023) to further enhance the behavior of GIANT-PANDA and design the decentralized scenario of PANDA and GIANT-PANDA.

Data availibility

No datasets were generated or analysed during the current study.

Notes

For simplicity, we only consider the case $n_{\textbf{x}} \gg n_{\textbf{y}}$ in this paper, while $n_\textbf{y}\gg n_\textbf{x}$ can be studied trivially in the same way.
We use the term “communication-efficient” to describe second-order methods whose per-iteration communication cost is of the same order as that of first-order methods.

References

Adil, D., Bullins, B., Jambulapati, A., & Sachdeva, S. (2022). Line search-free methods for higher-order smooth monotone variational inequalities. arXiv preprint arXiv:2205.06167
Basar, T., & Olsder, G. J. (1999). Dynamic noncooperative game theory. ser. Classics in Applied MathematicsSIAM.
Google Scholar
Ben-Tal, A., & Nemirovski, A. (2002). Robust optimization-methodology and applications. Mathematical Programming, 92, 453–480.
Article MathSciNet Google Scholar
Chang, C.-C., & Lin, C.-J. (2011). LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2, 27–12727.
Article Google Scholar
Chavdarova, T., Gidel, G., Fleuret, F., & Lacoste-Julien, S. (2019). Reducing noise in gan training with variance reduced extragradient. Advances in Neural Information Processing Systems, 32.
Clarkson, K. L., & Woodruff, D. P. (2017). Low-rank approximation and regression in input sparsity time. Journal of the ACM (JACM), 63(6), 1–45.
Article MathSciNet Google Scholar
Cortes, C., & Mohri, M. (2003). Auc optimization vs. error rate minimization. Advances in neural information processing systems, 16.
Creswell, A., White, T., Dumoulin, V., Arulkumaran, K., Sengupta, B., & Bharath, A. A. (2018). Generative adversarial networks: An overview. IEEE Signal Processing Magazine, 35(1), 53–65.
Article Google Scholar
Deng, Y., & Mahdavi, M. (2021). Local stochastic gradient descent ascent: Convergence analysis and communication efficiency. International conference on artificial intelligence and statistics (pp. 1387–1395). PMLR.
Google Scholar
Facchinei, F. (2003). Finite-dimensional variational inequalities and complementarity problems. Springer.
Google Scholar
Gao, R., & Kleywegt, A. (2022). Distributionally robust stochastic optimization with wasserstein distance. Mathematics of Operations Research, 48(2), 603–655.
Article MathSciNet Google Scholar
Hsieh, Y.-G., Iutzeler, F., Malick, J., & Mertikopoulos, P. (2019). On the convergence of single-call stochastic extra-gradient methods. Advances in Neural Information Processing Systems, 32.
Huang, K., & Zhang, S. (2022). An approximation-based regularized extra-gradient method for monotone variational inequalities. arXiv preprint arXiv:2210.04440
Huang, K., Zhang, J., & Zhang, S. (2022). Cubic regularized newton method for the saddle point models: A global and local convergence analysis. Journal of Scientific Computing, 91(2), 60.
Article MathSciNet Google Scholar
Islamov, R., Qian, X., Hanzely, S., Safaryan, M., & Richtárik, P. (2022). Distributed Newton-type methods with communication compression and bernoulli aggregation. Transactions on Machine Learning Research
Johnson, W. B., & Lindenstrauss, J. (1984). Extensions of Lipschitz maps into a Hilbert space. Contemporary Mathematics, 26, 189–206.
Article Google Scholar
Korpelevich, G. M. (1976). The extragradient method for finding saddle points and other problems. Matecon, 12, 747–756.
MathSciNet Google Scholar
Lanckriet, G. R., Ghaoui, L. E., Bhattacharyya, C., & Jordan, M. I. (2002). A robust minimax approach to classification. Journal of Machine Learning Research, 3, 555–582.
MathSciNet Google Scholar
Le Quy, T., Roy, A., Iosifidis, V., Zhang, W., & Ntoutsi, E. (2022). A survey on datasets for fairness-aware machine learning. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 12(3), 1452.
Google Scholar
Lin, T., & Jordan, M.I. (2022). Perseus: A simple high-order regularization method for variational inequalities. arXiv preprint arXiv:2205.03202
Lin, T., Jin, C., & Jordan, M. (2020). On gradient descent ascent for nonconvex-concave minimax problems. International conference on machine learning (pp. 6083–6093). PMLR.
Google Scholar
Liu, C., & Luo, L. (2022). Regularized newton methods for monotone variational inequalities with holders continuous jacobians. arXiv preprint arXiv:2212.07824
Liu, C., Bi, S., Luo, L., & Lui, J. C. (2022). Partial-quasi-newton methods: Efficient algorithms for minimax optimization problems with unbalanced dimensionality. In Proceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining (pp. 1031–1041).
Liu, C., Chen, L., Luo, L., & Lui, J. (2024). Communication efficient distributed newton method with fast convergence rates. In Proceedings of the 29th ACM SIGKDD conference on knowledge discovery and data mining.
Liu, C., & Luo, L. (2022). Quasi-newton methods for saddle point problems. Advances in Neural Information Processing Systems, 35, 3975–3987.
Google Scholar
Liu, M., Zhang, W., Mroueh, Y., Cui, X., Ross, J., Yang, T., & Das, P. (2020). A decentralized parallel algorithm for training generative adversarial nets. Advances in Neural Information Processing Systems, 33, 11056–11070.
Google Scholar
Lowd, D., & Meek, C. (2005). Adversarial learning. In Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining (pp. 641–647).
Luo, L., Li, Y., & Chen, C. (2022). Finding second-order stationary points in nonconvex-strongly-concave minimax optimization. Advances in Neural Information Processing Systems, 35, 36667–36679.
Google Scholar
Malitsky, Y. (2015). Projected reflected gradient methods for monotone variational inequalities. SIAM Journal on Optimization, 25(1), 502–520.
Article MathSciNet Google Scholar
Meng, X., & Mahoney, M. W. (2013). Low-distortion subspace embeddings in input-sparsity time and applications to robust linear regression. In Proceedings of the forty-fifth annual ACM symposium on theory of computing (pp. 91–100).
Mishchenko, K., Kovalev, D., Shulgin, E., Richtárik, P., & Malitsky, Y. (2020). Revisiting stochastic extragradient. In International conference on artificial intelligence and statistics (pp. 4573–4582). PMLR
Na, S., Dereziński, M., & Mahoney, M. W. (2023). Hessian averaging in stochastic newton methods achieves superlinear convergence. Mathematical Programming, 201(1–2), 473–520.
Article MathSciNet Google Scholar
Nedić, A., & Ozdaglar, A. (2009). Subgradient methods for saddle-point problems. Journal of Optimization Theory and Applications, 142, 205–228.
Article MathSciNet Google Scholar
, Nouiehed, M., Sanjabi, M., Huang, T., Lee, J. D., & Razaviyayn, M. (2019). Solving a class of non-convex min-max games using iterative first order methods. Advances in Neural Information Processing Systems, 32.
Rodomanov, A., & Nesterov, Y. (2021). Greedy quasi-newton methods with explicit superlinear convergence. SIAM Journal on Optimization, 31(1), 785–811.
Article MathSciNet Google Scholar
Roosta-Khorasani, F., & Mahoney, M. W. (2019). Sub-sampled newton methods. Mathematical Programming, 174, 293–326.
Article MathSciNet Google Scholar
Shamir, O., Srebro, N., & Zhang, T. (2014). Communication-efficient distributed optimization using an approximate newton-type method. International conference on machine learning (pp. 1000–1008). PMLR.
Google Scholar
Sun, Z., & Wei, E. (2022). A communication-efficient algorithm with linear convergence for federated minimax learning. Advances in Neural Information Processing Systems, 35, 6060–6073.
Google Scholar
Tseng, P. (2000). A modified forward-backward splitting method for maximal monotone mappings. SIAM Journal on Control and Optimization, 38(2), 431–446.
Article MathSciNet Google Scholar
Wang, S., Gittens, A., & Mahoney, M. W. (2017). Sketched ridge regression: Optimization perspective, statistical perspective, and model averaging. International conference on machine learning (pp. 3608–3616). PMLR.
Google Scholar
Wang, S., Roosta, F., Xu, P., & Mahoney, M. W. (2018). Giant: Globally improved approximate newton method for distributed optimization. Advances in Neural Information Processing Systems, 31.
Wang, Y., & Li, J. (2020). Improved algorithms for convex-concave minimax optimization. Advances in Neural Information Processing Systems, 33, 4800–4810.
Google Scholar
Ye, H., He, C., & Chang, X. (2022). Accelerated distributed approximate newton method. IEEE Transactions on Neural Networks and Learning Systems
Ye, H., Luo, L., & Zhang, Z. (2020). Nesterov’s acceleration for approximate newton. The Journal of Machine Learning Research, 21(1), 5627–5663.
MathSciNet Google Scholar
Ye, H., Luo, L., & Zhang, Z. (2021). Approximate newton methods. The Journal of Machine Learning Research, 22(1), 3067–3107.
MathSciNet Google Scholar
Ying, Y., Wen, L., & Lyu, S. (2016). Stochastic online AUC maximization. NIPS.
Google Scholar
Zhang, S., Choudhury, S., Stich, S. U., & Loizou, N. (2024). Communication-efficient gradient descent-accent methods for distributed variational inequalities: Unified analysis and local updates. In The twelfth international conference on learning representations
Zhang, B. H., Lemoine, B., & Mitchell, M. (2018). Mitigating unwanted biases with adversarial learning. In Proceedings of the 2018 AAAI/ACM conference on AI, ethics, and society (pp. 335–340).

Download references

Author information

Minheng Xiao and Chengchang Liu have been contributed equally to this work.

Authors and Affiliations

Ohio State University, Columbus, OH, USA
Minheng Xiao
The Chinese University of Hong Kong, Sha Tin, Hong Kong SAR, China
Chengchang Liu & John C. S. Lui
East China Normal University, Shanghai, China
Cheng Chen
Georgia Institute of Technology, Atlanta, GA, USA
Sen Na

Authors

Minheng Xiao
View author publications
Search author on:PubMed Google Scholar
Chengchang Liu
View author publications
Search author on:PubMed Google Scholar
Cheng Chen
View author publications
Search author on:PubMed Google Scholar
John C. S. Lui
View author publications
Search author on:PubMed Google Scholar
Sen Na
View author publications
Search author on:PubMed Google Scholar

Contributions

M.X. and C.L. wrote the main manuscript text, including numerical experiments and proofs. C.C. provided proofs of a key lemma. J.L and S.N. reviewed the manuscript and offered valuable suggestions and assistance. All authors reviewed and approved the final version of the manuscript.

Corresponding author

Correspondence to Chengchang Liu.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Additional information

Editor: Lam M. Nguyen.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

The Appendix A GIANT-PANDA algorithm

We present the detailed implementation of GIANT-PANDA and GIANT in Algorithms 2 and 3 respectively.

Appendix B Auxiliary Lemmas for positive definite matrices

We first provide some useful lemmas for positive definite matrices.

Lemma B.1

For two positive definite matrices $\textbf{A},\textbf{B}$, if

$$\begin{aligned} (1-\eta )\textbf{A}\preceq \textbf{B}\preceq (1+\eta )\textbf{A}, \end{aligned}$$

for some $\eta \in (0,1)$, then it holds that

$$\begin{aligned} \left\| \textbf{I}-\textbf{A}^{1/2}\textbf{B}^{-1}\textbf{A}^{1/2} \right\| \le \frac{\eta }{1-\eta }. \end{aligned}$$

Proof

We have

$$\begin{aligned} \frac{1}{1+\eta }\textbf{A}^{-1}\preceq \textbf{B}^{-1}\preceq \frac{1}{1-\eta }\textbf{A}^{-1}, \end{aligned}$$

so that

$$\begin{aligned} \frac{1}{1+\eta }\textbf{I}\preceq \textbf{A}^{1/2}\textbf{B}^{-1}\textbf{A}^{1/2}\preceq \frac{1}{1-\eta }\textbf{I}, \end{aligned}$$

and

$$\begin{aligned} \frac{\eta }{1-\eta }\textbf{I}\preceq \textbf{I}-\textbf{A}^{1/2}\textbf{B}^{-1}\textbf{A}^{1/2}\preceq \frac{\eta }{1+\eta }\textbf{I}. \end{aligned}$$

So, we have

$$\begin{aligned} \left\| \textbf{I}-\textbf{A}^{1/2}\textbf{B}^{-1}\textbf{A}^{1/2} \right\| \le \max \left\{ \frac{\eta }{1-\eta },\frac{\eta }{1+\eta }\right\} =\frac{\eta }{1-\eta }. \end{aligned}$$

$\square$

Lemma B.2

For two positive definite matrices $\textbf{A},\textbf{B}$, if

$$\begin{aligned} \Vert \textbf{I}-\textbf{A}^{1/2}\textbf{B}^{-1}\textbf{A}^{1/2}\Vert \le \eta , \end{aligned}$$

(B1)

for some $\eta \in (0,1)$, then it holds that

$$\begin{aligned} \Vert \textbf{I}-(\textbf{A}+\Delta )^{1/2}(\textbf{B}+\Delta )^{-1}(\textbf{A}+\Delta )^{1/2}\Vert \le \eta , \end{aligned}$$

for any $\Delta \succeq \textbf{0}$.

Proof

According to (B1), we have

$$\begin{aligned} (1-\eta )\textbf{I}\preceq \textbf{A}^{1/2}\textbf{B}^{-1}\textbf{A}^{1/2}\preceq (1+\eta )\textbf{I}, \end{aligned}$$

which means

$$\begin{aligned} (1-\eta )\textbf{A}^{-1}\preceq \textbf{B}^{-1}\preceq (1+\eta )\textbf{A}^{-1}, \end{aligned}$$

so that

$$\begin{aligned} \frac{1}{1+\eta }\textbf{A}\preceq \textbf{B}\preceq \frac{1}{1-\eta }\textbf{A}. \end{aligned}$$

Since $\Delta \succeq \textbf{0}$, we have

$$\begin{aligned} \frac{1}{1+\eta }(\textbf{A}+\Delta ) \preceq \frac{1}{1+\eta }\textbf{A}+\Delta \preceq \textbf{B}+\Delta \preceq \frac{1}{1-\eta }\textbf{A}+\Delta \preceq \frac{1}{1-\eta }(\textbf{A}+\Delta ), \end{aligned}$$

which means

$$\begin{aligned} (1-\eta )(\textbf{A}+\Delta )^{-1}\preceq (\textbf{B}+\Delta )^{-1}\preceq (1+\eta )(\textbf{A}+\Delta )^{-1}. \end{aligned}$$

So that we have

$$\begin{aligned} (1-\eta )\textbf{I}\preceq (\textbf{A}+\Delta )^{1/2}(\textbf{B}+\Delta )^{-1}(\textbf{A}+\Delta )^{1/2}\preceq (1+\eta )\textbf{I}. \end{aligned}$$

Finally, we have

$$\begin{aligned} -\eta \textbf{I}\preceq \textbf{I}-(\textbf{A}+\Delta )^{1/2}(\textbf{B}+\Delta )^{-1}(\textbf{A}+\Delta )^{1/2}\preceq \eta \textbf{I}. \end{aligned}$$

$\square$

Appendix C The Proof of Sect. 2.2

1.1 C.1 The Proof of Lemma 2.4

Proof

Recall the singular value decomposition of $\textbf{A}=\textbf{U}\Sigma \textbf{V}$ in Definition 2.1 where $\textbf{U}\in {\mathbb {R}}^{N\times d}$, $\Sigma \in {\mathbb {R}}^{d\times d}$ and $\textbf{V}\in {\mathbb {R}}^{d\times N}$. We can directly obtain the following results by taking $\rho =d$ of Wang et al. (2018, Lemma 8) that

$$\begin{aligned} \left\| \textbf{U}^{\top }\textbf{S}_i\textbf{S}_i^{\top }\textbf{U}-\textbf{I} \right\| \le \eta ~~~\text {and}~~~\left\| \textbf{U}^{\top }\textbf{S}\textbf{S}^{\top }\textbf{U}-\textbf{I} \right\| \le \frac{\eta }{\sqrt{m}}, \end{aligned}$$

holds for all $i\in [m]$ with probability at least $1-\delta$. Then we have

$$\begin{aligned} (1-\eta )\textbf{I}\preceq \textbf{U}^{\top }\textbf{S}_i\textbf{S}_i^{\top }\textbf{U}\preceq (1+\eta )\textbf{I}, \end{aligned}$$

and

$$\begin{aligned} \left( 1-\eta /\sqrt{m} \right) \textbf{I}\preceq \textbf{U}^{\top }\textbf{S}\textbf{S}^{\top }\textbf{U}\preceq \left( 1+\eta /\sqrt{m} \right) \textbf{I}. \end{aligned}$$

Recall the definition of $\textbf{H}_i=\textbf{A}^{\top }\textbf{S}_i\textbf{S}_i^{\top }\textbf{A}+\alpha$, $\textbf{H}=\textbf{A}^{\top }\textbf{A}+\alpha \textbf{I}$, we have

$$\begin{aligned} \textbf{H}_i&= \textbf{V}^{\top }\Sigma \textbf{U}^{\top }\textbf{S}_i\textbf{S}_i^{\top }\textbf{U}\Sigma \textbf{V}+\alpha \textbf{I}\\&\preceq (1+\eta ) \textbf{V}^{\top }\Sigma ^2\textbf{V}+\alpha \textbf{I}\\&\preceq (1+\eta ) \left( \textbf{A}^{\top }\textbf{A}+ \alpha \textbf{I}\right) = (1+\eta ) \textbf{H}, \end{aligned}$$

and

$$\begin{aligned} \textbf{H}_i&= \textbf{V}^{\top }\Sigma \textbf{U}^{\top }\textbf{S}_i\textbf{S}_i^{\top }\textbf{U}\Sigma \textbf{V}+\alpha \textbf{I}\\&\succeq (1-\eta ) \textbf{V}^{\top }\Sigma ^2\textbf{V}+\alpha \textbf{I}\\&\succeq (1-\eta ) (\textbf{A}^{\top }\textbf{A}+ \alpha \textbf{I}) = (1-\eta )\textbf{H}. \end{aligned}$$

Similarly, recall the definition of $\hat{\textbf{H}}=\textbf{A}^{\top }\textbf{S}\textbf{S}^{\top }\textbf{A}+\alpha$, we have

$$\begin{aligned} \hat{\textbf{H}}&= \textbf{V}^{\top }\Sigma \textbf{U}^{\top }\textbf{S}\textbf{S}^{\top }\textbf{U}\Sigma \textbf{V}+\alpha \textbf{I}\\&\preceq (1+\eta /\sqrt{m}) \textbf{V}^{\top }\Sigma ^2\textbf{V}+\alpha \textbf{I}\\&\preceq (1+\eta /\sqrt{m})(\textbf{A}^{\top }\textbf{A}+ \alpha \textbf{I}) = (1+\eta /\sqrt{m})\textbf{H}, \end{aligned}$$

and

$$\begin{aligned} \hat{\textbf{H}}&= \textbf{V}^{\top }\Sigma \textbf{U}^{\top }\textbf{S}\textbf{S}^{\top }\textbf{U}\Sigma \textbf{V}+\alpha \textbf{I}\\&\succeq (1-\eta /\sqrt{m}) \textbf{V}^{\top }\Sigma ^2\textbf{V}+\alpha \textbf{I}\\&\succeq (1-\eta /\sqrt{m}) (\textbf{A}^{\top }\textbf{A}+ \alpha \textbf{I}) = (1-\eta /\sqrt{m})\textbf{H}. \end{aligned}$$

$\square$

Appendix D The Proof of Sect. 3

1.1 D.1 The Proof of Lemma 3.1

Proof

According to the condition, it holds that

$$\begin{aligned} (1-\eta )\textbf{H}_{\textbf{x}\textbf{x}}(\textbf{x},\textbf{y})\preceq \tilde{\textbf{H}}_{\textbf{x}\textbf{x}}\preceq (1+\eta )\textbf{H}_{\textbf{x}\textbf{x}}(\textbf{x},\textbf{y}). \end{aligned}$$

Using Lemma B.1, we have

$$\begin{aligned} \left\| \textbf{I}- [\textbf{H}_{\textbf{x}\textbf{x}}(\textbf{x},\textbf{y})]^{1/2}[\tilde{\textbf{H}}_{\textbf{x}\textbf{x}}]^{-1}[\textbf{H}_{\textbf{x}\textbf{x}}(\textbf{x},\textbf{y})]^{1/2} \right\| \le \frac{\eta }{1-\eta }. \end{aligned}$$

Denote $\Delta =-\textbf{H}_{\textbf{x}\textbf{y}}(\textbf{x},\textbf{y})\textbf{H}_{\textbf{y}\textbf{y}}(\textbf{x},\textbf{y})^{-1}\textbf{H}_{\textbf{x}\textbf{y}}^{\top }(\textbf{x},\textbf{y})$. According to Assumption 2.2, we have $\textbf{H}_{\textbf{y}\textbf{y}}(\textbf{x},\textbf{y})\preceq - \mu ]\textbf{I}$, which means that $\Delta \succeq \textbf{0}$. Using Lemma B.2 on $\textbf{A}=\textbf{H}_{\textbf{x}\textbf{x}}(\textbf{x},\textbf{y})$ and $\textbf{B}=\tilde{\textbf{H}}_{\textbf{x}\textbf{x}}$, we have

$$\begin{aligned} \left\| \textbf{I}- \textbf{P}(\textbf{x},\textbf{y})^{1/2}\textbf{C}(\textbf{x},\textbf{y})^{-1}\textbf{P}(\textbf{x},\textbf{y})^{1/2} \right\| \le \frac{\eta }{1-\eta }. \end{aligned}$$

$\square$

1.2 D.2 The Proof of Theorem 3.2

Denote $\textbf{J}= \begin{bmatrix} \textbf{P}(\textbf{x}, \textbf{y})^{1/2} & \textbf{0} \\ \textbf{0} & \frac{\sqrt{\mu }}{2}\textbf{I}_{n_\textbf{y}}\end{bmatrix}$ and $r = \left\| \begin{bmatrix}\textbf{P}(\textbf{x},\textbf{y})^{1/2}(\textbf{x}_{+}-\textbf{x})\\ \frac{\sqrt{\mu }}{2}(\textbf{y}_+ - \textbf{y})\end{bmatrix} \right\|$. The following lemma illustrates the relation between $\textbf{P}(\textbf{x},\textbf{y})$ and $\textbf{P}(\textbf{x}_{+},\textbf{y}_{+})$.

Lemma D.1

(Liu et al., 2022, Lemma 4.3) Under Assumptions 2.2, we have

$\frac{1}{1 + \frac{3\kappa ^2\kappa _2}{\sqrt{\mu }}r}\textbf{P}(\textbf{x}, \textbf{y}) \preceq \textbf{P}(\textbf{x}_+, \textbf{y}_+) \preceq \bigg (1 + \frac{3\kappa ^2\kappa _2}{\sqrt{\mu }}r\bigg )\textbf{P}(\textbf{x}, \textbf{y})$.
$\Vert \textbf{J}\Vert \succeq \frac{\sqrt{\mu }}{2}$.

For simplicity, we use $\textbf{g}_\textbf{x}$, $\textbf{g}_\textbf{y}$, $\textbf{H}_{\textbf{x}\textbf{x}}$, $\textbf{H}_{\textbf{x}\textbf{y}}$, $\textbf{H}_{\textbf{y}\textbf{y}}$, $\textbf{P}$, $\textbf{P}_{+}$, $\textbf{C}$ to represent the $\textbf{g}_\textbf{x}(\textbf{x},\textbf{y})$, $\textbf{g}_\textbf{y}(\textbf{x},\textbf{y})$, $\textbf{H}_{\textbf{x}\textbf{x}}(\textbf{x},\textbf{y})$, $\textbf{H}_{\textbf{x}\textbf{y}}(\textbf{x},\textbf{y})$, $\textbf{H}_{\textbf{y}\textbf{y}}(\textbf{x},\textbf{y})$, $\textbf{P}_{\textbf{x}\textbf{x}}(\textbf{x},\textbf{y})$, $\textbf{P}(\textbf{x}_{+},\textbf{y}_{+})$ and $\textbf{C}(\textbf{x},\textbf{y})$. We also represent the full accurate Hessian matrix $\textbf{H}= \textbf{H}(\textbf{x}, \textbf{y}) = \begin{bmatrix} \textbf{H}_{\textbf{x}\textbf{x}}(\textbf{x}, \textbf{y}) & \textbf{H}_{\textbf{x}\textbf{y}}(\textbf{x}, \textbf{y}) \\ \textbf{H}_{\textbf{y}\textbf{x}}(\textbf{x}, \textbf{y}) & \textbf{H}_{\textbf{y}\textbf{y}}(\textbf{x}, \textbf{y}) \end{bmatrix}$ and approximated full Hessian matrix $\tilde{\textbf{H}} = \tilde{\textbf{H}}(\textbf{x}, \textbf{y}) = \begin{bmatrix} \tilde{\textbf{H}}_{\textbf{x}\textbf{x}}(\textbf{x}, \textbf{y}) & \textbf{H}_{\textbf{x}\textbf{y}}(\textbf{x}, \textbf{y}) \\ \textbf{H}_{\textbf{y}\textbf{x}}(\textbf{x}, \textbf{y}) & \textbf{H}_{\textbf{y}\textbf{y}}(\textbf{x}, \textbf{y}) \end{bmatrix}$.

Proof

According to the condition $\Vert \textbf{I}-\textbf{P}^{1/2}\textbf{C}^{-1}\textbf{P}^{1/2}\Vert \le \eta _1$, we can obtain that

$$\begin{aligned} \left\| \textbf{P}^{1/2}\textbf{C}^{-1}\textbf{P}^{1/2} \right\| \le \left\| \textbf{I} \right\| + \left\| \textbf{I}- \textbf{P}^{1/2}\textbf{C}^{-1}\textbf{P}^{1/2} \right\| \le 1 + \eta _1, \end{aligned}$$

(D2)

Under assumption 2.2, we have $\textbf{H}_{\textbf{x}\textbf{x}} \succeq \mu \textbf{I}$ and $\textbf{H}_{\textbf{y}\textbf{y}} \preceq -\mu \textbf{I}$, then

$$\begin{aligned} \mu \textbf{I}\preceq -\textbf{H}_{\textbf{y}\textbf{y}} \preceq -\textbf{H}_{\textbf{y}\textbf{y}} + \textbf{H}_{\textbf{x}\textbf{y}}^\top \textbf{H}_{\textbf{x}\textbf{x}}^{-1}\textbf{H}_{\textbf{x}\textbf{y}}, \\ \end{aligned}$$

and

$$\begin{aligned} \mu \textbf{I}\preceq -\textbf{H}_{\textbf{y}\textbf{y}} \preceq -\textbf{H}_{\textbf{y}\textbf{y}} + \textbf{H}_{\textbf{x}\textbf{y}}^\top \tilde{\textbf{H}}_{\textbf{x}\textbf{x}}^{-1}\textbf{H}_{\textbf{x}\textbf{y}}. \end{aligned}$$

If follows that

$$\begin{aligned} \left\| \bigg [\textbf{H}_{\textbf{y}\textbf{y}} - \textbf{H}_{\textbf{x}\textbf{y}}^\top \textbf{H}_{\textbf{x}\textbf{x}}^{-1}\textbf{H}_{\textbf{x}\textbf{y}}(\textbf{x}, \textbf{y})\bigg ]^{-1} \right\| \le \frac{1}{\mu }, \end{aligned}$$

(D3)

and

$$\begin{aligned} \left\| \bigg [\textbf{H}_{\textbf{y}\textbf{y}} - \textbf{H}_{\textbf{x}\textbf{y}}^\top \tilde{\textbf{H}}_{\textbf{x}\textbf{x}}^{-1}\textbf{H}_{\textbf{x}\textbf{y}}\bigg ]^{-1} \right\| \le \frac{1}{\mu }. \end{aligned}$$

(D4)

According to Woodbury identity, we have

$$\begin{aligned}&\bigg [\textbf{H}_{\textbf{y}\textbf{y}} - \textbf{H}_{\textbf{x}\textbf{y}}^\top \textbf{H}_{\textbf{x}\textbf{x}}^{-1}\textbf{H}_{\textbf{x}\textbf{y}}\bigg ]^{-1} = \textbf{H}_{\textbf{y}\textbf{y}}^{-1} + \textbf{H}_{\textbf{y}\textbf{y}}^{-1}\textbf{H}_{\textbf{x}\textbf{y}}^\top \textbf{P}^{-1}\textbf{H}_{\textbf{x}\textbf{y}}\textbf{H}_{\textbf{y}\textbf{y}}^{-1}. \end{aligned}$$

Hence, we have

$$\begin{aligned} \begin{aligned} \left\| \textbf{P}^{-1/2}\textbf{H}_{\textbf{x}\textbf{y}}\textbf{H}_{\textbf{y}\textbf{y}}^{-1} \right\|&= \sqrt{\lambda _{\max }\bigg (\textbf{H}_{\textbf{y}\textbf{y}}^{-1}\textbf{H}_{\textbf{x}\textbf{y}}^\top \textbf{P}^{-1}\textbf{H}_{\textbf{x}\textbf{y}}\textbf{H}_{\textbf{y}\textbf{y}}^{-1}\bigg )} \\&= \sqrt{\left\| \textbf{H}_{\textbf{y}\textbf{y}}^{-1}\textbf{H}_{\textbf{x}\textbf{y}}^\top \textbf{P}^{-1}\textbf{H}_{\textbf{x}\textbf{y}}\textbf{H}_{\textbf{y}\textbf{y}}^{-1} \right\| } \\&= \sqrt{\left\| \big [\textbf{H}_{\textbf{y}\textbf{y}} - \textbf{H}_{\textbf{x}\textbf{y}}^\top \textbf{H}_{\textbf{x}\textbf{x}}^{-1}\textbf{H}_{\textbf{x}\textbf{y}}\big ]^{-1} - \textbf{H}_{\textbf{y}\textbf{y}}^{-1} \right\| }\\&\le \sqrt{\left\| \big [\textbf{H}_{\textbf{y}\textbf{y}} - \textbf{H}_{\textbf{x}\textbf{y}}^\top \textbf{H}_{\textbf{x}\textbf{x}}^{-1}\textbf{H}_{\textbf{x}\textbf{y}}\big ]^{-1} \right\| + \left\| \textbf{H}_{\textbf{y}\textbf{y}}^{-1} \right\| } \\&\le \frac{2}{\sqrt{\mu }}. \end{aligned} \end{aligned}$$

(D5)

According to the update rule, we have

$$\begin{aligned} \begin{bmatrix} \textbf{g}_{\textbf{x}}(\textbf{x}_{+},\textbf{y}_{+}) \\ \textbf{g}_{\textbf{y}}(\textbf{x}_{+},\textbf{y}_{+}) \end{bmatrix}&= \underbrace{\begin{bmatrix} \tilde{\textbf{H}}_{\textbf{x}\textbf{x}} - \textbf{H}_{\textbf{x}\textbf{x}} & 0 \\ 0 & 0 \end{bmatrix} \tilde{\textbf{H}}^{-1} \begin{bmatrix} \textbf{g}_{\textbf{x}} \\ \textbf{g}_{\textbf{y}} \end{bmatrix}}_{\varvec{\Upsilon }} \nonumber \\&\hspace{-1.5cm}+ \underbrace{\int _0^1\bigg ([\textbf{H}\big (\textbf{x}+ s(\textbf{x}_+-\textbf{x}), \textbf{y}+ s(\textbf{y}_+-\textbf{y})\big ) - \textbf{H}(\textbf{x}, \textbf{y})] \begin{bmatrix} \textbf{x}_+ - \textbf{x}\\ \textbf{y}_+ - \textbf{y}\end{bmatrix}\bigg )\textrm{d}s}_{\varvec{\Gamma }}. \end{aligned}$$

(D6)

Using block inverse formula, we can write $\varvec{\Upsilon }$ as

$$\begin{aligned} \begin{bmatrix} \varvec{\Upsilon }_\textbf{x}\\ \varvec{\Upsilon }_\textbf{y}\end{bmatrix} = \begin{bmatrix} (\tilde{\textbf{H}}_{\textbf{x}\textbf{x}}-\textbf{H}_{\textbf{x}\textbf{x}})\textbf{C}^{-1}\textbf{g}_\textbf{x}- (\tilde{\textbf{H}}_{\textbf{x}\textbf{x}}-\textbf{H}_{\textbf{x}\textbf{x}})\textbf{C}^{-1}\textbf{H}_{\textbf{x}\textbf{y}}\textbf{H}_{\textbf{y}\textbf{y}}^{-1}\textbf{g}_{\textbf{y}} \\ 0 \end{bmatrix}. \end{aligned}$$

Let $\begin{bmatrix} \varvec{\zeta }_\textbf{x}\\ \varvec{\zeta }_\textbf{y}\end{bmatrix} = \textbf{J}^{-1}\begin{bmatrix} \textbf{g}_\textbf{x}\\ \textbf{g}_\textbf{y}\end{bmatrix} = \begin{bmatrix} \textbf{P}^{-1/2} & 0 \\ 0 & \frac{2}{\sqrt{\mu }}\textbf{I}_{n_\textbf{y}}\end{bmatrix}\begin{bmatrix} \textbf{g}_\textbf{x}\\ \textbf{g}_\textbf{y}\end{bmatrix} = \begin{bmatrix} \textbf{P}^{-1/2}\textbf{g}_\textbf{x}\\ \frac{2}{\sqrt{\mu }}\textbf{g}_\textbf{y}\end{bmatrix}$, then the weighted gradient norm can be written as

$$\begin{aligned} \lambda (\textbf{x},\textbf{y}) = \langle \textbf{g}_\textbf{x}, \textbf{P}^{-1}\textbf{g}_\textbf{x} \rangle ^{1/2} + \frac{2}{\sqrt{\mu }} \Vert \textbf{g}_\textbf{y}\Vert = \left\| \varvec{\zeta }_x \right\| + \left\| \varvec{\zeta }_y \right\| . \end{aligned}$$

From (D6) we can build the following relationship

$$\begin{aligned} \textbf{g}_{\textbf{x}}(\textbf{x}_{+},\textbf{y}_{+})&= \varvec{\Upsilon }_\textbf{x}+ \varvec{\Gamma }_\textbf{x}, \\ \textbf{g}_{\textbf{y}}(\textbf{x}_{+},\textbf{y}_{+})&= \varvec{\Upsilon }_\textbf{y}+ \varvec{\Gamma }_\textbf{y}= \varvec{\varvec{\Gamma }}_\textbf{y}. \end{aligned}$$

Hence, in order to build the relationship between $\lambda _+$ and $\lambda$, we need to bound $\left\| \textbf{P}_+^{-1/2}\varvec{\Upsilon }_\textbf{x} \right\| , \left\| \textbf{P}_+^{-1/2}\varvec{\Gamma } \right\| _\textbf{x}$ and $\left\| \varvec{\Gamma } \right\| _y$. Since

$$\begin{aligned} \left\| \textbf{P}^{-1/2}(\tilde{\textbf{H}}_{\textbf{x}\textbf{x}}-\textbf{H}_{\textbf{x}\textbf{x}})\textbf{C}^{-1}\textbf{g}_\textbf{x} \right\|&\overset{(7),~(8)}{=} \left\| \textbf{P}^{-1/2}(\textbf{C}- \textbf{P})\textbf{C}^{-1}\textbf{g}_\textbf{x} \right\| \\&= \left\| \big (\textbf{I}- \textbf{P}^{1/2}\textbf{C}^{-1}\textbf{P}^{1/2}\big )\textbf{P}^{-1/2}\textbf{g}_\textbf{x} \right\| \\&\le \left\| \textbf{I}- \textbf{P}^{1/2}\textbf{C}^{-1}\textbf{P}^{1/2} \right\| \left\| \textbf{P}^{-1/2}\textbf{g}_\textbf{x} \right\| \\&\le \eta _1\left\| \textbf{P}^{-1/2}\textbf{g}_\textbf{x} \right\| , \end{aligned}$$

and

$$\begin{aligned}&\left\| \textbf{P}^{-1/2}(\tilde{\textbf{H}}_{\textbf{x}\textbf{x}}-\textbf{H}_{\textbf{x}\textbf{x}})\textbf{C}^{-1}\textbf{H}_{\textbf{x}\textbf{y}}\textbf{H}_{\textbf{y}\textbf{y}}^{-1}\textbf{g}_\textbf{y} \right\| \\&\quad = \left\| \big (\textbf{I}- \textbf{P}^{1/2}\textbf{C}^{-1}\textbf{P}^{1/2}\big )\textbf{P}^{-1/2}\textbf{H}_{\textbf{x}\textbf{y}}\textbf{H}_{\textbf{y}\textbf{y}}\textbf{g}_\textbf{y} \right\| \\&\quad \overset{(D5)}{\le } \eta _1\frac{2}{\sqrt{\mu }}\left\| \textbf{g}_\textbf{y} \right\| , \end{aligned}$$

we have

$$\begin{aligned} \left\| \textbf{P}^{-1/2}\varvec{\Upsilon }_\textbf{x} \right\|&\le \left\| \textbf{P}^{-1/2}(\tilde{\textbf{H}}_{\textbf{x}\textbf{x}}-\textbf{H}_{\textbf{x}\textbf{x}})\textbf{C}^{-1}\textbf{g}_\textbf{x} \right\| \\&\quad + \left\| \textbf{P}^{-1/2}(\tilde{\textbf{H}}_{\textbf{x}\textbf{x}}-\textbf{H}_{\textbf{x}\textbf{x}})\textbf{C}^{-1}\textbf{H}_{\textbf{x}\textbf{y}}\textbf{H}_{\textbf{y}\textbf{y}}^{-1}\textbf{g}_\textbf{y} \right\| \\&\le \eta _1\left\| \textbf{P}^{-1/2}\textbf{g}_\textbf{x} \right\| + \eta _1\frac{2}{\sqrt{\mu }}\left\| \textbf{g}_\textbf{y} \right\| . \end{aligned}$$

Using Lemma D.1, we have

$$\begin{aligned} \left\| \textbf{P}_+^{-1/2}\varvec{\Upsilon }_\textbf{x} \right\|&\le \sqrt{1+\frac{3\kappa ^2\kappa _2}{\sqrt{\mu }}r}\left\| \textbf{P}^{-1/2}\varvec{\Upsilon }_\textbf{x} \right\| \\&\le \sqrt{1+\frac{3\kappa ^2\kappa _2}{\sqrt{\mu }}r}\bigg (\eta _1\left\| \textbf{P}^{-1/2}\textbf{g}_\textbf{x} \right\| + \eta _1\frac{2}{\sqrt{\mu }}\left\| \textbf{g}_\textbf{y} \right\| \bigg ). \end{aligned}$$

Then, the term $\left\| \varvec{\Gamma } \right\|$ can be bounded under assumption 2.2, that is

$$\begin{aligned} \left\| \varvec{\Gamma } \right\|&\le \int _0^1\left\| \textbf{H}\big (\textbf{x}+ s(\textbf{x}_+-\textbf{x}), \textbf{y}+ s(\textbf{y}_+-\textbf{y}) - \textbf{H}(\textbf{x}, \textbf{y}) \right\| \left\| \begin{bmatrix} \textbf{x}_+ - \textbf{x}\\ \textbf{y}_+ - \textbf{y}\end{bmatrix} \right\| \textrm{d}s \\&\le \frac{L_2}{2}\left\| \begin{bmatrix} \textbf{x}_+ - \textbf{x}\\ \textbf{y}_+ - \textbf{y}\end{bmatrix} \right\| ^2 \le \frac{L_2}{2\mu }\left\| \textbf{J}\begin{bmatrix} \textbf{x}_+ - \textbf{x}\\ \textbf{y}_+ - \textbf{y}\end{bmatrix} \right\| ^2 \le \frac{L_2}{2\mu }r^2. \end{aligned}$$

We can further bound $\left\| \textbf{P}_+^{-1/2}\varvec{\Gamma }_\textbf{x} \right\|$ as follows:

$$\begin{aligned} \left\| \textbf{P}_+^{-1/2}\varvec{\Gamma }_\textbf{x} \right\|&\le \left\| \textbf{P}_+^{-1/2} \right\| \left\| \varvec{\Gamma }_\textbf{x} \right\| \\&\le \frac{1}{\sqrt{\mu }}\frac{L_2}{2\mu }r^2 = \frac{L_2}{2\mu \sqrt{\mu }}r^2. \end{aligned}$$

With all bounds we have obtained, we are able to construct an incomplete relationship as follows:

$$\begin{aligned} \begin{aligned} \lambda (\textbf{x}_{+},\textbf{y}_{+})&= \left\| \textbf{P}_+^{-1/2}\textbf{g}_{x}(\textbf{x}_{+},\textbf{y}_{+}) \right\| + \left\| \frac{2}{\sqrt{\mu }}\textbf{g}_{\textbf{y}}(\textbf{x}_{+},\textbf{y}_{+}) \right\| \\&= \left\| \textbf{P}_+^{-1/2}(\varvec{\Upsilon }_\textbf{x}+ \varvec{\Gamma }_\textbf{x}) \right\| + \left\| \frac{2}{\sqrt{\mu }}\varvec{\Gamma }_\textbf{y} \right\| \\&\le \left\| \textbf{P}_+^{-1/2}\varvec{\Upsilon }_\textbf{x} \right\| + \left\| \textbf{P}_+^{-1/2}\varvec{\Gamma }_\textbf{x} \right\| + \frac{2}{\sqrt{\mu }}\left\| \varvec{\Gamma } \right\| \\&\le \eta _1\sqrt{1+\frac{3\kappa ^2\kappa _2}{\sqrt{\mu }}r}\lambda + \frac{L_2}{2\mu \sqrt{\mu }}r^2 + \frac{2}{\sqrt{\mu }}\frac{L_2}{2\mu }r^2\\&= \eta _1\sqrt{1+\frac{3\kappa ^2\kappa _2}{\sqrt{\mu }}r}\lambda + \frac{3L_2}{2\mu \sqrt{\mu }}r^2. \end{aligned} \end{aligned}$$

(D7)

Then we need to bound r by $\lambda$. By the update rule, we have

$$\begin{aligned} \textbf{J}\begin{bmatrix} \textbf{x}_+ - \textbf{x}\\ \textbf{y}_+ - \textbf{y}\end{bmatrix} = -\textbf{J}\tilde{\textbf{H}}^{-1} \begin{bmatrix} \textbf{g}_\textbf{x}\\ \textbf{g}_\textbf{y}\end{bmatrix} = -\textbf{J}\tilde{\textbf{H}}^{-1}\textbf{J}\begin{bmatrix} \varvec{\zeta }_\textbf{x}\\ \varvec{\zeta }_\textbf{y}\end{bmatrix}, \end{aligned}$$

which is equivalent to

$$\begin{aligned} -\begin{bmatrix} \textbf{P}^{1/2}(\textbf{x}_+ - \textbf{x}) \\ \frac{\sqrt{\mu }}{2}(\textbf{y}_+-\textbf{y}) \end{bmatrix}&= \begin{bmatrix} \textbf{P}^{1/2}\textbf{C}^{-1}\textbf{P}^{1/2}\varvec{\zeta }_\textbf{x}\\ -\frac{\sqrt{\mu }}{2}(\textbf{P}^{1/2}\textbf{C}^{-1}\textbf{P}^{1/2}\textbf{P}^{-1/2}\textbf{H}_{\textbf{x}\textbf{y}}\textbf{H}_{\textbf{y}\textbf{y}}^{-1})^\top \varvec{\zeta }_\textbf{x}\end{bmatrix} \\&\quad +\begin{bmatrix} - \frac{\sqrt{\mu }}{2}\textbf{P}^{1/2}\textbf{C}^{-1}\textbf{H}_{\textbf{x}\textbf{y}}\textbf{H}_{\textbf{y}\textbf{y}}^{-1}\varvec{\zeta }_\textbf{y}\\ \frac{\mu }{4}(\textbf{H}_{\textbf{y}\textbf{y}}-\textbf{H}_{\textbf{y}\textbf{x}}\tilde{\textbf{H}}^{-1}\textbf{H}_{\textbf{x}\textbf{y}})^{-1}\varvec{\zeta }_\textbf{y}\end{bmatrix} \end{aligned}$$

Then we can bound $\left\| \textbf{P}^{1/2}(\textbf{x}_+-\textbf{x}) \right\|$ and $\left\| \frac{\sqrt{\mu }}{2}(\textbf{y}_+-\textbf{y}) \right\|$ as follows:

$$\begin{aligned} \left\| \textbf{P}^{1/2}(\textbf{x}_+-\textbf{x}) \right\|&\le \left\| \textbf{P}^{1/2}\textbf{C}^{-1}\textbf{P}^{1/2} \right\| \left\| \varvec{\zeta }_\textbf{x} \right\| \\&\quad +\frac{\sqrt{\mu }}{2}\left\| \textbf{P}^{1/2}\textbf{C}^{-1}\textbf{P}^{1/2} \right\| \left\| \textbf{P}^{-1/2}\textbf{H}_{\textbf{x}\textbf{y}}\textbf{H}_{\textbf{y}\textbf{y}}^{-1} \right\| \left\| \varvec{\zeta }_\textbf{y} \right\| \\&\overset{(D2),~(D5)}{\le } (1+\eta _1)\left\| \varvec{\zeta }_\textbf{x} \right\| + \frac{\sqrt{\mu }}{2}(1+\eta _1)\frac{2}{\sqrt{\mu }}\left\| \varvec{\zeta }_\textbf{y} \right\| \\&= (1+\eta _1)\lambda , \end{aligned}$$

and

$$\begin{aligned} \left\| \frac{\sqrt{\mu }}{2}(\textbf{y}_+ - \textbf{y}) \right\|&\le \frac{\sqrt{\mu }}{2}\left\| \textbf{P}^{1/2}\textbf{C}^{-1}\textbf{P}^{1/2} \right\| \left\| \textbf{P}^{-1/2}\textbf{H}_{\textbf{x}\textbf{y}}\textbf{H}_{\textbf{y}\textbf{y}}^{-1} \right\| \left\| \varvec{\zeta }_\textbf{x} \right\| \\&\qquad + \frac{\mu }{4}\left\| \big (\textbf{H}_{\textbf{y}\textbf{y}}-\textbf{H}_{\textbf{y}\textbf{x}}\tilde{\textbf{H}}^{-1}\textbf{H}_{\textbf{x}\textbf{y}}\big )^{-1} \right\| \left\| \varvec{\zeta }_\textbf{y} \right\| \\&\le \frac{\sqrt{\mu }}{2}(1+\eta _1)\frac{2}{\sqrt{\mu }}\left\| \varvec{\zeta }_\textbf{x} \right\| + \frac{\mu }{4}\cdot \frac{1}{\mu }\left\| \varvec{\zeta }_\textbf{y} \right\| \\&\le (1+\eta _1)\lambda . \end{aligned}$$

Combine these two bounds, we have

$$\begin{aligned} r&= \left\| \textbf{P}^{1/2}(\textbf{x}_+-\textbf{x}) \right\| + \left\| \frac{\sqrt{\mu }}{2}(\textbf{y}_+-\textbf{y}) \right\| \\&\le 2(1+\eta _1)\lambda . \end{aligned}$$

Plug this back to (D7), we have

$$\begin{aligned} \begin{aligned} \lambda (\textbf{x}_{+},\textbf{y}_{+})&\le \eta _1\sqrt{1+\frac{3\kappa ^2\kappa _2}{\sqrt{\mu }}r}\lambda + \frac{3L_2}{2\mu \sqrt{\mu }}r^2 \\&\le \eta _1\big (1+\frac{3\kappa ^2\kappa _2}{\sqrt{\mu }}r\big )\lambda + \frac{3L_2}{2\mu \sqrt{\mu }}r^2 \\&\le \eta _1\lambda + \frac{6\eta _1\kappa ^2\kappa _2}{\sqrt{\mu }}(1+\eta _1)^2\lambda ^2 + \frac{6\kappa _2}{\sqrt{\mu }}(1+\eta _1)^2\lambda ^2 \\&\le \eta _1\lambda + \frac{12(1+\eta _1)^2\kappa ^2\kappa _2}{\sqrt{\mu }}\lambda ^2. \end{aligned} \end{aligned}$$

(D8)

$\square$

1.3 D.3 The Proof of Corollary 3.3

Proof

According Lemma 2.3 and take $\eta = \eta _\mathrm{\text {PAN}}=\sqrt{\frac{3K\log (2n_\textbf{x}/\delta )}{|{\mathcal {S}}|\mu }}$, it holds with probability at least $1-\delta$ that

$$\begin{aligned} (1-\eta )\textbf{H}_{\textbf{x}\textbf{x}}(\textbf{x},\textbf{y})\preceq \tilde{\textbf{H}}_{\textbf{x}\textbf{x}}\preceq (1+\eta ) \textbf{H}_{\textbf{x}\textbf{x}}(\textbf{x},\textbf{y}). \end{aligned}$$

Using Lemma 3.1, we have

$$\begin{aligned} \left\| \textbf{I}-\textbf{P}(\textbf{x},\textbf{y})^{1/2}[\textbf{C}(\textbf{x},\textbf{y})]^{-1}\textbf{P}(\textbf{x},\textbf{y})^{1/2} \right\| \le \frac{\eta }{1-\eta }, \end{aligned}$$

holds with probability at least $1-\delta$. Using Theorem 3.2, we can directly conclude (9). $\square$

Appendix E The Proof of Sect. 4.2

To simplify the presentation, we use $\textbf{H}_{\textbf{x}\textbf{x},t}$, $\textbf{H}_{\textbf{x}\textbf{x},t}^i$ to represent $\textbf{H}_{\textbf{x}\textbf{x}}(\textbf{x}_t,\textbf{y}_t)$ and $\textbf{H}_{\textbf{x}\textbf{x}}^i(\textbf{x}_t,\textbf{y}_t)$.

1.1 E.1 The Proof of Lemma 4.2

Proof

We denote

$$\begin{aligned} \textbf{H}_{\textbf{x}\textbf{x},t}^{1/2}\big [\textbf{H}_{\textbf{x}\textbf{x},t}^i \big ]^{-1}\textbf{H}_{\textbf{x}\textbf{x},t}^{1/2}\overset{\textrm{def}}{=}\textbf{I}+\textbf{E}^i. \end{aligned}$$

(E9)

According to the condition for $\textbf{H}^{i}_{\textbf{x}\textbf{x},t}$, we have

$$\begin{aligned} \frac{1}{1+\eta }\textbf{I}\preceq \textbf{H}_{\textbf{x}\textbf{x}}^{1/2}\big [\textbf{H}_{\textbf{x}\textbf{x},t}^i \big ]^{-1}\textbf{H}_{\textbf{x}\textbf{x}}^{1/2}\preceq \frac{1}{1-\eta }\textbf{I}, \end{aligned}$$

so that $\textbf{E}^i$ satisfies that

$$\begin{aligned} -\frac{\eta }{1+\eta }\textbf{I}\preceq \textbf{E}^i\preceq \frac{\eta }{1-\eta }\textbf{I}, \end{aligned}$$

(E10)

and

$$\begin{aligned} \left\| \textbf{H}_{\textbf{x}\textbf{x},t}^{-1/2}(\textbf{H}_{\textbf{x}\textbf{x},t}-\textbf{H}_{\textbf{x}\textbf{x},t}^i)\textbf{H}_{\textbf{x}\textbf{x},t}^{-1/2} \right\| \le \eta . \end{aligned}$$

(E11)

We rewrite $\textbf{H}^{1/2}_{\textbf{x}\textbf{x},t} [\textbf{H}_{\textbf{x}\textbf{x},t}^i ]^{-1}\textbf{H}^{1/2}_{\textbf{x}\textbf{x},t}-\textbf{I}$ as follows

$$\begin{aligned}&\textbf{H}^{1/2}_{\textbf{x}\textbf{x},t} \big [\textbf{H}_{\textbf{x}\textbf{x},t}^i \big ]^{-1}\textbf{H}^{1/2}_{\textbf{x}\textbf{x},t}-\textbf{I}\\&\quad = \textbf{H}_{\textbf{x}\textbf{x},t}^{-1/2} \big (\textbf{H}_{\textbf{x}\textbf{x},t}-\textbf{H}^i_{\textbf{x}\textbf{x},t} \big )\textbf{H}_{\textbf{x}\textbf{x},t}^{-1/2} \big (\textbf{H}_{\textbf{x}\textbf{x},t}^{1/2} \big [\textbf{H}^i_{\textbf{x}\textbf{x},t} \big ]^{-1}\textbf{H}_{\textbf{x}\textbf{x},t}^{1/2} \big ). \end{aligned}$$

Then, $\tilde{\textbf{H}}_{\textbf{x}\textbf{x},t}$ satisfies that

$$\begin{aligned} \textbf{H}_{\textbf{x}\textbf{x},t}^{1/2} \big [\tilde{\textbf{H}}_{\textbf{x}\textbf{x},t} \big ]^{-1}\textbf{H}_{\textbf{x}\textbf{x},t}^{1/2}-\textbf{I}&= \textbf{H}_{\textbf{x}\textbf{x},t}^{1/2}\left[ \frac{1}{m}\sum _{i=1}^{m} \big [\textbf{H}_{\textbf{x}\textbf{x},t}^i \big ]^{-1}\right] \textbf{H}_{\textbf{x}\textbf{x},t}^{1/2}-\textbf{I}\\&=\frac{1}{m} \sum _{i=1}^m \textbf{H}_{\textbf{x}\textbf{x},t}^{-1/2} \big (\textbf{H}_{\textbf{x}\textbf{x},t}-\textbf{H}_{\textbf{x}\textbf{x},t}^i \big )\textbf{H}_{\textbf{x}\textbf{x},t}^{-1/2} \big (\textbf{H}_{\textbf{x}\textbf{x},t}^{1/2} \big [\textbf{H}^i_{\textbf{x}\textbf{x},t} \big ]^{-1}\textbf{H}_{\textbf{x}\textbf{x},t}^{1/2} \big ). \end{aligned}$$

Thus, we have

$$\begin{aligned}&\left\| \textbf{H}_{\textbf{x}\textbf{x},t}^{1/2}\tilde{\textbf{H}}_{\textbf{x}\textbf{x},t}^{-1}\textbf{H}_{\textbf{x}\textbf{x},t}^{1/2}-\textbf{I} \right\| \\&\quad =\left\| \frac{1}{m} \sum _{i=1}^m \textbf{H}_{\textbf{x}\textbf{x},t}^{-1/2}(\textbf{H}_{\textbf{x}\textbf{x},t}-\textbf{H}_{\textbf{x}\textbf{x},t}^i)\textbf{H}_{\textbf{x}\textbf{x},t}^{-1/2}(\textbf{H}_{\textbf{x}\textbf{x},t}^{1/2}[\textbf{H}^i_{\textbf{x}\textbf{x},t}]^{-1}\textbf{H}_{\textbf{x}\textbf{x},t}^{1/2})) \right\| \\&\quad \overset{(E9)}{=}\left\| \frac{1}{m}\sum _{i=1}^m \textbf{H}_{\textbf{x}\textbf{x},t}^{-1/2}(\textbf{H}_{\textbf{x}\textbf{x},t}-\textbf{H}^i_{\textbf{x}\textbf{x},t})\textbf{H}_{\textbf{x}\textbf{x},t}^{-1/2}(\textbf{I}+\textbf{E}^i) \right\| \\&\quad \le \left\| \frac{1}{m}\sum _{i=1}^m \textbf{H}_{\textbf{x}\textbf{x},t}^{-1/2}(\textbf{H}_{\textbf{x}\textbf{x},t}-\textbf{H}^i_{\textbf{x}\textbf{x},t})\textbf{H}_{\textbf{x}\textbf{x},t}^{-1/2} \right\| \\&\qquad +\left\| \frac{1}{m}\sum _{i=1}^m \textbf{H}_{\textbf{x}\textbf{x},t}^{-1/2}(\textbf{H}_{\textbf{x}\textbf{x},t}-\textbf{H}^i_{\textbf{x}\textbf{x},t})\textbf{H}_{\textbf{x}\textbf{x},t}^{-1/2}\textbf{E}^i \right\| \\&\quad \le \left\| \textbf{H}_{\textbf{x}\textbf{x},t}^{-1/2}\underbrace{\left( \textbf{H}_{\textbf{x}\textbf{x},t}-\frac{1}{m}\sum _{i=1}^m\textbf{H}_{\textbf{x}\textbf{x},t}^i\right) }_{=\textbf{0}}\textbf{H}_{\textbf{x}\textbf{x},t}^{-1/2} \right\| \\&\qquad +\frac{1}{m}\sum _{i=1}^m \left\| \textbf{H}_{\textbf{x}\textbf{x},t}^{-1/2}(\textbf{H}_{\textbf{x}\textbf{x},t}-\textbf{H}^i_{\textbf{x}\textbf{x},t})\textbf{H}_{\textbf{x}\textbf{x},t}^{-1/2} \right\| \left\| \textbf{E}^i \right\| \\&\!\!\!\!\overset{(E11),(E10)}{\le } \frac{\eta ^2}{1-\eta }. \end{aligned}$$

Applying Lemma B.2 on $\textbf{A}=\textbf{H}_{\textbf{x}\textbf{x},t}$, $\textbf{B}=\tilde{\textbf{H}}_{\textbf{x}\textbf{x},t}$ and

$$\begin{aligned} \Delta = -\textbf{H}_{\textbf{x}\textbf{y}}(\textbf{x}_t,\textbf{y}_t)[\textbf{H}_{\textbf{y}\textbf{y}}(\textbf{x}_t,\textbf{y}_t)^{-1}][\textbf{H}_{\textbf{x}\textbf{y}}(\textbf{x}_t,\textbf{y}_t)]^{\top }, \end{aligned}$$

we obtain

$$\begin{aligned} \left\| \textbf{P}(\textbf{x}_t,\textbf{y}_t)^{1/2}\textbf{C}_t^{-1}\textbf{P}(\textbf{x}_t,\textbf{y}_t)^{1/2}-\textbf{I} \right\| \le \frac{\eta ^2}{1-\eta }. \end{aligned}$$

$\square$

1.2 E.2 The Proof of Theorem 4.3

Proof

Using Lemma 4.2, we have

$$\begin{aligned} \left\| \textbf{P}(\textbf{x}_t,\textbf{y}_t)^{1/2}\textbf{C}_t^{-1}\textbf{P}(\textbf{x}_t,\textbf{y}_t)^{1/2}-\textbf{I} \right\| \le \frac{\eta ^2}{1-\eta }. \end{aligned}$$

According to Proposition 4.1 and using Theorem 3.2 by taking $\eta _1=\frac{\eta ^2}{1-\eta }$, we obtain (15). $\square$

1.3 E.3 The Proof of Corollary 4.4

Proof

Using Lemma 2.3, with sample size on each client $s \ge \frac{3K/\mu \log (2n_\textbf{x}m/\delta )}{\eta ^2}$, we have

$$\begin{aligned} (1-\eta )\textbf{H}_{\textbf{x}\textbf{x}}(\textbf{x}_t,\textbf{y}_t) \preceq \textbf{H}^{i}_{\textbf{x}\textbf{x}}(\textbf{x}_t,\textbf{y}_t)\preceq (1+\eta )\textbf{H}_{\textbf{x}\textbf{x}}(\textbf{x}_t,\textbf{y}_t), \end{aligned}$$

holds with probability at least $1-\delta /m$. Then we have

$$\begin{aligned} (1-\eta )\textbf{H}_{\textbf{x}\textbf{x}}(\textbf{x}_t,\textbf{y}_t) \preceq \textbf{H}^{i}_{\textbf{x}\textbf{x}}(\textbf{x}_t,\textbf{y}_t)\preceq (1+\eta )\textbf{H}_{\textbf{x}\textbf{x}}(\textbf{x}_t,\textbf{y}_t), \end{aligned}$$

holds for all $i\in [m]$ with probability $1 - \delta$, remaining proof directly follows the proof of Theorem 4.3. $\square$

Appendix F The Proof of Sect. 4.3

To simplify the presentation, we use $\textbf{H}_{\textbf{x}\textbf{x},t}$, $\textbf{H}_{\textbf{x}\textbf{x},t}^i$ to represent $\textbf{H}_{\textbf{x}\textbf{x}}(\textbf{x}_t,\textbf{y}_t)$ and $\textbf{H}_{\textbf{x}\textbf{x}}^i(\textbf{x}_t,\textbf{y}_t)$.

1.1 F.1 The Proof of Lemma 4.6

Proof

Recall that we do matrix sketching on each client such that as

$$\begin{aligned} \tilde{\textbf{H}}^i_{\textbf{x}\textbf{x},t} = \textbf{A}_t^{\top }\tilde{\textbf{S}}_i\tilde{\textbf{S}}_i^{\top }\textbf{A}_t + \alpha \textbf{I}, \end{aligned}$$

we denote $\textbf{S}=\frac{1}{\sqrt{m}}[\tilde{\textbf{S}}_1,\cdots ,\tilde{\textbf{S}}_m]\in {\mathbb {R}}^{N\times ms_t}$, then the aggregation of $\tilde{\textbf{H}}_{\textbf{x}\textbf{x},t}^i$ can be defined by

$$\begin{aligned} \hat{\textbf{H}}_{\textbf{x}\textbf{x},t} \overset{\textrm{def}}{=}\frac{1}{m}\sum _{i=1}^m\tilde{\textbf{H}}_{\textbf{x}\textbf{x},t}^i = \textbf{A}_t^{\top }\textbf{S}\textbf{S}^{\top }\textbf{A}_t+\alpha \textbf{I}. \end{aligned}$$

Using the results of Lemma 2.4, we have

$$\begin{aligned} (1-\eta )\textbf{H}_{\textbf{x}\textbf{x},t} \preceq \tilde{\textbf{H}}^{i}_{\textbf{x}\textbf{x},t}\preceq (1+\eta )\textbf{H}_{\textbf{x}\textbf{x},t}, \end{aligned}$$

(F12)

and

$$\begin{aligned} (1-\eta /\sqrt{m})\textbf{H}_{\textbf{x}\textbf{x},t} \preceq \hat{\textbf{H}}_{\textbf{x}\textbf{x},t}\preceq (1+\eta /\sqrt{m})\textbf{H}_{\textbf{x}\textbf{x},t}, \end{aligned}$$

(F13)

hold with probability at least $1-\delta$. Following the proof of Lemma 4.2, we also denote

$$\begin{aligned} \textbf{H}_{\textbf{x}\textbf{x},t}^{1/2} \left[ \tilde{\textbf{H}}_{\textbf{x}\textbf{x},t}^i \right] ^{-1}\textbf{H}_{\textbf{x}\textbf{x},t}^{1/2}\overset{\textrm{def}}{=}\textbf{I}+\tilde{\textbf{E}}^i. \end{aligned}$$

(F14)

According to (F12), we have

$$\begin{aligned} -\frac{\eta }{1+\eta }\textbf{I}\preceq \tilde{\textbf{E}}^i\preceq \frac{\eta }{1-\eta }\textbf{I}, \end{aligned}$$

(F15)

$$\begin{aligned} \left\| \textbf{H}_{\textbf{x}\textbf{x},t}^{-1/2} \left( \textbf{H}_{\textbf{x}\textbf{x},t}-\tilde{\textbf{H}}_{\textbf{x}\textbf{x},t}^i \right) \textbf{H}_{\textbf{x}\textbf{x},t}^{-1/2} \right\| \le \eta . \end{aligned}$$

(F16)

and

$$\begin{aligned} \left\| \textbf{H}_{\textbf{x}\textbf{x},t}^{-1/2} \left( \textbf{H}_{\textbf{x}\textbf{x},t}-\hat{\textbf{H}}_{\textbf{x}\textbf{x},t} \right) \textbf{H}_{\textbf{x}\textbf{x},t}^{-1/2} \right\| \le \eta /\sqrt{m}, \end{aligned}$$

(F17)

hold with probability at least $1-\delta$. It also satisfies that

$$\begin{aligned} \textbf{H}_{\textbf{x}\textbf{x},t}^{1/2} \big [\tilde{\textbf{H}}^{\text {gp}}_{\textbf{x}\textbf{x},t} \big ]^{-1}\textbf{H}_{\textbf{x}\textbf{x},t}^{1/2}-\textbf{I}&= \textbf{H}_{\textbf{x}\textbf{x},t}^{1/2}\left[ \frac{1}{m}\sum _{i=1}^{m}[\tilde{\textbf{H}}_{\textbf{x}\textbf{x},t}^i]^{-1}\right] \textbf{H}_{\textbf{x}\textbf{x},t}^{1/2}-\textbf{I}\\&\quad =\frac{1}{m} \sum _{i=1}^m \textbf{H}_{\textbf{x}\textbf{x},t}^{-1/2}(\textbf{H}_{\textbf{x}\textbf{x},t}-\tilde{\textbf{H}}_{\textbf{x}\textbf{x},t}^i)\textbf{H}_{\textbf{x}\textbf{x},t}^{-1/2}(\textbf{H}_{\textbf{x}\textbf{x},t}^{1/2}[\tilde{\textbf{H}}^i_{\textbf{x}\textbf{x},t}]^{-1}\textbf{H}_{\textbf{x}\textbf{x},t}^{1/2}). \end{aligned}$$

Thus, we have

$$\begin{aligned}&\left\| \textbf{H}_{\textbf{x}\textbf{x},t}^{1/2}[\tilde{\textbf{H}}_{\textbf{x}\textbf{x},t}^{\text {gp}}]^{-1}\textbf{H}_{\textbf{x}\textbf{x},t}^{1/2}-\textbf{I} \right\| \\&\quad =\left\| \frac{1}{m} \sum _{i=1}^m \textbf{H}_{\textbf{x}\textbf{x},t}^{-1/2}(\textbf{H}_{\textbf{x}\textbf{x},t}-\tilde{\textbf{H}}_{\textbf{x}\textbf{x},t}^i)\textbf{H}_{\textbf{x}\textbf{x},t}^{-1/2}(\textbf{H}_{\textbf{x}\textbf{x},t}^{1/2}[\tilde{\textbf{H}}^i_{\textbf{x}\textbf{x},t}]^{-1}\textbf{H}_{\textbf{x}\textbf{x},t}^{1/2}) \right\| \\&\quad =\left\| \frac{1}{m}\sum _{i=1}^m \textbf{H}_{\textbf{x}\textbf{x},t}^{-1/2}(\textbf{H}_{\textbf{x}\textbf{x},t}-\tilde{\textbf{H}}^i_{\textbf{x}\textbf{x},t})\textbf{H}_{\textbf{x}\textbf{x},t}^{-1/2}(\textbf{I}+\tilde{\textbf{E}}^i) \right\| \\&\quad \le \left\| \frac{1}{m}\sum _{i=1}^m \textbf{H}_{\textbf{x}\textbf{x},t}^{-1/2}(\textbf{H}_{\textbf{x}\textbf{x},t}-\tilde{\textbf{H}}^i_{\textbf{x}\textbf{x},t})\textbf{H}_{\textbf{x}\textbf{x},t}^{-1/2} \right\| \\&\qquad +\left\| \frac{1}{m}\sum _{i=1}^m \textbf{H}_{\textbf{x}\textbf{x},t}^{-1/2}(\textbf{H}_{\textbf{x}\textbf{x},t}-\textbf{H}^i_{\textbf{x}\textbf{x},t})\textbf{H}_{\textbf{x}\textbf{x},t}^{-1/2}\tilde{\textbf{E}}^i \right\| \\&\quad \le \left\| \textbf{H}_{\textbf{x}\textbf{x},t}^{-1/2}{\left( \textbf{H}_{\textbf{x}\textbf{x},t}-\hat{\textbf{H}}_{\textbf{x}\textbf{x},t}\right) }\textbf{H}_{\textbf{x}\textbf{x},t}^{-1/2} \right\| \\&\qquad +\frac{1}{m}\sum _{i=1}^m \left\| \textbf{H}_{\textbf{x}\textbf{x},t}^{-1/2}(\textbf{H}_{\textbf{x}\textbf{x},t}-\textbf{H}^i_{\textbf{x}\textbf{x},t})\textbf{H}_{\textbf{x}\textbf{x},t}^{-1/2} \right\| \left\| \textbf{E}^i \right\| \\&\!\!\!\!\overset{(F17),(F15)}{\le } \frac{\eta }{\sqrt{m}}+\frac{\eta ^2}{1-\eta }, \end{aligned}$$

holds with probability at least $1-\delta$.

Applying Lemma B.2 on $\textbf{A}=\textbf{H}_{\textbf{x}\textbf{x},t}$, $\textbf{B}=\tilde{\textbf{H}}^{\text {gp}}_{\textbf{x}\textbf{x},t}$ and

$$\begin{aligned}\Delta = -\textbf{H}_{\textbf{x}\textbf{y}}(\textbf{x}_t,\textbf{y}_t)[\textbf{H}_{\textbf{y}\textbf{y}}(\textbf{x}_t,\textbf{y}_t)^{-1}][\textbf{H}_{\textbf{x}\textbf{y}}(\textbf{x}_t,\textbf{y}_t)]^{\top }, \end{aligned}$$

we conclude that

$$\begin{aligned} \left\| \textbf{P}(\textbf{x}_t,\textbf{y}_t)^{1/2}[\textbf{C}_t^{\text {gp}}]^{-1}\textbf{P}(\textbf{x}_t,\textbf{y}_t)^{1/2}-\textbf{I} \right\| \le \frac{\eta }{\sqrt{m}}+\frac{\eta ^2}{1-\eta } \end{aligned}$$

holds with probability at least $1-\delta$. $\square$

1.2 F.2 The Proof of Theorem 4.8

Using Lemma 4.6, we have

$$\begin{aligned} \left\| \textbf{P}(\textbf{x}_t,\textbf{y}_t)^{1/2}[\textbf{C}_t^{\text {gp}}]^{-1}\textbf{P}(\textbf{x}_t,\textbf{y}_t)^{1/2}-\textbf{I} \right\| \le \frac{\eta }{\sqrt{m}}+\frac{\eta ^2}{1-\eta }, \end{aligned}$$

holds with probability at least $1-\delta$. Since the update rule of GIANT-PANDA on the server can be written as

$$\begin{aligned} { \begin{bmatrix} \textbf{x}_{t+1}\\ \textbf{y}_{t+1} \end{bmatrix}}={ \begin{bmatrix} \textbf{x}_{t}\\ \textbf{y}_{t} \end{bmatrix}}-\begin{bmatrix} \tilde{\textbf{H}}_{\textbf{x}\textbf{x}, t }^{\text {gp}} & \quad \textbf{H}_{\textbf{x}\textbf{y}}(\textbf{x}_t,\textbf{y}_t)\\ \textbf{H}_{\textbf{x}\textbf{y}}(\textbf{x}_t,\textbf{y}_t)^{\top }& \quad \textbf{H}_{\textbf{y}\textbf{y}}(\textbf{x}_t,\textbf{y}_t) \end{bmatrix}^{-1}\begin{bmatrix} \textbf{g}_{\textbf{x}}(\textbf{x}_t,\textbf{y}_t)\\ \textbf{g}_\textbf{y}(\textbf{x}_t,\textbf{y}_t) \end{bmatrix}, \end{aligned}$$

Using the convergence results of PAN (Theorem 3.2) by taking $\eta _1 = \frac{\eta }{\sqrt{m}}+\frac{\eta ^2}{1-\eta }$, we can directly obtain (18).

1.3 F.3 The Proof of Corollary 4.10

When the variable $\textbf{y}$ vanishes, we have

$$\begin{aligned} \hat{\textbf{P}}(\textbf{x}) \overset{\textrm{def}}{=}\nabla ^2 f(\textbf{x}), \end{aligned}$$

the corresponding measure can be written as

$$\begin{aligned} \hat{\gamma }(\textbf{x}) \overset{\textrm{def}}{=}\left\| [\hat{\textbf{P}}(\textbf{x})]^{-1}\nabla f(\textbf{x}) \right\| , \end{aligned}$$

which recovers the measure of Rodomanov and Nesterov (2021). Under the condition of Corollary 4.10, f is $\mu$-strongly convex, $L_2$-Lipschitz continuous. We define $\hat{r}\overset{\textrm{def}}{=}\Vert \hat{\textbf{P}}(\textbf{x})^{1/2}(\textbf{x}_{+}-\textbf{x})\Vert$, using the results in Rodomanov and Nesterov (2021), it holds that

$\frac{1}{1+\frac{L_2\hat{r}}{2\mu ^{3/2}}} \hat{\textbf{P}}(\textbf{x}) \preceq \hat{\textbf{P}}(\textbf{x}_{+})\preceq \left( 1+\frac{L_2\hat{r}}{2\mu ^{3/2}}\right) \hat{\textbf{P}}(\textbf{x})$,
$\hat{r}\le \hat{\lambda }(\textbf{x})$.

Lemma F.1

Under the same condition of Corollary 4.10, when we use the update rule

$$\begin{aligned} \textbf{x}_{+}=\textbf{x}-\textbf{H}^{-1}\nabla f(\textbf{x}), \end{aligned}$$

where $\textbf{H}\in {\mathbb {R}}^{n_\textbf{x}\times n_\textbf{x}}$ is a positive definite matrix satisfies

$$\begin{aligned} \left\| \textbf{I}-\nabla ^2 f(\textbf{x})^{1/2}{\textbf{H}}^{-1}\nabla ^2 f(\textbf{x})^{1/2} \right\| \le \eta _1, \end{aligned}$$

then it holds that

$$\begin{aligned} \hat{\lambda }(\textbf{x}_{+})\le \eta _1 \hat{\lambda }(\textbf{x}) + \frac{2L_2}{\mu ^{3/2}}\hat{\lambda }(\textbf{x})^2. \end{aligned}$$

Now, we formally present the proof of Corollary 4.10.

Proof

Replacing the factor $\sqrt{1+\frac{3\kappa ^2\kappa _2}{\mu }r}$ by $\sqrt{1+\frac{L_2}{2\mu ^{3/2}}\hat{r}}$ of (D7) in Theorem 3.2, we have

$$\begin{aligned} \hat{\lambda }(\textbf{x}_{+})\le \eta _1 \sqrt{1+\frac{L_2}{2\mu ^{3/2}}\hat{r}}\hat{\lambda }(\textbf{x}) + \frac{3L_2}{2\mu ^{3/2}}\hat{r}^2, \end{aligned}$$

(F18)

Using $\hat{r}\le \hat{\lambda }(\textbf{x})$, we have

$$\begin{aligned} \hat{\lambda }(\textbf{x}_{+})\le \eta _1\hat{\lambda }(\textbf{x}) + \frac{L_2}{\mu ^{3/2}}\hat{\lambda }(\textbf{x})^2+ \frac{3L_2}{2\mu ^{3/2}}\hat{\lambda }(\textbf{x})^2= \eta _1\hat{\lambda }(\textbf{x}) + \frac{2L_2}{\mu ^{3/2}}\hat{\lambda }(\textbf{x})^2. \end{aligned}$$

$\square$

Now we present the proof of Corollary 4.10 which is very similar to the proof of Theorem 4.8.

Proof

Using the results of Lemma 4.6, we have

$$\begin{aligned} \left\| \textbf{I}-[\nabla ^2 f(\textbf{x}_t)]^{1/2}\tilde{\textbf{H}}_t^{-1}[\nabla ^2 f(\textbf{x}_t)]^{1/2} \right\| \le \frac{\eta }{\sqrt{m}}+\frac{\eta ^2}{1-\eta }, \end{aligned}$$

(F19)

holds with probability at least $1-\delta$ where $\tilde{\textbf{H}}_t$ is defined by

$$\begin{aligned} \tilde{\textbf{H}}_t\overset{\textrm{def}}{=}\left[ \frac{1}{m}\sum _{i=1}^m[\tilde{\textbf{H}}^i_{\textbf{x}\textbf{x},t}]^{-1}\right] ^{-1}. \end{aligned}$$

The update rule of GIANT can be written as

$$\begin{aligned} \textbf{x}_{t+1}=\textbf{x}_t-\tilde{\textbf{H}}_t^{-1}\nabla f(\textbf{x}_t). \end{aligned}$$

Using Lemma F.1 by taking $\eta _1=\frac{\eta }{\sqrt{m}}+\frac{\eta ^2}{1-\eta }$, we have

$$\begin{aligned} \hat{\lambda }(\textbf{x}_{t+1})\le \left( \frac{\eta }{\sqrt{m}}+\frac{\eta ^2}{1-\eta }\right) \hat{\lambda }(\textbf{x}_t) + \frac{2L_2}{\mu ^{3/2}}\hat{\lambda }(\textbf{x}_t) \end{aligned}$$

holds with probability at least $1-\delta$. $\square$

Appendix G Additional experiments

This section provides additional experiments to validate our new proposed methods. In G.1, we present additional numerical results of different sketch ratios for GIANT-PANDA. In G.2, we study the impact of the sketch methods on the convergence behavior for GIANT-PANDA. In G.3, we compare PAN with existing state-of-the-art methods for single-agent optimization ($m=1$).

1.1 G.1 More study on the impact of the sketch ratio

We choose the sketch ratio from $p\in \{10\%,30\%,50\%,70\%,100\%\}$. We set the number of clients $m=128$. We present the results for AUC maximization and Fairness-aware machine learning in Figs. 7 and 8 respectively. We observe similar behaviors as in Sect. 5.2.

1.2 G.2 Comparison of different sketch methods for GIANT-PANDA

We validate the impact of using different sketch methods in GIANT-PANDA. Specifically, we choose uniform sketch, Gauss sketch (Johnson & Lindenstrauss, 1984), and count sketch (Clarkson & Woodruff, 2017) to construct the local approximate partial Hessian $\tilde{\textbf{H}}^{i}_{\textbf{x}\textbf{x},t}$ in (16) in GIANT-PANDA.

We present the behavior of GIANT-PANDA under $m=8$ for AUC maximization and Fairness-aware machine learning in Figs. 9 and 10 respectively.

We also present the behavior of GIANT-PANDA under $m=128$ for AUC maximization and Fairness-aware machine learning in Figs. 11 and 12 respectively.

We observe that when $m=8$, GIANT-PANDA with uniform sketch achieves the best behavior in terms of both communication rounds and running time (Figs. 9 and 10). This means when the local sample size s is relatively large, employing uniform sketch in GIANT-PANDA is good enough.

However, when the number of clients is as large as $m=128$, the count sketch and Gauss sketch behave better than the uniform sketch (Figs. 11 and 12). This encourages us to choose more complicated sketch methods to improve the behavior of GIANT-PANDA when the local sample size is small.

1.3 G.3 Comparison of baselines on single-agent optimization

We compare partially approximate Newton (PAN) with existing state-of-the-art methods for single-agent minimax optimization. We adopt extra gradient (EG) and partial-quasi-Newton methods with SR1 update (Liu et al., 2022) (RaSR1) algorithms for comparison. We present the numerical results for AUC maximization and Fairness-aware machine learning in Figs. 13 and 14. The experiment results indicate that PAN with the sample ratio ($p'=|{\mathcal {S}}|/N$) at $p'=10\%$ or $p'= 20\%$ outperforms the baselines significantly.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Xiao, M., Liu, C., Chen, C. et al. Panda: partially approximate newton methods for distributed minimax optimization with unbalanced dimensions. Mach Learn 114, 174 (2025). https://doi.org/10.1007/s10994-025-06813-1

Download citation

Received: 31 January 2025
Revised: 24 February 2025
Accepted: 30 May 2025
Published: 19 June 2025
Version of record: 19 June 2025
DOI: https://doi.org/10.1007/s10994-025-06813-1

Keywords

Profiles

Chengchang Liu View author profile

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Panda: partially approximate newton methods for distributed minimax optimization with unbalanced dimensions

Abstract

Similar content being viewed by others

Deep machine learning for the PANDA software trigger

DeepSets and Their Derivative Networks for Solving Symmetric PDEs

Numerical Bifurcation Analysis of PDEs From Lattice Boltzmann Model Simulations: a Parsimonious Machine Learning Approach

Explore related subjects

1 Introduction

1.1 Contribution

2 Preliminaries

2.1 Notation and assumptions

Definition 2.1

Assumption 2.2

2.2 Matrix approximation via sub-sampling and sketching

Lemma 2.3

Lemma 2.4

3 The analysis framework of partially approximate Newton method

Lemma 3.1

Theorem 3.2

Corollary 3.3

4 Partially approximate Newton methods for distributed minimax optimization

4.1 The PANDA algorithm

Proposition 4.1

Proof

4.2 Convergence analysis of PANDA

Lemma 4.2

Theorem 4.3

Corollary 4.4

Remark 4.5

4.3 Extension to the GIANT-PANDA Algorithm

Lemma 4.6

Remark 4.7

Theorem 4.8

Remark 4.9

Corollary 4.10

5 Experiments

5.1 Comparison with the baselines

5.2 Comparison of different sketch ratios for GIANT-PANDA

6 Conclusion

Data availibility

Notes

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

The Appendix A GIANT-PANDA algorithm

Appendix B Auxiliary Lemmas for positive definite matrices

Lemma B.1

Proof

Lemma B.2

Proof

Appendix C The Proof of Sect. 2.2

1.1 C.1 The Proof of Lemma 2.4

Proof

Appendix D The Proof of Sect. 3

1.1 D.1 The Proof of Lemma 3.1

Proof

1.2 D.2 The Proof of Theorem 3.2

Lemma D.1

Proof

1.3 D.3 The Proof of Corollary 3.3

Proof

Appendix E The Proof of Sect. 4.2

1.1 E.1 The Proof of Lemma 4.2

Proof

1.2 E.2 The Proof of Theorem 4.3

Proof

1.3 E.3 The Proof of Corollary 4.4

Proof

Appendix F The Proof of Sect. 4.3

1.1 F.1 The Proof of Lemma 4.6

Proof

1.2 F.2 The Proof of Theorem 4.8

1.3 F.3 The Proof of Corollary 4.10