Limited memory bundle DC algorithm for sparse pairwise kernel learning

Karmitsa, Napsu; Joki, Kaisa; Airola, Antti; Pahikkala, Tapio

doi:10.1007/s10898-025-01481-w

Limited memory bundle DC algorithm for sparse pairwise kernel learning

Open access
Published: 03 April 2025

Volume 92, pages 55–85, (2025)
Cite this article

You have full access to this open access article

Download PDF

Journal of Global Optimization Aims and scope Submit manuscript

Limited memory bundle DC algorithm for sparse pairwise kernel learning

Download PDF

Napsu Karmitsa ORCID: orcid.org/0000-0002-8747-4836¹,
Kaisa Joki²,
Antti Airola¹ &
…
Tapio Pahikkala¹

946 Accesses
1 Citation
Explore all metrics

Abstract

Pairwise learning is a specialized form of supervised learning that focuses on predicting outcomes for pairs of objects. In this paper, we formulate the pairwise learning problem as a difference of convex (DC) optimization problem using the Kronecker product kernel, $\ell _1$- and $\ell _0$-regularizations, and various, possibly nonsmooth, loss functions. Our aim is to develop an efficient learning algorithm, SparsePKL, that produces accurate predictions with the desired sparsity level. In addition, we propose a novel limited memory bundle DC algorithm (LMB-DCA) for large-scale nonsmooth DC optimization and apply it as an underlying solver in the SparsePKL. The performance of the SparsePKL-algorithm is studied in seven real-world drug-target interaction data and the results are compared with those of the state-of-art methods in pairwise learning.

A unified DC programming framework and efficient DCA based approaches for large scale batch reinforcement learning

Article 21 August 2018

Robust discriminative feature learning with calibrated data reconstruction and sparse low-rank model

Article 07 October 2017

Combining Primal and Dual Representations in Deep Restricted Kernel Machines Classifiers

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Pairwise learning corresponds to the supervised learning setting with the aim of making predictions for pairs of objects. Data with pairwise observations naturally arise, for instance, in recommender systems [20, 31, 42], information retrieval [29], drug-target interaction (DTI) prediction [6, 39], and link prediction in social networks [47, 49] to mention but a few. In this paper, we formulate the pairwise learning problem as a difference of convex (DC) optimization problem using the Kronecker product kernel, $\ell _1$- and $\ell _0$-regularizations, and various, possible nonsmooth, loss functions. Our aim is to create a learning algorithm—SparsePKL —that produces accurate predictions with the desired sparsity level. In particular, we aim to solve the realistic form of the DTI problem called zero-shot learning. It corresponds to predicting labels for such drug-target pairs where neither the drug nor the target is encountered during the training phase.

In addition, we propose a novel limited memory bundle DC algorithm (LMB-DCA) for large-scale nonsmooth DC optimization and apply it as an underlying solver in the SparsePKL algorithm. We would like to emphasize that the proposed LMB-DCA is not the well-known DCA [27, 41] with the limited memory bundle method (LMBM) [15, 16] as an underlying solver but rather the LMBM modified to solve DC optimization problems using the DCA as a backup.

Using a learning algorithm that produces accurate predictions with a sparse model offers benefits such as interpretability, efficiency, better generalization to new data, scalability, and robustness to noise. These advantages make sparse models desirable in various applications (see, e.g., [10, 11, 18, 24, 50, 54]), allowing for improved understanding, computational efficiency, and reliable predictions.

In practice, we tackle the problem of finding sparse solutions by adding an $\ell _0$-pseudo-norm as a penalty for the pairwise kernel learning problem. Note that basically the $\ell _0$-pseudo-norm simply counts the number of nonzero components of a vector, and the optimization problems involving this pseudo-norm as it is are known to be NP-hard. Therefore, we instead use the continuous formulation of the $\ell _0$-pseudo-norm [11, 13] based on the polyhedral k-norms. This allows us to control the sparsity of the solution vector. It is worth of noting that the resulting problem is a nonsmooth DC optimization problem.

Applying the $\ell _0$-pseudo-norm in sparse optimization is not a new idea. However, to our knowledge, the continuous formulation of the $\ell _0$-pseudo-norm has been applied to a regression problem utilizing kernel transformations only in [37], where the LMBM-Kron$\ell _0$LS algorithm for sparse pairwise kernel learning problems is introduced. The differences between the SparsePKL algorithm proposed here and the LMBM-Kron$\ell _0$LS algorithm introduced in [37] are that the SparsePKL algorithm

utilizes the exact DC formulation of the sparse pairwise kernel learning problem and solves it with the novel LMB-DCA;
applies a double regularization scheme instead of the sole $\ell _0$-pseudo-norm (see, Sect. 5.1); and
solves the pairwise kernel learning problem with the user-specified sparsity percentage and with the selected loss function such as the squared loss, squared epsilon-intensive loss, epsilon-intensive squared loss, epsilon-intensive absolute loss, and absolute loss.

In contrast to the last point, the LMBM-Kron$\ell _0$LS algorithm applies only the squared loss and aims to find a solution as sparse as possible while maintaining satisfactory prediction accuracy with respect to the non-sparse reference solution. Although, finding a solution as sparse as possible may sound like a good idea, it also makes the LMBM-Kron$\ell _0$LS algorithm rather complex and time-consuming.

The contribution of the paper is three-fold:

1.
to introduce a novel algorithm LMB-DCA for large-scale nonsmooth DC optimization;
2.
to present a novel algorithm SparsePKL to solve sparse pairwise kernel learning problems; and
3.
to compare different, possible nonsmooth, formulations of the loss functions when seeking sparse solutions.

The paper is divided into two main parts. The first part considers nonsmooth and DC optimization (Sects. 2 and 3) whereas the second part is devoted to pairwise kernel learning problems (Sects. 4–7). More precisely, we first give some preliminaries to nonsmooth, DC, and sparse optimization in Sect. 2 and then we propose a new LMB-DCA for solving large-scale nonsmooth DC optimization problems in Sect. 3. In Sect. 4, we present the basics of pairwise learning problems and kernel methods including Kronecker product kernels. Section 5 focuses on the model formulation by recalling the used loss functions with their (sub)gradients and introducing the double regularization procedure used. Further, in Sect. 6 we propose a new SparsePKL algorithm for solving pairwise kernel learning problems with the desired sparsity of the solution. This algorithm applies the LMB-DCA as an underlying solver at each iteration. In Sect. 7, we give numerical results for the new SparsePKL algorithm and compare the obtained results against the state-of-the-art methods Kronecker product kernel regularized least squares (KronRLS) and Kronecker support vector machine (KronSVM) [1, 38] and against the LMBM-Kron$\ell _0$LS algorithm [37]. Further, we evaluate the LMB-DCA as a standalone optimization method by comparing it with the well-known DCA. Finally, Sect. 8 concludes the paper. The notations used throughout the paper are listed in Table 1.

Table 1 Notations

Full size table

2 Preliminaries

We start with a short introduction to nonsmooth and DC optimization. An interested reader can find more details, for example, from [3, 8, 19, 28, 36, 41, 43]. In addition, we recall the basic idea of sparse optimization and define the continuous formulation of the $\ell _0$-pseudo-norm.

In what follows, we denote by $\mathbb {R}^n$ the n-dimensional Euclidean space, by $ a^\top b = \sum _{i=1}^n a_i b_i$ the inner product of vectors $a, b \in \mathbb {R}^n$ and by $B(a,\varepsilon )$ the open ball at $a\in \mathbb {R}^n$ with a radius $\varepsilon >0$. The associated $\ell _2$- and $\ell _1$-norms are denoted by $\Vert a\Vert _2 = ( a^\top a )^{1/2}$ and $\Vert a\Vert _1=\sum _{i=1}^n |a_i|$, respectively.

2.1 Nonsmooth optimization

A function $J:\mathbb {R}^n \rightarrow \mathbb {R}$ is nonsmooth if there are one or more points $a\in \mathbb {R}^n$ such that J fails to be differentiable at these points. Typically these points are encountered at extremes of nonsmooth functions [3]. In nonsmooth optimization, we try to find a minimizer (or a maximizer) of the problem, where at least one of the objective functions involved is nonsmooth.

A function $J:\mathbb {R}^n \rightarrow \mathbb {R}$ is convex if for all $a,b \in \mathbb {R}^n$ and $\lambda \in [0,1]$ we have

$$\begin{aligned} J(\lambda a+ (1-\lambda )b)\le \lambda J(a) + (1-\lambda )J(b). \end{aligned}$$

Further, a function $J:\mathbb {R}^n \rightarrow \mathbb {R}$ is locally Lipschitz continuous if at every $a\in \mathbb {R}^n$ there exists $L > 0$ and $\varepsilon >0$ such that

$$\begin{aligned} |J(b)-J(c)| \le L\Vert b-c\Vert _2 \quad \text {for all } b, c \in B(a,\varepsilon ). \end{aligned}$$

For example, both convex and smooth functions always satisfy this property. Whenever a function $J:\mathbb {R}^n \rightarrow \mathbb {R}$ is locally Lipschitz continuous, the Clarke subdifferential at a point $a \in \mathbb {R}^n$ can be calculated with the formula [3, 8]

$$\begin{aligned} \partial J({a}) = {{\,\textrm{conv}\,}}\Big \{ \lim _{i \rightarrow \infty } \nabla J(a^i) \mid a^i \rightarrow a \text { and } \nabla J ({a}^i) \text { exists } \Big \}, \end{aligned}$$

where “conv” is the convex hull of a set and each vector $\xi \in \partial J(a)$ is called a subgradient. By using this subdifferential, it is easy to derive a necessary optimality condition for $a^*\in \mathbb {R}^n$ to be a local optimizer, namely, $0 \in \partial J(a^*)$ [3]. Any point satisfying this condition is called stationary. Note that stationarity can also be a sufficient condition for global optimality, but only when the considered function is convex.

2.2 DC programming

In DC programming, we consider problems where all functions can be stated as a difference of two convex (DC) functions. The general unconstrained DC problem is given with

$$\begin{aligned} \displaystyle \underset{a \in \mathbb {R}^n}{\text {minimize}} \quad J(a), \end{aligned}$$

(1)

where $J:\mathbb {R}^n\rightarrow \mathbb {R}$ is DC. Formally, a function $J:\mathbb {R}^n \rightarrow \mathbb {R}$ is DC if there exists two convex functions $J_1,J_2:\mathbb {R}^n\rightarrow \mathbb {R}$ enabling us to write

$$\begin{aligned} J(a)=J_1(a)-J_2(a)\quad \text { for all }a\in \mathbb {R}^n. \end{aligned}$$

In what follows, $J_1-J_2$ is called a DC decomposition, where the separate convex functions $J_1$ and $J_2$ are DC components. Note that one or both of the DC components can be nonsmooth and that a DC function has infinitely many different DC decompositions. Altogether DC functions constitute an important and wide subclass of nonsmooth nonconvex functions including, for example, every convex, concave, and twice continuously differentiable function [19, 21].

Before stating some optimality conditions for the DC problem (1), we need to introduce a subdifferential for convex DC components $J_i: \mathbb {R}^n \rightarrow \mathbb {R}$, $i=1,2$. The subdifferential at a point $a \in \mathbb {R}^n$ is given as a set [2, 3]

$$\begin{aligned} \partial _c J_i(a) = \Big \{\xi \in \mathbb {R}^n:~J_i(b) - J_i(a) \ge \xi ^{\top } (b - a) \,\, \text {for all } b \in \mathbb {R}^n \Big \}, \quad i=1,2. \end{aligned}$$

Each vector $\xi ^i\in \partial _c J_i(a)$ is called a subgradient of $J_i$ at a for $i=1,2$. From the above definitions, we see that $\partial _c J_i(a)=\partial J_i(a)$ for $i=1,2$. Thus, to simplify the notations we denote the subdiffereantial of the convex DC component $J_i$, $i=1,2$, also with the Clarke subdifferential $\partial J_i(a)$.

For the problem (1), the most commonly used necessary optimality condition is criticality requiring that at a point $a^*\in \mathbb {R}^n$ the condition

$$\begin{aligned} \partial J_1(a^*)\cap \partial J_2(a^*) \ne \emptyset \end{aligned}$$

is fulfilled. Note that a stationary point is always critical. However, the inverse result does not necessarily hold. This is due to the fact that at a point $a\in \mathbb {R}^n$ for a DC function $J=J_1-J_2$ we can only guarantee a result [3]

$$\begin{aligned} \partial J(a)\subseteq \partial J_1(a)-\partial J_2(a) \end{aligned}$$

(2)

and equality in the relationship happens only when $J_1$ or $J_2$ is differentiable. Thus, only differentiability of $J_1$ or $J_2$ together with criticality ensures also stationarity. In DC programming algorithms, the goal is typically find only a critical point [23, 27, 28].

2.3 $\ell _0$-pseudo-norm and sparse optimization

Optimization problems seeking sparsity of the solutions have recently received broad attention. They are applied, for instance, in regression analysis (see, e.g. [4, 11, 34]), sparse principal component analysis (see, e.g. [17, 53, 55]), and signal and image processing (see, e.g. [24, 30, 54]). In this subsection, we recall the DC formulation for sparse optimization problem [11, 13].

In sparse optimization, we consider a cardinality-constrained optimization problem:

$$\begin{aligned} {\left\{ \begin{array}{ll} \underset{{a\in \mathbb {R}^n}}{\text {minimize}} \quad & J(a) \\ \text {subject to} & \Vert a \Vert _0 \le k, \, \end{array}\right. } \end{aligned}$$

(3)

where $J:\mathbb {R}^n\rightarrow \mathbb {R}$, $\Vert a \Vert _0$ is the $\ell _0$-pseudo-norm that simply counts the number of nonzero components of a vector, and $k \in \{1,\ldots , n\}$ represents the maximum allowed number of non-zero elements. Due to the non-convex and discontinuous nature of the $\ell _0$-pseudo-norm, solving the problem (3) is known to be NP-hard [35]. As a result, the $\ell _0$-pseudo-norm in (3) is often replaced with its tight convex relaxation, the $\ell _1$-norm [11]. However, this kind of relaxation deteriorates the control over the desired sparsity level and, thus, we propose taking advantage of the continuous formulation of the $\ell _0$-pseudo-norm [11, 13]. We rely to the class of polyhedral k-norms $\Vert {a} \Vert _{[k]}$ defined as the sum of k largest components (in modulus) of the vector a. The fact that polyhedral k-norms are intermediate between $\ell _1$- and $\ell _{\infty }$-norms, allows us to derive the following equivalence, valid for $1\le k \le n$:

$$\begin{aligned} \Vert {a} \Vert _0 \le k \iff \Vert {a} \Vert _1 - \Vert {a} \Vert _{[k]} = 0. \end{aligned}$$

It is noteworthy that the cardinality-constraint in (3) is defined by a discontinuous function whereas the exact reformulation $\Vert {a} \Vert _1 - \Vert {a} \Vert _{[k]}$ is a continuous one, although both describe the same feasible set [11, 13].

We remind that $\Vert {a} \Vert _1 - \Vert {a} \Vert _{[k]} \ge 0$ for all ${a} \in \mathbb {R}^n$ and, thus, we transfer the cardinality-constrained problem (3) into a penalty function formulation:

$$\begin{aligned} \underset{a\in \mathbb {R}^n}{\text {minimize}} \quad J(a) + \rho \left( \Vert {a} \Vert _1 - \Vert {a} \Vert _{[k]} \right) \end{aligned}$$

(4)

with a penalty parameter $\rho >0$. Assuming that the original objective function J in (4) is a DC function with the decomposition $J = J_1-J_2$ (note that a convex function is trivially DC with $J_1=J$), formula (4) can be written as a DC optimization problem:

$$\begin{aligned} \underset{a\in \mathbb {R}^n}{\text {minimize}} \quad \hat{J}(a)=\hat{J}_1(a) - \hat{J}_2(a), \end{aligned}$$

where $\hat{J}_1(a)=J_1(a) + \rho \Vert {a} \Vert _1$ and $\hat{J}_2(a)=J_2(a) + \rho \Vert {a} \Vert _{[k]}$.

3 Limited memory bundle DC algorithm

In this section, we propose a novel LMB-DCA for solving the large nonsmooth DC optimization problem (1). The backbone of the new method is the LMBM [14, 16] developed for general large-scale nonsmooth optimization. The LMBM is a bundle-type method characterized by the use of serious and null steps together with the simple aggregation of subgradients. In addition, the search direction is calculated by the limited memory variable metric approach [5] (the limited memory BFGS (L-BFGS) update after a serious step and the limited memory SR1 (L-SR1) update, otherwise). Thus, the LMBM avoids solving the time-consuming quadratic direction finding problem appearing in standard bundle methods (see, e.g. [2, 26]) as well as storing and manipulating large matrices as is the case in the standard variable metric methods. These aspects make the LMBM suitable for solving large-scale nonsmooth optimization problems. Namely, the number of operations needed for the calculation of the search direction and the aggregate values is only linearly dependent on the number of variables.

The basic idea of the new LMB-DCA is very simple: we use the original LMBM, but instead of a subgradient $\xi \in \partial J(a)$ at a point $a\in \mathbb {R}^n$ we substitute it with the subgradient difference $\xi _1-\xi _2$, where $ \xi _1\in \partial J_1(a)$ and $ \xi _2 \in \partial J_2(a)$, unless we encounter an issue. The issue may arise when the difference of subgradients of the DC components does not belong to the subdifferential of the original function (see, Eq. (2)). In practice, this is noted by the failure in the line search procedure or by many consecutive null steps. In these cases, we switch to solve a subproblem similar to the one used in DCA [27, 41]. In what follows, we describe different components of the LMB-DCA in more detail.

The fundamental idea behind the LMBM is to calculate the search direction using the classical limited memory variable metric scheme $d^h= -{ G^h} \nabla J(a^h)$, where ${ G^h}$ is the limited memory variable metric update that represents the approximation of the inverse of the Hessian matrix, $\nabla J(a^h)$ is a gradient of function J at a point $a^h$, and h is an iteration index. However, since for a general nonsmooth objective J the gradient does not necessarily exist, an aggregate subgradient $\tilde{\xi }^h\in \partial J(a^h)$ is employed instead in the LMBM. Here, since also the subgradient of the DC function may be unknown, we go one step further and employ the (aggregate) subgradient difference. Therefore, the search direction is given by

$$\begin{aligned} d^h= -{ G^h} \tilde{\xi }^h, \end{aligned}$$

(5)

where $\tilde{\xi }^h = \xi ^h_1 -\xi ^h_2$ with $\xi ^{h}_1\in \partial J_1(a^{h})$ and $\xi ^{h}_2\in \partial J_2(a^{h})$ if the previous step was a serious step. However, if the previous step was a null step, $\tilde{\xi }^h$ is an aggregate subgradient difference (see, Eq. (10)). In what follows, we simply denote the subgradient difference by $\xi ^{h}$ for all $h=1,2,\ldots $.

The role of the matrix ${ G^h}$ is to accumulate information from the previous iterations. Note, however, that the large matrix ${ G^h}$ is not formed explicitly. Instead, we store only a few (say $\hat{m}_c$) correction vectors obtained at the previous iterations of the algorithm and use these vectors to implicitly define the product ${ G^h}\tilde{\xi }^h$. In the smooth case, the correction vectors are given by $s^h=a^{h+1}-a^h$ and $u^{h}=\nabla J(a^{h+1})-\nabla J(a^{h})$. Nevertheless, in the LMBM — as well as in the proposed LMB-DCA — we may have $a^{h+1}=a^h$ due to the usage of null steps and we use an auxiliary point $b^{h+1}$ (described in detail below) instead of $a^{h+1}$ when updating $s^h$. In addition, since the (sub)gradient does not necessarily exist for nonsmooth DC objective the correction vectors $u^h$ are computed using the difference of the subgradients of DC components, that is, the vectors are given by $s^h=b^{h+1}-a^h$ and $u^{h}=\xi ^{h+1}-\xi ^{m}$, where m is the index of the last serious step. These correction vectors are append into correction matrices $S^h$ and $U^h$, respectively. If $h>\hat{m}_c$, we first delete the oldest correction vectors from these matrices.

The search direction $d^h$ in (5) is not necessarily a descent one.^{Footnote 1} Thus, we apply the line search procedure that generates a null step instead of a serious step whenever necessary or indicates that neither the condition for a serious step nor the condition for a null step can be reached. We first determine a new auxiliary point $b^{h+1}=a^h+ t_R^hd^h$, where $t_R^h \in (0,t_{max}]$ is a step size with an upper bound $t_{max}>1$. A necessary condition for a serious step is to have

$$\begin{aligned} J(b^{h+1}) \le J(a^h) - \varepsilon _L t_R^h w^h, \end{aligned}$$

(6)

where $\varepsilon _L \in (0,1/2)$ is a line search parameter, and $w^h>0$ represents the desirable amount of descent of J at $a^h$. If the condition (6) is satisfied, we set $a^{h+1} = b^{h+1}$ and call this step a serious step. If the current search direction is not good enough, that is, the value of the objective at the auxiliary point $b^{h+1}$ is not decreased enough, either the null step occurs or we need to enter the DCA procedure since there is an issue during the line search procedure. In the case of a null step, we have

$$\begin{aligned} - \beta ^{h+1} + (\xi ^{h+1})^\top d^h \ge - \varepsilon _R w^h, \end{aligned}$$

(7)

where $\varepsilon _R \in (\varepsilon _L,1/2)$ is a line search parameter, $\xi ^{h+1}$ is the subgradient difference computed at $b^{h+1}$, and $\beta ^{h+1}$ is the subgradient locality measure given by

$$\begin{aligned} \beta ^{h+1} = \max \Big \{ \left| J(a^h) - J(b^{h+1}) + (s^h)^\top \xi ^{h+1}\right| , \gamma \Vert s^h\Vert ^2 \Big \} \end{aligned}$$

(8)

with the distance measure parameter $\gamma > 0$. In addition, we set $a^{h+1} = a^h$ but information about the objective function is increased since we store the auxiliary point $b^{h+1}$ and the corresponding auxiliary subgradient difference $\xi ^{h+1}$, and use them to compute the new aggregate values. More precisely, if the condition (7) is satisfied and the maximum number of consecutive null steps is not yet reached (implying that we do not have serious problems with the subgradient differences), we proceed to the aggregation procedure similar to that of the original LMBM — and very different from the one used with standard bundle methods (see, e.g. [2, 26]). That is, we determine multipliers $\lambda _i^h$ ($i \in \{1,2,3\}$) minimizing the function

$$\begin{aligned} \varphi (\lambda _1, \lambda _2, \lambda _3)&= [\lambda _1 \xi ^m + \lambda _2 \xi ^{h+1} + \lambda _3\tilde{\xi }^h]\, ^\top { G^h} [\lambda _1 \xi ^m + \lambda _2 \xi ^{h+1} + \lambda _3 \tilde{\xi }^h\,] \\ &\qquad + 2(\lambda _2 \beta ^{h+1} + \lambda _3 \tilde{\beta }^h),\nonumber \end{aligned}$$

(9)

and satisfying $\lambda ^h_{i} \ge 0$ and $\sum _{i=1}^3 \lambda ^k_{i}=1$. Here, $\xi ^m$ is the subgradient difference at the current point $a^m=a^h$ (m denotes the iteration index of the last serious step), $\xi ^{h+1}$ is the subgradient difference at the auxiliary point $b^{h+1}$, and $\tilde{\xi }^h$ is the current aggregate subgradient difference from the previous iteration ($\tilde{\xi }^1 = \xi ^1_1-\xi ^1_2$). In addition, $\beta ^{h+1}$ is the current subgradient locality measure and $\tilde{\beta }^h$ is the current aggregate subgradient locality measure ($\tilde{\beta }^1=0$). The new aggregate values can then be computed as

$$\begin{aligned} \tilde{\xi }^{h+1} = \lambda ^h_{1} \xi ^m + \lambda ^h_{2} \xi ^{h+1} + \lambda ^h_{3} \tilde{\xi }^k \qquad \text {and} \qquad \tilde{\beta }^{h+1} = \lambda ^h_{2} \beta ^{h+1} + \lambda ^h_{3} \tilde{\beta }^h. \end{aligned}$$

(10)

This simple aggregation procedure gives us the possibility to retain the global convergence without solving the complicated quadratic direction finding problem appearing in standard bundle methods (see, e.g. [2]). Moreover, only one trial point $b^{h+1}$ and the corresponding subgradient difference $\xi ^{h+1}$ need to be stored instead of $n+3$ subgradients typically stored in bundle methods. Finally, we point out that the aggregate values need to be computed only if the last step was a null step. Otherwise, we set $\tilde{\xi }^{h+1} ={\xi }^{h+1} $ and $\tilde{\beta }^{h+1}=0$.

On the other hand, if the condition (7) is not satisfied during the line search procedure or the maximum number of consecutive null steps is met, we solve the convex subproblem similar to the DCA to escape the current problematic iteration point. This yields a problem

$$\begin{aligned} \underset{{a\in \mathbb {R}^n}}{\text {minimize}} \quad \bar{J}_{h}(a)=J_1({a}) - [J_2({a^h}) + ({\xi _2^h})^\top (a-a^h)], \end{aligned}$$

(11)

where ${\xi _2^h} \in \partial J_2(a^h)$, which is also solved with LMBM. This means that in the subproblem we are linearizing the second DC component $J_2$ at the current iteration point $a^h$. One main benefit of this is that the solution $a^{h+1}$ of the subproblem (11) always satisfies

$$\begin{aligned} J(a^{h+1})\le \bar{J}_h(a^{h+1})\le \bar{J}_h(a^h)=J(a^h). \end{aligned}$$

Thus, we should always use $a^h$ as a starting point in the LMBM when we solve (11). Furthermore, if $\bar{J}_h(a^{h+1})<\bar{J}_h(a^h)$ we know that a better solution is found for the original objective J and a serious step can be performed. However, if $\bar{J}_h(a^{h+1})=\bar{J}_h(a^h)$ is satisfied we know that $0\in \partial \bar{J}_h(a^h)$ implying that $\xi _2^h\in \partial J_1(a^h)$. This guarantees the criticality of the current iteration point $a^h$ and we can terminate the execution of the whole LMB-DCA with $a^h$ as its solution.

Remark 1

The original LMBM is globally convergent for locally Lipschitz continuous functions [16]. Therefore, if $\xi ^h_1-\xi ^h_2 \in \partial J(a^h)$ for all h also the proposed algorithm converges to a Clarke stationary point. This is the case, for instance, if either $J_1$ or $J_2$ is a smooth function.

Remark 2

If we stop at Step 3 of the LMB-DCA with $m=h$ (i.e. the previous step was a serious step), we have found a point $a^h$ that (approximately) satisfies $0 = \xi ^h=\xi _1^h-\xi _2^h$. Thus, we can say $a^h$ is a critical point.

Remark 3

The DCA is proved to be globally convergent to a critical point [27, 41]. In principle, the DCA just successively solves the subproblem (11) and updates the iteration point. Thus, if we enter Step 8 of the LMB-DCA infinitely many times, the proposed algorithm converges to a critical point due to the repeated execution of the DCA step. In addition, if the execution of the LMB-DCA is stopped at Step 8, then we directly have a critical point.

Remark 4

To guarantee the global convergence in the original LMBM, the sequence $\{w^h\}$ has to be nonincreasing in consecutive null steps and matrices $({ D}^i)^{-1}$ for all $i=1,\ldots , h$ need to be bounded. This leads to a more complex algorithm that is described in detail in [16].

4 Preliminaries for pairwise kernel learning

In this section, we provide the theoretical background on pairwise learning problems and kernel methods.

4.1 Pairwise learning problem

The goal of supervised learning (e.g., regression and classification) is to learn an unknown prediction function $f:\mathcal {X} \rightarrow \mathcal {Y}$ from a set of training samples $\{(x_i,y_i)\}_{i=1}^n$ each consisting of an input $x_i \in \mathcal {X}$ and its corresponding label $y_i \in \mathcal {Y}$ such that f can correctly predict a label for a new object $x \in \mathcal {X}$ unseen during the training phase. Here, the label space is $\mathcal {Y}=\mathbb {R}$ for regression and $\mathcal {Y} =\{0,1\}$ for classification problems. In what follows, we consider almost solely the regression problems.

In pairwise learning, the training input consists of pairs of objects $x=(d,t)$ and their real valued labels y. We call these objects drugs $d \in \mathcal {D}$ and targets $t \in \mathcal {T}$, but they can represent various other objects in different applications. Different fields often consider divergent but related pairwise prediction tasks.^{Footnote 2} Nevertheless, the two most fundamental assumptions of pairwise learning are given below [48].

Assumption 1

(Pairwise data assumption) Both the drugs and the targets tend to appear several times as parts of different inputs in an observed data set.

Assumption 2

(Nonlinearity assumption) The predicting functions are usually not linear combinations of functions depending only on drugs or targets.

Assumption 1 means that the same drug $d_i$ may belong to two (or more) different samples $(d_i,t_j)$ and $(d_i,t_k)$ while Assumption 2 means that we do not have global ordering of drugs (or targets). More precisely, the opposite of Assumption 2 would mean that if $f(d,t)= f_d(d)+f_t(t)$ for some functions $f_d$ and $f_t$, then $f(d_i,t_j)>f(d_k,t_j)$ would imply $f(d_i,t_l)>f(d_k,t_l)$ for all drugs and targets. In other words, there would always be a single drug that would be the best choice for all targets (and vice versa).

We formulate the problem of learning the prediction function f as the regularized empirical risk minimization problem

$$\begin{aligned} f^* = \underset{f \in \mathcal {H}}{\arg \min } \,\, \mathcal {L}(p,y)+\lambda \mathcal {C}(f), \end{aligned}$$

(12)

where $p = f(X) \in \mathbb {R}^n$ are the predicted and $y \in \mathbb {R}^n$ are the correct outputs for the training set, $X=(x_1,\ldots ,x_n) \in (\mathcal {X}^n)^\top $ is the set of inputs, $\mathcal {L}$ is a convex non-negative loss function, and $\lambda >0$ and $\mathcal {C}$ are the regularization parameter and function, respectively. In addition, $\mathcal {H}$ is the hypothesis space.

The standard formulation of problem (12) is called RLS or ridge regression. It is obtained by choosing a squared loss to loss function $\mathcal {L}(p,y) = \Vert y-p\Vert ^2$, a Euclidean norm for regularization $\lambda \mathcal {C}(f) = \frac{\lambda }{2} \Vert f\Vert ^2_\mathcal {H}$, and the reproducing kernel Hilbert space (RKHS) as the hypothesis space $\mathcal {H}$.

4.2 Kernel methods

Nowadays kernel methods are the de facto methods in supervised learning [1]. They provide a powerful framework for learning complex patterns in data by implicitly mapping it into a higher-dimensional feature space [22]. Kernel methods can be used when the training data either have explicit feature vectors or implicit feature vectors are defined via positive semi-definite kernel functions. Furthermore, when both drugs and targets have their own feature representation or kernel, we can use a pairwise kernel to define a kernel for the pair. The simplest pairwise kernel is obtained by concatenating the feature vectors of drugs and targets and applying a standard kernel (like polynomial kernel) into this combined feature vector. However, the simplest kernel that models actual pairwise interactions between drug and target features is the pairwise Kronecker product kernel [48]. It also allows generalization for such new pairs where neither the drug nor the target occur in the training set — a setting known as zero-shot learning [1].

Let $k_\mathcal {D}: \mathcal {D} \times \mathcal {D} \rightarrow \mathbb {R}$ and $k_\mathcal {T}: \mathcal {T} \times \mathcal {T} \rightarrow \mathbb {R}$ denote positive semidefinite kernel functions^{Footnote 3} defined for drugs and targets. Then the Kronecker product kernel for the drug-target pairs $k_{\mathcal {D},\mathcal {T}}:(\mathcal {D} \times \mathcal {T})\times (\mathcal {D} \times \mathcal {T})\rightarrow \mathbb {R}$ is defined as a product of these two base kernels. That is,

$$\begin{aligned} k_{\mathcal {D},\mathcal {T}} ((d,t),(d',t')) = k_\mathcal {D}(d,d')k_\mathcal {T}(t,t'). \end{aligned}$$

Further, let $K\in \mathbb {R}^{n \times n}$ denote the pairwise Kroneker product kernel matrix containing all the kernel evaluations between the drug-target pairs such that $K_{i,j}= k_{\mathcal {D},\mathcal {T}}((d_i,t_i),(d_j,t_j)),\, i,j =1,\ldots ,n$. By choosing RKHS associated with $k_{\mathcal {D},\mathcal {T}}$ as the hypothesis space, the generalized representer theorem [44] implies that the minimizing function f in (12) can be given as:

$$\begin{aligned} f(d,t) = \sum _{i=1}^n a_i k_{\mathcal {D},\mathcal {T}}((d_i, t_i),(d,t)), \end{aligned}$$

where $a_i$ is a component of the vector ${a} \in \mathbb {R}^n$ of dual coefficients and index i goes trough all observed drug-target pairs. In addition, the predictions for the training data can be given with the kernel matrix as $p = K a$ [48].

It is worth noting that Assumption 1 implies that if n is the number of observed data and $n_D$ and $n_T$ are the numbers of unique drugs and targets in the data, respectively, then $n>>n_D+n_T$. This fact is used to develop computational short-cuts tailored specifically to pairwise learning problems in [1, 48]. More precisely, the large kernel matrix K need not be computed explicitly but the matrix–vector product Ka can be computed implicitly in $\mathcal {O}(n\,(n_D+n_T))$ time using a sparse Kronecker product multiplication algorithm known as the generalized vec trick (GVT). This computational shortcut can be applied to speed up any optimization approach whose computational complexity is dominated by multiplications of a pairwise kernel matrix with a vector.

5 Model formulation

In this section, we introduce the model we are using. We consider the regularized empirical risk minimization problem (12) with various loss functions described in detail in Sect. 5.2. In addition, instead of just replacing the most commonly used squared regularization term with the $\ell _0$-pseudo-norm as in [37], we apply a double regularization scheme. It is worth mentioning that we end up with a nonsmooth DC optimization problem.

5.1 Double regularization

As pointed out in Sect. 4, the standard formulation of the pairwise kernel learning problem (12) is to use the squared loss with the squared regularization. In [37], the squared regularization is replaced with the $\ell _0$-pseudo-norm. However, the direct use of the $\ell _0$-pseudo-norm may lead to a situation where the largest nonzero dual coefficients are left without any regularization. That is, in the continuous representation of the $\ell _0$-pseudo-norm the polyhedral k-norm cancels out the k largest dual coefficients from the $\ell _1$-norm. Therefore, we use the $\ell _1$-norm as a double regularization to provide regularization for all dual coefficients. The $\ell _1$-norm is selected (instead of the usually used $\ell _2$-norm) as it supports our effort to produce sparse solutions.

We consider the double regularized empirical risk minimization problem in a form

$$\begin{aligned} \underset{{a\in \mathbb {R}^n}}{\text {minimize}} \quad&J(a)=\mathcal {L}(Ka,y) + \rho _1 \Vert a \Vert _1 + \rho _2 ( \Vert a \Vert _1 - \Vert a \Vert _{[k]} ) \end{aligned}$$

(13)

with $\rho _1,\rho _2 > 0$. This formula is both nonsmooth and nonconvex. However, function J(a) can be written as a DC function, which allows us to compute the subgradients of convex DC components separately. The DC formulation of the problem (13) is

$$\begin{aligned} \underset{{a\in \mathbb {R}^n}}{\text {minimize}} \quad J(a) = J_1(a) - J_2(a), \end{aligned}$$

where $J_1(a)=\mathcal {L}(Ka,y) + (\rho _1 + \rho _2) \Vert a \Vert _1$ and $J_2(a)=\rho _2 \Vert a \Vert _{[k]}$.

The (initial) double regularization parameters $\rho _1$ and $\rho _2$ in (13) are given by the formula

$$\begin{aligned} \rho _1 = \rho _2 = \frac{1}{n^2} \cdot \frac{\mathcal {L}(p^1,y)}{\Vert a^1\Vert _1 }, \end{aligned}$$

where $a^1\in \mathbb {R}^n$ is the user-specified starting point and $\mathcal {L}(p^1,y)$ is the value of the loss function in that point (remind that $p^1=Ka^1$). Parameter $\rho _1$ is fixed while parameter $\rho _2$ is used as a penalty parameter and increased at every iteration of the proposed SparsePKL algorithm until the feasible solution with respect to the desired sparsity level is obtained. More details of this will be given in Sect. 6.

5.2 Loss functions and their (sub)gradients

We use the following loss functions with (13):

Squared loss: $\mathcal {L}(p,y)=\frac{1}{2} \sum \nolimits _{i=1}^n(p_i-y_i)^2$;
Squared epsilon-intensive loss: $\mathcal {L}(p,y)= \frac{1}{2n}\sum \nolimits _{i=1}^n\max (0,|p_i-y_i|-\varepsilon )^2$, where $\varepsilon > 0$;
Epsilon-intensive squared loss: $\mathcal {L}(p,y)=\frac{1}{2n}\sum \nolimits _{i=1}^n\max (0,(p_i-y_i)^2-\varepsilon )$, where $\varepsilon > 0$;
Epsilon-intensive absolute loss: $\mathcal {L}(p,y)=\frac{1}{n}\sum \nolimits _{i=1}^n\max (0,|p_i-y_i|-\varepsilon )$, where $\varepsilon > 0$;
Absolute loss: $\mathcal {L}(p,y)= \frac{1}{n}\sum \nolimits _{i=1}^n |p_i-y_i|$.

These loss functions are illustrated in Fig. 1.

The gradients $\zeta = \nabla \mathcal {L}(p,y)$ and subgradients $\zeta \in \partial \mathcal {L}(p,y)$ of the above-mentioned loss functions with respect to the predictions are given below. Note that it is enough to find one subgradient $\zeta $ from the subdifferential $ \partial \mathcal {L}(p,y)$ at each point p to apply a nonsmooth optimization method.

Squared loss: $\zeta _i = p_i - y_i$, $i=1,\ldots ,n$.
Squared epsilon-intensive loss: $\zeta _i = {\left\{ \begin{array}{ll} 0, & \quad \text {if } |p_i-y_i|<\varepsilon ,\\ p_i-y_i-\varepsilon )/n, & \quad \text {if } p_i-y_i{\ge \varepsilon },\\ (p_i-y_i+\varepsilon )/n, & \quad \text {otherwise.} \end{array}\right. }$
Epsilon-intensive squared loss: $\zeta _i = {\left\{ \begin{array}{ll} 0, & \quad \text {if } (p_i-y_i)^2<\varepsilon ,\\ (p_i-y_i)/n, & \text {otherwise.} \end{array}\right. }$
Epsilon-intensive absolute loss: $\zeta _i= {\left\{ \begin{array}{ll} 0, & \quad \text {if } |p_i-y_i|<\varepsilon ,\\ 1/n, & \quad \text {if } p_i-y_i \ge \varepsilon ,\\ -1/n, & \quad \text {otherwise.} \end{array}\right. }$
Absolute loss: $\zeta _i= {\left\{ \begin{array}{ll} 1/n, & \quad \text {if } p_i-y_i>0,\\ 0, & \quad \text {if } p_i-y_i=0,\\ -1/n, & \quad \text {otherwise.} \end{array}\right. }$

As pointed out the gradients and subgradients above are given with respect to predictions p. In the case of Kronecker product kernel methods, the loss function is $\mathcal {L}(p,y)$, where $p= Ka$. Due to this, the loss function can be written in the form $f=g\circ h$ where $h:\mathbb {R}^n\rightarrow \mathbb {R}^n$ and $g:\mathbb {R}^n\rightarrow \mathbb {R}$ are given with formulas $h(a)=Ka$ and $g(p)=\mathcal {L}(p,y)$. In addition, h is linear and continuously differentiable, whereas g is convex and thus subdifferentially regular. These properties guarantee that the (sub)differential of f with respect to a can be computed via (generalized, see e.g. [3] Theorem 3.20) chain rule as

$$\begin{aligned} \partial f(a) = \nabla h(a)^\top \partial g(h(a)) = K^\top \partial \mathcal {L}(p,y). \end{aligned}$$

This means that $K^{\top } \zeta $, where $\zeta = \nabla \mathcal {L}(p,y)$ or $\zeta \in \partial \mathcal {L}(p,y)$, is a (sub)gradient of the loss function with respect to the dual variable a.

5.3 Subgradient of double regularization

The final point in this section is to give subgradients of the regularization part in (13). A subgradient $\eta $ of $(\rho _1 + \rho _2) \Vert a \Vert _1$ with respect to the dual variable a is given by

$$\begin{aligned} \eta _i= {\left\{ \begin{array}{ll} (\rho _1 + \rho _2), & \quad \text {if } a_i>0,\\ -(\rho _1 + \rho _2), & \quad \text {if } a_i<0,\\ 0, & \text {otherwise.} \end{array}\right. } \end{aligned}$$

Next, let us denote by $\mathcal {I}_{[k]}= \{i_1, \ldots , i_k \}\subseteq \{1,2,\ldots ,n\}$ the index set of k largest (in modulus) components of a. The subgradient $\vartheta $ of $\rho _2 \Vert a \Vert _{[k]}$ with respect to the dual variable a is given by

$$\begin{aligned} \vartheta _i= {\left\{ \begin{array}{ll} \rho _2, & \quad \text {if } a_i\ge 0 \text { and } i \in \mathcal {I}_{[k]},\\ -\rho _2, & \quad \text {if } a_i<0 \text { and } i \in \mathcal {I}_{[k]},\\ 0, & \text {otherwise.} \end{array}\right. } \end{aligned}$$

When we combine the previous information, the subgradients $\xi _1\in J_1(a)$ and $\xi _2\in J_2(a)$ are obtained in the model (13) with respect to the dual variable a using the formulas

$$\begin{aligned} \xi _1= K^{\top }\zeta +\eta \quad \text { and }\quad \xi _2= \vartheta . \end{aligned}$$

Furthermore, utilizing these subgradients we obtain the subgradient difference $\xi _1-\xi _2$ for the objective J in (13). This is enough for calculations since the proposed method LMB-DCA does not need the exact subgradient of J.

6 Sparse pairwise kernel learning

In this section, we propose the pairwise kernel learning algorithm SparsePKL as Algorithm 2 and give some details of its implementation.

6.1 Pairwise kernel learning algorithm

The pairwise kernel learning algorithm SparsePKL aims to produce predictions with the desired sparsity of the solution. This is done by solving the double regularized risk minimization problem (13) with an increasing penalty parameter $\rho _2$. With $\rho _2$ large enough, the problem (13) has a solution $a_{best}$ which is feasible with respect to the $\ell _0$-pseudo-norm constraint $\Vert a\Vert _0 \le k$ (see, Eq. (3)). Nevertheless, if the desired sparsity level is not achieved within the maximum number of iterations given by the user, the SparsePKL algorithm returns the sparsest infeasible solution obtained so far (denoted by $a_{sis}$) with the warning.

^{Footnote 4}

The above-mentioned double regularized risk minimization problem (13) is solved with the LMB-DCA. To accelerate the solution process the problem is solved using a little bit smaller value of k than the user-specified sparsity level would imply (see Steps 0 and 1 of Algorithm 2). The reason for this is that the penalty-based optimization methods tend to slow down considerably near the boundary of the feasible region. Naturally, we accept the solution whenever the desired sparsity level is achieved (see Step 3 of Algorithm 2).

At every iteration of the SparsePKL algorithm the predictions are validated using a separate validation set and an evaluation metric called the concordance index (C-index) [12]. The C-index is a rank-based performance measure that is defined as the probability that the predictions for two sample points are in the same order as their real labels. In the case of binary interaction labels, the C-index is equal to the widely used metric area under the receiver operating characteristic curve (AUC) [39].

Remark 5

One might expect that the number of nonzero variables would decrease at each iteration of the SparsePKL algorithm. However, this is not necessarily the case in penalty-based methods with changing penalty parameters and, especially, when the DCA step of LMB-DCA is employed (see Step 8 of Algorithm 1) the number of nonzero variables may increase. Therefore, if the desired sparsity is not yet achieved in Step 2 of Algorithm 2, we save the sparsest infeasible solution $a_{sis}$ obtained so far and the number of nonzero components $nz_{sis}$ in it.

6.2 Implementation

The SparsePKL algorithm is implemented using a combination of Python and Fortran 95 via the F2PY interface generator. The GVT from RLScore https://github.com/aatapa/RLScore [38] is applied to compute Kronecker product kernels implicitly. The k-norm is computed using the bisection method (mostly) without sorting or copying the data. Here, we employed the source code by PierU available at https://stackoverflow.com/questions/75975549/find-and-modify-highest-n-values-in-a-2d-array-in-fortran. The source code of SparsePKL — including LMB-DCA and all the above-mentioned loss functions — is available at https://github.com/napsu/sparsePKL.

7 Numerical experiments

Now we are ready to compare the proposed algorithms SparsePKL and LMB-DCA with some state-of-the-art methods in pairwise kernel learning and DC optimization. We start this section by describing the data sets and different experimental settings for data and continue with some implementation details of the algorithms used. Finally, we give the results of our experiments.

7.1 Data and experimental settings

Numerical experiments are run on seven real pairwise drug-target data sets whose characteristics are presented in Table 2. These datasets are selected for three key reasons:

1.
Pairwise kernel methods were originally introduced for biomedical interaction prediction and are most widely used in this domain. The datasets used in this study serve as standard benchmarks in the field.
2.
The datasets exhibit diverse characteristics, including variations in size, label type (continuous or binary), and sparsity.
3.
The datasets provide feature representations for both drugs and targets, which are essential for pairwise learning using the Kronecker kernel. While open-source drug-target interaction (DTI) datasets with these properties are scarce, they are crucial for achieving meaningful generalization in zero-shot learning scenarios.

Each data consists of three matrices $X_{d} \in \mathbb {R}^{n_{D} \times n_{D}}$, $X_{t} \in \mathbb {R}^{n_{T} \times n_{T}}$ and $Y \in \mathbb {R}^{n_{D} \times n_{T}}$, where the first two are feature matrices for drugs/compounds and target kinases, respectively, and the last one is the drug-target interaction affinity matrix. For Davis, GPCR, IC, and E all pairwise interactions are known, whereas Metz, KiBA, and Merget are naturally sparse. That is, interaction information is available only for a small subset of all possible drug-target pairs. The numbers of drugs ($n_D$), targets ($n_T$), and their known interactions (n) are presented in Table 2. For more details of the data sets, we recommend the references provided in Table 2.

Table 2 Data set characteristics

Full size table

According to [39, 40, 45], we can divide pairwise prediction tasks into different experimental settings based on the assumptions regarding the overlap between the training data and the new pairs for which predictions are needed. The four distinct settings can be defined as

${x} \in S1$ if and only if ${d} \in X_d$ and ${t} \in X_t$
${x} \in S2$ if and only if ${d} \in X_d$ but ${t} \notin X_t$
${x} \in S3$ if and only if ${d} \notin X_d$ but ${t} \in X_t$
${x} \in S4$ if and only if ${d} \notin X_d$ and ${t} \notin X_t$.

Setting S1 corresponds to missing value imputation as both the drug and the target are encountered in the training set. Settings S2 and S3 correspond to multilabel learning problems, where the aim is to predict the labels for novel targets and drugs, respectively. Setting S4 is the most challenging setting called zero-shot learning. It corresponds to predicting labels for such pairs where neither the drug nor the target occurs in the training set. For more details about splitting the data in different settings, we refer to [39].

All data sets are divided five times to separate, randomly selected, training, validation, and test sets (1/3, 1/3, and 1/3) addressing the separate experimental settings accordingly. In what follows, we report the average result obtained under each setting in each data set. Naturally, the same sets are used with all learning methods.

7.2 Methods

The following pairwise kernel learning algorithms are used in our experiments:

CGKronRLS: We apply the state-of-the-art method CGKronRLS [1] as a benchmark for non-sparse solutions. CGKronRLS has demonstrated strong real-world performance, ranking among the top-performing methods in the IDG-DREAM Drug-Kinase Binding Prediction Challenge [7]. CGKronRLS uses the conjugate gradient (CG) method to solve the standard RLS formulation of the pairwise kernel learning problem (12). That is, it applies the squared loss $\Vert y-p\Vert ^2$ as a loss function and the Euclidean norm $\frac{\lambda }{2} \Vert f\Vert ^2_\mathcal {H}$ for regularization. The regularization parameter $\lambda $ is selected from the grid $\{2^{-10}, 2^{-5}, 2^{-4}, 2^{-3}, 2^{-2}, 2^{-1}, 2^0, 2^1, 2^3, 2^4, 2^5, 2^{10}\}$. The tuning of $\lambda $ is made separately for each data set and each setting. In CGKronRLS, the C-index of the validation data is used to select the best solution with an early stopping procedure [52].

CGKronRLS is implemented in Python/Cython and it belongs to RLScore [38] — the regularized least-squares machine learning algorithms package — available at https://github.com/aatapa/RLScore.

CGKronSVM: In addition, for binary labelled data sets, we apply CGKronSVM [1], a hinge loss variant of CGKronRLS, as hinge loss is generally considered more effective for classification than squared error loss. Notably, in its original introduction [1], CGKronSVM demonstrated slightly better predictive accuracy than CGKronRLS for binary-labeled data. CGKronSVM uses CG to solve the SVM formulation of the classification problem. It sometimes produces sparse solutions with an arbitrary level of sparsity. The parameters for CGKronSVM are selected similarly to CGKronRLS. In addition, CGKronSVM is implemented in Python/Cython, belongs to RLScore, and is available at https://github.com/aatapa/RLScore.

LMBM-Kron$\ell _0$LS: LMBM-Kron$\ell _0$LS [37] is a nonsmooth optimization-based method designed for learning sparse models in pairwise interaction affinity prediction. This method is chosen as a benchmark because, like the proposed approach, it utilizes kernels and the continuous formulation of the $\ell _0$-pseudonorm to address sparse pairwise learning problems. In LMBM-Kron$\ell _0$LS, the pairwise kernel learning problem (12) is formulated with squared loss and $\ell _0$-pseudonorm regularization. The method employs LMBM to efficiently solve the resulting nonsmooth optimization problem. An early stopping procedure based on the C-index values of the validation data and the obtained sparsity level is applied.

Note that LMBM-Kron$\ell _0$LS is unable to find a solution with the predetermined sparsity percentage (cf. the proposed method). Instead, it aims to find a solution as sparse as possible while maintaining satisfactory prediction accuracy with respect to the non-sparse reference solution. This is done by solving a bi-objective optimization problem. Similar to [37], we allow a 5% decrease in C-index from the solution obtained with the reference method CGKronRLS. The main objective of the method is set to MO1 and the starting point procedure SP1 is used.^{Footnote 5} Otherwise, the parameters similar to [37] are used.

LMBM-Kron$\ell _0$LS is implemented in combination of Python and Fortran 95 via the F2PY interface. The source code of LMBM-Kron$\ell _0$LS is available at https://github.com/pavetsu14/LMBM-KronL0LS.

SparsePKL: SparsePKL is an implementation of Algorithm 2. We test the algorithm with every loss function described in Sect. 5.2. In what follows we call these different formulations

DRSL—squared loss with double regularization;
DRSEI—squared epsilon-intensive loss with double regularization;
DREIS—epsilon-intensive squared loss with double regularization;
DREIA—epsilon-intensive absolute loss with double regularization;
DRAL—absolute loss with double regularization.

We employ the same set of parameters for all losses in penalty updating and optimization. The parameters are chosen to be generally effective across all losses in the Davis data set. Note that due to automatic definition of the (starting) regularization parameters $\rho _1$ and $\rho _2^1$ (see, Algorithm 2 Step 0) they are different for different data sets, settings, and loss functions. In addition, our preliminary experiments showed that DRAL and DREIA tend to produce trivial feasible solutions $a^*=0$ in binary labeled data with the original starting point $a^1=1/n$ (see Algorithm 2, Step 0). Therefore, we used the starting point $a^1=1.0$ with these versions of SparsePKL in the binary labeled data sets GPCR, IC, and E.

There is no early stopping procedure employed in SparsePKL but the algorithm is terminated if a feasible critical point is found with the LMB-DCA or if the maximum number of iterations (50 in our experiments) in SparsePKL is reached. The best result is selected using the C-index of the validation data. For more details of the implementation, see Algorithm 2 and Sect. 6.2.

As already mentioned the source code of SparsePKL is available at https://github.com/napsu/sparsePKL.

All the learning algorithms use Kronecker product kernel matrices to compute the predictions $p=Ka$. The Kronecker product matrices are computed implicitly using the GVT [39, 48] from RLScore. The drug and target kernels are computed via Gaussian RBF kernels^{Footnote 6} with the kernel width parameter $\mu =10^5$ as recommended in [1].

In addition to these pairwise kernel learning algorithms, we consider two DC optimization algorithms:

LMB-DCA: LMB-DCA is a Fortran 95 implementation of Algorithm 1. It is capable of solving any large-scale nonsmooth DC programming problem, although, here it is used as an underlying solver for pairwise kernel learning. The parameters are chosen to be generally effective across all losses in the Davis data set. The source code of LMB-DCA is included in SparsePKL and available at https://github.com/napsu/sparsePKL.

DCA: The DCA software applied here is a Fortran 95 implementation of the renowned DC programming algorithm DCA [27, 41]. We employ the LMBM to solve the underlying convex optimization problem at each iteration of DCA. The parameter selection used is similar to that of LMB-DCA. The source code of DCA is included in SparsePKL and available at https://github.com/napsu/sparsePKL.

Computational experiments are carried out on iMac, 4.0 GHz Intel(R) Core(TM) i7 machine with 16 GB of RAM. We use Python 3.7 and gfortran to compile the Fortran codes.

7.3 Results

The results of our numerical experiments are analyzed and evaluated using the C-index, mean squared error (MSE), sparsity of the solution, and the used CPU time. As already mentioned, the results are averaged over five different training, validation, and test set splits of each data. The C-index provides convenient performance metrics when it is more important to predict the relative order of labels than their exact values, while MSE measures these exact values. Note that any constant function obtains trivial 0.5 level performance with respect to the C-index. We say that the sparsity level is zero if the solution is dense (i.e. all variables are non-zero) and it is 100% if we have a trivial feasible solution $a^*=0$. The CPU times for the algorithms are averaged over all runs in data. That is, they include computational times used in all different experimental settings S1–S4 in five different splits.

The results are given in Tables 4, 5, 6, 7, 8, 9 and10 in the appendix and summarized here in Figs. 2 and 3. We run the proposed SparsePKL with four desired levels of sparsity: 0% ($k=n$), 50% ($k=n/2$), 80% ($k=n/5$), and 90% ($k=n/10$). In Fig. 2 we give the C-indices (blue bar) and MSEs (red line) for different algorithms/loss functions in continuously labeled data sets. The desired sparsity level used with SparsePKL is 80% and it was always reached but once with DREIS in Davis data under setting S4. The results obtained with the other desired levels of sparsity are very similar but with the 90% sparsity, there were somewhat more infeasible solutions (see Tables 4, 5, 6 and 7). Note that CGKronRLS only produces dense solutions, while the final sparsity levels obtained with LMBMKron$\ell _0$LS are given in the caption of the figure. In Fig. 3, we summarize the corresponding results (see Tables 8, 9 and 10) for binary labeled data. In addition, we report the results and sparsity levels obtained with CGKronSVM.

In Tables 4, 5, 6, 7, 8, 9 and10, we use the bold font to emphasize the best result (greatest C-index or smallest MSE) obtained with any loss function attached to SparsePKL. In addition, we use a blue pen to emphasize predictions and computational times that are better with SparsePKL with $k=n$ than those of CGKronRLS and we use a red pen to emphasize predictions that are better with SparsePKL than those of LMBMKron$\ell _0$LS. In the latter, we only compare the results which have approximately the same sparsity levels. That is, the results we select for SparsePKL depend on the sparsity levels reached with LMBMKron$\ell _0$LS. If none of the solutions is better, we use the blue (red) pen to the result obtained with CGKronRLS (LMBMKron$\ell _0$LS). Further, in the tables, $nz\%$ denotes the percentages of the variables that are nonzero at the solution (i.e. it is opposite to the sparsity level) and the CPU time is given in seconds.

7.3.1 General observations

As expected, the highest prediction accuracy was obtained under the most informative setting S1 while the more realistic settings S2, S3, and S4 resulted in reduced accuracies (see Figs. 2 and 3). Setting S2 showed often higher accuracy than S3 suggesting that predicting new targets for drugs is easier than predicting new targeted compounds. Moreover, predicting labels for unseen drugs and targets (S4, zero-shot learning) was clearly the most challenging task for all methods. Nevertheless, also in this case, a predicting function was successfully learned in the majority of the data sets. Further, we note that learning to predict is usually easier in the binary labeled data sets GPCR, IC, and E than in the continuously labeled data sets Davis, Metz, KiBA, and Merget. This result is consistent with the one obtained in [39].

One might expect that increasing the desired sparsity level would decrease the obtained prediction accuracy. Indeed, this is usually the case under settings S1–S3 in continuously labeled data sets (see Tables 4, 5, 6 and 7). However, in binary labeled data sets (see Tables 8–10) and/or in zero-shot learning (regardless of the type of the labels) the obtained predictions with enforced sparsity are often more accurate than the corresponding predictions produced with dense solution vectors. In the first case, this is just due to the characteristics of the data. In zero-shot learning, there is a lot of uncertainty involved due to the fact that neither the drug nor the target has been seen in the training phase. Employing sparse solution vectors enables us to recognize the most essential drug-target interactions, helping us uncover the real patterns within the data while disregarding the redundant ones. This, in turn, leads to improved prediction accuracy.

7.3.2 Comparison with CGKronRLS, CGKronSVM, and LMBMKron $\ell _0$ LS

A quick look at Tables 4, 5, 6, 7, 8, 9 and 10 (see the blue pen) reveals that CGKronRLS is always the most accurate predictor under the simplest setting S1 and also often in settings S2 and S3 when no sparsity is required. On the other hand, in setting S4, there is always a version of SparsePKL that produces more accurate predictions than CGKronRLS^{Footnote 7}. As noted before, the sparsity of the solution vector may give SparsePKL some advantage in setting S4. However, the above result holds true (but in one data) even if we only compare the SparsePKL results with $k=n$. The reason for improved prediction accuracy is most probably the usage of the nonsmooth optimization solver LMB-DCA that is less sensitive to all kinds of perturbations than the smooth CG solver. The CPU times used with DRSEI and DREIS are always significantly smaller than those of CGKronRLS. On the other hand, CGKronRLS clearly win the other loss functions applied within SparsePKL.

The comparison between LMBMKron$\ell _0$LS and SparsePKL (see the red pen in Tables 4, 5, 6, 7, 8, 9 and 10) reveals that LMBMKron$\ell _0$LS often finds sparse solutions with a bit more accurate predictions. More so, if we only compare the results in continuously labeled data with respect to the C-index. However, in binary labeled data and/or with respect to MSE different versions of SparsePKL start to look attractive alternatives. It is also worth noting that most results obtained with SparsePKL are still pretty accurate predictions and the used computational times are vastly shorter than those of LMBMKron$\ell _0$LS: for instance, in Merget DREIS with $k=n/2$ is about 150 times faster than LMBMKron$\ell _0$LS. Besides, the implementation of LMBMKron$\ell _0$LS seems to be a bit unstable and the method crashed often with an unknown reason, especially with bigger data sets (the results in tables are averaged over the successful runs).

In binary labeled data sets CGKronSVM is always the best algorithm with respect to both the C-index and MSE under setting S1 (sometimes together with CGKronRLS, see Fig. 3). However, this result does not hold with more demanding settings S2–S4. In addition, the solutions provided by CGKronSVM are usually not very sparse and they take longer to compute than those of CGKronRLS and most versions of SparsePKL (see Tables 8–10).

Hence, we conclude that CGKronRLS is a very good predictor if the prediction task falls under simple settings S1–S3 and no sparsity of the solution is required. Otherwise, particularly for zero-shot learning, using SparsePKL with a suitable loss function is a preferable option.

7.3.3 Different loss functions

Now we compare different loss functions used with SparsePKL. First, we point out that the nonlinear losses DRSL, DRSEI, and DREIS usually produce more accurate predictions than the piecewise linear losses DRAL and DREIA. An exception here is DREIA under setting S1. In addition, DRSEI and DREIS find these predictions efficiently.

The original LMBM is shown to be best suited for solving highly nonlinear (and nonsmooth) optimization problems [25]. That is, it works better — both in terms of accuracy and efficiency — in nonlinear than in (piecewise) linear cases. As a successor of the LMBM the proposed LMB-DCA most probably shares this property, which explains the previous result. The inefficiency of DRSL is because it never triggers the termination criterion of the LMB-DCA (see Step 3 of Algorithm 1). Instead, it terminates only when the maximum number of iterations in the LMB-DCA is reached. This also means that it runs until the maximum number of iterations in the SparsePKL algorithm (see Algorithm 2) is reached. Here, an easy fix would be an early stopping procedure using the validation data.

The predictions produced by DRAL are often clearly less accurate than those of the others. The same is true for DREIA but with less clear deterioration. There are two possible reasons for this. The first one is the used optimization method LMB-DCA as both absolute loss and epsilon-intensive absolute loss comprise only piecewise linear terms. The second one is the used starting point. As already mentioned, DRAL and DREIA tend to produce trivial feasible solutions $a^*=0$ in binary labeled data when the starting point is too small. Although trivial feasible solutions are not obtained in continuously labeled data, bigger starting points might work better (give more accurate predictions) with these loss functions in continuously labeled data as well but then the number of iterations should be increased in order to get feasible solutions as the convergence to feasible solutions with these two losses is quite slow.

As a result, we recommend to use SparsePKL with a nonlinear (nonsmooth) loss. More precisely, apply DRSL under settings S1 and either DRSEI or DREIS with settings S2–S4.

7.3.4 LMB-DCA vs. DCA

Finally, we compare LMB-DCA with the well-known DCA to show its performance as a pure optimization method. The DCA applied here employs the LMBM as an underlying solver. Therefore, the comparison should be quite fair as both solvers are capable of solving large-scale nonsmooth optimization problems and internal parameter selection is similar. We use the same data sets as before, with one split to different experimental settings each, and the epsilon-intensive squared loss as an objective function. The desired sparsity level is set to 50% (if we set $k=n$ we would have a convex problem) but only one fixed value 0.0001 is used for both $\rho _1$ and $\rho _2$. In practice, this means that we cannot expect to get feasible solutions in terms of sparsity level. However, we can compare the values of the objectives at the end of the optimization procedure (the training procedure without intermediate validation).

In Table 3 we report

n—the number of variables in optimization: in addition to data, this value depends on the experimental setting we are using;
$J^*$—the obtained value of the objective function;
$n_{J1}$, $n_{J2}$—numbers of DC component evaluations used with DCA;
$n_{\xi 1}$, $n_{\xi 2}$—numbers of subgradient evaluations for the DC components used with DCA;
$n_J$ and $n_\xi $—numbers of function and subgradient evaluations for LMB-DCA: here $n_J=n_{J1}=n_{J2}$ and $n_\xi =n_{\xi 1}=n_{\xi 2}$, that is both DC components (subgradients) were evaluated $n_J$ ($n_\xi $) times^{Footnote 8};
CPU—the used CPU times in seconds.

In the table, bold font is used for the better objective value and for the better CPU time. It is easy to see that LMB-DCA outperforms DCA in terms of efficiency. The differences in accuracy are not as significant. However, there is a clear difference in the sense that LMB-DCA found slightly smaller objective values in data sets with continuous labels while DCA found smaller values in data with binary labels (see Table 2). In addition, if we solved a DC optimization problem with a computationally more expensive second DC component $J_2$, DCA would be an obvious choice as it uses only very few evaluations of the second DC component.

Table 3 Comparison of LMB-DCA and DCA

Full size table

8 Conclusions

In this paper, we have introduced the sparse pairwise kernel learning algorithm (SparsePKL) that produces accurate predictions with the desired sparsity level of the solution. The most important reason to seek a sparse solution vector—that is, a vector with only a few nonzero entries—is that it enables us to recognize the most essential input pairs and omit the redundant ones. This leads to improved prediction accuracy, especially under the most challenging experimental setting called zero-shot learning. Moreover, having more zeros in the solution vector speeds up the prediction process. This is due to the fact that in the prediction phase, the dominating costs are the kernel matrix multiplications with the sparse dual coefficient vector.

In addition to the SparsePKL, we have proposed a novel limited memory bundle DC algorithm (LMB-DCA) for large-scale nonsmooth DC optimization and used it as an underlying solver in the SparsePKL. As a pure optimization method, the proposed LMB-DCA outperforms the well-known DCA in terms of efficiency (unless we would solve a DC optimization problem with a computationally very expensive second DC component) while the accuracies of these algorithms are similar.

We have evaluated the performance of the SparsePKL with various loss functions in seven real-life drug-target interaction data sets. The data sets are divided into different experimental settings S1–S4 (including zero-shot learning) according to [39] and different levels of sparsity are tested with the SparsePKL. As a result, we recommend to use SparsePKL with a nonlinear (nonsmooth) loss. More precisely, apply the squared loss under settings S1 (may be used with S2 and S3 in binary labeled data as well) and either the squared epsilon-intensive loss or epsilon-intensive squared loss with settings S2–S4.

The obtained results are compared against the state-of-the-art methods KronRLS and KronSVM [1, 38], as well as the LMBM-Kron$\ell _0$LS algorithm [37] that produces as sparse a solution as possible while keeping the predetermined accuracy of the prediction. The proposed algorithm outperformed the LMBM-Kron$\ell _0$LS by producing relatively accurate predictions within only a small fraction of the time used by the LMBM-Kron$\ell _0$LS. In addition, the proposed algorithm produced more accurate predictions than KronRLS and KronSVM under the most challenging setting S4. It usually also made it faster.

We conclude that the KronRLS is a very good predictor if the prediction task falls under the simple settings S1–S3 and no sparsity of the solution is required. Otherwise, particularly for zero-shot learning, using the SparsePKL with a suitable loss function is a preferable option.

Data Availability

All the data analysed during this study are reported in the paper and openly available.

Code availability

the source code of SparsePKL is available at https://github.com/napsu/sparsePKL.

Notes

This is a common feature in nonsmooth optimization.
This leads to the need for different experimental settings, see [39, 48] and Sect. 7.1.
In this work, the kernel functions $k_\mathcal {D}$ for drugs and $k_\mathcal {T}$ for targets come from the chemical structure and sequence similarity matrices, respectively.
In this work Gaussian kernels for drugs and targets are used but other kernel functions can be applied as well. The pairwise interactions between drug and target features are modelled via the pairwise Kronecker product kernel matrix K.
We only used one starting point procedure here due to the fact that even with one starting point procedure the method is the most time consuming of the methods tested and using also the second starting point procedure would have doubled the time.
Our preliminary tests with other kernel combinations resulted in much worse predictions.
To be precise, in IC data under S4, CGKronRLS produces the most accurate prediction w.r.t. C-index if we only compare it to SparsePKL results with $k=n$. However, both DRSL and DRSEI predict better than CGKronRLS with higher sparsity levels (see Table 9).
This is due to the fact that the DCA step (Step 9 in Algorithm 1) was not applied in these experiments. Nevertheless, it was employed semi-regularly in the experiments of the previous sections.

References

Airola, A., Pahikkala, T.: Fast Kronecker product kernel methods via generalized Vec trick. IEEE Trans. Neural Netw. Learn. Syst. 29, 3374–3387 (2018)
MathSciNet Google Scholar
Bagirov, A.M., Gaudioso, M., Karmitsa, N., Mäkelä, M.M., Taheri, S.: Numerical Nonsmooth Optimization: State of the Art Algorithms. Springer, Cham (2020)
Google Scholar
Bagirov, A.M., Karmitsa, N., Mäkelä, M.M.: Introduction to Nonsmooth Optimization: Theory, Practice and Software. Springer, Cham (2014)
Google Scholar
Bertsimas, D., King, A., Mazumder, R.: Best subset selection via a modern optimization lens. Ann. Stat. 44(2), 813–852 (2016)
MathSciNet Google Scholar
Byrd, R.H., Nocedal, J., Schnabel, R.B.: Representations of quasi-Newton matrices and their use in limited memory methods. Math. Program. 63, 129–156 (1994)
MathSciNet Google Scholar
Cichonska, A., Pahikkala, T., Szedmak, S., Julkunen, H., Airola, A., Heinonen, M., Aittokallio, T., Rousu, J.: Learning with multiple pairwise kernels for drug bioactivity prediction. Bioinformatics 34(13), i509–i518 (2018)
Google Scholar
Cichonska, A., Ravikumar, B., Allaway, R.J., et al.: Crowdsourced mapping of unexplored target space of kinase inhibitors. Nat. Commun. 12(3307), 146–157 (2021)
Google Scholar
Clarke, F.H.: Optimization and Nonsmooth Analysis. Wiley-Interscience, New York (1983)
Google Scholar
Davis, M.I., Hunt, J.P., Herrgard, S., Ciceri, P., Wodicka, L.M., Pallares, G., Hocker, M., Treiber, D.K., Zarrinkar, P.P.: Comprehensive analysis of kinase inhibitor selectivity. Nat. Biotechnol. 29(11), 1046–1051 (2011)
Google Scholar
Gaudioso, M., Giallombardo, G., Miglionico, G.: Sparse optimization via vector k-norm and DC programming with an application to feature selection for support vector machines. Comput. Optim. Appl. 86, 745–766 (2023)
MathSciNet Google Scholar
Gaudioso, M., Gorgone, E., Hiriart-Urruty, J.: Feature selection in SVM via polyhedral $k$-norm. Optim. Lett. 14, 19–36 (2020)
MathSciNet Google Scholar
Gönen, M., Heller, G.: Concordance probability and discriminatory power in proportional hazards regression. Biometrika 92, 965–970 (2005)
MathSciNet Google Scholar
Gotoh, J., Takeda, A., Tono, K.: DC formulations and algorithms for sparse optimization problems. Math. Program. Ser. B 169, 141–176 (2018)
MathSciNet Google Scholar
Haarala, M.: Large-scale nonsmooth optimization: variable metric bundle method with limited memory. Ph.D. thesis, University of Jyväskylä, Department of Mathematical Information Technology (2004)
Haarala, M., Miettinen, K., Mäkelä, M.M.: New limited memory bundle method for large-scale nonsmooth optimization. Optim. Methods Softw. 19(6), 673–692 (2004)
MathSciNet Google Scholar
Haarala, N., Miettinen, K., Mäkelä, M.M.: Globally convergent limited memory bundle method for large-scale nonsmooth optimization. Math. Program. 109(1), 181–205 (2007)
MathSciNet Google Scholar
Hager, W.W., Phan, D.T., Zhu, J.: Projection algorithms for nonconvex minimization with application to sparse principal component analysis. J. Global Optim. 65, 657–676 (2016)
MathSciNet Google Scholar
Halkola, A., Joki, K., Mirtti, T., Mäkelä, M.M., Aittokallio, T., Laajala, T.D.: OSCAR: Optimal subset cardinality regression using the L0-pseudonorm with applications to prognostic modelling of prostate cancer. PLoS Comput. Biol. 19(3), e1010333 (2023)
Google Scholar
Hartman, P.: On functions representable as a difference of convex functions. Pac. J. Math. 9(3), 707–713 (1959)
MathSciNet Google Scholar
Hertz, A., Kuflik, T., Tuval, N.: Resolving sets and integer programs for recommender systems. J. Global Optim. 81, 153–178 (2021)
MathSciNet Google Scholar
Hiriart-Urruty, J.B.: Generalized differentiability, duality and optimization for problems dealing with differences of convex functions. In: Ponstein, J. (ed.) Convexity Dual. Optim., vol. 256, pp. 37–70. Springer, Berlin (1985)
Google Scholar
Hofmann, T., Schölkopf, B., Smola, A.: Kernel methods in machine learning. Ann. Stat. 36(3), 1171–1220 (2008)
MathSciNet Google Scholar
Joki, K., Bagirov, A.M., Karmitsa, N., Mäkelä, M.M.: A proximal bundle method for nonsmooth DC optimization utilizing nonconvex cutting planes. J. Global Optim. 68(3), 501–535 (2017)
MathSciNet Google Scholar
Kampa, K., Mehta, S., Chou, C.A., Chaovalitwongse, W.A., Grabowski, T.J.: Sparse optimization in feature selection: application in neuroimaging. J. Global Optim. 59, 439–457 (2014)
MathSciNet Google Scholar
Karmitsa, N., Bagirov, A.M., Mäkelä, M.M.: Comparing different nonsmooth optimization methods and software. Optim. Methods Softw. 27(1), 131–153 (2012)
MathSciNet Google Scholar
Kiwiel, K.C.: Methods of Descent for Nondifferentiable Optimization. Lecture Notes in Mathematics, vol. 1133. Springer, Berlin (1985)
Le Thi, H.A., Pham Dinh, T.: Convex analysis approach to DC programming: theory, algorithms and applications. Math. Program. 169(1), 5–68 (2018)
MathSciNet Google Scholar
Le Thi, H.A., Pham Dinh, T.: Open issues and recent advances in DC programming and DCA. J. Global Optim. (2023). https://doi.org/10.1007/s10898-023-01272-1
Article Google Scholar
Liu, T.Y.: Learning to Rank for Information Retrieval. Springer, Berlin (2011)
Google Scholar
Meng, N., Zhao, Y.-B., Kočvara, M., Sun, Z.: Partial gradient optimal thresholding algorithms for a class of sparse optimization problems. J. Global Optim. 84, 393–413 (2022)
MathSciNet Google Scholar
Menon, A., Elkan, C.: A log-linear model with latent features for dyadic prediction. In: The 10th IEEE International Conference on Data Mining (ICDM), pp. 364–373 (2010)
Merget, B., Turk, S., Eid, S., Rippmann, F., Fulle, S.: Profiling prediction of kinase inhibitors: toward the virtual assay. J. Med. Chem. 60(1), 474–485 (2017)
Google Scholar
Metz, J.T., Johnson, E.F., Soni, N.B., Merta, P.J., Kifle, L., Hajduk, P.J.: Navigating the kinome. Nat. Chem. Biol. 7, 200–202 (2011)
Google Scholar
Miyashiro, R., Takano, Y.: Mixed integer second-order cone programming formulations for variable selection in linear regression. Eur. J. Oper. Res. 247(3), 721–731 (2015)
MathSciNet Google Scholar
Natarajan, B.K.: Sparse approximate solutions to linear systems. SIAM J. Comput. 24(2), 227–234 (1995)
MathSciNet Google Scholar
de Oliveira, W.: The ABC of DC programming. Set-Valued Var. Anal. 28(1), 679–706 (2022)
MathSciNet Google Scholar
Paasivirta, P., Numminen, R., Airola, A., Karmitsa, N., Pahikkala, T.: Predicting pairwise interaction affinities with $\ell _0$-penalized least squares—a nonsmooth bi-objective optimization based approach. Optim. Methods Softw. (2023, in press)
Pahikkala, T., Airola, A.: Rlscore: regularized least-squares learners. J. Mach. Learn. Res. 17, 1–5 (2016)
MathSciNet Google Scholar
Pahikkala, T., Airola, A., Pietilä, S., Shakyawar, S., Szwajda, A., Tang, J., Aittokallio, T.: Toward more realistic drug-target interaction predictions. Brief. Bioinform. 16(2), 325–337 (2014)
Google Scholar
Park, Y., Marcotte, E.: Flaws in evaluation schemes for pair-input computational predictions. Nat. Methods 9(12), 1134–1136 (2012)
Google Scholar
Pham Dinh, T., Le Thi, H.A.: DC programming and DCA: thirty years of developments. Acta Math. Vietnam 22(1), 289–355 (1997)
MathSciNet Google Scholar
Rendle, S., Freudenthaler, C.: Improving pairwise learning for item recommendation from implicit feedback. In: Proceedings of the 7th ACM International Conference on Web Search and Data Mining, WSDM’14, pp. 273–282. Association for Computing Machinery (2014)
Rockafellar, R.T.: Convex Analysis. Princeton University Press, Princeton, NJ (1970)
Google Scholar
Schölkopf, B., Herbrich, R., Smola, A.J.: A generalized representer theorem. In: Proceedings of the 14th Annual Conference on Computational Learning Theory and 5th European Conference on Computational Learning Theory pp. 416–426 (2001)
Schrynemackers, M., Küffner, R., Geurts, P.: On protocols and measures for the validation of supervised methods for the inference of biological networks. Front. Genet. 4, 262 (2013)
Google Scholar
Tang, J., Szwajda, A., Shakyawar, S., Xu, T., Hintsanen, P., Wennerberg, K., Aittokallio, T.: Making sense of large-scale kinase inhibitor bioactivity data sets: a comparative and integrative analysis. J. Chem. Inf. Model. 54(3), 735–743 (2014)
Google Scholar
Taskar, B., Wong, M.-F., Abbeel, P., Koller, D.: Link prediction in relational data. In: Advances in Neural Information Processing Systems, vol. 16. MIT Press (2003)
Viljanen, M., Airola, A., Pahikkala, T.: Generalized vec trick for fast learning of pairwise kernel models. Mach. Learn. 111, 543–573 (2022)
MathSciNet Google Scholar
Wang, Z., Zhou, Y., Hong, L., Zou, Y., Su, H., Chen, S.: Pairwise learning for neural link prediction (2022). https://doi.org/10.48550/arXiv.2112.02936
Wei, Z., Li, Q., Wei, J., Bian, W.: Neural network for a class of sparse optimization with L0-regularization. Neural Netw. 151, 211–221 (2022)
Google Scholar
Yamanishi, Y., Araki, M., Gutteridge, A., Honda, W., Kanehisa, M.: Prediction of drug-target interaction networks from the integration of chemical and genomic spaces. Bioinformatics 24(13), i232–i240 (2008)
Google Scholar
Yao, Y., Rosasco, L., Caponnetto, A.: On early stopping in gradient descent learning. Constr. Approx. 26(2), 289–315 (2007)
MathSciNet Google Scholar
Yi, S., Lai, Z., He, Z., Cheung, Y.-M., Liu, Y.: Joint sparse principal component analysis. Pattern Recogn. 61, 524–536 (2017)
Google Scholar
Yin, Y., Wang, Y., Dai, T., Wang, L.: DOA estimation based on smoothed sparse reconstruction with time-modulated linear arrays. Signal Process. 214, 109229 (2024)
Google Scholar
Zou, H., Hastie, T., Tibshirani, R.: Sparse principal component analysis. J. Comput. Graph. Stat. 15(2), 265–286 (2006)
MathSciNet Google Scholar

Download references

Funding

Open Access funding provided by University of Turku (including Turku University Central Hospital). Open Access funding provided by University of Turku (including Turku University Central Hospital. This work was financially supported by University of Turku and Research Council of Finland Grants #345804 and #345805.

Author information

Authors and Affiliations

Department of Computing, University of Turku, 20014, Turku, Finland
Napsu Karmitsa, Antti Airola & Tapio Pahikkala
Department of Mathematics and Statistics, University of Turku, 20014, Turku, Finland
Kaisa Joki

Authors

Napsu Karmitsa
View author publications
Search author on:PubMed Google Scholar
Kaisa Joki
View author publications
Search author on:PubMed Google Scholar
Antti Airola
View author publications
Search author on:PubMed Google Scholar
Tapio Pahikkala
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Napsu Karmitsa.

Ethics declarations

Conflict of interest:

The authors have no financial or non-financial interests to disclose.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A: Detailed numerical results

The results of our numerical experiments are given in Tables 4, 5, 6, 7, 8, 9 and 10. We recall that the results are averaged over five different training, validation, and test splits of each data.

Table 4 Results in Davis

Full size table

Table 5 Results in Metz

Full size table

Table 6 Results in KiBA

Full size table

Table 7 Results in Merget

Full size table

Table 8 Results in GPCR

Full size table

Table 9 Results in IC

Full size table

Table 10 Results in E

Full size table

In the tables, we use the bold font to emphasize the best result (greatest C-index or smallest MSE) obtained with any loss function attached to SparsePKL. We use a blue pen to emphasize predictions and computational times that are better or equal with SparsePKL with $k=n$ than those of CGKronRLS. If none of the solutions is better, we use the blue pen to the result obtained with CGKronRLS. In addition, we use a red pen to emphasize predictions of SparsePKL that are more or equally accurate than those obtained with LMBMKron$\ell _0$LS. Here we only compare the results that have approximately the same sparsity levels. That is, the results of SparsePKL we use for comparison depend on the sparsity levels obtained with LMBMKron$\ell _0$LS. If none of the solutions is better, we use the red pen to the result obtained with LMBMKron$\ell _0$LS.

In the tables, $nz\%$ denotes the percentages of the variables that are nonzero at the solution, and CPU times (given in seconds) are averaged over all runs in data. That is, they include computation of all different settings. In addition, we report the number of infeasible solutions together with the biggest $nz\%$ obtained with SparsePKL as a footnote to each table.

It is worth noting that LMBMKron$\ell _0$LS crashed often (with an unknown reason), especially with bigger data sets. The results are averaged over the successful runs.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Karmitsa, N., Joki, K., Airola, A. et al. Limited memory bundle DC algorithm for sparse pairwise kernel learning. J Glob Optim 92, 55–85 (2025). https://doi.org/10.1007/s10898-025-01481-w

Download citation

Received: 30 October 2023
Accepted: 11 March 2025
Published: 03 April 2025
Version of record: 03 April 2025
Issue date: May 2025
DOI: https://doi.org/10.1007/s10898-025-01481-w

Keywords

Profiles

Tapio Pahikkala View author profile

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Limited memory bundle DC algorithm for sparse pairwise kernel learning

Abstract

Similar content being viewed by others

A unified DC programming framework and efficient DCA based approaches for large scale batch reinforcement learning

Robust discriminative feature learning with calibrated data reconstruction and sparse low-rank model

Combining Primal and Dual Representations in Deep Restricted Kernel Machines Classifiers

Explore related subjects

1 Introduction

2 Preliminaries

2.1 Nonsmooth optimization

2.2 DC programming

2.3 \(\ell _0\)-pseudo-norm and sparse optimization

3 Limited memory bundle DC algorithm

Remark 1

Remark 2

Remark 3

Remark 4

4 Preliminaries for pairwise kernel learning

4.1 Pairwise learning problem

Assumption 1

Assumption 2

4.2 Kernel methods

5 Model formulation

5.1 Double regularization

5.2 Loss functions and their (sub)gradients

5.3 Subgradient of double regularization

6 Sparse pairwise kernel learning

6.1 Pairwise kernel learning algorithm

Remark 5

6.2 Implementation

7 Numerical experiments

7.1 Data and experimental settings

7.2 Methods

7.3 Results

7.3.1 General observations

7.3.2 Comparison with CGKronRLS, CGKronSVM, and LMBMKron \(\ell _0\) LS

7.3.3 Different loss functions

7.3.4 LMB-DCA vs. DCA

8 Conclusions

Data Availability

Code availability

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest:

Additional information

Publisher's Note

Appendix A: Detailed numerical results

Appendix A: Detailed numerical results

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Profiles