1 Introduction

Different from the single task learning, Multi-Task Learning (MTL) trains a united model over several tasks simultaneously [1]. Recently, it has been noticed that MTL particularly with Deep Learning (DL) has led to many successes in Natural Language Processing (NLP) and Computer Vision (CV) [2, 3].

There mainly exist two network structures in current DL-based MTL approaches, i.e., the soft structure and the hard structure as shown in Fig. 1 [4]. The soft structure requires that each task has its own parameters. In soft layers, the parameters of different tasks are regularized to learn relationships between tasks. In specific layers, the parameters are specified to each task. This structure allows the model to learn its own feature representations on each task and adapt to the differences between different tasks. It can better adapt to the characteristics of different tasks and improve overall performance. However, more parameters may increase the complexity of training. The hard structure shares the same parameters between different tasks in a model, where all tasks are trained and predicted using the same model weights. It forces the MTL model to learn a unit representation over all tasks at first. Specific layers follow hard layers for each task as well. This structure can accelerate the training process because the number of parameters is small, and tasks can promote each other through sharing knowledge. However, there may be differences in the feature representations of each task, which may lead to lower performances of the model on certain tasks. Compared with the soft structure, the hard structure significantly reduces the model size and the risk of overfitting. Thus, the hard structure is more popular in industry [5]. In the hard structure, the hard layers are expected to learn general representations over all tasks, while the specific layers are expected to learn specific representations for each task. Since the specific layers directly follow the hard layers, the MTL model needs to estimate the general representations at first and then estimate the specific representations alternatively. To certain degree, this rapid change is difficult for the MTL model to estimate without any transition. An efficient MTL network should simultaneously consider both feature sharing and task-specific parts, which needs to learn the general representation between tasks to avoid overfitting and the specific representations of each task to avoid underfitting.

Fig. 1
figure 1

The soft and hard parameter sharing structures

To alleviate this problem, we propose a novel parameter sharing mechanism in this paper. To certain degree, the tasks that are related tend to learn similar parameters in the MTL model [4]. Inspired by this idea, we present a novel neural layer, i.e., the cluster layer. As shown in Fig. 2a, the MTL model firstly estimates the specific parameters for each task and the similar tasks tend to learn the similar parameters (with similar colors). Then, we perform network clustering to group tasks with similar parameters into clusters. For tasks in one cluster, their estimated parameters are further replaced with the same parameters (i.e., the parameter center of all tasks in this center) as seen in Fig. 2b. There are three main advantages of this cluster sharing mechanism. At first, like hard layers, the cluster layer learns general representations for all tasks intra one cluster. Like specific layers, the cluster layer learns specific representations for inter clusters. The cluster layer is regarded as a transition from the hard layers to the specific layers. Thus, the MTL model can learn general representations to specific representations gradually. Secondly, the cluster layer significantly reduces the parameter size. The parameter size of a specific layer for MTL is \(N\) times larger than that for a single task(\(N\) is the task number). Replacing with the cluster layer, the parameter size reduces to \({K \mathord{\left/ {\vphantom {K N}} \right. \kern-0pt} N}\) times less than that of a specific layer(\(K\) is the cluster number). This will benefit a lot to massive multi-task learning (MMTL) [6] (where MTL confronts the learning problem over tens or hundreds of tasks e.g., Drug discovery [7]) since the parameters will scale up rapidly as the task number increases. Finally, this cluster sharing mechanism is regarded as a regularization in the MTL model, which is good for the MTL model to reduce the over-fitting risk. In this paper, we also present the learning algorithm of the Network Clustering MTL model, which alternatively performs model estimation and network clustering during train procedures.

Fig. 2
figure 2

The proposed cluster parameter sharing structure

Experiments are conducted on multi-task document classification with public datasets. The results show the cluster layer is quite efficient in MTL. The main contributions of this paper are summarized as follows: 1) As far as we know, we are the first to present the cluster layer for MTL. By this way, the MTL model gradually learns general representations to specific representations. 2) We present the NCMTL approach, which alternatively performs network clustering and parameter estimation. This idea may be leveraged to the similar scenarios.

2 Related Work

Multi-Task Learning (MTL) aims to train a united model over several tasks simultaneously [1]. In NLP and CV, MTL especially with Deep Learning has led to many successes recently [2, 3]. Thus, we just focus on the DL-based MTL approaches in this paper.

Ruder summarizes the DL-based MTL network structures, which divides into two categories, i.e., the soft parameter sharing structure and the hard parameter sharing structure [4]. For example, the soft structure is selected in [8] to regularize the parameters of all models, while the hard structure is adopted in [9] to share the convolution backbone network over all tasks. Sarkar et al. develop an MTL model that utilizes the potential embedding space of words and topics for prediction and achieves information sharing of related tasks through message passing mechanisms and train the model through active learning methods to address the lack of standardized fine-grained label data for the MTL task [10]. Specially for MTL text classification, two mechanisms (i.e., an external memory and a reading/writing communication) are introduced in to share the task information [11, 12], and Lu et al. introduce the hybrid representation-learning network [13]. Liu et al. employ the adversarial mechanism to overcome the task differences in [14]. The meta-knowledge from different tasks is extracted and used in the MTL learning in [15]. There are many studies on MTL document classification [2]. Tan et al. propose an MTL-based framework, which utilizes deep neural networks to model a correlation to improve the overall performance of sentiment analysis [16]. Gawron et al. create a multi-task multilingual model for the following text classification tasks: functional style, domain, readability, and sentiment [17]. Luo et al. propose a text guided multi-task learning network for multimodal sentiment analysis [18]. Ameur et al. propose a deep multi-task learning for image/video distortions identification [19]. With limited spaces, we only focus on the MTL network structure. These methods mainly adopt the hard structure, of which the hard and specific layers tend to learn general and specific representations respectively. In this paper, we present the cluster layers, which are regarded as a transition from the hard layers to the specific layers for MTL and learn general representations to specific representations.

3 Network Clustering MTL

The task of multi-task learning is mathematically formulated as: Given a serial of tasks \({\text{T}} = \left\{ {T_{1,} \, T_{2,} \, ... \, T_{N} } \right\}\), we aim to learn a united mapping function \(Y = f(X,\theta )\), which outputs labels \(Y_{i}\) with the inputs \(X_{i}\) of the task \(T_{i}\). In our Network Clustering MTL (NCMTL) model, \(\theta\) is written as \(\left\{ {W_{H} ,W_{C} ,W_{S} } \right\}\), which denotes the parameters in the hard, cluster and specific neural networks respectively. Meanwhile, the cluster layers are set between the hard layers and the specific layers as shown in Fig. 2b.

For the hard layers \(W_{H}\), it is possible to adopt the recurrent and/or convolutional neural networks as the backbone networks. But the hard layers are not the key interest of this paper. With the limited space, we focus more on the cluster layers.

For the cluster layers \(W_{C}\), we incorporate \(L\) cluster layers in the NCMTL model. Taking the cluster layer \(l_{i}\) as an example, firstly we assume there exist specific weights \(w_{ij}{\prime}\) for the task \(T_{j}\). Since similar tasks tend to learn similar parameters, all tasks are grouped into \(K_{i}\) clusters with the similarities of their specific weights \(w_{i*}{\prime}\) (* denotes all tasks, and \(K_{i}\) is the predefined cluster number in this cluster layer). We obtain the parameter center of each cluster. Let \(c_{ik}\) be the parameter center of the \(k^{th}\) cluster, the original specific weights of the tasks in this cluster are replaced with \(c_{ik}\). By this way, the tasks in one cluster share the same parameters. Compared with the hard layer which produces the same representation for all tasks, the cluster layer produces the same representation for a group of tasks. When we gradually increase the cluster numbers of the cluster layers, the NCMTL model is prone to learn the general features to the specific features. It is important to highlight that the \(W_{C}\) are solely updated during NCMTL and are unrelated to clustering.

To save the memory usage, the final NCMTL model does not need to include the specific weights \(w_{i*}{\prime}\). We can only keep the parameter centers and the cluster results in the final NCMTL model. Since we use the \(K_{i}\) parameter centers to replace the specific weights of all \(N\) tasks, the parameter size of the cluster layer reduces to \({\raise0.5ex\hbox{$\scriptstyle {K_{i} }$} \kern-0.1em/\kern-0.15em \lower0.25ex\hbox{$\scriptstyle N$}}\) times less than that of the specific layer.

In practice, the clustering procedure is performed during training stages. As shown in Fig. 3, at the \((t - 1)^{th}\) training batch, we update \(w_{ij}{\prime}\) with the parameter center \(c_{ik}\) of task \(j\).Then, clustering is carried out on the updated \(w_{i*}{\prime}\) and forms the new cluster center \(c_{ik}\) of the next \(t^{th}\) training batch. Under the hard cluster mode (i.e., one task is grouped into only one cluster in a cluster layer), the clustering loss function is presented by

$$ {\mathcal{L}}_{c} \left( \theta \right) = \sum\limits_{i = 1}^{L} {\sum\limits_{j = 1}^{N} {\sum\limits_{k = 1}^{{k_{i} }} {r_{ijk} } } } ||w_{ji}{\prime} - c_{ki} ||^{2} $$
(1)
$$ r_{ijk} = \left\{ {\begin{array}{*{20}l} 1 \hfill & {if \, T_{j} \in c_{ik} } \hfill \\ 0 \hfill & {else} \hfill \\ \end{array} } \right. $$

where \(r_{ijk}\) are the cluster results, denoting that the task \(T_{j}\) is grouped into the cluster \(c_{ik}\) of the cluster layer \(l_{i}\).

Fig. 3
figure 3

Parameter updating and clustering

The specific layers predict the final probabilities \(Y_{j}\) given the input \(X_{j}\) of the task \(T_{j}\). This is a standard classification problem. With the expected labels \(\hat{Y}_{j}\), the classification loss function is defined by

$$ {\mathcal{L}}_{p} \left( \theta \right) = - \sum\limits_{j = 1}^{N} {\hat{Y}_{j} } \log \left( {Y_{j} } \right) $$
(2)

In sum, the loss function of our NCMTL model is written as

$$ {\mathcal{L}}\left( \theta \right) = {\mathcal{L}}_{p} \left( \theta \right) + \alpha {\mathcal{L}}_{c} \left( \theta \right) $$
(3)

where \(\alpha\) balances the classification loss and the clustering loss. To some extent, \({\mathcal{L}}_{c} \left( \theta \right)\) acts as the regularization to the NCMTL model. We alternatively perform the model estimation and network clustering as shown in Algorithm 1.

Algorithm 1
figure a

Network clustering MTL

4 Experiments

4.1 Datasets

In this paper, our NCMTL model is evaluated with MTL document classification. 14 Amazon product reviews datasetsFootnote 1 are selected from the domains of {“apparel”, “baby”, “books”, “camera”, “DVD”, “electronic”, “health”, “kitchen”, “magazines”, “music”, “software”, “sports”, “toys” and “video”}[20]. These datasets are document-level reviews, and the goal is to predict the sentiment (i.e., positive or negative) of each document. We totally collect 27,755 documents for all 14 domains. Each dataset contains 37 K vocabularies on average, and each document includes 115 words averagely. For each dataset, we equally split it into ten sets, among which nine are used as the training set and the rest is used as the testing set. Accuracy is employed to evaluate the performances of our NCMTL model.

Our proposed approach is compared with the baseline approaches, including the single-task learning based approaches and the multi-task learning based approaches.

  • LSTM: The standard LSTM-based neural network for single task document classification [21].

  • TextCNN: The standard convolution neural network for single task document classification [22].

  • HAN: The hierarchical sentence-level and document-level attention network for single task document classification [23].

  • SoftShare: The soft parameter sharing approach for MTL document classification [8]. For each task, we adopt the network structure like the HAN approach in the soft layers. The soft layers of different tasks are regularized with \(\ell_{2}\) normalization. The specific layers are used for classification of each task.

  • HardShare: The hard parameter sharing approach for MTL document classification [9]. In the hard layers, we adopt the same network configuration as the soft layers in the SoftShare approach. The specific layers in the HardShare approach are same to those of the SoftShare approach.

  • ASP-MTL: The hard parameter sharing approach for MTL document classification [14]. An adversarial task is introduced to alleviate the feature interaction between tasks.

  • NCMTL: Our cluster layer based MTL. Different to the HardShare approach, network clustering is performed on all specific layers except the last one. KMeans +  + [24] is adopted in network clustering of each training batch.

Implementation detail For a fair comparison, we reimplement all the above approaches with nearly the same hyperparameter settings. The word embeddings are initialized with pre-trained GloVe vectors like [2]. The dimension of the word embedding is 200. In these experiments, we focus on the comparison of the cluster layers rather than the comparison of manual-designed & deliciated network structure. Thus, we adopt relatively simple network structure as base model. Two word-level and two sentence-level LSTM layers are adopted as the hard layers. We adopt three cluster layers with the hidden sizes {32, 32, 16} for each task and the cluster numbers {3, 5, 10}. Finally, one specific layer is used for the perdition of each task. The approaches are trained on Telsa V100 GPU with Adam optimizer (learning rate \(1E-5\)). The batch size is set to 32.

4.2 Main Results

We run all approaches three times and show the average accuracy in Table 1. Compared with the baseline approaches, the NCMTL approach achieves the significant improvement except the ASP-MTL approach. Thus, the experiment results illustrate the efficiency of the cluster layer.

Table 1 Accuracy of the NCMTL model on 14 Datasets against Baselines

We show the cluster results of some cluster layers in our final NCMTL model in Fig. 4. To certain degree, the cluster results verify our previous hypotheses that the similar tasks are prone to learn similar parameters. For example, “camera”, “electronic” and “software” are related to the electronic category and these tasks are in the same cluster. Meanwhile, we find the clusters in the cluster layer 1 are more general than those in the cluster layer 2, e.g., the cluster {“apparel”, “baby”, “books”, “magazines”} in the cluster layer 1 compared with the clusters {“apparel”, “baby”} and {“books”, “magazines”} in the cluster layer 2. To certain degree, this verifies our motivation that the cluster layers gradually learn the general representations to the specific representations. We also trace the change of the clustering results during the training iterations. We find the clustering results are relatively stable, which means certain tasks (e.g., “music” and “video” classification tasks) are always grouped to the same cluster after a few training iterations. To accelerate the training speed, we freeze the cluster results after four training epochs.

Fig. 4
figure 4

Cluster results in our final NCMTL model

4.3 Ablation Studies

In this section, we conduct ablation experiments to evaluate the effects of cluster number, embedding size and iteration epoch.

Effect of cluster number The cluster numbers of cluster layers are investigated at first. The cluster number of Layer 1 and Layer 2 are set to {1, 3, 5, 8, 10} respectively. The experiment results in Fig. 5 suggest that we achieve the best results when the cluster numbers of Layer 1 and Layer 2 are set to < 3,5 > or < 5,8 > . When the cluster number of Layer 1 is less than that of Layer 2, it is more possible to achieve improved outcomes. As the number of layers in the model increases, it is possible that the NCMTL approach tends to pay more attention to diversity of multiple tasks and therefore requires a higher number of clusters.

Fig. 5
figure 5

Evaluation of cluster number in layer 1 and layer 2

Effect of embedding size The different embedding sizes of cluster layer 1 ~ 3 are compared in this section, which are set to < 8,8,4 > , < 16,16,8 > , < 32,32,16 > , < 64,64,32 > , < 128,128,64 > and < 256,258,128 > . As shown in Fig. 6, the NCMTL approach demonstrates optimal performance outcomes when the embedding sizes are set to < 32,32,16 > and < 64,64,32 > on average. Meanwhile, it is observed that the NCMTL approach is unable to capture the connection among the multiple tasks when the embedding size is set to < 8,8,4 > , and there is a significant improvement when the embedding size increases from < 8,8,4 > to < 16,16,8 > . When the embedding size increases from < 128,128,64 > continually, the performance is slightly decreased, which may hint the overfitting of our approach.

Fig. 6
figure 6

Comparison of embedding size

Effect of iteration epoch For the training procedure, the accuracies of the various datasets are illustrated in Fig. 7. All tasks converge smoothly. There is a soar performance improvement of multiple tasks during the first 10 training epochs. Most tasks start to converge gradually after 15 epochs.

Fig. 7
figure 7

Accuracy during the training iterations

5 Conclusion

In this paper, we propose the novel cluster layer to address the MTL problems. The cluster layer automatically groups the network parameters of similar tasks. The cluster layer does not only reduce the model size, but also learns representations from general ones to specific ones gradually. The experimental results show that the proposed NCMTL is quite efficient in multi-task document classification. The cluster results verify that the cluster layers learn general representations to specific representations.

Currently network clustering is performed within layers. We will attempt to cluster networks across layers. Besides, network clustering can be regarded as a method of automatic Neural Architecture Search (NAS) [25]. We are exploring the opportunities to extend our NCMTL with the NAS approaches.