Abstract
The Multi-Task Learning (MTL) technique has been widely studied by worldwide researchers. The majority of current MTL studies adopt the hard parameter sharing structure, where hard layers tend to learn general representations over all tasks and specific layers are prone to learn specific representations for each task. Since the specific layers directly follow the hard layers, the MTL model needs to estimate this direct change (from general to specific) as well. To alleviate this problem, we introduce the novel cluster layer, which groups tasks into clusters during training procedures. In a cluster layer, the tasks in the same cluster are further required to share the same network. By this way, the cluster layer produces the general presentation for the same cluster, while produces relatively specific presentations for different clusters. The cluster layers are used as transitions between the hard layers and the specific layers. Thus, the MTL model can learn general representations to specific representations gradually. We evaluate our model with MTL document classification, and the results demonstrate the cluster layer is quite efficient in MTL.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Different from the single task learning, Multi-Task Learning (MTL) trains a united model over several tasks simultaneously [1]. Recently, it has been noticed that MTL particularly with Deep Learning (DL) has led to many successes in Natural Language Processing (NLP) and Computer Vision (CV) [2, 3].
There mainly exist two network structures in current DL-based MTL approaches, i.e., the soft structure and the hard structure as shown in Fig. 1 [4]. The soft structure requires that each task has its own parameters. In soft layers, the parameters of different tasks are regularized to learn relationships between tasks. In specific layers, the parameters are specified to each task. This structure allows the model to learn its own feature representations on each task and adapt to the differences between different tasks. It can better adapt to the characteristics of different tasks and improve overall performance. However, more parameters may increase the complexity of training. The hard structure shares the same parameters between different tasks in a model, where all tasks are trained and predicted using the same model weights. It forces the MTL model to learn a unit representation over all tasks at first. Specific layers follow hard layers for each task as well. This structure can accelerate the training process because the number of parameters is small, and tasks can promote each other through sharing knowledge. However, there may be differences in the feature representations of each task, which may lead to lower performances of the model on certain tasks. Compared with the soft structure, the hard structure significantly reduces the model size and the risk of overfitting. Thus, the hard structure is more popular in industry [5]. In the hard structure, the hard layers are expected to learn general representations over all tasks, while the specific layers are expected to learn specific representations for each task. Since the specific layers directly follow the hard layers, the MTL model needs to estimate the general representations at first and then estimate the specific representations alternatively. To certain degree, this rapid change is difficult for the MTL model to estimate without any transition. An efficient MTL network should simultaneously consider both feature sharing and task-specific parts, which needs to learn the general representation between tasks to avoid overfitting and the specific representations of each task to avoid underfitting.
To alleviate this problem, we propose a novel parameter sharing mechanism in this paper. To certain degree, the tasks that are related tend to learn similar parameters in the MTL model [4]. Inspired by this idea, we present a novel neural layer, i.e., the cluster layer. As shown in Fig. 2a, the MTL model firstly estimates the specific parameters for each task and the similar tasks tend to learn the similar parameters (with similar colors). Then, we perform network clustering to group tasks with similar parameters into clusters. For tasks in one cluster, their estimated parameters are further replaced with the same parameters (i.e., the parameter center of all tasks in this center) as seen in Fig. 2b. There are three main advantages of this cluster sharing mechanism. At first, like hard layers, the cluster layer learns general representations for all tasks intra one cluster. Like specific layers, the cluster layer learns specific representations for inter clusters. The cluster layer is regarded as a transition from the hard layers to the specific layers. Thus, the MTL model can learn general representations to specific representations gradually. Secondly, the cluster layer significantly reduces the parameter size. The parameter size of a specific layer for MTL is \(N\) times larger than that for a single task(\(N\) is the task number). Replacing with the cluster layer, the parameter size reduces to \({K \mathord{\left/ {\vphantom {K N}} \right. \kern-0pt} N}\) times less than that of a specific layer(\(K\) is the cluster number). This will benefit a lot to massive multi-task learning (MMTL) [6] (where MTL confronts the learning problem over tens or hundreds of tasks e.g., Drug discovery [7]) since the parameters will scale up rapidly as the task number increases. Finally, this cluster sharing mechanism is regarded as a regularization in the MTL model, which is good for the MTL model to reduce the over-fitting risk. In this paper, we also present the learning algorithm of the Network Clustering MTL model, which alternatively performs model estimation and network clustering during train procedures.
Experiments are conducted on multi-task document classification with public datasets. The results show the cluster layer is quite efficient in MTL. The main contributions of this paper are summarized as follows: 1) As far as we know, we are the first to present the cluster layer for MTL. By this way, the MTL model gradually learns general representations to specific representations. 2) We present the NCMTL approach, which alternatively performs network clustering and parameter estimation. This idea may be leveraged to the similar scenarios.
2 Related Work
Multi-Task Learning (MTL) aims to train a united model over several tasks simultaneously [1]. In NLP and CV, MTL especially with Deep Learning has led to many successes recently [2, 3]. Thus, we just focus on the DL-based MTL approaches in this paper.
Ruder summarizes the DL-based MTL network structures, which divides into two categories, i.e., the soft parameter sharing structure and the hard parameter sharing structure [4]. For example, the soft structure is selected in [8] to regularize the parameters of all models, while the hard structure is adopted in [9] to share the convolution backbone network over all tasks. Sarkar et al. develop an MTL model that utilizes the potential embedding space of words and topics for prediction and achieves information sharing of related tasks through message passing mechanisms and train the model through active learning methods to address the lack of standardized fine-grained label data for the MTL task [10]. Specially for MTL text classification, two mechanisms (i.e., an external memory and a reading/writing communication) are introduced in to share the task information [11, 12], and Lu et al. introduce the hybrid representation-learning network [13]. Liu et al. employ the adversarial mechanism to overcome the task differences in [14]. The meta-knowledge from different tasks is extracted and used in the MTL learning in [15]. There are many studies on MTL document classification [2]. Tan et al. propose an MTL-based framework, which utilizes deep neural networks to model a correlation to improve the overall performance of sentiment analysis [16]. Gawron et al. create a multi-task multilingual model for the following text classification tasks: functional style, domain, readability, and sentiment [17]. Luo et al. propose a text guided multi-task learning network for multimodal sentiment analysis [18]. Ameur et al. propose a deep multi-task learning for image/video distortions identification [19]. With limited spaces, we only focus on the MTL network structure. These methods mainly adopt the hard structure, of which the hard and specific layers tend to learn general and specific representations respectively. In this paper, we present the cluster layers, which are regarded as a transition from the hard layers to the specific layers for MTL and learn general representations to specific representations.
3 Network Clustering MTL
The task of multi-task learning is mathematically formulated as: Given a serial of tasks \({\text{T}} = \left\{ {T_{1,} \, T_{2,} \, ... \, T_{N} } \right\}\), we aim to learn a united mapping function \(Y = f(X,\theta )\), which outputs labels \(Y_{i}\) with the inputs \(X_{i}\) of the task \(T_{i}\). In our Network Clustering MTL (NCMTL) model, \(\theta\) is written as \(\left\{ {W_{H} ,W_{C} ,W_{S} } \right\}\), which denotes the parameters in the hard, cluster and specific neural networks respectively. Meanwhile, the cluster layers are set between the hard layers and the specific layers as shown in Fig. 2b.
For the hard layers \(W_{H}\), it is possible to adopt the recurrent and/or convolutional neural networks as the backbone networks. But the hard layers are not the key interest of this paper. With the limited space, we focus more on the cluster layers.
For the cluster layers \(W_{C}\), we incorporate \(L\) cluster layers in the NCMTL model. Taking the cluster layer \(l_{i}\) as an example, firstly we assume there exist specific weights \(w_{ij}{\prime}\) for the task \(T_{j}\). Since similar tasks tend to learn similar parameters, all tasks are grouped into \(K_{i}\) clusters with the similarities of their specific weights \(w_{i*}{\prime}\) (* denotes all tasks, and \(K_{i}\) is the predefined cluster number in this cluster layer). We obtain the parameter center of each cluster. Let \(c_{ik}\) be the parameter center of the \(k^{th}\) cluster, the original specific weights of the tasks in this cluster are replaced with \(c_{ik}\). By this way, the tasks in one cluster share the same parameters. Compared with the hard layer which produces the same representation for all tasks, the cluster layer produces the same representation for a group of tasks. When we gradually increase the cluster numbers of the cluster layers, the NCMTL model is prone to learn the general features to the specific features. It is important to highlight that the \(W_{C}\) are solely updated during NCMTL and are unrelated to clustering.
To save the memory usage, the final NCMTL model does not need to include the specific weights \(w_{i*}{\prime}\). We can only keep the parameter centers and the cluster results in the final NCMTL model. Since we use the \(K_{i}\) parameter centers to replace the specific weights of all \(N\) tasks, the parameter size of the cluster layer reduces to \({\raise0.5ex\hbox{$\scriptstyle {K_{i} }$} \kern-0.1em/\kern-0.15em \lower0.25ex\hbox{$\scriptstyle N$}}\) times less than that of the specific layer.
In practice, the clustering procedure is performed during training stages. As shown in Fig. 3, at the \((t - 1)^{th}\) training batch, we update \(w_{ij}{\prime}\) with the parameter center \(c_{ik}\) of task \(j\).Then, clustering is carried out on the updated \(w_{i*}{\prime}\) and forms the new cluster center \(c_{ik}\) of the next \(t^{th}\) training batch. Under the hard cluster mode (i.e., one task is grouped into only one cluster in a cluster layer), the clustering loss function is presented by
where \(r_{ijk}\) are the cluster results, denoting that the task \(T_{j}\) is grouped into the cluster \(c_{ik}\) of the cluster layer \(l_{i}\).
The specific layers predict the final probabilities \(Y_{j}\) given the input \(X_{j}\) of the task \(T_{j}\). This is a standard classification problem. With the expected labels \(\hat{Y}_{j}\), the classification loss function is defined by
In sum, the loss function of our NCMTL model is written as
where \(\alpha\) balances the classification loss and the clustering loss. To some extent, \({\mathcal{L}}_{c} \left( \theta \right)\) acts as the regularization to the NCMTL model. We alternatively perform the model estimation and network clustering as shown in Algorithm 1.
4 Experiments
4.1 Datasets
In this paper, our NCMTL model is evaluated with MTL document classification. 14 Amazon product reviews datasetsFootnote 1 are selected from the domains of {“apparel”, “baby”, “books”, “camera”, “DVD”, “electronic”, “health”, “kitchen”, “magazines”, “music”, “software”, “sports”, “toys” and “video”}[20]. These datasets are document-level reviews, and the goal is to predict the sentiment (i.e., positive or negative) of each document. We totally collect 27,755 documents for all 14 domains. Each dataset contains 37 K vocabularies on average, and each document includes 115 words averagely. For each dataset, we equally split it into ten sets, among which nine are used as the training set and the rest is used as the testing set. Accuracy is employed to evaluate the performances of our NCMTL model.
Our proposed approach is compared with the baseline approaches, including the single-task learning based approaches and the multi-task learning based approaches.
-
LSTM: The standard LSTM-based neural network for single task document classification [21].
-
TextCNN: The standard convolution neural network for single task document classification [22].
-
HAN: The hierarchical sentence-level and document-level attention network for single task document classification [23].
-
SoftShare: The soft parameter sharing approach for MTL document classification [8]. For each task, we adopt the network structure like the HAN approach in the soft layers. The soft layers of different tasks are regularized with \(\ell_{2}\) normalization. The specific layers are used for classification of each task.
-
HardShare: The hard parameter sharing approach for MTL document classification [9]. In the hard layers, we adopt the same network configuration as the soft layers in the SoftShare approach. The specific layers in the HardShare approach are same to those of the SoftShare approach.
-
ASP-MTL: The hard parameter sharing approach for MTL document classification [14]. An adversarial task is introduced to alleviate the feature interaction between tasks.
-
NCMTL: Our cluster layer based MTL. Different to the HardShare approach, network clustering is performed on all specific layers except the last one. KMeans + + [24] is adopted in network clustering of each training batch.
Implementation detail For a fair comparison, we reimplement all the above approaches with nearly the same hyperparameter settings. The word embeddings are initialized with pre-trained GloVe vectors like [2]. The dimension of the word embedding is 200. In these experiments, we focus on the comparison of the cluster layers rather than the comparison of manual-designed & deliciated network structure. Thus, we adopt relatively simple network structure as base model. Two word-level and two sentence-level LSTM layers are adopted as the hard layers. We adopt three cluster layers with the hidden sizes {32, 32, 16} for each task and the cluster numbers {3, 5, 10}. Finally, one specific layer is used for the perdition of each task. The approaches are trained on Telsa V100 GPU with Adam optimizer (learning rate \(1E-5\)). The batch size is set to 32.
4.2 Main Results
We run all approaches three times and show the average accuracy in Table 1. Compared with the baseline approaches, the NCMTL approach achieves the significant improvement except the ASP-MTL approach. Thus, the experiment results illustrate the efficiency of the cluster layer.
We show the cluster results of some cluster layers in our final NCMTL model in Fig. 4. To certain degree, the cluster results verify our previous hypotheses that the similar tasks are prone to learn similar parameters. For example, “camera”, “electronic” and “software” are related to the electronic category and these tasks are in the same cluster. Meanwhile, we find the clusters in the cluster layer 1 are more general than those in the cluster layer 2, e.g., the cluster {“apparel”, “baby”, “books”, “magazines”} in the cluster layer 1 compared with the clusters {“apparel”, “baby”} and {“books”, “magazines”} in the cluster layer 2. To certain degree, this verifies our motivation that the cluster layers gradually learn the general representations to the specific representations. We also trace the change of the clustering results during the training iterations. We find the clustering results are relatively stable, which means certain tasks (e.g., “music” and “video” classification tasks) are always grouped to the same cluster after a few training iterations. To accelerate the training speed, we freeze the cluster results after four training epochs.
4.3 Ablation Studies
In this section, we conduct ablation experiments to evaluate the effects of cluster number, embedding size and iteration epoch.
Effect of cluster number The cluster numbers of cluster layers are investigated at first. The cluster number of Layer 1 and Layer 2 are set to {1, 3, 5, 8, 10} respectively. The experiment results in Fig. 5 suggest that we achieve the best results when the cluster numbers of Layer 1 and Layer 2 are set to < 3,5 > or < 5,8 > . When the cluster number of Layer 1 is less than that of Layer 2, it is more possible to achieve improved outcomes. As the number of layers in the model increases, it is possible that the NCMTL approach tends to pay more attention to diversity of multiple tasks and therefore requires a higher number of clusters.
Effect of embedding size The different embedding sizes of cluster layer 1 ~ 3 are compared in this section, which are set to < 8,8,4 > , < 16,16,8 > , < 32,32,16 > , < 64,64,32 > , < 128,128,64 > and < 256,258,128 > . As shown in Fig. 6, the NCMTL approach demonstrates optimal performance outcomes when the embedding sizes are set to < 32,32,16 > and < 64,64,32 > on average. Meanwhile, it is observed that the NCMTL approach is unable to capture the connection among the multiple tasks when the embedding size is set to < 8,8,4 > , and there is a significant improvement when the embedding size increases from < 8,8,4 > to < 16,16,8 > . When the embedding size increases from < 128,128,64 > continually, the performance is slightly decreased, which may hint the overfitting of our approach.
Effect of iteration epoch For the training procedure, the accuracies of the various datasets are illustrated in Fig. 7. All tasks converge smoothly. There is a soar performance improvement of multiple tasks during the first 10 training epochs. Most tasks start to converge gradually after 15 epochs.
5 Conclusion
In this paper, we propose the novel cluster layer to address the MTL problems. The cluster layer automatically groups the network parameters of similar tasks. The cluster layer does not only reduce the model size, but also learns representations from general ones to specific ones gradually. The experimental results show that the proposed NCMTL is quite efficient in multi-task document classification. The cluster results verify that the cluster layers learn general representations to specific representations.
Currently network clustering is performed within layers. We will attempt to cluster networks across layers. Besides, network clustering can be regarded as a method of automatic Neural Architecture Search (NAS) [25]. We are exploring the opportunities to extend our NCMTL with the NAS approaches.
Data Availability
No datasets were generated or analysed during the current study.
References
Caruana R (1997) Multitask learning. Mach Learn 28(1):41–75
Liu P, Fu J, Dong Y et al (2018) Multi-task learning over graph structures. arXiv preprint arXiv:1811.10211
Zhang T, Ghanem B, Liu S, Ahuja N (2012) Robust visual tracking via multi-task sparse learning. In: IEEE conference on computer vision and pattern recognition, pp. 2042–2049
Ruder S (2017) An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098
Caruana R (1993) Multitask learning: A knowledge-based source of inductive bias. In: Proceedings of the 10th international conference on machine learning, pp. 41–48
Ratner A, Hancock B, Dunnmon J, Sala F, Pandey S, Ré C (2019) Training complex models with multi-task weak supervision. Proc AAAI Conf Artif Intell 33:4763–4771
Ramsundar B, Kearnes S, Riley P, Webster D, Konerding D, Pande V (2015) Massively multitask networks for drug discovery. arXiv preprint arXiv:1502.02072
Yang Y, Hospedales TM (2016) Trace norm regularised deep multi-task learning. arXiv preprint arXiv:1606.04038
Long M, Cao Z, Wang J, Yu PS (2017) Learning multiple tasks with multilinear relationship networks. Advances in neural information processing systems 30
Sarkar S, Alhamadani A, Alkulaib L, Lu CT (2022) Predicting depression and anxiety on reddit: a multi-task learning approach. In: IEEE/ACM international conference on advances in social networks analysis and mining (ASONAM), pp 427–435
Liu P, Qiu X, Huang X (2016) Deep multi-task learning with shared memory. arXiv preprint arXiv:1609.07222
Liu P, Qiu X, Huang X (2016) Recurrent neural network for text classification with multi-task learning. arXiv preprint arXiv:1605.05101
Lu G, Gan J, Yin J, Luo Z, Li B, Zhao X (2020) Multi-task learning using a hybrid representation for text classification. Neural Comput & Applic 32:6467–6480. https://doi.org/10.1007/s00521-018-3934-y
Liu P, Qiu X, Huang X (2017) Adversarial multi-task learning for text classification. arXiv preprint arXiv:1704.05742
Chen J, Qiu X, Liu P, Huang X (2018) Meta multi-task learning for sequence modeling. In: Proceedings of the AAAI conference on artificial intelligence, 32(1)
Tan YY, Chow CO, Kanesan J, Chuah JH, Lim Y (2023) Sentiment analysis and sarcasm detection using deep multi-task learning. Wireless Pers Commun 129:2213–2237. https://doi.org/10.1007/s11277-023-10235-4
Gawron K, Pogoda M, Ropiak N, Swędrowski M, Kocoń J (2021) Deep neural language-agnostic multi-task text classifier. In: IEEE International conference on data mining workshops (ICDMW), pp 136–142
Luo Y, Wu R, Liu J, Tang X (2023) A text guided multi-task learning network for multimodal sentiment analysis. Neurocomputing 560:126836. https://doi.org/10.1016/j.neucom.2023.126836
Ameur Z, Fezza SA, Hamidouche W (2022) Deep multi-task learning for image/video distortions identification. Neural Comput & Applic 34:21607–21623. https://doi.org/10.1007/s00521-021-06576-5
Blitzer J, Dredze M, Pereira F (2007) Biographies, bollywood, boom-boxes and blenders: domain adaptation for sentiment classification. In: Proceedings of the 45th annual meeting of the association of computational linguistics, pp 440–447
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Kim Y (2014) Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882
Yang Z, Yang D, Dyer C, He X, Smola A, Hovy E (2016) Hierarchical attention networks for document classification. In: Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies, pp 1480–1489
Arthur D, Vassilvitskii S (2007) k-means++: The advantages of careful seeding. In: Proceedings of the 18th Annual ACM-SIAM symposium on discrete algorithms, pp 1027–1035
Zoph B, Le QV (2016) Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578
Author information
Authors and Affiliations
Contributions
All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by ZM. ZM wrote the main manuscript text and DG commented and revised on it. SG prepared figures and table. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Conflict of interest
We declare that there is no conflict of interests regarding the publication of this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Mu, Z., Gao, D. & Guo, S. Network Clustering for Multi-task Learning. Neural Process Lett 57, 4 (2025). https://doi.org/10.1007/s11063-024-11712-y
Accepted:
Published:
DOI: https://doi.org/10.1007/s11063-024-11712-y