TooT-BERT-T: A BERT Approach on Discriminating Transport Proteins from Non-transport Proteins

Ghazikhani, Hamed; Butler, Gregory

doi:10.1007/978-3-031-17024-9_1

Hamed Ghazikhani¹⁴ &
Gregory Butler^14,15

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 553))

Included in the following conference series:

International Conference on Practical Applications of Computational Biology & Bioinformatics

288 Accesses
6 Citations

Abstract

Transmembrane transport proteins (transporters) serve a crucial role for the transport of hydrophilic molecules across hydrophobic membranes in every living cell. The structures and functions of many membrane proteins are unknown due to the enormous effort required to characterize them. This article proposes TooT-BERT-T, a technique that employs the BERT representation to analyze and discriminate between transporters and non-transporters using a Logistic Regression classifier. Additionally, we evaluate frozen and fine-tuned representations from two different BERT models. Compared to state-of-the-art prediction methods, TooT-BERT-T achieves the highest accuracy of 93.89% and MCC of 0.86.

Download conference paper PDF

TooT-T: discrimination of transport proteins from non-transport proteins

Article Open access 23 April 2020

A Study on the Application of Protein Language Models in the Analysis of Membrane Proteins

Topology based identification and comprehensive classification of four-transmembrane helix containing proteins (4TMs) in the human genome

Article Open access 31 March 2016

1 Introduction

Around one-third of the proteins in a cell are found in its membrane, and approximately one-third of these proteins are involved in molecule transport [21]. Transmembrane transport proteins, also known as transporters, are required for cell metabolism, ion homeostasis, signal transduction, binding with small molecules in the extracellular space, immune recognition, energy transduction, and physiological and developmental processes [21].

Protein research has advanced our knowledge of human health and disease treatment. The decreasing cost of sequencing technology has enabled the generation of massive datasets of naturally occurring proteins with enough information to build sophisticated machine learning models of protein sequences [23].

Since proteins, like human languages, are denoted by string concatenation, we can apply natural language processing (NLP) approaches [18]. Transformer neural networks (Transformers) have contributed significantly to the field of natural language processing [22]. Autoencoders, for example, BERT (Bidirectional Encoder Representations from Transformers) [9], are stacking models that are trained by corrupting input tokens and attempting to recover the original sentence [11]. While they can generate text as well, they are typically used to create vector representations for future tasks such as classification [11].

A massive collection of protein sequences from UniProt Archive (UniParc) [14] and the Big Fantastic Database (BFD) [11, 13] comprising over 390 billion amino acids resulted in ProtTrans [10], an amazing adaption to the protein domain of six available Transformer topologies which are Transformer-XL, BERT, Albert, XLnet, T5, and Electra.

TooT-BERT-T proposes a method for discriminating transport proteins from non-transport proteins using representations from ProtBERT-BFD and Logistic Regression. Our investigation can be summarised as follows: 1) Using ProtBERT-BFD to discriminate between transport and non-transport proteins for the first time. 2) Evaluation of frozen/fine-tuned ProtBERT-BFD representations. 3) Evaluation of frozen/fine-tuned MembraneBERT representations. 4) The fine-tuned TransporterBERT is a publicly accessible model pre-trained on the BFD database and fine-tuned using the transport proteins dataset (https://huggingface.co/ghazikhanihamed/TransporterBERT). 5) Proposing TooT-BERT-T as a method for classifying transport proteins that outperforms all other approaches.

The following is the outline for the paper: Sect. 2 describes the related work. Section 3 contains information about the dataset and experimental design used in this study. Section 4 compares and analyses the outcomes of TooT-BERT-T and Sect. 5 brings the paper to a close.

2 Related Work

Aplop and Butler [4, 5] provide a comprehensive overview of transport protein prediction methods. Earlier efforts used experimentally characterized databases to conduct homology searches for novel transporters. For example, TransATH [5] automates the Saier’s protocol via sequence similarity. TransATH improves transmembrane segment computations by including subcellular localization and claims an overall accuracy of 71.0%.

TrSSP (Transporter Substrate Specificity Prediction Service) [16] was developed to predict the substrate category of membrane transport proteins in an attempt to overcome the limitations of homology methods. The TrSSP tool predicts top-level transporters with an accuracy of 78.99 and 80.00%, respectively, and an MCC of 0.58 and 0.57 on the cross-validation and independent test sets.

SCMMTP [15] makes use of a novel scoring card method (SCM) to ascertain the dipeptide composition of potential membrane transport proteins. SCMMTP begins with a 400-dipeptide starting matrix and scores dipeptides based on the difference between positive and negative compositions. Following that, the matrix is optimized using a genetic algorithm. SCMMTP achieved an overall accuracy of 81.12% and 76.11% and an MCC of 0.62 and 0.47, respectively, on the training and independent datasets.

Nguyen et al. [17] characterize transporter protein sequences using a word-embedding technique. The protein sequence is defined by the word embedding and the protein’s biological terms frequency. They achieved accurate results in terms of transporter substrate specificity but not in terms of transporter detection. When cross-validation was used, the prediction accuracy for transporters was only 83.94 and 85.00% using the independent dataset.

In 2020, Alballa and Butler developed TooT-T [2], an ensemble technique that combines the results of two distinct approaches: homology annotation transfer and machine learning. BLAST searches the Transporter Classification Database (TCDB) [20] for homology to a query protein. If a query meets three thresholds, it is projected as a transporter. It also computes three composition features for training their respective SVM models. Finally, the meta-model assigns a protein the transport protein classification. They claim accuracy of 90.07% and 92.22%, respectively, and MCC values of 0.80 and 0.82 for the cross-validation and independent test sets, respectively. While incorporating multiple feature sets and classifiers improves the classification of transport proteins in TooT-T, it also increases the task’s complexity.

3 Materials and Methods

3.1 Dataset

This work utilizes the dataset from the TrSSP project [16] which can be accessed at the following URL: https://www.zhaolab.org/TrSSP/. The dataset was created using the UniProt database [14], in which 10, 780 transporter, carrier, and channel proteins were initially well characterized at the protein level with different substrate specificity annotation. Mishra et al. [16] eliminated from this benchmarking dataset fragmented sequences, sequences with more than two substrate specificities, and biological function annotations based only on sequence similarity. As presented in Table 1 the final dataset contains 1, 560 protein sequences for the training and test sets. This dataset is referred to as DS-T, which stands for a dataset for transporter proteins.

Table 1 DS-T: transport proteins dataset

Full size table

3.2 Protein Sequence Representation

As multiple studies demonstrate, representation learning, a branch of machine learning in which the representation is estimated concurrently with the statistical model, is gaining traction in biology. Works [3, 6, 19] highlight how representations can assist in extracting crucial biological information from the millions of observations collected by modern sequencing technologies [8].

BERT (Bidirectional Encoder Representations from Transformers) [9] is a language model used in natural language processing that employs a multi-layer bi-directional Transformer encoder that employs an attention mechanism in each encoder layer to condition both left and right context and process all words in the sentence in parallel. Each encoder layer comprises two sub-layers: multi-head self-attention and feed-forward neural networks. While encoding a specific word, the multi-head self-attention sublayer assists the encoder in looking at other words in the input sentence. The following formula is used to compute the scaled dot-product attention sublayer [22]:

$$\begin{aligned} MultiHead(Q,K,V)&= Concat(head_1,...,head_n)W^o \end{aligned}$$

(1)

$$\begin{aligned} head_i&= Attention(QW^Q_i, KW^K_i, VW^V_i) \end{aligned}$$

(2)

$$\begin{aligned} Attention(Q,K,V)&= softmax(\frac{QK^T}{\sqrt{d_k}})V \end{aligned}$$

(3)

where Q (Query), K (Key) and V (Value) are various linear transformations of the input features in order to obtain information representations for various subspaces. The dimension of K is $d_k$ and $W_i^Q$, $W_i^K$, $W_i^V$ and $W_i^O$ are weight matrices.

BERT is a two-step framework: pre-training and fine-tuning. Pre-training is training the model on a large amount of unlabeled data in an unsupervised manner. In contrast, fine-tuning is the process of initializing the model with the pre-trained parameters and fine-tuning all parameters using labeled data from downstream tasks via an additional classifier [9].

There are two methods for extracting representations from pre-trained BERT models: (i) frozen and (ii) fine-tuned. The former extracts features from a pre-trained BERT model without updating the model’s weights, whereas the latter extracts features after training the pre-trained BERT model on a smaller dataset and fine-tuning the model’s weights [9].

ProtBERT-BFD [10] is the BERT model which has been pre-trained on a large corpus of protein sequences from the BFD database (https://bfd.mmseqs.com) which contains 2.5 billion protein sequences. MembraneBERT is ProtBERT-BFD fine-tuned using the TooT-M membrane proteins dataset [1]. MembraneBERT can be found at (https://huggingface.co/ghazikhanihamed/MembraneBERT).

The representations from the final hidden layer of ProtBERT-BFD and MembraneBERT models are used in conjunction with a mean-pooling strategy, which is concluded to be the optimal method in ProtTrans [10].

3.3 Fine-Tuning a BERT Model

We add a classification layer and train the entire BERT model on the transporters training set to fine-tune a BERT model. We randomly chose 10% of the training samples as the validation set in this study. The downstream task dataset will update all initialized weights from pre-training during the fine-tuning phase. We fine-tuned the BERT models using the Trainer API from HuggingFace [24]. This is a preliminary investigation of BERT’s role in transport protein analysis, so we used the same hyperparameter settings as ProtTrans [10], except for the empirically determined number of training epochs of 13 for ProtBERT-BFD and 10 for MembraneBERT. We discovered these numbers when we have the maximum performance throughout the validation set results. Additional hyperparameters for fine-tuning are listed in Table 2 which are recommended and used in ProtTrans project.

Table 2 Fine-tuning ProtBERT-BFD and MembraneBERT hyperparameters

Full size table

3.4 Logistic Regression

Logistic Regression is a widely used classification technique in medical/biological research [12]. The Logistic Regression algorithm used was the scikit-learn Python module (https://scikit-learn.org) and the study used the default hyperparameters.

3.5 Evaluation

A 10-fold cross-validation (CV) technique was used in this analysis to evaluate the model’s performance by partitioning the dataset into ten sections. For the purpose of fine-tuning the BERT, 10% of the training set was used as the validation set, while the remaining 90% was used for training. The independent test set is utilised for the sole purpose of evaluating the method.

3.6 Evaluation Metrics

Four key evaluation criteria are considered in this project: Sensitivity (Sen), Specificity (Spc), Accuracy (Acc), and MCC.

$$\begin{aligned} MCC = \frac{(TP\times TN) - (FP \times FN)}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}} \end{aligned}$$

(4)

MCC is an acronym for Matthew’s Correlation Coefficient. For imbalanced data, MCC is a more stable assessment metric [7].

4 Results and Discussion

4.1 Fine-Tuning ProtBERT-BFD and MembraneBERT

We compared both representations of ProtBERT-BFD and MembraneBERT, without (frozen) and with (fine-tuned) fine-tuning using the DS-T dataset. Figure 1 visualises the effect of fine-tuning ProtBERT-BFD and MembraneBERT for each epoch.

Two line graphs of scores over epochs of accuracy and M C C lines for ProtB E R T-B F D and MembraneB E R T . It indicates that the Prot B E R T-B F D model outperformed the MembraneB E R T — **Fig. 1**

Table 3 Logistic Regression performance with ProtBERT-BFD and MembraneBERT

Full size table

As demonstrated, the ProtBERT-BFD model improved representations in each epoch, increasing from zero MCC and 56% accuracy to 0.77 MCC and 87% accuracy on the validation set. The ProtBERT-BFD model outperforms the MembraneBERT model, indicating that a BERT model trained on a more extensive set of protein sequences has superior representation and performance in the downstream task fine-tuning. Additionally, the ProtBERT-BFD performs better in both frozen and fine-tuned representations than MembraneBERT, with the exception of the frozen representation of sensitivity. Despite the high cost of fine-tuning the 420 million-parameter ProtBERT-BFD model, our results (Table 3) demonstrate that fine-tuning ProtBERT-BFD for transport protein prediction results in the best representation.

4.2 Logistic Regression with Fine-Tuned ProtBERT-BFD

We selected Logistic Regression as a preliminary good binary classifier because it is simple to implement and interpret, has been tested in the ProtTrans project, and produces competitive results [10]. Table 3 demonstrates that Logistic Regression with both fine-tuned ProtBERT-BFD and MembraneBERT representations performs well, with fine-tuned ProtBERT-BFD outperforming MembraneBERT on all independent test set results, while MembraneBERT outperforms sensitivity, accuracy, and MCC on CV results.

4.3 Comparison of TooT-BERT-T with State-of-the-Art Models

Table 4 and Fig. 2 are used to compare TooT-BERT-T to other published methods that use only the protein sequence on the same dataset. As demonstrated, TooT-BERT-T outperforms other published works in all evaluation metrics except sensitivity, where Nguyen et al. [17] achieves 100% sensitivity.

Table 4 Comparative performance of TooT-BERT-T with state-of-the-art

Full size table

A bar graph illustrates the evaluation metric over the scores. It indicates that the highest score value for M C C, accuracy, and specificity is of TooT dash B E R T dash T. — **Fig. 2**

TooT-BERT-T has a greater specificity (rate of true negatives) than the approach of Nguyen et al. [17], indicating that it makes fewer false positive predictions (Fig. 3). This is essential for achieving a high true negative rate of 90% when describing non-transport proteins.

The proposed method, TooT-BERT-T, which employs fine-tuned ProtBERT-BFD representation and a Logistic Regression classifier using the dataset explained in Sect. 3.1, outperforms previous methods with an accuracy of 93.89% and an MCC of 0.86 on the independent test set.

The ProtBERT-BFD representation is effective because it understands the context of each amino acid in different protein sequences, whereas other methods rely on static protein-encoding techniques.

A diagram illustrates the TooT dash BERTdash T confusion matrix, which illustrates the actual values versus predicted values for T and non-T. — **Fig. 3**

Figure 3 shows a confusion matrix of TooT-BERT-T for separating transport proteins from non-transport proteins. As depicted in the figure, despite the fact that the number of errors is quite low, the model makes more mistakes when identifying non-transporters as transporters (False positive = 6) than when predicting transporters as non-transporters (False negative = 5). This suggests that the proposed strategy is somewhat skewed towards predicting the positive class (transport proteins). This issue may occur when the dataset is imbalanced, with more positive class samples than negative class samples.

5 Conclusion

TooT-BERT-T distinguishes transport proteins from non-transport proteins using the fine-tuned ProtBERT-BFD representation. The representations of two BERT models, ProtBERT-BFD and MembraneBERT, were compared using frozen and fine-tuned representations. The ProtBERT-BFD fine-tuned representation outperforms the MembraneBERT representation on the independent test set. The proposed method, TooT-BERT-T, which utilizes fine-tuned ProtBERT-BFD and Logistic Regression, achieves an accuracy of 93.89% and an MCC of 0.86 on the independent test set and outperforms other methods. Given that this study was a preliminary examination of the BERT representation’s performance in transport protein analysis, other classifiers such as SVM and CNN can be evaluated in the future.

References

Alballa M, Butler G (2020) Integrative approach for detecting membrane proteins. BMC Bioinform 21(19):575
Article Google Scholar
Alballa M, Butler G (2020) TooT-T: discrimination of transport proteins from non-transport proteins. BMC Bioinform 21(3):25
Article Google Scholar
Alley EC, Khimulya G, Biswas S, AlQuraishi M, Church GM (2019) Unified rational protein engineering with sequence-based deep representation learning. Nat Methods 16(12):1315–1322
Article Google Scholar
Aplop F, Butler G (2015) On predicting transport proteins and their substrates for the reconstruction of metabolic networks. In: 2015 IEEE conference on computational intelligence in bioinformatics and computational biology (CIBCB), pp 1–9
Google Scholar
Aplop F, Butler G (2017) TransATH: transporter prediction via annotation transfer by homology. ARPN J Eng Appl Sci 12(2):8
Google Scholar
Bepler T, Berger B (2019) Learning protein sequence embeddings using information from structure. arXiv:1902.08661 [cs, q-bio, stat]
Chicco D, Jurman G (2020) The advantages of the Matthews Correlation Coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genom 21(1):6
Article Google Scholar
Detlefsen NS, Hauberg S, Boomsma W (2022) Learning meaningful representations of protein sequences. Nat Commun 13(1):1914
Article Google Scholar
Devlin J, Chang MW, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 [cs]
Elnaggar A, Heinzinger M, Dallago C, Rehawi G, Wang Y, Jones L, Gibbs T, Feher T, Angerer C, Steinegger M, Bhowmik D, Rost B (2021) ProtTrans: towards cracking the language of lifes code through self-supervised deep learning and high performance computing. IEEE Trans Pattern Anal Mach Intell 1
Google Scholar
Ferruz N, Höcker B (2022) Towards controllable protein design with conditional transformers. arXiv:2201.07338 [q-bio]
Hess AS, Hess JR (2019) Logistic regression. Transfusion 59(7):2197–2198
Article Google Scholar
Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, Tunyasuvunakool K, Bates R, Z̆ídek A, Potapenko A, Bridgland A, Meyer C, Kohl SAA, Ballard AJ, Cowie A, Romera-Paredes B, Nikolov S, Jain R, Adler J, Back T, Petersen S, Reiman D, Clancy E, Zielinski M, Steinegger M, Pacholska M, Berghammer T, Bodenstein S, Silver D, Vinyals O, Senior AW, Kavukcuoglu K, Kohli P, Hassabis D (2021) Highly accurate protein structure prediction with AlphaFold. Nature 596(7873):583–589
Google Scholar
Leinonen R, Diez FG, Binns D, Fleischmann W, Lopez R, Apweiler R (2004) UniProt archive. Bioinformatics 20(17):3236–3237
Article Google Scholar
Liou YF, Vasylenko T, Yeh CL, Lin WC, Chiu SH, Charoenkwan P, Shu LS, Ho SY, Huang HL (2015) SCMMTP: identifying and characterizing membrane transport proteins using propensity scores of dipeptides. BMC Genom 16(12):S6
Article Google Scholar
Mishra NK, Chang J, Zhao PX (2014) Prediction of membrane transport proteins and their substrate specificities using primary sequence information. PLoS ONE 9(6):e100278
Article Google Scholar
Nguyen TTD, Le NQK, Ho QT, Phan DV, Ou YY (2019) Using word embedding technique to efficiently represent protein sequences for identifying substrate specificities of transporters. Anal Biochem 577:73–81
Article Google Scholar
Ofer D, Brandes N, Linial M (2021) The language of proteins: NLP, machine learning & protein sequences. Comput Struct Biotechnol J 19:1750–1758
Article Google Scholar
Rao R, Bhattacharya N, Thomas N, Duan Y, Chen P, Canny J, Abbeel P, Song Y (2019) Evaluating protein transfer learning with TAPE. In: Wallach H, Larochelle H, Beygelzimer A, Alché-Buc Fd, Fox E, Garnett R (eds) Advances in neural information processing systems, vol 32. Curran Associates, Inc
Google Scholar
Saier Jr MH, Tran CV, Barabote RD (2006) TCDB: the transporter classification database for membrane transport protein analyses and information. Nucleic Acids Res 34(suppl_1):D181–D186
Google Scholar
Saier Jr MH (2002) Families of transporters and their classification. In: Transmembrane transporters. Wiley, pp 1–17
Google Scholar
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. arXiv
Google Scholar
Vig J, Madani A, Varshney LR, Xiong C, Socher R, Rajani NF (2021) BERTology meets biology: interpreting attention in protein language models. arXiv:2006.15222 [cs, q-bio]
Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, Cistac P, Rault T, Louf R, Funtowicz M, Davison J, Shleifer S, von Platen P, Ma C, Jernite Y, Plu J, Xu C, Scao TL, Gugger S, Drame M, Lhoest Q, Rush AM (2020) HuggingFace’s transformers: state-of-the-art natural language processing. arXiv
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Software Engineering, Concordia University, Montreal, Canada
Hamed Ghazikhani & Gregory Butler
Centre for Structural and Functional Genomics, Concordia University, Montreal, Canada
Gregory Butler

Authors

Hamed Ghazikhani
View author publications
Search author on:PubMed Google Scholar
Gregory Butler
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Hamed Ghazikhani .

Editor information

Editors and Affiliations

Computer Science Department, Universidad de Vigo, Vigo, Spain
Florentino Fdez-Riverola
Campus de Gualtar, Universidade do Minho, Braga, Portugal
Miguel Rocha
College of Medicine and Health Sciences, United Arab Emirates University, Al Ain, Abu Dhabi, United Arab Emirates
Mohd Saberi Mohamad
Gheorghe Asachi Technical University of Iași, Iași, Romania
Simona Caraiman
Edificio I+D+i, University of Salamanca, Salamanca, Spain
Ana Belén Gil-González

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ghazikhani, H., Butler, G. (2023). TooT-BERT-T: A BERT Approach on Discriminating Transport Proteins from Non-transport Proteins. In: Fdez-Riverola, F., Rocha, M., Mohamad, M.S., Caraiman, S., Gil-González, A.B. (eds) Practical Applications of Computational Biology and Bioinformatics, 16th International Conference (PACBB 2022). PACBB 2022. Lecture Notes in Networks and Systems, vol 553. Springer, Cham. https://doi.org/10.1007/978-3-031-17024-9_1

Download citation

DOI: https://doi.org/10.1007/978-3-031-17024-9_1
Published: 20 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-17023-2
Online ISBN: 978-3-031-17024-9
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Keywords

Publish with us

Policies and ethics

TooT-BERT-T: A BERT Approach on Discriminating Transport Proteins from Non-transport Proteins

Abstract

Similar content being viewed by others

TooT-T: discrimination of transport proteins from non-transport proteins

A Study on the Application of Protein Language Models in the Analysis of Membrane Proteins

Topology based identification and comprehensive classification of four-transmembrane helix containing proteins (4TMs) in the human genome

Explore related subjects

1 Introduction

2 Related Work

3 Materials and Methods

3.1 Dataset

3.2 Protein Sequence Representation

3.3 Fine-Tuning a BERT Model

3.4 Logistic Regression

3.5 Evaluation

3.6 Evaluation Metrics

4 Results and Discussion

4.1 Fine-Tuning ProtBERT-BFD and MembraneBERT

4.2 Logistic Regression with Fine-Tuned ProtBERT-BFD

4.3 Comparison of TooT-BERT-T with State-of-the-Art Models

5 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Keywords

Publish with us