1 Introduction

Around one-third of the proteins in a cell are found in its membrane, and approximately one-third of these proteins are involved in molecule transport [21]. Transmembrane transport proteins, also known as transporters, are required for cell metabolism, ion homeostasis, signal transduction, binding with small molecules in the extracellular space, immune recognition, energy transduction, and physiological and developmental processes [21].

Protein research has advanced our knowledge of human health and disease treatment. The decreasing cost of sequencing technology has enabled the generation of massive datasets of naturally occurring proteins with enough information to build sophisticated machine learning models of protein sequences [23].

Since proteins, like human languages, are denoted by string concatenation, we can apply natural language processing (NLP) approaches [18]. Transformer neural networks (Transformers) have contributed significantly to the field of natural language processing [22]. Autoencoders, for example, BERT (Bidirectional Encoder Representations from Transformers) [9], are stacking models that are trained by corrupting input tokens and attempting to recover the original sentence [11]. While they can generate text as well, they are typically used to create vector representations for future tasks such as classification [11].

A massive collection of protein sequences from UniProt Archive (UniParc) [14] and the Big Fantastic Database (BFD) [11, 13] comprising over 390 billion amino acids resulted in ProtTrans [10], an amazing adaption to the protein domain of six available Transformer topologies which are Transformer-XL, BERT, Albert, XLnet, T5, and Electra.

TooT-BERT-T proposes a method for discriminating transport proteins from non-transport proteins using representations from ProtBERT-BFD and Logistic Regression. Our investigation can be summarised as follows: 1) Using ProtBERT-BFD to discriminate between transport and non-transport proteins for the first time. 2) Evaluation of frozen/fine-tuned ProtBERT-BFD representations. 3) Evaluation of frozen/fine-tuned MembraneBERT representations. 4) The fine-tuned TransporterBERT is a publicly accessible model pre-trained on the BFD database and fine-tuned using the transport proteins dataset (https://huggingface.co/ghazikhanihamed/TransporterBERT). 5) Proposing TooT-BERT-T as a method for classifying transport proteins that outperforms all other approaches.

The following is the outline for the paper: Sect. 2 describes the related work. Section 3 contains information about the dataset and experimental design used in this study. Section 4 compares and analyses the outcomes of TooT-BERT-T and Sect. 5 brings the paper to a close.

2 Related Work

Aplop and Butler [4, 5] provide a comprehensive overview of transport protein prediction methods. Earlier efforts used experimentally characterized databases to conduct homology searches for novel transporters. For example, TransATH [5] automates the Saier’s protocol via sequence similarity. TransATH improves transmembrane segment computations by including subcellular localization and claims an overall accuracy of 71.0%.

TrSSP (Transporter Substrate Specificity Prediction Service) [16] was developed to predict the substrate category of membrane transport proteins in an attempt to overcome the limitations of homology methods. The TrSSP tool predicts top-level transporters with an accuracy of 78.99 and 80.00%, respectively, and an MCC of 0.58 and 0.57 on the cross-validation and independent test sets.

SCMMTP [15] makes use of a novel scoring card method (SCM) to ascertain the dipeptide composition of potential membrane transport proteins. SCMMTP begins with a 400-dipeptide starting matrix and scores dipeptides based on the difference between positive and negative compositions. Following that, the matrix is optimized using a genetic algorithm. SCMMTP achieved an overall accuracy of 81.12% and 76.11% and an MCC of 0.62 and 0.47, respectively, on the training and independent datasets.

Nguyen et al. [17] characterize transporter protein sequences using a word-embedding technique. The protein sequence is defined by the word embedding and the protein’s biological terms frequency. They achieved accurate results in terms of transporter substrate specificity but not in terms of transporter detection. When cross-validation was used, the prediction accuracy for transporters was only 83.94 and 85.00% using the independent dataset.

In 2020, Alballa and Butler developed TooT-T [2], an ensemble technique that combines the results of two distinct approaches: homology annotation transfer and machine learning. BLAST searches the Transporter Classification Database (TCDB) [20] for homology to a query protein. If a query meets three thresholds, it is projected as a transporter. It also computes three composition features for training their respective SVM models. Finally, the meta-model assigns a protein the transport protein classification. They claim accuracy of 90.07% and 92.22%, respectively, and MCC values of 0.80 and 0.82 for the cross-validation and independent test sets, respectively. While incorporating multiple feature sets and classifiers improves the classification of transport proteins in TooT-T, it also increases the task’s complexity.

3 Materials and Methods

3.1 Dataset

This work utilizes the dataset from the TrSSP project [16] which can be accessed at the following URL: https://www.zhaolab.org/TrSSP/. The dataset was created using the UniProt database [14], in which 10, 780 transporter, carrier, and channel proteins were initially well characterized at the protein level with different substrate specificity annotation. Mishra et al. [16] eliminated from this benchmarking dataset fragmented sequences, sequences with more than two substrate specificities, and biological function annotations based only on sequence similarity. As presented in Table 1 the final dataset contains 1, 560 protein sequences for the training and test sets. This dataset is referred to as DS-T, which stands for a dataset for transporter proteins.

Table 1 DS-T: transport proteins dataset

3.2 Protein Sequence Representation

As multiple studies demonstrate, representation learning, a branch of machine learning in which the representation is estimated concurrently with the statistical model, is gaining traction in biology. Works [3, 6, 19] highlight how representations can assist in extracting crucial biological information from the millions of observations collected by modern sequencing technologies [8].

BERT (Bidirectional Encoder Representations from Transformers) [9] is a language model used in natural language processing that employs a multi-layer bi-directional Transformer encoder that employs an attention mechanism in each encoder layer to condition both left and right context and process all words in the sentence in parallel. Each encoder layer comprises two sub-layers: multi-head self-attention and feed-forward neural networks. While encoding a specific word, the multi-head self-attention sublayer assists the encoder in looking at other words in the input sentence. The following formula is used to compute the scaled dot-product attention sublayer [22]:

$$\begin{aligned} MultiHead(Q,K,V)&= Concat(head_1,...,head_n)W^o \end{aligned}$$
(1)
$$\begin{aligned} head_i&= Attention(QW^Q_i, KW^K_i, VW^V_i) \end{aligned}$$
(2)
$$\begin{aligned} Attention(Q,K,V)&= softmax(\frac{QK^T}{\sqrt{d_k}})V \end{aligned}$$
(3)

where Q (Query), K (Key) and V (Value) are various linear transformations of the input features in order to obtain information representations for various subspaces. The dimension of K is \(d_k\) and \(W_i^Q\), \(W_i^K\), \(W_i^V\) and \(W_i^O\) are weight matrices.

BERT is a two-step framework: pre-training and fine-tuning. Pre-training is training the model on a large amount of unlabeled data in an unsupervised manner. In contrast, fine-tuning is the process of initializing the model with the pre-trained parameters and fine-tuning all parameters using labeled data from downstream tasks via an additional classifier [9].

There are two methods for extracting representations from pre-trained BERT models: (i) frozen and (ii) fine-tuned. The former extracts features from a pre-trained BERT model without updating the model’s weights, whereas the latter extracts features after training the pre-trained BERT model on a smaller dataset and fine-tuning the model’s weights [9].

ProtBERT-BFD [10] is the BERT model which has been pre-trained on a large corpus of protein sequences from the BFD database (https://bfd.mmseqs.com) which contains 2.5 billion protein sequences. MembraneBERT is ProtBERT-BFD fine-tuned using the TooT-M membrane proteins dataset [1]. MembraneBERT can be found at (https://huggingface.co/ghazikhanihamed/MembraneBERT).

The representations from the final hidden layer of ProtBERT-BFD and MembraneBERT models are used in conjunction with a mean-pooling strategy, which is concluded to be the optimal method in ProtTrans [10].

3.3 Fine-Tuning a BERT Model

We add a classification layer and train the entire BERT model on the transporters training set to fine-tune a BERT model. We randomly chose 10% of the training samples as the validation set in this study. The downstream task dataset will update all initialized weights from pre-training during the fine-tuning phase. We fine-tuned the BERT models using the Trainer API from HuggingFace [24]. This is a preliminary investigation of BERT’s role in transport protein analysis, so we used the same hyperparameter settings as ProtTrans [10], except for the empirically determined number of training epochs of 13 for ProtBERT-BFD and 10 for MembraneBERT. We discovered these numbers when we have the maximum performance throughout the validation set results. Additional hyperparameters for fine-tuning are listed in Table 2 which are recommended and used in ProtTrans project.

Table 2 Fine-tuning ProtBERT-BFD and MembraneBERT hyperparameters

3.4 Logistic Regression

Logistic Regression is a widely used classification technique in medical/biological research [12]. The Logistic Regression algorithm used was the scikit-learn Python module (https://scikit-learn.org) and the study used the default hyperparameters.

3.5 Evaluation

A 10-fold cross-validation (CV) technique was used in this analysis to evaluate the model’s performance by partitioning the dataset into ten sections. For the purpose of fine-tuning the BERT, 10% of the training set was used as the validation set, while the remaining 90% was used for training. The independent test set is utilised for the sole purpose of evaluating the method.

3.6 Evaluation Metrics

Four key evaluation criteria are considered in this project: Sensitivity (Sen), Specificity (Spc), Accuracy (Acc), and MCC.

$$\begin{aligned} MCC = \frac{(TP\times TN) - (FP \times FN)}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}} \end{aligned}$$
(4)

MCC is an acronym for Matthew’s Correlation Coefficient. For imbalanced data, MCC is a more stable assessment metric [7].

4 Results and Discussion

4.1 Fine-Tuning ProtBERT-BFD and MembraneBERT

We compared both representations of ProtBERT-BFD and MembraneBERT, without (frozen) and with (fine-tuned) fine-tuning using the DS-T dataset. Figure 1 visualises the effect of fine-tuning ProtBERT-BFD and MembraneBERT for each epoch.

Fig. 1
Two line graphs of scores over epochs of accuracy and M C C lines for ProtB E R T-B F D and MembraneB E R T . It indicates that the Prot B E R T-B F D model outperformed the MembraneB E R T

The effect of fine-tuning (This figure depicts the results of fine-tuning the ProtBERT-BFD (left) and MembraneBERT (right) with accuracy and MCC metrics at each epoch on the validation set. The y-axis and x-axis display the scores and epochs, respectively)

Table 3 Logistic Regression performance with ProtBERT-BFD and MembraneBERT

As demonstrated, the ProtBERT-BFD model improved representations in each epoch, increasing from zero MCC and 56% accuracy to 0.77 MCC and 87% accuracy on the validation set. The ProtBERT-BFD model outperforms the MembraneBERT model, indicating that a BERT model trained on a more extensive set of protein sequences has superior representation and performance in the downstream task fine-tuning. Additionally, the ProtBERT-BFD performs better in both frozen and fine-tuned representations than MembraneBERT, with the exception of the frozen representation of sensitivity. Despite the high cost of fine-tuning the 420 million-parameter ProtBERT-BFD model, our results (Table 3) demonstrate that fine-tuning ProtBERT-BFD for transport protein prediction results in the best representation.

4.2 Logistic Regression with Fine-Tuned ProtBERT-BFD

We selected Logistic Regression as a preliminary good binary classifier because it is simple to implement and interpret, has been tested in the ProtTrans project, and produces competitive results [10]. Table 3 demonstrates that Logistic Regression with both fine-tuned ProtBERT-BFD and MembraneBERT representations performs well, with fine-tuned ProtBERT-BFD outperforming MembraneBERT on all independent test set results, while MembraneBERT outperforms sensitivity, accuracy, and MCC on CV results.

4.3 Comparison of TooT-BERT-T with State-of-the-Art Models

Table 4 and Fig. 2 are used to compare TooT-BERT-T to other published methods that use only the protein sequence on the same dataset. As demonstrated, TooT-BERT-T outperforms other published works in all evaluation metrics except sensitivity, where Nguyen et al. [17] achieves 100% sensitivity.

Table 4 Comparative performance of TooT-BERT-T with state-of-the-art
Fig. 2
A bar graph illustrates the evaluation metric over the scores. It indicates that the highest score value for M C C, accuracy, and specificity is of TooT dash B E R T dash T.

Comparison of methodologies

TooT-BERT-T has a greater specificity (rate of true negatives) than the approach of Nguyen et al. [17], indicating that it makes fewer false positive predictions (Fig. 3). This is essential for achieving a high true negative rate of 90% when describing non-transport proteins.

The proposed method, TooT-BERT-T, which employs fine-tuned ProtBERT-BFD representation and a Logistic Regression classifier using the dataset explained in Sect. 3.1, outperforms previous methods with an accuracy of 93.89% and an MCC of 0.86 on the independent test set.

The ProtBERT-BFD representation is effective because it understands the context of each amino acid in different protein sequences, whereas other methods rely on static protein-encoding techniques.

Fig. 3
A diagram illustrates the TooT dash BERTdash T confusion matrix, which illustrates the actual values versus predicted values for T and non-T.

TooT-BERT-T confusion matrix (This figure summarises the performance of TooT-BERT-T, where T represents transport protein and non-T represents non-transport protein)

Figure 3 shows a confusion matrix of TooT-BERT-T for separating transport proteins from non-transport proteins. As depicted in the figure, despite the fact that the number of errors is quite low, the model makes more mistakes when identifying non-transporters as transporters (False positive = 6) than when predicting transporters as non-transporters (False negative = 5). This suggests that the proposed strategy is somewhat skewed towards predicting the positive class (transport proteins). This issue may occur when the dataset is imbalanced, with more positive class samples than negative class samples.

5 Conclusion

TooT-BERT-T distinguishes transport proteins from non-transport proteins using the fine-tuned ProtBERT-BFD representation. The representations of two BERT models, ProtBERT-BFD and MembraneBERT, were compared using frozen and fine-tuned representations. The ProtBERT-BFD fine-tuned representation outperforms the MembraneBERT representation on the independent test set. The proposed method, TooT-BERT-T, which utilizes fine-tuned ProtBERT-BFD and Logistic Regression, achieves an accuracy of 93.89% and an MCC of 0.86 on the independent test set and outperforms other methods. Given that this study was a preliminary examination of the BERT representation’s performance in transport protein analysis, other classifiers such as SVM and CNN can be evaluated in the future.