SecondaryStructurePredictionReport

Nicholas Choa 22517804
Aivan Eugene Francisco 66939519
CS 184A
Proteins Structures: Predicting Secondary Structures
Abstract
This report examines the process of being able to predict the secondary structure labels of
proteins and their amino acids. Of the 57 features included in each amino acid, we found
ourselves using the amino acid residues, whether it was a N- or C- terminal, relative/absolute
solvent accessibility, and sequence profiles as a basis for said predictions. Additionally, each of
the 57 features were associated with 700 amino acids in total. We utilized multiple learning
algorithms, both the supervised and unsupervised, in the process of predicting the secondary
structure labels. After finding a more precise and accurate learning algorithm we then put our
predictive methods to the test. This involved incorporating our most optimal algorithm to the
training, test, and validation datasets which amounted to 5600, 272, and 256 samples
respectively. Through machine learning, we ran our datasets using our choice of training
algorithms and computed the prediction accuracy to assess their individual performances. The
end goal of extracting a more than satisfactory prediction score was achieved through our
implementations. All this resulted in a function that retroactively fits our training data into a
machine learning classifier, and takes a sequence string of primary amino-acids as input, to
predict a corresponding secondary structure string for the amino-acid sequence.
Introduction
Protein function is determined by the structure and the structure is correlated with the different
combinations of amino acid sequences. Amino acids provide a strong clue to the proteins’ three
secondary structures: alpha helix, beta sheet, and coil. However, other factors such as the
solubility accessibility and the terminal location of the amino acid sequences (N- and C-

terminals) also play a key role in determining the protein’s structure. Culminating the vast
number of different sequence combinations and considering unpredictable external factors, the
notion of machine learning becomes more imperative to accurately and efficiently predict the
structure of a protein.
Machine learning, the idea of teaching a machine how to interpret a specific kind of data, proves
to be useful in providing a solution for problems like predicting protein structure. Fields within
finance and computer vision use machine learning heavily to predict and view the change in
stock prices given a set of parameter values or to determine the formation of a shape given its
current state. In a similar fashion, machine learning is used to solve problems within the field of
bioinformatics, specifically problems like the complexity of the protein’s secondary structure.
An attempt of predicting protein structure was the 1980’s feed-forward neural network which
consisted of three layers, the input, middle, and output layer, each having a specific number of
nodes used to store the amino acid sequences and secondary structure labels (Kirshner et al.
2008). It was developed such that the middle layers connected themselves via nodes called
context nodes which served as a liaison for feeding in the outputs (structures) such that when the
inputs (sequences) are propagated forward from the input layer they match up with the most
probable output. This development improved the structure prediction accuracy from 60% to
70%.
Despite the advancements made by the neural network, certain training algorithms are more
meticulous and provide an effective way for regularizing individual input features so that they
are all taken into consideration equally. Support vector machines (SVM), for example, include a
soft margin criterion to distinguish one feature from another (Gassend et al. 2006). As we’ll
discuss later on, we had found that the SVM method of machine learning too slow and inaccurate
compared to other alternatives. In another case, hidden markov models (HMM) also
conveniently condense the number of parameters as a part of its algorithmic implementation and
nature (Gassend et al. 2006). This aspect of machine learning has significantly improved overall

accuracy and performance to approximately around the 80th percentile since then. Through this
project, we have found that these advancements towards protein structure prediction led to many
important medical breakthroughs (Guzzo, 56).
Methods Used
As our first step, we read the amino acid sequences from the dataset file where the various
batches of amino-acid sequences were split into training, testing, and validation datasets
containing 5600, 272, and 256 respective samples. We then reshaped them into two-dimensional
matrices with the shape of 700 amino acids x 57 features. We also predefined a window with a
size of 30 to determine how many times to iterate over the dataset. The amino acids and features
were then placed in our training set X while training set Y simply contained the secondary
structure labels; a targeted dataset. Training set X consisted of amino acid residues, solubility
averages, and sequence profiles. All of which were then represented as binary values (0 or 1) in
the matrix. Our Y training set also contained binary values to represent the 8 structure labels
which were L, B, E, G, I, H, S, T, and NoSeq.
In terms of training the data, we chose one supervised and three unsupervised learning
algorithms. For our supervised learning, we decided to use Logistic Regression (LR). We
decided to randomize our data by shuffling it through cross validation called KFolds. We defined
its number of KFolds as the number of amino acids in training set X. We split(folded) the data 10
times and set the pseudo-random generator value as 3 to use as our shuffle value. Afterwards, we
used a cross validation score for LR to assess the performance and set its parameters to training
set X, training set Y, and for partitioning and randomizing the data as the number of KFolds. The
score was determined as to how accurate each of the samples matched the predicted output so
our final score was defined by the mean of all the scores along with the standard deviation of the
samples.
The three unsupervised models were the Gaussian Naive Bayes (GNB), the Decision Tree
Classifier (DT), and the Random Forest (RF) clusters. Similarly, we shuffled the data using

KFolds and set its parameters to training set X using the number of amino acids with the number
of folds set as 10 and its pseudo-random generator as 7. We then scored the data against training
sets X and Y and randomized the data as the number of KFolds and our final prediction score
was the overall average for each of the samples along with its standard deviation.
Specifically for the Random Forest method, we evaluated its overall performance against the
validation sets. Choosing an initial window size of 100, we split the data for X validation
datasets and Y validation sets. Using validation set X, we attempt to predict the structure labels
and compared those predictions with the Y validation set. Afterwards, we computed its accuracy
score which was defined by the number of correct predictions over the total of samples.
Additionally, we had a confusion matrix that computed individual accuracy scores for each of
the samples.
Results and Findings
The table below shows the quantitative results of the various methods we attempted to train for
protein structure prediction. We included a variance between batch sizes when looking at groups
of proteins in datasets, as looking at the dataset as a whole or in very large sizes resulted in
extremely long wait times. When it came to logistic regression, we thought we would find the
output probabilities useful but we then realized the sklearn methods could do the predicting
aspect for us, rather than us looking at raw probabilities ourselves. Naive-Bayes seemed like a
better alternative but accuracy eventually dipped. We then came to a conclusion of using a
decision tree based method based on overall averages of accuracy presented both in class and
online. After implementing a regular decision tree algorithm, accuracy went up considerably.
This uptick in accuracy prompted us to try the Random Forest method that was mentioned during
presentations, which we found to be a more intuitive in making deeper decision trees for deeper
learning. While the variance would increase, we understood that the decisions would be much
less bias and would yield better performance. It also performed well in processing time as
increasing batch size didn’t significantly make it longer in the prediction and training process.

Method Batch Size Time Accuracy
Logistic Regression 250 00:02:34 0.7069 ± (0.014)
Logistic Regression 1000 00:06:21 0.6961 ± (0.190)
Naive-Bayes 250 00:03:21 0.6785 ± (0.014)
Naive-Bayes 1000 00:05:56 0.6810 ± (0.015)
Decision Tree 500 00:04:03 0.6870 ± (0.011)
Random Forest 100 00:00:45 0.7511 ± (0.799)
Random Forest 500 00:03:51 0.7503 ± (0.085)
Random Forest 1000 00:07:33 0.7876 ± (0.040)
The resulting graph below shows the predicted values in comparison with the actual validation
data. Despite a few minor differences, mainly the secondary structure labels in range of 4
through 7, it clearly shows what a .80 prediction accuracy entails, an almost psychic-like
intuitiveness for knowing what labels matches with each primary amino-acid letter. In an
analytical point of view, higher quantities of sample data produce better prediction results as the
correctness of the predictions increases.

Further Discussion
In conclusion, our results show that modern training algorithms not only provide a more efficient
way to compute for the predictions of secondary structure labels, but also achieves higher
accuracy as an end result. The Random Forest method of machine learning has proven to be the
best performing prediction model, with a score score valuing at 0.7876 ± (0.040). Despite our
best training algorithm, achieving a similar accuracy score from past models, we have a solid
foundation to build and make improvements on. We could have created more specific and
precise criteria for what determines the output by studying more and adding new features
associated to each particular amino acid. This could give a slight boost to the prediction
accuracy. By adding new features, however, could cause for an overfitting of the data. The
process overall was insightful and innovative as we now know the specific tools and
implementation to predict secondary structure labels and compute their accuracy scores along the
way.

Work Cited
Kirschner, Andreas. Prediction of Protein Structural Features by Machine Learning Methods.
Lehrstuhl fur Genomorientierte Bioinformatik. 2008. n.d. Print.
Gassend, Blaise and O’Donnell, Charles. Predicting Secondary Structure of All-Helical
Proteins Using Hidden Markov Support Vector Machines. Computer Science and Artificial
Intelligence Laboratory Massachusetts Institute of Technology 2006. n.d. Print.
Guzzo, Anthony V. "The Influence of Amino Acid Sequence on Protein Structure." Biophysical
Journal. U.S. National Library of Medicine, Nov. 2009. Web. 11 Dec. 2016.

SecondaryStructurePredictionReport

More Related Content

What's hot (9)

Viewers also liked (8)

Similar to SecondaryStructurePredictionReport (20)

SecondaryStructurePredictionReport