SlideShare a Scribd company logo
Nicholas Choa 22517804
Aivan Eugene Francisco 66939519
CS 184A
Proteins Structures: Predicting Secondary Structures
Abstract
This report examines the process of being able to predict the secondary structure labels of
proteins and their amino acids. Of the 57 features included in each amino acid, we found
ourselves using the amino acid residues, whether it was a N- or C- terminal, relative/absolute
solvent accessibility, and sequence profiles as a basis for said predictions. Additionally, each of
the 57 features were associated with 700 amino acids in total. We utilized multiple learning
algorithms, both the supervised and unsupervised, in the process of predicting the secondary
structure labels. After finding a more precise and accurate learning algorithm we then put our
predictive methods to the test. This involved incorporating our most optimal algorithm to the
training, test, and validation datasets which amounted to 5600, 272, and 256 samples
respectively. Through machine learning, we ran our datasets using our choice of training
algorithms and computed the prediction accuracy to assess their individual performances. The
end goal of extracting a more than satisfactory prediction score was achieved through our
implementations. All this resulted in a function that retroactively fits our training data into a
machine learning classifier, and takes a sequence string of primary amino-acids as input, to
predict a corresponding secondary structure string for the amino-acid sequence.
Introduction
Protein function is determined by the structure and the structure is correlated with the different
combinations of amino acid sequences. Amino acids provide a strong clue to the proteins’ three
secondary structures: alpha helix, beta sheet, and coil. However, other factors such as the
solubility accessibility and the terminal location of the amino acid sequences (N- and C-
terminals) also play a key role in determining the protein’s structure. Culminating the vast
number of different sequence combinations and considering unpredictable external factors, the
notion of machine learning becomes more imperative to accurately and efficiently predict the
structure of a protein.
Machine learning, the idea of teaching a machine how to interpret a specific kind of data, proves
to be useful in providing a solution for problems like predicting protein structure. Fields within
finance and computer vision use machine learning heavily to predict and view the change in
stock prices given a set of parameter values or to determine the formation of a shape given its
current state. In a similar fashion, machine learning is used to solve problems within the field of
bioinformatics, specifically problems like the complexity of the protein’s secondary structure.
An attempt of predicting protein structure was the 1980’s feed-forward neural network which
consisted of three layers, the input, middle, and output layer, each having a specific number of
nodes used to store the amino acid sequences and secondary structure labels (Kirshner et al.
2008). It was developed such that the middle layers connected themselves via nodes called
context nodes which served as a liaison for feeding in the outputs (structures) such that when the
inputs (sequences) are propagated forward from the input layer they match up with the most
probable output. This development improved the structure prediction accuracy from 60% to
70%.
Despite the advancements made by the neural network, certain training algorithms are more
meticulous and provide an effective way for regularizing individual input features so that they
are all taken into consideration equally. Support vector machines (SVM), for example, include a
soft margin criterion to distinguish one feature from another (Gassend et al. 2006). As we’ll
discuss later on, we had found that the SVM method of machine learning too slow and inaccurate
compared to other alternatives. In another case, hidden markov models (HMM) also
conveniently condense the number of parameters as a part of its algorithmic implementation and
nature (Gassend et al. 2006). This aspect of machine learning has significantly improved overall
accuracy and performance to approximately around the 80th percentile since then. Through this
project, we have found that these advancements towards protein structure prediction led to many
important medical breakthroughs (Guzzo, 56).
Methods Used
As our first step, we read the amino acid sequences from the dataset file where the various
batches of amino-acid sequences were split into training, testing, and validation datasets
containing 5600, 272, and 256 respective samples. We then reshaped them into two-dimensional
matrices with the shape of 700 amino acids x 57 features. We also predefined a window with a
size of 30 to determine how many times to iterate over the dataset. The amino acids and features
were then placed in our training set X while training set Y simply contained the secondary
structure labels; a targeted dataset. Training set X consisted of amino acid residues, solubility
averages, and sequence profiles. All of which were then represented as binary values (0 or 1) in
the matrix. Our Y training set also contained binary values to represent the 8 structure labels
which were L, B, E, G, I, H, S, T, and NoSeq.
In terms of training the data, we chose one supervised and three unsupervised learning
algorithms. For our supervised learning, we decided to use Logistic Regression (LR). We
decided to randomize our data by shuffling it through cross validation called KFolds. We defined
its number of KFolds as the number of amino acids in training set X. We split(folded) the data 10
times and set the pseudo-random generator value as 3 to use as our shuffle value. Afterwards, we
used a cross validation score for LR to assess the performance and set its parameters to training
set X, training set Y, and for partitioning and randomizing the data as the number of KFolds. The
score was determined as to how accurate each of the samples matched the predicted output so
our final score was defined by the mean of all the scores along with the standard deviation of the
samples.
The three unsupervised models were the Gaussian Naive Bayes (GNB), the Decision Tree
Classifier (DT), and the Random Forest (RF) clusters. Similarly, we shuffled the data using
KFolds and set its parameters to training set X using the number of amino acids with the number
of folds set as 10 and its pseudo-random generator as 7. We then scored the data against training
sets X and Y and randomized the data as the number of KFolds and our final prediction score
was the overall average for each of the samples along with its standard deviation.
Specifically for the Random Forest method, we evaluated its overall performance against the
validation sets. Choosing an initial window size of 100, we split the data for X validation
datasets and Y validation sets. Using validation set X, we attempt to predict the structure labels
and compared those predictions with the Y validation set. Afterwards, we computed its accuracy
score which was defined by the number of correct predictions over the total of samples.
Additionally, we had a confusion matrix that computed individual accuracy scores for each of
the samples.
Results and Findings
The table below shows the quantitative results of the various methods we attempted to train for
protein structure prediction. We included a variance between batch sizes when looking at groups
of proteins in datasets, as looking at the dataset as a whole or in very large sizes resulted in
extremely long wait times. When it came to logistic regression, we thought we would find the
output probabilities useful but we then realized the sklearn methods could do the predicting
aspect for us, rather than us looking at raw probabilities ourselves. Naive-Bayes seemed like a
better alternative but accuracy eventually dipped. We then came to a conclusion of using a
decision tree based method based on overall averages of accuracy presented both in class and
online. After implementing a regular decision tree algorithm, accuracy went up considerably.
This uptick in accuracy prompted us to try the Random Forest method that was mentioned during
presentations, which we found to be a more intuitive in making deeper decision trees for deeper
learning. While the variance would increase, we understood that the decisions would be much
less bias and would yield better performance. It also performed well in processing time as
increasing batch size didn’t significantly make it longer in the prediction and training process.
Method Batch Size Time Accuracy
Logistic Regression 250 00:02:34 0.7069 ± (0.014)
Logistic Regression 1000 00:06:21 0.6961 ± (0.190)
Naive-Bayes 250 00:03:21 0.6785 ​± ​(0.014)
Naive-Bayes 1000 00:05:56 0.6810 ​± ​(0.015)
Decision Tree 500 00:04:03 0.6870 ​± ​(0.011)
Random Forest 100 00:00:45 0.7511 ​± ​(0.799)
Random Forest 500 00:03:51 0.7503 ​± ​(0.085)
Random Forest 1000 00:07:33 0.7876 ​± ​(0.040)
The resulting graph below shows the predicted values in comparison with the actual validation
data. Despite a few minor differences, mainly the secondary structure labels in range of 4
through 7, it clearly shows what a .80 prediction accuracy entails, an almost psychic-like
intuitiveness for knowing what labels matches with each primary amino-acid letter. In an
analytical point of view, higher quantities of sample data produce better prediction results as the
correctness of the predictions increases.
Further Discussion
In conclusion, our results show that modern training algorithms not only provide a more efficient
way to compute for the predictions of secondary structure labels, but also achieves higher
accuracy as an end result. The Random Forest method of machine learning has proven to be the
best performing prediction model, with a score score valuing at 0.7876 ​± ​(0.040). Despite our
best training algorithm, achieving a similar accuracy score from past models, we have a solid
foundation to build and make improvements on. We could have created more specific and
precise criteria for what determines the output by studying more and adding new features
associated to each particular amino acid. This could give a slight boost to the prediction
accuracy. By adding new features, however, could cause for an overfitting of the data. The
process overall was insightful and innovative as we now know the specific tools and
implementation to predict secondary structure labels and compute their accuracy scores along the
way.
Work Cited
Kirschner, Andreas. ​Prediction of Protein Structural Features by Machine Learning Methods.
Lehrstuhl f​ur Genomorientierte Bioinformatik. 2008. n.d. Print.
Gassend, Blaise and O’Donnell, Charles. ​Predicting Secondary Structure of All-Helical
Proteins Using Hidden Markov Support Vector Machines. Computer Science and Artificial
Intelligence Laboratory Massachusetts Institute of Technology 2006. n.d. Print.
Guzzo, Anthony V. "The Influence of Amino Acid Sequence on Protein Structure." ​Biophysical
Journal. U.S. National Library of Medicine, Nov. 2009. Web. 11 Dec. 2016.

More Related Content

PDF
B017410916
IOSR Journals
 
PDF
Application of tree based structures in machine learning to a real word scenario
Nimmi Weeraddana
 
DOCX
Classification modelling review
Jaideep Adusumelli
 
PDF
“To Fuse or Not to Fuse: Cognitive Diversity for Combining Multiple Scoring S...
diannepatricia
 
PDF
working_example_poster
Huikun Zhang
 
PDF
Initial Optimal Parameters of Artificial Neural Network and Support Vector Re...
IJECEIAES
 
PDF
Analysis and Implementation of Efficient Association Rules using K-mean and N...
IOSR Journals
 
PDF
Data Analysis. Predictive Analysis. Activity Prediction that a subject perfor...
Guillermo Santos
 
B017410916
IOSR Journals
 
Application of tree based structures in machine learning to a real word scenario
Nimmi Weeraddana
 
Classification modelling review
Jaideep Adusumelli
 
“To Fuse or Not to Fuse: Cognitive Diversity for Combining Multiple Scoring S...
diannepatricia
 
working_example_poster
Huikun Zhang
 
Initial Optimal Parameters of Artificial Neural Network and Support Vector Re...
IJECEIAES
 
Analysis and Implementation of Efficient Association Rules using K-mean and N...
IOSR Journals
 
Data Analysis. Predictive Analysis. Activity Prediction that a subject perfor...
Guillermo Santos
 

What's hot (9)

PDF
JEDM_RR_JF_Final
Jonathan Fivelsdal
 
PDF
Measurement of farm level efficiency of beef cattle fattening in west java pr...
Alexander Decker
 
PDF
STAT 897D Project 2 - Final Draft
Jonathan Fivelsdal
 
PDF
Comparison of wood, gaines,
Fatih Üçkardeş
 
PDF
Ca25458463
IJERA Editor
 
PDF
Welcome to International Journal of Engineering Research and Development (IJERD)
IJERD Editor
 
PDF
QUERY PROOF STRUCTURE CACHING FOR INCREMENTAL EVALUATION OF TABLED PROLOG PRO...
csandit
 
PDF
Analysis of Textual Data Classification with a Reddit Comments Dataset
AdamBab
 
PDF
Using the Breeder GA to Optimize a Multiple Regression Analysis Model
infopapers
 
JEDM_RR_JF_Final
Jonathan Fivelsdal
 
Measurement of farm level efficiency of beef cattle fattening in west java pr...
Alexander Decker
 
STAT 897D Project 2 - Final Draft
Jonathan Fivelsdal
 
Comparison of wood, gaines,
Fatih Üçkardeş
 
Ca25458463
IJERA Editor
 
Welcome to International Journal of Engineering Research and Development (IJERD)
IJERD Editor
 
QUERY PROOF STRUCTURE CACHING FOR INCREMENTAL EVALUATION OF TABLED PROLOG PRO...
csandit
 
Analysis of Textual Data Classification with a Reddit Comments Dataset
AdamBab
 
Using the Breeder GA to Optimize a Multiple Regression Analysis Model
infopapers
 
Ad

Viewers also liked (8)

DOCX
RESUME - NITIN
Nitin Katiyar
 
PPT
Antorchasyaccesoriosparaelprocesodesoldadura 120715024954-phpapp02
Cesar Varela
 
PPTX
Vacunació virus papil·loma
Alba Calderon Sarabia
 
DOC
RESUME 1
Sebin Dominic
 
DOCX
дахин шалгалт
Jack Jones
 
PPTX
Server
JuSangHa
 
PDF
La Veu d'Alginet nº 16 gener de 2017
Sonia Bosch
 
PPT
Cartel cómo
yuuki_88
 
RESUME - NITIN
Nitin Katiyar
 
Antorchasyaccesoriosparaelprocesodesoldadura 120715024954-phpapp02
Cesar Varela
 
Vacunació virus papil·loma
Alba Calderon Sarabia
 
RESUME 1
Sebin Dominic
 
дахин шалгалт
Jack Jones
 
Server
JuSangHa
 
La Veu d'Alginet nº 16 gener de 2017
Sonia Bosch
 
Cartel cómo
yuuki_88
 
Ad

Similar to SecondaryStructurePredictionReport (20)

PDF
Data science and visualization MODULE 3 FG&FS
vinuthak18
 
PDF
Data analysis_PredictingActivity_SamsungSensorData
Karen Yang
 
PPTX
module_of_healthcare_wound_healing_mbbs_3.pptx
harshypate56l8155
 
PDF
SET MATCHING MEASURES FOR EXTERNAL CLUSTER VALIDITY
Nexgen Technology
 
PPTX
Data Science Project: Advancements in Fetal Health Classification
Boston Institute of Analytics
 
PDF
Performance Comparision of Machine Learning Algorithms
Dinusha Dilanka
 
PDF
Quantile Regression with Q1/Q3 Anchoring: A Robust Alternative for Outlier-Re...
mlaij
 
PDF
Quantile Regression with Q1/Q3 Anchoring: A Robust Alternative for Outlier-Re...
mlaij
 
PDF
ON THE PREDICTION ACCURACIES OF THREE MOST KNOWN REGULARIZERS : RIDGE REGRESS...
ijaia
 
PPTX
SVM - Functional Verification
Sai Kiran Kadam
 
PDF
J017256674
IOSR Journals
 
PDF
Supervised WSD Using Master- Slave Voting Technique
iosrjce
 
PDF
SVM-PSO based Feature Selection for Improving Medical Diagnosis Reliability u...
cscpconf
 
PDF
The Beginnings Of A Search Engine
VirenKhandal
 
PDF
The Beginnings of a Search Engine
VirenKhandal
 
PDF
Microarray Data Classification Using Support Vector Machine
CSCJournals
 
PDF
WisconsinBreastCancerDiagnosticClassificationusingKNNandRandomForest
Sheing Jing Ng
 
PDF
Predicting breast cancer: Adrian Valles
Adrián Vallés
 
PDF
PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...
cscpconf
 
PDF
A parsimonious SVM model selection criterion for classification of real-world ...
o_almasi
 
Data science and visualization MODULE 3 FG&FS
vinuthak18
 
Data analysis_PredictingActivity_SamsungSensorData
Karen Yang
 
module_of_healthcare_wound_healing_mbbs_3.pptx
harshypate56l8155
 
SET MATCHING MEASURES FOR EXTERNAL CLUSTER VALIDITY
Nexgen Technology
 
Data Science Project: Advancements in Fetal Health Classification
Boston Institute of Analytics
 
Performance Comparision of Machine Learning Algorithms
Dinusha Dilanka
 
Quantile Regression with Q1/Q3 Anchoring: A Robust Alternative for Outlier-Re...
mlaij
 
Quantile Regression with Q1/Q3 Anchoring: A Robust Alternative for Outlier-Re...
mlaij
 
ON THE PREDICTION ACCURACIES OF THREE MOST KNOWN REGULARIZERS : RIDGE REGRESS...
ijaia
 
SVM - Functional Verification
Sai Kiran Kadam
 
J017256674
IOSR Journals
 
Supervised WSD Using Master- Slave Voting Technique
iosrjce
 
SVM-PSO based Feature Selection for Improving Medical Diagnosis Reliability u...
cscpconf
 
The Beginnings Of A Search Engine
VirenKhandal
 
The Beginnings of a Search Engine
VirenKhandal
 
Microarray Data Classification Using Support Vector Machine
CSCJournals
 
WisconsinBreastCancerDiagnosticClassificationusingKNNandRandomForest
Sheing Jing Ng
 
Predicting breast cancer: Adrian Valles
Adrián Vallés
 
PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...
cscpconf
 
A parsimonious SVM model selection criterion for classification of real-world ...
o_almasi
 

SecondaryStructurePredictionReport

  • 1. Nicholas Choa 22517804 Aivan Eugene Francisco 66939519 CS 184A Proteins Structures: Predicting Secondary Structures Abstract This report examines the process of being able to predict the secondary structure labels of proteins and their amino acids. Of the 57 features included in each amino acid, we found ourselves using the amino acid residues, whether it was a N- or C- terminal, relative/absolute solvent accessibility, and sequence profiles as a basis for said predictions. Additionally, each of the 57 features were associated with 700 amino acids in total. We utilized multiple learning algorithms, both the supervised and unsupervised, in the process of predicting the secondary structure labels. After finding a more precise and accurate learning algorithm we then put our predictive methods to the test. This involved incorporating our most optimal algorithm to the training, test, and validation datasets which amounted to 5600, 272, and 256 samples respectively. Through machine learning, we ran our datasets using our choice of training algorithms and computed the prediction accuracy to assess their individual performances. The end goal of extracting a more than satisfactory prediction score was achieved through our implementations. All this resulted in a function that retroactively fits our training data into a machine learning classifier, and takes a sequence string of primary amino-acids as input, to predict a corresponding secondary structure string for the amino-acid sequence. Introduction Protein function is determined by the structure and the structure is correlated with the different combinations of amino acid sequences. Amino acids provide a strong clue to the proteins’ three secondary structures: alpha helix, beta sheet, and coil. However, other factors such as the solubility accessibility and the terminal location of the amino acid sequences (N- and C-
  • 2. terminals) also play a key role in determining the protein’s structure. Culminating the vast number of different sequence combinations and considering unpredictable external factors, the notion of machine learning becomes more imperative to accurately and efficiently predict the structure of a protein. Machine learning, the idea of teaching a machine how to interpret a specific kind of data, proves to be useful in providing a solution for problems like predicting protein structure. Fields within finance and computer vision use machine learning heavily to predict and view the change in stock prices given a set of parameter values or to determine the formation of a shape given its current state. In a similar fashion, machine learning is used to solve problems within the field of bioinformatics, specifically problems like the complexity of the protein’s secondary structure. An attempt of predicting protein structure was the 1980’s feed-forward neural network which consisted of three layers, the input, middle, and output layer, each having a specific number of nodes used to store the amino acid sequences and secondary structure labels (Kirshner et al. 2008). It was developed such that the middle layers connected themselves via nodes called context nodes which served as a liaison for feeding in the outputs (structures) such that when the inputs (sequences) are propagated forward from the input layer they match up with the most probable output. This development improved the structure prediction accuracy from 60% to 70%. Despite the advancements made by the neural network, certain training algorithms are more meticulous and provide an effective way for regularizing individual input features so that they are all taken into consideration equally. Support vector machines (SVM), for example, include a soft margin criterion to distinguish one feature from another (Gassend et al. 2006). As we’ll discuss later on, we had found that the SVM method of machine learning too slow and inaccurate compared to other alternatives. In another case, hidden markov models (HMM) also conveniently condense the number of parameters as a part of its algorithmic implementation and nature (Gassend et al. 2006). This aspect of machine learning has significantly improved overall
  • 3. accuracy and performance to approximately around the 80th percentile since then. Through this project, we have found that these advancements towards protein structure prediction led to many important medical breakthroughs (Guzzo, 56). Methods Used As our first step, we read the amino acid sequences from the dataset file where the various batches of amino-acid sequences were split into training, testing, and validation datasets containing 5600, 272, and 256 respective samples. We then reshaped them into two-dimensional matrices with the shape of 700 amino acids x 57 features. We also predefined a window with a size of 30 to determine how many times to iterate over the dataset. The amino acids and features were then placed in our training set X while training set Y simply contained the secondary structure labels; a targeted dataset. Training set X consisted of amino acid residues, solubility averages, and sequence profiles. All of which were then represented as binary values (0 or 1) in the matrix. Our Y training set also contained binary values to represent the 8 structure labels which were L, B, E, G, I, H, S, T, and NoSeq. In terms of training the data, we chose one supervised and three unsupervised learning algorithms. For our supervised learning, we decided to use Logistic Regression (LR). We decided to randomize our data by shuffling it through cross validation called KFolds. We defined its number of KFolds as the number of amino acids in training set X. We split(folded) the data 10 times and set the pseudo-random generator value as 3 to use as our shuffle value. Afterwards, we used a cross validation score for LR to assess the performance and set its parameters to training set X, training set Y, and for partitioning and randomizing the data as the number of KFolds. The score was determined as to how accurate each of the samples matched the predicted output so our final score was defined by the mean of all the scores along with the standard deviation of the samples. The three unsupervised models were the Gaussian Naive Bayes (GNB), the Decision Tree Classifier (DT), and the Random Forest (RF) clusters. Similarly, we shuffled the data using
  • 4. KFolds and set its parameters to training set X using the number of amino acids with the number of folds set as 10 and its pseudo-random generator as 7. We then scored the data against training sets X and Y and randomized the data as the number of KFolds and our final prediction score was the overall average for each of the samples along with its standard deviation. Specifically for the Random Forest method, we evaluated its overall performance against the validation sets. Choosing an initial window size of 100, we split the data for X validation datasets and Y validation sets. Using validation set X, we attempt to predict the structure labels and compared those predictions with the Y validation set. Afterwards, we computed its accuracy score which was defined by the number of correct predictions over the total of samples. Additionally, we had a confusion matrix that computed individual accuracy scores for each of the samples. Results and Findings The table below shows the quantitative results of the various methods we attempted to train for protein structure prediction. We included a variance between batch sizes when looking at groups of proteins in datasets, as looking at the dataset as a whole or in very large sizes resulted in extremely long wait times. When it came to logistic regression, we thought we would find the output probabilities useful but we then realized the sklearn methods could do the predicting aspect for us, rather than us looking at raw probabilities ourselves. Naive-Bayes seemed like a better alternative but accuracy eventually dipped. We then came to a conclusion of using a decision tree based method based on overall averages of accuracy presented both in class and online. After implementing a regular decision tree algorithm, accuracy went up considerably. This uptick in accuracy prompted us to try the Random Forest method that was mentioned during presentations, which we found to be a more intuitive in making deeper decision trees for deeper learning. While the variance would increase, we understood that the decisions would be much less bias and would yield better performance. It also performed well in processing time as increasing batch size didn’t significantly make it longer in the prediction and training process.
  • 5. Method Batch Size Time Accuracy Logistic Regression 250 00:02:34 0.7069 ± (0.014) Logistic Regression 1000 00:06:21 0.6961 ± (0.190) Naive-Bayes 250 00:03:21 0.6785 ​± ​(0.014) Naive-Bayes 1000 00:05:56 0.6810 ​± ​(0.015) Decision Tree 500 00:04:03 0.6870 ​± ​(0.011) Random Forest 100 00:00:45 0.7511 ​± ​(0.799) Random Forest 500 00:03:51 0.7503 ​± ​(0.085) Random Forest 1000 00:07:33 0.7876 ​± ​(0.040) The resulting graph below shows the predicted values in comparison with the actual validation data. Despite a few minor differences, mainly the secondary structure labels in range of 4 through 7, it clearly shows what a .80 prediction accuracy entails, an almost psychic-like intuitiveness for knowing what labels matches with each primary amino-acid letter. In an analytical point of view, higher quantities of sample data produce better prediction results as the correctness of the predictions increases.
  • 6. Further Discussion In conclusion, our results show that modern training algorithms not only provide a more efficient way to compute for the predictions of secondary structure labels, but also achieves higher accuracy as an end result. The Random Forest method of machine learning has proven to be the best performing prediction model, with a score score valuing at 0.7876 ​± ​(0.040). Despite our best training algorithm, achieving a similar accuracy score from past models, we have a solid foundation to build and make improvements on. We could have created more specific and precise criteria for what determines the output by studying more and adding new features associated to each particular amino acid. This could give a slight boost to the prediction accuracy. By adding new features, however, could cause for an overfitting of the data. The process overall was insightful and innovative as we now know the specific tools and implementation to predict secondary structure labels and compute their accuracy scores along the way.
  • 7. Work Cited Kirschner, Andreas. ​Prediction of Protein Structural Features by Machine Learning Methods. Lehrstuhl f​ur Genomorientierte Bioinformatik. 2008. n.d. Print. Gassend, Blaise and O’Donnell, Charles. ​Predicting Secondary Structure of All-Helical Proteins Using Hidden Markov Support Vector Machines. Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology 2006. n.d. Print. Guzzo, Anthony V. "The Influence of Amino Acid Sequence on Protein Structure." ​Biophysical Journal. U.S. National Library of Medicine, Nov. 2009. Web. 11 Dec. 2016.