SlideShare a Scribd company logo
International Journal of Computer Science, Engineering and Information Technology (IJCSEIT), Vol.1, No.3, August 2011
DOI : 10.5121/ijcseit.2011.1301 1
HMM’S INTERPOLATION OF PROTIENS FOR
PROFILE ANALYSIS
Er. Neeshu Sharma1
, Er. Dinesh Kumar2
, Er. Reet Kamal Kaur3
1
Department of Computer science & Engineering ,PTU RIMT-MAEC ,Mandi
Gobindgarh
neeshukhn@yahoo.com
2
Department of Computer science & Engineering PTU ,DAVIET, Jallandhar
er.dineshk@gmail.com
3
Department of Computer science & Engineering ,PTU RIMT-MAEC ,Mandi
Gobindgarh
reetkamal1901@yahoo.co.in
ABSTRACT
HMM has found its application in almost every field. Applying Hmm to biological sequences has its own
advantages. HMM’s being more systematic and specific, yield a result better than consensus techniques.
Profile HMMs use position specific scoring for the matching & substitution of a residue and for the
opening or extension of a gap. HMMs apply a statistical method to estimate the true frequency of a residue
at a given position in the alignment from its observed frequency while standard profiles use the observed
frequency itself to assign the score for that residue. This means that a profile HMM derived from only 10 to
20 aligned sequences can be of equivalent quality to a standard profile created from 40 to 50 aligned
sequences.
KEYWORDS:
Sequence alignment, Profile Analysis, Hmm Profile HMM.
1. INTRODUCTION
Proteins are complex organic compounds that consist of amino acids joined by peptide bonds.
Proteins are essential to the structure and function of all living cells and viruses. Many proteins
function as enzymes or form subunits of enzymes. Some proteins play structural or mechanical
roles. Some proteins function in immune response and the storage and transport of various
ligands. Proteins serve as nutrients as well; they provide the organism with the amino acids that
are not synthesized by that organism. Proteins are amongst the most actively studied molecules in
biochemistry and they were discovered by the Swedish scientist, Jons Jakob Berzelius in 1838.
An amino acid is any molecule that contains both an amino group and a carboxylic acid group.
An amino acid residue is the residuals of an amino acid after it forms a peptide bond and loses a
water molecule. Since we are interested in amino acids that form proteins, it is safe to use the
International Journal of Computer Science, Engineering and Information Technology (IJCSEIT), Vol.1, No.3, August 2011
2
terms residue and amino acid interchangeably. There are 20 different amino acids in nature that
form proteins.
Fig 1: Structure of Amino Acid
2. PROFILE ANALYSIS
Profile analysis is a sequence comparison method for finding and aligning distantly related
sequences. The comparison allows a new sequence to be aligned optimally to a family of similar
sequences. The comparison uses a scoring matrix called a PAM matrix and an existing optimal
alignment of two or more similar protein sequences. The group or family similar sequences are
first aligned together to create a multiple sequence alignment.[16] The information in the multiple
sequence alignment is then represented quantitatively as a table of position-specific symbol
comparison values and gap penalties. This table is called a profile.
The starting point for the creation of a profile is a sequence or group of aligned sequences. This
probe is generally a group of functionally related proteins that have been aligned. A profile,
however, can be created from a single sequence. The similarity of new sequences to an existing
profile can be tested by comparing each new sequence to the profile with the same algorithm used
to make optimal alignments. To understand how this is done we must first recall what alignment
algorithms do. Alignment algorithms find alignments between two sequences that maximize the
number of matches and minimize the number of gaps. Gaps are given penalties in the same units
as the values in the scoring matrix. The best alignment is then simply defined as the alignment for
which the sum of the scoring matrix values minus the gap penalties is maximal. Each row in the
profile corresponds to a position in the original multiple sequence alignment. Each possible
sequence symbol has a value (a column) in each row of the profile. The comparison of a sequence
symbol to any row of the profile defines a specific value or "profile comparison value." The best
alignments of a sequence to a profile are found by aligning the symbols of the sequence to the
profile in such a way that the sum of the profile comparison values minus the gap penalties is
maximal. The profile also contains gap coefficients that are specific for each position so the
penalty for inserting a gap in one part of the alignment might be more or less than in another part.
The position-specific gap coefficients penalize gaps in conserved regions more heavily than gaps
in more variable regions.[16] The profile contains a consensus sequence for the display of
alignments of other sequences to the profile. The consensus sequence character corresponds to the
highest value in the row. Since the table on which the profile is based is usually the Dayhoff
evolutionary distance table, the consensus residue is the residue that has the smallest evolutionary
distance from all of the residues in that position of the alignment rather than simply the most
frequent residue at that position. In the original approach of Dayhoff the actual estimation is
International Journal of Computer Science, Engineering and Information Technology (IJCSEIT), Vol.1, No.3, August 2011
3
restricted to only very closely related pairs of sequences. However, once a Markov model is fitted
to this data, replacement frequencies characteristic for distantly related sequences can be
extrapolated from the model.
For example the table value for a profile that is 25 amino acids will have 25 rows of 20 scores,
each score in row for matching one of the amino acids in length is to be searched each 25 amino
acids long stretch of sequence will be examined ,1-25,2-26 ,………..76-100.The first 25 amino
acid long stretch will be evaluated using the profile scores for the amino acids in that sequence
then the next 25 long stretch, and so on .The highest scoring section will be the most similar to
the profile.
The profile method differs in two major respects from methods of sequence comparison in
common use:
Any number of known sequences can be used to construct the profile, allowing more information
to be used in the testing of the target than is possible with pairwise alignment methods. The
profile includes the penalties for insertion or deletion at each position, which allow one to include
the probe secondary structure in the testing scheme
2.1 Techniques for Profile Analysis:
2.2.1 Protein Microarrays: Protein microarrays consist of antibodies, proteins, protein
fragments, peptides or carbohydrate elements that are immobilized in a grid-like pattern on a
glass surface. The arrayed molecules are then used to screen and assess protein interaction
patterns with samples containing distinct proteins.[17]
Fig 2: Protein Microarrays
These microarrays are used to identify protein-protein interactions, to identify the substrates of
proteins or to identify the targets of biologically active small molecules. And with this growth
comes a need for bioinformatics tools to analyze the microarrays.
2.2.2 Protein Amino Acid Sequences: The analysis of amino acid sequences, or primary
structure, of proteins provides the foundation for many other types of protein studies. The primary
structure ultimately determines how proteins fold into functional 3D structures. Primary structure
is used in multiple sequence alignment studies to determine the evolutionary relationships
between proteins, and to determine relationships between structure and function in related
proteins.
International Journal of Computer Science, Engineering and Information Technology (IJCSEIT), Vol.1, No.3, August 2011
4
Fig 3: Protein Amino Acid Sequences
2.2.3 Protein-Ligand Docking: In drug discovery and development, the manner in which
small-molecule compounds bind or dock with proteins is of the utmost importance. Proteins are
often the main targets for new drugs. And many drug compounds are small molecules that are
designed to bind preferentially to specific proteins. Because of this need to design small
molecules for protein docking, many bioinformatics tools exist for the analysis of protein-ligand
interactions. These tools often fall in the category of computational chemistry. At the atomic
scales in which compounds dock with proteins, the interactions are biochemical and biophysical
in nature[17]
2.2.4 Protein Folds: Although there is no universal agreement on how to define protein folds,
one simple characterization of folds is “an arrangement of secondary structures into a unique
tertiary structure.” That is, protein amino acid sequences arrange themselves in recognizable,
identifiable, 3D structures. Some of these structures are so common in many different proteins
that they are given special names, i.e. Rossmann folds, TIM barrels, etc.[17]
2.2 Role of Profile Analysis:
Typical scenarios of a profiling approach become relevant, particularly, in the cases of the first
two groups, where researchers commonly wish to combine information derived from several
sources about a single query or target sequence. For example, users might use the sequence
alignment and search tool BLAST to identify homologs of their gene of interest in other species,
and then use these results to locate a solved protein structure for one of the homologs. Similarly,
they might also want to know the likely secondary structure of the mRNA encoding the gene of
interest, or whether a company sells a DNA Construct containing the gene. Sequence profiling
tools serve to automate and integrate the process of seeking such disparate information by
rendering the process of searching several different external databases transparent to the user.
Advantages of sequence profiling tools include the ability to use multiple of these specialized
tools in a single query and present the output with a common interface, the ability to direct the
output of one set of tools or database searches into the input of another, and the capacity to
disseminate hosting and compilation obligations to a network of research groups and institutions
rather than a single centralized repository.
3. HIDDEN MARKOV MODEL (HMM)
Hidden Markov models are sophisticated and flexible statistical tool for the study of protein
models. Using HMMs to analyze proteins is part of a new scientific field called bioinformatics,
based on the relationship between computer science, statistics and molecular biology. Hidden
Markov models (HMMs) offer a more systematic approach to estimating model parameters. The
HMM is a dynamic kind of statistical profile. Like an ordinary profile, it is built by analyzing the
distribution of amino acids in a training set of related proteins. However, an HMM has a more
complex topology than a profile. It can be visualized as a finite state machine. Finite state
International Journal of Computer Science, Engineering and Information Technology (IJCSEIT), Vol.1, No.3, August 2011
5
machines typically move through a series of states and produce some kind of output either when
the machine has reached a particular state or when it is moving from state to state. A markov
model is a statistical model that stepwise goes through some kind of change. Markov model is
characterized by the property that the change is dependent only on the current state. HMMs are
hidden because only the symbols emitted by system are observable, not the underlying walks
between states[15]. HMMs are the Legos of computational sequence analysis.A Hidden Markov
Model M is defined by
• a set of states X
• a set A of transition probabilities between the states, an |X| x |X| matrix. aij ≡ P(Xj | Xi)
The probability of going from state i to state j.
• States of X are “hidden” states.
• an alphabet Σ of symbols emitted in states of X, a set of emission probabilities E, an
X x Σ matrix
• ei(b) ≡ P(b | Xi). The probability that b is emitted in state i. (Emissions are sometimes
called observations.)[1]
It is important to note that in most cases of HMM use in bioinformatics a fictitious inversion
occurs between causes and effects when dealing with emissions. For example, one can synthesize
a (known) polymer sequence that can have different (unknown) features along the sequence. In an
HMM one must choose as emissions the monomers of the sequence, because they are the only
known data, and as internal states the features to be estimated. In this way, one hypothesizes that
the sequence is the effect and the features are the cause, while obviously the reverse is true. An
excellent case is provided by the polypeptides, for which it is just the amino acid sequence that
causes the secondary structures, while in an HMM the amino acids are assumed as emissions and
the secondary structures are assumed as internal states. States “emit” certain symbols according to
these probabilities.
3.1 Major Applications of HMM in Bioinformatics
The HMMs are in general well suited for natural language processing, and have been initially
employed in speech-recognition and later in optical character recognition, and melody
classification. In bioinformatics, many algorithms based on HMMs have been applied to
biological sequence analysis, as gene finding and protein family characterization.
A detailed description of all applications would be, in our opinion, outside the scope and the size
of a normal survey paper. Nevertheless, in order to give a feeling of how the models described in
the first part are implemented in real-life bioinformatics problems, we shall describe in more
detail, in what follows, a single application, i.e. the use, for multiple sequence alignment, of the
profile HMM, which is a powerful, simple, and very popular algorithm, especially suited to this
purpose.[13]
3.2 Profile HMM
Profile HMMs use position specific scoring for the matching & substitution of a residue and for
the opening or extension of a gap. Profile hidden Markov models (HMMs) have several
advantages over standard profiles. Profile HMMs have a formal probabilistic basis and have a
consistent theory behind gap and insertion scores, in contrast to standard profile methods which
use heuristic methods. HMMs apply a statistical method to estimate the true frequency of a
International Journal of Computer Science, Engineering and Information Technology (IJCSEIT), Vol.1, No.3, August 2011
6
residue at a given position in the alignment from its observed frequency while standard profiles
use the observed frequency itself to assign the score for that residue. This means that a profile
HMM derived from only 10 to 20 aligned sequences can be of equivalent quality to a standard
profile created from 40 to 50 aligned sequences. [14] In general, producing good profile HMMs
requires less skill and manual intervention than producing good standard profiles. A profile HMM
has several types of probabilities associated with it. One type is the transition probability -- the
probability of transitioning from one state to another. In a simple ungapped model, the probability
of a transition from one match state to the next match state is 1.0 and the path through the model
is strictly linear, moving from the match state of node n to the match state of node n+1.
There are also emissions probabilities associated with each match state, based on the probability
of a given residue existing at that position in the alignment. For example, for a fairly well
conserved column in a protein alignment, the emissions probability for the most common amino
acid may be 0.81, while for each of the other 19 amino acids it may be 0.01. If you follow a path
through the model to generate a sequence consistent with the model, the probability of any
sequence that is generated depends on the transition and emissions probabilities at each node. In
order to model real sequences, we also need to consider the possibility that gaps might occur
when a model is aligned to a sequence. Two types of gaps may arise. The first type occurs when
the sequence contains a region that is not present in the model (an insertion in the sequence). The
second type occurs when there is a region in the model that is not present in the sequence (a
deletion in the sequence). To handle these cases, each node in the profile HMM must now have
three states: the match state, an insert state, and a delete state. The model also needs more types
of transition probabilities: match>match, match->insert, match->delete, insert- >match, etc. [1]
Aligning a sequence to a profile HMM is done by a dynamic programming algorithm that finds
the most probable path that the sequence may take through the model, using the transition and
emissions probabilities to score each possible path.
3.2.1 Purpose of Profile HMM
Profile HMMs are statistical tools that can model the commonalities of the amino acid sequences
for a family of proteins. Considered to be more expressive than a standard consensus sequence or
a regular expression, profile HMMs allow position dependent insertion and deletion penalties, as
well as the option to use a separate distribution for inserted portions of the amino acid sequence.
Once a model is trained on a number of amino acid sequences from a given family or group, it is
most commonly used for three purposes:
By aligning sequences to the model, one can construct multiple alignments.
The model itself can offer insight into the characteristics of the family when one examines the
structure and probabilities of the trained HMM.
The model can be used to score how well a new protein sequence fits the family motif. For
example, one could train a model on a number of proteins in a family, and then match sequences
in a database to that model in order to try to find other family members. This technique is also
used to infer protein structure and function.
International Journal of Computer Science, Engineering and Information Technology (IJCSEIT), Vol.1, No.3, August 2011
7
3.3 Advantages of Hidden Markov Model
Statistical Grounding
• Statisticians are comfortable with the theory behind hidden Markov models
• Freedom to manipulate the training and verification processes
• Mathematical / theoretical analysis of the results and processes
• HMMs are still very powerful modeling tools – far more powerful than many statistical
methods
Modularity
• HMMs can be combined into larger HMMs
Transparency of the Model
• Assuming an architecture with a good design
• People can read the model and make sense of it
• The model itself can help increase understanding
Incorporation of Prior Knowledge
• Incorporate prior knowledge into the architecture
• Initialize the model close to something believed to be correct
• Use prior knowledge to constrain training process
Example of HMM [1].
Fig 4: Hidden Markov Model
Probabilistic parameters of a hidden Markov model given in the above example.
x — states
y — possible observations
a — state transition probabilities
b — output probabilities
4. PRESENT WORK
Profile analysis has long been a useful tool in finding and aligning distantly related sequences and
in identifying known sequence domains in new sequences. Basically, a profile is a description of
International Journal of Computer Science, Engineering and Information Technology (IJCSEIT), Vol.1, No.3, August 2011
8
the consensus of a multiple sequence alignment. It uses a position-specific scoring system to
capture information about the degree of conservation at various positions in the multiple
alignments. This makes it a much more sensitive and specific method for database searching than
pair wise methods. Following are the steps followed in this research work:
1. Align the sequences in the family: Initially, we will assume that there are no gaps in the
alignment. We look at the alignment of N sequences of l positions as follows:
Table 1 : Alignment of sequence
Sequence Position
1 2 3 4 … l
1 a11 a12 a13 … … a1l
2 a21 a122 a23 … … a2l
3 a31
-
-
N aN1 aN2 aN3 … … aNl
where aij denotes the amino acid from the ith
sequence at the jth
position.
2. Use the alignment to create a profile: We build the profile as follows. We compute:
fij = % of column j that is amino acid i
bi = % of background which is amino acid i
The background" can be computed, for example, from a large sequence database, or from a
genome, or from some particular protein family.
Now compute the 20 x l array Pij , where
Pij = fij/bi
Intuitively, Pij is the “propensity" for amino acid i in the j position in the alignment.
This gives us the following table:
Table 2: Alignment to compute the Profile
Sequenc
e
Position
1 2 3 4 5 … L
L PL1 PL2 PL3 … … PLl
V PV1 PV2 PV3 … … PVl
F PF1
.
And we use this table to compute:
Scoreij = log(Pij)
International Journal of Computer Science, Engineering and Information Technology (IJCSEIT), Vol.1, No.3, August 2011
9
3. Test new sequences against the profile: To use the profile to score a new sequence, we do
the following:
• Slide a window of width l over the new sequence.
• The score of the window equals the sum of the scores of each position in the window.
• If the score of the window is higher than the cut off, which is determined empirically,
we can conclude that the window is a member of the family. In addition, the higher
the score, the more confident the prediction.
5. CONCLUSION AND FUTURE WORK
Currently, one very promising approach for protein family related analysis of amino acid
sequences is the application of so-called Profile Hidden Markov Models (Profile HMMs) as
probabilistic target family models. Given a training set of protein data, discrete HMMs are
estimated. These models are then evaluated for unknown query sequences which are aligned to
the explicit protein family models. Such explicit target family models are favorable for sequence
analysis since family specific data is incorporated into the analysis. One of the main purposes of
developing profile HMMs is to use them to detect potential membership in a family. We can use
either the Viterbi algorithm to get the most probable alignment or the forward algorithm to
calculate the full probability of the sequence summed over all possible paths.
The research can be extended to:
1. Real user interface.
2. Provision to include other sequences (i.e. with different accession numbers and their
supported files) automatically.
3. Provision to access the data from a database.
4. Provision for choice of alignment technique
5. Provision to incorporate various input formats
6. REFERENCES
[1] Sharma N.,Kumar D., Kaur Reet. (2011) “Applying Hidden markov model to sequence alignment”,
Vol 2 (3),pages 1031-1035
[2] Devos, D. and Valencia, A. (2000) “Practical Limits of Function Prediction”, Protein Design Group,
National Centre for Biotechnology, CNB-CSIC Madrid, E-28049, Spain, pp. 134-170.
[3] Erik L. L. Sonnhammer, Sean R. Eddy, Ewan Birney, Alex Bateman and Richard Durbin (1998)
“Pfam: multiple sequence alignments and HMM-profiles of protein domains”, Nucleic Acids
Research vol. 26, No.1, pp. 320-322.
[4] Georgina Mirceva1 and Danco Davcev (2009) “HMM based approach for classifying protein
structures” International Journal of Bio- Science and Bio- Technolog, vol. 1, no.1, pp. 37-46.
[5] N. von Öhsen, I. Sommer, R. Zimmer (2003) “Profile-Profile Alignment: A Powerful Tool for Protein
Structure Prediction” Pacific Symposium on Biocomputing, Vol 8, pp 252-263.
[6] Park, C.Y., Park, S.H., Kim, D.H., Park, S.H. and Hwang, C.J. (2004) “A new protein Classification
method using dynamic classifier”, Bioinformatics, vol. 9, pp 32-35.
International Journal of Computer Science, Engineering and Information Technology (IJCSEIT), Vol.1, No.3, August 2011
10
[7] Herbert Popp, Mona Singh and Johnson parker (2002) “Topics in Computational Molecular Biology”
Lecture notes in bio computing, pp.1-11.
[8] Raninder Kaur, Shavinder Kaur, Reet Kamal Kaur and Amandeep Kaur (2010) “Characterization of
Parathyroid Hormone using HMM Framework” International Journal of Computer Applications, vol.
1, no. 16, pp. 65-68.
[9] T. Plötz, and G.A. Fink, “Pattern recognition methods for advanced stochastic protein sequence
analysis using HMMs”, Pattern Recognition, vol. 39, 2006, pp. 2267-2280.
[10] Thakoor N, Gao J, Jung S.(2007) “Hidden Markov model-based weighted likelihood discriminant for
2-D shape classification.” Online journal at Springerlink.com
[11] Tolga Can, Orhan C¸ amoglu, Ambuj K. Singh, Yuan-Fang Wang (2004) “Automated Protein
Classification Using Consensus Decision” Journal of Molecular Biology, Volume 348, Issue 4, Pages
66-68.
[12] Usman Roshan and Dennis R. Livesay (2006) “Probalign: multiple sequence alignment using partition
function posterior probabilities” Bioinformatics, Vol. 22, No. 22, pp 2715-2721.
[13] Valeria De Fonzo, Filippo Aluffi-Pentini and Valerio Parisi. (2009) “Hidden Markov Models in
Bioinformatics”, Current Bioinformatics, 2007, Vol. 2, No. 1, pp. 49-61.
[14] Wong, L., Chua, H., 17]W.R. Taylor, and C.A. Orengo, “Protein structure alignment”, J. Mol. Biol.,
vol. 208, 1989, pp. 1-22.
[15] Li, Z., Liu, G. and Sung, W. (2008) “Graph – Based Protein Function Prediction”, Genome
Informatics, vol. 16(1), pp. 17-23.
[16] http;/www.avatar.se/molbioinfo2001/multali.html
[17] http://www.b.eye-network.com/view/1127
[18] http://www.caspur.it/castri/bioinformatica/gcghelp/profileanalysis.html
Authors
Er. Neeshu Sharma: Neeshu sharma was born on September 17, 1984 at kurukshetra,
India. She completed her B.Tech in Computer science from Kurukshetra University in the
year 2005 and is pursuing her M.Tech from DAVIET college Jallandhar
Er. Dinesh Kumar :has completed his B.Tech and M.tech in Computer Sciences and is
cirrenctly pursuing his P.h D. he had guided 7 M.tech research Thesis and has active
research publications in the field of Machine Learning & Natural Language Processing,
Computer Networks, Data Structures.
Reet kamal kaur: born on 19-july-1984, she completed her B.Tech from LCET, and her
M.Tech from GNDEC, ludhiana in 2008. currently serving in RIMT-MAEC,
Mandigobindgarh on the post of assistant professor in the CSE department

More Related Content

PDF
International Journal of Computer Science, Engineering and Information Techno...
IJCSEIT Journal
 
PDF
Bioinformatics data mining
Sangeeta Das
 
DOCX
Bioinformatics_Sequence Analysis
Sangeeta Das
 
PPTX
Structure alignment methods
Samvartika Majumdar
 
PPT
Homology modeling
Malla Reddy College of Pharmacy
 
PPTX
Bioinformatics
seyed mohammad motevalli
 
PPT
demonstration lecture on Homology modeling
Maharaj Vinayak Global University
 
PPTX
Protein threading using context specific alignment potential ismb-2013
Sheng Wang
 
International Journal of Computer Science, Engineering and Information Techno...
IJCSEIT Journal
 
Bioinformatics data mining
Sangeeta Das
 
Bioinformatics_Sequence Analysis
Sangeeta Das
 
Structure alignment methods
Samvartika Majumdar
 
Bioinformatics
seyed mohammad motevalli
 
demonstration lecture on Homology modeling
Maharaj Vinayak Global University
 
Protein threading using context specific alignment potential ismb-2013
Sheng Wang
 

What's hot (20)

PPTX
Protein 3 d structure prediction
Samvartika Majumdar
 
PPT
Sequence Alignment In Bioinformatics
Nikesh Narayanan
 
PDF
Bioinformatics.Assignment
Naima Tahsin
 
PPTX
Protein computational analysis
Kinza Irshad
 
PPTX
Homology modelling
Ayesha Choudhury
 
PDF
Crimson Publishers-Predicting Protein Transmembrane Regionsby Using LSTM Model
CrimsonPublishers-SBB
 
PDF
Construction of phylogenetic tree from multiple gene trees using principal co...
IAEME Publication
 
PPTX
Presentation1
firesea
 
PPTX
Threading modeling methods
ratanvishwas
 
PDF
Msa
Swati Kumari
 
PPTX
threading and homology modelling methods
mohammed muzammil
 
PDF
Multi objective approach in predicting
ijaia
 
PPTX
Molecular modelling (1)
Bharatesha S Viru
 
PDF
Molecular dynamics and Simulations
Abhilash Kannan
 
PPTX
protein sequence analysis
RamikaSingla
 
PPTX
In silico structure prediction
Subin E K
 
PPT
Molecular modelling for in silico drug discovery
Lee Larcombe
 
PPT
Protein modeling
Malla Reddy College of Pharmacy
 
PPTX
Sequence homology search and multiple sequence alignment(1)
AnkitTiwari354
 
Protein 3 d structure prediction
Samvartika Majumdar
 
Sequence Alignment In Bioinformatics
Nikesh Narayanan
 
Bioinformatics.Assignment
Naima Tahsin
 
Protein computational analysis
Kinza Irshad
 
Homology modelling
Ayesha Choudhury
 
Crimson Publishers-Predicting Protein Transmembrane Regionsby Using LSTM Model
CrimsonPublishers-SBB
 
Construction of phylogenetic tree from multiple gene trees using principal co...
IAEME Publication
 
Presentation1
firesea
 
Threading modeling methods
ratanvishwas
 
threading and homology modelling methods
mohammed muzammil
 
Multi objective approach in predicting
ijaia
 
Molecular modelling (1)
Bharatesha S Viru
 
Molecular dynamics and Simulations
Abhilash Kannan
 
protein sequence analysis
RamikaSingla
 
In silico structure prediction
Subin E K
 
Molecular modelling for in silico drug discovery
Lee Larcombe
 
Sequence homology search and multiple sequence alignment(1)
AnkitTiwari354
 
Ad

Similar to HMM’S INTERPOLATION OF PROTIENS FOR PROFILE ANALYSIS (20)

PDF
BITS: Basics of sequence analysis
BITS
 
PDF
Basics of bioinformatics
Abhishek Vatsa
 
PPT
Protein Evolution and Sequence Analysis.ppt
Francis de Castro
 
PPTX
Bioinformatics
Arockiyajainmary
 
PPT
SooryaKiran Bioinformatics
contactsoorya
 
PDF
Sequence-analysis-pairwise-alignment.pdf
sriaisvariyasundar
 
PDF
A Frequency Domain Approach to Protein Sequence Similarity Analysis and Funct...
sipij
 
PPTX
Functional proteomics, and tools
KAUSHAL SAHU
 
PDF
Signals of Evolution: Conservation, Specificity Determining Positions and Coe...
Asociación Argentina de Bioinformática y Biología Computacional
 
PPTX
4. sequence alignment.pptx
ArupKhakhlari1
 
PPT
20100515 bioinformatics kapushesky_lecture07
Computer Science Club
 
PDF
ANALYSIS OF PROTEIN MICROARRAY DATA USING DATA MINING
ijbbjournal
 
PDF
Bioinformatics Sequence And Genome Analysis 1st Edition David W Mount
reyxanwuwu
 
PDF
A Review of Various Methods Used in the Analysis of Functional Gene Expressio...
ijitcs
 
PPTX
Data mining ppt
sai krishna
 
PDF
57 bio infomark
phdcao
 
PPTX
Sequence Analysis
Meghaj Mallick
 
PPTX
Bioinformatics life sciences_v2015
Prof. Wim Van Criekinge
 
PPTX
MATLAB Bioinformatics tool box
Pinky Vincent
 
PPTX
STRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICS
SHEETHUMOLKS
 
BITS: Basics of sequence analysis
BITS
 
Basics of bioinformatics
Abhishek Vatsa
 
Protein Evolution and Sequence Analysis.ppt
Francis de Castro
 
Bioinformatics
Arockiyajainmary
 
SooryaKiran Bioinformatics
contactsoorya
 
Sequence-analysis-pairwise-alignment.pdf
sriaisvariyasundar
 
A Frequency Domain Approach to Protein Sequence Similarity Analysis and Funct...
sipij
 
Functional proteomics, and tools
KAUSHAL SAHU
 
Signals of Evolution: Conservation, Specificity Determining Positions and Coe...
Asociación Argentina de Bioinformática y Biología Computacional
 
4. sequence alignment.pptx
ArupKhakhlari1
 
20100515 bioinformatics kapushesky_lecture07
Computer Science Club
 
ANALYSIS OF PROTEIN MICROARRAY DATA USING DATA MINING
ijbbjournal
 
Bioinformatics Sequence And Genome Analysis 1st Edition David W Mount
reyxanwuwu
 
A Review of Various Methods Used in the Analysis of Functional Gene Expressio...
ijitcs
 
Data mining ppt
sai krishna
 
57 bio infomark
phdcao
 
Sequence Analysis
Meghaj Mallick
 
Bioinformatics life sciences_v2015
Prof. Wim Van Criekinge
 
MATLAB Bioinformatics tool box
Pinky Vincent
 
STRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICS
SHEETHUMOLKS
 
Ad

Recently uploaded (20)

PDF
Introduction to Ship Engine Room Systems.pdf
Mahmoud Moghtaderi
 
PPTX
Module2 Data Base Design- ER and NF.pptx
gomathisankariv2
 
PPTX
Color Model in Textile ( RGB, CMYK).pptx
auladhossain191
 
PPTX
Victory Precisions_Supplier Profile.pptx
victoryprecisions199
 
PPTX
business incubation centre aaaaaaaaaaaaaa
hodeeesite4
 
PDF
Principles of Food Science and Nutritions
Dr. Yogesh Kumar Kosariya
 
PDF
FLEX-LNG-Company-Presentation-Nov-2017.pdf
jbloggzs
 
PDF
July 2025: Top 10 Read Articles Advanced Information Technology
ijait
 
PDF
20ME702-Mechatronics-UNIT-1,UNIT-2,UNIT-3,UNIT-4,UNIT-5, 2025-2026
Mohanumar S
 
PDF
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
PPTX
MT Chapter 1.pptx- Magnetic particle testing
ABCAnyBodyCanRelax
 
PDF
Cryptography and Information :Security Fundamentals
Dr. Madhuri Jawale
 
PPTX
MSME 4.0 Template idea hackathon pdf to understand
alaudeenaarish
 
PDF
settlement FOR FOUNDATION ENGINEERS.pdf
Endalkazene
 
PDF
EVS+PRESENTATIONS EVS+PRESENTATIONS like
saiyedaqib429
 
PDF
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
PPTX
Civil Engineering Practices_BY Sh.JP Mishra 23.09.pptx
bineetmishra1990
 
DOCX
SAR - EEEfdfdsdasdsdasdasdasdasdasdasdasda.docx
Kanimozhi676285
 
PDF
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
PPTX
22PCOAM21 Session 2 Understanding Data Source.pptx
Guru Nanak Technical Institutions
 
Introduction to Ship Engine Room Systems.pdf
Mahmoud Moghtaderi
 
Module2 Data Base Design- ER and NF.pptx
gomathisankariv2
 
Color Model in Textile ( RGB, CMYK).pptx
auladhossain191
 
Victory Precisions_Supplier Profile.pptx
victoryprecisions199
 
business incubation centre aaaaaaaaaaaaaa
hodeeesite4
 
Principles of Food Science and Nutritions
Dr. Yogesh Kumar Kosariya
 
FLEX-LNG-Company-Presentation-Nov-2017.pdf
jbloggzs
 
July 2025: Top 10 Read Articles Advanced Information Technology
ijait
 
20ME702-Mechatronics-UNIT-1,UNIT-2,UNIT-3,UNIT-4,UNIT-5, 2025-2026
Mohanumar S
 
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
MT Chapter 1.pptx- Magnetic particle testing
ABCAnyBodyCanRelax
 
Cryptography and Information :Security Fundamentals
Dr. Madhuri Jawale
 
MSME 4.0 Template idea hackathon pdf to understand
alaudeenaarish
 
settlement FOR FOUNDATION ENGINEERS.pdf
Endalkazene
 
EVS+PRESENTATIONS EVS+PRESENTATIONS like
saiyedaqib429
 
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
Civil Engineering Practices_BY Sh.JP Mishra 23.09.pptx
bineetmishra1990
 
SAR - EEEfdfdsdasdsdasdasdasdasdasdasdasda.docx
Kanimozhi676285
 
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
22PCOAM21 Session 2 Understanding Data Source.pptx
Guru Nanak Technical Institutions
 

HMM’S INTERPOLATION OF PROTIENS FOR PROFILE ANALYSIS

  • 1. International Journal of Computer Science, Engineering and Information Technology (IJCSEIT), Vol.1, No.3, August 2011 DOI : 10.5121/ijcseit.2011.1301 1 HMM’S INTERPOLATION OF PROTIENS FOR PROFILE ANALYSIS Er. Neeshu Sharma1 , Er. Dinesh Kumar2 , Er. Reet Kamal Kaur3 1 Department of Computer science & Engineering ,PTU RIMT-MAEC ,Mandi Gobindgarh [email protected] 2 Department of Computer science & Engineering PTU ,DAVIET, Jallandhar [email protected] 3 Department of Computer science & Engineering ,PTU RIMT-MAEC ,Mandi Gobindgarh [email protected] ABSTRACT HMM has found its application in almost every field. Applying Hmm to biological sequences has its own advantages. HMM’s being more systematic and specific, yield a result better than consensus techniques. Profile HMMs use position specific scoring for the matching & substitution of a residue and for the opening or extension of a gap. HMMs apply a statistical method to estimate the true frequency of a residue at a given position in the alignment from its observed frequency while standard profiles use the observed frequency itself to assign the score for that residue. This means that a profile HMM derived from only 10 to 20 aligned sequences can be of equivalent quality to a standard profile created from 40 to 50 aligned sequences. KEYWORDS: Sequence alignment, Profile Analysis, Hmm Profile HMM. 1. INTRODUCTION Proteins are complex organic compounds that consist of amino acids joined by peptide bonds. Proteins are essential to the structure and function of all living cells and viruses. Many proteins function as enzymes or form subunits of enzymes. Some proteins play structural or mechanical roles. Some proteins function in immune response and the storage and transport of various ligands. Proteins serve as nutrients as well; they provide the organism with the amino acids that are not synthesized by that organism. Proteins are amongst the most actively studied molecules in biochemistry and they were discovered by the Swedish scientist, Jons Jakob Berzelius in 1838. An amino acid is any molecule that contains both an amino group and a carboxylic acid group. An amino acid residue is the residuals of an amino acid after it forms a peptide bond and loses a water molecule. Since we are interested in amino acids that form proteins, it is safe to use the
  • 2. International Journal of Computer Science, Engineering and Information Technology (IJCSEIT), Vol.1, No.3, August 2011 2 terms residue and amino acid interchangeably. There are 20 different amino acids in nature that form proteins. Fig 1: Structure of Amino Acid 2. PROFILE ANALYSIS Profile analysis is a sequence comparison method for finding and aligning distantly related sequences. The comparison allows a new sequence to be aligned optimally to a family of similar sequences. The comparison uses a scoring matrix called a PAM matrix and an existing optimal alignment of two or more similar protein sequences. The group or family similar sequences are first aligned together to create a multiple sequence alignment.[16] The information in the multiple sequence alignment is then represented quantitatively as a table of position-specific symbol comparison values and gap penalties. This table is called a profile. The starting point for the creation of a profile is a sequence or group of aligned sequences. This probe is generally a group of functionally related proteins that have been aligned. A profile, however, can be created from a single sequence. The similarity of new sequences to an existing profile can be tested by comparing each new sequence to the profile with the same algorithm used to make optimal alignments. To understand how this is done we must first recall what alignment algorithms do. Alignment algorithms find alignments between two sequences that maximize the number of matches and minimize the number of gaps. Gaps are given penalties in the same units as the values in the scoring matrix. The best alignment is then simply defined as the alignment for which the sum of the scoring matrix values minus the gap penalties is maximal. Each row in the profile corresponds to a position in the original multiple sequence alignment. Each possible sequence symbol has a value (a column) in each row of the profile. The comparison of a sequence symbol to any row of the profile defines a specific value or "profile comparison value." The best alignments of a sequence to a profile are found by aligning the symbols of the sequence to the profile in such a way that the sum of the profile comparison values minus the gap penalties is maximal. The profile also contains gap coefficients that are specific for each position so the penalty for inserting a gap in one part of the alignment might be more or less than in another part. The position-specific gap coefficients penalize gaps in conserved regions more heavily than gaps in more variable regions.[16] The profile contains a consensus sequence for the display of alignments of other sequences to the profile. The consensus sequence character corresponds to the highest value in the row. Since the table on which the profile is based is usually the Dayhoff evolutionary distance table, the consensus residue is the residue that has the smallest evolutionary distance from all of the residues in that position of the alignment rather than simply the most frequent residue at that position. In the original approach of Dayhoff the actual estimation is
  • 3. International Journal of Computer Science, Engineering and Information Technology (IJCSEIT), Vol.1, No.3, August 2011 3 restricted to only very closely related pairs of sequences. However, once a Markov model is fitted to this data, replacement frequencies characteristic for distantly related sequences can be extrapolated from the model. For example the table value for a profile that is 25 amino acids will have 25 rows of 20 scores, each score in row for matching one of the amino acids in length is to be searched each 25 amino acids long stretch of sequence will be examined ,1-25,2-26 ,………..76-100.The first 25 amino acid long stretch will be evaluated using the profile scores for the amino acids in that sequence then the next 25 long stretch, and so on .The highest scoring section will be the most similar to the profile. The profile method differs in two major respects from methods of sequence comparison in common use: Any number of known sequences can be used to construct the profile, allowing more information to be used in the testing of the target than is possible with pairwise alignment methods. The profile includes the penalties for insertion or deletion at each position, which allow one to include the probe secondary structure in the testing scheme 2.1 Techniques for Profile Analysis: 2.2.1 Protein Microarrays: Protein microarrays consist of antibodies, proteins, protein fragments, peptides or carbohydrate elements that are immobilized in a grid-like pattern on a glass surface. The arrayed molecules are then used to screen and assess protein interaction patterns with samples containing distinct proteins.[17] Fig 2: Protein Microarrays These microarrays are used to identify protein-protein interactions, to identify the substrates of proteins or to identify the targets of biologically active small molecules. And with this growth comes a need for bioinformatics tools to analyze the microarrays. 2.2.2 Protein Amino Acid Sequences: The analysis of amino acid sequences, or primary structure, of proteins provides the foundation for many other types of protein studies. The primary structure ultimately determines how proteins fold into functional 3D structures. Primary structure is used in multiple sequence alignment studies to determine the evolutionary relationships between proteins, and to determine relationships between structure and function in related proteins.
  • 4. International Journal of Computer Science, Engineering and Information Technology (IJCSEIT), Vol.1, No.3, August 2011 4 Fig 3: Protein Amino Acid Sequences 2.2.3 Protein-Ligand Docking: In drug discovery and development, the manner in which small-molecule compounds bind or dock with proteins is of the utmost importance. Proteins are often the main targets for new drugs. And many drug compounds are small molecules that are designed to bind preferentially to specific proteins. Because of this need to design small molecules for protein docking, many bioinformatics tools exist for the analysis of protein-ligand interactions. These tools often fall in the category of computational chemistry. At the atomic scales in which compounds dock with proteins, the interactions are biochemical and biophysical in nature[17] 2.2.4 Protein Folds: Although there is no universal agreement on how to define protein folds, one simple characterization of folds is “an arrangement of secondary structures into a unique tertiary structure.” That is, protein amino acid sequences arrange themselves in recognizable, identifiable, 3D structures. Some of these structures are so common in many different proteins that they are given special names, i.e. Rossmann folds, TIM barrels, etc.[17] 2.2 Role of Profile Analysis: Typical scenarios of a profiling approach become relevant, particularly, in the cases of the first two groups, where researchers commonly wish to combine information derived from several sources about a single query or target sequence. For example, users might use the sequence alignment and search tool BLAST to identify homologs of their gene of interest in other species, and then use these results to locate a solved protein structure for one of the homologs. Similarly, they might also want to know the likely secondary structure of the mRNA encoding the gene of interest, or whether a company sells a DNA Construct containing the gene. Sequence profiling tools serve to automate and integrate the process of seeking such disparate information by rendering the process of searching several different external databases transparent to the user. Advantages of sequence profiling tools include the ability to use multiple of these specialized tools in a single query and present the output with a common interface, the ability to direct the output of one set of tools or database searches into the input of another, and the capacity to disseminate hosting and compilation obligations to a network of research groups and institutions rather than a single centralized repository. 3. HIDDEN MARKOV MODEL (HMM) Hidden Markov models are sophisticated and flexible statistical tool for the study of protein models. Using HMMs to analyze proteins is part of a new scientific field called bioinformatics, based on the relationship between computer science, statistics and molecular biology. Hidden Markov models (HMMs) offer a more systematic approach to estimating model parameters. The HMM is a dynamic kind of statistical profile. Like an ordinary profile, it is built by analyzing the distribution of amino acids in a training set of related proteins. However, an HMM has a more complex topology than a profile. It can be visualized as a finite state machine. Finite state
  • 5. International Journal of Computer Science, Engineering and Information Technology (IJCSEIT), Vol.1, No.3, August 2011 5 machines typically move through a series of states and produce some kind of output either when the machine has reached a particular state or when it is moving from state to state. A markov model is a statistical model that stepwise goes through some kind of change. Markov model is characterized by the property that the change is dependent only on the current state. HMMs are hidden because only the symbols emitted by system are observable, not the underlying walks between states[15]. HMMs are the Legos of computational sequence analysis.A Hidden Markov Model M is defined by • a set of states X • a set A of transition probabilities between the states, an |X| x |X| matrix. aij ≡ P(Xj | Xi) The probability of going from state i to state j. • States of X are “hidden” states. • an alphabet Σ of symbols emitted in states of X, a set of emission probabilities E, an X x Σ matrix • ei(b) ≡ P(b | Xi). The probability that b is emitted in state i. (Emissions are sometimes called observations.)[1] It is important to note that in most cases of HMM use in bioinformatics a fictitious inversion occurs between causes and effects when dealing with emissions. For example, one can synthesize a (known) polymer sequence that can have different (unknown) features along the sequence. In an HMM one must choose as emissions the monomers of the sequence, because they are the only known data, and as internal states the features to be estimated. In this way, one hypothesizes that the sequence is the effect and the features are the cause, while obviously the reverse is true. An excellent case is provided by the polypeptides, for which it is just the amino acid sequence that causes the secondary structures, while in an HMM the amino acids are assumed as emissions and the secondary structures are assumed as internal states. States “emit” certain symbols according to these probabilities. 3.1 Major Applications of HMM in Bioinformatics The HMMs are in general well suited for natural language processing, and have been initially employed in speech-recognition and later in optical character recognition, and melody classification. In bioinformatics, many algorithms based on HMMs have been applied to biological sequence analysis, as gene finding and protein family characterization. A detailed description of all applications would be, in our opinion, outside the scope and the size of a normal survey paper. Nevertheless, in order to give a feeling of how the models described in the first part are implemented in real-life bioinformatics problems, we shall describe in more detail, in what follows, a single application, i.e. the use, for multiple sequence alignment, of the profile HMM, which is a powerful, simple, and very popular algorithm, especially suited to this purpose.[13] 3.2 Profile HMM Profile HMMs use position specific scoring for the matching & substitution of a residue and for the opening or extension of a gap. Profile hidden Markov models (HMMs) have several advantages over standard profiles. Profile HMMs have a formal probabilistic basis and have a consistent theory behind gap and insertion scores, in contrast to standard profile methods which use heuristic methods. HMMs apply a statistical method to estimate the true frequency of a
  • 6. International Journal of Computer Science, Engineering and Information Technology (IJCSEIT), Vol.1, No.3, August 2011 6 residue at a given position in the alignment from its observed frequency while standard profiles use the observed frequency itself to assign the score for that residue. This means that a profile HMM derived from only 10 to 20 aligned sequences can be of equivalent quality to a standard profile created from 40 to 50 aligned sequences. [14] In general, producing good profile HMMs requires less skill and manual intervention than producing good standard profiles. A profile HMM has several types of probabilities associated with it. One type is the transition probability -- the probability of transitioning from one state to another. In a simple ungapped model, the probability of a transition from one match state to the next match state is 1.0 and the path through the model is strictly linear, moving from the match state of node n to the match state of node n+1. There are also emissions probabilities associated with each match state, based on the probability of a given residue existing at that position in the alignment. For example, for a fairly well conserved column in a protein alignment, the emissions probability for the most common amino acid may be 0.81, while for each of the other 19 amino acids it may be 0.01. If you follow a path through the model to generate a sequence consistent with the model, the probability of any sequence that is generated depends on the transition and emissions probabilities at each node. In order to model real sequences, we also need to consider the possibility that gaps might occur when a model is aligned to a sequence. Two types of gaps may arise. The first type occurs when the sequence contains a region that is not present in the model (an insertion in the sequence). The second type occurs when there is a region in the model that is not present in the sequence (a deletion in the sequence). To handle these cases, each node in the profile HMM must now have three states: the match state, an insert state, and a delete state. The model also needs more types of transition probabilities: match>match, match->insert, match->delete, insert- >match, etc. [1] Aligning a sequence to a profile HMM is done by a dynamic programming algorithm that finds the most probable path that the sequence may take through the model, using the transition and emissions probabilities to score each possible path. 3.2.1 Purpose of Profile HMM Profile HMMs are statistical tools that can model the commonalities of the amino acid sequences for a family of proteins. Considered to be more expressive than a standard consensus sequence or a regular expression, profile HMMs allow position dependent insertion and deletion penalties, as well as the option to use a separate distribution for inserted portions of the amino acid sequence. Once a model is trained on a number of amino acid sequences from a given family or group, it is most commonly used for three purposes: By aligning sequences to the model, one can construct multiple alignments. The model itself can offer insight into the characteristics of the family when one examines the structure and probabilities of the trained HMM. The model can be used to score how well a new protein sequence fits the family motif. For example, one could train a model on a number of proteins in a family, and then match sequences in a database to that model in order to try to find other family members. This technique is also used to infer protein structure and function.
  • 7. International Journal of Computer Science, Engineering and Information Technology (IJCSEIT), Vol.1, No.3, August 2011 7 3.3 Advantages of Hidden Markov Model Statistical Grounding • Statisticians are comfortable with the theory behind hidden Markov models • Freedom to manipulate the training and verification processes • Mathematical / theoretical analysis of the results and processes • HMMs are still very powerful modeling tools – far more powerful than many statistical methods Modularity • HMMs can be combined into larger HMMs Transparency of the Model • Assuming an architecture with a good design • People can read the model and make sense of it • The model itself can help increase understanding Incorporation of Prior Knowledge • Incorporate prior knowledge into the architecture • Initialize the model close to something believed to be correct • Use prior knowledge to constrain training process Example of HMM [1]. Fig 4: Hidden Markov Model Probabilistic parameters of a hidden Markov model given in the above example. x — states y — possible observations a — state transition probabilities b — output probabilities 4. PRESENT WORK Profile analysis has long been a useful tool in finding and aligning distantly related sequences and in identifying known sequence domains in new sequences. Basically, a profile is a description of
  • 8. International Journal of Computer Science, Engineering and Information Technology (IJCSEIT), Vol.1, No.3, August 2011 8 the consensus of a multiple sequence alignment. It uses a position-specific scoring system to capture information about the degree of conservation at various positions in the multiple alignments. This makes it a much more sensitive and specific method for database searching than pair wise methods. Following are the steps followed in this research work: 1. Align the sequences in the family: Initially, we will assume that there are no gaps in the alignment. We look at the alignment of N sequences of l positions as follows: Table 1 : Alignment of sequence Sequence Position 1 2 3 4 … l 1 a11 a12 a13 … … a1l 2 a21 a122 a23 … … a2l 3 a31 - - N aN1 aN2 aN3 … … aNl where aij denotes the amino acid from the ith sequence at the jth position. 2. Use the alignment to create a profile: We build the profile as follows. We compute: fij = % of column j that is amino acid i bi = % of background which is amino acid i The background" can be computed, for example, from a large sequence database, or from a genome, or from some particular protein family. Now compute the 20 x l array Pij , where Pij = fij/bi Intuitively, Pij is the “propensity" for amino acid i in the j position in the alignment. This gives us the following table: Table 2: Alignment to compute the Profile Sequenc e Position 1 2 3 4 5 … L L PL1 PL2 PL3 … … PLl V PV1 PV2 PV3 … … PVl F PF1 . And we use this table to compute: Scoreij = log(Pij)
  • 9. International Journal of Computer Science, Engineering and Information Technology (IJCSEIT), Vol.1, No.3, August 2011 9 3. Test new sequences against the profile: To use the profile to score a new sequence, we do the following: • Slide a window of width l over the new sequence. • The score of the window equals the sum of the scores of each position in the window. • If the score of the window is higher than the cut off, which is determined empirically, we can conclude that the window is a member of the family. In addition, the higher the score, the more confident the prediction. 5. CONCLUSION AND FUTURE WORK Currently, one very promising approach for protein family related analysis of amino acid sequences is the application of so-called Profile Hidden Markov Models (Profile HMMs) as probabilistic target family models. Given a training set of protein data, discrete HMMs are estimated. These models are then evaluated for unknown query sequences which are aligned to the explicit protein family models. Such explicit target family models are favorable for sequence analysis since family specific data is incorporated into the analysis. One of the main purposes of developing profile HMMs is to use them to detect potential membership in a family. We can use either the Viterbi algorithm to get the most probable alignment or the forward algorithm to calculate the full probability of the sequence summed over all possible paths. The research can be extended to: 1. Real user interface. 2. Provision to include other sequences (i.e. with different accession numbers and their supported files) automatically. 3. Provision to access the data from a database. 4. Provision for choice of alignment technique 5. Provision to incorporate various input formats 6. REFERENCES [1] Sharma N.,Kumar D., Kaur Reet. (2011) “Applying Hidden markov model to sequence alignment”, Vol 2 (3),pages 1031-1035 [2] Devos, D. and Valencia, A. (2000) “Practical Limits of Function Prediction”, Protein Design Group, National Centre for Biotechnology, CNB-CSIC Madrid, E-28049, Spain, pp. 134-170. [3] Erik L. L. Sonnhammer, Sean R. Eddy, Ewan Birney, Alex Bateman and Richard Durbin (1998) “Pfam: multiple sequence alignments and HMM-profiles of protein domains”, Nucleic Acids Research vol. 26, No.1, pp. 320-322. [4] Georgina Mirceva1 and Danco Davcev (2009) “HMM based approach for classifying protein structures” International Journal of Bio- Science and Bio- Technolog, vol. 1, no.1, pp. 37-46. [5] N. von Öhsen, I. Sommer, R. Zimmer (2003) “Profile-Profile Alignment: A Powerful Tool for Protein Structure Prediction” Pacific Symposium on Biocomputing, Vol 8, pp 252-263. [6] Park, C.Y., Park, S.H., Kim, D.H., Park, S.H. and Hwang, C.J. (2004) “A new protein Classification method using dynamic classifier”, Bioinformatics, vol. 9, pp 32-35.
  • 10. International Journal of Computer Science, Engineering and Information Technology (IJCSEIT), Vol.1, No.3, August 2011 10 [7] Herbert Popp, Mona Singh and Johnson parker (2002) “Topics in Computational Molecular Biology” Lecture notes in bio computing, pp.1-11. [8] Raninder Kaur, Shavinder Kaur, Reet Kamal Kaur and Amandeep Kaur (2010) “Characterization of Parathyroid Hormone using HMM Framework” International Journal of Computer Applications, vol. 1, no. 16, pp. 65-68. [9] T. Plötz, and G.A. Fink, “Pattern recognition methods for advanced stochastic protein sequence analysis using HMMs”, Pattern Recognition, vol. 39, 2006, pp. 2267-2280. [10] Thakoor N, Gao J, Jung S.(2007) “Hidden Markov model-based weighted likelihood discriminant for 2-D shape classification.” Online journal at Springerlink.com [11] Tolga Can, Orhan C¸ amoglu, Ambuj K. Singh, Yuan-Fang Wang (2004) “Automated Protein Classification Using Consensus Decision” Journal of Molecular Biology, Volume 348, Issue 4, Pages 66-68. [12] Usman Roshan and Dennis R. Livesay (2006) “Probalign: multiple sequence alignment using partition function posterior probabilities” Bioinformatics, Vol. 22, No. 22, pp 2715-2721. [13] Valeria De Fonzo, Filippo Aluffi-Pentini and Valerio Parisi. (2009) “Hidden Markov Models in Bioinformatics”, Current Bioinformatics, 2007, Vol. 2, No. 1, pp. 49-61. [14] Wong, L., Chua, H., 17]W.R. Taylor, and C.A. Orengo, “Protein structure alignment”, J. Mol. Biol., vol. 208, 1989, pp. 1-22. [15] Li, Z., Liu, G. and Sung, W. (2008) “Graph – Based Protein Function Prediction”, Genome Informatics, vol. 16(1), pp. 17-23. [16] http;/www.avatar.se/molbioinfo2001/multali.html [17] http://www.b.eye-network.com/view/1127 [18] http://www.caspur.it/castri/bioinformatica/gcghelp/profileanalysis.html Authors Er. Neeshu Sharma: Neeshu sharma was born on September 17, 1984 at kurukshetra, India. She completed her B.Tech in Computer science from Kurukshetra University in the year 2005 and is pursuing her M.Tech from DAVIET college Jallandhar Er. Dinesh Kumar :has completed his B.Tech and M.tech in Computer Sciences and is cirrenctly pursuing his P.h D. he had guided 7 M.tech research Thesis and has active research publications in the field of Machine Learning & Natural Language Processing, Computer Networks, Data Structures. Reet kamal kaur: born on 19-july-1984, she completed her B.Tech from LCET, and her M.Tech from GNDEC, ludhiana in 2008. currently serving in RIMT-MAEC, Mandigobindgarh on the post of assistant professor in the CSE department