SlideShare a Scribd company logo
Bioinformatics t8-go-hmm v2014
FBW 
2-12-2014 
Wim Van Criekinge
Bioinformatics t8-go-hmm v2014
Gene Prediction, HMM & ncRNA 
What to do with an unknown 
sequence ? 
Gene Ontologies 
Gene Prediction 
Composite Gene Prediction 
Non-coding RNA 
HMM
UNKNOWN PROTEIN SEQUENCE 
LOOK FOR: 
• Similar sequences in databases ((PSI) 
BLAST) 
• Distinctive patterns/domains associated 
with function 
• Functionally important residues 
• Secondary and tertiary structure 
• Physical properties (hydrophobicity, IEP 
etc)
BASIC INFORMATION COMES FROM SEQUENCE 
• One sequence- can get some information eg 
amino acid properties 
• More than one sequence- get more info on 
conserved residues, fold and function 
• Multiple alignments of related sequences-can 
build up consensus sequences of known 
families, domains, motifs or sites. 
• Sequence alignments can give information 
on loops, families and function from 
conserved regions
Additional analysis of protein sequences 
• transmembrane 
regions 
• signal sequences 
• localisation 
signals 
• targeting 
sequences 
• GPI anchors 
• glycosylation sites 
• hydrophobicity 
• amino acid 
composition 
• molecular weight 
• solvent accessibility 
• antigenicity
FINDING CONSERVED PATTERNS IN PROTEIN SEQUENCES 
• Pattern - short, simplest, but limited 
• Motif - conserved element of a sequence 
alignment, usually predictive of structural or 
functional region 
To get more information across whole 
alignment: 
• Profile 
• HMM
PATTERNS 
• Small, highly conserved regions 
• Shown as regular expressions 
Example: 
[AG]-x-V-x(2)-x-{YW} 
– [] shows either amino acid 
– X is any amino acid 
– X(2) any amino acid in the next 2 positions 
– {} shows any amino acid except these 
BUT- limited to near exact match in small 
region
PROFILES 
• Table or matrix containing comparison 
information for aligned sequences 
• Used to find sequences similar to 
alignment rather than one sequence 
• Contains same number of rows as 
positions in sequences 
• Row contains score for alignment of 
position with each residue
HIDDEN MARKOV MODELS (HMM) 
HMM 
• An HMM is a large-scale profile with gaps, 
insertions and deletions allowed in the 
alignments, and built around probabilities 
• Package used HMMER (http://hmmer.wusd.edu/) 
• Start with one sequence or alignment -HMMbuild, 
then calibrate with HMMcalibrate, search 
database with HMM 
• E-value- number of false matches expected with 
a certain score 
• Assume extreme value distribution for noise, 
calibrate by searching random seq with HMM 
build up curve of noise (EVD)
Sequence
Gene Prediction, HMM & ncRNA 
What to do with an unknown 
sequence ? 
Gene Ontologies 
Gene Prediction 
HMM 
Composite Gene Prediction 
Non-coding RNA
What is an ontology? 
• An ontology is an explicit 
specification of a conceptualization. 
• A conceptualization is an abstract, 
simplified view of the world that we 
want to represent. 
• If the specification medium is a 
formal representation, the ontology 
defines the vocabulary.
Why Create Ontologies? 
• to enable data exchange among 
programs 
• to simplify unification (or translation) 
of disparate representations 
• to employ knowledge-based services 
• to embody the representation of a 
theory 
• to facilitate communication among 
people
Summary 
• Ontologies are what they do: 
artifacts to help people and their 
programs communicate, coordinate, 
collaborate. 
• Ontologies are essential elements in 
the technological infrastructure of 
the Knowledge Age 
• http://www.geneontology.org/
The Three Ontologies 
•Molecular Function — elemental activity or task 
nuclease, DNA binding, transcription factor 
•Biological Process — broad objective or goal 
mitosis, signal transduction, metabolism 
•Cellular Component — location or complex 
nucleus, ribosome, origin recognition complex
DAG Structure 
Directed acyclic graph: each child 
may have one or more parents
Example - Molecular Function
Example - Biological Process
Example - Cellular Location
AmiGO browser
GO: Applications 
• Eg. chip-data analysis: Overrepresented item 
can provide functional clues 
• Overrepresentation check: contingency table 
– Chi-square test (or Fisher is frequency < 5)
Gene Prediction, HMM & ncRNA 
What to do with an unknown 
sequence ? 
Web applications 
Gene Ontologies 
Gene Prediction 
HMM 
Composite Gene Prediction 
Non-coding RNA
Computational Gene Finding 
Problem: 
Given a very long DNA sequence, identify coding 
regions (including intron splice sites) and their 
predicted protein sequences
Computational Gene Finding 
Eukaryotic gene structure
Computational Gene Finding 
• There is no (yet known) perfect method 
for finding genes. All approaches rely on 
combining various “weak signals” 
together 
• Find elements of a gene 
– coding sequences (exons) 
– promoters and start signals 
– poly-A tails and downstream signals 
• Assemble into a consistent gene model
Genefinder
Bioinformatics t8-go-hmm v2014
GENE STRUCTURE INFORMATION - POSITION ON PHYSICAL MAP 
This gene structure corresponds to the position on the physical map
The Active Zone limits the extent of 
analysis, genefinder & fasta dumps 
A blue line within the yellow box 
indicates regions outside of the active 
zone 
The active zone is set by entering 
coordinates in the active zone (yellow 
box) 
GENE STRUCTURE INFORMATION - ACTIVE ZONE 
This gene structure shows the Active Zone
GENE STRUCTURE INFORMATION - POSITION 
This gene structure relates to the Position: 
Change origin of 
this scale by 
entering a 
number in the 
green 'origin' 
box
Boxes are Exons, 
thin lines (or 
springs) are Introns 
GENE STRUCTURE INFORMATION - PREDICTED GENE STRUCTURE 
This gene structure relates to the predicted gene structures
Find the open reading frames 
The triplet, non-punctuated nature of the genetic code helps us out 
64 potential codons 
61 true codons 
3 stop codons (TGA, TAA, TAG) 
Random distribution app. 1/21 codons will be a stop 
Any sequence has 3 potential reading frames (+1, +2, +3) 
Its complement also has three potential reading frames (-1, -2, -3) 
6 possible reading frames 
GAAAAAGCTCCTGCCCAATCTGAAATGGTTAGCCTATCTTTCCACCGT 
E K A P A Q S E M V S L S F H R 
K K L L P N L K W L A Y L S T 
K S S C P I * N G * P I F P P
There is one column 
for each frame 
Small horizontal 
lines represent stop 
codons 
GENE STRUCTURE INFORMATION - OPEN READING FRAMES 
This gene structure relates to Open reading Frames
They have one 
column for each 
frame 
The size indicates 
relative score for the 
particular start site 
GENE STRUCTURE INFORMATION - START CODONS 
This gene structure represents Start Codons
Computational Gene Finding: Hexanucleotide frequencies 
• Amino acid distributions are biased 
e.g. p(A) > p(C) 
• Pairwise distributions also biased 
e.g. p(AT)/[p(A)*p(T)] > p(AC)/[p(A)*p(C)] 
• Nucleotides that code for preferred amino 
acids (and AA pairs) occur more frequently in 
coding regions than in non-coding regions. 
• Codon biases (per amino acid) 
• Hexanucleotide distributions that reflect those 
biases indicate coding regions.
Gene prediction 
Generation of datasets (Ensmart@Ensembl): 
Dataset 1 (http://biobix.ugent.be/txt/coding.txt) consists of >900 
coding regions (DNA): 
Dataset 2 (http://biobix.ugent.be/txt/noncoding.txt) consists of 
>900 non-coding regions 
Distance Array: Calculate for every base all the distances (in 
bp) to the same nucleotide (focus on the first 1000 bp of the 
coding region and limit the distance array to a window of 
1000 bp) 
Do you see a difference in this “distance array” between coding 
and noncoding sequence ? 
Could it be used to predict genes ? 
Write a program to predict genes in the following genomic 
sequence (http://biobix.ugent.be/txt/genomic.txt) 
What else could help in finding genes in raw genomic 
sequences ?
The grey boxes indicate 
regions where the codon 
frequencies match those of 
known C. elegans genes. 
the larger the grey box the 
more this region resembles a 
C. elegans coding element 
GENE STRUCTURE INFORMATION - CODING POTENTIAL 
This gene structure corresponds to the Coding Potential
blastn (EST) 
For raw DNA sequence analysis blastx is 
extremely useful 
Will probe your DNA sequence against the protein database 
A match (homolog) gives you some ideas regarding function 
One problem are all of the genome sequences 
Will get matches to genome databases that are strictly identified by 
sequence homology – often you need some experimental evidence
The blue boxes indicate 
regions of sequence which 
when translated have 
similarity to previously 
characterised proteins. 
To view the alignment, 
select the right mouse 
button whilst over the blue 
box. 
GENE STRUCTURE INFORMATION - SEQUENCE SIMILARITY 
This feature shows protein sequence similarity
The yellow boxes represent 
DNA matches (Blast) to C. 
elegans Expressed Sequence 
Tags (ESTS) 
To view the alignment use the 
right mouse button whilst 
over the yellow box to invoke 
Blixem 
GENE STRUCTURE INFORMATION - EST MATCHES 
This gene structure relates to Est Matches
New generation of programs to predict gene coding 
sequences based on a non-random repeat pattern 
(eg. Glimmer, GeneMark) – actually pretty good 
Borodovsky et al., 1999, Organization of the Prokaryotic Genome (Charlebois, ed) pp. 11-34
Computational Gene Finding 
• CpG islands are regions of sequence that 
have a high proportion of CG dinucleotide 
pairs (p is a phoshodiester bond linking 
them) 
– CpG islands are present in the promoter and 
exonic regions of approximately 40% of 
mammalian genes 
– Other regions of the mammalian genome contain 
few CpG dinucleotides and these are largely 
methylated 
• Definition: sequences of >500 bp with 
– G+C > 55% 
– Observed(CpG)/Expected(CpG) > 0.65
This column shows 
matches to members of a 
number of repeat families 
Currently a hidden markov 
model is used to detect 
these 
GENE STRUCTURE INFORMATION - REPEAT FAMILIES 
This gene structure corresponds to Repeat Families
This column shows regions 
of localised repeats both 
tandem and inverted 
Clicking on the boxes will 
show the complete repeat 
information in the blue line 
at the top end of the screen 
GENE STRUCTURE INFORMATION - REPEATS 
This gene structure relates to Repeats
Exon/intron boundaries
Computational Gene Finding: Splice junctions 
• Most Eukaryotic introns have a 
consensus splice signal: GU at the 
beginning (“donor”), AG at the end 
(“acceptor”). 
• Variation does occur in the splice sites 
• Many AGs and GTs are not splice sites. 
• Database of experimentally validated 
human splice sites: 
http://www.ebi.ac.uk/~thanaraj/splice.h 
tml
The Splice Sites are shown 
'Hooked' 
The Hook points in the 
direction of splicing, therefore 
3' splice sites point up and 5' 
Splice sites point down 
The colour of the Splice Site 
indicates the position at which 
it interrupts the Codon 
The height of the Splices is 
proportional to the Genefinder 
score of the Splice Site 
GENE STRUCTURE INFORMATION - PUTATIVE SPLICE SITES 
This gene structure shows putative splice sites
Gene Prediction, HMM & ncRNA 
What to do with an unknown 
sequence ? 
Web applications 
Gene Ontologies 
Gene Prediction 
HMM 
Composite Gene Prediction 
Non-coding RNA
Bioinformatics t8-go-hmm v2014
Towards profiles (PSSM) with indels – insertions and/or deletions 
• Recall that profiles are matrices that 
identify the probability of seeing an 
amino acid at a particular location in a 
motif. 
• What about motifs that allow insertions 
or deletions (together, called indels)? 
• Patterns and regular expressions can 
handle these easily, but profiles are 
more flexible. 
• Can indels be integrated into profiles?
Hidden Markov Models: Graphical models of sequences 
• Need a representation that allows 
specification of the probability of 
introducing (and/or extending) a gap in 
the profile. 
A .1 
C .05 
D .2 
E .08 
F .01 
continue 
Gap A .04 
C .1 
D .01 
E .2 
F .02 
Gap A .2 
C .01 
D .05 
E .1 
F .06 
delete
Hidden Markov Chain 
• A sequence is said to be Markovian if the 
probability of the occurrence of an element in 
a particular position depends only on the 
previous elements in the sequence. 
• Order of a Markov chain depends on how 
many previous elements influence probability 
– 0th order: uniform probability at every position 
– 1st order: probability depends only on immediately 
previous position. 
• 1st order Markov chains are good for proteins.
Marchov Chain for DNA
Markov chain with begin and end
Markov Models: Graphical models of sequences 
• Consists of states (boxes) and transitions 
(arcs) labeled with probabilities 
• States have probability(s) of “emitting” an 
element of a sequence (or nothing). 
• Arcs have probability of moving from one 
state to another. 
– Sum of probabilities of all out arcs must be 1 
– Self-loops (e.g. gap extend) are OK.
Markov Models 
• Simplest example: Each state emits (or, 
equivalently, recognizes) a particular 
element with probability 1, and each 
transition is equally likely. 
Begi 
n 
Emit 1 
Emit 2 
Emit 4 
Emit 3 
End 
Example sequences: 1234 234 14 121214 2123334
Hidden Markov Models: Probabilistic Markov Models 
• Now, add probabilities to each transition (let 
emission remain a single element) 
0.5 
Begi 1.0 
n 
0.5 
0.25 
0.9 
• We can calculate the probability of any sequence given this 
model by multiplying 
0.75 
0.1 
0.2 
0.8 
Emit 1 
Emit 2 
Emit 4 
Emit 3 
End 
p(1234) = 0.5 * 0.1 * 0.75 * 0.8 = 0.03 
p(14) = 0.5 * 0.9 = 0.45 
p(2334)= 0.5 * 0.75 * 0.2 * 0.8 = 0.06
• If we let the states define a set of emission 
probabilities for elements, we can no longer be 
sure which state we are in given a particular 
element of a sequence 
0.9 
Begi 1.0 
n 
BCCD or BCCD ? 
0.5 
0.5 
0.25 
0.75 
0.1 
0.2 
0.8 
A (0.8) B(0.2) 
B (0.7) C(0.3) 
C (0.1) D (0.9) 
C (0.6) A(0.4) 
End 
Hidden Markov Models: Probablistic Emmision
• Emission uncertainty means the sequence doesn't 
identify a unique path. The states are “hidden” 
0.5 
Begi 1.0 
n 
0.5 
0.9 
• Probability of a sequence is sum of all paths that can 
produce it: 
0.25 
0.75 
0.1 
0.2 
0.8 
A (0.8) B(0.2) 
B (0.7) C(0.3) 
C (0.1) D (0.9) 
C (0.6) A(0.4) 
End 
p(bccd) = 0.5 * 0.2 * 0.1 * 0.3 * 0.75 * 0.6 * 0.8 * 0.9 
+ 0.5 * 0.7 * 0.75 * 0.6 * 0.2 * 0.6 * 0.8 * 0.9 
= 0.000972 + 0.013608 = 0.01458 
Hidden Markov Models
Hidden Markov Models
Hidden Markov Models: The occasionally dishonest casino
Hidden Markov Models: The occasionally dishonest casino
Use of Hidden Markov Models 
• The HMM must first be “trained” using a training set 
– Eg. database of known genes. 
– Consensus sequences for all signal sensors are needed. 
– Compositional rules (i.e., emission probabilities) and 
length distributions are necessary for content sensors. 
• Transition probabilities between all connected 
states must be estimated. 
• Estimate the probability of sequence s, given model 
m, P(s|m) 
– Multiply probabilities along most likely path 
(or add logs – less numeric error)
Applications of Hidden Markov Models 
• HMMs are effectively profiles with gaps, and 
have applications throughout Bioinformatics 
• Protein sequence applications: 
– MSAs and identifying distant homologs 
E.g. Pfam uses HMMs to define its MSAs 
– Domain definitions 
– Used for fold recognition in protein structure 
prediction 
• Nucleotide sequence applications: 
– Models of exons, genes, etc. for gene 
recognition.
Hidden Markov Models Resources 
• UC Santa Cruz (David Haussler group) 
– SAM-02 server. Returns alignments, secondary 
structure predictions, HMM parameters, etc. etc. 
– SAM HMM building program 
(requires free academic license) 
• Washington U. St. Louis (Sean Eddy group) 
– Pfam. Large database of precomputed HMM-based 
alignments of proteins 
– HMMer, program for building HMMs 
• Gene finders and other HMMs (more later)
Example TMHMM 
Beyond Kyte-Doolitlle …
HMM in protein analysis 
• http://www.cse.ucsc.edu/research/compbio/is 
mb99.handouts/KK185FP.html
Bioinformatics t8-go-hmm v2014
Hidden Markov model for gene structure 
Signals (blue nodes): 
• begin sequence (B) 
• start translation (S) 
• donor splice site (D) 
• acceptor splice site (A) 
• stop translation (T) 
• end sequence (F) 
Contents (red arcs): 
• 5’ UTR (J5’) 
• initial exon (EI) 
• exon (E) 
• intron (I) 
• final exon (EF) 
• single exon (ES) 
• 3’ UTR (J3’) 
• A representation of the linguistic rules for what features might follow 
what other features when parsing a sequence consisting of a multiple 
exon gene. 
• A candidate gene structure is created by tracing a path from B to F. 
• A hidden Markov model (or hidden semi-Markov model) is defined by 
attaching stochastic models to each of the arcs and nodes.
Classic Programs for gene finding 
Some of the best programs are HMM based: 
• GenScan – http://genes.mit.edu/GENSCAN.html 
• GeneMark – http://opal.biology.gatech.edu/GeneMark/ 
Other programs 
• AAT, EcoParse, Fexeh, Fgeneh, Fgenes, Finex, GeneHacker, GeneID-3, 
GeneParser 2, GeneScope, Genie, GenLang, Glimmer, GlimmerM, Grail 
II, HMMgene, Morgan, MZEF, Procrustes, SORFind, Veil, Xpound
Hidden Markov Models: Gene Finding Software 
• A Semi-Markov Model 
GENSCAN 
not to be confused with GeneScan, a commercial product 
– Explicit model of how long 
to stay in a state (rather 
than just self-loops, which 
must be exponentially 
decaying) 
• Tracks “phase” of exon or 
intron (0 coincides with codon 
boundary, or 1 or 2) 
• Tracks strand (and direction)
Conservation of Gene Features 
100% 
95% 
90% 
85% 
80% 
75% 
70% 
65% 
60% 
55% 
50% 
aligning identity 
Conservation pattern across 3165 mappings of human 
RefSeq mRNAs to the genome. A program sampled 200 
evenly spaced bases across 500 bases upstream of 
transcription, the 5’ UTR, the first coding exon, introns, 
middle coding exons, introns, the 3’ UTR and 500 bases 
after polyadenylatoin. There are peaks of conservation at the 
transition from one region to another.
Composite Approaches 
• Use EST info to constrain HMMs (Genie) 
• Use protein homology info on top of HMMs 
(fgenesh++, GenomeScan) 
• Use cross species genomic alignments on top 
of HMMs (twinscan, fgenesh2, SLAM, SGP)
Gene Prediction: more complex … 
1. Species specific 
2. Splicing enhancers found in coding regions 
3. Trans-splicing 
4. …
Length preference 
5’ ss intcomp branch 3’ ss
Bioinformatics t8-go-hmm v2014
Contents-Schedule 
RNA genes 
Besides the 6000 protein coding-genes, there is: 
140 ribosomal RNA genes 
275 transfer RNA gnes 
40 small nuclear RNA genes 
>100 small nucleolar genes 
? 
pRNA in 29 rotary packaging motor (Simpson 
et el. Nature 408:745-750,2000) 
Cartilage-hair hypoplasmia mapped to an RNA 
(Ridanpoa et al. Cell 104:195-203,2001) 
The human Prader-Willi ciritical region (Cavaille 
et al. PNAS 97:14035-7, 2000)
Bioinformatics t8-go-hmm v2014
Bioinformatics t8-go-hmm v2014
Bioinformatics t8-go-hmm v2014
Bioinformatics t8-go-hmm v2014
RNA genes can be hard to detects 
UGAGGUAGUAGGUUGUAUAGU 
C.elegans let-27; 21 nt 
(Pasquinelli et al. Nature 408:86-89,2000) 
Often small 
Sometimes multicopy and redundant 
Often not polyadenylated 
(not represented in ESTs) 
Immune to frameshift and nonsense mutations 
No open reading frame, no codon bias 
Often evolving rapidly in primary sequence 
miRNA genes
Lin-4 
• Lin-4 identified in a screen for mutations that affect timing and 
sequence of postembryonic development in C.elegans. Mutants re-iterate 
L1 instead of later stages of development 
• Gene positionally cloned by isolating a 693-bp DNA fragment that 
can rescue the phenotype of mutant animals 
• No protein found but 61-nucleotide precursor RNA with stem-loop 
structure which is processed to 22-mer ncRNA 
• Genetically lin-4 acts as negative regulator of lin-14 and lin-28 
• The 3’ UTR of the target genes have short stretches of 
complementarity to lin-4 
• Deletion of these lin-4 target seq causes unregulated gof phenotype 
• Lin-4 RNA inhibits accumulation of LIN-14 and LIN-28 proteins 
although the target mRNA
Let-7 
(Pasquinelli et al. Nature 408:86-89,2000) 
Let-7 (lethal-7) was also mapped to a ncRNA gene with a 21- 
nucleotide product 
The small let-7 RNA is also thought to be a post-transcriptional 
negative regulator for lin-41 and lin-42 
100% conserved in all bilaterally symmetrical animals (not 
jellyfish and sponges) 
Sometimes called stRNAs, small temporal RNAs
Bioinformatics t8-go-hmm v2014
Two computational analysis problems 
• Similarity search (eg BLAST), I give you a query, 
you find sequences in a database that look like the 
query (note: SW/Blat) 
– For RNA, you want to take the secondary structure of 
the query into account 
• Genefinding. Based solely on a priori knowledge 
of what a “gene” looks like, find genes in a 
genome sequence 
– For RNA, with no open reading frame and no codon 
bias, what do you look for ?
Bioinformatics t8-go-hmm v2014
Bioinformatics t8-go-hmm v2014
Bioinformatics t8-go-hmm v2014
Bioinformatics t8-go-hmm v2014
Bioinformatics t8-go-hmm v2014
Bioinformatics t8-go-hmm v2014
Context-free grammers 
Basic CFG 
“production rules” 
S -> aS 
S -> Sa 
S -> aSu 
S -> SS 
A CFG “derivation” 
S -> aS
Context-free grammers 
Basic CFG 
“production rules” 
S -> aS 
S -> Sa 
S -> aSu 
S -> SS 
A CFG “derivation” 
S -> aS 
S -> aaS
Context-free grammers 
Basic CFG 
“production rules” 
S -> aS 
S -> Sa 
S -> aSu 
S -> SS 
A CFG “derivation” 
S -> aS 
S -> aaS 
S -> aaSS
Context-free grammers 
Basic CFG 
“production rules” 
S -> aS 
S -> Sa 
S -> aSu 
S -> SS 
A CFG “derivation” 
S -> aS 
S -> aaS 
S -> aaSS 
S -> aagScuS
Context-free grammers 
Basic CFG 
“production rules” 
S -> aS 
S -> Sa 
S -> aSu 
S -> SS 
A CFG “derivation” 
S -> aS 
S -> aaS 
S -> aaSS 
S -> aagScuS
Context-free grammers 
Basic CFG 
“production rules” 
S -> aS 
S -> Sa 
S -> aSu 
S -> SS 
A CFG “derivation” 
S -> aS 
S -> aaS 
S -> aaSS 
S -> aagScuS 
S -> aagaSucugSc
Context-free grammers 
Basic CFG 
“production rules” 
S -> aS 
S -> Sa 
S -> aSu 
S -> SS 
A CFG “derivation” 
S -> aS 
S -> aaS 
S -> aaSS 
S -> aagScuS 
S -> aagaSucugSc 
S -> aagaSaucuggScc 
S -> aagacSgaucuggcgSccc
Context-free grammers 
Basic CFG 
“production rules” 
S -> aS 
S -> Sa 
S -> aSu 
S -> SS 
A CFG “derivation” 
S -> aS 
S -> aaS 
S -> aaSS 
S -> aagScuS 
S -> aagaSucugSc 
S -> aagaSaucuggScc 
S -> aagacSgaucuggcgSccc 
S -> aagacuSgaucuggcgSccc 
S -> aagacuuSgaucuggcgaSccc 
S -> aagacuucSgaucuggcgacSccc 
S -> aagacuucgSgaucuggcgacaSccc 
S -> aagacuucggaucuggcgacaccc
Context-free grammers 
Basic CFG 
“production rules” 
S -> aS 
S -> Sa 
S -> aSu 
S -> SS 
A CFG “derivation” 
S -> aS 
S -> aaS 
S -> aaSS 
S -> aagScuS 
S -> aagaSucugSc 
S -> aagaSaucuggScc 
S -> aagacSgaucuggcgSccc 
S -> aagacuSgaucuggcgSccc 
S -> aagacuuSgaucuggcgaSccc 
S -> aagacuucSgaucuggcgacSccc 
S -> aagacuucgSgaucuggcgacaSccc 
S -> aagacuucggaucuggcgacaccc
Context-free grammers 
Basic CFG 
“production rules” 
S -> aS 
S -> Sa 
S -> aSu 
S -> SS 
A CFG “derivation” 
G 
C 
S -> aS 
A 
S -> aaS 
S -> aaSS 
S -> aagScuS 
S -> aagaSucugSc 
S -> aagaSaucuggScc 
S -> aagacSgaucuggcgSccc 
S -> aagacuSgaucuggcgSccc 
S -> aagacuuSgaucuggcgaSccc 
S -> aagacuucSgaucuggcgacSccc 
S -> aagacuucgSgaucuggcgacaSccc 
S -> aagacuucggaucuggcgacaccc 
C 
G 
U 
* 
A 
A 
A 
A 
A 
G 
G G G 
C 
C 
C 
C C C 
U 
U 
U 
* 
* 
* * *
Bioinformatics t8-go-hmm v2014
Bioinformatics t8-go-hmm v2014
The power of comparative analysis 
• Comparative genome analysis is an indispensable means of 
inferring whether a locus produces a ncRNA as opposed to 
encoding a protein. 
• For a small gene to be called a protein-coding gene, one 
excellent line of evidence is that the ORF is significantly 
conserved in another related species. 
• It is more difficult to positively corroborate a ncRNA by 
comparative analysis but, in at least some cases, a ncRNA 
might conserve an intramolecular secondary structure and 
comparative analysis can show compensatory base 
substitutions. 
• With comparative genome sequence data now 
accumulating in the public domain for most if not all 
important genetic systems, comparative analysis can (and 
should) become routine.
Compensatory substitutions 
that maintain the structure 
U U 
C G 
U A 
A U 
G C 
A UCGAC 3’ 
G C 
5’
Evolutionary conservation of RNA molecules can be revealed 
by identification of compensatory substitutions
…………
• Manual annotation of 60,770 full-length mouse complementary 
DNA sequences, clustered into 33,409 ‘transcriptional units’, 
contributing 90.1% of a newly established mouse transcriptome 
database. 
• Of these transcriptional units, 4,258 are new protein-coding and 
11,665 are new non-coding messages, indicating that non-coding 
RNA is a major component of the transcriptome.
Function on ncRNAs
ncRNAs & RNAi
Therapeutic Applications 
• Shooting millions of tiny RNA molecules into a 
mouse’s bloodstream can protect its liver from the 
ravages of hepatitis, a new study shows. In this 
case, they blunt the liver’s selfdestructive 
inflammatory response, which can be triggered by 
agents such as the hepatitis B or C viruses. 
(Harvard University immunologists Judy 
Lieberman and Premlata Shankar) 
• In a series of experiments published online this 
week by Nature Medicine, Lieberman’s team gave 
mice injections of siRNAs designed to shut down a 
gene called Fas. When overactivated during an 
inflammatory response, it induces liver cells to 
self-destruct. The next day, the animals were given 
an antibody that sends Fas into hyperdrive. Control 
mice died of acute liver failure within a few days, 
but 82% of the siRNA-treated mice remained free 
of serious disease and survived. Between 80% and 
90% of their liver cells had incorporated the 
siRNAs.
Bioinformatics t8-go-hmm v2014

More Related Content

PPTX
Bioinformatics t7-proteinstructure v2014
Prof. Wim Van Criekinge
 
PPTX
2015 bioinformatics protein_structure_wimvancriekinge
Prof. Wim Van Criekinge
 
PPTX
Bioinformatica t7-protein structure
Prof. Wim Van Criekinge
 
PPT
Bioinformatica 01-12-2011-t7-protein
Prof. Wim Van Criekinge
 
PPTX
METHODS TO DETERMINE PROTEIN STRUCTURE
Sabahat Ali
 
PPT
Proteins
Vedpal Yadav
 
PPTX
Structure-Function Analysis of POR Mutants
AYang999
 
PPTX
Structure-Function Analysis of POR Mutants
AYang999
 
Bioinformatics t7-proteinstructure v2014
Prof. Wim Van Criekinge
 
2015 bioinformatics protein_structure_wimvancriekinge
Prof. Wim Van Criekinge
 
Bioinformatica t7-protein structure
Prof. Wim Van Criekinge
 
Bioinformatica 01-12-2011-t7-protein
Prof. Wim Van Criekinge
 
METHODS TO DETERMINE PROTEIN STRUCTURE
Sabahat Ali
 
Proteins
Vedpal Yadav
 
Structure-Function Analysis of POR Mutants
AYang999
 
Structure-Function Analysis of POR Mutants
AYang999
 

What's hot (13)

PDF
Lab5_NguyenAlbert
Albert Nguyen
 
PDF
P0126557 slides
Nguyen Chien
 
PDF
BT631-8-Folds_proteins
Rajesh G
 
PDF
71st ICREA Colloquium - Intrinsically disordered proteins (IDPs) the challeng...
Mayi Suárez
 
PPTX
Prediction of disorder in protein structure (amit singh)
Amit Singh
 
PPTX
Different Levels of protein
Rajpal Choudhary
 
PPTX
Bioinformatica t3-scoring matrices-wim_vancriekinge_v2013
Prof. Wim Van Criekinge
 
PDF
Central dogma
Albert
 
PPT
Protein structure Lecture for M Sc biology students
Anuj Kumar
 
PPT
STAR: Recombination site prediction
Denis C. Bauer
 
PPTX
Gel Based Proteomics and Protein Sequences Analysis
Gelica F
 
PPTX
P6 2018 biopython2b
Prof. Wim Van Criekinge
 
PDF
Final_Randhawa
Samender Randhawa
 
Lab5_NguyenAlbert
Albert Nguyen
 
P0126557 slides
Nguyen Chien
 
BT631-8-Folds_proteins
Rajesh G
 
71st ICREA Colloquium - Intrinsically disordered proteins (IDPs) the challeng...
Mayi Suárez
 
Prediction of disorder in protein structure (amit singh)
Amit Singh
 
Different Levels of protein
Rajpal Choudhary
 
Bioinformatica t3-scoring matrices-wim_vancriekinge_v2013
Prof. Wim Van Criekinge
 
Central dogma
Albert
 
Protein structure Lecture for M Sc biology students
Anuj Kumar
 
STAR: Recombination site prediction
Denis C. Bauer
 
Gel Based Proteomics and Protein Sequences Analysis
Gelica F
 
P6 2018 biopython2b
Prof. Wim Van Criekinge
 
Final_Randhawa
Samender Randhawa
 
Ad

Similar to Bioinformatics t8-go-hmm v2014 (20)

PPTX
Bioinformatics t8-go-hmm wim-vancriekinge_v2013
Prof. Wim Van Criekinge
 
PPT
Bioinformatica 08-12-2011-t8-go-hmm
Prof. Wim Van Criekinge
 
PPTX
2015 bioinformatics go_hmm_wim_vancriekinge
Prof. Wim Van Criekinge
 
PPTX
Bioinformatica t8-go-hmm
Prof. Wim Van Criekinge
 
PDF
Gene prediction methods vijay
Vijay Hemmadi
 
PPTX
Structural annotation................pptx
Cherry
 
PPTX
Genome annotation
Shifa Ansari
 
PDF
genomeannotation-160822182432.pdf
VidyasriDharmalingam1
 
PDF
08_Annotation_2022.pdf
Kristen DeAngelis
 
PPT
Lecture bioinformatics Part2.next generation
MohamedHasan816582
 
PDF
Apollo Workshop AGS2017 Introduction
Monica Munoz-Torres
 
PDF
A Genome Sequence Analysis System Built with Hypertable
DATAVERSITY
 
PPTX
prediction methods for ORF
karamveer prajapat
 
PPTX
B.sc biochem i bobi u 4 gene prediction
Rai University
 
PPTX
B.sc biochem i bobi u 4 gene prediction
Rai University
 
PPTX
2016 bergen-sars
c.titus.brown
 
PPT
Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation Overview
Pathema
 
PPT
genomeannotation2013-140127002622-phpapp02.ppt
MohamedHasan816582
 
PPTX
Genomics_final.pptx
Silpa87
 
PPTX
Gene identification and discovery
Amit Ruchi Yadav
 
Bioinformatics t8-go-hmm wim-vancriekinge_v2013
Prof. Wim Van Criekinge
 
Bioinformatica 08-12-2011-t8-go-hmm
Prof. Wim Van Criekinge
 
2015 bioinformatics go_hmm_wim_vancriekinge
Prof. Wim Van Criekinge
 
Bioinformatica t8-go-hmm
Prof. Wim Van Criekinge
 
Gene prediction methods vijay
Vijay Hemmadi
 
Structural annotation................pptx
Cherry
 
Genome annotation
Shifa Ansari
 
genomeannotation-160822182432.pdf
VidyasriDharmalingam1
 
08_Annotation_2022.pdf
Kristen DeAngelis
 
Lecture bioinformatics Part2.next generation
MohamedHasan816582
 
Apollo Workshop AGS2017 Introduction
Monica Munoz-Torres
 
A Genome Sequence Analysis System Built with Hypertable
DATAVERSITY
 
prediction methods for ORF
karamveer prajapat
 
B.sc biochem i bobi u 4 gene prediction
Rai University
 
B.sc biochem i bobi u 4 gene prediction
Rai University
 
2016 bergen-sars
c.titus.brown
 
Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation Overview
Pathema
 
genomeannotation2013-140127002622-phpapp02.ppt
MohamedHasan816582
 
Genomics_final.pptx
Silpa87
 
Gene identification and discovery
Amit Ruchi Yadav
 
Ad

More from Prof. Wim Van Criekinge (20)

PPTX
2020 02 11_biological_databases_part1
Prof. Wim Van Criekinge
 
PPTX
2019 03 05_biological_databases_part5_v_upload
Prof. Wim Van Criekinge
 
PPTX
2019 03 05_biological_databases_part4_v_upload
Prof. Wim Van Criekinge
 
PPTX
2019 03 05_biological_databases_part3_v_upload
Prof. Wim Van Criekinge
 
PPTX
2019 02 21_biological_databases_part2_v_upload
Prof. Wim Van Criekinge
 
PPTX
2019 02 12_biological_databases_part1_v_upload
Prof. Wim Van Criekinge
 
PPTX
P7 2018 biopython3
Prof. Wim Van Criekinge
 
PPTX
P4 2018 io_functions
Prof. Wim Van Criekinge
 
PPTX
P3 2018 python_regexes
Prof. Wim Van Criekinge
 
PPTX
T1 2018 bioinformatics
Prof. Wim Van Criekinge
 
PPTX
P1 2018 python
Prof. Wim Van Criekinge
 
PDF
Bio ontologies and semantic technologies[2]
Prof. Wim Van Criekinge
 
PPTX
2018 05 08_biological_databases_no_sql
Prof. Wim Van Criekinge
 
PPTX
2018 03 27_biological_databases_part4_v_upload
Prof. Wim Van Criekinge
 
PPTX
2018 03 20_biological_databases_part3
Prof. Wim Van Criekinge
 
PPTX
2018 02 20_biological_databases_part2_v_upload
Prof. Wim Van Criekinge
 
PPTX
2018 02 20_biological_databases_part1_v_upload
Prof. Wim Van Criekinge
 
PPTX
P7 2017 biopython3
Prof. Wim Van Criekinge
 
PPTX
P6 2017 biopython2
Prof. Wim Van Criekinge
 
PPTX
Van criekinge 2017_11_13_rodebiotech
Prof. Wim Van Criekinge
 
2020 02 11_biological_databases_part1
Prof. Wim Van Criekinge
 
2019 03 05_biological_databases_part5_v_upload
Prof. Wim Van Criekinge
 
2019 03 05_biological_databases_part4_v_upload
Prof. Wim Van Criekinge
 
2019 03 05_biological_databases_part3_v_upload
Prof. Wim Van Criekinge
 
2019 02 21_biological_databases_part2_v_upload
Prof. Wim Van Criekinge
 
2019 02 12_biological_databases_part1_v_upload
Prof. Wim Van Criekinge
 
P7 2018 biopython3
Prof. Wim Van Criekinge
 
P4 2018 io_functions
Prof. Wim Van Criekinge
 
P3 2018 python_regexes
Prof. Wim Van Criekinge
 
T1 2018 bioinformatics
Prof. Wim Van Criekinge
 
P1 2018 python
Prof. Wim Van Criekinge
 
Bio ontologies and semantic technologies[2]
Prof. Wim Van Criekinge
 
2018 05 08_biological_databases_no_sql
Prof. Wim Van Criekinge
 
2018 03 27_biological_databases_part4_v_upload
Prof. Wim Van Criekinge
 
2018 03 20_biological_databases_part3
Prof. Wim Van Criekinge
 
2018 02 20_biological_databases_part2_v_upload
Prof. Wim Van Criekinge
 
2018 02 20_biological_databases_part1_v_upload
Prof. Wim Van Criekinge
 
P7 2017 biopython3
Prof. Wim Van Criekinge
 
P6 2017 biopython2
Prof. Wim Van Criekinge
 
Van criekinge 2017_11_13_rodebiotech
Prof. Wim Van Criekinge
 

Recently uploaded (20)

PPTX
Care of patients with elImination deviation.pptx
AneetaSharma15
 
PPTX
How to Close Subscription in Odoo 18 - Odoo Slides
Celine George
 
PPTX
Artificial-Intelligence-in-Drug-Discovery by R D Jawarkar.pptx
Rahul Jawarkar
 
DOCX
Unit 5: Speech-language and swallowing disorders
JELLA VISHNU DURGA PRASAD
 
PPTX
BASICS IN COMPUTER APPLICATIONS - UNIT I
suganthim28
 
PPTX
Five Point Someone – Chetan Bhagat | Book Summary & Analysis by Bhupesh Kushwaha
Bhupesh Kushwaha
 
PDF
The Minister of Tourism, Culture and Creative Arts, Abla Dzifa Gomashie has e...
nservice241
 
PPTX
CDH. pptx
AneetaSharma15
 
PPTX
Basics and rules of probability with real-life uses
ravatkaran694
 
PPTX
Applications of matrices In Real Life_20250724_091307_0000.pptx
gehlotkrish03
 
PPTX
Command Palatte in Odoo 18.1 Spreadsheet - Odoo Slides
Celine George
 
PPTX
HISTORY COLLECTION FOR PSYCHIATRIC PATIENTS.pptx
PoojaSen20
 
PPTX
An introduction to Dialogue writing.pptx
drsiddhantnagine
 
PPTX
INTESTINALPARASITES OR WORM INFESTATIONS.pptx
PRADEEP ABOTHU
 
PPTX
Tips Management in Odoo 18 POS - Odoo Slides
Celine George
 
PPTX
Continental Accounting in Odoo 18 - Odoo Slides
Celine George
 
DOCX
pgdei-UNIT -V Neurological Disorders & developmental disabilities
JELLA VISHNU DURGA PRASAD
 
PPTX
Gupta Art & Architecture Temple and Sculptures.pptx
Virag Sontakke
 
PPTX
Measures_of_location_-_Averages_and__percentiles_by_DR SURYA K.pptx
Surya Ganesh
 
PPTX
An introduction to Prepositions for beginners.pptx
drsiddhantnagine
 
Care of patients with elImination deviation.pptx
AneetaSharma15
 
How to Close Subscription in Odoo 18 - Odoo Slides
Celine George
 
Artificial-Intelligence-in-Drug-Discovery by R D Jawarkar.pptx
Rahul Jawarkar
 
Unit 5: Speech-language and swallowing disorders
JELLA VISHNU DURGA PRASAD
 
BASICS IN COMPUTER APPLICATIONS - UNIT I
suganthim28
 
Five Point Someone – Chetan Bhagat | Book Summary & Analysis by Bhupesh Kushwaha
Bhupesh Kushwaha
 
The Minister of Tourism, Culture and Creative Arts, Abla Dzifa Gomashie has e...
nservice241
 
CDH. pptx
AneetaSharma15
 
Basics and rules of probability with real-life uses
ravatkaran694
 
Applications of matrices In Real Life_20250724_091307_0000.pptx
gehlotkrish03
 
Command Palatte in Odoo 18.1 Spreadsheet - Odoo Slides
Celine George
 
HISTORY COLLECTION FOR PSYCHIATRIC PATIENTS.pptx
PoojaSen20
 
An introduction to Dialogue writing.pptx
drsiddhantnagine
 
INTESTINALPARASITES OR WORM INFESTATIONS.pptx
PRADEEP ABOTHU
 
Tips Management in Odoo 18 POS - Odoo Slides
Celine George
 
Continental Accounting in Odoo 18 - Odoo Slides
Celine George
 
pgdei-UNIT -V Neurological Disorders & developmental disabilities
JELLA VISHNU DURGA PRASAD
 
Gupta Art & Architecture Temple and Sculptures.pptx
Virag Sontakke
 
Measures_of_location_-_Averages_and__percentiles_by_DR SURYA K.pptx
Surya Ganesh
 
An introduction to Prepositions for beginners.pptx
drsiddhantnagine
 

Bioinformatics t8-go-hmm v2014

  • 2. FBW 2-12-2014 Wim Van Criekinge
  • 4. Gene Prediction, HMM & ncRNA What to do with an unknown sequence ? Gene Ontologies Gene Prediction Composite Gene Prediction Non-coding RNA HMM
  • 5. UNKNOWN PROTEIN SEQUENCE LOOK FOR: • Similar sequences in databases ((PSI) BLAST) • Distinctive patterns/domains associated with function • Functionally important residues • Secondary and tertiary structure • Physical properties (hydrophobicity, IEP etc)
  • 6. BASIC INFORMATION COMES FROM SEQUENCE • One sequence- can get some information eg amino acid properties • More than one sequence- get more info on conserved residues, fold and function • Multiple alignments of related sequences-can build up consensus sequences of known families, domains, motifs or sites. • Sequence alignments can give information on loops, families and function from conserved regions
  • 7. Additional analysis of protein sequences • transmembrane regions • signal sequences • localisation signals • targeting sequences • GPI anchors • glycosylation sites • hydrophobicity • amino acid composition • molecular weight • solvent accessibility • antigenicity
  • 8. FINDING CONSERVED PATTERNS IN PROTEIN SEQUENCES • Pattern - short, simplest, but limited • Motif - conserved element of a sequence alignment, usually predictive of structural or functional region To get more information across whole alignment: • Profile • HMM
  • 9. PATTERNS • Small, highly conserved regions • Shown as regular expressions Example: [AG]-x-V-x(2)-x-{YW} – [] shows either amino acid – X is any amino acid – X(2) any amino acid in the next 2 positions – {} shows any amino acid except these BUT- limited to near exact match in small region
  • 10. PROFILES • Table or matrix containing comparison information for aligned sequences • Used to find sequences similar to alignment rather than one sequence • Contains same number of rows as positions in sequences • Row contains score for alignment of position with each residue
  • 11. HIDDEN MARKOV MODELS (HMM) HMM • An HMM is a large-scale profile with gaps, insertions and deletions allowed in the alignments, and built around probabilities • Package used HMMER (http://hmmer.wusd.edu/) • Start with one sequence or alignment -HMMbuild, then calibrate with HMMcalibrate, search database with HMM • E-value- number of false matches expected with a certain score • Assume extreme value distribution for noise, calibrate by searching random seq with HMM build up curve of noise (EVD)
  • 13. Gene Prediction, HMM & ncRNA What to do with an unknown sequence ? Gene Ontologies Gene Prediction HMM Composite Gene Prediction Non-coding RNA
  • 14. What is an ontology? • An ontology is an explicit specification of a conceptualization. • A conceptualization is an abstract, simplified view of the world that we want to represent. • If the specification medium is a formal representation, the ontology defines the vocabulary.
  • 15. Why Create Ontologies? • to enable data exchange among programs • to simplify unification (or translation) of disparate representations • to employ knowledge-based services • to embody the representation of a theory • to facilitate communication among people
  • 16. Summary • Ontologies are what they do: artifacts to help people and their programs communicate, coordinate, collaborate. • Ontologies are essential elements in the technological infrastructure of the Knowledge Age • http://www.geneontology.org/
  • 17. The Three Ontologies •Molecular Function — elemental activity or task nuclease, DNA binding, transcription factor •Biological Process — broad objective or goal mitosis, signal transduction, metabolism •Cellular Component — location or complex nucleus, ribosome, origin recognition complex
  • 18. DAG Structure Directed acyclic graph: each child may have one or more parents
  • 21. Example - Cellular Location
  • 23. GO: Applications • Eg. chip-data analysis: Overrepresented item can provide functional clues • Overrepresentation check: contingency table – Chi-square test (or Fisher is frequency < 5)
  • 24. Gene Prediction, HMM & ncRNA What to do with an unknown sequence ? Web applications Gene Ontologies Gene Prediction HMM Composite Gene Prediction Non-coding RNA
  • 25. Computational Gene Finding Problem: Given a very long DNA sequence, identify coding regions (including intron splice sites) and their predicted protein sequences
  • 26. Computational Gene Finding Eukaryotic gene structure
  • 27. Computational Gene Finding • There is no (yet known) perfect method for finding genes. All approaches rely on combining various “weak signals” together • Find elements of a gene – coding sequences (exons) – promoters and start signals – poly-A tails and downstream signals • Assemble into a consistent gene model
  • 30. GENE STRUCTURE INFORMATION - POSITION ON PHYSICAL MAP This gene structure corresponds to the position on the physical map
  • 31. The Active Zone limits the extent of analysis, genefinder & fasta dumps A blue line within the yellow box indicates regions outside of the active zone The active zone is set by entering coordinates in the active zone (yellow box) GENE STRUCTURE INFORMATION - ACTIVE ZONE This gene structure shows the Active Zone
  • 32. GENE STRUCTURE INFORMATION - POSITION This gene structure relates to the Position: Change origin of this scale by entering a number in the green 'origin' box
  • 33. Boxes are Exons, thin lines (or springs) are Introns GENE STRUCTURE INFORMATION - PREDICTED GENE STRUCTURE This gene structure relates to the predicted gene structures
  • 34. Find the open reading frames The triplet, non-punctuated nature of the genetic code helps us out 64 potential codons 61 true codons 3 stop codons (TGA, TAA, TAG) Random distribution app. 1/21 codons will be a stop Any sequence has 3 potential reading frames (+1, +2, +3) Its complement also has three potential reading frames (-1, -2, -3) 6 possible reading frames GAAAAAGCTCCTGCCCAATCTGAAATGGTTAGCCTATCTTTCCACCGT E K A P A Q S E M V S L S F H R K K L L P N L K W L A Y L S T K S S C P I * N G * P I F P P
  • 35. There is one column for each frame Small horizontal lines represent stop codons GENE STRUCTURE INFORMATION - OPEN READING FRAMES This gene structure relates to Open reading Frames
  • 36. They have one column for each frame The size indicates relative score for the particular start site GENE STRUCTURE INFORMATION - START CODONS This gene structure represents Start Codons
  • 37. Computational Gene Finding: Hexanucleotide frequencies • Amino acid distributions are biased e.g. p(A) > p(C) • Pairwise distributions also biased e.g. p(AT)/[p(A)*p(T)] > p(AC)/[p(A)*p(C)] • Nucleotides that code for preferred amino acids (and AA pairs) occur more frequently in coding regions than in non-coding regions. • Codon biases (per amino acid) • Hexanucleotide distributions that reflect those biases indicate coding regions.
  • 38. Gene prediction Generation of datasets (Ensmart@Ensembl): Dataset 1 (http://biobix.ugent.be/txt/coding.txt) consists of >900 coding regions (DNA): Dataset 2 (http://biobix.ugent.be/txt/noncoding.txt) consists of >900 non-coding regions Distance Array: Calculate for every base all the distances (in bp) to the same nucleotide (focus on the first 1000 bp of the coding region and limit the distance array to a window of 1000 bp) Do you see a difference in this “distance array” between coding and noncoding sequence ? Could it be used to predict genes ? Write a program to predict genes in the following genomic sequence (http://biobix.ugent.be/txt/genomic.txt) What else could help in finding genes in raw genomic sequences ?
  • 39. The grey boxes indicate regions where the codon frequencies match those of known C. elegans genes. the larger the grey box the more this region resembles a C. elegans coding element GENE STRUCTURE INFORMATION - CODING POTENTIAL This gene structure corresponds to the Coding Potential
  • 40. blastn (EST) For raw DNA sequence analysis blastx is extremely useful Will probe your DNA sequence against the protein database A match (homolog) gives you some ideas regarding function One problem are all of the genome sequences Will get matches to genome databases that are strictly identified by sequence homology – often you need some experimental evidence
  • 41. The blue boxes indicate regions of sequence which when translated have similarity to previously characterised proteins. To view the alignment, select the right mouse button whilst over the blue box. GENE STRUCTURE INFORMATION - SEQUENCE SIMILARITY This feature shows protein sequence similarity
  • 42. The yellow boxes represent DNA matches (Blast) to C. elegans Expressed Sequence Tags (ESTS) To view the alignment use the right mouse button whilst over the yellow box to invoke Blixem GENE STRUCTURE INFORMATION - EST MATCHES This gene structure relates to Est Matches
  • 43. New generation of programs to predict gene coding sequences based on a non-random repeat pattern (eg. Glimmer, GeneMark) – actually pretty good Borodovsky et al., 1999, Organization of the Prokaryotic Genome (Charlebois, ed) pp. 11-34
  • 44. Computational Gene Finding • CpG islands are regions of sequence that have a high proportion of CG dinucleotide pairs (p is a phoshodiester bond linking them) – CpG islands are present in the promoter and exonic regions of approximately 40% of mammalian genes – Other regions of the mammalian genome contain few CpG dinucleotides and these are largely methylated • Definition: sequences of >500 bp with – G+C > 55% – Observed(CpG)/Expected(CpG) > 0.65
  • 45. This column shows matches to members of a number of repeat families Currently a hidden markov model is used to detect these GENE STRUCTURE INFORMATION - REPEAT FAMILIES This gene structure corresponds to Repeat Families
  • 46. This column shows regions of localised repeats both tandem and inverted Clicking on the boxes will show the complete repeat information in the blue line at the top end of the screen GENE STRUCTURE INFORMATION - REPEATS This gene structure relates to Repeats
  • 48. Computational Gene Finding: Splice junctions • Most Eukaryotic introns have a consensus splice signal: GU at the beginning (“donor”), AG at the end (“acceptor”). • Variation does occur in the splice sites • Many AGs and GTs are not splice sites. • Database of experimentally validated human splice sites: http://www.ebi.ac.uk/~thanaraj/splice.h tml
  • 49. The Splice Sites are shown 'Hooked' The Hook points in the direction of splicing, therefore 3' splice sites point up and 5' Splice sites point down The colour of the Splice Site indicates the position at which it interrupts the Codon The height of the Splices is proportional to the Genefinder score of the Splice Site GENE STRUCTURE INFORMATION - PUTATIVE SPLICE SITES This gene structure shows putative splice sites
  • 50. Gene Prediction, HMM & ncRNA What to do with an unknown sequence ? Web applications Gene Ontologies Gene Prediction HMM Composite Gene Prediction Non-coding RNA
  • 52. Towards profiles (PSSM) with indels – insertions and/or deletions • Recall that profiles are matrices that identify the probability of seeing an amino acid at a particular location in a motif. • What about motifs that allow insertions or deletions (together, called indels)? • Patterns and regular expressions can handle these easily, but profiles are more flexible. • Can indels be integrated into profiles?
  • 53. Hidden Markov Models: Graphical models of sequences • Need a representation that allows specification of the probability of introducing (and/or extending) a gap in the profile. A .1 C .05 D .2 E .08 F .01 continue Gap A .04 C .1 D .01 E .2 F .02 Gap A .2 C .01 D .05 E .1 F .06 delete
  • 54. Hidden Markov Chain • A sequence is said to be Markovian if the probability of the occurrence of an element in a particular position depends only on the previous elements in the sequence. • Order of a Markov chain depends on how many previous elements influence probability – 0th order: uniform probability at every position – 1st order: probability depends only on immediately previous position. • 1st order Markov chains are good for proteins.
  • 56. Markov chain with begin and end
  • 57. Markov Models: Graphical models of sequences • Consists of states (boxes) and transitions (arcs) labeled with probabilities • States have probability(s) of “emitting” an element of a sequence (or nothing). • Arcs have probability of moving from one state to another. – Sum of probabilities of all out arcs must be 1 – Self-loops (e.g. gap extend) are OK.
  • 58. Markov Models • Simplest example: Each state emits (or, equivalently, recognizes) a particular element with probability 1, and each transition is equally likely. Begi n Emit 1 Emit 2 Emit 4 Emit 3 End Example sequences: 1234 234 14 121214 2123334
  • 59. Hidden Markov Models: Probabilistic Markov Models • Now, add probabilities to each transition (let emission remain a single element) 0.5 Begi 1.0 n 0.5 0.25 0.9 • We can calculate the probability of any sequence given this model by multiplying 0.75 0.1 0.2 0.8 Emit 1 Emit 2 Emit 4 Emit 3 End p(1234) = 0.5 * 0.1 * 0.75 * 0.8 = 0.03 p(14) = 0.5 * 0.9 = 0.45 p(2334)= 0.5 * 0.75 * 0.2 * 0.8 = 0.06
  • 60. • If we let the states define a set of emission probabilities for elements, we can no longer be sure which state we are in given a particular element of a sequence 0.9 Begi 1.0 n BCCD or BCCD ? 0.5 0.5 0.25 0.75 0.1 0.2 0.8 A (0.8) B(0.2) B (0.7) C(0.3) C (0.1) D (0.9) C (0.6) A(0.4) End Hidden Markov Models: Probablistic Emmision
  • 61. • Emission uncertainty means the sequence doesn't identify a unique path. The states are “hidden” 0.5 Begi 1.0 n 0.5 0.9 • Probability of a sequence is sum of all paths that can produce it: 0.25 0.75 0.1 0.2 0.8 A (0.8) B(0.2) B (0.7) C(0.3) C (0.1) D (0.9) C (0.6) A(0.4) End p(bccd) = 0.5 * 0.2 * 0.1 * 0.3 * 0.75 * 0.6 * 0.8 * 0.9 + 0.5 * 0.7 * 0.75 * 0.6 * 0.2 * 0.6 * 0.8 * 0.9 = 0.000972 + 0.013608 = 0.01458 Hidden Markov Models
  • 63. Hidden Markov Models: The occasionally dishonest casino
  • 64. Hidden Markov Models: The occasionally dishonest casino
  • 65. Use of Hidden Markov Models • The HMM must first be “trained” using a training set – Eg. database of known genes. – Consensus sequences for all signal sensors are needed. – Compositional rules (i.e., emission probabilities) and length distributions are necessary for content sensors. • Transition probabilities between all connected states must be estimated. • Estimate the probability of sequence s, given model m, P(s|m) – Multiply probabilities along most likely path (or add logs – less numeric error)
  • 66. Applications of Hidden Markov Models • HMMs are effectively profiles with gaps, and have applications throughout Bioinformatics • Protein sequence applications: – MSAs and identifying distant homologs E.g. Pfam uses HMMs to define its MSAs – Domain definitions – Used for fold recognition in protein structure prediction • Nucleotide sequence applications: – Models of exons, genes, etc. for gene recognition.
  • 67. Hidden Markov Models Resources • UC Santa Cruz (David Haussler group) – SAM-02 server. Returns alignments, secondary structure predictions, HMM parameters, etc. etc. – SAM HMM building program (requires free academic license) • Washington U. St. Louis (Sean Eddy group) – Pfam. Large database of precomputed HMM-based alignments of proteins – HMMer, program for building HMMs • Gene finders and other HMMs (more later)
  • 68. Example TMHMM Beyond Kyte-Doolitlle …
  • 69. HMM in protein analysis • http://www.cse.ucsc.edu/research/compbio/is mb99.handouts/KK185FP.html
  • 71. Hidden Markov model for gene structure Signals (blue nodes): • begin sequence (B) • start translation (S) • donor splice site (D) • acceptor splice site (A) • stop translation (T) • end sequence (F) Contents (red arcs): • 5’ UTR (J5’) • initial exon (EI) • exon (E) • intron (I) • final exon (EF) • single exon (ES) • 3’ UTR (J3’) • A representation of the linguistic rules for what features might follow what other features when parsing a sequence consisting of a multiple exon gene. • A candidate gene structure is created by tracing a path from B to F. • A hidden Markov model (or hidden semi-Markov model) is defined by attaching stochastic models to each of the arcs and nodes.
  • 72. Classic Programs for gene finding Some of the best programs are HMM based: • GenScan – http://genes.mit.edu/GENSCAN.html • GeneMark – http://opal.biology.gatech.edu/GeneMark/ Other programs • AAT, EcoParse, Fexeh, Fgeneh, Fgenes, Finex, GeneHacker, GeneID-3, GeneParser 2, GeneScope, Genie, GenLang, Glimmer, GlimmerM, Grail II, HMMgene, Morgan, MZEF, Procrustes, SORFind, Veil, Xpound
  • 73. Hidden Markov Models: Gene Finding Software • A Semi-Markov Model GENSCAN not to be confused with GeneScan, a commercial product – Explicit model of how long to stay in a state (rather than just self-loops, which must be exponentially decaying) • Tracks “phase” of exon or intron (0 coincides with codon boundary, or 1 or 2) • Tracks strand (and direction)
  • 74. Conservation of Gene Features 100% 95% 90% 85% 80% 75% 70% 65% 60% 55% 50% aligning identity Conservation pattern across 3165 mappings of human RefSeq mRNAs to the genome. A program sampled 200 evenly spaced bases across 500 bases upstream of transcription, the 5’ UTR, the first coding exon, introns, middle coding exons, introns, the 3’ UTR and 500 bases after polyadenylatoin. There are peaks of conservation at the transition from one region to another.
  • 75. Composite Approaches • Use EST info to constrain HMMs (Genie) • Use protein homology info on top of HMMs (fgenesh++, GenomeScan) • Use cross species genomic alignments on top of HMMs (twinscan, fgenesh2, SLAM, SGP)
  • 76. Gene Prediction: more complex … 1. Species specific 2. Splicing enhancers found in coding regions 3. Trans-splicing 4. …
  • 77. Length preference 5’ ss intcomp branch 3’ ss
  • 79. Contents-Schedule RNA genes Besides the 6000 protein coding-genes, there is: 140 ribosomal RNA genes 275 transfer RNA gnes 40 small nuclear RNA genes >100 small nucleolar genes ? pRNA in 29 rotary packaging motor (Simpson et el. Nature 408:745-750,2000) Cartilage-hair hypoplasmia mapped to an RNA (Ridanpoa et al. Cell 104:195-203,2001) The human Prader-Willi ciritical region (Cavaille et al. PNAS 97:14035-7, 2000)
  • 84. RNA genes can be hard to detects UGAGGUAGUAGGUUGUAUAGU C.elegans let-27; 21 nt (Pasquinelli et al. Nature 408:86-89,2000) Often small Sometimes multicopy and redundant Often not polyadenylated (not represented in ESTs) Immune to frameshift and nonsense mutations No open reading frame, no codon bias Often evolving rapidly in primary sequence miRNA genes
  • 85. Lin-4 • Lin-4 identified in a screen for mutations that affect timing and sequence of postembryonic development in C.elegans. Mutants re-iterate L1 instead of later stages of development • Gene positionally cloned by isolating a 693-bp DNA fragment that can rescue the phenotype of mutant animals • No protein found but 61-nucleotide precursor RNA with stem-loop structure which is processed to 22-mer ncRNA • Genetically lin-4 acts as negative regulator of lin-14 and lin-28 • The 3’ UTR of the target genes have short stretches of complementarity to lin-4 • Deletion of these lin-4 target seq causes unregulated gof phenotype • Lin-4 RNA inhibits accumulation of LIN-14 and LIN-28 proteins although the target mRNA
  • 86. Let-7 (Pasquinelli et al. Nature 408:86-89,2000) Let-7 (lethal-7) was also mapped to a ncRNA gene with a 21- nucleotide product The small let-7 RNA is also thought to be a post-transcriptional negative regulator for lin-41 and lin-42 100% conserved in all bilaterally symmetrical animals (not jellyfish and sponges) Sometimes called stRNAs, small temporal RNAs
  • 88. Two computational analysis problems • Similarity search (eg BLAST), I give you a query, you find sequences in a database that look like the query (note: SW/Blat) – For RNA, you want to take the secondary structure of the query into account • Genefinding. Based solely on a priori knowledge of what a “gene” looks like, find genes in a genome sequence – For RNA, with no open reading frame and no codon bias, what do you look for ?
  • 95. Context-free grammers Basic CFG “production rules” S -> aS S -> Sa S -> aSu S -> SS A CFG “derivation” S -> aS
  • 96. Context-free grammers Basic CFG “production rules” S -> aS S -> Sa S -> aSu S -> SS A CFG “derivation” S -> aS S -> aaS
  • 97. Context-free grammers Basic CFG “production rules” S -> aS S -> Sa S -> aSu S -> SS A CFG “derivation” S -> aS S -> aaS S -> aaSS
  • 98. Context-free grammers Basic CFG “production rules” S -> aS S -> Sa S -> aSu S -> SS A CFG “derivation” S -> aS S -> aaS S -> aaSS S -> aagScuS
  • 99. Context-free grammers Basic CFG “production rules” S -> aS S -> Sa S -> aSu S -> SS A CFG “derivation” S -> aS S -> aaS S -> aaSS S -> aagScuS
  • 100. Context-free grammers Basic CFG “production rules” S -> aS S -> Sa S -> aSu S -> SS A CFG “derivation” S -> aS S -> aaS S -> aaSS S -> aagScuS S -> aagaSucugSc
  • 101. Context-free grammers Basic CFG “production rules” S -> aS S -> Sa S -> aSu S -> SS A CFG “derivation” S -> aS S -> aaS S -> aaSS S -> aagScuS S -> aagaSucugSc S -> aagaSaucuggScc S -> aagacSgaucuggcgSccc
  • 102. Context-free grammers Basic CFG “production rules” S -> aS S -> Sa S -> aSu S -> SS A CFG “derivation” S -> aS S -> aaS S -> aaSS S -> aagScuS S -> aagaSucugSc S -> aagaSaucuggScc S -> aagacSgaucuggcgSccc S -> aagacuSgaucuggcgSccc S -> aagacuuSgaucuggcgaSccc S -> aagacuucSgaucuggcgacSccc S -> aagacuucgSgaucuggcgacaSccc S -> aagacuucggaucuggcgacaccc
  • 103. Context-free grammers Basic CFG “production rules” S -> aS S -> Sa S -> aSu S -> SS A CFG “derivation” S -> aS S -> aaS S -> aaSS S -> aagScuS S -> aagaSucugSc S -> aagaSaucuggScc S -> aagacSgaucuggcgSccc S -> aagacuSgaucuggcgSccc S -> aagacuuSgaucuggcgaSccc S -> aagacuucSgaucuggcgacSccc S -> aagacuucgSgaucuggcgacaSccc S -> aagacuucggaucuggcgacaccc
  • 104. Context-free grammers Basic CFG “production rules” S -> aS S -> Sa S -> aSu S -> SS A CFG “derivation” G C S -> aS A S -> aaS S -> aaSS S -> aagScuS S -> aagaSucugSc S -> aagaSaucuggScc S -> aagacSgaucuggcgSccc S -> aagacuSgaucuggcgSccc S -> aagacuuSgaucuggcgaSccc S -> aagacuucSgaucuggcgacSccc S -> aagacuucgSgaucuggcgacaSccc S -> aagacuucggaucuggcgacaccc C G U * A A A A A G G G G C C C C C C U U U * * * * *
  • 107. The power of comparative analysis • Comparative genome analysis is an indispensable means of inferring whether a locus produces a ncRNA as opposed to encoding a protein. • For a small gene to be called a protein-coding gene, one excellent line of evidence is that the ORF is significantly conserved in another related species. • It is more difficult to positively corroborate a ncRNA by comparative analysis but, in at least some cases, a ncRNA might conserve an intramolecular secondary structure and comparative analysis can show compensatory base substitutions. • With comparative genome sequence data now accumulating in the public domain for most if not all important genetic systems, comparative analysis can (and should) become routine.
  • 108. Compensatory substitutions that maintain the structure U U C G U A A U G C A UCGAC 3’ G C 5’
  • 109. Evolutionary conservation of RNA molecules can be revealed by identification of compensatory substitutions
  • 111. • Manual annotation of 60,770 full-length mouse complementary DNA sequences, clustered into 33,409 ‘transcriptional units’, contributing 90.1% of a newly established mouse transcriptome database. • Of these transcriptional units, 4,258 are new protein-coding and 11,665 are new non-coding messages, indicating that non-coding RNA is a major component of the transcriptome.
  • 114. Therapeutic Applications • Shooting millions of tiny RNA molecules into a mouse’s bloodstream can protect its liver from the ravages of hepatitis, a new study shows. In this case, they blunt the liver’s selfdestructive inflammatory response, which can be triggered by agents such as the hepatitis B or C viruses. (Harvard University immunologists Judy Lieberman and Premlata Shankar) • In a series of experiments published online this week by Nature Medicine, Lieberman’s team gave mice injections of siRNAs designed to shut down a gene called Fas. When overactivated during an inflammatory response, it induces liver cells to self-destruct. The next day, the animals were given an antibody that sends Fas into hyperdrive. Control mice died of acute liver failure within a few days, but 82% of the siRNA-treated mice remained free of serious disease and survived. Between 80% and 90% of their liver cells had incorporated the siRNAs.