Brief introduction to artificial neural networks and the application to bioinformatics fields. And show how to utilize neural networks to predict splice sites in genome/gene sequences.
Journal club dec24 2015 splice site prediction using artificial neural networks
1. Splice Site Prediction Using
Artificial Neural Networks
Journal Club Dec24, 2015
Hiroya MORIMOTO
2. Information
— Title:
— Splice Site Prediction Using Artificial Neural Networks
— Authors:
— Øystein Johansen, Tom Ryen, Trygve Eftesøl, Thomas Kjosmoen, and
Peter Ruoff
— Institutional affiliations:
— University of Stavanger, Norway
— Publishing year
— 2009
Splice Site Prediction Using Artificial Neural Networks
3. 0. ABSTRACT
— SQ
— NN を用いたSS 予測が用いられている.
— Methods
— NN を持ちた後に,出力値をparabolic func. にfitting することで正確な
SS 予測を可能とした.
— Data
— Arabidopsis genesから,16,965 genes -> training, 5,000 genes ->
benchmark, 20 genes -> verification
— Result
— 最高で,Sn=0.891, Sp=0.816, CC=0.552 をマークした.
ANN を駆使して,DSS/ASS を予測しようとした論文.
シロイヌナズナでテストし,高精度をマークした.
Abbreviations 1:
ANN = Artificial neural network,
SS = Splice site
DSS = Donor splice site
ASS = Acceptor splice site
Abbreviations 2:
CC = Correlation coefficient, Sp = Specificity, Sn = Sensitivity,
TP = True positive, FP = False positive,
TN = True negative, FN = False negative
4. 1. INTRODUCTION
— SQ
— DB に存在するdata の中には,すでにgenetics, biochemical methods
によって遺伝子やプロモータがannotate されいているものがある.
— Sequence 情報は日々増大しており,annotate されないままのものも
多く存在している.
— This study did …
— Artificial neural networks を活用して,DSS (donor splice sites) やASS
(acceptor splice sites) を予測することを目指した.
増え続けるsequence 情報に対して,低コストにannotation を
行うためにコンピュータの力を利用しようではないか!
ATGCGATTTAGC AGCGCGAATAGGGTGTCAGTTAAGCTGAG
DSS です! ASS です! ASS じゃない…
exon exonintron
10. 2. NEURAL NETWORK
Prediction system overview: 一定長のDNA sequence を入力とし,
そのwindow 内のDSS/ASS の存在有無を出力とするANN.
neural network will give an output score if it recognizes there is a splice site in
the window. A diagram of the entire prediction system is shown in Fig. 2. The
window size is chosen to be 60 nucleotides. This is hopefully wide enough to find
significant patterns on both sides of the splice site. A bigger window will make
the neural network bigger and thereby harder to train. Smaller window would
maybe exclude important information around the splice site.
11. 2. NEURAL NETWORK (cont’d)
— Network topology
— 3 layer feedforward neural network
— 2 output units (DSS/ASS)
— 128 hidden layer units
— 240 input units (=60bp)
— A -> (1,0,0,0) のように,1塩基を4 units に対応.
— 31,106 parameters が存在.
— 240 x 128 + 128 x 2 + 128 + 2
— Bias parameter も含めて.
— Activation function
— Standard sigmoid function を使用.(ß=0.1)
— Backpropagation
— 各unit におけるerror の変化量を計算する際に,二次導関数を用いるよ
うなことはしていない = 一次導関数を用いている,という意味だと思
われる.
— Parameter 調整の際に,error を微分して0,つまりこれ以上誤差が変
化しない(局所)最適解を目指す.
-50 -40 -30 -20 -10 0 10 20 30 40 50
0.25
0.5
0.75
1
12. 3. TRAINING DATA & BENCHMARKING DATA
— Data
— The Arabidopsis Information Resource (TAIR) release 8
— シロイヌナズナのDB.
— Excluded genes
— ANN のinput に同時に複数の同種のSS が入らないようにするためと,
train に悪影響を及ぼす(複雑になる) 要素を取り除くため,以下の条件
に該当する遺伝子を除いた.
— N gap が含まれている遺伝子.
— Single exon genes
— Exons/introns の長さが,30bp 以下のものを含む遺伝子.
— Alternative splicing を行う遺伝子.
— 一つだけを残して他を除いた.
— Training data set & benchmark data set
— 残った21,985 遺伝子をそれぞれのために
16,965 と5,000遺伝子にランダムに分割.
— 20 遺伝子をfinal verification 用に.
— 各gene set は全5chr.s からの遺伝子を含む.
Deviding data set
Training
Benchmark
Final verification
Remaining
data set
13. 3. TRAINING DATA & BENCHMARKING DATA (cont’d)
— Training data, benchmark data とfinal verification data の関係
Training iterations
Benchmark data
Training data
Final verification
data
Measurements of
the performance
(= Verification)
Training のために使用.
Epoch ごとにinput される.
Training の進捗度を測定するために使用
毎epoch 終了時に使用される.
最終的に構築されたNN の精度評価に使用.
(20 genes)
(16,965 genes)
(5,000 genes)
14. 4. TARINING METHOD
— Input data
— 各遺伝子配列上に60bp window をとり,1bp ごとにslide させてゆく.
— それをinput calculator にて,binary 値に変換し,input data とする.
— Desired labels
— Output layer unit は2つあり,それぞれDSS, ASS に対応している.
— 基本的には,各output unit で1 が出れば,そのSS を支持.
Standard backpropagation を行い,neural network をtrain した.
Input data とそれに対応するdesired labels について示す.
... AACGGTTTCT GGTAAATGGA AGCTTACCGG AGAATCTGTT GAGGTCAAGG ACCAGTGGGG ATTTTGTAGT GAGGGCTTTC GTGGTAAGAT TGGTCATAAA …
Input calculator
0010 0010 0100 1000 1000 1000 0100 0010 0010 1000 1000 0010 0001 0100 0100 … 1000 0100 0100 0100 0100 0010 0100 1000 0010 0100
0 – 1 (for DSS)
0 – 1 (for ASS)
Desired labels
Input data
+
A training data
15. 4. TARINING METHOD (cont’d)
— Desired labels (in detail)
— Window の中央にtrue SS が来た時に,最大値1 を与え,そこからのズ
レに比例して最小値0 に近づく値を以下のscore function にて設定する.
Desired label として,0 or 1 のbinary 値にするのではなく,
window 内でのSS の位置に応じて連続変化する値を設定する.
106 Ø. Johansen et al.
splice site.
However, if it is only a 1.0 output when a splice site is in the middle of the
window, and 0.0 when a splice site is not in the middle of the window, there
will probably be too many 0.0 training samples that the neural network would
learn to predict everything as ’no splice site’. This is why we introduce a score
function which calculates a target output not only when the splice site is in
the middle of the window, but whenever there is a splice site somewhere in the
window. We use a weighting function where the weight of a splice site depends
on the distance from the respective nucleotide to the nucleotide at the window
mid-point. The further from the mid point of the window this splice site is, the
lower value we get in the target values. The target values decrease linearly from
the mid point of the window. This gives the score function as shown in Eq. 2
f(n) = 1 − |1 −
2n
LW
| (2)
If a splice site is exactly at the mid point, the target output is 1.0. An example
window is shown in Fig. 2.
n = 21
where Lw = 60
*SS: splice sites
A true splice site
18. 6. BENCHMARK (cont’d)
— Significant top (cont’d)
— 以下の例だと,pos 2 – 6 のみがthreshold, 0.2 を超えていて,それら
のみを用いて,回帰線を描いている.
— Threshold を決定するために,mean, SD といった動的な値を試したが,
結局0.2 という定数が良い結果を生んだ.
value in the donor splice site indicator. When the algorithm finds a significant
top in the donor splice site indicator, the state switches to intron. The algorithm
continues to look for a significant top in the acceptor splice site indicator, and
the state is switched back to exon. This process continues until the end of the
gene. The gene must end in the exon state.
In the above paragraph, it is unclear what is meant by a significant top. To
indicate a top in a splice site indicator, the algorithm first finds a indicator value
above some threshold value. It then finds all successive indicator data points that
are higher than this threshold value. Through all these values, a second order
polynomial regression line is fitted, and the maximum of this parabola is used
to indicate the splice site. This method is explained with some example data in
Fig. 3. In this example the indicator value at 0 and 1 is below the threshold. The
value at 2 is just above the threshold and the successive values at 3,4,5 and 6 is
also above the threshold and these five values are used in the curve fitting. The
rest of the data points are below the threshold and not used in the curve fitting.
Fig. 3. Predicting a splice site based on the splice site indicator. When the indicator
reaches above the threshold value, 0.2 in the figure, all successive data points above
this threshold are used in a curve fitting of a parabola. The nucleotide closest to the
parabola maxima is used the splice site.
Finding a good threshold value is difficult. Several values have been tried. We
Significant top
19. 6. BENCHMARK (cont’d)
— Measurement indicators
— Sensitivity (Sn)
— 100個の的のうち,いくつを当てたか.「数打ちゃ当たる」
— Specificity (Sp)
— 100発撃って,何発当てたか.
— Correlation coefficient (CC)
— より少ない弾数で,より多くの的を当てたか.
推定されたSS を用いて,exon, intron を予測し,それらの精度
を測定した.以下にその評価指標を示す.
Abbreviations:
CC = Correlation coefficient, Sp = Specificity, Sn = Sensitivity,
TP = True positive, FP = False positive,
TN = True negative, FN = False negative
110 Ø. Johansen et al.
There are four different outcomes of this comparison, true positive (T P), false
negative (FN), false positive (FP) and true negative (T N). The comparison of
actual and predicted location is done at nucleotide level.
The count of each comparison outcome are used to compute standard mea-
surement indicators to benchmark the performance of the predictor. The sensi-
tivity, specificity and correlation coefficient has been the de facto standard way
of measuring the performance of prediction tools. These prediction measurement
values are defined by Burset and Guig´o [2] and by Snyder and Stormo [7].
The sensitivity (Sn) is defined as the ratio of correctly predicted exon nu-
cleotides to all actual exon nucleotides as given in Eq. 3.
Sn =
T P
T P + FN
(3)
The higher the ratio, the better prediction. As we can see, this ratio is between
0.0 and 1.0, where 1.0 is the best possible.
The specificity (Sp) is defined as the ratio of correctly predicted exon nu-
cleotides to all predicted exon nucleotides as given in Eq. 4.
Sp =
T P
T P + FP
(4)
The higher the ratio, the better prediction. As we can see, this ratio is between
0.0 and 1.0, where 1.0 is the best possible.
The correlation coefficient (CC) combines all the four possible outcomes into
one value. The correlation coefficient is defined as given in Eq. 5.
CC =
(T P × T N) − (FN × FP)
(5)
actual and predicted location is done at nucleotide level.
The count of each comparison outcome are used to compute standard mea-
surement indicators to benchmark the performance of the predictor. The sensi-
tivity, specificity and correlation coefficient has been the de facto standard way
of measuring the performance of prediction tools. These prediction measurement
values are defined by Burset and Guig´o [2] and by Snyder and Stormo [7].
The sensitivity (Sn) is defined as the ratio of correctly predicted exon nu-
cleotides to all actual exon nucleotides as given in Eq. 3.
Sn =
T P
T P + FN
(3)
The higher the ratio, the better prediction. As we can see, this ratio is between
0.0 and 1.0, where 1.0 is the best possible.
The specificity (Sp) is defined as the ratio of correctly predicted exon nu-
cleotides to all predicted exon nucleotides as given in Eq. 4.
Sp =
T P
T P + FP
(4)
The higher the ratio, the better prediction. As we can see, this ratio is between
0.0 and 1.0, where 1.0 is the best possible.
The correlation coefficient (CC) combines all the four possible outcomes into
one value. The correlation coefficient is defined as given in Eq. 5.
CC =
(T P × T N) − (FN × FP)
(T P + FN)(T N + FP)(T P + FP)(T N + FN)
(5)
6.3 The Overall Training Algorithm
tivity, specificity and correlation coefficient has been the de facto standard way
of measuring the performance of prediction tools. These prediction measurement
values are defined by Burset and Guig´o [2] and by Snyder and Stormo [7].
The sensitivity (Sn) is defined as the ratio of correctly predicted exon nu-
cleotides to all actual exon nucleotides as given in Eq. 3.
Sn =
T P
T P + FN
(3)
The higher the ratio, the better prediction. As we can see, this ratio is between
0.0 and 1.0, where 1.0 is the best possible.
The specificity (Sp) is defined as the ratio of correctly predicted exon nu-
cleotides to all predicted exon nucleotides as given in Eq. 4.
Sp =
T P
T P + FP
(4)
The higher the ratio, the better prediction. As we can see, this ratio is between
0.0 and 1.0, where 1.0 is the best possible.
The correlation coefficient (CC) combines all the four possible outcomes into
one value. The correlation coefficient is defined as given in Eq. 5.
CC =
(T P × T N) − (FN × FP)
(T P + FN)(T N + FP)(T P + FP)(T N + FN)
(5)
6.3 The Overall Training Algorithm
The main loop of the training is very simple and is an infinite loop with two
significant activities. First, the infinite loop trains the neural network on all genes
20. 7. EXPERIMENTS & RESULTS
— Finding splice sites in a particular gene
— Final verification set 20 genes から恣意的に抽出したある遺伝子につい
て予測したexon/intron 構造を以下に示す.(平均より良い予測結果)
— ほとんどのSS を正確に予測できていた.
— 一部errors があったがそれはwindow を設定したことによる’low-pass
filtering effect’ の影響でindicator の鋭いピークが減衰されたためだろう.
ある遺伝子について,予測精度を確認したところ,ほとんどの
予測SS が実際のものと合致していた.
112 Ø. Johansen et al.
Fig. 4. The splice site indicators plotted along an arbitrary gene (AT4G18370.1) form
the verification set. Above the splice site indicators, there are two line indicators where
‘AT4G18370.1’ Sn=0.961 Sp=0.910 CC=0.835 Err=0.079
Predicted exons
Actual exons
Error:
予測と正解が異なるものを指している割合を示す.
(exon とintron を反対に予測した総塩基数) / (遺伝子長)
Epoch count = about 80
Learning rate = 0.2
21. 7. EXPERIMENTS & RESULTS
— Finding splice sites in a particular gene
— Final verification set 20 genes から恣意的に抽出したある遺伝子につい
て予測したexon/intron 構造を以下に示す.(平均より良い予測結果)
— ほとんどのSS を正確に予測できていた.
— 一部errors があったがそれはwindow を設定したことによる’low-pass
filtering effect’ の影響でindicator の鋭いピークが減衰されたためだろう.
ある遺伝子について,予測精度を確認したところ,ほとんどの
予測SS が実際のものと合致していた.
112 Ø. Johansen et al.
Fig. 4. The splice site indicators plotted along an arbitrary gene (AT4G18370.1) form
the verification set. Above the splice site indicators, there are two line indicators where
‘AT4G18370.1’ Sn=0.961 Sp=0.910 CC=0.835 Err=0.079
Predicted exons
Actual exons
Error:
予測と正解が異なるものを指している割合を示す.
(exon とintron を反対に予測した総塩基数) / (遺伝子長)
Epoch count = about 80
Learning rate = 0.2
22. 7. EXPERIMENTS & RESULTS
— Benchmark は3パタンのlearning rate を用いて行われた.
(learning rate が高いほど学習効率が良いが局所解に陥りやすい.)
— Benchmark algorithm は,Sn とSp を平均して判断していた.
(training 終了をその平均値で判断したということだと思われる.)
Final verification set 20 genes に対して,3パタンで構築したNN
の精度評価を行った.
Fig. 4. The splice site indicators plotted along an arbitrary gene (AT4G18370.1) form
the verification set. Above the splice site indicators, there are two line indicators where
the upper line indicates predicted exons, and the other line indicates actual exons
The sensitivity, specificity and correlation coefficient of this gene is given in the figure
heading. (Err is an error rate defined as the ratio of false predicted nucleotides to al
nucleotides. Err = 1 − SMC.)
Table 1. Measurements of the neural network performances for each of the three
training sessions. Numbers are based on a set of 20 genes which are not found in the
training set nor the benchmarking set.
Average All nucleotides in set
Session Sn Sp Sn Sp CC SMC
η = 0.20 0.864 0.801 0.844 0.802 0.5205 0.7761
η = 0.10 0.891 0.816 0.872 0.806 0.5517 0.7916
η = 0.02 0.888 0.778 0.873 0.777 0.4978 0.7680
8 Conclusion
This study shows an artificial neural networks used in splice site prediction. The
best neural network trained in this study, achieve a correlation coefficient at
0.552. This result is achieved without any prior knowledge of any sensor signals
Average:
遺伝子ごとに出した測定値の平均
All nucleotides in set:
遺伝子を問わずに全延期に対して算出した測定値
SMC: standard simple matching coefficient
予測と正解が同一のものを指している割合を示す.
(両方がexonを支持 + 両方がintron を支持) / (総塩基数)
23. 8. CONCLUSION
— Best neural network in this study
— 最高でCC=0.552 を達成した.
— 以下のようなhandicaps を抱えての結果なのでこの結論はfair.
— GT-AG 則をはじめとする,splice site に関するどんな事前知識も用い
なかった.
— Training を阻害する影響を持つ,上記のような基本ルールを守ってい
ない遺伝子も一部存在した.
— Training を阻害する影響を持つ,Alternative splicing の存在する遺伝子
も用いていた(splice variant のうち一つだけを用いていた) .
— 展望
— GHMM と一緒に用いれば,鬼に金棒.
ANN は遺伝子予測に対して有用であり,さらにsliding window
を用いるという本手法はより深く研究する価値がある.
25. Appdx: FAQ
Q. 結局neural network って?
A. 有能な無能.
Q. 結局この論文は何が新しいの?
A. window を用いて,gradient に変化するSS indicator を出力する点.
Q. train data 多すぎない? Verification の遺伝子少なすぎない?
A. 御尤も.
Q. なぜ,Arabidopsis?
A. 植物はintron でexon に対してAT-rich な傾向があるので,予測しやすい
のかもしれない.
Q. IR を除いて遺伝子配列だけで予測するのはどうなの.
A. まあ,train しやすいし,予測しやすく,当たりやすいでしょうね.
Q. CC=0.552 ってどうなの? 高いの?
A. 低くはない,でしょう.しかし,同様にNN をexploit した Genie の
SS 用algorithm はCC > 0.81 です.
— Q.