SUMOylation site prediction

SUMOylation-site Prediction Denis C. Bauer Fabian A. Buske Mikael Bod én

Overview Background SUMOylation - what is that ? Published predictors Our approach What makes SUMO hard to tackle

SUMO is not 相撲 S mall U biquitin-related Mo difier is a small protein of 97 amino acids. 20% homology to ubiquitin Post-translational modification Covalently attached to Lysines Involved in many pathways/mechanisms Transcriptional regulation Compartmentisation

SUMOylation motif One consensus motif [ILV]K.E for about 60% of known sites However Not all [ILV]K.E -sites are SUMOylated Not all SUMOylated sites have the consensus motif TP FP FN

Baseline prediction Method CC Regular Expression scanner 0.68

Comparison with existing predictors + Xu J., BMC Bioinformatics 2008, 9:8 ‡ Xue Y., Nucleic Acid Res 2006, W254 -W 257 † http://www.abgent.com/doc/sumoplot (commercial) Method CC Regular Expression scanner 0.68 SUMOpre + 0.64 SUMOsp ‡ 0.26 SUMOplot † 0.48

Case study : Core histones in yeast Identified SUMOylation sites + H2B : K6/7, K16/17 H2A : K2, K126 H4 : somewhere in the tail No SUMOylation consensus site Predictor to date are not able to predict even a single SUMOylation site in the histone sequence + Nathan D., Genes Dev 2006, 20(8):966-76

Our approach Identify window size which ML method is best Voil á: better predictor ! Sequence xxxx K xxxx SUMOylation 1/0 ML

Training in more Detail w U w D Protein Sequence K Imbalance in the dataset - more negatives than positives SUMOylated K Not SUMOylated K K K ML T 0 1 0 P 1 1 0 K K

Prediction in more Detail w U w D Protein Sequence K K K Trained ML 1 1 0 K K SUMOylated K Not SUMOylated K K K

ML methods Bidirectional Recurrent Neural Network (BRNN) Using information of flanking windows Decaying with distance to center window Prone to overfit Support Vector Machine (SVM) regularized requires suitable kernel and feature representation Standard Kernels Linear, Polynomial, RBF String Kernel P-kernel, local-alignment kernel

Data set Training/Testing data 144 proteins with 241 SUMOylation sites 5,741 non-SUMOylated Lysines 68% of the SUMOulated sites confom to the consensus motif Hold-out 13 proteins with 27 SUMOylation sites 48% consensus motif Xu J., BMC Bioinformatics 2008, 9:8

Evaluation 5-fold cross-validation Matthews correlation coefficient (CC) Sensitivity, Specificity, Accuracy Area under the curve ( AUC )

Comparison with existing methods

Quest to improve performance Protein structural features and evolutionary features Separating SUMOylation sites from different species or compartment Clustering for other motifs using kernel hierarchical clustering

Summary Regular Expression Scanner is still the best classifier. SUMO more versatile than expected ! The road to better predictions Are there other motifs? Which features can discriminate? Is the dataset biased? http://spot.colorado.edu/~colemab/Theatre_Resources/SumoBallerina.jpg

Acknowledgment Predictor/Analysis Mikael Bod én Fabian Buske Dataset Xu et al. PhD Supervisors Tim Bailey Andrew Perkins Mikael Bod én Other Bioinformatic tools: STREAM – a practical workbench for modeling transcriptional regulation. www.bioinformatics.org.au/stream/

SUMOylation site prediction

More Related Content

What's hot (20)

More from Denis C. Bauer (20)

Recently uploaded (20)

SUMOylation site prediction