Protein Database

LECTURE TOPIC: PROTEIN DATABASE
T. ASHOK KUMART. ASHOK KUMAR
HEAD, DEPARTMENT OF BIOINFORMATICSHEAD, DEPARTMENT OF BIOINFORMATICS
NOORUL ISLAM COLLEGE OF ARTS ANDNOORUL ISLAM COLLEGE OF ARTS AND
SCIENCESCIENCE
KUMARACOIL, THUCKALAY - 629180KUMARACOIL, THUCKALAY - 629180

TOPICS COVERED
• Protein Terms & Definitions – Computational biology aspect of protein
• ExPASy – SIB Bioinformatics Resource Portal (http://www.expasy.org)
• UniProt/Swiss-Prot – A comprehensive, non-redundant, expert manually annotated protein
sequence database (http://www.uniprot.org/)
• NBRF/PIR– A comprehensive, non-redundant, expertly manually annotated, fully classified and
extensively cross-referenced protein sequence database (http://pir.georgetown.edu/)
• PDB– A single worldwide repository of information about the 3D structures of large biological
molecules, including proteins and nucleic acids (http://rcsb.org/pdb)
• SCOP– Knowledge-based expert analysis and classification of proteins that are structurally
characterized and deposited in the Protein Data Bank (http://scop2.mrc-lmb.cam.ac.uk/)
• CATH– A hierarchical domain classification of protein structures in the Protein Data Bank
(http://www.cathdb.info/)
• MOTIF – Finds sequence motifs in a query sequence, also provides functional and genomic
information of the found motifs using DBGET and LinkDB as the hyperlinked annotations
(http://www.genome.jp/tools/motif/)
• Pfam – Database of protein HMM profiles that define domain families (http://pfam.xfam.org/)
• PROSITE – Database of protein motifs expressed as patterns or profiles

PROTEIN TERMS & DEFINITIONS
• Protein Sequence – 20 a.a. characters [A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y] in
sequence
• Protein Structure – 3D of atomic co-ordinates [x-axis, y-axis, z-axis]
• Types of Biological Databases – [Raw Database = Plain text, Object-oriented Database = Table
(Records), Relational Database = Table of tables]
• 3D Atom Model – [Sphere = Atom, Cylinder = Bond, Dotted Line = Bond Interaction]
• Sequence Alignment – [Match = Similar Character, Mismatch = Dissimilar Character, Gap = No
Substitute Character, Word = Sub-string, Sequence = Super-string, Score = Rating, Identity =
Similar in function]
• Motif – Short, conserved sequence associated with a distinct function.
• Domain – Evolutionarily conserved sequence region that corresponds to a structurally
independent 3D unit associated with a particular functional role. It is usually much larger than a
motif.
• Pattern – Sequence with symbol representation for a expression. Example: N{P}[ST]{P}
• Regular Expression – Representation format for a sequence motif, which includes positional
information for conserved and partly conserved residues. Similar to Pattern, but applies to MSA.
• Profile – Scoring matrix that represents a multiple sequence alignment. It contains probability or

EXPASY
• ExPASy (Expert Protein Analysis System) is a bioinformatics resource portal operated by the
Swiss Institute of Bioinformatics (SIB).
• ExPASy was the first website of the life sciences.
• Extensible and integrative portal for accessing many scientific resources, databases and
software tools.
• Wide range of resources in many different domains, such as proteomics, genomics,
phylogeny/evolution, systems biology, population genetics, transcriptomics, etc.
• Proteomics server to analyze protein sequences and structures and 2D Page gel
electrophoresis.
• Databases, online and offline software tools are hosted by different groups of the SIB and
partner institutions. --- CFSSP
• ExPASy references the protein sequence knowledgebase, UniProtKB/Swiss-Prot, and its
computer annotated supplement, UniProtKB/Trembl.

ARCHITECTURE OF UNIPROT/SWISS-PROT
• Universal Protein Resource (UniProt) is a comprehensive resource for protein sequence and
annotation data
• The UniProt databases are the UniProt Knowledgebase (UniProtKB), the UniProt Reference
Clusters (UniRef), and the UniProt Archive (UniParc)
• UniProt Metagenomic and Environmental Sequences (UniMES) database is a repository
specifically developed for metagenomic and environmental data

BACKGROUND OF UNIPROT/SWISS-PROT
• UniProt is a collaboration between the European Bioinformatics Institute (EMBL-EBI), the Swiss
Institute of Bioinformatics (SIB) and the Protein Information Resource (PIR)
• EMBL-EBI and SIB together used to produce Swiss-Prot and TrEMBL, while PIR produced the
Protein Sequence Database (PIR-PSD)
• Translated EMBL Nucleotide Sequence Data Library (TrEMBL) was originally created because
sequence data was being generated at a pace that exceeded Swiss-Prot's ability to keep up
• PIR maintained the PIR-PSD and related databases, including iProClass, a database of protein
sequences and curated families
http://www.uniprot.org/

UNIPROT/SWISS-PROT FILE FORMAT
Line code Content Occurrence in an entry
ID Identification Once; starts the entry
AC Accession number(s) Once or more
DT Date Three times
DE Description Once or more
GN Gene name(s) Optional
OS Organism species Once or more
OG Organelle Optional
OC Organism classification Once or more
OX Taxonomy cross-reference Once
OH Organism host Optional
RN Reference number Once or more
RP Reference position Once or more
RC Reference comment(s) Optional
RX Reference cross-reference(s) Optional
RG Reference group Once or more (Optional if RA line)
RA Reference authors Once or more (Optional if RG line)
RT Reference title Optional
RL Reference location Once or more
CC Comments or notes Optional
DR Database cross-references Optional
PE Protein existence Once
KW Keywords Optional
FT
Feature table data Once or more in Swiss-Prot, optional in
TrEMBL
SQ Sequence header Once
(blanks) Sequence data Once or more

NBRF/PIR
• The Protein Information Resource (PIR) was established in 1984 by the National Biomedical
Research Foundation (NBRF) as a resource to assist researchers in the identification and
interpretation of protein sequence information.
• In 2002 PIR, along with its international partners, EBI and SIB, were awarded a grant from
NIH to create UniProt, by unifying the PIR-PSD, Swiss-Prot, and TrEMBL databases.
• As of 2010, PIR offers a wide variety of resources mainly oriented to assist the propagation
and standardization of protein annotation: PRO, iProClass, iProLINK.
http://pir.georgetown.edu/

SEQUENCE RETRIEVED FROM NBRF/PIR IN FASTA FILE
FORMAT
>F7VJQ1 APRIO_HUMAN Alternative prion protein [Homo
sapiens]
MEHWGQPIPGAGQPWRQPLPTSGRWWLGAASWWWLGAASWWWLGAAPWWWLGTASWWWL
G
SRRWHPQSVEQAE

PDB
• The Protein Data Bank (PDB) archive is the single worldwide repository of information about
the 3D structures of large biological molecules, including proteins and nucleic acids.
• The PDB was established in 1971 at Brookhaven National Laboratory (BNL) under the
leadership of Walter Hamilton and originally contained 7 structures.
• In 1998, the Research Collaboratory for Structural Bioinformatics (RCSB) became
responsible for the management of the PDB.
• In 2003, the wwPDBwas formed to maintain a single PDB archive of macromolecular
structural data that is freely and publicly available to the global community.
• The RCSB PDB supports a website where visitors can perform simple and complex queries
on the data, analyze, and visualize the results.
• Members of wwPDB are: RCSBPDB(USA), PDBe (Europe) and PDBj (Japan), and
Biological Magnetic Resonance Data Bank BMRB(USA).
http://rcsb.org/pdb/

SCOP
• The Structural Classification of Proteins (SCOP) database is a largely manual classification of
protein structural domains based on similarities of their structures and amino acid sequences.
• A motivation for this classification is to determine the evolutionary relationship between
proteins.
• Proteins with the same shapes but having little sequence or functional similarity are placed in
different "superfamilies", and are assumed to have only a very distant common ancestor.
• Proteins having the same shape and some similarity of sequence and/or function are placed in
"families", and are assumed to have a closer common ancestor.
• SCOP has been discontinued and the last official version of SCOP is 1.75. SCOP1.75 is also
known as SCOP2.
• SCOP2 offers two different ways for accessing data: SCOP2-browser, and SCOP2-graph.
• SCOP2-browser allows navigation in a traditional way by browsing pages displaying the node
information.
• SCOP2-graph is a graph-based web tool for display and navigation.
• The source of protein structures is the Protein Data Bank.

HIERARCHICAL STRUCTURE OF SCOP
• The unit of classification of structure in SCOP is the protein domain.
• The levels of SCOP are as follows.
1. Class: Types of folds, e.g., all α, all β, α/β, α+β, α&β, etc.
2. Fold: The different shapes of domains within a class, e.g., 2 helices; antiparallel hairpin, left-handed
twist, etc.
3. Superfamily: The domains in a fold are grouped into superfamilies, which have at least a distant
common ancestor.
4. Family: The domains in a superfamily are grouped into families, which have more recent common
ancestor.
5. Protein domain: The domains in families are grouped into protein domains, which are essentially the
same protein.
6. Species: The domains in "protein domains" are grouped according to species.
7. Domain: It is part of a protein. For simple proteins, it can be the entire protein.
http://scop2.mrc-lmb.cam.ac.uk/

HIERARCHICAL STRUCTURE OF SCOP

CATH
• The CATH (Class, Architecture, Topology, and Homologous superfamily) is a semi-
automatic, hierarchical classification of protein domains.
• CATH shares many broad features with its principal rival, SCOP.
• The four main levels of the CATH hierarchy are as follows:
• Class: the overall secondary-structure content of the domain. e.g., all α, all β, α/β, α+β, α&β, etc.
• Architecture: high structural similarity but no evidence of homology. Equivalent to a fold
in SCOP.
• Topology: a large-scale grouping of topologies which share particular structural features
• Homologous superfamily: indicative of a demonstrable evolutionary relationship. Equivalent to
the superfamily level of SCOP.
http://www.cathdb.info/

MOTIF
• Motif is a search service provided by GenomeNet to search with a protein query
sequence against Motif Libraries.
• Supports several motif databases such as Prosite, BLOCKS, ProDom, Pfam, and
PRINTS.
• Allows you to search protein sequence libraries with your patterns.
• Each residue must be separated with - (minus sign).
• x represents any amino acids.
• [DE] means either D or E.
• {FWY} means any amino acids except for F, W and Y
• A(2,3) means that A appears 2 to 3 times consecutively.
• The pattern string must be terminated with . (period).
For example, C-x-{C}-[DN]-x(2)-C-x(5)-C-C.
• Generates a profile from a set of multiple aligned sequences using PFMake or
HMMBuild, respectively.http://www.genome.jp/tools/motif/

PATTERN OF MATCHING MOTIF HITS

PFAM
• The Pfam database is a large collection of protein families, each represented by
multiple sequence alignments and hidden Markov models (HMMs).
• Pfam version 27.0 was produced at the European Bioinformatics Institute using a
sequence database called Pfamseq, which is based on UniProt.
• The descriptions of Pfam families are managed by the general public using
Wikipedia.
• The Pfam database contains information about protein domains and families.
• Pfam-A is the manually curated portion of the database that contains over 10,000
entries.
• Pfam-B contains a large number of small families derived from clusters produced by
an algorithm called ADDA (for automatic generation).
• Pfam-B families can be useful when no Pfam-A families are found (but lower
quality).
http://pfam.xfam.org/

PROSITE
• PROSITE, a protein domain database for functional characterization and annotation.
• PROSITE consists of entries describing the protein families, domains and functional
sites as well as amino acid patterns and profiles in them.
• PROSITE is manually curated by a team of the Swiss Institute of Bioinformatics and
tightly integrated into Swiss-Prot protein annotation.
• PROSITE is complemented by ProRule, a collection of rules based on profiles and
patterns.
• The rules contain information about biologically meaningful residues, like active
sites, substrate- or co-factor-binding sites, posttranslational modification sites or
disulfide bonds, to help function determination.
http://prosite.expasy.org/

Protein Database

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Protein Database (20)

Recently uploaded (20)

Protein Database