SlideShare a Scribd company logo
2
Most read
3
Most read
Data mining in Bioinformatics:
Data Mining is the process of automatic discovery of novel and understandable models and
patterns from large amounts of data involving methods at the intersection of machine learning,
statistics and database systems. Bioinformatics is the science of storing, analyzing, and utilizing
information from biological data such as sequences, molecules, gene expressions, and
pathways. Development of novel data mining methods will play a fundamental role in
understanding these rapidly expanding sources of biological data.
Data mining is an interdisciplinary subfield of computer science and statistics with an overall
goal to extract information from a large set of data and transform the information into a
comprehensible structure for further use. Data mining is the analysis step of the "knowledge
discovery in databases" process or KDD (Fig.1). Aside from the raw analysis step, it also
involves database and data management aspects, data pre-processing, model and inference
considerations, interestingness metrics, complexity considerations, post-processing of
discovered structures, visualization, and online updating.
Fig. 1: The process of KDD and the steps involved.
Data mining approaches seem ideally suited in the field of bioinformatics with enormous
volumes of data deposited at every second. The extensive databases of biological information
create both challenges and opportunities for developing novel data mining methods. Every
year, workshop on Data Mining in Bioinformatics (BIOKDD) is held since 2001 with a goal
to encourage the KDD researchers worldwide to take on the numerous challenges that
Bioinformatics offers.
The difference between data analysis and data mining is that data analysis is used to test models
and hypotheses on the dataset, e.g., analyzing the effectiveness of a marketing campaign,
regardless of the amount of data; in contrast, data mining uses machine learning and statistical
models to uncover clandestine or hidden patterns in a large volume of data.
BOTMT:604
Bioinformatics and Biophysics
Prepared By-
Dr. Sangeeta Das.
Assistant Professor, Department of Botany, Bahona College, Jorhat, Assam, India.
Data mining Tools in Bioinformatics:
Various tools for data mining are used in bioinformatics. The following are the tools for
nucleotide sequence analysis:
1. BLAST:
The Basic Local Alignment Search Tool (BLAST) for comparing gene and protein sequences
against others in public databases, now comes in several types including PSI-BLAST, PHI-
BLAST, and BLAST 2 sequences. Specialized BLASTs are also available for human,
microbial, malaria, and other genomes, as well as for vector contamination, immunoglobulins,
and tentative human consensus sequences.
2. Electronic PCR:
This tool allows to search the target DNA sequence for sequence tagged sites (STSs) that have
been used as landmarks in various types of genomic maps. It compares the query sequence
against data in NCBI’s UniSTS, a unified, non-redundant view of STSs from a wide range of
sources.
3. Entrez:
The Entrez is Global Query Cross-Database Search System is a federated search engine, or
web portal that allows users to search many discrete health sciences databases at the National
Center for Biotechnology Information (NCBI) website. The name "Entrez" (meaning "Come
in" in French) was chosen to reflect the spirit of welcoming the public to search the content
available from the National Library of Medicine (NLM).
Entrez Global Query is an integrated search and retrieval system that provides access to all
databases simultaneously with a single query string and user interface. Entrez can efficiently
retrieve related sequences, structures, and references. The Entrez system can provide views of
gene and protein sequences and chromosome maps. Some textbooks are also available online
through the Entrez system. Entrez searches the databases such as PubMed, PubMed Central,
Site Search, online Books, Online Mendelian Inheritance in Man (OMIM), Nucleotide
sequence database (GenBank), Protein sequence database, Genome Project, UniGene, NLM
Catalog, etc.
Each Entrez Gene record encapsulates a wide range of information for a given gene and
organism. When possible, the information includes results of analyses that have been done on
the sequence data. The amount and type of information presented depend on what is available
for a particular gene and organism and includes:
(1) graphic summary of the genomic context, intron/exon structure, and flanking genes
(2) link to a graphic view of the mRNA sequence, which in turn shows biological features such
as CDS, SNPs, etc.
(3) links to gene ontology and phenotypic information
(4) links to corresponding protein sequence data and conserved domains
(5) links to related resources, such as mutation databases. Entrez Gene is a successor to
LocusLink.
4. Model Maker:
It allows to view the evidence (mRNAs, ESTs, and gene predictions) that was aligned to
assembled genomic sequence to build a gene model and to edit the model by selecting or
BOTMT:604
Bioinformatics and Biophysics
Prepared By-
Dr. Sangeeta Das.
Assistant Professor, Department of Botany, Bahona College, Jorhat, Assam, India.
removing putative exons. Model Maker is accessible from sequence maps that were analyzed
at NCBI and displayed in Map Viewer.
5. ORF (Open Reading Frame) Finder:
ORF Finder identifies all possible ORFs in a DNA sequence by locating the standard and
alternative stop and start codons. The deduced amino acid sequences can then be used to
BLAST against GenBank. ORF finder is also packaged in the sequence submission software
Sequin.
6. SAGEMAP:
It is a tool for performing statistical tests designed specifically for differential-type analyses of
SAGE (Serial Analysis of Gene Expression) data. The data include SAGE libraries generated
by individual labs as well as those generated by the Cancer Genome Anatomy Project (CGAP),
which have been submitted to Gene Expression Omnibus (GEO). Gene expression profiles that
compare the expression in different SAGE libraries are also available on the Entrez GEO
Profiles pages. It is possible to enter a query sequence in the SAGEmap resource to determine
what SAGE tags are in the sequence, then map to associated SAGEtag records and view the
expression of those tags in different CGAP SAGE libraries.
7. Spidey:
It aligns one or more mRNA sequences to a single genomic sequence. Spidey will try to
determine the exon/intron structure, returning one or more models of the genomic structure,
including the genomic/mRNA alignments for each exon.
8. VecScreen:
It is a tool for identifying segments of a nucleic acid sequence that may be of vector, linker, or
adapter origin prior to sequence analysis or submission. VecScreen was developed to combat
the problem of vector contamination in public sequence databases.
BOTMT:604
Bioinformatics and Biophysics
Prepared By-
Dr. Sangeeta Das.
Assistant Professor, Department of Botany, Bahona College, Jorhat, Assam, India.

More Related Content

What's hot (20)

PPTX
Introduction to sequence alignment partii
SumatiHajela
 
PPTX
Prosite
Rashi Srivastava
 
PPTX
TrEMBL
Ankit Alankar
 
PPTX
Sequence alignment global vs. local
benazeer fathima
 
PPTX
SEQUENCE ANALYSIS
prashant tripathi
 
PPTX
Swiss prot database
sagrika chugh
 
PDF
Nucleic Acid Sequence databases
Pranavathiyani G
 
PPTX
Scoring matrices
Ashwini
 
PPTX
Entrez databases
Hafiz Muhammad Zeeshan Raza
 
PPTX
DNA SEQUENCING METHODS AND STRATEGIES FOR GENOME SEQUENCING
Puneet Kulyana
 
PPTX
Chou fasman algorithm for protein structure prediction
Roshan Karunarathna
 
PPTX
BLAST
Anushi Jain
 
PDF
Dot matrix
Tania Khan
 
PDF
Gene prediction methods vijay
Vijay Hemmadi
 
PDF
dot plot analysis
ShwetA Kumari
 
PPTX
Sequence alig Sequence Alignment Pairwise alignment:-
naveed ul mushtaq
 
PPTX
blast bioinformatics
Sardar Harpreet Kalsi
 
DOCX
Gen bank (genetic sequence databank)
Vidya Kalaivani Rajkumar
 
PPTX
Genomic databases
DrSatyabrataSahoo
 
PDF
Secondary Structure Prediction of proteins
Vijay Hemmadi
 
Introduction to sequence alignment partii
SumatiHajela
 
Sequence alignment global vs. local
benazeer fathima
 
SEQUENCE ANALYSIS
prashant tripathi
 
Swiss prot database
sagrika chugh
 
Nucleic Acid Sequence databases
Pranavathiyani G
 
Scoring matrices
Ashwini
 
Entrez databases
Hafiz Muhammad Zeeshan Raza
 
DNA SEQUENCING METHODS AND STRATEGIES FOR GENOME SEQUENCING
Puneet Kulyana
 
Chou fasman algorithm for protein structure prediction
Roshan Karunarathna
 
Dot matrix
Tania Khan
 
Gene prediction methods vijay
Vijay Hemmadi
 
dot plot analysis
ShwetA Kumari
 
Sequence alig Sequence Alignment Pairwise alignment:-
naveed ul mushtaq
 
blast bioinformatics
Sardar Harpreet Kalsi
 
Gen bank (genetic sequence databank)
Vidya Kalaivani Rajkumar
 
Genomic databases
DrSatyabrataSahoo
 
Secondary Structure Prediction of proteins
Vijay Hemmadi
 

Similar to Bioinformatics data mining (20)

PPTX
Informal presentation on bioinformatics
Atai Rabby
 
PPTX
Introduction to databases.pptx
sworna kumari chithiraivelu
 
PPTX
Web based servers and softwares for genome analysis
Dr. Naveen Gaurav srivastava
 
PPTX
Share_Introduction to Bioinformatics-WPS_Office.pptx
ShashiKala434918
 
PDF
A Survey on Bioinformatics Tools
idescitation
 
PDF
Bioinformatics: History of Bioinformatics, Components of Bioinformatics, Geno...
A Biodiction : A Unit of Dr. Divya Sharma
 
PDF
Article
MisbahAlwi
 
PPTX
BIOINFORMATICS BIOLOGICAL DATABASES DATA BASES.pptx
Jaleelkabdul Jaleel
 
PPTX
Cool Informatics Tools and Services for Biomedical Research
David Ruau
 
PPTX
Data mining ppt
sai krishna
 
PDF
Bioinformatics - Exam_Materials.pdf by uos
Taimur Khan
 
DOCX
Bioinformatics
Vidya Kalaivani Rajkumar
 
PPTX
Biological databasesBiological databases
KrittikaChandran
 
PDF
LECTURE NOTES ON BIOINFORMATICS
MSCW Mysore
 
PPTX
Bridging Histology and Bioinformatics
Nahla Imbarak
 
PDF
PERFORMANCE EVALUATION OF STRUCTURED AND SEMI-STRUCTURED BIOINFORMATICS TOOLS...
ijseajournal
 
PDF
PERFORMANCE EVALUATION OF STRUCTURED AND SEMI-STRUCTURED BIOINFORMATICS TOOLS...
ijseajournal
 
PPTX
Bioinformatic, and tools by kk sahu
KAUSHAL SAHU
 
PPTX
Bioinformatic tool for Annotation of gene
dxx7bhanu
 
Informal presentation on bioinformatics
Atai Rabby
 
Introduction to databases.pptx
sworna kumari chithiraivelu
 
Web based servers and softwares for genome analysis
Dr. Naveen Gaurav srivastava
 
Share_Introduction to Bioinformatics-WPS_Office.pptx
ShashiKala434918
 
A Survey on Bioinformatics Tools
idescitation
 
Bioinformatics: History of Bioinformatics, Components of Bioinformatics, Geno...
A Biodiction : A Unit of Dr. Divya Sharma
 
Article
MisbahAlwi
 
BIOINFORMATICS BIOLOGICAL DATABASES DATA BASES.pptx
Jaleelkabdul Jaleel
 
Cool Informatics Tools and Services for Biomedical Research
David Ruau
 
Data mining ppt
sai krishna
 
Bioinformatics - Exam_Materials.pdf by uos
Taimur Khan
 
Bioinformatics
Vidya Kalaivani Rajkumar
 
Biological databasesBiological databases
KrittikaChandran
 
LECTURE NOTES ON BIOINFORMATICS
MSCW Mysore
 
Bridging Histology and Bioinformatics
Nahla Imbarak
 
PERFORMANCE EVALUATION OF STRUCTURED AND SEMI-STRUCTURED BIOINFORMATICS TOOLS...
ijseajournal
 
PERFORMANCE EVALUATION OF STRUCTURED AND SEMI-STRUCTURED BIOINFORMATICS TOOLS...
ijseajournal
 
Bioinformatic, and tools by kk sahu
KAUSHAL SAHU
 
Bioinformatic tool for Annotation of gene
dxx7bhanu
 
Ad

More from Sangeeta Das (20)

PPT
Strategies and Solutions for Sustainable Land Management.ppt
Sangeeta Das
 
PPT
Cyanophyta
Sangeeta Das
 
PPTX
Human Impact on Forests.pptx
Sangeeta Das
 
PPTX
Women in NE India-A Holistic Approach
Sangeeta Das
 
PPTX
Can organic feed the world
Sangeeta Das
 
PPT
Chlamydomonas
Sangeeta Das
 
PPTX
Evolution of sporophyte in bryotphytes
Sangeeta Das
 
PPTX
Botanical garden
Sangeeta Das
 
PPTX
Herbarium Techniques
Sangeeta Das
 
PPTX
Numerical taxonomy_Plant Taxonomy
Sangeeta Das
 
DOCX
Bioinformatics_Sequence Analysis
Sangeeta Das
 
PPTX
Chemotaxonomy-Plant Taxonomy
Sangeeta Das
 
PPTX
Cytotaxonomy plant taxonomy
Sangeeta Das
 
PPTX
Rosaceae family-Plant Taxonomy
Sangeeta Das
 
PDF
Bioinformatics biological databases
Sangeeta Das
 
PDF
Cytokinin
Sangeeta Das
 
PDF
Documentation in plant taxonomy
Sangeeta Das
 
PPTX
Aims and objectives of plant taxonomy
Sangeeta Das
 
PPTX
History and development of plant taxonomy
Sangeeta Das
 
PPTX
Archegoniates
Sangeeta Das
 
Strategies and Solutions for Sustainable Land Management.ppt
Sangeeta Das
 
Cyanophyta
Sangeeta Das
 
Human Impact on Forests.pptx
Sangeeta Das
 
Women in NE India-A Holistic Approach
Sangeeta Das
 
Can organic feed the world
Sangeeta Das
 
Chlamydomonas
Sangeeta Das
 
Evolution of sporophyte in bryotphytes
Sangeeta Das
 
Botanical garden
Sangeeta Das
 
Herbarium Techniques
Sangeeta Das
 
Numerical taxonomy_Plant Taxonomy
Sangeeta Das
 
Bioinformatics_Sequence Analysis
Sangeeta Das
 
Chemotaxonomy-Plant Taxonomy
Sangeeta Das
 
Cytotaxonomy plant taxonomy
Sangeeta Das
 
Rosaceae family-Plant Taxonomy
Sangeeta Das
 
Bioinformatics biological databases
Sangeeta Das
 
Cytokinin
Sangeeta Das
 
Documentation in plant taxonomy
Sangeeta Das
 
Aims and objectives of plant taxonomy
Sangeeta Das
 
History and development of plant taxonomy
Sangeeta Das
 
Archegoniates
Sangeeta Das
 
Ad

Recently uploaded (20)

PPTX
How to Create a PDF Report in Odoo 18 - Odoo Slides
Celine George
 
PDF
Biological Bilingual Glossary Hindi and English Medium
World of Wisdom
 
PPTX
How to Handle Salesperson Commision in Odoo 18 Sales
Celine George
 
PDF
DIGESTION OF CARBOHYDRATES,PROTEINS,LIPIDS
raviralanaresh2
 
PPTX
ASRB NET 2023 PREVIOUS YEAR QUESTION PAPER GENETICS AND PLANT BREEDING BY SAT...
Krashi Coaching
 
PPTX
Growth and development and milestones, factors
BHUVANESHWARI BADIGER
 
PPTX
Neurodivergent Friendly Schools - Slides from training session
Pooky Knightsmith
 
PDF
ARAL_Orientation_Day-2-Sessions_ARAL-Readung ARAL-Mathematics ARAL-Sciencev2.pdf
JoelVilloso1
 
PDF
Knee Extensor Mechanism Injuries - Orthopedic Radiologic Imaging
Sean M. Fox
 
PPTX
How to Set Up Tags in Odoo 18 - Odoo Slides
Celine George
 
PDF
CONCURSO DE POESIA “POETUFAS – PASSOS SUAVES PELO VERSO.pdf
Colégio Santa Teresinha
 
PPTX
Post Dated Cheque(PDC) Management in Odoo 18
Celine George
 
PPTX
QUARTER 1 WEEK 2 PLOT, POV AND CONFLICTS
KynaParas
 
PPTX
PATIENT ASSIGNMENTS AND NURSING CARE RESPONSIBILITIES.pptx
PRADEEP ABOTHU
 
PPTX
Cultivation practice of Litchi in Nepal.pptx
UmeshTimilsina1
 
PPTX
How to Create Odoo JS Dialog_Popup in Odoo 18
Celine George
 
PPTX
GRADE-3-PPT-EVE-2025-ENG-Q1-LESSON-1.pptx
EveOdrapngimapNarido
 
PDF
Exploring the Different Types of Experimental Research
Thelma Villaflores
 
PDF
The Different Types of Non-Experimental Research
Thelma Villaflores
 
PPTX
Identifying elements in the story. Arrange the events in the story
geraldineamahido2
 
How to Create a PDF Report in Odoo 18 - Odoo Slides
Celine George
 
Biological Bilingual Glossary Hindi and English Medium
World of Wisdom
 
How to Handle Salesperson Commision in Odoo 18 Sales
Celine George
 
DIGESTION OF CARBOHYDRATES,PROTEINS,LIPIDS
raviralanaresh2
 
ASRB NET 2023 PREVIOUS YEAR QUESTION PAPER GENETICS AND PLANT BREEDING BY SAT...
Krashi Coaching
 
Growth and development and milestones, factors
BHUVANESHWARI BADIGER
 
Neurodivergent Friendly Schools - Slides from training session
Pooky Knightsmith
 
ARAL_Orientation_Day-2-Sessions_ARAL-Readung ARAL-Mathematics ARAL-Sciencev2.pdf
JoelVilloso1
 
Knee Extensor Mechanism Injuries - Orthopedic Radiologic Imaging
Sean M. Fox
 
How to Set Up Tags in Odoo 18 - Odoo Slides
Celine George
 
CONCURSO DE POESIA “POETUFAS – PASSOS SUAVES PELO VERSO.pdf
Colégio Santa Teresinha
 
Post Dated Cheque(PDC) Management in Odoo 18
Celine George
 
QUARTER 1 WEEK 2 PLOT, POV AND CONFLICTS
KynaParas
 
PATIENT ASSIGNMENTS AND NURSING CARE RESPONSIBILITIES.pptx
PRADEEP ABOTHU
 
Cultivation practice of Litchi in Nepal.pptx
UmeshTimilsina1
 
How to Create Odoo JS Dialog_Popup in Odoo 18
Celine George
 
GRADE-3-PPT-EVE-2025-ENG-Q1-LESSON-1.pptx
EveOdrapngimapNarido
 
Exploring the Different Types of Experimental Research
Thelma Villaflores
 
The Different Types of Non-Experimental Research
Thelma Villaflores
 
Identifying elements in the story. Arrange the events in the story
geraldineamahido2
 

Bioinformatics data mining

  • 1. Data mining in Bioinformatics: Data Mining is the process of automatic discovery of novel and understandable models and patterns from large amounts of data involving methods at the intersection of machine learning, statistics and database systems. Bioinformatics is the science of storing, analyzing, and utilizing information from biological data such as sequences, molecules, gene expressions, and pathways. Development of novel data mining methods will play a fundamental role in understanding these rapidly expanding sources of biological data. Data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a large set of data and transform the information into a comprehensible structure for further use. Data mining is the analysis step of the "knowledge discovery in databases" process or KDD (Fig.1). Aside from the raw analysis step, it also involves database and data management aspects, data pre-processing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating. Fig. 1: The process of KDD and the steps involved. Data mining approaches seem ideally suited in the field of bioinformatics with enormous volumes of data deposited at every second. The extensive databases of biological information create both challenges and opportunities for developing novel data mining methods. Every year, workshop on Data Mining in Bioinformatics (BIOKDD) is held since 2001 with a goal to encourage the KDD researchers worldwide to take on the numerous challenges that Bioinformatics offers. The difference between data analysis and data mining is that data analysis is used to test models and hypotheses on the dataset, e.g., analyzing the effectiveness of a marketing campaign, regardless of the amount of data; in contrast, data mining uses machine learning and statistical models to uncover clandestine or hidden patterns in a large volume of data. BOTMT:604 Bioinformatics and Biophysics Prepared By- Dr. Sangeeta Das. Assistant Professor, Department of Botany, Bahona College, Jorhat, Assam, India.
  • 2. Data mining Tools in Bioinformatics: Various tools for data mining are used in bioinformatics. The following are the tools for nucleotide sequence analysis: 1. BLAST: The Basic Local Alignment Search Tool (BLAST) for comparing gene and protein sequences against others in public databases, now comes in several types including PSI-BLAST, PHI- BLAST, and BLAST 2 sequences. Specialized BLASTs are also available for human, microbial, malaria, and other genomes, as well as for vector contamination, immunoglobulins, and tentative human consensus sequences. 2. Electronic PCR: This tool allows to search the target DNA sequence for sequence tagged sites (STSs) that have been used as landmarks in various types of genomic maps. It compares the query sequence against data in NCBI’s UniSTS, a unified, non-redundant view of STSs from a wide range of sources. 3. Entrez: The Entrez is Global Query Cross-Database Search System is a federated search engine, or web portal that allows users to search many discrete health sciences databases at the National Center for Biotechnology Information (NCBI) website. The name "Entrez" (meaning "Come in" in French) was chosen to reflect the spirit of welcoming the public to search the content available from the National Library of Medicine (NLM). Entrez Global Query is an integrated search and retrieval system that provides access to all databases simultaneously with a single query string and user interface. Entrez can efficiently retrieve related sequences, structures, and references. The Entrez system can provide views of gene and protein sequences and chromosome maps. Some textbooks are also available online through the Entrez system. Entrez searches the databases such as PubMed, PubMed Central, Site Search, online Books, Online Mendelian Inheritance in Man (OMIM), Nucleotide sequence database (GenBank), Protein sequence database, Genome Project, UniGene, NLM Catalog, etc. Each Entrez Gene record encapsulates a wide range of information for a given gene and organism. When possible, the information includes results of analyses that have been done on the sequence data. The amount and type of information presented depend on what is available for a particular gene and organism and includes: (1) graphic summary of the genomic context, intron/exon structure, and flanking genes (2) link to a graphic view of the mRNA sequence, which in turn shows biological features such as CDS, SNPs, etc. (3) links to gene ontology and phenotypic information (4) links to corresponding protein sequence data and conserved domains (5) links to related resources, such as mutation databases. Entrez Gene is a successor to LocusLink. 4. Model Maker: It allows to view the evidence (mRNAs, ESTs, and gene predictions) that was aligned to assembled genomic sequence to build a gene model and to edit the model by selecting or BOTMT:604 Bioinformatics and Biophysics Prepared By- Dr. Sangeeta Das. Assistant Professor, Department of Botany, Bahona College, Jorhat, Assam, India.
  • 3. removing putative exons. Model Maker is accessible from sequence maps that were analyzed at NCBI and displayed in Map Viewer. 5. ORF (Open Reading Frame) Finder: ORF Finder identifies all possible ORFs in a DNA sequence by locating the standard and alternative stop and start codons. The deduced amino acid sequences can then be used to BLAST against GenBank. ORF finder is also packaged in the sequence submission software Sequin. 6. SAGEMAP: It is a tool for performing statistical tests designed specifically for differential-type analyses of SAGE (Serial Analysis of Gene Expression) data. The data include SAGE libraries generated by individual labs as well as those generated by the Cancer Genome Anatomy Project (CGAP), which have been submitted to Gene Expression Omnibus (GEO). Gene expression profiles that compare the expression in different SAGE libraries are also available on the Entrez GEO Profiles pages. It is possible to enter a query sequence in the SAGEmap resource to determine what SAGE tags are in the sequence, then map to associated SAGEtag records and view the expression of those tags in different CGAP SAGE libraries. 7. Spidey: It aligns one or more mRNA sequences to a single genomic sequence. Spidey will try to determine the exon/intron structure, returning one or more models of the genomic structure, including the genomic/mRNA alignments for each exon. 8. VecScreen: It is a tool for identifying segments of a nucleic acid sequence that may be of vector, linker, or adapter origin prior to sequence analysis or submission. VecScreen was developed to combat the problem of vector contamination in public sequence databases. BOTMT:604 Bioinformatics and Biophysics Prepared By- Dr. Sangeeta Das. Assistant Professor, Department of Botany, Bahona College, Jorhat, Assam, India.