SlideShare a Scribd company logo
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic Data Analysis
Special Thank you to:Dr. Vladimir Galatenko, Chief Scientist at the
Tauber Bioinformatics Research Center. His work is
focused on issues related to Big Data analysis and,
in particular, on integration of multi-omics datasets.
A special research interest of Dr. Galatenko is
related to feature selection which is vital for efficient
development of clinical test systems.
Julia Panov, Ph.D. student involved in a number of
neuroscience research projects, an experienced
bioinformatics user. She relies on the T-BioInfo
platform for regular processing and integration of
omics data, collaborating with TBRC research
group on platform development. Dr. Javeed Iqbal, UNMC
Biological Examples and Reference Data sets:
• “Modeling precision treatment of breast cancer”, Daemen et. al.
(https://genomebiology.biomedcentral.com/articles/10.1186/gb-2013-14-10-r110),
• “Whole transcriptome profiling of patient-derived xenograft models as a tool to identify both
tumor and stromal specific biomarkers” Bradford et. al.
(http://www.oncotarget.com/index.php?journal=oncotarget&page=article&op=view&path[]=8014
&path[]=23533), and
• Processed data from The Cancer Genome Atlas samples (https://cancergenome.nih.gov/).
1. Next Generation Sequencing data pre-processing:
• Trimming technical sequences
• Removing PCR duplicates
2. RNA-seq based quantification of expression levels:
• Conventional pipelines (looking at known transcripts)
• Identification of novel isoforms
Processing of NGS data:
Gene set enrichment analysis
Analysis of Expression Data Using Machine Learning:
3. Unsupervised analysis of expression data:
• Principal Component Analysis
• Clustering
4. Supervised analysis:
• Differential expression analysis
• Classification, gene signature construction
Part 1:
Biological Significance
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic Data Analysis
Cell Line DataTypes: Gene Expression and RNA-editing
RNA-EditingGene expression
Breast cancer and cell line models
Cell Lines as Cancer Models
Sample 1 Sample 2 Sample 3 Sample 4
gene 1 4 3 3 7
gene 2 6 5 5 8
gene 3 6 6 6 6
gene 4 1 2 1 2
gene 5 9 10 1 5
gene 6 12 4 0 5
gene 7 1 7 9 8
gene 8 4 8 3 10
Gene ExpressionTable
Chr Pos start End Sample 1 Sample 2 Sample 3 Sample 4
chr1 1312400 1312400 0 0 0 0
chr1 8362100 8362100 0 0 0 0
chr11 842700 842700 0.705023 0 0 1.17938
chr12 753200 753200 0 0 0 0
chr16 521100 521100 0 0 0 0
chr16 1362700 1362700 0 0 0 0
chr16 1446900 1446900 0 0 8.55549 0
chr16 2176500 2176500 0 0 0 0
chr16 2896600 2896600 0 0 0 0
chr16 29972700 29972700 0 0 0 0
chr16 30358600 30358600 0 0 0 0
chr16 30778800 30778800 0 0 0 0
chr17 2042900 2042900 0 0 15.332 0
chr17 4538300 4538300 0 0 0 0
chr17 4891100 4891100 0 0 0 0
chr17 4946300 4946300 0 38.4794 0 0
chr17 5033200 5033200 0 0 0 0
RNA-editingTable
49 Cell Lines
Samples
Genes
Expression
values
Samples
Abundance
values
RNA-editing
Link1
Link2
Matrix of distances between samples based on
Gene Expression
Matrix of distances between samples based on
Abundance of RNA editing
HCC202
Gene expression and RNA editing abundance
tables similarly separate HCC202 sample
Genes and RNA editing
Genes: RNA Editing:
Olfactory
Receptor
s
miRNA
Rab
GTPases
EnhancedTrafficking
Learn more at: T-bio.info
Part 2:
Working with RNA-Seq Data
RNA-seq: overview
.…TCTGAAACAATGCTTCAATCTAACTTATCATTCATTGGGA….
Genome
19
RNA-seq: overview
Genome
Gene A Gene B Gene C
20
RNA-seq: overview
Genome
Gene A Gene B Gene C
Transcr. ATranscr. ATranscr. A Transcr. ATranscr. C
21
RNA-seq: overview
Genome
Gene A Gene B Gene C
Transcr. ATranscr. ATranscr. A Transcr. ATranscr. C
Reads
22
RNA-seq: overview
Genome
Gene A Gene B Gene C
Transcr. ATranscr. ATranscr. A Transcr. ATranscr. C
Reads
Transcr. A Transcr. C
23
RNA-seq: some details
Genome
Gene A Gene B Gene C
Transcr.Transcr.Transcr. A
Shattering
24
Transcr. CTranscr. C
RNA-seq: some details
Genome
Gene A Gene B Gene C
Transcr.Transcr.Transcr. A Transcr.
Adapters ligation
25
Transcr. C
RNA-seq: some details
Genome
Gene A Gene B Gene C
Transcr.Transcr.Transcr. A Transcr.
PCR amplification
26
Transcr. C
RNA-seq: some details
Genome
Gene A Gene B Gene C
Transcr.Transcr.Transcr. A Transcr.
“Reading”
27
Transcr. C
RNA-seq: per-sample processing
Preprocessing:
• Adapters removal plus additional trimming
• Removing PCR duplicates
Mapping
• Mapping on the set of known transcripts
• Mapping on genome (and potential identification of novel transcripts)
• Combined strategy
Quantification of expression levels
28
RNA-seq: Comments
PCR removal should be used with caution to avoid removing natural
duplicates (valuable links:
http://www.cureffi.org/2012/12/11/how-pcr-duplicates-arise-in-next-generation-sequencing/
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4965708/ - DNA-seq and variant calling
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4597324/ - RNA-seq, ChIP-seq data
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3871669/ - trimming
29
RNA-seq: processing
30
31
RNA-seq: processing
RNA-seq: expression level quantification
Standard measures
• read counts (raw, expected)
• FPKM – fragments per kilo base per million mapped reads:
Number of reads mapped on the gene /
((total number of mapped reads – in millions) x (gene length – in kilobases))
• TPM – transcripts per million
For one sample TPMg = C x FPKMg, where C is selected in such a way that sum of all
TPMg is one million. But constants C are different for different samples.
32
RNA-seq: expression level quantification
Alternative definition of TPM:
(Number of reads mapped on the gene x read mean length x 106) /
(gene length x T),
where T is the sum over all genes of
(Number of reads mapped on the gene x read mean length) /
gene length
Each term here represents the number of sampled transcripts corresponding to a gene, and T estimates the
total number of sampled transcripts (molecules). Thus, TPM is the estimate of the number of transcripts
corresponding to a gene in every million transcripts.
Details: Wagner G.P., Kin K., Lynch V.J. (Theory Biosci., 2012) https://www.ncbi.nlm.nih.gov/pubmed/22872506
33
RNA-seq: expression level quantification
Linear scale vs Log-scale
Relative differences are biologically more meaningful than absolute.
Computations are simplified if a log-scaling is performed:
Log-scaled measure = log2 (linear-scale measure + shift)
For relatively large values a difference equal to 1 in log-scale is a 2x difference in linear
scale; difference equal to 3 in log-scale is a 8x difference in linear scale, etc.; difference
equal to -1 in log-scale is a 2x difference in linear scale, but in the opposite direction.
34
Comparison: the role of preprocessing
No
preprocessing
35
Comparison: the role of preprocessing
No PCR
duplicate
removal
36
Comparison: the role of preprocessing
Standard
37
Comparison: the role of preprocessing (output)
38
Comparison: the role of preprocessing
39
Comparison: the role of preprocessing
40
Extended pipeline
41
Extended pipeline
42
BREAK
43
B R E A K
Unsupervised analysis: PCA
44
Unsupervised analysis: PCA
45
Unsupervised analysis: PCA
46
47
Unsupervised analysis: hierarchical clustering
48
Unsupervised analysis: hierarchical clustering
49
Unsupervised analysis: hierarchical clustering
Unsupervised analysis: hierarchical clustering
50
Unsupervised analysis: hierarchical clustering
51
52
Unsupervised analysis: hierarchical clustering
53
Unsupervised analysis: hierarchical clustering
54
Unsupervised analysis: hierarchical clustering
55
Unsupervised analysis: hierarchical clustering
Dendrogram
56
Unsupervised analysis: hierarchical clustering
Dendrogram
Unsupervised analysis: PCA (15 genes)
57
Unsupervised analysis: PCA (15 genes)
58
Unsupervised analysis: hierarchical clustering, 15 genes
Dendrogram
59
Unsupervised analysis: hierarchical clustering, 15 genes
N-like BasalC-lowLuminal 60
Dendrogram
Gene annotation: ENSG to Gene Symbols plus GO
61
62
Unsupervised analysis: K-means, 15 genes
63
Unsupervised analysis: K-means, 15 genes
64
Unsupervised analysis: K-means, 15 genes
65
Unsupervised analysis: K-means, 15 genes
66
Unsupervised analysis: K-means, 15 genes
67
Unsupervised analysis: K-means, 15 genes
68
Unsupervised analysis: K-means, 15 genes
69
Unsupervised analysis: K-means, 15 genes
70
Unsupervised analysis: K-means, 15 genes
71
Unsupervised analysis: K-means, 15 genes
Unsupervised analysis: K-means, 15 genes
72
Unsupervised analysis: K-means, 15 genes
“The SUM52PE cell line was derived from a pleural effusion and was found to be
negative for ER and PR expression, however the original primary tumor from this
patient was positive for both hormone receptors”.
Chavez KJ, Garimella SV, Lipkowitz S. Triple negative breast cancer cell lines: one tool in the
search for better treatment of triple negative breast cancer. Breast Dis. 2010; 32(1-2):35-48.
Ethier SP, Kokeny KE, Ridings JW, Dilts CA. erbB family receptor expression and growth regulation
in a newly isolated human breast cancer cell line. Cancer Res. 1996; 56(4): 899-907.
73
BREAK
74
B R E A K
75
Supervised analysis:
SVM with a linear kernel as an example
76
Supervised analysis:
SVM with a linear kernel as an example
77
Supervised analysis:
SVM with a linear kernel as an example
d
d
78
Supervised analysis:
SVM with a linear kernel as an example
79
Supervised analysis:
SVM with a linear kernel as an example
?
80
Supervised analysis:
SVM with a linear kernel as an example
Supervised analysis:
SVM with a linear kernel as an example
?
81
Supervised analysis: available methods
• Linear Discriminant Analysis (LDA)
• Quadratic Discriminant Analysis (QDA)
• Random Forest
• Support Vector Machine (SVM)
• Naïve Bayes
82
Supervised analysis: 15 genes
83
Differential expression analysis
Quantities related to the degree of differential
expression:
• Difference between mean expression levels – fold
change (please, pay attention to scale);
• Statistical significance – p-value, adjusted p-value
(e.g., FDR)
• Expression level magnitude (caution with low-
expressed genes from the analysis).
84
Differential expression analysis
85
Differential expression analysis
86
Gene set / pathway enrichment analysis
Possible options:
• Use only lists (thresholding required): one of the standard
tools here is The Database for Annotation, Visualization and
Integrated Discovery – DAVID
(https://david.ncifcrf.gov/home.jsp, https://david-
d.ncifcrf.gov/).
• Take into consideration degrees of differential expression;
• Additionally take into consideration pathway topology.
87
Gene set / pathway enrichment analysis
88
Gene set / pathway enrichment analysis
89
BREAK
90
B R E A K
BREAK
91
HANDSON
Separation of TCGA and breast
cancer PDX samples
BREAK
92
HANDSON
Analysis of a subset of breast
cancer PDX samples
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic Data Analysis

More Related Content

What's hot (20)

PPT
Introduction to Bioinformatics Slides
Saide OER Africa
 
PPT
Bioinformatics
nadimissimple
 
PPTX
Uses of Artificial Intelligence in Bioinformatics
Pragya Pai
 
PDF
Introduction to Bioinformatics
Leighton Pritchard
 
PPTX
AI in Bioinformatics
Ali Kishk
 
PPT
Intro bioinformatics
Chris Dwan
 
PPTX
Application of bioinformatics
Kamlesh Patade
 
PDF
Introduction to Bioinformatics
Alexander Niema Moshiri
 
PPTX
Bioinformatics
Somdutt Sharma
 
PPSX
Bioinformatic tools in Pheromone technology
THILAKAR MANI
 
PDF
Bioinformatics
HemantAlhat1
 
PPT
Bio Informatics
Vaishnavi Ramanujan
 
PPTX
Bioinformatics ppt
Sai Tharun Kumar Guttikonda
 
PPTX
Introduction to Bioinformatics
Denis C. Bauer
 
PPT
B.sc biochem i bobi u-1 introduction to bioinformatics
Rai University
 
PPT
Bioinformatics-General_Intro
Abhiroop Ghatak
 
PPTX
Bioinformatics
Arockiyajainmary
 
DOCX
Bioinformatics on internet
Bahauddin Zakariya University lahore
 
PPTX
Bioinformatics
ANJALY JOHNSON K
 
PPTX
Careers in bioinformatics
entranzz123
 
Introduction to Bioinformatics Slides
Saide OER Africa
 
Bioinformatics
nadimissimple
 
Uses of Artificial Intelligence in Bioinformatics
Pragya Pai
 
Introduction to Bioinformatics
Leighton Pritchard
 
AI in Bioinformatics
Ali Kishk
 
Intro bioinformatics
Chris Dwan
 
Application of bioinformatics
Kamlesh Patade
 
Introduction to Bioinformatics
Alexander Niema Moshiri
 
Bioinformatics
Somdutt Sharma
 
Bioinformatic tools in Pheromone technology
THILAKAR MANI
 
Bioinformatics
HemantAlhat1
 
Bio Informatics
Vaishnavi Ramanujan
 
Bioinformatics ppt
Sai Tharun Kumar Guttikonda
 
Introduction to Bioinformatics
Denis C. Bauer
 
B.sc biochem i bobi u-1 introduction to bioinformatics
Rai University
 
Bioinformatics-General_Intro
Abhiroop Ghatak
 
Bioinformatics
Arockiyajainmary
 
Bioinformatics on internet
Bahauddin Zakariya University lahore
 
Bioinformatics
ANJALY JOHNSON K
 
Careers in bioinformatics
entranzz123
 

Similar to Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic Data Analysis (20)

PPTX
June 25-26, Workshop
Fahadahammed2
 
PPTX
May workshop
Fahadahammed2
 
PPTX
May 15 workshop
Fahadahammed2
 
PPTX
Dgaston dec-06-2012
Dan Gaston
 
PPTX
TNBC Research Presentation and medical virology .pptx
MohamedHasan816582
 
PPTX
Rna seq
Amitha Dasari
 
PDF
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
VHIR Vall d’Hebron Institut de Recerca
 
PDF
RapportHicham
Hicham Janati
 
PDF
Bioinfornatics Practical Lab Manual For Biotech
rao143om
 
PPTX
RNA-Seq_Presentation
Toyin23
 
PDF
RNASeq Experiment Design
Yaoyu Wang
 
PDF
Digital RNAseq for Gene Expression Profiling: Digital RNAseq Webinar Part 2
QIAGEN
 
PPTX
EiB Seminar from Antoni Miñarro, Ph.D
VHIR Vall d’Hebron Institut de Recerca
 
PPTX
Bioinformatics t8-go-hmm v2014
Prof. Wim Van Criekinge
 
PPTX
Bioinformatics
rashabakkour
 
PDF
SFScon 2020 - Paola Lecca - A network analysis computational pipeline to dete...
South Tyrol Free Software Conference
 
PDF
Fehrman Nat Gen 2014 - Journal Club
Giovanni Marco Dall'Olio
 
PPTX
RNA Sequencing Research
Tanmay Ghai
 
PPT
20100509 bioinformatics kapushesky_lecture05_0
Computer Science Club
 
PPTX
RNA-seq differential expression analysis
mikaelhuss
 
June 25-26, Workshop
Fahadahammed2
 
May workshop
Fahadahammed2
 
May 15 workshop
Fahadahammed2
 
Dgaston dec-06-2012
Dan Gaston
 
TNBC Research Presentation and medical virology .pptx
MohamedHasan816582
 
Rna seq
Amitha Dasari
 
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
VHIR Vall d’Hebron Institut de Recerca
 
RapportHicham
Hicham Janati
 
Bioinfornatics Practical Lab Manual For Biotech
rao143om
 
RNA-Seq_Presentation
Toyin23
 
RNASeq Experiment Design
Yaoyu Wang
 
Digital RNAseq for Gene Expression Profiling: Digital RNAseq Webinar Part 2
QIAGEN
 
EiB Seminar from Antoni Miñarro, Ph.D
VHIR Vall d’Hebron Institut de Recerca
 
Bioinformatics t8-go-hmm v2014
Prof. Wim Van Criekinge
 
Bioinformatics
rashabakkour
 
SFScon 2020 - Paola Lecca - A network analysis computational pipeline to dete...
South Tyrol Free Software Conference
 
Fehrman Nat Gen 2014 - Journal Club
Giovanni Marco Dall'Olio
 
RNA Sequencing Research
Tanmay Ghai
 
20100509 bioinformatics kapushesky_lecture05_0
Computer Science Club
 
RNA-seq differential expression analysis
mikaelhuss
 
Ad

Recently uploaded (20)

PDF
Chemokines and Receptors Overview – Key to Immune Cell Signaling
Benjamin Lewis Lewis
 
PDF
NRRM 330 Dynamic Equlibrium Presentation
Rowan Sales
 
PPTX
Phage Therapy and Bacteriophage Biology.pptx
Prachi Virat
 
PDF
The-Origin- of -Metazoa-vertebrates .ppt
S.B.P.G. COLLEGE BARAGAON VARANASI
 
PDF
A young gas giant and hidden substructures in a protoplanetary disk
Sérgio Sacani
 
PDF
High-speedBouldersandtheDebrisFieldinDARTEjecta
Sérgio Sacani
 
PDF
2025-06-10 TWDB Agency Updates & Legislative Outcomes
tagdpa
 
PDF
Adding Geochemistry To Understand Recharge Areas - Kinney County, Texas - Jim...
Texas Alliance of Groundwater Districts
 
PPTX
Diuretic Medicinal Chemistry II Unit II.pptx
Dhanashri Dupade
 
PPTX
Qualification of DISSOLUTION TEST APPARATUS.pptx
shrutipandit17
 
PPT
Cell cycle,cell cycle checkpoint and control
DrMukeshRameshPimpli
 
PPTX
Lamarckism is one of the earliest theories of evolution, proposed before Darw...
Laxman Khatal
 
PDF
WUCHERIA BANCROFTI-converted-compressed.pdf
S.B.P.G. COLLEGE BARAGAON VARANASI
 
PPTX
MODULE 2 Effects of Lifestyle in the Function of Respiratory and Circulator...
judithgracemangunday
 
PPTX
Immunopharmaceuticals and microbial Application
xxkaira1
 
PPTX
GB1 Q1 04 Life in a Cell (1).pptx GRADE 11
JADE ACOSTA
 
PDF
The role of the Lorentz force in sunspot equilibrium
Sérgio Sacani
 
PDF
crestacean parasitim non chordates notes
S.B.P.G. COLLEGE BARAGAON VARANASI
 
PDF
Insect Behaviour : Patterns And Determinants
SheikhArshaqAreeb
 
PPTX
Hypothalamus_nuclei_ structure_functions.pptx
muralinath2
 
Chemokines and Receptors Overview – Key to Immune Cell Signaling
Benjamin Lewis Lewis
 
NRRM 330 Dynamic Equlibrium Presentation
Rowan Sales
 
Phage Therapy and Bacteriophage Biology.pptx
Prachi Virat
 
The-Origin- of -Metazoa-vertebrates .ppt
S.B.P.G. COLLEGE BARAGAON VARANASI
 
A young gas giant and hidden substructures in a protoplanetary disk
Sérgio Sacani
 
High-speedBouldersandtheDebrisFieldinDARTEjecta
Sérgio Sacani
 
2025-06-10 TWDB Agency Updates & Legislative Outcomes
tagdpa
 
Adding Geochemistry To Understand Recharge Areas - Kinney County, Texas - Jim...
Texas Alliance of Groundwater Districts
 
Diuretic Medicinal Chemistry II Unit II.pptx
Dhanashri Dupade
 
Qualification of DISSOLUTION TEST APPARATUS.pptx
shrutipandit17
 
Cell cycle,cell cycle checkpoint and control
DrMukeshRameshPimpli
 
Lamarckism is one of the earliest theories of evolution, proposed before Darw...
Laxman Khatal
 
WUCHERIA BANCROFTI-converted-compressed.pdf
S.B.P.G. COLLEGE BARAGAON VARANASI
 
MODULE 2 Effects of Lifestyle in the Function of Respiratory and Circulator...
judithgracemangunday
 
Immunopharmaceuticals and microbial Application
xxkaira1
 
GB1 Q1 04 Life in a Cell (1).pptx GRADE 11
JADE ACOSTA
 
The role of the Lorentz force in sunspot equilibrium
Sérgio Sacani
 
crestacean parasitim non chordates notes
S.B.P.G. COLLEGE BARAGAON VARANASI
 
Insect Behaviour : Patterns And Determinants
SheikhArshaqAreeb
 
Hypothalamus_nuclei_ structure_functions.pptx
muralinath2
 
Ad

Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic Data Analysis

Editor's Notes

  • #2: Welcome to our first workshop of this kind – we are constantly experimenting, so hopefully this experiment will be successful. Our goal is to share with you several important concepts around Next Generation Sequencing Analysis techniques, specifically how to process, analyze and annotate gene expression data.
  • #3: Before we start, I would like to say a special thank you to Dr. Javeed Iqbal, whom I am sure you all know from University of Nebraska Medical Center. He has been a tremendous help organizing the venue and sharing updates about the workshop with many of you. Also, let me introduce our speakers today – Dr. Vladimir Galatenko, the chief scientist at the Tauber Bioinformatics Research Center. Together with Dr. Galatenko we invited Julia Panov, a Ph.D. student who regularly relies on the T-BioInfo platform in her research
  • #4: In this workshop, we will utilize oncology-related public-domain datasets derived from cell lines, animal models and if we have time, will touch on TCGA data. I also want to mention that these are projects prepared as examples for this workshop, however one of our goals is to identify key topics of interest for future workshops and online courses we are developing. We would be happy to speak with you afterwards about topics of interest, pathologies or other types of data of interest.
  • #5: We will cover important topics about Next Generation sequencing data: pre-processing and quantification of expression levels
  • #94: Conclusion