Bioconductor Code: PureCN

Raw Blame Patch Log History
%\VignetteKeywords{runAbsoluteCN}
%\VignetteEngine{knitr::knitr}
%\VignetteDepends{PureCN}
%\VignettePackage{PureCN}
%\VignetteIndexEntry{Overview of the PureCN R package}

\documentclass{article}

<<style-knitr, eval=TRUE, echo=FALSE, results="asis">>=
BiocStyle::latex()
@ 

\usepackage{booktabs} % book-quality tables

\title{Copy number calling and SNV classification using 
targeted short read sequencing}
\author[1]{Markus Riester}
\affil[1]{Novartis Biomedical Research, Cambridge, MA}

\begin{document}

\maketitle

\begin{abstract}
\Biocpkg{PureCN} \cite{Riester2016} is a purity and ploidy aware copy number
caller for cancer samples inspired by the \software{ABSOLUTE} algorithm
\cite{Carter2012}.  It was designed for hybrid capture sequencing data,
especially with medium-sized targeted gene panels without matching normal
samples in mind (matched whole-exome data is of course supported). 

It can be used to supplement existing normalization and segmentation algorithms,
i.e. the software can start from BAM files, from target-level coverage data,
from copy number log2-ratios or from already segmented data. After the correct
purity and ploidy solution was identified, \Biocpkg{PureCN} will accurately
classify variants as germline vs. somatic or clonal vs. sub-clonal.

\Biocpkg{PureCN} was further designed to integrate well with industry standard
pipelines \cite{McKenna2010}, but it is straightforward to generate input data
from other pipelines.
\end{abstract}

\packageVersion{\Sexpr{BiocStyle::pkg_ver("PureCN")}}

\tableofcontents
\newpage

<<load-purecn, echo=FALSE, message=FALSE>>=
library(PureCN)
set.seed(1234)
@

\section{Introduction}

This tutorial will demonstrate on a toy example how we recommend running
\Biocpkg{PureCN} on targeted sequencing data. To estimate tumor purity, we
jointly utilize both target-level\footnote{The captured genomic regions, e.g.
exons.} coverage data and allelic fractions of single nucleotide
variants (SNVs), inside - and optionally outside - the targeted regions.
Knowledge of purity will in turn allow us to accurately (i) infer integer copy
number and (ii) classify variants (somatic vs. germline, mono-clonal vs.
sub-clonal, heterozygous vs. homozygous etc.).

This requires 3 basic input files:

\begin{enumerate}
    \item A VCF file containing germline SNPs and somatic mutations. Somatic
status is not required in case the variant caller was run without matching
normal sample.
    \item The tumor BAM file.
    \item At least one BAM file from a normal control sample, either matched or
        process-matched.
\end{enumerate}

In addition, we need to know a little bit more about the assay.  This is the
annoying step since here the user needs to provide some information.  Most
importantly, we need to know the positions of all targets.  Then we need to
correct for GC-bias, for which we need GC-content for each target.  Optionally,
if gene-level calls are wanted, we also need for each target a gene symbol. We
may also observe subtile differences in coverage of tumor compared to
normal due to varying proliferation rates and we can provide replication
timing data to check and correct for such a bias. To obtain best results,
we can finally use a pool of normal samples to automatically learn more
about the assay and its biases and common artifacts.

The next sections will show how to do all this with \Biocpkg{PureCN} alone or
with the help of \software{GATK} and/or existing copy number pipelines. 

\textbf{All the steps described in the following are available in easy to use
command line scripts described in a separate vignette.}

\section{Basic input files}

\subsection{VCF}
\label{secinputvcf}
Germline SNPs and somatic mutations are expected in a single VCF file. At the
bare minimum, this VCF should contain read depths of reference and alt alleles
in an AD format field and a DB info flag for membership in germline
databases\footnote{The name of the flag is customizable in
\Rfunction{runAbsoluteCN}}. 

Without DB flag, variant ids starting with \Rcode{rs} are assumed to be in
dbSNP. Population allele frequencies are expected in a POP\_AF info field. These
frequencies are currently only used to infer membership in germline databases
when the DB flag is missing; in future versions they will be used calculate
somatic prior probabilities more accurately.

If a matched normal is available, then somatic status information is currently
expected in a SOMATIC info flag in the VCF. The \Biocpkg{VariantAnnotation}
package provides examples how to add info fields to a VCF in case the used
variant caller does not add this flag. If the VCF contains a BQ format field
containing base quality scores, \Biocpkg{PureCN} can remove low quality calls.

VCF files generated by \software{MuTect} \cite{Cibulskis2013} should work well
and in general require no post-processing. \Biocpkg{PureCN} can handle
\software{MuTect} VCF files generated in both tumor-only and matched normal
mode. Beta support for \software{MuTect 2} and \software{FreeBayes}
VCFs is available.  

As alternative to VCF input, read counts of common SNPs are also supported
in case classification of somatic variants is of no interest and those
genotyped SNPs are already generated in an upstream pipeline.
The function \Rfunction{readAllelicCountsFile} can be used to parse a
\software{GATK4 CollectAllelicCounts} file into a VCF supported by
\Biocpkg{PureCN}\footnote{Only the \Rcode{RG} header line is required for
extracting the sample id}:

\begin{verbatim}
@HD	VN:1.6
@SQ	SN:1	LN:248956422
@SQ	SN:2	LN:242193529
...
@RG	ID:GATK	SM:Sample1
CONTIG	POSITION	REF_COUNT	ALT_COUNT	REF_NUCLEOTIDE	ALT_NUCLEOTIDE
1	114515871	177	189	G	A
1	150044293	119	157	T	G
\end{verbatim}

<<exampleac>>=
ac.file <- system.file("extdata", "example_allelic_counts.tsv", 
          package="PureCN")
vcf.ac <- readAllelicCountsFile(ac.file)
@

\subsection{Target information}
\label{sectargetinfo}
For the default segmentation function provided by \Biocpkg{PureCN}, the
algorithm first needs to calculate log2-ratios of tumor vs. normal control
coverage. To do this, we need to know the locations of the captured genomic
regions (targets). These are provided by the manufacturer of your capture
kit\footnote{While \Biocpkg{PureCN} can use a pool of normal samples to learn
which intervals are reliable and which not, it is highly recommended to provide
the correct intervals. Garbage in, garbage out.}.  Please double check that the
genome version of the target file matches the reference. Usually the
manufacturer provides two files: the baits file containing the coordinates of
the actual capture baits, and the target file containing the coordinates of the
actual regions we want to capture. We recommend to use the baits file (and
recognize the confusing nomenclature that we follow due to convention in
established tools).

Default parameters assume that these targets do NOT include a "padding" to
include flanking regions.  \Biocpkg{PureCN} will automatically include variants
in the 50bp flanking regions if the variant caller was either run without
interval file or with interval padding (See section~\ref{faqinput}). 

\Biocpkg{PureCN} will attempt to optimize the targets for copy number calling
(similar to \cite{Talevich2016}):

\begin{itemize} 
\item Large targets are split to obtain a higher resolution
\item Targets in regions of low mappability are dropped
\item Optionally, accessible regions in-between the target (off-target) regions
    are included so that available coverage information in on- and off-target
    reads can be used by the segmentation function. In the following, we
    will use \textit{intervals} when something applies to both on-target and
    off-target regions and \textit{targets} when it only applies to
    on-target.
\end{itemize}

It further annotates intervals by GC-content (how coverage is normalized is
described later in Section~\ref{secgcbias}). 

\Biocpkg{PureCN} provides the \Rfunction{preprocessIntervals} 
function:

<<examplegc>>=
reference.file <- system.file("extdata", "ex2_reference.fa", 
    package = "PureCN", mustWork = TRUE)
bed.file <- system.file("extdata", "ex2_intervals.bed", 
        package = "PureCN", mustWork = TRUE)
mappability.file <- system.file("extdata", "ex2_mappability.bigWig", 
        package = "PureCN", mustWork = TRUE)

intervals <- import(bed.file)
mappability <- import(mappability.file)

preprocessIntervals(intervals, reference.file, 
    mappability = mappability, output.file = "ex2_gc_file.txt")
@

A command line script described in a separate vignette provides convenient
access to this function and also attempts to annotate the targets with gene
symbols using the \Rfunction{annotateTargets} function.

\subsection{Coverage data}
\label{purecncoverage}

The \Rfunction{calculateBamCoverageByInterval} function can be used to generate
the required coverage data from BAM files. All we need to do is providing the
desired intervals (as generated by \Rfunction{preprocessIntervals}):

<<examplecoverage>>=
bam.file <- system.file("extdata", "ex1.bam", package="PureCN", 
    mustWork = TRUE)
interval.file <- system.file("extdata", "ex1_intervals.txt", 
    package = "PureCN", mustWork = TRUE)

calculateBamCoverageByInterval(bam.file = bam.file, 
 interval.file = interval.file, output.file = "ex1_coverage.txt")
@

\subsection{Third-party coverage tools}
\label{gatkcoverage}

Calculating coverage from BAM files is a common task and your pipeline might
already provide this information. As alternative to
\Rfunction{calculateBamCoverageByInterval}, \Biocpkg{PureCN} currently supports
coverage files generated by \software{GATK3 DepthOfCoverage}, \software{GATK4
CollectFragmentCounts} and by \software{CNVkit}. By providing files with
standard file extension, \Biocpkg{PureCN} will automatically detect the correct
format and all following steps are the same for all tools. You will, however,
still need the interval file generated in Section~\ref{sectargetinfo} and the
third-party tool must use the exact same intervals. See also FAQ
Section~\ref{faqinput} for recommended settings for \software{GATK3
DepthOfCoverage}. 

\subsection{Third-party segmentation tools}

\Biocpkg{PureCN} integrates well with existing copy number pipelines. Instead of
coverage data, the user then needs to provide either already segmented data or a
wrapper function. This is described in Section~\ref{customseg}.

\subsection{Example data}

We now load a few example files that we will use throughout this tutorial:

<<example_files, message=FALSE, warning=FALSE, results='hide'>>=
library(PureCN)

normal.coverage.file <- system.file("extdata", "example_normal.txt.gz",
    package = "PureCN") 
normal2.coverage.file <- system.file("extdata", "example_normal2.txt.gz", 
    package = "PureCN") 
normal.coverage.files <- c(normal.coverage.file, normal2.coverage.file) 
tumor.coverage.file <- system.file("extdata", "example_tumor.txt.gz", 
    package = "PureCN") 
seg.file <- system.file("extdata", "example_seg.txt",
    package = "PureCN")
vcf.file <- system.file("extdata", "example.vcf.gz", package = "PureCN")
interval.file <- system.file("extdata", "example_intervals.txt", 
    package = "PureCN")
@

\section{Library-specific coverage bias}
\label{secgcbias}

In coverage normalization, we distinguish between assay- and library-specific
biases.  Assay-specific biases, for example due to probe density, probe capture
efficiency and read mappability, are best removed with a pool of normal samples
(Section~\ref{bestnormal}, \cite{Talevich2016}). In other words, by examining
the coverage of particular intervals in a pool of normals, we can estimate how well
this assay captures these intervals and will then adjust the tumor coverage
accordingly.

Other biases are library-specific, meaning a patient sample captured in
different libraries may display dramatically different coverage profiles across
libraries. Data from great sequencing centers usually show relatively small
technical variance nowadays, but some biases are not completely avoidable. The
most important library-specific bias is due to GC-content, i.e. regions of high
AT- or GC-content are not always captured with exactly the same efficiency in
tumor and normals.

We usually also observe that early replicating regions have a slightly higher
coverage than late replicating regions \cite{Koren2014, Kleinheinz210807}.
Since there is often a significant difference in proliferation rates of tumor
and normal, the pool of normals might also not completely adjust for this small
bias.

A good panel of normal samples will remove both assay- and library-specific 
biases. We recommend sequencing the normals in multiple batches so that the
coverage normalization has access to any expected day-to-day assay variance.
This will greatly increase the normalization's capability in removing
library-specific biases. 

Sometimes it can be beneficial to remove library-specific biases explicitly,
especially with smaller panels of normals that might show significant
differences to tumors (FFPE vs blood, high vs low coverage, etc.).

In the following example, we correct the raw coverage of all samples, tumor and
normal, for the two major sources of library-specific coverage biases mentioned
above (Figure~\ref{figure/figuregccorrect-1}). For GC-normalization, we use a
2-step loess normalization \cite{Ha2014}. For the replication timing bias, a
linear model of log-transformed coverage and provided replication timing score
is used.

<<figuregccorrect, fig.show='hide', fig.width=7, fig.height=7, warning=FALSE>>=
correctCoverageBias(normal.coverage.file, interval.file, 
    output.file = "example_normal_loess.txt.gz", plot.bias = TRUE)
@
\incfig{figure/figuregccorrect-1}{0.95\textwidth}{Coverage before and after
normalization for GC-content and replication timing.} 
{This plot shows coverage as a function of on- and off-target GC-content and
replication timing before and after normalization. Each dot is an
interval. The example files are already GC-normalized; real data will show more
dramatic differences.}

The example coverage files are already GC-normalized.  We provide a convenient
command line script for generating normalized coverage data from BAM files or
from \software{GATK} coverage files (see Quick vignette).


\section{Pool of normals}

\subsection{Coverage normalization}
\label{bestnormal}

For calculating copy number log2-ratios of tumor vs. normal, \Biocpkg{PureCN}
requires coverage from a process-matched normal sample.  Using a normal that
was sequenced using a similar, but not identical assay, rarely works.
Differently covered genomic regions simply result in too many log2-ratio
outliers. This section describes how to optimally normalize coverage against a
pool of normals.

The \Rfunction{createNormalDatabase} function builds a database of coverage
files (a command line script providing this functionality is described in
a separate vignette): 

<<normaldb>>=
normalDB <- createNormalDatabase(normal.coverage.files)
# serialize, so that we need to do this only once for each assay
saveRDS(normalDB, file = "normalDB.rds")
@

When using a GC-normalization in tumors, please make sure that all provided
coverage files here were also GC-normalized prior to building the database
(Section~\ref{secgcbias}).  Internally, \Rfunction{createNormalDatabase}
determines the sex of the samples and trains a PCA that is later used for
denoising tumor coverage using Tangent normalization \cite{Tabak566505}:

<<normaldbpca>>=
normalDB <- readRDS("normalDB.rds")
pool <- calculateTangentNormal(tumor.coverage.file, normalDB)
@

This \Rfunction{createNormalDatabase} function further automatically estimates
copy number log2-ratio standard deviations. Assuming that all normal samples
are in general diploid, a high variance in log2-ratio is indicative of an
interval with either common germline alterations or frequent artifacts; high or
low copy number log2-ratios in these intervals are unlikely measuring somatic
copy number events. The segmentation function can use this information to skip
over such noisy regions.


\subsection{Artifact filtering}
\label{artifactfiltering}

It is important to remove as many artifacts as possible, because low ploidy
solutions are typically punished more by artifacts than high ploidy solutions.
High ploidy solutions are complex and usually find ways of explaining artifacts
reasonably well. The following steps in this section are optional, but
recommended since they will reduce the number of samples requiring manual
curation, especially when matching normal samples are not available. 

\subsubsection{VCF}
\label{artifactfilteringvcf}

We recommend running \software{MuTect} with a pool of normal samples to filter
common sequencing errors and alignment artifacts from the VCF.
\software{MuTect} requires a single VCF containing all normal samples, for
example generated by the \software{GATK3 CombineVariants} tool (see
Section~\ref{faqinput}). For \software{Mutect 2} in \software{GATK4}, please
follow the somatic best practices including the generation of a
\software{GenomicsDB} containing variants from the normal samples.

It is highly recommended to provide \Biocpkg{PureCN} these variant pools of
normals as well; it will help the software correcting non-reference read
mapping biases. This is described in the \Rfunction{setMappingBiasVcf}
documentation.  To reduce memory usuage, the normal panel VCF can be reduced to
contain only variants present in 2-3 or more samples. \software{GenomicsDBs}
from \software{GATK4} can be provided as is, but require the installation of
the \software{genomicsdb} R package from
\url{https://github.com/nalinigans/GenomicsDB-R}.

Because these pools can become huge with increasing number of normal samples,
we pre-compute the mapping bias:

<<calculatemappingbias>>=
# speed-up future runtimes by pre-calculating variant mapping biases
normal.panel.vcf.file <- system.file("extdata", "normalpanel.vcf.gz",
                                     package = "PureCN")

bias <- calculateMappingBiasVcf(normal.panel.vcf.file, genome = "h19")
saveRDS(bias, "mapping_bias.rds")
mapping.bias.file <- "mapping_bias.rds"
@

This function will first fit beta-binomial distributions of allelic fractions
from heterozygous variants (between fractions of 0.1 and 0.9) observed in
sufficient number of samples; by default \Rcode{min.normals.betafit} is set to
7. Under the assumption that there are only a limited number of error modes
resulting in different distributions, we next cluster those position-specific
distributions into \Rcode{num.betafit.clusters} (default 9). Next, all variants
observed heterozygous in less than \Rcode{min.normals.betafit} but equal or
more than \Rcode{min.normals.assign.betafit} (default 3) are assigned the best
fitting beta-distribution cluster.

For variants with at least \Rcode{min.normals} (default 1) but less than
\Rcode{min.normals.assign.betafit}, the empirical Bayes estimate described in
the original \Biocpkg{PureCN} paper is used where the reference and alternate
counts of an average high quality SNP is added to the observed counts, thus
forcing the estimate to the average when position-specific information is
limited.

Note that variants with low median coverage
(\Rcode{min.median.coverage.betafit}, default 5) are not attempted to fit to
beta-binomal distributions and instead the empirical Bayes method is used.

\subsection{Artifact filtering without a pool of normals}
\label{artifactfiltering2}

By default, \Biocpkg{PureCN} will exclude targets with coverage below 15X from
segmentation (with a pool of normals, targets are filtered based on the
coverage and variance in normal database only). For variants in the provided
VCF, the same 15X cutoff is applied.  \software{MuTect} applies more
sophisticated artifact tests and flags suspicious variants. If
\software{MuTect} was run in matched normal mode, then both potential artifacts
and germline variants are rejected, that means we cannot just filter by the
PASS/REJECT \software{MuTect} flags. The \Rfunction{filterVcfMuTect} function
optionally reads the \software{MuTect 1.1.7} stats file and will keep germline
variants, while removing potential artifacts.  Without the stats file,
\Biocpkg{PureCN} will use only the filters based on read depths as defined in
\Rfunction{filterVcfBasic}.  Both functions are automatically called by
\Biocpkg{PureCN}, but can be easily modified and replaced if necessary.

We can also use a BED file to blacklist regions expected to be problematic, for
example the simple repeats track from the UCSC:

<<ucsc_segmental>>=
# Instead of using a pool of normals to find low quality regions,
# we use suitable BED files, for example from the UCSC genome browser.

# We do not download these in this vignette to avoid build failures 
# due to internet connectivity problems.
downloadFromUCSC <- FALSE
if (downloadFromUCSC) {
    library(rtracklayer)
    mySession <- browserSession("UCSC")
    genome(mySession) <- "hg19"
    simpleRepeats <- track(ucscTableQuery(mySession, 
        track = "Simple Repeats", table = "simpleRepeat"))
    export(simpleRepeats, "hg19_simpleRepeats.bed")
}

snp.blacklist <- "hg19_simpleRepeats.bed"
@

\section{Recommended run}
\label{secrecommended}
Finally, we can run \Biocpkg{PureCN} with all that information:

<<runpurecn>>=
ret <-runAbsoluteCN(normal.coverage.file = pool,
    tumor.coverage.file = tumor.coverage.file, vcf.file = vcf.file, 
    genome = "hg19", sampleid = "Sample1", 
    interval.file = interval.file, normalDB = normalDB,
#    args.setMappingBiasVcf=list(
        # mapping.bias.file = mapping.bias.file
#    ),
    args.filterVcf=list(
        # snp.blacklist = snp.blacklist, 
        # stats.file = mutect.stats.file
    ),
    args.filterIntervals = list(
        min.total.counts = 50
    ), 
    post.optimize = FALSE, plot.cnv = FALSE, verbose = FALSE)
@

The \Rcode{normal.coverage.file} argument points to a coverage file obtained from
either a matched or a process-matched normal sample, but can be also a small
pool of best normals (Section~\ref{bestnormal}).  

The \Rcode{normalDB} argument (Section~\ref{bestnormal}) provides a pool of
normal samples and for example allows the segmentation function to skip targets
with low coverage or common germline deletions in the pool of normals.  If
available, a VCF containing all variants from the normal samples should be
provided via \Rcode{args.setMappingBiasVcf} to correct read mapping biases.  The
files specified in \Rcode{args.filterVcf} help \Biocpkg{PureCN} filtering SNVs
more efficiently for artifacts as described in Sections~\ref{artifactfiltering}
and \ref{artifactfiltering2}.  The \Rcode{snp.blacklist} is only necessary if
neither a matched normal nor a large pool of normals is available. We set
\Rcode{min.total.counts} to a very low value to make this toy example use all
simulated intervals. 

The \Rcode{post.optimize} flag will increase the runtime by about a factor of
2-5, but might return slightly more accurate purity estimates. For high quality
whole-exome data, this is typically not necessary for copy number calling (but
might be for variant classification, see Section~\ref{predictsomatic}). For
smaller targeted panels, the runtime increase is typically marginal and
\Rcode{post.optimize} should be always set to \Rcode{TRUE}.

The \Rcode{plot.cnv} argument allows the segmentation function to generate

additional plots if set to \Rcode{TRUE}.  Finally, \Rcode{verbose} outputs
important and helpful information about all the steps performed and is therefore
set to \Rcode{TRUE} by default.

\section{Output}

\subsection{Plots}

We now create a few output files:

<<createoutput>>=
file.rds <- "Sample1_PureCN.rds"
saveRDS(ret, file = file.rds)
pdf("Sample1_PureCN.pdf", width = 10, height = 11)
plotAbs(ret, type = "all")
dev.off()
@

The RDS file now contains the serialized return object of the
\Rfunction{runAbsoluteCN} call. The PDF contains helpful plots for all local
minima, sorted by likelihood.  The first plot in the generated PDF is displayed
in Figure~\ref{figure/figureexample1-1} and shows the purity and ploidy local
optima, sorted by final likelihood score after fitting both copy number and
allelic fractions.

<<figureexample1, fig.show='hide', fig.width=6, fig.height=6>>=
plotAbs(ret, type = "overview")
@
\incfig{figure/figureexample1-1}{0.75\textwidth}{Overview.}
{The colors visualize the copy number fitting score from low (blue) to high
(red). The numbers indicate the ranks of the local optima. Yellow fonts
indicate that the corresponding solutions were flagged, which does not
necessarily mean the solutions are wrong. The correct solution (number 1) of
this toy example was flagged due to large amount of LOH.} 

We now look at the main plots of the maximum likelihood solution in more detail.

<<figureexample2, fig.show='hide', fig.width=6, fig.height=6>>=
plotAbs(ret, 1, type = "hist")
@
\incfig{figure/figureexample2-1}{0.75\textwidth}{Log-ratio histogram.}

Figure~\ref{figure/figureexample2-1} displays a histogram of tumor vs. normal
copy number log2-ratios for the maximum likelihood solution (number 1 in
Figure~\ref{figure/figureexample1-1}). The height of a bar in this plot is
proportional to the fraction of the genome falling into the particular
log2-ratio copy number range. The vertical dotted lines and numbers visualize
the, for the given purity/ploidy combination, expected log2-ratios for all
integer copy numbers from 0 to 7. It can be seen that most of the log2-ratios of
the maximum likelihood solution align well to expected values for copy numbers
of 0, 1, 2 and 4. 

<<figureexample3, fig.show='hide', fig.width=8, fig.height=8>>=
plotAbs(ret, 1, type = "BAF")
@
\incfig{figure/figureexample3-1}{0.95\textwidth}{B-allele frequency plot.}{Each
dot is a (predicted) germline SNP. The first panel shows the allelic fractions
as provided in the VCF file.  The alternating grey and white background colors
visualize odd and even chromosome numbers, respectively. The black lines
visualize the expected (not the average!) allelic fractions in the segment.
These are calculated using the estimated purity and the total and minor segment
copy numbers. These are visualized in black and grey, respectively, in the
second and third panel.  The second panel shows the copy number log2-ratios, the
third panel the integer copy numbers.}

Germline variant data are informative for calculating integer copy number
because unbalanced maternal and paternal chromosome numbers in the tumor
portion of the sample lead to unbalanced germline allelic fractions.
Figure~\ref{figure/figureexample3-1} shows the allelic fractions of predicted
germline SNPs. The goodness of fit (GoF) is provided on an arbitrary scale in
which 100\% corresponds to a perfect fit and 0\% to the worst possible fit.
The latter is defined as a fit in which allelic fractions on average differ by
0.2 from their expected fractions. Note that this does not take purity into
account and low purity samples are expected to have a better fit. In the
middle panel, the corresponding copy number log2-ratios are shown. The lower
panel displays the calculated integer copy numbers, corrected for purity and
ploidy. We can zoom into particular chromosomes 
(Figure~\ref{figure/figureexample3b-1}).

<<figureexample3b, fig.show='hide', fig.width=9, fig.height=8>>=
plotAbs(ret, 1, type = "BAF", chr = "chr19")
@
\incfig{figure/figureexample3b-1}{0.95\textwidth}{Chromosome plot.}{Similar
    to Figure~\ref{figure/figureexample3-1}, but zoomed into a particular 
    chromosome. The grey dots in the middle panel visualize copy number 
    log2-ratios of targets without heterozygous SNPs, which are omitted from the 
    previous genome-wide plot. The x-axis now indicates genomic coordinates in
    kbps.}

<<figureexample4, fig.show='hide', fig.width=8, fig.height=8>>=
plotAbs(ret, 1, type = "AF")
@
\incfig{figure/figureexample4-1}{0.95\textwidth}{Allele fraction plots.}{Each
dot is again a (predicted) germline SNP. The shapes visualize the different SNV
groups based on prior and posterior probabilities.  The labels show the
expected values for all called states; 2m1 would be diploid, heterozygous, 2m2
diploid, homozygous.  The relationship of allelic fraction and coverage is
shown in the top right panel.  This plot normally also shows somatic mutations
in two additional panels, with the left panel showing the same plot as for
germline SNPs and the bottom right panel a histogram of cellular fraction of
predicted somatic mutations. This toy example contains only germline SNPs
however.}

Finally, Figure~\ref{figure/figureexample4-1} provides more insight into how
well the variants fit the expected values.
\clearpage

\subsection{Data structures}

The R data file (\Rcode{file.rds}) contains gene-level copy number calls,
SNV status and LOH calls.  The purity/ploidy combinations are sorted by
likelihood and stored in \Rcode{ret\$results}.

<<output1>>=
names(ret)
@

We provide convenient functions to extract information from this data structure
and show their usage in the next sections. We recommend using these functions
instead of accessing the data directly since data structures might change in
future versions. 

\subsubsection{Prediction of somatic status and cellular fraction}
\label{predictsomatic}
To understand allelic fractions of particular SNVs, we must know the (i)
somatic status, the (ii) tumor purity, the (iii) local copy number, as well as
the (iv) number of chromosomes harboring the mutations or SNPs. One of
\Biocpkg{PureCN} main functions is to find the most likely combination of these
four values. We further assign posterior probabilities to all possible
combinations or states. Availability of matched normals reduces the search space
by already providing somatic status. 

The \Rfunction{predictSomatic} function provides access to these probabilities.
For predicted somatic mutations, this function also provides cellular fraction
estimates \cite{McGranahan2015}, i.e. the fraction of tumor cells with mutation.
Fractions significantly below 1 indicate sub-clonality:

<<output3>>=
head(predictSomatic(ret), 3)
@

The output columns are explained in Table~\ref{tbl:predictSomatic}.

To annotate the input VCF file with these values:
<<output4>>=
vcf <- predictSomatic(ret, return.vcf = TRUE)
writeVcf(vcf, file = "Sample1_PureCN.vcf") 
@

For optimal classification results:
\begin{itemize}
    \item Set \Rcode{post.optimize = TRUE} since small inaccuracies in purity can
decrease the classification performance significantly
    \item Provide \Rcode{args.setMappingBias} a pool of normal VCF to
        obtain position-specific mapping bias information
    \item Exclude variants in regions of low mappability    
    \item Use a somatic posterior probability cutoff of 0.8 and 0.2 for somatic
    and germline variants, respectively.  This appears to be a good compromise
    of call rate and accuracy. If the beta-binomial model was selected in the
    \Rcode{model} argument of \Rfunction{runAbsoluteCN} with small panel of
    normals, these cutoffs might need to be relaxed to get acceptable call
    rates.
    \item Add a \Rcode{Cosmic.CNT} info field to the VCF or provide a COSMIC VCF
        in \Rfunction{runAbsoluteCN} (see Section~\ref{seccosmic}). 
\end{itemize}        

\textbf{Note that the posterior probabilities assume that the purity and ploidy
combination is correct. Before classifying variants, it is thus recommended to
manually curate flagged samples.}

\begin{table*}
\caption{\Rfunction{predictSomatic} output columns.}
\begin{tabular}{lp{9.5cm}}
\toprule
Column name       & Description \\
\midrule
\Rcode{chr, start, end}   & Variant coordinates \\
\Rcode{ID}   & The variant ID as provided in the VCF  \\
\Rcode{REF, ALT}   & The reference and alt alleles  \\
\Rcode{SOMATIC.M*} & Posterior probabilities for all somatic states.  M0 to M7 
are multiplicity values, i.e. the number of chromosomes harboring the 
mutation (e.g.  1 heterozygous, 2 homozygous if copy number C is 2). SOMATIC.M0
represents a sub-clonal state (somatic mutations by definition have a 
multiplicity larger than 0). \\
\Rcode{GERMLINE.M*} & Posterior probabilities for all heterozygous germline 
states \\
\Rcode{GERMLINE.CONTHIGH} & Posterior probability for contamination.  This 
state corresponds to homozygous germline SNPs that were not filtered out 
because reference alleles from another individual were sequenced, resulting in 
allelic fractions smaller than 1. \\
\Rcode{GERMLINE.CONTLOW} & Posterior probability for contamination. This state
corresponds to non-reference alleles only present in the contamination. \\
\Rcode{GERMLINE.HOMOZYGOUS} & Posterior probability that SNP is homozygous in 
normal. Requires the \Rcode{model.homozygous} option in 
\Rfunction{runAbsoluteCN}. See Section~\ref{sec:celllines}. \\
\Rcode{ML.SOMATIC} & \Rcode{TRUE} if the maximum likelihood state is a somatic
    state \\
\Rcode{POSTERIOR.SOMATIC} & The posterior probability that the variant is 
somatic (sum of all somatic state posterior probabilities) \\
\Rcode{ML.M}  & Maximum likelihood multiplicity \\
\Rcode{ML.C}  & Maximum likelihood integer copy number \\
\Rcode{ML.M.SEGMENT} & Maximum likelihood minor segment copy number \\
\Rcode{M.SEGMENT.POSTERIOR} & Posterior probability of \Rcode{ML.M.SEGMENT} \\
\Rcode{M.SEGMENT.FLAGGED} & Segment flag indicating \Rcode{ML.M.SEGMENT} is
unreliable, either due to low posterior probability (< 0.5) or few variants (<
\Rcode{min.variants.segment}). Indels are always flagged. \\
\Rcode{ML.AR}  & Expected allelic fraction of the maximum likelihood state \\
\Rcode{AR}   & Observed allelic fraction (as provided in VCF) \\
\Rcode{AR.ADJUSTED} & Observed allelic fraction adjusted for mapping bias\\
\Rcode{ML.LOH} & \Rcode{TRUE} if variant is most likely in LOH \\
\Rcode{CN.SUBCLONAL} & \Rcode{TRUE} if copy number segment is sub-clonal \\
\Rcode{CELLFRACTION} & Fraction of tumor cells harboring the somatic mutation
    \cite{McGranahan2015} \\
\Rcode{CELLFRACTION.95.LOWER} & Lower 95\% of confidence interval \\
\Rcode{CELLFRACTION.95.UPPER} & Upper 95\% of confidence interval \\
\Rcode{ALLELIC.IMBALANCE} & Log-likelihood that variant is in allelic balance.
Requires position-specific mapping bias information to be reliable. \\
\Rcode{FLAGGED} & Flag indicating that call is unreliable 
    (currently only due to high mapping bias and high pool of normal counts)\\
\Rcode{log.ratio} & The copy number log2-ratio (tumor vs. normal) for this 
    variant \\
\Rcode{depth} & The total sequencing depth at this position \\
\Rcode{prior.somatic} & Prior probability of variant being somatic \\
\Rcode{prior.contamination} & Prior probability that variant is contamination 
    from another individual \\
\Rcode{on.target} & 1 for variants within intervals, 2 for variants in flanking
 regions, 0 for off-target variants \\
\Rcode{seg.id}       & Segment id  \\
\Rcode{pon.count}  & Number of hits in the \Rcode{mapping.bias.file} \\
\Rcode{gene.symbol} & Gene symbol if available \\ 
\bottomrule
\end{tabular}
\label{tbl:predictSomatic}
\end{table*}

\subsubsection{Amplifications and deletions}
\label{callalterations}

To call amplifications, we recommend using a cutoff of 6 for focal
amplifications and a cutoff of 7 otherwise. For homozygous deletions, a cutoff
of 0.5 is useful to allow some heterogeneity in copy number.

For samples that failed \Biocpkg{PureCN} calling we recommended using
common log2-ratio cutoffs to call amplifications, for example 0.9.

This strategy is implemented in the \Rfunction{callAlterations} function:

<<calling2>>=
gene.calls <- callAlterations(ret)
head(gene.calls)
@

It is also often useful to filter the list further by known biology, for
example to exclude non-focal amplifications of tumor suppressor genes. The
Sanger Cancer Gene Census \cite{Futreal2004} for example provides such a list.

The output columns of \Rfunction{callAlterations} are explained in
Table~\ref{tbl:callAlterations}.

\begin{table*}
\caption{\Rfunction{callAlterations} output columns.}
\begin{tabular}{lp{9.5cm}}
\toprule
Column name       & Description \\
\midrule
\Rcode{chr, start, end} & Gene coordinates \\
\Rcode{C}            & Segment integer copy number. See also \Rcode{seg.id}. \\
\Rcode{C.flagged}    & Flagged when segment interval weights are low.  Requires
\Rcode{normalDB}. In case multiple segments overlap, \Rcode{C.flagged} will be
\Rcode{TRUE} when any segment is flagged. \\
\Rcode{seg.mean}     & Segment mean of copy number log2-ratios 
(not adjusted for purity/ploidy). See also \Rcode{seg.id}.  \\
\Rcode{seg.id}       & Segment id. In case multiple segments overlap,
\Rcode{seg.id} will point to the smallest overlapping segment and the column
\Rcode{breakpoints} will be non-zero.  \\
\Rcode{number.targets} & Number of targets annotated with this symbol \\
\Rcode{gene.*}       & Gene copy number log2-ratios  
(not adjusted for purity/ploidy). \Rcode{gene.mean} is weighted
by interval weights when these are available.  \\
\Rcode{focal}       & \Rcode{TRUE} for focal amplification, as defined by the 
\Rfunction{fun.focal} argument in \Rfunction{runAbsoluteCN} \\
\Rcode{breakpoints} & Number of breakpoints between \Rcode{start} and 
\Rcode{end} \\
\Rcode{type}        & Amplification or deletion \\
\Rcode{num.snps} & Number of SNPs in this segment informative for 
    LOH detection (requires VCF)\\
\Rcode{M} & minor copy number of segment (requires VCF) \\
\Rcode{M.flagged} & flag indicating that \Rcode{M} is unreliable (requires VCF) \\
\Rcode{loh}         & \Rcode{TRUE} if segment is in LOH, \Rcode{FALSE} if not 
    and \Rcode{NA} if number of SNPs is insufficient (requires VCF) \\
\bottomrule
\end{tabular}
\label{tbl:callAlterations}
\end{table*}

\subsubsection{Amplifications in low purity samples}

The default \Rfunction{callAlterations} is not suited for samples of very low
purity below 10\% such as frequently observed in cfDNA samples. To call high
level amplifications in samples between 4 to 10\%, we recommend a simple
z-score approach implemented in \Rfunction{callAmplificationsInLowPurity}. This
will calculate gene-level log2 copy number ratio standard deviations in all
normal samples that are then used to obtain gene-level cutoffs for
amplifications in tumor samples. The output of this function is nearly
identical to \Rfunction{callAlterations} with additional columns explained in
Table~\ref{tbl:callamplificationsinlowpurity}.

<<calling3>>=
gene.calls.zscore <- callAmplificationsInLowPurity(ret, normalDB)
head(gene.calls.zscore)
# filter to known amplifications 
known.calls.zscore <- gene.calls.zscore[ gene.calls.zscore$gene.symbol %in% 
c("AR", "BRAF", "CCND1", "CCNE1", "CDK4", "CDK6", "EGFR", "ERBB2", "FGFR1", 
"FGFR2", "KIT", "KRAS", "MCL1", "MDM2", "MDM4", "MET", "MYC", "PDGFRA", 
"PIK3CA", "RAF1"),]
@

Since this algorithm is intended to rescue known high-level amplifications in
very low tumor purities, we recommend using a liberal p-value cutoff, but
restricting the search to a limited set of genes.

\begin{table*}
\caption{\Rfunction{callAmplificationsInLowPurity} output columns not already
explained in Table~\ref{tbl:callAlterations}.}
\begin{tabular}{lp{9.5cm}}
\toprule
Column name        & Description \\
\midrule
\Rcode{p.value}     & P-value of observing such a gene-level log2-ratio 
in normal samples \\
\Rcode{percentile.genome} & Percentile of gene-level log2-ratio in the tumor sample.
Useful for distinguishing focal amplifications from broad gains. \\
\Rcode{percentile.chromosome} & Same as \Rcode{percentile.genome}, but for the
chromosome \\
\bottomrule
\end{tabular}
\label{tbl:callamplificationsinlowpurity}
\end{table*}

\subsubsection{Find genomic regions in LOH}

The \Rcode{gene.calls} data.frame described above provides gene-level LOH
information. To find the corresponding genomic regions in LOH, we can use the
\Rfunction{callLOH} function:

<<loh>>=
loh <- callLOH(ret)
head(loh)
@

The output columns are explained in Table~\ref{tbl:callLOH}.

\begin{table*}
\caption{\Rfunction{callLOH} output columns.}
\begin{tabular}{lp{9.5cm}}
\toprule
Column name & Description \\
\midrule
\Rcode{chr, start, end} & Segment coordinates \\
\Rcode{arm}         & Chromosome arm \\
\Rcode{C}           & Segment integer copy number \\
\Rcode{M}           & Minor integer copy number ($M+N=C,M\leq N$) \\
\Rcode{type}        & LOH type if $M=0$ \\
\Rcode{seg.mean}    & Segment mean of copy number log2-ratios 
(not adjusted for purity/ploidy)  \\
\Rcode{num.mark} & Number of intervals in this segment (following DNAcopy
    naming convention) \\
\Rcode{num.snps} & Number of SNPs in this segment informative for 
    LOH detection \\
\Rcode{M.flagged} & flag indicating that \Rcode{M} is unreliable (due to small
    number of variants defined by \Rcode{min.variants.segment} or because 
    of ambiguous \Rcode{M} call) \\
\Rcode{maf.expected} & Expected minor allele frequency of heterozygous SNPs
    in this segment \\
\Rcode{maf.observed} & Median of observed minor allele frequency of heterozygous
    SNPs in this segment \\
\bottomrule
\end{tabular}
\label{tbl:callLOH}
\end{table*}

\section{Curation}

For prediction of variant status (germline vs. somatic, sub-clonal vs. clonal,
homozygous vs. heterozygous), it is important that both purity and ploidy are
correct. We provide functionality for curating results:

<<curationfile>>=
createCurationFile(file.rds) 
@

This will generate a CSV file in which the correct purity and ploidy values
can be manually entered. It also contains a column "Curated", which should be
set to \Rcode{TRUE}, otherwise the file will be overwritten when re-run.

Then in R, the correct solution (closest to the combination in the CSV file)
can be loaded with the \Rfunction{readCurationFile} function:

<<readcurationfile>>=
ret <- readCurationFile(file.rds)
@

This function has various handy features, but most importantly it will re-order
the local optima so that the curated purity and ploidy combination is ranked
first. This means \Rcode{plotAbs(ret,1,type="hist")} would show the plot for
the curated purity/ploidy combination, for example.

The default curation file will list the maximum likelihood solution:

<<curationfileshow>>=
read.csv("Sample1_PureCN.csv")
@

\Biocpkg{PureCN} currently only flags samples with warnings, it does not mark
any samples as failed. The \Rcode{Failed} column in the curation file can be
used to manually flag samples for exclusion in downstream analyses. See
Table~\ref{tbl:flags} for an explanation of all flags.

\begin{table*}
\caption{\Rfunction{createCurationFile} flags.}
\begin{tabular}{lp{6cm}}
\toprule
Flag & Description \\
\midrule
EXCESSIVE LOH & $> 50$\% of genome in LOH and ploidy $< 2.6$ \\
EXCESSIVE LOSSES & $\geq 1$\% of genome deleted \\
HIGH AT- OR GC-DROPOUT & High GC-bias exceeding cutoff in \Rcode{max.dropout} \\
HIGH PURITY & (when \Rcode{model.homozygous=FALSE}). For very high purity
samples, it is recommended to set \Rcode{model.homozygous=TRUE}. See
Section~\ref{sec:celllines}. \\ 
LOW PURITY & Purity $< 30$\% \\
LOW BOOTSTRAP VALUE & \Rfunction{bootstrapResults} identified multiple plausible solutions \\
NOISY LOG-RATIO & Log-ratio standard deviation $>$ \Rcode{max.logr.sdev} \\ 
NOISY SEGMENTATION & More than \Rcode{max.segments} \\ 
NON-ABERRANT & $\geq 99 $\% of genome has identical copy number and $\geq 0.5$\% has second most common state \\
POLYGENOMIC & $\geq 0.75 \times$ \Rcode{max.non.clonal} fraction of the genome in sub-clonal state \\
POOR GOF & GoF $<$ \Rcode{min.gof} \\
POTENTIAL SAMPLE CONTAMINATION & Significant portion of variants present in germline databases are potentially cross-contaminated \\
RARE KARYOTYPE & Ploidy $< 1.5$ or  $> 4.5$  \\
\bottomrule
\end{tabular}
\label{tbl:flags}
\end{table*}

\section{Cell lines}
\label{sec:celllines}

Default parameters assume some normal contamination. In 100\% pure samples
without matching normal samples such as cell lines, we cannot distinguish
homozygous SNPs from LOH by looking at single allelic fractions alone. It is
thus necessary to keep homozygous variants and include this uncertainty in the
likelihood model. This is done by setting the \Rfunction{runAbsoluteCN} argument
\Rcode{model.homozygous=TRUE}. If matched normals are available, then variants
homozygous in normal are automatically removed since they are uninformative.

For technical reasons, the maximum purity \Biocpkg{PureCN} currently models is
0.99. We recommend setting \Rcode{test.purity=seq(0.9,0.99,by=0.01)} in
\Rfunction{runAbsoluteCN} for cell lines. 

Please note that in order to detect homozygous deletions in 100\% pure samples,
you will need to provide a \Rcode{normalDB} in \Rfunction{runAbsoluteCN} to
filter low quality targets efficiently (Section~\ref{secrecommended}).

\section{Maximizing the number of heterozygous SNPs}
\label{secmaxsnps}

It is possible to use SNPs in off-target reads in the variant fitting step by
running \software{MuTect} without interval file and then setting the
\Rfunction{filterVcfBasic} argument \Rcode{remove.off.target.snvs} to
\Rcode{FALSE}. We recommend a large pool of normals in this case and then
generating SNP blacklists as described in Sections~\ref{artifactfiltering} and
\ref{artifactfiltering2}.  Remember to also run all the normals in
\software{MuTect} without interval file. 

An often better alternative to including all off-target reads is only including
variants in the flanking regions of targets (between 50-100bp). This will usually
significantly increase the number of heterozygous SNPs (see
Section~\ref{faqinput}). These SNPs are automatically added if the variant
caller was run without interval file or with interval padding.

\section{Advanced usage}
\subsection{Custom normalization and segmentation}
\label{customseg}
Copy number normalization and segmentation are crucial for obtaining good
purity and ploidy estimates. If you have a well-tested pipeline that produces
clean results for your data, you might want to use \Biocpkg{PureCN} as add-on
to your pipeline. By default, we will use \CRANpkg{DNAcopy}
\cite{Venkatraman2007} to segment normalized target-level coverage log2-ratios.
It is straightforward to replace the default with other methods and the
\Rfunction{segmentationCBS} function can serve as an example.

The next section describes how to replace the default segmentation. For the
probably more uncommon case that only the coverage normalization is performed by
third-party tools, see Section~\ref{customnorm}.

\subsubsection{Custom segmentation}
It is possible to provide already segmented data, which is especially
recommended when third-party segmentation tools are not written in R. Otherwise
it is usually however better to customize the default segmentation function,
since the algorithm then has access to the raw log2-ratio
distribution\footnote{If the third-party tool provides target-level
log2-ratios, then these can be provided via the \Rcode{log.ratio} argument in
addition to \Rcode{seg.file} though. See also Section~\ref{customnorm}.}. The
expected file format for already segmented copy number data parsed by
\Rfunction{readSegmentationFile} is\footnote{This segmentation file can contain
multiple samples, in which case the provided \Rcode{sampleid} must match a
sample in the column \Rcode{ID}}:

\begin{verbatim}
    ID  chrom   loc.start   loc.end num.mark    seg.mean
    Sample1   1   61723   5773942 2681    0.125406444072723
    Sample1   1   5774674 5785170 10  -0.756511807441712
\end{verbatim}

Since its likelihood model is exon-based, \Biocpkg{PureCN} currently still
requires an interval file to generate simulated target-level log2-ratios from a
segmentation file.  For simplicity, this interval file is expected either via
the \Rcode{tumor.coverage.file} or via the \Rcode{interval.file} argument (see
Figure~\ref{figure/figurecustombaf-1}). Note that \Biocpkg{PureCN} will
re-segment the simulated log2-ratios using the default
\Rfunction{segmentationCBS} function, in particular to identify regions of
copy-number neutral LOH and to cluster segments with similar allelic imbalance
and log2-ratio. The provided interval file should therefore cover all
significant copy number alterations\footnote{If this behaviour is not wanted,
because maybe the custom function already identifies CNNLOH reliably,
\Rfunction{segmentationCBS} can be replaced with a minimal version.}. Please
check that the log2-ratios are similar to the ones obtained by the default
\Biocpkg{PureCN} segmentation and normalization.

<<customseg>>=
retSegmented <- runAbsoluteCN(seg.file = seg.file, 
    interval.file = interval.file, vcf.file = vcf.file, 
    max.candidate.solutions = 1, genome = "hg19",
    test.purity = seq(0.3,0.7,by = 0.05), verbose = FALSE, 
    plot.cnv = FALSE)
@

The \Rcode{max.candidate.solutions} and \Rcode{test.purity} arguments are set
to non-default values to reduce the runtime of this vignette. 

<<figurecustombaf, fig.show='hide', fig.width=8, fig.height=8>>=
plotAbs(retSegmented, 1, type = "BAF")
@
\incfig{figure/figurecustombaf-1}{0.95\textwidth}{B-allele frequency plot for
segmented data.}{This plot shows the maximum likelihood solution for an example
where segmented data are provided instead of coverage data. Note that the middle
panel shows no variance in log2-ratios, since only segment-level log2-ratios are
available.}
  
\subsubsection{Custom normalization}
\label{customnorm}
If third-party tools such as \software{GATK4} are used to calculate
target-level copy number log2-ratios, and \Biocpkg{PureCN} should be used for
segmentation and purity/ploidy inference only, it is possible to provide these
log2-ratios: 

<<customlogratio, message=FALSE>>=
# We still use the log2-ratio exactly as normalized by PureCN for this
# example
log.ratio <- calculateLogRatio(readCoverageFile(normal.coverage.file),
    readCoverageFile(tumor.coverage.file))

retLogRatio <- runAbsoluteCN(log.ratio = log.ratio,
    interval.file = interval.file, vcf.file = vcf.file, 
    max.candidate.solutions = 1, genome = "hg19",
    test.purity = seq(0.3, 0.7, by = 0.05), verbose = FALSE, 
    normalDB = normalDB, plot.cnv = FALSE)
@

Again, the \Rcode{max.candidate.solutions} and \Rcode{test.purity} arguments
are set to non-default values to reduce the runtime of this vignette. 
Note that this example uses a pool of normals to filter
low quality targets. Interval coordinates are again expected in either a
\Rcode{interval.file} or a \Rcode{tumor.coverage.file}.

The \Rfunction{readLogRatioFile} function provides support for parsing
\software{GATK3}-style and \software{GATK4 DenoiseReadCounts} formats:

\begin{verbatim}
# GATK3-style
Target log_ratio
1:3682750-3683489 0.109473
1:3690585-3691264 0.126205

# GATK4
@HD	VN:1.6
@SQ	SN:1	LN:248956422
@SQ	SN:2	LN:242193529
...
CONTIG	START	END	LOG2_COPY_RATIO
1	3682750	3683489	0.109473
1	3690585	3691264	0.126205
\end{verbatim}

It is highly recommended to compare the log2-ratios obtained by
\Biocpkg{PureCN} and the third-party tool, since some pipelines automatically
adjust log2-ratios for a default purity value. 

\subsection{COSMIC annotation}
\label{seccosmic}
If a matched normal is not available, it is also helpful to provide
\Rfunction{runAbsoluteCN} the COSMIC database \cite{Forbes2015} via
\Rcode{cosmic.vcf.file} (or via a \Rcode{Cosmic.CNT} INFO field in the VCF).
While this has limited effect on purity and ploidy estimation due the sparsity
of hotspot mutations, it often helps in the manual curation to compare how well
high confidence germline (based on dbSNP membership or POPAF) vs. somatic (COSMIC)
variants fit a particular purity/ploidy combination. 

For variant classification (Section~\ref{predictsomatic}), providing COSMIC
annotation also avoids that hotspot mutations in germline databases get assigned
a very low prior probability of being somatic.

\subsection{ExAC and gnomAD annotation}

\Biocpkg{PureCN} is not automatically annotating input VCFs with data from
common germline databases such as ExAC. See Section~\ref{secinputvcf} for ways
to tell \Biocpkg{PureCN} where to find either a summary binary flag (i.e. likely
germline yes/no) or population allele frequencies.

\subsection{Mutation burden}

The \Rfunction{predictSomatic} function described in
Section~\ref{predictsomatic} can be used to efficiently remove private germline
mutations. This in turn allows the calculation of mutation burden for unmatched
tumor samples. A wrapper function for this specific task is included as
\Rfunction{callMutationBurden}:

<<callmutationburden>>=
callableBed <- import(system.file("extdata", "example_callable.bed.gz", 
    package = "PureCN"))

callMutationBurden(ret, callable = callableBed)
@

The \Rcode{callableBed} file should be a file parsable by
\Biocpkg{rtracklayer}. This file can specify genomic regions that are callable,
for example as obtained by \software{GATK3 CallableLoci}. This is optional, but
if provided can be used to accurately calculate mutation rates per
megabase. Variants outside the callable regions are not counted. Private
germline rates should be fairly constant across samples; outliers here should be
manually inspected.

The output columns are explained in Table~\ref{tbl:callMutationBurden}.

\begin{table*}
\caption{\Rfunction{callMutationBurden} output columns.}
\begin{tabular}{lp{6cm}}
\toprule
Column name & Description \\
\midrule
\Rcode{somatic.ontarget} & Number of mutations in target regions \\
\Rcode{somatic.all}      & Total number of mutations. Depending on input
VCF and \Rfunction{runAbsoluteCN} arguments, this might include calls in 
flanking regions and off-targets reads.  \\
\Rcode{private.germline.ontarget} & Number of private germline SNPs in 
targets. For matched tumor/normal samples, this is the number of variants
annotated as somatic, but classified as germline.  \\
\Rcode{private.germline.all} & Total number of private germline SNPs \\
\Rcode{callable.bases.ontarget} & Number of callable on-target bases \\
\Rcode{callable.bases.flanking} & Number of callable on-target and 
flanking bases \\
\Rcode{callable.bases.all} & Total number of callable bases. With
default parameters includes off-target regions that were ignored by 
\Rfunction{runAbsoluteCN}. \\
\Rcode{somatic.rate.ontarget} & Somatic mutations per megabase in target regions \\
\Rcode{somatic.rate.ontarget.95.lower} & Lower 95\% of confidence interval \\
\Rcode{somatic.rate.ontarget.95.upper} & Upper 95\% of confidence interval \\
\Rcode{private.germline.rate.ontarget} & Private germline mutations per megabase in target regions \\
\Rcode{private.germline.rate.ontarget.95.lower} & Lower 95\% of confidence interval \\
\Rcode{private.germline.rate.ontarget.95.upper} & Upper 95\% of confidence interval \\
\bottomrule
\end{tabular}
\label{tbl:callMutationBurden}
\end{table*}

\subsection{Chromosomal Instability}

Chromosomal Instability (CIN) is usually defined as the fraction of the genome
that is altered. The \Rfunction{callCIN} function can be used to estimate this
fraction. 

Parameters define regions that are altered. First, \Rcode{allele.specific}
defines whether only total vs. both minor and major copy number define a state.
Copy number neutral LOH would count as altered only when this parameter is set
to \Rcode{TRUE}. Second, \Rcode{reference.state} defines the unaltered copy
number state. This can be either \Rcode{normal} for \Rcode{2/1}, or
\Rcode{dominant} for the most common state. While technically potentially wrong,
the latter is robust to errors in ploidy and is thus recommended without careful
manual curation. Similarly, setting \Rcode{allele.specific} to \Rcode{FALSE}
makes this metric more robust to purity and ploidy errors, but usually to a much
lesser extend.

It is recommended to test for a relationship of tumor purity and CIN and
if necessary exclude low purity samples with uncertain CIN.

<<callcin>>=
# All 4 possible ways to calculate fraction of genome altered
cin <- data.frame(
    cin = callCIN(ret, allele.specific = FALSE, reference.state = "normal"),
    cin.allele.specific = callCIN(ret, reference.state = "normal"),
    cin.ploidy.robust = callCIN(ret, allele.specific = FALSE),
    cin.allele.specific.ploidy.robust = callCIN(ret)
)
@

\subsection{Detect cross-sample contamination}
\label{seccontamination}
It is important to correctly handle heterozygous common SNPs that do not have an
expected allelic fraction of 0.5 in normal samples. These can be SNPs in poor
quality regions (as already described, see
Section~\ref{artifactfilteringvcf}), but also SNPs from cross-sample
contaminated DNA. Without matched normals, detection of those problematic SNPs
is not trivial. 

For cross-sample contamination, \Biocpkg{PureCN} by default always tests for a
1\% contamination and assigns common SNPs to a contamination state when allelic
fractions are either close to 0 or close to 1 and when this cannot be explained
by CNAs. The main purpose of these states is to provide a bin for common SNPs
that for artifactual reasons do not fit any other state.

This tool applies a simple heuristic to flag samples for cross-contamination:
Given the coverage and putative contamination rate based on allelic fractions of
potentially contaminated SNPs, how many SNPs do we expect to detect based on our
power to detect variants at that contamination rate? If the expected number is
much higher than observed, then significant contamination is unlikely; 
observed SNPs close to 0 or 1 are more likely artifacts or the contamination
rate is much lower than the minimum tested. Otherwise \Biocpkg{PureCN} will
perform a post-optimization in which contamination rate is optimized in
additional variant fitting steps.

Cross-sample contamination can also result in increased observed heterozygosity
on chrX for males, which in turn often results in a \Biocpkg{PureCN} warning
that sex inferred from coverage and VCF are in conflict.

By default, cross-contamination is tested in the range from 1 to 7.5\%.
Catastrophic failures with higher contamination might not get flagged.

\subsection{Power to detect somatic mutations}

As final quality control step, we can test if coverage and tumor purity are
sufficent to detect mono-clonal or even sub-clonal somatic mutations. We
strictly follow the power calculation by Carter et al. \cite{Carter2012}. 

The following Figure~\ref{figure/power1-1} shows the power to detect mono-clonal
somatic mutations as a function of tumor purity and sequencing coverage
(reproduced from \cite{Carter2012}):

<<power1, fig.show='hide', fig.width=6, fig.height=6>>=
purity <- c(0.1,0.15,0.2,0.25,0.4,0.6,1)
coverage <- seq(5,35,1)
power <- lapply(purity, function(p) sapply(coverage, function(cv) 
    calculatePowerDetectSomatic(coverage = cv, purity = p, ploidy = 2, 
    verbose = FALSE)$power))

# Figure S7b in Carter et al.
plot(coverage, power[[1]], col = 1, xlab = "Sequence coverage", 
    ylab = "Detection power", ylim = c(0, 1), type = "l")

for (i in 2:length(power)) lines(coverage, power[[i]], col = i)
abline(h = 0.8, lty = 2, col = "grey")
legend("bottomright", legend = paste("Purity", purity),
    fill = seq_along(purity))
@
\incfig{figure/power1-1}{0.95\textwidth}{Power to detect mono-clonal somatic
mutations.}{Reproduced from \cite{Carter2012}.}

Figure~\ref{figure/power2-1} then shows the same plot for sub-clonal mutations
present in 20\% of all tumor cells: 

<<power2, fig.show='hide', fig.width=6, fig.height=6>>=
coverage <- seq(5,350,1)
power <- lapply(purity, function(p) sapply(coverage, function(cv) 
    calculatePowerDetectSomatic(coverage = cv, purity = p, ploidy = 2,
    cell.fraction = 0.2, verbose = FALSE)$power))
plot(coverage, power[[1]], col = 1, xlab = "Sequence coverage", 
    ylab="Detection power", ylim = c(0, 1), type = "l")

for (i in 2:length(power)) lines(coverage, power[[i]], col = i)
abline(h = 0.8, lty = 2, col = "grey")
legend("bottomright", legend = paste("Purity", purity),
    fill = seq_along(purity))
@
\incfig{figure/power2-1}{0.95\textwidth}{Power to detect sub-clonal somatic
mutations present in 20\% of all tumor cells.}{Reproduced from \cite{Carter2012}.}

\clearpage

\section{Limitations}
\label{sec:limits}
\Biocpkg{PureCN} currently assumes a completely diploid normal genome. For
human samples, it tries to detect sex by calculating the coverage ratio of
chromosomes X and Y and will then remove sex chromosomes in male 
samples\footnote{Loss of Y chromosome (LOY) can result in wrong female calls,
especially in high purity samples or if LOY is in both tumor and contaminating
normal cells. The software throws a warning in this case when germline SNPs
are provided.}. For non-human samples, the user needs to manually remove all
non-diploid chromosomes from the coverage data and specify
\Rcode{sex="diploid"} in the \Biocpkg{PureCN} call.

While \Biocpkg{PureCN} supports and models sub-clonal somatic copy number
alterations, it currently assumes that the majority of alterations are
mono-clonal. For most clinical samples, this is reasonable, but very
heterogeneous samples are likely not possible to call without manual curation.
Poly-genomic tumors are often called as high ploidy or low purity. The former
usually happens when sub-clonal losses are called as 2 copies and mono-clonal
losses correctly as 1 copy. The latter when sub-clonal losses are called
mono-clonal, which only happens when there are far more sub-clonal than
mono-clonal losses. Please note however that unless purities are very high,
algorithms that model poly-genomic tumors do not necessarily have a higher call
rate, since they tend to overfit noisy samples or similarly confuse true
high-ploidy with poly-genomic tumors.  Due to the lack of signal, manual
curation is also recommended in low purity samples or very quiet genomes.

\section{Support}

If you encounter bugs or have problems running \Biocpkg{PureCN},
please post them at

\begin{itemize}
    \item \url{https://support.bioconductor.org/p/new/post/?tag\_val=PureCN} or
    \item \url{https://github.com/lima1/PureCN/issues}.
\end{itemize}

If \Biocpkg{PureCN} throws user errors, then there is likely a problem with the
input files.  If the error message is not self-explanatory, feel free to seek
help at the support site. 

In your report, please add the outputs of the \Rfunction{runAbsoluteCN} call
(with \Rcode{verbose=TRUE}, or the \Rcode{*.log} file in PureCN.R) and
\Rfunction{sessionInfo()}. Please also check that your problem is not already
covered in the following sections.

For general feedback such as suggestions for improvements, feature requests,
complaints, etc. please do not hesitate to send us an email.

\subsection{Checklist}

\begin{itemize}
    \item Used the correct interval files provided by the manufacturer of the
    capture kit and the genome version of the interval file matches the
    reference. Ideally used the baits file, not the targets file (in Agilent
    data, the baits files are called "covered" and the targets are "regions").
    \item For hybrid capture data, included off-target reads in the coverage
        calculation 
    \item BAM files were generated following established best practices and
    tools finished successfully.
    \item Checked standard QC metrics such as AT/GC dropout and duplication 
    rates.
    \item Tumor and normal data were obtained using the same capture kit and
    pipeline.
    \item Coverage data of tumor and normal were GC-normalized.
    \item The VCF file contains germline variants (i.e. not only somatic calls).
    \item Maximized the number of high coverage heterozygous SNPs, for example
    by running \software{MuTect} with a 50-75bp interval padding
    (Section~\ref{secmaxsnps}).  The \Rfunction{runAbsoluteCN} output lists the
    percentage of targets with variants and this should be around 10-15\%.
    Ultra-deep sequencing data can provide good SNP allelic fractions in the
    100-200bp flanking regions.
    \item If a pool of normal samples is available, followed the steps in
    Section~\ref{artifactfiltering}.
    \item Read the output of \Rfunction{runAbsoluteCN} with
        \Rcode{verbose = TRUE}, fixed all warnings.
    \item If third-party segmentation tools are used, checked that normalized
    log2-ratios are not biased, i.e. very similar compared to \Biocpkg{PureCN}
    log2-ratios (some pipelines already adjust for a default normal 
    contamination).
    \item Followed the best practices described in the Quick vignette
\end{itemize}

\subsection{FAQ}

\paragraph{If the ploidy is frequently too high, please check:}

\begin{itemize}
    \item Does the log2-ratio histogram (Figure~\ref{figure/figureexample2-1})
look noisy? If yes, then
    \begin{itemize}
    \item Is the coverage sufficient? Tumor coverages below 80X can be
difficult, especially in low purity samples. Normal coverages below 50X might
result in high variance of log2-ratios. See Section~\ref{bestnormal} for finding
a good normal sample for log2-ratio calculation.
    \item Is the coverage data of both tumor and normal GC-normalized? If not,
    see \Rfunction{correctCoverageBias}. It can be safe to skip the
    normalization, with large panels of normals potentially even slighly
    beneficial, but make sure that GC-normalization was either skipped or
    performed in both tumor sample and normal coverage database.
    \item Is the quality of both tumor and normal sufficient? A high AT or
GC-dropout might result in high variance of log2-ratios. Challenging FFPE samples
also might need parameter tuning of the segmentation function. See 
\Rfunction{segmentationCBS}. A high expected tumor purity allows more aggressive
segmentation parameters, such as \Rcode{prune.hclust.h=0.2} or higher.
    \item Was the correct target interval file used (genome version and capture
    kit, see Section~\ref{gatkcoverage})? If unsure, ask the help desk of your
    sequencing center.
    \item Were the normal samples run with the same assay and pipeline? 
    \item Did you provide \Rfunction{runAbsoluteCN} all the recommended files
    as described in Section~\ref{secrecommended}?  
    \item For whole-genome data, you will get better results using a specialized
        third-party segmentation method as described in section~\ref{customseg},
        since our default is optimized for targeted sequencing. In general,
        you should probably start with tools optimized for WGS data,
        such as Battenberg \cite{NikZainal2012}, ABSOLUTE \cite{Carter2012},
        ACEseq \cite{Kleinheinz210807}, or TitanCNA \cite{Ha2014}. We are
        planning to incorporate proper support for WGS once high coverage
        diagnostic WGS becomes more common.
    \end{itemize}
    \item Otherwise, if log2-ratio peaks are clean as in 
Figure~\ref{figure/figureexample2-1}:
    \begin{itemize}
        \item Was MuTect run without a matched normal? If yes, then make sure
        to provide either a pool of normal VCF or a SNP blacklist (if no pool
        of normal samples is available) as described in
        Sections~\ref{artifactfiltering} and \ref{artifactfiltering2}.
        \item A high fraction of sub-clonal copy-number alterations might also
    result in a low ranking of correct low ploidy solutions (see
    Section~\ref{sec:limits}). 
    \end{itemize}
\end{itemize} 

\paragraph{If the ploidy is frequently too low:}
\begin{itemize}
    \item \Biocpkg{PureCN} with default parameters is conservative in calling
    genome duplications.
    \item This should only affect low purity samples (< 35\%), since in higher
    purity samples the duplication signal is usually strong enough to
    reliably detect it.  
    \item In whole-exome data, it is usually safe to decrease the
    \Rcode{max.homozygous.loss} default, since such large losses are rare.
\end{itemize}

\paragraph{Will PureCN work with my data?}

\begin{itemize}
    \item \Biocpkg{PureCN} was designed for medium-sized (>2-3Mb) targeted
        panels.  The more data, the better, best results are typically achieved
        in whole-exome data.

    \item The number of heterozygous SNPs is also important (>1000 per sample).
        Copy number baits enriched in SNPs are therefore very helpful (see
        Section~\ref{secmaxsnps}).

    \item Some users got acceptable results with small (<1Mb) panels.
        Try to find a perfect off-target bin width
        (\Rcode{average.off.target.width} in \Rfunction{preprocessIntervals})
        and maximize the number of heterozygous SNPs by including as much
        padding as possible. Keep in mind that without tiling baits, you will
        only have a poor resolution to detect LOH. 

    \item Coverages below 80X are difficult unless purities are high and
        coverages are even.

    \item \Biocpkg{PureCN} also needs process-matched normal samples, again, the
        more the better. 

    \item Samples with tumor purities below 15-20\% usually cannot be analyzed
        with this algorithm and \Biocpkg{PureCN} might return very wrong purity
        estimates. In high coverage samples with low duplication rates, this
        limit can be close to 10\%.

    \item Whole-genome data is not officially supported and specialized tools
        will likely provide better results.  Third-party segmentation tools
        designed for this data type would be again required.

    \item Amplicon sequencing data is also not officially supported. If the
        assay contains tiling probes (at least with 1Mb spacing) and uses a
        barcode protocol that reduces PCR bias of measured allelic fractions,
        then this method might produce acceptable results. Setting the
        \Rcode{model} argument of \Rfunction{runAbsoluteCN} to \Rcode{betabin}
        is recommended. Specialized segmentation tools might be again better
        than our default.
    
	\item This tool was originally designed for single sample pipelines. It
becomes more common to have multiple biopsies per patient and sharing
information across biopsies might be beneficial. 
Other tools worth checking out are lumosVar 2.0 \cite{Halperin2019}
and SuperFreq \cite{Flensburg2020}.
\end{itemize}

\paragraph{If you have trouble generating input data PureCN accepts, please check:}
\label{faqinput}
\begin{itemize}
    \item For problems related to generating valid coverage data, either consult
        the \software{GATK} manual for the \software{DepthOfCoverage} or
        \software{CollectFragmentCounts} tools or Section~\ref{purecncoverage}
        for the equivalent function in \Biocpkg{PureCN}. If you use
        \software{DepthOfCoverage} and off-target intervals as generated by
        IntervalFile.R (See Quick vignette), make sure to run it with parameters
        \Rcode{ --omitDepthOutputAtEachBase} and \Rcode{--interval\_merging
        OVERLAPPING\_ONLY}.

    \item Currently only VCF files generated by \software{MuTect 1} are
        officially supported and well tested. A minimal example 
        \software{MuTect} call would be:
    \begin{verbatim}
    $JAVA7 -Xmx6g -jar $MUTECT \
    --analysis_type MuTect -R $REFERENCE \
    --dbsnp $DBSNP_VCF \
    --cosmic $COSMIC_VCF \
    -I:normal $BAM_NORMAL \
    -I:tumor $BAM_TUMOR  \
    -o $OUT/${ID}_bwa_mutect_stats.txt \
    -vcf $OUT/${ID}_bwa_mutect.vcf
    \end{verbatim}
    The normal needs to be matched; without matched normal, only provide the
    tumor BAM file (do NOT provide a process-matched normal here). The default
    output file is the stats or call-stats file; this can be provided in
    addition to the required VCF file via \Rcode{args.filterVcf} in
    \Rfunction{runAbsoluteCN}. If provided, it may help \Biocpkg{PureCN} filter
    artifacts. This requires \software{MuTect} in version 1.1.7.  This version
    is currently available at
    \url{https://software.broadinstitute.org/gatk/download/mutect} and requires
    Java 1.7 (it does not work with Java 1.8).
    
    Note that this \software {MuTect} VCF will contain variants in off-target
    reads.  By default, \Biocpkg{PureCN} will remove variants outside the
    provided targets and their 50bp flanking regions. We highly recommend
    finding good values for each assay. A good cutoff will maximize the number
    of heterozygous SNPs and keep only an acceptable number of lower quality
    calls.  This cutoff is set via \Rcode{interval.padding} in
    \Rcode{args.filterVcf}.  See Section~\ref{secmaxsnps}. 

    \item Support for \software{GATK4} and \software{Mutect2} is in beta.
    Version 4.1.7.0 and higher as well as specifying the
    \Rcode{--genotype-germline-sites} flag in \software{Mutect2} is required.
    Please follow the Broad Best Practices as closely as possible as we only
    test those workflows.

    \item For VCFs generated by other callers, the required dbSNP annotation 
    can be added for example with \software{bcftools}:
    \begin{verbatim}
    bcftools annotate --annotation $DBSNP_VCF \
        --columns ID --output $OUT/${ID}_bwa_dbsnp.vcf.gz --output-type z \
            $OUT/${ID}_bwa.vcf.gz
    \end{verbatim}
    \item To generate a mappability file with the \software{GEM} library:
         \begin{itemize}
         \item Calculate mappability, set kmer size to length of mapped reads. 
         \begin{verbatim}
    REFERENCE="hg38.fa"
    PREF=`basename $REFERENCE .fa`     
    THREADS=4
    KMER=100
    gem-indexer -T ${THREADS} -c dna -i ${REFERENCE} -o ${PREF}_index
    gem-mappability -T ${THREADS} -I ${PREF}_index.gem -l ${KMER} \
        -o ${PREF}_${KMER} -m 2 -e 2
    gem-2-wig -I ${PREF}_index.gem -i ${PREF}_${KMER}.mappability \
        -o ${PREF}_${KMER}
         \end{verbatim}
        \item Convert to bigWig format, for example using the UCSC
        \software{wigToBigWig} tool:
         \begin{verbatim}
    cut -f1,2 ${REFERENCE}.fai > ${PREF}.sizes
    wigToBigWig ${PREF}_${KMER}.wig ${PREF}.sizes ${PREF}_${KMER}.bw
         \end{verbatim}
         \end{itemize}   

    \item To generate the normal panel VCF (\Rcode{normal.panel.vcf.file})
    with \software{GATK3}:
    \begin{itemize}
        \item Run \software{MuTect} on the normal with
        \Rcode{-I:tumor \$BAM\_NORMAL} and optionally with the 
        \Rcode{--artifact\_detection\_mode} flag.
        \item Remove the empty \Rcode{none} sample from the VCF:
        \begin{verbatim}
    $JAVA -Xmx6g -jar $GATK3 \
    --analysis_type SelectVariants -R $REFERENCE \
    --exclude_sample_expressions none \
    -V $OUT/${ID}_bwa_mutect_artifact_detection_mode.vcf \
    -o $OUT/${ID}_bwa_mutect_artifact_detection_mode_no_none.vcf 
        \end{verbatim}
        \item Merge the VCFs:
        \begin{verbatim}
    CMD="java  -Xmx12g -jar $GATK3 -T CombineVariants --minimumN 3 -R $REFERENCE"
    for VCF in $OUT/*bwa_mutect_artifact_detection_mode_no_none.vcf;
    do
        CMD="$CMD --variant $VCF"
    done
    CMD="$CMD -o $OUT/normals_merged_min3.vcf"
    echo $CMD > $OUT/merge_normals_min3.sh
    bgzip $OUT/normals_merged_min3.vcf
    tabix $OUT/normals_merged_min3.vcf.gz
        \end{verbatim}
    \end{itemize}
\end{itemize}

\paragraph{Questions related to the output.}

\begin{itemize}
    \item \textit{Where is the major and minor segment copy number?} The minor
        copy number is \Rcode{ML.M.SEGMENT} in \Rfunction{predictSomatic}
        (Table~\ref{tbl:predictSomatic}) and \Rcode{M} in
        \Rfunction{callAlterations} (Table~\ref{tbl:callAlterations}) and
        \Rfunction{callLOH} (Table~\ref{tbl:callLOH}).  The major copy number is
        simply the total copy number (column \Rcode{C} or \Rcode{ML.C} in
        \Rfunction{predictSomatic}) minus the minor copy number.
    \item \textit{Why are there two different amplification output files?} See
    Section~\ref{callalterations}. Gene-level p-values are not very useful for
    calling high-level amplifications in tissue samples with significant amount
    of tumor cells because even regions of single gains will have significant
    read pile-ups. These p-values can be however useful in low purity samples
    commonly observed in cfDNA samples. There is no corresponding file for
    homozygous deletions since detecting those is usually impossible in very
    low purity samples.
\end{itemize}

\paragraph{Questions related to pool of normals.}

As described in detail in Sections~\ref{bestnormal} and
\ref{artifactfiltering}, a pool of normal samples is used in \Biocpkg{PureCN}
for coverage normalization (to adjust for target-specific capture biases) and
for artifact filtering. A few recommendations:

\begin{itemize}
    \item Matched normals are obviously perfect for identifying most alignment
        artifacts and mapping biases. While not critical, we still recommend
        generating a \Rcode{normal.panel.vcf.file} for \software{MuTect} and
        \Rfunction{calculateMappingBiasVcf} using the available normals.

    \item Without matched normals, we need process-matched normals for coverage
        normalization. We recommend more than 5 from 2-3 different library
        preparation and sequencing batches. The more, the better.
    
    \item If you are able to add new normal samples over time capturing potential
        subtle changes in the lab, do so!

    \item These normals used for coverage normalization should be ideally
        sequenced to at least half the coverage as the tumor samples. This is
        completely different from matched normals for germline calling where
        30-40X provide sufficient power to detect heterozygosity.

    \item For artifact removal, the more normals available, the more rare
        artifacts are removed. We recommend at least 10, 50 or more are ideal.
        The more artifacts are removed, the less likely \Biocpkg{PureCN} output
        requires manual curation (Section~\ref{artifactfiltering}).

    \item For position-specific mapping bias correction, the more normals are
        available, the more rare SNPs will have reliable mapping bias estimates.
        This requires again at least about 10 normals to be useful, 100 or more
        are ideal.

    \item With smaller pool of normals, we additionally recommend filtering SNPs
        from low quality regions (Section~\ref{artifactfiltering2}).

    \item It is worth trying the beta-binomial function instead of
        the default in the \Rcode{model} argument of \Rfunction{runAbsoluteCN}.
        This will incorporate uncertainty of observed variant allelic fractions
        in the variant fitting step. The more normals, the more SNPs will have
        reliable estimates of allelic fraction overdispersion.
    
    \item \textit{Do I really need a pool of normals? I only have tumor samples.}
        Unfortunately, yes. If you used a commercial capture kit, you might be
        able to obtain control samples from the vendor or the public domain.
        This is not optimal, but usually better than nothing. 
    
    \item It is safe to include multiple normals from the same individual.
        Fewer common germline CNVs in  \Rfunction{createNormalDatabase} and
        fewer SNPs in \Rfunction{calculateMappingBiasVcf} will be detected. But
        especially when the controls were sequenced in multiple batches, these
        replicates will still provide useful information for coverage
        normalization.
         
\end{itemize}

\paragraph{Questions related to manual curation.}

\Biocpkg{PureCN}, like most other related tools, essentially finds the most
simple explanation of the data. There are three major problems with this
approach:
\begin{itemize}
    \item First, hybrid capture data can be noisy and the algorithm must
        distinguish signal from noise; if the algorithm mistakes noise for
        signal, then this often results in wrong high ploidy calls (see
        Sections~\ref{artifactfiltering} and \ref{artifactfiltering2}). If all
        steps in this vignette were followed, then \Biocpkg{PureCN} should
        ignore common artifacts. Noisy samples thus often have outlier ploidy
        values and are often automatically flagged by \Biocpkg{PureCN}. The
        correct solution is in most of these cases ranked second or third.

    \item The second problem is that signal can be sparse, i.e. when the tumor
        purity is very low or when there are only few somatic events.  Manual
        curation is often easy in the latter case. For example when small losses
        are called as homozygous, but corresponding germline allele-frequencies
        are unbalanced (a complete loss would result in balanced germline allele
        frequencies, since only normal DNA is left). Future versions might
        improve calling in these cases by underweighting uninformative genomic
        regions.

    \item The third problem is that tumor evolution is fast and complex and very
        difficult to incorporate into general likelihood models. Sometimes
        multiple solutions explain the data equally well, but one solution is
        then often clearly more consistent with known biology, for example LOH
        in tumor suppressor genes such as \textit{TP53}.  A basic understanding
        of both the algorithm and the tumor biology of the particular cancer
        type are thus important for curation.  Fortunately, in most cancer
        types, such ambiguity is rather rare. See also Section~\ref{sec:limits}.

\end{itemize}

\paragraph{Questions related to matched normals.}
\begin{description}
    \item[Coverage normalization:] Even when matched normals are available, we
        recommend building a normal database for coverage normalization. This
        usually produces cleaner coverage profiles than the matched normal
        (\cite{Tabak566505}).
    \item[VCFs: ] When matched normals are available, simply provide this
        information to the variant caller and make sure that germline SNPs are
        not filtered out. \Biocpkg{PureCN} should automatically find the matched
        information.
\end{description}

\paragraph{Questions related to off-target reads.}
Hybrid capture assays usually provide a significant amount of reads mapping
outside the captured regions. This means these assays provide shallow whole
genome sequencing data for free. \Biocpkg{PureCN} makes it straightforward
to use that information, but it might not always make sense to do so:
\begin{itemize}
    \item Does the assay contain a fairly large number of big gaps (>1Mb)
        between neighboring baits? For whole exome data or targeted panels with
        copy number tiling probes, the answer can be no.
    \item Off-target reads are usually less affected by library-specific biases
        (Section~\ref{secgcbias}) than on-target reads. In noisy data,
        off-target reads can thus provide relatively clean additional
        information.
    \item Very efficient assays with low off-target mapping might not provide a
        large number of reads in useful window sizes (usually below 0.2Mb),
        especially when total sequencing coverages are rather low. This can
        result in off-target data that are significantly noisier than
        on-target data.
    \item It is recommended to check the \Rfunction{runAbsoluteCN} output for
        the standard deviations of log2 copy number ratios for both on- and
        off-target reads. If the off-target noise is consistently larger than
        the on-target noise, you can try increasing the off-target bin width -
        and vice versa make it smaller when the on-target noise is larger.
\end{itemize}

\paragraph{Questions related to GC normalization.}
\begin{itemize} 
    \item Explicit GC-normalization is useful whenever a significant difference in tumor
and normal controls is suspected. One example would be FFPE tumors normalized
with blood controls. Otherwise the normalization using the panel of normals
should automatically remove any GC bias.
    \item Smaller panels of normals might also benefit from explicit GC-normalization.
    \item Otherwise it is safe skip the GC-normalization. In the command line tools,
set \Rcode{--skip-gc-norm} in \Rcode{Coverage.R}.
\end{itemize}

\paragraph{If all or most of the samples are flagged as:}

\begin{description}
    \item[Noisy segmentation:] The default of 300 for \Rcode{max.segments} is
        calibrated for high quality and high coverage whole-exome data. For
        whole-genome data or lower coverage data, this value needs to be
        re-calibrated.  In case the copy number data looks indeed noisy, please
        see the first FAQ. Please be aware that \Biocpkg{PureCN} will apply more
        aggressive segmentation parameters when the number of segments exceeds
        this cutoff. If the high segment count is real, this might confound
        downstream analyses.
    \item[High AT/GC dropout:] If the data are GC-normalized, then there might be
        issues with either the target intervals or the provided GC content.
        Please double check that all files are correct and that all the coverage
        files are GC-normalized (Section~\ref{secgcbias}).
    \item[Sex mismatch of coverage and VCF:] If the panel contains baits for
        chromosome Y, then the interval file was probably generated without
        mappability file (Section~\ref{sectargetinfo}). Similarly when
        third-party tools were used for coverage normalization and segmentation,
        this usually means probes on chromosome Y were not filtered for
        mappability. Cross-sample contamination (Section~\ref{seccontamination})
        can also cause sex mismatches.  
\end{description}

\appendix

\bibliography{PureCN}

\section*{Session Info}
<<sessioninfo, results='asis', echo=FALSE>>=
toLatex(sessionInfo())
@

\end{document}