Bioconductor Code: SplicingGraphs

Raw Blame Patch Log History
\name{countReads-methods}

\alias{countReads-methods}

\alias{countReads}
\alias{countReads,SplicingGraphs-method}


\title{
  Summarize the reads assigned to a SplicingGraphs object
}

\description{
  \code{countReads} counts the reads assigned to a SplicingGraphs object.
  The counting can be done by splicing graph edge, \emph{reduced} splicing
  graph edge, transcript, or gene.

  \code{reportReads} is similar to \code{countReads} but returns right before
  the final counting step, that is, the returned DataFrame contains the reads
  instead of their counts.
}

\usage{
countReads(x, by=c("sgedge", "rsgedge", "tx", "gene"))
reportReads(x, by=c("sgedge", "rsgedge", "tx", "gene"))
}

\arguments{
  \item{x}{
    A \link{SplicingGraphs} object.
  }
  \item{by}{
    Can be \code{"sgedge"}, \code{"rsgedge"}, \code{"tx"}, or \code{"gene"}.
    Specifies the \emph{level of resolution} that summarization should be
    performed at. See Details section below.
  }
}

\details{
  \subsection{Levels of resolution}{
    \code{countReads} and \code{reportReads} allow summarization of the reads
    at different levels of resolution. The level of resolution is determined
    by the type of feature that one chooses via the \code{by} argument.
    The supported resolutions are (from highest to lowest resolution):
    \enumerate{
      \item \code{by="sgedge"} for summarization at the splicing graph edge
            level (i.e. at the exons/intron level);
      \item \code{by="rsgedge"} for summarization at the \emph{reduced}
            splicing graph edge level;
      \item \code{by="tx"} for summarization at the transcript level;
      \item \code{by="gene"} for summarization at the gene level.
    }
  }

  \subsection{Relationship between levels of resolution}{
    There is a parent-child relationship between the features
    corresponding to a given level of resolution (the parent features)
    and those corresponding to a higher level of resolution (the child
    features).

    For example, in the case of the 2 first levels of resolution listed
    above, the parent-child relationship is the following: the parent
    features are the \emph{reduced} splicing graph edges, the child features
    are the splicing graph edges, and each parent feature is obtained by
    merging one or more child features together.
    Similarly, transcripts can be seen as parent features of \emph{reduced}
    splicing graph edges, and genes as parent features of transcripts.
    Note that, the rsgedge/sgedge and gene/tx relationships are one-to-many,
    but the tx/rsgedge relationship is many-to-many because a given edge can
    belong to more than one transcript.

    Finally the parent-child relationships between 2 arbitrary levels of
    resolution is defined by combining the relationships between consecutive
    levels. All possible parent-child relationships are summarized in the
    following table:
    \preformatted{
                    | to: sgedge   | to: rsgedge  | to: tx
      --------------+--------------+--------------+------------
      from: rsgedge | one-to-many  |              |            
      from: tx      | many-to-many | many-to-many |            
      from: gene    | one-to-many  | one-to-many  | one-to-many
    }
  }

  \subsection{Multiple hits and ambiguous reads}{
    An important distinction needs to be made between a read that hits a
    given feature multiple times and a read that hits more than one feature.

    If the former, the read is counted/reported only once for that feature.
    For example, when summarizing at the transcript level, a read is
    counted/reported only once for a given transcript, even if that read
    hits more than one splicing graph edge (or \emph{reduced} splicing graph
    edge) associated with that transcript.

    If the latter, the read is said to be \emph{ambiguous}. An ambiguous read
    is currently counted/reported for each feature where it has a hit.
    This is a temporary situation: in the near future the user will be offered
    options to handle ambiguous reads in different ways.
  }

  \subsection{Ambiguous reads and levels of resolution}{
    A read might be ambiguous at one level of resolution but not at the other.
    Also the number of ambiguous reads is typically affected by the level
    of resolution. However, even though higher resolution generally means
    more ambiguous reads, this is only true when the switch from one level
    of resolution to the other implies a parent-child relationship between
    features that is one-to-many.
    So, based on the above table, this is always true, except when
    switching from using \code{by="tx"} to using \code{by="sgedge"} or
    \code{by="rsgedge"}. In those cases, the switch can produce more
    ambiguities but it can also produce less.
  }
}

\value{
  A \link[S4Vectors]{DataFrame} object with one row per:
  \itemize{
    \item unique splicing graph edge, if \code{by="sgedge"};
    \item unique \emph{reduced} splicing graph edge, if \code{by="rsgedge"};
    \item transcript if \code{by="tx"};
    \item gene if \code{by="gene"}.
  }

  And with one column per sample (containing the counts for that sample for
  \code{countReads}, and the reads for that sample for \code{reportReads}),
  plus the two following left columns:
  \itemize{
    \item if \code{by="sgedge"}: \code{"sgedge_id"}, containing the
          \emph{global splicing graph edge ids}, and \code{"ex_or_in"},
          containing the type of edge (exon or intron);
    \item if \code{by="rsgedge"}: \code{"rsgedge_id"}, containing the
          \emph{global reduced splicing graph edge ids}, and
          \code{"ex_or_in"}, containing the type of edge (exon, intron,
          or mixed);
    \item if \code{by="tx"}: \code{"tx_id"} and \code{"gene_id"};
    \item if \code{by="gene"}: \code{"gene_id"} and \code{"tx_id"}.
  }

  For \code{countReads}, each column of counts is of type integer and
  is named after the corresponding sample.
  For \code{reportReads}, each column of reads is a CharacterList object
  and its name is the name of the corresponding sample with the
  \code{".hits"} suffix added to it.
  In both cases, the name of the sample is the name that was passed to
  \code{assignReads} when the reads of a given sample were initially
  assigned. See \code{?\link{assignReads}} for more information.
}

\author{
  H. Pages
}

\seealso{
  This man page is part of the \pkg{SplicingGraphs} package.
  Please see \code{?`\link{SplicingGraphs-package}`} for an overview of the
  package and for an index of its man pages.
}

\examples{
## ---------------------------------------------------------------------
## 1. Make SplicingGraphs object 'sg' from toy gene model and assign toy
##    reads to it (see '?assignReads')
## ---------------------------------------------------------------------
example(assignReads)

## ---------------------------------------------------------------------
## 2. Summarize the reads by splicing graph edge
## ---------------------------------------------------------------------
countReads(sg)
reportReads(sg) 

## ---------------------------------------------------------------------
## 3. Summarize the reads by reduced splicing graph edge
## ---------------------------------------------------------------------
countReads(sg, by="rsgedge")
reportReads(sg, by="rsgedge")

## ---------------------------------------------------------------------
## 4. Summarize the reads by transcript
## ---------------------------------------------------------------------
countReads(sg, by="tx")
reportReads(sg, by="tx")

## ---------------------------------------------------------------------
## 5. Summarize the reads by gene
## ---------------------------------------------------------------------
countReads(sg, by="gene")
reportReads(sg, by="gene")

## ---------------------------------------------------------------------
## 6. A close look at ambiguous reads
## ---------------------------------------------------------------------
resolutions <- c("sgedge", "rsgedge", "tx", "gene")

reported_reads <- lapply(resolutions,
    function(by) {
        reported_reads <- reportReads(sg, by=by)
        unlist(reported_reads$TOYREADS.hits)
    })

## The set of reported reads is the same at all levels of resolution:
unique_reported_reads <- lapply(reported_reads, unique)
stopifnot(identical(unique_reported_reads,
                    rep(unique_reported_reads[1], 4)))

## Extract ambigous reads for each level of resolution:
ambiguous_reads <- lapply(reported_reads,
                          function(x) unique(x[duplicated(x)]))
names(ambiguous_reads) <- resolutions
ambiguous_reads

## Reads that are ambiguous at the "rsgedge" level must also be
## ambiguous at the "sgedge" level:
stopifnot(all(ambiguous_reads$rsgedge \%in\% ambiguous_reads$sgedge))

## However, there is no reason why reads that are ambiguous at the
## "tx" level should also be ambiguous at the "sgedge" or "rsgedge"
## level!

## ---------------------------------------------------------------------
## 7. Remove the reads from 'sg'.
## ---------------------------------------------------------------------
sg <- removeReads(sg)
countReads(sg)
}