Bioconductor Code: seqArchR

Raw Blame Patch Log History
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/seqArchR.R, R/seqArchR_main.R
\docType{package}
\name{seqArchR}
\alias{seqArchR}
\title{seqArchR: A package for de novo discovery of different sequence
architectures}
\usage{
seqArchR(
  config,
  seqs_ohe_mat,
  seqs_raw,
  seqs_pos = NULL,
  total_itr = NULL,
  set_ocollation = NULL,
  fresh = TRUE,
  use_oc = NULL,
  o_dir = NULL
)
}
\arguments{
\item{config}{seqArchR configuration object as returned by
\code{\link{set_config}}. This is a required argument.}

\item{seqs_ohe_mat}{A matrix of one-hot encoded sequences with sequences
along columns. This is a required argument.}

\item{seqs_raw}{A \code{\link[Biostrings]{DNAStringSet}} object. The FASTA
sequences as a DNAStringSet object. This argument required argument.}

\item{seqs_pos}{Vector. Specify the tick labels for sequence positions.
Default is NULL.}

\item{total_itr}{Numeric. Specify the number of iterations to perform. This
should be greater than zero. Default is NULL.}

\item{set_ocollation}{Logical vector. A logical vector of length `total_itr`
specifying for every iteration of seqArchR if collation of clusters from
outer chunks should be performed. TRUE denotes clusters are collated,
FALSE otherwise.}

\item{fresh}{Logical. Specify if this is (not) a fresh run. Because
seqArchR enables checkpointing, it is possible to perform additional
iterations upon clusters from an existing seqArchR result (or a checkpoint)
object. See 'use_oc' argument.
For example, when processing a set of FASTA sequences,
if an earlier call to seqArchR performed two iterations, and now you wish to
perform a third, the arguments `fresh` and `use_oc` can be used. Simply set
`fresh` to FALSE and assign the sequence clusters from iteration two from
the earlier result to `use_oc`. As of v0.1.3, with this setting, seqArchR
returns a new result object as if the additional iteration performed is the
only iteration.}

\item{use_oc}{List. Clusters to be further processed with seqArchR. These can
be from a previous seqArchR result (in which case use
\code{\link{get_seqs_clust_list}} function), or simply clusters from any
other method.
Warning: This has not been rigorously tested yet (v0.1.3).}

\item{o_dir}{Character. Specify the output directory with its path. seqArchR
will create this directory. If a directory with the given name exists at the
given location, seqArchR will add a suffix to the directory name. This
change is reported to the user. Default is NULL. When NULL, just the result
is returned, and no plots or checkpoints or result is written to disk.}
}
\value{
A nested list of elements as follows:
\describe{
\item{seqsClustLabels}{A list with cluster labels for all sequences per
iteration of seqArchR. The cluster labels as stored as characters.}

\item{clustBasisVectors}{A list with information on NMF basis vectors per
iteration of seqArchR. Per iteration, there are two variables `nBasisVectors`
storing the number of basis vectors after model selection,
and `basisVectors`, a matrix storing the basis vectors themselves.
Dimensions of the `basisVectors` matrix are 4*L x nBasisVectors
(mononucleotide case) or 16*L x nBasisVectors (dinucleotide case).}

\item{clustSol}{The clustering solution obtained upon processing the raw
clusters from the last iteration of seqArchR's result. This is handled
internally by the function \code{\link{collate_seqArchR_result}} using the
default setting of Euclidean distance and ward.D linkage hierarchical
clustering.}

\item{rawSeqs}{The input sequences as a DNAStringSet object.}

\item{timeInfo}{Stores the time taken (in minutes) for processing each
iteration. This element is added only if `time` flag is set to TRUE in
config.}

\item{config}{The configuration used for processing.}
\item{call}{The function call itself.}
}
}
\description{
Given a set of DNA sequences, \code{seqArchR} enables unsupervised
discovery of _de novo_ clusters with characteristic sequence
architectures characterized by position-specific motifs or composition
of stretches of nucleotides, e.g., CG-richness, etc.

Call this function to process a data set using seqArchR.
}
\details{
The seqArchR package provides three categories of important functions:
related to data preparation and manipulation, performing non-negative
matrix factorization, performing clustering, and visualization-related
functions.
}
\section{Functions for data preparation and manipulation}{

\itemize{
\item \code{\link{prepare_data_from_FASTA}}
\item \code{\link{get_one_hot_encoded_seqs}}
}
}

\section{Functions for visualizations}{

\itemize{
\item \code{\link{plot_arch_for_clusters}}
\item \code{\link{plot_ggseqlogo_of_seqs}}
\item \code{\link{viz_bas_vec}}
\item \code{\link{viz_seqs_acgt_mat}}
\item \code{\link{viz_pwm}}
}
}

\examples{


# Here,we re-use the example input sequences and one-hot encoded matrix
# shipped with seqArchR. Please see examples in the corresponding man pages
# for generating a one-hot encoded input matrix from raw FASTA sequences
# in `prepare_data_from_FASTA`
#
inputSeqsMat <- readRDS(system.file("extdata", "tssSinuc.rds",
                             package = "seqArchR", mustWork = TRUE))

inputSeqsRaw <- readRDS(system.file("extdata", "tssSeqsRaw.rds",
                             package = "seqArchR", mustWork = TRUE))

# Set seqArchR configuration
seqArchRconfig <- seqArchR::set_config(
    parallelize = TRUE,
    n_cores = 2,
    n_runs = 100,
    k_min = 1,
    k_max = 20,
    mod_sel_type = "stability",
    bound = 10^-8,
    chunk_size = 100,
    flags = list(debug = FALSE, time = TRUE, verbose = TRUE,
        plot = FALSE)
)

# Run seqArchR
seqArchRresult <- seqArchR::seqArchR(config = seqArchRconfig,
                          seqs_ohe_mat = inputSeqsMat,
                          seqs_raw = inputSeqsRaw,
                          seqs_pos = seq(1,100,by=1),
                          total_itr = 2,
                          set_ocollation = c(TRUE, FALSE))


}