README.md
de95dafd
 
 <!-- badges: start -->
 <!-- badges: end -->
 
f62360e9
 # Introduction
 
 `TENxIO` allows users to import 10X pipeline files into known
fda59ec3
 Bioconductor classes. The package is not comprehensive, there are file
 types that are not supported. For Visium datasets, we direct users to
 the `VisiumIO` package on Bioconductor. TENxIO consolidates
 functionality from `DropletUtils`. If you would like a file format to be
 supported, open an issue at <https://github.com/waldronlab/TENxIO>.
f62360e9
 
facc7851
 # Supported Formats
 
fda59ec3
 | **Extension**       | **Class**       | **Imported as**                    |
 |---------------------|-----------------|------------------------------------|
 | .h5                 | TENxH5          | SingleCellExperiment w/ TENxMatrix |
 | .mtx / .mtx.gz      | TENxMTX         | SummarizedExperiment w/ dgCMatrix  |
 | .tar.gz             | TENxFileList    | SingleCellExperiment w/ dgCMatrix  |
 | peak_annotation.tsv | TENxPeaks       | GRanges                            |
 | fragments.tsv.gz    | TENxFragments   | RaggedExperiment                   |
 | .tsv / .tsv.gz      | TENxTSV         | tibble                             |
 | spatial.tar.gz      | TENxSpatialList | inter. DataFrame list              |
facc7851
 
 # Tested 10X Products
 
 We have tested these functions with *some*
 [datasets](https://www.10xgenomics.com/resources/datasets) from 10x
 Genomics including those from:
 
 - Single Cell Gene Expression
 - Single Cell ATAC
 - Single Cell Multiome ATAC + Gene Expression
fda59ec3
 - Spatial Gene Expression
facc7851
 
 Note. That extensive testing has not been performed and the codebase may
 require some adaptation to ensure compatibility with all pipeline
 outputs.
 
 # Bioconductor implementations
 
 We are aware of existing functionality in both `DropletUtils` and
 `SpatialExperiment`. We are working with the authors of those packages
 to cover the use cases in both those packages and possibly port I/O
 functionality into `TENxIO`. We are using long tests and the
 `DropletTestFiles` package to cover example datasets on `ExperimentHub`,
 if you would like to know more, see the `longtests` directory on GitHub.
 
969eca88
 # Installation
de95dafd
 
 ``` r
969eca88
 if (!requireNamespace("BiocManager", quietly = TRUE))
de95dafd
     install.packages("BiocManager")
969eca88
 
de95dafd
 BiocManager::install("waldronlab/TENxIO")
 ```
 
969eca88
 # Load the package
 
 ``` r
 library(TENxIO)
 ```
 
 # Description
 
 `TENxIO` offers an set of classes that allow users to easily work with
7d138058
 files typically obtained from the 10X Genomics website. Generally, these
 are outputs from the Cell Ranger pipeline.
 
 # Procedure
 
 Loading the data into a Bioconductor class is a two step process. First,
 the file must be identified by either the user or the `TENxFile`
 function. The appropriate function will be evoked to provide a `TENxIO`
 class representation, e.g., `TENxH5` for HDF5 files with an `.h5`
 extension. Secondly, the `import` method for that particular file class
 will render a common Bioconductor class representation for the user. The
 main representations used by the package are `SingleCellExperiment`,
 `SummarizedExperiment`, `GRanges`, and `RaggedExperiment`.
969eca88
 
ae47d0ad
 # Dataset versioning
 
 The versioning schema in the package mostly applies to HDF5 resources
 and is loosely based on versions of 10X datasets. For the most part,
 version 3 datasets usually contain ranged information at specific
 locations in the data file. Version 2 datasets will usually contain a
 `genes.tsv` file, rather than `features.tsv` as in version 3. If the
 file version is unknown, the software will attempt to derive the version
 from the data where possible.
 
7d138058
 # File classes
 
 ## TENxFile
969eca88
 
 The `TENxFile` class is the catch-all class superclass that allows
 transition to subclasses pertinent to specific files. It inherits from
 the `BiocFile` class and allows for easy dispatching `import` methods.
 
 ``` r
 showClass("TENxFile")
 #> Class "TENxFile" [package "TENxIO"]
 #> 
 #> Slots:
fda59ec3
 #>                                                                                                                               
 #> Name:                extension                  colidx                  rowidx                  remote              compressed
 #> Class:               character                 integer                 integer                 logical                 logical
 #>                               
 #> Name:                 resource
 #> Class: character_OR_connection
969eca88
 #> 
 #> Extends: "BiocFile"
 #> 
f8bfe71f
 #> Known Subclasses: "TENxFragments", "TENxH5", "TENxMTX", "TENxPeaks", "TENxTSV"
969eca88
 ```
 
7d138058
 ### `ExperimentHub` resources
969eca88
 
 `TENxFile` can handle resources from `ExperimentHub` with careful
 inputs. For example, one can import a `TENxBrainData` dataset via the
 appropriate `ExperimentHub` identifier (`EH1039`):
 
 ``` r
 hub <- ExperimentHub::ExperimentHub()
fda59ec3
 #> snapshotDate(): 2025-04-21
969eca88
 hub["EH1039"]
 #> ExperimentHub with 1 record
fda59ec3
 #> # snapshotDate(): 2025-04-21
969eca88
 #> # names(): EH1039
 #> # package(): TENxBrainData
 #> # $dataprovider: 10X Genomics
 #> # $species: Mus musculus
 #> # $rdataclass: character
 #> # $rdatadateadded: 2017-10-26
 #> # $title: Brain scRNA-seq data, 'HDF5-based 10X Genomics' format
fda59ec3
 #> # $description: Single-cell RNA-seq data for 1.3 million brain cells from E18 mice. 'HDF5-based 10X Genomics' format originally pro...
969eca88
 #> # $taxonomyid: 10090
 #> # $genome: mm10
 #> # $sourcetype: HDF5
fda59ec3
 #> # $sourceurl: http://cf.10xgenomics.com/samples/cell-exp/1.3.0/1M_neurons/1M_neurons_filtered_gene_bc_matrices_h5.h5
969eca88
 #> # $sourcesize: NA
fda59ec3
 #> # $tags: c("SequencingData", "RNASeqData", "ExpressionData", "SingleCell") 
969eca88
 #> # retrieve record with 'object[["EH1039"]]'
 ```
 
 Currently, `ExperimentHub` resources do not have an extension and it is
 best to provide that to the `TENxFile` constructor function.
 
 ``` r
 fname <- hub[["EH1039"]]
 TENxFile(fname, extension = "h5", group = "mm10", version = "2")
 ```
de95dafd
 
fda59ec3
 Note. `EH1039` is a large ~ 4GB file and files without extension as
facc7851
 those obtained from `ExperimentHub` will emit a warning so that the user
 is aware that the import operation may fail, esp. if the internal
 structure of the file is modified.
f62360e9
 
7d138058
 ### TENxH5
de95dafd
 
f62360e9
 `TENxIO` mainly supports version 3 and 2 type of H5 files. These are
 files with specific groups and names as seen in `h5.version.map`, an
 internal `data.frame` map that guides the import operations.
 
 ``` r
 TENxIO:::h5.version.map
 #>   Version           ID         Symbol                   Type             Ranges
 #> 1       3 /features/id /features/name /features/feature_type /features/interval
 #> 2       2       /genes    /gene_names                   <NA>               <NA>
 ```
 
 In the case that, there is a file without genomic coordinate
 information, the constructor function can take an `NA_character_` input
 for the `ranges` argument.
 
 The `TENxH5` constructor function can be used on either version of these
 H5 files. In this example, we use a subset of the PBMC granulocyte H5
 file obtained from the [10X
 website](https://cf.10xgenomics.com/samples/cell-arc/2.0.0/pbmc_granulocyte_sorted_3k/pbmc_granulocyte_sorted_3k_filtered_feature_bc_matrix.h5).
969eca88
 
 ``` r
 h5f <- system.file(
     "extdata", "pbmc_granulocyte_ff_bc_ex.h5",
     package = "TENxIO", mustWork = TRUE
f62360e9
 )
 library(rhdf5)
 h5ls(h5f)
 #>               group          name       otype  dclass dim
 #> 0                 /        matrix   H5I_GROUP            
 #> 1           /matrix      barcodes H5I_DATASET  STRING  10
 #> 2           /matrix          data H5I_DATASET INTEGER   2
 #> 3           /matrix      features   H5I_GROUP            
 #> 4  /matrix/features _all_tag_keys H5I_DATASET  STRING   2
 #> 5  /matrix/features  feature_type H5I_DATASET  STRING  10
 #> 6  /matrix/features        genome H5I_DATASET  STRING  10
 #> 7  /matrix/features            id H5I_DATASET  STRING  10
 #> 8  /matrix/features      interval H5I_DATASET  STRING  10
 #> 9  /matrix/features          name H5I_DATASET  STRING  10
 #> 10          /matrix       indices H5I_DATASET INTEGER   2
 #> 11          /matrix        indptr H5I_DATASET INTEGER  11
 #> 12          /matrix         shape H5I_DATASET INTEGER   2
969eca88
 ```
 
ae47d0ad
 Note. The `h5ls` function gives an overview of the structure of the
f62360e9
 file. It matches version 3 in our version map.
 
 The show method gives an overview of the data components in the file:
969eca88
 
de95dafd
 ``` r
 con <- TENxH5(h5f)
969eca88
 con
 #> TENxH5 object 
fda59ec3
 #> resource: /home/mramos/R/bioc-devel/TENxIO/extdata/pbmc_granulocyte_ff_bc_ex.h5 
969eca88
 #> dim: 10 10 
 #> rownames: ENSG00000243485 ENSG00000237613 ... ENSG00000286448 ENSG00000236601 
 #> rowData names(3): ID Symbol Type 
 #>   Type: Gene Expression 
 #> colnames: AAACAGCCAAATATCC-1 AAACAGCCAGGAACTG-1 ... AAACCGCGTGAGGTAG-1 AAACGCGCATACCCGG-1
 ```
 
 ### import TENxH5 method
 
 We can simply use the import method to convert the file representation
 to a Bioconductor class representation, typically a
 `SingleCellExperiment`.
 
 ``` r
de95dafd
 import(con)
fda59ec3
 #> preview <= 12 rowRanges: pbmc_granulocyte_ff_bc_ex.h5
969eca88
 #> class: SingleCellExperiment 
 #> dim: 10 10 
fda59ec3
 #> metadata(1): TENxFile
969eca88
 #> assays(1): counts
fda59ec3
 #> rownames(10): ENSG00000243485 ENSG00000237613 ... ENSG00000286448 ENSG00000236601
969eca88
 #> rowData names(3): ID Symbol Type
fda59ec3
 #> colnames(10): AAACAGCCAAATATCC-1 AAACAGCCAGGAACTG-1 ... AAACCGCGTGAGGTAG-1 AAACGCGCATACCCGG-1
969eca88
 #> colData names(0):
 #> reducedDimNames(0):
 #> mainExpName: Gene Expression
 #> altExpNames(0):
 ```
 
f62360e9
 **Note**. Although the main representation in the package is
 `SingleCellExperiment`, there could be a need for alternative data class
 representations of the data. The `projection` field in the `TENxH5` show
 method is an initial attempt to allow alternative representations.
969eca88
 
7d138058
 ## TENxMTX
969eca88
 
 Matrix Market formats are also supported (`.mtx` extension). These are
 typically imported as SummarizedExperiment as they usually contain count
 data.
 
 ``` r
 mtxf <- system.file(
     "extdata", "pbmc_3k_ff_bc_ex.mtx",
     package = "TENxIO", mustWork = TRUE
 )
 con <- TENxMTX(mtxf)
 con
 #> TENxMTX object
fda59ec3
 #> resource: /home/mramos/R/bioc-devel/TENxIO/extdata/pbmc_3k_ff_bc_ex.mtx
7d138058
 ```
 
 ## import MTX method
 
 The `import` method yields a `SummarizedExperiment` without colnames or
 rownames.
 
 ``` r
969eca88
 import(con)
 #> class: SummarizedExperiment 
 #> dim: 171 10 
fda59ec3
 #> metadata(1): TENxFile
969eca88
 #> assays(1): counts
 #> rownames: NULL
 #> rowData names(0):
 #> colnames: NULL
 #> colData names(0):
 ```
 
7d138058
 ## TENxFileList
 
 Generally, the 10X website will provide tarballs (with a `.tar.gz`
 extension) which can be imported with the `TENxFileList` class. The
 tarball can contain components of a gene expression experiment including
 the matrix data, row data (aka ‘features’) expressed as Ensembl
 identifiers, gene symbols, etc. and barcode information for the columns.
969eca88
 
7d138058
 The `TENxFileList` class allows importing multiple files within a
 `tar.gz` archive. The `untar` function with the `list = TRUE` argument
 shows all the file names in the tarball.
969eca88
 
 ``` r
 fl <- system.file(
     "extdata", "pbmc_granulocyte_sorted_3k_ff_bc_ex_matrix.tar.gz",
     package = "TENxIO", mustWork = TRUE
 )
 untar(fl, list = TRUE)
 #> [1] "./pbmc_granulocyte_sorted_3k_filtered_feature_bc_matrix/filtered_feature_bc_matrix/"               
 #> [2] "./pbmc_granulocyte_sorted_3k_filtered_feature_bc_matrix/filtered_feature_bc_matrix/barcodes.tsv.gz"
 #> [3] "./pbmc_granulocyte_sorted_3k_filtered_feature_bc_matrix/filtered_feature_bc_matrix/features.tsv.gz"
 #> [4] "./pbmc_granulocyte_sorted_3k_filtered_feature_bc_matrix/filtered_feature_bc_matrix/matrix.mtx.gz"
 ```
 
7d138058
 We then use the `import` method across all file types to obtain an
 integrated Bioconductor representation that is ready for analysis. Files
 in `TENxFileList` can be represented as a `SingleCellExperiment` with
 row names and column names.
969eca88
 
 ``` r
 con <- TENxFileList(fl)
 import(con)
 #> class: SingleCellExperiment 
 #> dim: 10 10 
fda59ec3
 #> metadata(1): TENxFileList
969eca88
 #> assays(1): counts
fda59ec3
 #> rownames(10): ENSG00000243485 ENSG00000237613 ... ENSG00000286448 ENSG00000236601
969eca88
 #> rowData names(3): ID Symbol Type
fda59ec3
 #> colnames(10): AAACAGCCAAATATCC-1 AAACAGCCAGGAACTG-1 ... AAACCGCGTGAGGTAG-1 AAACGCGCATACCCGG-1
969eca88
 #> colData names(0):
 #> reducedDimNames(0):
 #> mainExpName: Gene Expression
 #> altExpNames(0):
 ```
 
7d138058
 ## TENxPeaks
969eca88
 
 Peak files can be handled with the `TENxPeaks` class. These files are
 usually named `*peak_annotation` files with a `.tsv` extension. Peak
 files are represented as `GRanges`.
 
 ``` r
 pfl <- system.file(
     "extdata", "pbmc_granulocyte_sorted_3k_ex_atac_peak_annotation.tsv",
     package = "TENxIO", mustWork = TRUE
 )
 tenxp <- TENxPeaks(pfl)
 peak_anno <- import(tenxp)
 peak_anno
 #> GRanges object with 10 ranges and 3 metadata columns:
 #>        seqnames        ranges strand |        gene  distance   peak_type
 #>           <Rle>     <IRanges>  <Rle> | <character> <numeric> <character>
 #>    [1]     chr1    9768-10660      * | MIR1302-2HG    -18894      distal
 #>    [2]     chr1 180582-181297      * |  AL627309.5     -6721      distal
 #>    [3]     chr1 181404-181887      * |  AL627309.5     -7543      distal
 #>    [4]     chr1 191175-192089      * |  AL627309.5    -17314      distal
 #>    [5]     chr1 267561-268455      * |  AP006222.2       707      distal
 #>    [6]     chr1 270864-271747      * |  AP006222.2      4010      distal
 #>    [7]     chr1 273947-274758      * |  AP006222.2      7093      distal
 #>    [8]     chr1 585751-586647      * |  AC114498.1      -982    promoter
 #>    [9]     chr1 629484-630393      * |  AC114498.1     41856      distal
 #>   [10]     chr1 633556-634476      * |  AC114498.1     45928      distal
 #>   -------
 #>   seqinfo: 1 sequence from an unspecified genome; no seqlengths
 ```
 
7d138058
 ## TENxFragments
969eca88
 
 Fragment files are quite large and we make use of the `Rsamtools`
 package to import them with the `yieldSize` parameter. By default, we
 use a `yieldSize` of 200.
 
 ``` r
 fr <- system.file(
     "extdata", "pbmc_3k_atac_ex_fragments.tsv.gz",
     package = "TENxIO", mustWork = TRUE
 )
 ```
 
 Internally, we use the `TabixFile` constructor function to work with
 indexed `tsv.gz` files.
 
 **Note**. A warning is emitted whenever a `yieldSize` parameter is not
 set.
 
 ``` r
 tfr <- TENxFragments(fr)
 #> Warning in TENxFragments(fr): Using default 'yieldSize' parameter
 tfr
 #> TENxFragments object
fda59ec3
 #> resource: /home/mramos/R/bioc-devel/TENxIO/extdata/pbmc_3k_atac_ex_fragments.tsv.gz
969eca88
 ```
 
 Because there may be a variable number of fragments per barcode, we use
 a `RaggedExperiment` representation for this file type.
 
 ``` r
 fra <- import(tfr)
 fra
 #> class: RaggedExperiment 
 #> dim: 10 10 
 #> assays(2): barcode readSupport
 #> rownames: NULL
fda59ec3
 #> colnames(10): AAACCGCGTGAGGTAG-1 AAGCCTCCACACTAAT-1 ... TGATTAGTCTACCTGC-1 TTTAGCAAGGTAGCTT-1
969eca88
 #> colData names(0):
 ```
 
fda59ec3
 Similar operations to those used with `SummarizedExperiment` are
969eca88
 supported. For example, the genomic ranges can be displayed via
 `rowRanges`:
 
 ``` r
 rowRanges(fra)
 #> GRanges object with 10 ranges and 0 metadata columns:
 #>        seqnames      ranges strand
 #>           <Rle>   <IRanges>  <Rle>
 #>    [1]     chr1 10152-10180      *
 #>    [2]     chr1 10152-10195      *
 #>    [3]     chr1 10080-10333      *
 #>    [4]     chr1 10091-10346      *
 #>    [5]     chr1 10152-10180      *
 #>    [6]     chr1 10152-10202      *
 #>    [7]     chr1 10097-10344      *
 #>    [8]     chr1 10080-10285      *
 #>    [9]     chr1 10090-10560      *
 #>   [10]     chr1 10074-10209      *
 #>   -------
 #>   seqinfo: 1 sequence from an unspecified genome; no seqlengths
 ```
 
fda59ec3
 <details>
 <summary>
 Click here to expand <code>sessionInfo()</code>
 </summary>
 
969eca88
 # Session Information
 
 ``` r
 sessionInfo()
fda59ec3
 #> R version 4.5.0 Patched (2025-04-15 r88148)
 #> Platform: x86_64-pc-linux-gnu
 #> Running under: Ubuntu 24.04.2 LTS
969eca88
 #> 
 #> Matrix products: default
fda59ec3
 #> BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
969eca88
 #> 
 #> locale:
fda59ec3
 #>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 #>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 #>  [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
 #> 
 #> time zone: America/New_York
 #> tzcode source: system (glibc)
969eca88
 #> 
 #> attached base packages:
fda59ec3
 #> [1] stats4    stats     graphics  grDevices utils     datasets  methods   base     
969eca88
 #> 
 #> other attached packages:
fda59ec3
 #>  [1] rhdf5_2.53.0                TENxIO_1.11.1               SingleCellExperiment_1.31.0 SummarizedExperiment_1.39.0
 #>  [5] Biobase_2.69.0              GenomicRanges_1.61.0        GenomeInfoDb_1.45.3         IRanges_2.43.0             
 #>  [9] S4Vectors_0.47.0            BiocGenerics_0.55.0         generics_0.1.3              MatrixGenerics_1.21.0      
 #> [13] matrixStats_1.5.0           colorout_1.3-2             
969eca88
 #> 
 #> loaded via a namespace (and not attached):
fda59ec3
 #>  [1] tidyselect_1.2.1        dplyr_1.1.4             blob_1.2.4              bitops_1.0-9            filelock_1.0.3         
 #>  [6] R.utils_2.13.0          Biostrings_2.77.0       RaggedExperiment_1.33.1 fastmap_1.2.0           BiocFileCache_2.99.0   
 #> [11] digest_0.6.37           lifecycle_1.0.4         KEGGREST_1.49.0         RSQLite_2.3.9           magrittr_2.0.3         
 #> [16] compiler_4.5.0          rlang_1.1.6             tools_4.5.0             yaml_2.3.10             knitr_1.50             
 #> [21] S4Arrays_1.9.0          bit_4.6.0               curl_6.2.2              DelayedArray_0.35.1     BiocParallel_1.43.0    
 #> [26] abind_1.4-8             rsconnect_1.3.4         HDF5Array_1.37.0        withr_3.0.2             purrr_1.0.4            
 #> [31] R.oo_1.27.0             grid_4.5.0              ExperimentHub_2.99.0    Rhdf5lib_1.31.0         cli_3.6.5              
 #> [36] rmarkdown_2.29          crayon_1.5.3            rstudioapi_0.17.1       httr_1.4.7              tzdb_0.5.0             
 #> [41] BiocBaseUtils_1.11.0    DBI_1.2.3               cachem_1.1.0            parallel_4.5.0          AnnotationDbi_1.71.0   
 #> [46] BiocManager_1.30.25     XVector_0.49.0          vctrs_0.6.5             Matrix_1.7-3            jsonlite_2.0.0         
 #> [51] hms_1.1.3               bit64_4.6.0-1           archive_1.1.12          h5mread_1.1.0           glue_1.8.0             
 #> [56] codetools_0.2-20        BiocVersion_3.22.0      BiocIO_1.19.0           UCSC.utils_1.5.0        tibble_3.2.1           
 #> [61] pillar_1.10.2           rappdirs_0.3.3          htmltools_0.5.8.1       rhdf5filters_1.21.0     R6_2.6.1               
 #> [66] dbplyr_2.5.0            httr2_1.1.2             vroom_1.6.5             evaluate_1.0.3          lattice_0.22-7         
 #> [71] readr_2.1.5             AnnotationHub_3.99.0    Rsamtools_2.25.0        png_0.1-8               R.methodsS3_1.8.2      
 #> [76] memoise_2.0.1           BiocStyle_2.37.0        SparseArray_1.9.0       xfun_0.52               pkgconfig_2.0.3
de95dafd
 ```
fda59ec3
 
 </details>