vignettes/BiocPkgTools.Rmd
eb04b614
 ---
f216925b
 title: "Overview of BiocPkgTools"
eb04b614
 author:
fe3872d0
 - name: Shian Su
   affiliation: Walter and Eliza Hall Institute, Melbourne, Australia
a6a83be4
 - name: Vince Carey
e03f2c6a
   affiliation: Channing Lab, Brigham and Womens Hospital, Harvard University, Boston, MA USA
3fee48e1
 - name: Marcel Ramos
7aa10a79
   affiliation: CUNY School of Public Health, New York, NY USA
a6a83be4
 - name: Lori Shepherd
e03f2c6a
   affiliation: Roswell Park Comprehensive Cancer Center, Buffalo, NY USA
a6a83be4
 - name: Martin Morgan
e03f2c6a
   affiliation: Roswell Park Comprehensive Cancer Center, Buffalo, NY USA
fe3872d0
 - name: Sean Davis
   affiliation: National Cancer Institute, National Institutes of Health, Bethesda, MD USA
   email: [email protected]
eb04b614
 package: BiocPkgTools
 output:
   BiocStyle::html_document:
     toc: false
 abstract: |
   Bioconductor has a rich ecosystem of metadata around packages, usage, and build status. This package is a simple collection of functions to access that metadata from R in a tidy data format. The goal is to expose metadata for data mining and value-added functionality such as package searching, text mining, and analytics on packages. 
 vignette: |
a6a83be4
   %\VignetteIndexEntry{Overview of BiocPkgTools}
eb04b614
   %\VignetteEngine{knitr::rmarkdown}
   %\VignetteEncoding{UTF-8}
 ---
 
a6a83be4
 # Introduction
 
3fee48e1
 Bioconductor has a rich ecosystem of metadata around packages, usage, and build
 status. This package is a simple collection of functions to access that metadata
 from R in a tidy data format. The goal is to expose metadata for data mining and
 value-added functionality such as package searching, text mining, and analytics
 on packages. 
eb04b614
 
 Functionality includes access to :
 
 - Download statistics
 - General package listing
 - Build reports
2cdbee28
 - Package dependency graphs
eb04b614
 - Vignettes
 
 ```{r init, include=FALSE}
 library(knitr)
2cdbee28
 opts_chunk$set(warning = FALSE, message = FALSE, cache=FALSE)
a6a83be4
 ```
 
 ```{r style, echo = FALSE, results = 'asis'}
 BiocStyle::markdown()
eb04b614
 ```
 
 # Build reports
 
a6a83be4
 The Bioconductor build reports are available online as HTML pages. 
 However, they are not very computable.
 The `biocBuildReport` function does some heroic parsing of the HTML
 to produce a *tidy* data.frame for further processing in R. 
 
eb04b614
 ```{r}
 library(BiocPkgTools)
a6a83be4
 head(biocBuildReport())
eb04b614
 ```
 
a6a83be4
 ## Personal build report
 
3fee48e1
 Because developers may be interested in a quick view of their own packages,
 there is a simple function, `problemPage`, to produce an HTML report of the
 build status of packages matching a given author *regex* supplied to the
 `authorPattern` argument. The default is to report only "problem" build statuses
 (ERROR, WARNING).
a6a83be4
 
50ef3f60
 ```{r eval=FALSE}
17396fdd
 problemPage(authorPattern = "V.*Carey")
 ```
 
3fee48e1
 In similar fashion, maintainers of packages that have many downstream packages
 that depend on them may wish to check that a change they introduced hasn't
 suddenly broken a large number of these.  You can use the `dependsOn` argument
 to produce the summary report of those packages that "depend on" the given
 package.
17396fdd
 
 ```{r eval=FALSE}
 problemPage(dependsOn = "limma")
a6a83be4
 ```
 
 When run in an interactive environment, the `problemPage` function 
 will open a browser window for user interaction. Note that if you want
 to include all your package results, not just the broken ones, simply 
 specify `includeOK = TRUE`.
 
eb04b614
 # Download statistics
 
a6a83be4
 Bioconductor supplies download stats for all packages. The `biocDownloadStats`
 function grabs all available download stats for all packages in all
 Experiment Data, Annotation Data, and Software packages. The results
 are returned as a tidy data.frame for further analysis.
 
eb04b614
 ```{r}
a6a83be4
 head(biocDownloadStats())
eb04b614
 ```
 
3fee48e1
 The download statistics reported are for ***all available versions*** of a
 package. There are no separate, publicly available statistics broken down by
 version.
 The majority of Bioconductor Software packages are also available through other
 channels such as Anaconda, who also provided download statistics for packages
 installed from their repositories.  Access to these counts is provided by the
 `anacondaDownloadStats` function:
17396fdd
 
 ```{r}
 head(anacondaDownloadStats())
 ```
 
3fee48e1
 Note that Anaconda do not provide counts for distinct IP addresses, but this
 column is included for compatibility with the Bioconductor count tables.
17396fdd
 
eb04b614
 # Package details
 
a6a83be4
 The R `DESCRIPTION` file contains a plethora of information regarding package
3fee48e1
 authors, dependencies, versions, etc. In a repository such as Bioconductor,
 these details are available in bulk for all included packages. The `biocPkgList`
 returns a data.frame with a row for each package. Tons of information are
 available, as evidenced by the column names of the results.
a6a83be4
 
 ```{r}
 bpi = biocPkgList()
 colnames(bpi)
 ```
 
 Some of the variables are parsed to produce `list` columns. 
 
 ```{r}
 head(bpi)
 ```
 
 As a simple example of how these columns can be used, extracting
 the `importsMe` column to find the packages that import the 
 `r Biocpkg("GEOquery")` package.
 
 
eb04b614
 ```{r}
 require(dplyr)
a6a83be4
 bpi = biocPkgList()
eb04b614
 bpi %>% 
a6a83be4
     filter(Package=="GEOquery") %>%
     pull(importsMe) %>%
     unlist()
eb04b614
 ```
 
fe3872d0
 # Package Explorer
 
 For the end user of Bioconductor, an analysis often starts with finding a
 package or set of packages that perform required tasks or are tailored 
e01aac8f
 to a specific operation or data type. The `biocExplore()` function
fe3872d0
 implements an interactive bubble visualization with filtering based on 
 biocViews terms. Bubbles are sized based on download statistics. Tooltip
 and detail-on-click capabilities are included. To start a local session:
 
50ef3f60
 ```{r biocExplore}
 biocExplore()
fe3872d0
 ```
 
2cdbee28
 # Dependency graphs
 
 The Bioconductor ecosystem is built around the concept of interoperability
 and dependencies. These interdependencies are available as part of the
 `biocPkgList()` output. The `BiocPkgTools` provides some convenience
 functions to convert package dependencies to R graphs. A modular approach leads
 to the following workflow.
 
 1. Create a `data.frame` of dependencies using `buildPkgDependencyDataFrame`.
3fee48e1
 2. Create an `igraph` object from the dependency data frame using
   `buildPkgDependencyIgraph`
2cdbee28
 3. Use native `igraph` functionality to perform arbitrary network operations. 
3fee48e1
   Convenience functions, `inducedSubgraphByPkgs` and `subgraphByDegree` are
   available.
2cdbee28
 4. Visualize with packages such as `r CRANpkg("visNetwork")`.
 
 
 ## Working with dependency graphs
 
 A dependency graph for all of Bioconductor is a starting place.
 
 ```{r}
 dep_df = buildPkgDependencyDataFrame()
 g = buildPkgDependencyIgraph(dep_df)
 g
 library(igraph)
 head(V(g))
 head(E(g))
 ```
 
 See `inducedSubgraphByPkgs` and `subgraphByDegree` to produce
 subgraphs based on a subset of packages.
 
 See the igraph documentation for more detail on graph analytics, setting
 vertex and edge attributes, and advanced subsetting.
 
 ## Graph visualization
 
 The visNetwork package is a nice interactive visualization tool
 that implements graph plotting in a browser. It can be integrated
 into shiny applications. Interactive graphs can also be included in
 Rmarkdown documents (see vignette)
 
 ```{r}
 igraph_network = buildPkgDependencyIgraph(buildPkgDependencyDataFrame())
 ``` 
 
 The full dependency graph is really not that informative to look at, though
 doing so is possible. A common use case is to visualize the graph of
 dependencies "centered" on a package of interest. In this case, I will 
 focus on the `r Biocpkg("GEOquery")` package. 
 
 ```{r}
 igraph_geoquery_network = subgraphByDegree(igraph_network, "GEOquery")
 ```
 
 The `subgraphByDegree()` function returns all nodes and connections within
 `degree` of the named package; the default `degree` is `1`.
 
 The visNework package can plot `igraph` objects directly, but more flexibility
 is offered by first converting the graph to visNetwork form.
 
 ```{r}
 library(visNetwork)
 data <- toVisNetworkData(igraph_geoquery_network)
 ```
 
 The next few code chunks highlight just a few examples of the visNetwork
 capabilities, starting with a basic plot.
 
50ef3f60
 ```{r }
2cdbee28
 visNetwork(nodes = data$nodes, edges = data$edges, height = "500px")
 ```
 
 For fun, we can watch the graph stabilize during drawing, best viewed
 interactively.
 
 ```{r}
 visNetwork(nodes = data$nodes, edges = data$edges, height = "500px") %>%
     visPhysics(stabilization=FALSE)
 ```
 
 Add arrows and colors to better capture dependencies.
 
 ```{r}
 data$edges$color='lightblue'
 data$edges[data$edges$edgetype=='Imports','color']= 'red'
 data$edges[data$edges$edgetype=='Depends','color']= 'green'
 
 visNetwork(nodes = data$nodes, edges = data$edges, height = "500px") %>%
     visEdges(arrows='from') 
 ```
 
 Add a legend.
 
 ```{r}
 ledges <- data.frame(color = c("green", "lightblue", "red"),
   label = c("Depends", "Suggests", "Imports"), arrows =c("from", "from", "from"))
 visNetwork(nodes = data$nodes, edges = data$edges, height = "500px") %>%
   visEdges(arrows='from') %>%
   visLegend(addEdges=ledges)
 ```
a6a83be4
 
e03f2c6a
 ## Integration with `r Biocpkg("BiocViews")`
 
fe3872d0
 [Work in progress]
 
e03f2c6a
 The `r Biocpkg("biocViews")` package is a small ontology of terms describing
3fee48e1
 Bioconductor packages. This is a work-in-progress section, but here is a small
 example of plotting the biocViews graph.
e03f2c6a
 
 ```{r biocViews}
 library(biocViews)
 data(biocViewsVocab)
 biocViewsVocab
 library(igraph)
 g = igraph.from.graphNEL(biocViewsVocab)
 library(visNetwork)
 gv = toVisNetworkData(g)
 visNetwork(gv$nodes, gv$edges, width="100%") %>%
fe3872d0
     visIgraphLayout(layout = "layout_as_tree", circular=TRUE) %>%
     visNodes(size=20) %>%
     visPhysics(stabilization=FALSE)
e03f2c6a
 ```
 
2b05f079
 # Dependency burden
 
 The dependency burden of a package, namely the amount of functionality that
 a given package is importing, is an important parameter to take into account
 during package development. A package may break because one or more of its
 dependencies have changed the part of the API our package is importing or
 this part has even broken. For this reason, it may be useful for package
 developers to quantify the dependency burden of a given package. To do that
 we should first gather all dependency information using the function
 `buildPkgDependencyDataFrame()` but setting the arguments to work with
 packages in Bioconductor and CRAN and dependencies categorised as `Depends`
 or `Imports`, which are the ones installed by default for a given package.
 
 ```{r}
 
 depdf <- buildPkgDependencyDataFrame(repo=c("BioCsoft", "CRAN"),
                                      dependencies=c("Depends", "Imports"))
dbf1ceed
 dim(depdf)
 head(depdf)  # too big to show all
2b05f079
 ```
 
 Finally, we call the function `pkgDepMetrics()` to obtain different metrics
 on the dependency burden of a package we want to analyze, in the case below,
 the package `BiocPkgTools` itself:
 
 ```{r}
 pkgDepMetrics("BiocPkgTools", depdf)
 ```
 
 In this resulting table, rows correspond to dependencies and columns provide
 the following information:
 
   * `ImportedAndUsed`: number of functionality calls imported and used in
     the package.
   * `Exported`: number of functionality calls exported by the dependency.
   * `Usage`: (`ImportedAndUsed`x 100) / `Exported`. This value provides an
     estimate of what fraction of the functionality of the dependency is
     actually used in the given package.
   * `DepOverlap`: Similarity between the dependency graph structure of the
     given package and the one of the dependency in the corresponding row,
3fee48e1
     estimated as the
     [Jaccard index](https://en.wikipedia.org/wiki/Jaccard_index)
2b05f079
     between the two sets of vertices of the corresponding graphs. Its values
     goes between 0 and 1, where 0 indicates that no dependency is shared, while
     1 indicates that the given package and the corresponding dependency depend
     on an identical subset of packages.
593acca1
   * `DepGainIfExcluded`: The 'dependency gain' (decrease in the total number 
     of dependencies) that would be obtained if this package was excluded 
     from the list of direct dependencies. 
2b05f079
 
 The reported information is ordered by the `Usage` column to facilitate the
 identification of dependencies for which the analyzed package is using a small
 fraction of their functionality and therefore, it could be easier remove them.
 To aid in that decision, the column `DepOverlap` reports the overlap of the
 dependency graph of each dependency with the one of the analyzed package. Here
 a value above, e.g., 0.5, could, albeit not necessarily, imply that removing
3fee48e1
 that dependency could substantially lighten the dependency burden of the
 analyzed package.
2b05f079
 
 An `NA` value in the `ImportedAndUsed` column indicates that the function
 `pkgDepMetrics()` could not identify what functionality calls in the analyzed
 package are made to the dependency. This may happen because `pkgDepMetrics()`
 has failed to identify the corresponding calls, as it happens with imported
4cc5b984
 built-in constants such as `DNA_BASES` from `Biostrings`, or that although the
3fee48e1
 given package is importing that dependency, none of its functionality is
 actually being used. In such a case, this dependency could be safely removed
 without any further change in the analyzed package.
2b05f079
 
 We can find out what actually functionality calls are we importing as follows:
 
 ```{r}
 imp <- pkgDepImports("BiocPkgTools")
 imp %>% filter(pkg == "DT")
 ```
 
dbf1ceed
 # Identities of maintainers
 
 It is important to be able to identify the maintainer of a package in
 a reliable way.  The DESCRIPTION file for a package can include an `Authors@R`
 field.  This field can capture metadata about maintainers and contributors
 in a programmatically accessible way.  Each element of the role
 field of a `person` (see `?person`) in the `Authors@R` field comes
 from a subset of the 
 [relations vocabulary of the Library of Congress](https://www.loc.gov/marc/relators/relaterm.html).  
 
 Metadata about maintainers can be extracted from DESCRIPTION in various ways.
 As of October 2022, we focus on the use of the ORCID field which is
 an optional `comment` component in a `person` element.  For example,
 in the DESCRIPTION for the AnVIL package we have
 ```
 Authors@R:
     c(person(
         "Martin", "Morgan", role = c("aut", "cre"),
         email = "[email protected]",
         comment = c(ORCID = "0000-0002-5874-8148")
      ),
 ```
 This convention is used for a fair number of Bioconductor and CRAN packages.
 
 We'll demonstrate the use of `get_cre_orcids` with some packages.
 ```{r demoorc}
 inst = rownames(installed.packages())
 cands = c("devtools", "evaluate", "ggplot2", "GEOquery", "gert", "utils")
 totry = intersect(cands, inst)
 oids = get_cre_orcids(totry)
 oids
 ```
 
 We use the ORCID API to tabulate metadata about the holders of these IDs.
 We'll avoid evaluating this because a token must be refreshed for the query
 to succeed.
 
 ```{r lktab,eval=FALSE}
 orcid_table(oids)
 ```
 In October 2022 the result is
 ```
 > orcid_table(.Last.value)
                         name                                            org
 devtools      Jennifer Bryan                                        RStudio
 evaluate           Yihui Xie                                  RStudio, Inc.
 ggplot2  Thomas Lin Pedersen                                        RStudio
 GEOquery          Sean Davis University of Colorado Anschutz Medical Campus
 gert             Jeroen Ooms            Berkeley Institute for Data Science
 utils                   <NA>                                           <NA>
                city   region country               orcid
 devtools     Boston       MA      US 0000-0002-6983-2759
 evaluate    Elkhorn       NE      US 0000-0003-0645-5666
 ggplot2  Copenhagen     <NA>      DK 0000-0002-5147-4711
 GEOquery     Aurora Colorado      US 0000-0002-8991-6458
 gert       Berkeley       CA      US 0000-0002-4035-0289
 utils          <NA>     <NA>    <NA>                <NA>
 ```
 
a6a83be4
 # Provenance
 
 ```{r}
 sessionInfo()
 ```