Add a variety of read-level modified basecalling summary statistics to a SummarizedExperiment

This function calculates various per-read summary statistics on modification probabilities or calls from a SummarizedExperiment object with genomic positions in rows and reads in columns and adds them to the colData. See details for more information on the statistics that are calculated.

Usage

addReadStats(
  se,
  regions = NULL,
  sequence.context = NULL,
  stats = NULL,
  min.Nobs.ppos = NULL,
  min.Nobs.pread = 0,
  LowConf = 0.7,
  LagRange = c(12, 64),
  verbose = FALSE
)

Arguments

se: A RangedSummarizedExperiment object with assay "mod_prob" typically returned by readModkitExtract or readModBam.
regions: A GRanges object limiting the positions included in the calculations to the ones overlapping the corresponding genomic regions. Alternatively, regions can be specified as a character vector (e.g. "chr1:1200-1300") that can be coerced into a GRanges object.
sequence.context: A character vector with sequence context(s) to include in the calculations. Only positions that match one of the provided sequence contexts will be included. Sequence contexts can be provided using IUPAC redundancy codes. The sequence contexts of modified bases are obtained from rowData(se)$sequence.context and thus requires that se contains the appropriate information, for example by setting the sequence.context and sequence.reference arguments of readModkitExtract when it was generated, or by adding it using addSeqContext.
stats: Character vector specifying which statistics to calculate. When set to NULL all available statistics are calculated. See details for a complete list of available read statistics.
min.Nobs.ppos: A numeric scalar value >=1 indicating the minimum coverage on individual positions for them to be included in the calculations. In high coverage data this is an effective filter for removing spurious modbases, typically the result of erroneous basecalling. The default NULL sets its value to Q3-0.5*IQR, where Q3 and IQR are the third quartile and interquartile range of the coverage distribution estimated from the data in se.
min.Nobs.pread: A numeric scalar with the minimum number of observed modifiable bases per read for it to be included in the calculations. NA values are returned for the reads that do not pass this threshold.
LowConf: A numeric scalar with the minimum call confidence below which calls are considered "low confidence".
LagRange: A numeric vector of two values (minimum and maxium) defining the range of lags for the calculation of autocorrelation and partial autocorrelation (see details section).
verbose: If TRUE, report on progress.

Value

A SummarizedExperiment object with colData filtered for positions according to regions, sequence.context and min.Nobs.ppos arguments and extended to include the read statistics in its row- and column-data.

Details

Calculates a collection of location/scatter statistics and information theoretic/signal-processing metrics for the modification probability, confidence or modification call value vectors across individual reads. When sequence.context, min.coverage or min.Nobs.ppos filters are enforced, only modifiable bases passing the filters are included in the calculations. The following statistics are available:

MeanModProb: : Mean modification probability across the read.
FracMod: : Fraction of confidently modified bases, defined as the ratio of modifiable bases with modification probability >= 0.5 over all modifiable bases.
MeanConf: : Mean call confidence across the read.
MeanConfUnm: : Mean call confidence confined to unmodified bases (modifiable bases with modification probability < 0.5).
MeanConfMod: : Mean call confidence confined to modified bases (modifiable bases with modification probability >= 0.5).
FracLowConf: : Fraction of modifiable bases called with low confidence (call confidence < LowConf).
IQRModProb: : Interquartile range of modification probabilities across the read.
sdModProb: : Standard deviation of modification probabilities across the read.
SEntrModProb: : Sample entropy of the modification probability signal. Sample entropy is a metric assessing the complexity of one-dimensional physiological signals. The higher the sample entropy, the more irregular, unpredictable and therefore complex the signal. See wikipedia:Sample_entropy for more details.
Lag1DModProb: : Mean Lag1 differences of modification calls, defined as: mean(Mod[i]-Mod[i-1]), where Mod is a {0,1} modification call.
ACModProb: : Autocorrelation of the modification probability values for lags in the range LagRange. This range typically covers the signal of nucleosome periodicity.
PACModProb: : Partial autocorrelation of the modification probability values for lags in the range LagRange. This range typically covers the signal of nucleosome periodicity.

Author

Panagiotis Papapasaikas

Examples

library(SummarizedExperiment)
#> Loading required package: MatrixGenerics
#> Loading required package: matrixStats
#> 
#> Attaching package: ‘MatrixGenerics’
#> The following objects are masked from ‘package:matrixStats’:
#> 
#>     colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
#>     colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
#>     colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
#>     colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
#>     colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
#>     colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
#>     colWeightedMeans, colWeightedMedians, colWeightedSds,
#>     colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
#>     rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
#>     rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
#>     rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
#>     rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
#>     rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
#>     rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
#>     rowWeightedSds, rowWeightedVars
#> Loading required package: GenomicRanges
#> Loading required package: stats4
#> Loading required package: BiocGenerics
#> 
#> Attaching package: ‘BiocGenerics’
#> The following objects are masked from ‘package:stats’:
#> 
#>     IQR, mad, sd, var, xtabs
#> The following objects are masked from ‘package:base’:
#> 
#>     Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
#>     as.data.frame, basename, cbind, colnames, dirname, do.call,
#>     duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
#>     lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
#>     pmin.int, rank, rbind, rownames, sapply, setdiff, table, tapply,
#>     union, unique, unsplit, which.max, which.min
#> Loading required package: S4Vectors
#> 
#> Attaching package: ‘S4Vectors’
#> The following object is masked from ‘package:utils’:
#> 
#>     findMatches
#> The following objects are masked from ‘package:base’:
#> 
#>     I, expand.grid, unname
#> Loading required package: IRanges
#> Loading required package: GenomeInfoDb
#> Loading required package: Biobase
#> Welcome to Bioconductor
#> 
#>     Vignettes contain introductory material; view with
#>     'browseVignettes()'. To cite Bioconductor, see
#>     'citation("Biobase")', and for packages 'citation("pkgname")'.
#> 
#> Attaching package: ‘Biobase’
#> The following object is masked from ‘package:MatrixGenerics’:
#> 
#>     rowMedians
#> The following objects are masked from ‘package:matrixStats’:
#> 
#>     anyMissing, rowMedians
modbamfile <- system.file("extdata", "6mA_1_10reads.bam",
                          package = "footprintR")
se <- readModBam(bamfile = modbamfile, regions = "chr1:6940000-6955000",
           modbase = "a", verbose = TRUE)
#> extracting base modifications from modBAM files
#>     opening input file /Users/runner/work/_temp/Library/footprintR/extdata/6mA_1_10reads.bam
#>     reading alignments overlapping any of 1 regions
#>     removed 1724 unaligned (e.g. soft-masked) of 25090 called bases
#>     read 3 alignments
#> finding unique genomic positions...
#> collapsed 11300 positions to 4772 unique ones
se_withReadStats <- addReadStats(se)
rowData(se_withReadStats)
#> DataFrame with 4772 rows and 0 columns
colData(se_withReadStats)
#> DataFrame with 1 row and 5 columns
#>         sample     modbase   n_reads                  qscore
#>    <character> <character> <integer>                  <List>
#> s1          s1           a         3 14.1428,16.0127,20.3082
#>                                                                                                        QC
#>                                                                                                    <List>
#> s1 0.1813986:0.1540698:0.931635:...,0.0969716:0.0632411:0.944381:...,0.0714111:0.0353929:0.950393:...,...