Calculate a variety of read-level modified basecalling summary statistics
Source:R/calcReadStats.R
calcReadStats.Rd
This function calculates various per-read summary statistics on modification
probabilities or calls from a SummarizedExperiment
object with genomic positions in rows and reads in columns. See details
for more information on the statistics that are calculated.
Usage
calcReadStats(
se,
regions = NULL,
sequence.context = NULL,
stats = NULL,
min.Nobs.ppos = NULL,
min.Nobs.pread = 0,
LowConf = 0.7,
LagRange = c(12, 64),
verbose = FALSE
)
Arguments
- se
A
RangedSummarizedExperiment
object with assay"mod_prob"
typically returned byreadModkitExtract
orreadModBam
.- regions
A
GRanges
object limiting the positions included in the calculations to the ones overlapping the corresponding genomic regions. Alternatively, regions can be specified as a character vector (e.g. "chr1:1200-1300") that can be coerced into aGRanges
object.- sequence.context
A character vector with sequence context(s) to include in the calculations. Only positions that match one of the provided sequence contexts will be included. Sequence contexts can be provided using IUPAC redundancy codes. The sequence contexts of modified bases are obtained from
rowData(se)$sequence.context
and thus requires thatse
contains the appropriate information, for example by setting thesequence.context
andsequence.reference
arguments ofreadModkitExtract
when it was generated, or by adding it usingaddSeqContext
.- stats
Character vector specifying which statistics to calculate. When set to
NULL
all available statistics are calculated. See details for a complete list of available read statistics.- min.Nobs.ppos
A numeric scalar value >=1 indicating the minimum coverage on individual positions for them to be included in the calculations. In high coverage data this is an effective filter for removing spurious modbases, typically the result of erroneous basecalling. The default
NULL
sets its value to Q3-0.5*IQR, where Q3 and IQR are the third quartile and interquartile range of the coverage distribution estimated from the data inse
.- min.Nobs.pread
A numeric scalar with the minimum number of observed modifiable bases per read for it to be included in the calculations.
NA
values are returned for the reads that do not pass this threshold.- LowConf
A numeric scalar with the minimum call confidence below which calls are considered "low confidence".
- LagRange
A numeric vector of two values (minimum and maxium) defining the range of lags for the calculation of autocorrelation and partial autocorrelation (see details section).
- verbose
If
TRUE
, report on progress.
Details
Calculates a collection of location/scatter statistics and information
theoretic/signal-processing metrics for the modification probability,
confidence or modification call value vectors across individual reads. When
sequence.context
, min.coverage
or min.Nobs.ppos
filters
are enforced, only modifiable bases passing the filters are included in the
calculations. The following statistics are available:
- MeanModProb
: Mean modification probability across the read.
- FracMod
: Fraction of confidently modified bases, defined as the ratio of modifiable bases with modification probability >= 0.5 over all modifiable bases.
- MeanConf
: Mean call confidence across the read.
- MeanConfUnm
: Mean call confidence confined to unmodified bases (modifiable bases with modification probability < 0.5).
- MeanConfMod
: Mean call confidence confined to modified bases (modifiable bases with modification probability >= 0.5).
- FracLowConf
: Fraction of modifiable bases called with low confidence (call confidence <
LowConf
).- IQRModProb
: Interquartile range of modification probabilities across the read.
- sdModProb
: Standard deviation of modification probabilities across the read.
- SEntrModProb
: Sample entropy of the modification probability signal. Sample entropy is a metric assessing the complexity of one-dimensional physiological signals. The higher the sample entropy, the more irregular, unpredictable and therefore complex the signal. See wikipedia:Sample_entropy for more details.
- Lag1DModProb
: Mean Lag1 differences of modification calls, defined as:
mean(Mod[i]-Mod[i-1])
, whereMod
is a{0,1}
modification call.- ACModProb
: Autocorrelation of the modification probability values for lags in the range
LagRange
. This range typically covers the signal of nucleosome periodicity.- PACModProb
: Partial autocorrelation of the modification probability values for lags in the range
LagRange
. This range typically covers the signal of nucleosome periodicity.
Examples
library(SummarizedExperiment)
modbamfile <- system.file("extdata", "6mA_1_10reads.bam",
package = "footprintR")
se <- readModBam(bamfile = modbamfile, regions = "chr1:6940000-6955000",
modbase = "a", verbose = TRUE)
#> extracting base modifications from modBAM files
#> opening input file /Users/runner/work/_temp/Library/footprintR/extdata/6mA_1_10reads.bam
#> reading alignments overlapping any of 1 regions
#> removed 1724 unaligned (e.g. soft-masked) of 25090 called bases
#> read 3 alignments
#> finding unique genomic positions...
#> collapsed 11300 positions to 4772 unique ones
ReadStats <- calcReadStats(se)
ReadStats[["s1"]]
#> DataFrame with 3 rows and 13 columns
#> MeanModProb FracMod MeanConf
#> <numeric> <numeric> <numeric>
#> s1-233e48a7-f379-4dcf-9270-958231125563 0.1813986 0.1540698 0.931635
#> s1-d52a5f6a-a60a-4f85-913e-eada84bfbfb9 0.0969716 0.0632411 0.944381
#> s1-92e906ae-cddb-4347-a114-bf9137761a8d 0.0714111 0.0353929 0.950393
#> MeanConfUnm MeanConfMod FracLowConf
#> <numeric> <numeric> <numeric>
#> s1-233e48a7-f379-4dcf-9270-958231125563 0.943439 0.866826 0.0668605
#> s1-d52a5f6a-a60a-4f85-913e-eada84bfbfb9 0.952309 0.826942 0.0456066
#> s1-92e906ae-cddb-4347-a114-bf9137761a8d 0.955617 0.808031 0.0371926
#> IQRModProb sdModProb SEntrModProb
#> <numeric> <numeric> <numeric>
#> s1-233e48a7-f379-4dcf-9270-958231125563 0.114766 0.310155 0.210238
#> s1-d52a5f6a-a60a-4f85-913e-eada84bfbfb9 0.000000 0.208093 0.161245
#> s1-92e906ae-cddb-4347-a114-bf9137761a8d 0.000000 0.160210 0.193140
#> Lag1DModProb
#> <numeric>
#> s1-233e48a7-f379-4dcf-9270-958231125563 0.1546961
#> s1-d52a5f6a-a60a-4f85-913e-eada84bfbfb9 0.0821168
#> s1-92e906ae-cddb-4347-a114-bf9137761a8d 0.0519052
#> ACModProb
#> <list>
#> s1-233e48a7-f379-4dcf-9270-958231125563 0.1400568,0.0864025,0.0542389,...
#> s1-d52a5f6a-a60a-4f85-913e-eada84bfbfb9 0.0490902,0.0539862,0.0119943,...
#> s1-92e906ae-cddb-4347-a114-bf9137761a8d -0.00389478,-0.01949284,-0.00804977,...
#> PACModProb
#> <list>
#> s1-233e48a7-f379-4dcf-9270-958231125563 -0.04826710,-0.03493131,-0.00989425,...
#> s1-d52a5f6a-a60a-4f85-913e-eada84bfbfb9 0.01549346,-0.03706194, 0.00959782,...
#> s1-92e906ae-cddb-4347-a114-bf9137761a8d -0.01961229, 0.00615528,-0.01833084,...
#> sample
#> <character>
#> s1-233e48a7-f379-4dcf-9270-958231125563 s1
#> s1-d52a5f6a-a60a-4f85-913e-eada84bfbfb9 s1
#> s1-92e906ae-cddb-4347-a114-bf9137761a8d s1