Calculate or add summary statistics for read-level base modification data
Source:R/calcReadStats.R
calcReadStats.Rd
calcReadStats
calculates various per-read summary statistics on
modification probabilities or calls from a
SummarizedExperiment
object with genomic
positions in rows and samples in columns. addReadStats
adds them to
the colData
under name
. See details for more information on the
statistics that are calculated.
Usage
calcReadStats(
se,
assayName = "mod_prob",
stats = NULL,
regions = NULL,
sequenceContext = NULL,
minNobsPpos = 0,
minNobsPread = 0,
LowConf = 0.7,
LagRange = c(12, 64),
BPPARAM = MulticoreParam(4L, RNGseed = 42L),
verbose = FALSE
)
addReadStats(se, ..., name = "QC")
Arguments
- se
A
RangedSummarizedExperiment
object with assayassayName
typically returned byreadModkitExtract
orreadModBam
.- assayName
A character scalar specifying the assay of
se
containing the read-level data to be summarized. Typically, this assay contains modification probabilities.- stats
Character vector specifying which statistics to calculate. When set to
NULL
all available statistics except the sample entropy are calculated. See details for available read statistics.- regions
A
GRanges
object limiting the positions included in the calculations to the ones overlapping the corresponding genomic regions. Alternatively, regions can be specified as a character vector (e.g. "chr1:1200-1300") that can be coerced into aGRanges
object.- sequenceContext
A character vector with sequence context(s) to include in the calculations. Only positions that match one of the provided sequence contexts will be included. Sequence contexts can be provided using IUPAC redundancy codes. The sequence contexts of modified bases are obtained from
rowData(se)$sequenceContext
and thus requires thatse
contains the appropriate information, for example by setting thesequenceContextWidth
andsequenceReference
arguments ofreadModkitExtract
when it was generated, or by adding it usingaddSeqContext
.- minNobsPpos
A numeric scalar value >=1 indicating the minimum coverage on individual positions for them to be included in the calculations. In high coverage data this is an effective filter for removing spurious modbases, typically the result of erroneous basecalling. The default
NULL
sets its value to Q3-0.5*IQR, where Q3 and IQR are the third quartile and interquartile range of the coverage distribution estimated from the data inse
.- minNobsPread
A numeric scalar with the minimum number of observed modifiable bases per read for it to be included in the calculations.
NA
values are returned for the reads that do not pass this threshold.- LowConf
A numeric scalar with the minimum call confidence below which calls are considered "low confidence".
- LagRange
A numeric vector of two values (minimum and maxium) defining the range of lags for the calculation of autocorrelation and partial autocorrelation (see details section).
- BPPARAM
A
BiocParallelParam
object that controls the number of parallel CPU threads to use for calculating read statistics. The default value is (MulticoreParam(4L, RNGseed = 42L)
).- verbose
If
TRUE
, report on progress.- ...
For
addReadStats
only: Additional arguments passed on tocalcReadStats
.- name
For
addReadStats
only: A character scalar specifying the name to be used to store the result in thecolData
of the output.
Value
For calcReadStats
, a SimpleList
object with summary statistics
for the samples (columns) in se
.
For addReadStats
, a SummarizedExperiment
object with summary statistics added to the name
column of
colData
.
Details
calcReadStats
calculates a collection of location/scatter statistics
and information theoretic/signal-processing metrics for the modification
probability, confidence or modification call value vectors across individual
reads (data in assay assayName
). Only bases matching the criteria
given byregions
, sequenceContext
, minNobsPpos
and
minNobsPread
are included in the calculations. The values of these
filtering parameters are stored in the attribute of the output.
stats
selects the summaries to be calculated. Currently available
values are:
- MeanModProb
: Mean modification probability across the read.
- FracMod
: Fraction of confidently called bases (either modified or unmodified), that are modified. By default, all bases are considered confidently called.
- MeanConf
: Mean call confidence across the read.
- MeanConfUnm
: Mean call confidence confined to unmodified bases (modifiable bases with modification probability < 0.5).
- MeanConfMod
: Mean call confidence confined to modified bases (modifiable bases with modification probability >= 0.5).
- FracLowConf
: Fraction of modifiable bases called with low confidence (call confidence <
LowConf
).- IQRModProb
: Interquartile range of modification probabilities across the read.
- sdModProb
: Standard deviation of modification probabilities across the read.
- SEntrModProb
: Sample entropy of the modification probability signal. Sample entropy is a metric assessing the complexity of one-dimensional physiological signals. The higher the sample entropy, the more irregular, unpredictable and therefore complex the signal. See wikipedia:Sample_entropy for more details.
- Lag1DModProb
: Mean Lag1 differences of modification calls, defined as:
mean(Mod[i]-Mod[i-1])
, whereMod
is a{0,1}
modification call.- ACModProb
: Autocorrelation of the modification probability values for lags in the range
LagRange
. This range typically covers the signal of nucleosome periodicity.- PACModProb
: Partial autocorrelation of the modification probability values for lags in the range
LagRange
. This range typically covers the signal of nucleosome periodicity.
Examples
# load example data
library(SummarizedExperiment)
modbamfile <- system.file("extdata", "6mA_1_10reads.bam",
package = "footprintR")
se <- readModBam(bamfile = modbamfile, regions = "chr1:6940000-6955000",
modbase = "a", verbose = TRUE,
BPPARAM = BiocParallel::SerialParam())
#> ℹ extracting base modifications from modBAM files
#> ℹ finding unique genomic positions...
#> ✔ finding unique genomic positions... [18ms]
#>
#> ℹ collapsed 11300 positions to 4772 unique ones
#> ✔ collapsed 11300 positions to 4772 unique ones [421ms]
#>
readStats <- calcReadStats(se, BPPARAM = BiocParallel::SerialParam())
readStats$s1
#> DataFrame with 3 rows and 11 columns
#> MeanModProb FracMod MeanConf
#> <numeric> <numeric> <numeric>
#> s1-233e48a7-f379-4dcf-9270-958231125563 0.1652868 0.1498969 0.943914
#> s1-d52a5f6a-a60a-4f85-913e-eada84bfbfb9 0.0835376 0.0646707 0.958320
#> s1-92e906ae-cddb-4347-a114-bf9137761a8d 0.0556013 0.0347512 0.965854
#> MeanConfUnm MeanConfMod FracLowConf
#> <numeric> <numeric> <numeric>
#> s1-233e48a7-f379-4dcf-9270-958231125563 0.957961 0.864255 0.0680724
#> s1-d52a5f6a-a60a-4f85-913e-eada84bfbfb9 0.967633 0.823622 0.0470060
#> s1-92e906ae-cddb-4347-a114-bf9137761a8d 0.971512 0.808703 0.0355852
#> IQRModProb sdModProb Lag1DModProb
#> <numeric> <numeric> <numeric>
#> s1-233e48a7-f379-4dcf-9270-958231125563 0.1308594 0.312860 0.1474094
#> s1-d52a5f6a-a60a-4f85-913e-eada84bfbfb9 0.0488281 0.215142 0.0868524
#> s1-92e906ae-cddb-4347-a114-bf9137761a8d 0.0000000 0.164023 0.0536707
#> ACModProb
#> <list>
#> s1-233e48a7-f379-4dcf-9270-958231125563 0.1306906,0.0840176,0.0591518,...
#> s1-d52a5f6a-a60a-4f85-913e-eada84bfbfb9 0.0619512,0.0479904,0.0392903,...
#> s1-92e906ae-cddb-4347-a114-bf9137761a8d -0.00193260,-0.01562365,-0.00424019,...
#> PACModProb
#> <list>
#> s1-233e48a7-f379-4dcf-9270-958231125563 -0.0415954,-0.0216324,-0.0132802,...
#> s1-d52a5f6a-a60a-4f85-913e-eada84bfbfb9 -0.00443190,-0.00209608,-0.00710440,...
#> s1-92e906ae-cddb-4347-a114-bf9137761a8d -0.01637500, 0.00794205,-0.02053895,...
se_withReadStats <- addReadStats(se, name = "QC",
BPPARAM = BiocParallel::SerialParam())
se_withReadStats$QC$s1
#> DataFrame with 3 rows and 11 columns
#> MeanModProb FracMod MeanConf
#> <numeric> <numeric> <numeric>
#> s1-233e48a7-f379-4dcf-9270-958231125563 0.1652868 0.1498969 0.943914
#> s1-d52a5f6a-a60a-4f85-913e-eada84bfbfb9 0.0835376 0.0646707 0.958320
#> s1-92e906ae-cddb-4347-a114-bf9137761a8d 0.0556013 0.0347512 0.965854
#> MeanConfUnm MeanConfMod FracLowConf
#> <numeric> <numeric> <numeric>
#> s1-233e48a7-f379-4dcf-9270-958231125563 0.957961 0.864255 0.0680724
#> s1-d52a5f6a-a60a-4f85-913e-eada84bfbfb9 0.967633 0.823622 0.0470060
#> s1-92e906ae-cddb-4347-a114-bf9137761a8d 0.971512 0.808703 0.0355852
#> IQRModProb sdModProb Lag1DModProb
#> <numeric> <numeric> <numeric>
#> s1-233e48a7-f379-4dcf-9270-958231125563 0.1308594 0.312860 0.1474094
#> s1-d52a5f6a-a60a-4f85-913e-eada84bfbfb9 0.0488281 0.215142 0.0868524
#> s1-92e906ae-cddb-4347-a114-bf9137761a8d 0.0000000 0.164023 0.0536707
#> ACModProb
#> <list>
#> s1-233e48a7-f379-4dcf-9270-958231125563 0.1306906,0.0840176,0.0591518,...
#> s1-d52a5f6a-a60a-4f85-913e-eada84bfbfb9 0.0619512,0.0479904,0.0392903,...
#> s1-92e906ae-cddb-4347-a114-bf9137761a8d -0.00193260,-0.01562365,-0.00424019,...
#> PACModProb
#> <list>
#> s1-233e48a7-f379-4dcf-9270-958231125563 -0.0415954,-0.0216324,-0.0132802,...
#> s1-d52a5f6a-a60a-4f85-913e-eada84bfbfb9 -0.00443190,-0.00209608,-0.00710440,...
#> s1-92e906ae-cddb-4347-a114-bf9137761a8d -0.01637500, 0.00794205,-0.02053895,...
metadata(se_withReadStats$QC$s1)
#> list()