Scan one or more chromosomes for high-scoring regions
Source:R/scanGenome.R
scanForHighScoringRegions.Rd
Given one or several modBam files and a vector with chromosome lengths,
quantify the modification data in windows using quantFunction
,
calculate a score for each window using scoreFunction
and fuse
neighboring high-scoring windows. For example, the quantification could
sum up observed counts in each window (implemented in sumNmodNvalid
),
and the score could be a directional P value measuring differential
modification between two groups of modBam files (implemented in
getDifferentiallyModifiedWindows
), and the identified windows would
then be differentially modified regions (DMRs). Please note that
quantFunction
needs to be compatible with scoreFunction
, as
the latter consumes the output of the former.
Usage
scanForHighScoringRegions(
bamfiles,
sampleAnnot,
chromosomeLengths,
quantFunction = "sumNmodNvalid",
quantFunctionArgs = list(),
scoreFunction = "getDifferentiallyModifiedWindows",
scoreFunctionArgs = list(),
modbase,
modProbThreshold = 0.5,
tileSize = 1e+06,
seqinfo = NULL,
sequenceContextWidth = 0,
sequenceReference = NULL,
sequenceContext = NULL,
windowMode = "fixed",
windowSize = 24,
windowStep = round(windowSize/2),
scoreCol = "dirNegLog10PValue",
scoreAction = "smoothFuse",
minperiod = 3,
thresh = 3,
maxGap = 50,
BPPARAM = MulticoreParam(4L, RNGseed = 42L),
verbose = FALSE
)
Arguments
- bamfiles
Character vector with one or several modBam file names. Note that no read-filtering will be performed on the data from these files, so possibly the files should contain only filtered alignments.
- sampleAnnot
A
data.frame
(orNULL
) providing annotations for the samples. It must contain at least one column, named"sample"
, which must contain all the values ofnames(bamfiles)
. The provided annotations will be propagated to the returnedSummarizedExperiment
object.- chromosomeLengths
A named character vector with lengths of chromosomes to be analyzed, for example the return value of
seqlengths(BSgenomeObject)
.- quantFunction
A character scalar giving the name of the window quantification function. This function should take as a first argument a
RangedSummarizedExperiment
object, as generated byreadModBam
. The second argument should be aGRanges
object defining the windows to quantify. Additional arguments may be passed toquantFunction
usingquantFunctionArgs
. The function must return aRangedSummarizedExperiment
object with one row per window defined by the second argument and the window quantifications contained in the assay(s).- quantFunctionArgs
List with additional arguments for
quantFunction
.- scoreFunction
A character scalar giving the name of the scoring function. This function should take as a first argument a
SummarizedExperiment
object, as generated byquantifyWindowsInRegion
. Additional arguments may be passed toscoreFunction
usingscoreFunctionArgs
. The function must return aGRanges
object (typically corresponding to therowRanges
of the input) with the score added to themcols()
.- scoreFunctionArgs
List with additional arguments for
scoreFunction
.- modbase
Character vector defining the modified base to extract.
- modProbThreshold
A numeric scalar, indicating the modification probability threshold to use to classify a base as 'modified' or 'unmodified'.
- tileSize
A numeric scalar giving the approximate size of a genomic region for which the data is loaded at once. Chromosomes that are larger than
tileSize
will be automatically split into multiple tiles. Reduce this size if you would like to reduce the memory usage of the function.- seqinfo
NULL
or aSeqinfo
object containing information about the set of genomic sequences (chromosomes). Alternatively, a named numeric vector with genomic sequence names and lengths. Useful to set the sorting order of sequence names.- sequenceContextWidth, sequenceReference
Define the sequence context to be extracted around modified bases. By default (
sequenceContextWidth = 0
), no sequence context will be extracted, otherwise it will be returned inrowData(x)$sequenceContext
. SeeaddSeqContext
for details.- sequenceContext
A character vector with sequence contexts to retain. To apply this filter, the arguments
"sequenceContextWidth"
and"sequenceReference"
must be set.- windowMode
Character scalar defining how windows in
region
are define. Currently supported are:- "fixed"
: The window is of fixed size (in number of bases), corresponding to the
windowSize
argument value.- "predefined"
: The windows are provided by the user, as a
GRanges
object.
- windowSize
Numeric scalar defining the size of windows (see
windowMode
argument).- windowStep
Numeric scalar defining the step (shift) between start positions of consecutive windows. For non-overlapping consecutive windows,
windowStep
should be equal towindowSize
.- scoreCol
Character scalar giving the column name in
mcols(x)
to use for the analysis. ForscanForHighScoringRegions
, it can be a vector, in which caseprocessWindowScores
will be run for each of them and the results will be returned as a list.- scoreAction
A character scalar indicating how to process the scores. Currently supported values are:
- pass
: Do not modify window scores (pass-through). Note that if
mcols(x)
has multiple columns, they will all be passed through (i.e.,scoreCol
is ignored). To return onlyscoreCol
, useselect
.- select
: Pass through only
scoreCol
.- smooth
: Smooth window scores (
minperiod
argument).- smoothFuse (default)
: Smooth window scores, identify high scoring elements (
thresh
argument) and fuse nearby high-scoring elements (maxGap
argument).
- minperiod
Numeric scalar that defines the low-pass filter parameter used to smooth the scores for segmentation.
minperiod
gives the minimal period (in number of windows) for the critical frequency in the score signal that should be retained. The default value is suitable to retain signals that occur in three neighboring windows. The filtering is performed usingfiltfilt
and thus requires thesignal
package to be installed.- thresh
A numeric scalar giving the minimal absolute window score (after smoothing, see
minperiod
argument) defining a region of interest. Higher values make the region detection more stringent.- maxGap
Numeric scalar giving the maximal gap between neighboring windows, in base pairs, from the end of the first to the start of the next, to be fused into a single region of interest.
- BPPARAM
A
BiocParallelParam
object that controls the number of parallel CPU threads to use for some of the steps inreadModBam()
. The default value is (MulticoreParam(4L, RNGseed = 42L)
). If randomly sampling reads (nAlnsToSample > 0
), make sure to set theRNGseed
argument when constructing theBPPARAM
object for reproducible results (see alsovignette("Random_Numbers", package = "BiocParallel")
).- verbose
Logical scalar. If
TRUE
, report on progress.
Value
A GRanges
object with identified
regions of interest, or a named list of such objects if
scoreCol
is a vector of length >1.
Examples
modbamfiles <- system.file("extdata",
c("6mA_1_10reads.bam", "6mA_1_10reads.bam",
"6mA_2_10reads.bam", "6mA_2_10reads.bam"),
package = "footprintR")
gr <- scanForHighScoringRegions(
bamfiles = modbamfiles,
sampleAnnot = data.frame(sample = c("s1","s2","s3","s4"),
group = c("A","A","B","B")),
chromosomeLengths = c(chr1 = 6955000),
modbase = "a", BPPARAM = BiocParallel::SerialParam())
gr
#> GRanges object with 38 ranges and 5 metadata columns:
#> seqnames ranges strand | dirNegLog10PValueThresh
#> <Rle> <IRanges> <Rle> | <numeric>
#> [1] chr1 6928525-6928560 * | -4.31289
#> [2] chr1 6928621-6928692 * | 3.95187
#> [3] chr1 6929185-6929208 * | 3.14622
#> [4] chr1 6929353-6929496 * | 3.98559
#> [5] chr1 6929797-6929868 * | 4.16167
#> ... ... ... ... . ...
#> [34] chr1 6937909-6937932 * | 3.15746
#> [35] chr1 6938521-6938592 * | 5.31284
#> [36] chr1 6938617-6938652 * | -4.54483
#> [37] chr1 6938713-6938760 * | 3.81666
#> [38] chr1 6939973-6940044 * | 5.17778
#> numWindowsThresh direction dirNegLog10PValue numWindows
#> <integer> <factor> <numeric> <integer>
#> [1] 2 negative -4.31289 2
#> [2] 5 positive 3.95187 5
#> [3] 1 positive 3.14622 1
#> [4] 8 positive 3.35340 11
#> [5] 5 positive 4.16167 5
#> ... ... ... ... ...
#> [34] 1 positive 3.15746 1
#> [35] 5 positive 5.31284 5
#> [36] 2 negative -4.54483 2
#> [37] 3 positive 3.81666 3
#> [38] 5 positive 5.17778 5
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths