Scan one or more chromosomes for high-scoring regions
Source:R/scanGenome.R
scanForHighScoringRegions.Rd
Given one or several modBam files and a vector with chromosome lengths, quantify the modification data in windows, calculate a score for each window and fuse neighboring high-scoring windows. For example, the score could be a directional P value measuring differential modification between two groups of modBam files, and the identified windows would then be differentially modified regions (DMRs).
Usage
scanForHighScoringRegions(
bamfiles,
sampleAnnot,
chromosomeLengths,
scoreFunction = "getDifferentiallyModifiedWindows",
modbase,
modProbThreshold = 0.5,
tileSize = 1e+06,
seqinfo = NULL,
sequenceContextWidth = 0,
sequenceReference = NULL,
sequenceContext = NULL,
windowMode = "fixed",
windowSize = 24,
windowStep = round(windowSize/2),
scoreCol = "dirNegLog10PValue",
thresh = 3,
minperiod = 3,
maxGap = 50,
BPPARAM = MulticoreParam(4L, RNGseed = 42L),
verbose = FALSE,
...
)
Arguments
- bamfiles
Character vector with one or several modBam file names. Note that no read-filtering will be performed on the data from these files, so possibly the files should contain only filtered alignments.
- sampleAnnot
A
data.frame
(orNULL
) providing annotations for the samples. It must contain at least one column, named"sample"
, which must contain all the values ofnames(bamfiles)
. The provided annotations will be propagated to the returnedSummarizedExperiment
object.- chromosomeLengths
A named character vector with lengths of chromosomes to be analyzed, for example the return value of
seqlengths(BSgenomeObject)
.- scoreFunction
A character scalar giving the name of the scoring function. This function should take as a first argument a
SummarizedExperiment
object, as generated byquantifyWindowsInRegion
. Additional arguments may be passed toscoreFunction
using...
. The function must return aGRanges
object (typically corresponding to therowRanges
of the input) with the score added to themcols()
.- modbase
Character vector defining the modified base to extract.
- modProbThreshold
A numeric scalar, indicating the modification probability threshold to use to classify a base as 'modified' or 'unmodified'.
- tileSize
A numeric scalar giving the approximate size of a genomic region for which the data is loaded at once. Chromosomes that are larger than
tileSize
will be automatically split into multiple tiles. Reduce this size if you would like to reduce the memory usage of the function.- seqinfo
NULL
or aSeqinfo
object containing information about the set of genomic sequences (chromosomes). Alternatively, a named numeric vector with genomic sequence names and lengths. Useful to set the sorting order of sequence names.- sequenceContextWidth, sequenceReference
Define the sequence context to be extracted around modified bases. By default (
sequenceContextWidth = 0
), no sequence context will be extracted, otherwise it will be returned inrowData(x)$sequenceContext
. SeeaddSeqContext
for details.- sequenceContext
A character vector with sequence contexts to retain. To apply this filter, the arguments
"sequenceContextWidth"
and"sequenceReference"
must be set.- windowMode
Character scalar defining how windows in
region
are define. Currently supported are:- "fixed"
: The window is of fixed size (in number of bases), corresponding to the
windowSize
argument value.
- windowSize
Numeric scalar defining the size of windows (see
windowMode
argument).- windowStep
Numeric scalar defining the step (shift) between start positions of consecutive windows. For non-overlapping consecutive windows,
windowStep
should be equal towindowSize
.- scoreCol
Character scalar giving the column name in
mcols(x)
to use for the analysis.- thresh
A numeric scalar giving the minimal absolute window score (after smoothing, see
minperiod
argument) defining a region of interest. Higher values make the region detection more stringent.- minperiod
Numeric scalar that defines the low-pass filter parameter used to smooth the scores for segmentation.
minperiod
gives the minimal period (in number of windows) for the critical frequency in the score signal that should be retained. The default value is suitable to retain signals that occur in three neighboring windows. The filtering is performed usingfiltfilt
and thus requires thesignal
package to be installed.- maxGap
Numeric scalar giving the maximal gap between neighboring windows, in base pairs, from the end of the first to the start of the next, to be fused into a single region of interest.
- BPPARAM
A
BiocParallelParam
object that controls the number of parallel CPU threads to use for some of the steps inreadModBam()
. The default value is (MulticoreParam(4L, RNGseed = 42L)
). If randomly sampling reads (nAlnsToSample > 0
), make sure to set theRNGseed
argument when constructing theBPPARAM
object for reproducible results (see alsovignette("Random_Numbers", package = "BiocParallel")
).- verbose
Logical scalar. If
TRUE
, report on progress.- ...
Additional parameters sent to
scoreFunction
.
Value
A GRanges
object with identified
regions of interest.
Examples
modbamfiles <- system.file("extdata",
c("6mA_1_10reads.bam", "6mA_1_10reads.bam",
"6mA_2_10reads.bam", "6mA_2_10reads.bam"),
package = "footprintR")
gr <- scanForHighScoringRegions(bamfiles = modbamfiles,
sampleAnnot = data.frame(sample = c("s1","s2","s3","s4"),
group = c("A","A","B","B")),
chromosomeLengths = c(chr1 = 6955000),
modbase = "a", BPPARAM = BiocParallel::SerialParam())
gr
#> GRanges object with 38 ranges and 5 metadata columns:
#> seqnames ranges strand | dirNegLog10PValueThresh
#> <Rle> <IRanges> <Rle> | <numeric>
#> [1] chr1 6928530-6928565 * | -3.89850
#> [2] chr1 6928626-6928685 * | 4.07694
#> [3] chr1 6929190-6929213 * | 3.05522
#> [4] chr1 6929346-6929561 * | 3.91628
#> [5] chr1 6929802-6929861 * | 4.45501
#> ... ... ... ... . ...
#> [34] chr1 6938526-6938597 * | 5.30074
#> [35] chr1 6938610-6938657 * | -4.04405
#> [36] chr1 6938718-6938765 * | 3.31873
#> [37] chr1 6938874-6938897 * | -3.24545
#> [38] chr1 6939966-6940049 * | 5.22974
#> numWindowsThresh direction dirNegLog10PValue numWindows
#> <integer> <factor> <numeric> <integer>
#> [1] 2 down -3.89850 2
#> [2] 4 up 4.07694 4
#> [3] 1 up 3.05522 1
#> [4] 8 up 2.80645 17
#> [5] 4 up 4.45501 4
#> ... ... ... ... ...
#> [34] 5 up 5.30074 5
#> [35] 3 down -4.04405 3
#> [36] 3 up 3.31873 3
#> [37] 1 down -3.24545 1
#> [38] 6 up 5.22974 6
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths