Scan one or more chromosomes for high-scoring regions
Source:R/scanGenome.R
scanForHighScoringRegions.RdGiven one or several modBam files and a vector with chromosome lengths,
quantify the modification data in windows using quantFunction,
calculate a score for each window using scoreFunction and fuse
neighboring high-scoring windows. For example, the quantification could
sum up observed counts in each window (implemented in sumNmodNvalid),
and the score could be a directional P value measuring differential
modification between two groups of modBam files (implemented in
getDifferentiallyModifiedWindows), and the identified windows would
then be differentially modified regions (DMRs). Please note that
quantFunction needs to be compatible with scoreFunction, as
the latter consumes the output of the former.
Usage
scanForHighScoringRegions(
bamfiles,
sampleAnnot,
chromosomeLengths,
quantFunction = "sumNmodNvalid",
quantFunctionArgs = list(),
scoreFunction = "getDifferentiallyModifiedWindows",
scoreFunctionArgs = list(),
modbase,
level = "summary",
modProbThreshold = 0.5,
tileSize = 1e+06,
seqinfo = NULL,
sequenceContextWidth = 0,
sequenceReference = NULL,
sequenceContext = NULL,
windowMode = "fixed",
windowSize = 24,
windowStep = round(windowSize/2),
scoreCol = "dirNegLog10PValue",
scoreAction = "smoothFuse",
minperiod = 3,
thresh = 3,
maxGap = 50,
BPPARAM = MulticoreParam(4L, RNGseed = 42L),
verbose = FALSE
)Arguments
- bamfiles
Character vector with one or several modBam file names. Note that no read-filtering will be performed on the data from these files, so possibly the files should contain only filtered alignments.
- sampleAnnot
A
data.frame(orNULL) providing annotations for the samples. It must contain at least one column, named"sample", which must contain all the values ofnames(bamfiles). The provided annotations will be propagated to the returnedSummarizedExperimentobject.- chromosomeLengths
A named character vector with lengths of chromosomes to be analyzed, for example the return value of
seqlengths(BSgenomeObject).- quantFunction
A character scalar giving the name of the window quantification function. This function should take as a first argument a
RangedSummarizedExperimentobject, as generated byreadModBam. The second argument should be aGRangesobject defining the windows to quantify. Additional arguments may be passed toquantFunctionusingquantFunctionArgs. The function must return aRangedSummarizedExperimentobject with one row per window defined by the second argument and the window quantifications contained in the assay(s).- quantFunctionArgs
List with additional arguments for
quantFunction.- scoreFunction
A character scalar giving the name of the scoring function. This function should take as a first argument a
SummarizedExperimentobject, as generated byquantifyWindowsInRegion. Additional arguments may be passed toscoreFunctionusingscoreFunctionArgs. The function must return aGRangesobject (typically corresponding to therowRangesof the input) with the score added to themcols().- scoreFunctionArgs
List with additional arguments for
scoreFunction.- modbase
Character vector defining the modified base to extract.
- level
A character scalar giving the level of the data to required by
quantFunction(seereadModBam).- modProbThreshold
A numeric scalar, indicating the modification probability threshold to use to classify a base as 'modified' or 'unmodified'.
- tileSize
A numeric scalar giving the approximate size of a genomic region for which the data is loaded at once. Chromosomes that are larger than
tileSizewill be automatically split into multiple tiles. Reduce this size if you would like to reduce the memory usage of the function.- seqinfo
NULLor aSeqinfoobject containing information about the set of genomic sequences (chromosomes). Alternatively, a named numeric vector with genomic sequence names and lengths. Useful to set the sorting order of sequence names.- sequenceContextWidth, sequenceReference
Define the sequence context to be extracted around modified bases. By default (
sequenceContextWidth = 0), no sequence context will be extracted, otherwise it will be returned inrowData(x)$sequenceContext. SeeaddSeqContextfor details.- sequenceContext
A character vector with sequence contexts to retain. To apply this filter, the arguments
"sequenceContextWidth"and"sequenceReference"must be set.- windowMode
Character scalar defining how windows in
regionare define. Currently supported are:- "fixed"
: The window is of fixed size (in number of bases), corresponding to the
windowSizeargument value.- "predefined"
: The windows are provided by the user, as a
GRangesobject.
- windowSize
Numeric scalar defining the size of windows (see
windowModeargument).- windowStep
Numeric scalar defining the step (shift) between start positions of consecutive windows. For non-overlapping consecutive windows,
windowStepshould be equal towindowSize.- scoreCol
Character scalar giving the column name in
mcols(x)to use for the analysis. ForscanForHighScoringRegions, it can be a vector, in which caseprocessWindowScoreswill be run for each of them and the results will be returned as a list.- scoreAction
A character scalar indicating how to process the scores. Currently supported values are:
- pass
: Do not modify window scores (pass-through). Note that if
mcols(x)has multiple columns, they will all be passed through (i.e.,scoreColis ignored). To return onlyscoreCol, useselect.- select
: Pass through only
scoreCol.- smooth
: Smooth window scores (
minperiodargument).- smoothFuse (default)
: Smooth window scores, identify high scoring elements (
threshargument) and fuse nearby high-scoring elements (maxGapargument).
- minperiod
Numeric scalar that defines the low-pass filter parameter used to smooth the scores for segmentation.
minperiodgives the minimal period (in number of windows) for the critical frequency in the score signal that should be retained. The default value is suitable to retain signals that occur in three neighboring windows. The filtering is performed usingfiltfiltand thus requires thesignalpackage to be installed.- thresh
A numeric scalar giving the minimal absolute window score (after smoothing, see
minperiodargument) defining a region of interest. Higher values make the region detection more stringent.- maxGap
Numeric scalar giving the maximal gap between neighboring windows, in base pairs, from the end of the first to the start of the next, to be fused into a single region of interest.
- BPPARAM
A
BiocParallelParamobject that controls the number of parallel CPU threads to use for some of the steps inreadModBam(). The default value is (MulticoreParam(4L, RNGseed = 42L)). If randomly sampling reads (nAlnsToSample > 0), make sure to set theRNGseedargument when constructing theBPPARAMobject for reproducible results (see alsovignette("Random_Numbers", package = "BiocParallel")).- verbose
Logical scalar. If
TRUE, report on progress.
Value
A GRanges object with identified
regions of interest, or a named list of such objects if
scoreCol is a vector of length >1.
Examples
modbamfiles <- system.file("extdata",
c("6mA_1_10reads.bam", "6mA_1_10reads.bam",
"6mA_2_10reads.bam", "6mA_2_10reads.bam"),
package = "footprintR")
gr <- scanForHighScoringRegions(
bamfiles = modbamfiles,
sampleAnnot = data.frame(sample = c("s1", "s2", "s3", "s4"),
group = c("A", "A", "B", "B")),
chromosomeLengths = c(chr1 = 6955000),
modbase = "a", BPPARAM = BiocParallel::SerialParam())
gr
#> GRanges object with 38 ranges and 5 metadata columns:
#> seqnames ranges strand | dirNegLog10PValueThresh
#> <Rle> <IRanges> <Rle> | <numeric>
#> [1] chr1 6928525-6928560 * | -4.31289
#> [2] chr1 6928621-6928692 * | 3.95187
#> [3] chr1 6929185-6929208 * | 3.14622
#> [4] chr1 6929353-6929496 * | 3.98559
#> [5] chr1 6929797-6929868 * | 4.16167
#> ... ... ... ... . ...
#> [34] chr1 6937909-6937932 * | 3.15746
#> [35] chr1 6938521-6938592 * | 5.31284
#> [36] chr1 6938617-6938652 * | -4.54483
#> [37] chr1 6938713-6938760 * | 3.81666
#> [38] chr1 6939973-6940044 * | 5.17778
#> numWindowsThresh direction dirNegLog10PValue numWindows
#> <integer> <factor> <numeric> <integer>
#> [1] 2 negative -4.31289 2
#> [2] 5 positive 3.95187 5
#> [3] 1 positive 3.14622 1
#> [4] 8 positive 3.35340 11
#> [5] 5 positive 4.16167 5
#> ... ... ... ... ...
#> [34] 1 positive 3.15746 1
#> [35] 5 positive 5.31284 5
#> [36] 2 negative -4.54483 2
#> [37] 3 positive 3.81666 3
#> [38] 5 positive 5.17778 5
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths