Scan one or more chromosomes for high-scoring regions — scanForHighScoringRegions • footprintR

Given one or several modBam files and a vector with chromosome lengths, quantify the modification data in windows using quantFunction, calculate a score for each window using scoreFunction and fuse neighboring high-scoring windows. For example, the quantification could sum up observed counts in each window (implemented in sumNmodNvalid), and the score could be a directional P value measuring differential modification between two groups of modBam files (implemented in getDifferentiallyModifiedWindows), and the identified windows would then be differentially modified regions (DMRs). Please note that quantFunction needs to be compatible with scoreFunction, as the latter consumes the output of the former.

Usage

scanForHighScoringRegions(
  bamfiles,
  sampleAnnot,
  chromosomeLengths,
  quantFunction = "sumNmodNvalid",
  quantFunctionArgs = list(),
  scoreFunction = "getDifferentiallyModifiedWindows",
  scoreFunctionArgs = list(),
  modbase,
  level = "summary",
  modProbThreshold = 0.5,
  tileSize = 1e+06,
  seqinfo = NULL,
  sequenceContextWidth = 0,
  sequenceReference = NULL,
  sequenceContext = NULL,
  windowMode = "fixed",
  windowSize = 24,
  windowStep = round(windowSize/2),
  scoreCol = "dirNegLog10PValue",
  scoreAction = "smoothFuse",
  minperiod = 3,
  thresh = 3,
  maxGap = 50,
  BPPARAM = MulticoreParam(4L, RNGseed = 42L),
  verbose = FALSE
)

Arguments

bamfiles

Character vector with one or several modBam file names. Note that no read-filtering will be performed on the data from these files, so possibly the files should contain only filtered alignments.

sampleAnnot

A data.frame (or NULL) providing annotations for the samples. It must contain at least one column, named "sample", which must contain all the values of names(bamfiles). The provided annotations will be propagated to the returned SummarizedExperiment object.

chromosomeLengths

A named character vector with lengths of chromosomes to be analyzed, for example the return value of seqlengths(BSgenomeObject).

quantFunction

A character scalar giving the name of the window quantification function. This function should take as a first argument a RangedSummarizedExperiment object, as generated by readModBam. The second argument should be a GRanges object defining the windows to quantify. Additional arguments may be passed to quantFunction using quantFunctionArgs. The function must return a RangedSummarizedExperiment object with one row per window defined by the second argument and the window quantifications contained in the assay(s).

quantFunctionArgs

List with additional arguments for quantFunction.

scoreFunction

A character scalar giving the name of the scoring function. This function should take as a first argument a SummarizedExperiment object, as generated by quantifyWindowsInRegion. Additional arguments may be passed to scoreFunction using scoreFunctionArgs. The function must return a GRanges object (typically corresponding to the rowRanges of the input) with the score added to the mcols().

scoreFunctionArgs

List with additional arguments for scoreFunction.

modbase

Character vector defining the modified base to extract.

level

A character scalar giving the level of the data to required by quantFunction (see readModBam).

modProbThreshold

A numeric scalar, indicating the modification probability threshold to use to classify a base as 'modified' or 'unmodified'.

tileSize

A numeric scalar giving the approximate size of a genomic region for which the data is loaded at once. Chromosomes that are larger than tileSize will be automatically split into multiple tiles. Reduce this size if you would like to reduce the memory usage of the function.

seqinfo

NULL or a Seqinfo object containing information about the set of genomic sequences (chromosomes). Alternatively, a named numeric vector with genomic sequence names and lengths. Useful to set the sorting order of sequence names.

sequenceContextWidth, sequenceReference

Define the sequence context to be extracted around modified bases. By default ( sequenceContextWidth = 0), no sequence context will be extracted, otherwise it will be returned in rowData(x)$sequenceContext. See addSeqContext for details.

sequenceContext

A character vector with sequence contexts to retain. To apply this filter, the arguments "sequenceContextWidth" and "sequenceReference" must be set.

windowMode

Character scalar defining how windows in region are define. Currently supported are:

"fixed": : The window is of fixed size (in number of bases), corresponding to the windowSize argument value.
"predefined": : The windows are provided by the user, as a GRanges object.

windowSize

Numeric scalar defining the size of windows (see windowMode argument).

windowStep

Numeric scalar defining the step (shift) between start positions of consecutive windows. For non-overlapping consecutive windows, windowStep should be equal to windowSize.

scoreCol

Character scalar giving the column name in mcols(x) to use for the analysis. For scanForHighScoringRegions, it can be a vector, in which case processWindowScores will be run for each of them and the results will be returned as a list.

scoreAction

A character scalar indicating how to process the scores. Currently supported values are:

pass: : Do not modify window scores (pass-through). Note that if mcols(x) has multiple columns, they will all be passed through (i.e., scoreCol is ignored). To return only scoreCol, use select.
select: : Pass through only scoreCol.
smooth: : Smooth window scores (minperiod argument).
smoothFuse (default): : Smooth window scores, identify high scoring elements (thresh argument) and fuse nearby high-scoring elements (maxGap argument).

minperiod

Numeric scalar that defines the low-pass filter parameter used to smooth the scores for segmentation. minperiod gives the minimal period (in number of windows) for the critical frequency in the score signal that should be retained. The default value is suitable to retain signals that occur in three neighboring windows. The filtering is performed using filtfilt and thus requires the signal package to be installed.

thresh

A numeric scalar giving the minimal absolute window score (after smoothing, see minperiod argument) defining a region of interest. Higher values make the region detection more stringent.

maxGap

Numeric scalar giving the maximal gap between neighboring windows, in base pairs, from the end of the first to the start of the next, to be fused into a single region of interest.

BPPARAM

A BiocParallelParam object that controls the number of parallel CPU threads to use for some of the steps in readModBam(). The default value is (MulticoreParam(4L, RNGseed = 42L)). If randomly sampling reads (nAlnsToSample > 0), make sure to set the RNGseed argument when constructing the BPPARAM object for reproducible results (see also vignette("Random_Numbers", package = "BiocParallel")).

verbose

Logical scalar. If TRUE, report on progress.

Value

A GRanges object with identified regions of interest, or a named list of such objects if scoreCol is a vector of length >1.

Author

Charlotte Soneson, Michael Stadler

Examples

modbamfiles <- system.file("extdata",
                           c("6mA_1_10reads.bam", "6mA_1_10reads.bam",
                             "6mA_2_10reads.bam", "6mA_2_10reads.bam"),
                           package = "footprintR")
gr <- scanForHighScoringRegions(
    bamfiles = modbamfiles,
    sampleAnnot = data.frame(sample = c("s1","s2","s3","s4"),
                             group = c("A","A","B","B")),
    chromosomeLengths = c(chr1 = 6955000),
    modbase = "a", BPPARAM = BiocParallel::SerialParam())
gr
#> GRanges object with 38 ranges and 5 metadata columns:
#>        seqnames          ranges strand | dirNegLog10PValueThresh
#>           <Rle>       <IRanges>  <Rle> |               <numeric>
#>    [1]     chr1 6928525-6928560      * |                -4.31289
#>    [2]     chr1 6928621-6928692      * |                 3.95187
#>    [3]     chr1 6929185-6929208      * |                 3.14622
#>    [4]     chr1 6929353-6929496      * |                 3.98559
#>    [5]     chr1 6929797-6929868      * |                 4.16167
#>    ...      ...             ...    ... .                     ...
#>   [34]     chr1 6937909-6937932      * |                 3.15746
#>   [35]     chr1 6938521-6938592      * |                 5.31284
#>   [36]     chr1 6938617-6938652      * |                -4.54483
#>   [37]     chr1 6938713-6938760      * |                 3.81666
#>   [38]     chr1 6939973-6940044      * |                 5.17778
#>        numWindowsThresh direction dirNegLog10PValue numWindows
#>               <integer>  <factor>         <numeric>  <integer>
#>    [1]                2  negative          -4.31289          2
#>    [2]                5  positive           3.95187          5
#>    [3]                1  positive           3.14622          1
#>    [4]                8  positive           3.35340         11
#>    [5]                5  positive           4.16167          5
#>    ...              ...       ...               ...        ...
#>   [34]                1  positive           3.15746          1
#>   [35]                5  positive           5.31284          5
#>   [36]                2  negative          -4.54483          2
#>   [37]                3  positive           3.81666          3
#>   [38]                5  positive           5.17778          5
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths