Parse ML and MM tags (see https://samtools.github.io/hts-specs/SAMtags.pdf,
section 1.7) and return a SummarizedExperiment
object with information on modified bases. Implicitly called bases will get
a modification probability of zero. Secondary and supplementary alignments
are ignored.
Usage
readModBam(
bamfiles,
regions = NULL,
modbase,
level = ifelse(nAlnsToSample == 0 & is.null(variantPositions), "quickread", "read"),
sampleAnnot = NULL,
nAlnsToSample = 0,
seqnamesToSampleFrom = "chr19",
seqinfo = NULL,
sequenceContextWidth = 0,
sequenceReference = NULL,
variantPositions = NULL,
modProbThreshold = 0.5,
trim = FALSE,
BPPARAM = MulticoreParam(4L, RNGseed = 42L),
verbose = FALSE
)
Arguments
- bamfiles
Character vector with one or several paths of
modBAM
files, containing information about base modifications inMM
andML
tags. Ifbamfiles
is a named vector, the names are used as sample names and prefixes for read names. Otherwise, the prefixes will bes1
, ...,sN
, whereN
is the length ofbamfiles
. Allbamfiles
must have an index.- regions
A
GRanges
object specifying which genomic regions to extract the reads from. Alternatively, regions can be specified as a character vector (e.g. "chr1:1200-1300", "chr2:-6000" or "chrM") that will be coerced into aGRanges
object. If the end coordinate is not provided (for example in "chr1:10-", which means "to the end of chr1"), it will be obtained fromseqinfo
or default to a large value ifseqinfo
is not provided. Note that unlesstrim=TRUE
, the reads are not trimmed to the boundaries of the specified ranges. As a result, returned positions will typically extend out of the specified regions. IfnAlnsToSample
is set to a non-zero value,regions
is ignored.- modbase
Character vector defining the modified base for each sample. If
modbase
is a named vector, the names should correspond to the names ofbamfiles
. Otherwise, it will be assumed that the elements are in the same order as the files inbamfiles
. Ifmodbase
has length 1, the same modified base will be used for all samples.- level
Character scalar specifying the level of returned modification data. Supported values are:
- "read"
: Extracts modification probabilities for individual reads into an assay called
"mod_prob"
, with each column (sample) consisting of a position-by-readNaMatrix
. This is the default ifnAlnsToSample
is non-zero orvariantPositions
is notNULL
.- "quickread"
: Like "read", but runs faster, does not support read sampling, and does not return all annotations currently provided by "read". This is the default if
nAlnsToSample
is zero andvariantPositions
isNULL
.- "summary"
: Counts the total and modified bases for each position and strand and returns them in assays named
"Nvalid"
and"Nmod"
, respectively, as well as theNmod/Nvalid
ratio in the assay"FracMod"
.
- sampleAnnot
A
data.frame
(orNULL
) providing annotations for the samples. It must contain at least one column, named"sample"
, which must contain all the values ofnames(bamfiles)
. The provided annotations will be propagated to the returnedSummarizedExperiment
object.- nAlnsToSample
A numeric scalar. If non-zero,
regions
is ignored and approximatelynAlnsToSample
randomly selected alignments onseqnamesToSampleFrom
are read from each of thebamfiles
. In order to make the results reproducible, make sure to set theRNGseed
argument in the providedBPPARAM
object (see below). Please note that secondary and supplementary alignments inbamfiles
contribute to the total number of alignments but will not be sampled, thus the number of returned alignments may be lower thannAlnsToSample
.- seqnamesToSampleFrom
A character vector with one or several sequence names (chromosomes) from which to sample alignments from (only used if
nAlnsToSample
is greater than zero).- seqinfo
NULL
or aSeqinfo
object containing information about the set of genomic sequences (chromosomes). Alternatively, a named numeric vector with genomic sequence names and lengths. Useful to set the sorting order of sequence names.- sequenceContextWidth, sequenceReference
Define the sequence context to be extracted around modified bases. By default (
sequenceContextWidth = 0
), no sequence context will be extracted, otherwise it will be returned inrowData(x)$sequenceContext
. SeeaddSeqContext
for details.- variantPositions
An optional
GPos
object with seqnames and coordinates of single nucleotide variant positions, to be used to construct read labels for allele-specific analysis. Ignored ifNULL
ornAlnsToSample > 0
(sampling-mode).- modProbThreshold
A numeric scalar, indicating the modification probability threshold to use to classify a base as 'modified' or 'unmodified'. Only used if
level
is"summary"
.- trim
A logical scalar. If
TRUE
, the returnedSummarizedExperiment
object will only contain the positions overlapping the specifiedregions
. IfFALSE
(default), the object will be extended to all positions covered by the reads overlappingregions
. In both cases, only reads overlapping the specifiedregions
are included.- BPPARAM
A
BiocParallelParam
object that controls the number of parallel CPU threads to use for some of the steps inreadModBam()
. The default value is (MulticoreParam(4L, RNGseed = 42L)
). If randomly sampling reads (nAlnsToSample > 0
), make sure to set theRNGseed
argument when constructing theBPPARAM
object for reproducible results (see alsovignette("Random_Numbers", package = "BiocParallel")
).- verbose
Logical scalar. If
TRUE
, report on progress.
Value
A SummarizedExperiment
object
with genomic positions in rows and samples in columns. The assays
depend on the value of the level
argument (see above).
See also
https://samtools.github.io/hts-specs/SAMtags.pdf describing the SAM ML and MM tags for base modifications.
Examples
modbamfile <- system.file("extdata", "6mA_1_10reads.bam",
package = "footprintR")
readModBam(bamfiles = modbamfile, regions = "chr1:6940000-6955000",
modbase = "a", verbose = TRUE,
BPPARAM = BiocParallel::SerialParam())
#> ℹ extracting base modifications from modBAM files
#> ⠙ 0.000 Mio. genomic positions processed (0.002 Mio./s) [1ms]
#> ⠙ 0.000 Mio. genomic positions processed (0.002 Mio./s) [1ms]
#> ℹ finding unique genomic positions...
#> ✔ finding unique genomic positions... [16ms]
#>
#> ⠙ 0.000 Mio. genomic positions processed (0.002 Mio./s) [1ms]
#> ℹ collapsed 11300 positions to 4772 unique ones
#> ✔ collapsed 11300 positions to 4772 unique ones [101ms]
#>
#> ⠙ 0.000 Mio. genomic positions processed (0.002 Mio./s) [1ms]
#> class: RangedSummarizedExperiment
#> dim: 4772 1
#> metadata(3): readLevelData variantPositions filteredOutReads
#> assays(1): mod_prob
#> rownames(4772): chr1:6925830:- chr1:6925834:- ... chr1:6941622:-
#> chr1:6941631:-
#> rowData names(0):
#> colnames(1): s1
#> colData names(4): sample modbase n_reads readInfo