Parse ML and MM tags (see https://samtools.github.io/hts-specs/SAMtags.pdf,
section 1.7) and return a SummarizedExperiment
object with information on modified bases. Implicitly called bases will get
a modification probability of zero. Secondary and supplementary alignments
are ignored.
Usage
readModBam(
bamfiles,
regions = NULL,
modbase,
level = ifelse(nAlnsToSample == 0 & is.null(variantPositions), "quickread", "read"),
sampleAnnot = NULL,
nAlnsToSample = 0,
seqnamesToSampleFrom = "chr19",
seqinfo = NULL,
sequenceContextWidth = 0,
sequenceReference = NULL,
variantPositions = NULL,
modProbThreshold = 0.5,
trim = FALSE,
BPPARAM = MulticoreParam(4L, RNGseed = 42L),
verbose = FALSE
)Arguments
- bamfiles
Character vector with one or several paths of
modBAMfiles, containing information about base modifications inMMandMLtags. Ifbamfilesis a named vector, the names are used as sample names and prefixes for read names. Otherwise, the prefixes will bes1, ...,sN, whereNis the length ofbamfiles. Allbamfilesmust have an index.- regions
A
GRangesobject specifying which genomic regions to extract the reads from. Alternatively, regions can be specified as a character vector (e.g. "chr1:1200-1300", "chr2:-6000" or "chrM") that will be coerced into aGRangesobject. If the end coordinate is not provided (for example in "chr1:10-", which means "to the end of chr1"), it will be obtained fromseqinfoor default to a large value ifseqinfois not provided. Note that unlesstrim=TRUE, the reads are not trimmed to the boundaries of the specified ranges. As a result, returned positions will typically extend out of the specified regions. IfnAlnsToSampleis set to a non-zero value,regionsis ignored.- modbase
Character vector defining the modified base for each sample. If
modbaseis a named vector, the names should correspond to the names ofbamfiles. Otherwise, it will be assumed that the elements are in the same order as the files inbamfiles. Ifmodbasehas length 1, the same modified base will be used for all samples.- level
Character scalar specifying the level of returned modification data. Supported values are:
- "read"
: Extracts modification probabilities for individual reads into an assay called
"mod_prob", with each column (sample) consisting of a position-by-readNaMatrix. This is the default ifnAlnsToSampleis non-zero orvariantPositionsis notNULL.- "quickread"
: Like "read", but runs faster, does not support read sampling, and does not return all annotations currently provided by "read". This is the default if
nAlnsToSampleis zero andvariantPositionsisNULL.- "summary"
: Counts the total and modified bases for each position and strand and returns them in assays named
"Nvalid"and"Nmod", respectively, as well as theNmod/Nvalidratio in the assay"FracMod".
- sampleAnnot
A
data.frame(orNULL) providing annotations for the samples. It must contain at least one column, named"sample", which must contain all the values ofnames(bamfiles). The provided annotations will be propagated to the returnedSummarizedExperimentobject.- nAlnsToSample
A numeric scalar. If non-zero,
regionsis ignored and approximatelynAlnsToSamplerandomly selected alignments onseqnamesToSampleFromare read from each of thebamfiles. In order to make the results reproducible, make sure to set theRNGseedargument in the providedBPPARAMobject (see below). Please note that secondary and supplementary alignments inbamfilescontribute to the total number of alignments but will not be sampled, thus the number of returned alignments may be lower thannAlnsToSample.- seqnamesToSampleFrom
A character vector with one or several sequence names (chromosomes) from which to sample alignments from (only used if
nAlnsToSampleis greater than zero).- seqinfo
NULLor aSeqinfoobject containing information about the set of genomic sequences (chromosomes). Alternatively, a named numeric vector with genomic sequence names and lengths. Useful to set the sorting order of sequence names.- sequenceContextWidth, sequenceReference
Define the sequence context to be extracted around modified bases. By default (
sequenceContextWidth = 0), no sequence context will be extracted, otherwise it will be returned inrowData(x)$sequenceContext. SeeaddSeqContextfor details.- variantPositions
An optional
GPosobject with seqnames and coordinates of single nucleotide variant positions, to be used to construct read labels for allele-specific analysis. Ignored ifNULLornAlnsToSample > 0(sampling-mode).- modProbThreshold
A numeric scalar, indicating the modification probability threshold to use to classify a base as 'modified' or 'unmodified'. Only used if
levelis"summary".- trim
A logical scalar. If
TRUE, the returnedSummarizedExperimentobject will only contain the positions overlapping the specifiedregions. IfFALSE(default), the object will be extended to all positions covered by the reads overlappingregions. In both cases, only reads overlapping the specifiedregionsare included.- BPPARAM
A
BiocParallelParamobject that controls the number of parallel CPU threads to use for some of the steps inreadModBam(). The default value is (MulticoreParam(4L, RNGseed = 42L)). If randomly sampling reads (nAlnsToSample > 0), make sure to set theRNGseedargument when constructing theBPPARAMobject for reproducible results (see alsovignette("Random_Numbers", package = "BiocParallel")).- verbose
Logical scalar. If
TRUE, report on progress.
Value
A SummarizedExperiment object
with genomic positions in rows and samples in columns. The assays
depend on the value of the level argument (see above).
See also
https://samtools.github.io/hts-specs/SAMtags.pdf describing the SAM ML and MM tags for base modifications.
Examples
modbamfile <- system.file("extdata", "6mA_1_10reads.bam",
package = "footprintR")
readModBam(bamfiles = modbamfile, regions = "chr1:6940000-6955000",
modbase = "a", verbose = TRUE,
BPPARAM = BiocParallel::SerialParam())
#> ℹ extracting base modifications from modBAM files
#> ⠙ 0.000 Mio. genomic positions processed (0.001 Mio./s) [2ms]
#> ⠙ 0.000 Mio. genomic positions processed (0.001 Mio./s) [1ms]
#> ℹ finding unique genomic positions...
#> ✔ finding unique genomic positions... [38ms]
#>
#> ⠙ 0.000 Mio. genomic positions processed (0.001 Mio./s) [1ms]
#> ℹ collapsed 11300 positions to 4772 unique ones
#> ✔ collapsed 11300 positions to 4772 unique ones [225ms]
#>
#> ⠙ 0.000 Mio. genomic positions processed (0.001 Mio./s) [1ms]
#> class: RangedSummarizedExperiment
#> dim: 4772 1
#> metadata(3): readLevelData variantPositions filteredOutReads
#> assays(1): mod_prob
#> rownames(4772): chr1:6925830:- chr1:6925834:- ... chr1:6941622:-
#> chr1:6941631:-
#> rowData names(0):
#> colnames(1): s1
#> colData names(4): sample modbase n_reads readInfo