Count pairs of modified bases by distance and modification state for bam files where modifications are indicated by sequence mismatches
Source:R/countMismatchStatePairs.R
countMismatchStatePairs.RdFor all pairs of bases with modification calls in a read (pair), tabulate the number of pairs at a given distance with a given modification state.
Usage
countMismatchStatePairs(
bamfile,
bamFormat = "QuasR",
regions = ".",
sequenceContext = "GCH",
nAlnsToSample = 0,
seqnamesToSampleFrom = character(0),
readBaseUnmod = "T",
readBaseMod = "C",
windowSize = 200,
minMapQ = 0,
minAlignedLength = 0,
seqinfo = NULL,
sequenceReference = NULL,
BPPARAM = MulticoreParam(4L),
verbose = FALSE
)Arguments
- bamfile
Character scalar giving the path to a
BAMfile, containing alignments with specific (e.g., C-to-T) mismatches. Thebamfilemust have an index.- bamFormat
A character scalar giving the format of
BAMfiles. Currently supported arBAMfiles that have been created with"QuasR"(usingqAlignin bisulfite mode) or"Bismark".- regions
A
GRangesobject specifying which genomic regions to extract the reads from. Alternatively, regions can be specified as a character vector (e.g. "chr1:1200-1300", "chr2:-6000", "chr1:10-", "chrM" or ".").- sequenceContext
A character scalar with an odd number of characters, specifying the genomic context on the plus strand, for which to report (mis-)matches. The string may contain IUPAC ambiguity codes, e.g.
sequenceContext="GCH"would correspond to all GpC dinucleotides, excluding C's that are in CpG dinucleotides. The call will be made by comparing the genomic base in the middle ofsequenceContextto the aligned base in the read and interpreted according toreadBaseUnmodandreadBaseMod.- nAlnsToSample
A numeric scalar. If non-zero,
regionsis ignored and approximatelynAlnsToSamplerandomly selected alignments onseqnamesToSampleFromare read from thebamfile. Ifbamfilecontains paired alignments, they are included or excluded as pairs. Alignments are counted individually towardsnAlnsToSample, thus for paired-end data this parameter has to be set to twice the number of pairs to be sampled. In order to make the results reproducible, make sure to set the random number seed usingset.seed. Please note that secondary and supplementary alignments inbamfilescontribute to the total number of alignments but will not be sampled, thus the number of used alignments may be lower thannAlnsToSample.- seqnamesToSampleFrom
A character vector with one or several sequence names (chromosomes) from which to sample alignments from (only used if
nAlnsToSampleis greater than zero).- readBaseUnmod, readBaseMod
Character scalars defining the read bases that are interpreted as unmodified or modified, respectively, when aligned to the middle base of
sequenceContext.- windowSize
Numeric scalar giving the maximum window size covering pairs of modified bases to consider in pair-counting mode. A window size of 1 corresponds to a single base, a size of 2 to directly adjacent bases, etc.
- minMapQ
Numeric scalar giving the minimal mapping quality to include alignments in pair-counting mode.
- minAlignedLength
Numeric scalar giving the minimal alignment length to include alignments in pair-counting mode.
- seqinfo
NULLor aSeqinfoobject containing information about the set of genomic sequences (chromosomes). Alternatively, a named numeric vector with genomic sequence names and lengths. Useful to set the sorting order of sequence names.- sequenceReference
A
BSgenomeobject, or a character scalar giving the path to a fasta formatted file with reference sequences, or aDNAStringSetobject. The sequence context (seesequenceContextWidthargument) will be extracted from these sequences.- BPPARAM
A
BiocParallelParamobject that controls the number of parallel CPU threads to use for some of the steps inreadModBam(). The default value is (MulticoreParam(4L, RNGseed = 42L)). If randomly sampling reads (nAlnsToSample > 0), make sure to set theRNGseedargument when constructing theBPPARAMobject for reproducible results (see alsovignette("Random_Numbers", package = "BiocParallel")).- verbose
A logical scalar. If
TRUE, report on progress.
Examples
bamfile <- system.file("extdata", "BisSeq_quasr_single.bam",
package = "SingleMoleculeGenomicsIO")
reffile <- system.file("extdata", "reference.fa.gz",
package = "SingleMoleculeGenomicsIO")
tbl <- countMismatchStatePairs(bamfile = bamfile, bamFormat = "QuasR",
regions = "chr1:6940000-6955000",
sequenceContext = "C",
readBaseUnmod = "T", readBaseMod = "C",
windowSize = 200,
sequenceReference = reffile,
BPPARAM = BiocParallel::SerialParam(),
verbose = TRUE)
#> ℹ finding positions with C
#> ℹ opening input file /Users/runner/work/_temp/Library/SingleMoleculeGenomicsIO/extdata/BisSeq_quasr_single.bam using 1 thread
#> ℹ finding positions with C
#> ℹ counting state-pairs for alignments overlapping 1 region
#> ℹ finding positions with C
#> ℹ removed 0 unaligned (e.g. soft-masked) of 0 called bases
#> ℹ finding positions with C
#> ℹ read 184 alignments
#> ℹ finding positions with C
#> ✔ finding positions with C [142ms]
#>
tbl
#> DataFrame with 200 rows and 5 columns
#> S unmod_unmod unmod_mod mod_unmod mod_mod
#> <integer> <numeric> <numeric> <numeric> <numeric>
#> 1 1 3242 0 0 68
#> 2 2 691 10 5 0
#> 3 3 561 9 10 1
#> 4 4 563 12 15 0
#> 5 5 533 12 13 0
#> ... ... ... ... ... ...
#> 196 196 0 0 0 0
#> 197 197 0 0 0 0
#> 198 198 0 0 0 0
#> 199 199 0 0 0 0
#> 200 200 0 0 0 0