Skip to contents

For all pairs of bases with modification calls in a read, tabulate the number of pairs at a given distance with a given modification state.

Usage

countMismatchStatePairs(
  bamfile,
  bamFormat = "QuasR",
  regions = ".",
  sequenceContext = "GCH",
  readBaseUnmod = "T",
  readBaseMod = "C",
  windowSize = 200,
  minMapQ = 0,
  minAlignedLength = 0,
  seqinfo = NULL,
  sequenceReference = NULL,
  BPPARAM = MulticoreParam(4L),
  verbose = FALSE
)

Arguments

bamfile

Character scalar giving the path to a BAM file, containing alignments with specific (e.g., C-to-T) mismatches. The bamfile must have an index.

bamFormat

A character scalar giving the format of BAM files. Currently supported ar BAM files that have been created with "QuasR" (using qAlign in bisulfite mode) or "Bismark".

regions

A GRanges object specifying which genomic regions to extract the reads from. Alternatively, regions can be specified as a character vector (e.g. "chr1:1200-1300", "chr2:-6000", "chr1:10-", "chrM" or ".").

sequenceContext

A character scalar with an odd number of characters, specifying the genomic context on the plus strand, for which to report (mis-)matches. The string may contain IUPAC ambiguity codes, e.g. sequenceContext="GCH" would correspond to all GpC dinucleotides, excluding C's that are in CpG dinucleotides. The call will be made by comparing the genomic base in the middle of sequenceContext to the aligned base in the read and interpreted according to readBaseUnmod and readBaseMod.

readBaseUnmod, readBaseMod

Character scalars defining the read bases that are interpreted as unmodified or modified, respectively, when aligned to the middle base of sequenceContext.

windowSize

Numeric scalar giving the maximum window size covering pairs of modified bases to consider in pair-counting mode. A window size of 1 corresponds to a single base, a size of 2 to directly adjacent bases, etc.

minMapQ

Numeric scalar giving the minimal mapping quality to include alignments in pair-counting mode.

minAlignedLength

Numeric scalar giving the minimal alignment length to include alignments in pair-counting mode.

seqinfo

NULL or a Seqinfo object containing information about the set of genomic sequences (chromosomes). Alternatively, a named numeric vector with genomic sequence names and lengths. Useful to set the sorting order of sequence names.

sequenceReference

A BSgenome object, or a character scalar giving the path to a fasta formatted file with reference sequences, or a DNAStringSet object. The sequence context (see sequenceContextWidth argument) will be extracted from these sequences.

BPPARAM

A BiocParallelParam object that controls the number of parallel CPU threads to use for decompressing bam records.

verbose

A logical scalar. If TRUE, report on progress.

Value

A DataFrame with windowSize rows and five columns.

Author

Charlotte Soneson, Michael Stadler

Examples

bamfile <- system.file("extdata", "BisSeq_quasr_single.bam",
                       package = "SingleMoleculeGenomicsIO")
reffile <- system.file("extdata", "reference.fa.gz",
                       package = "SingleMoleculeGenomicsIO")
tbl <- countMismatchStatePairs(bamfile = bamfile, bamFormat = "QuasR",
                               regions = "chr1:6940000-6955000",
                               sequenceContext = "C",
                               readBaseUnmod = "T", readBaseMod = "C",
                               windowSize = 200,
                               sequenceReference = reffile,
                               BPPARAM = BiocParallel::SerialParam(),
                               verbose = TRUE)
#>  finding positions with C
#>  opening input file /Users/runner/work/_temp/Library/SingleMoleculeGenomicsIO/extdata/BisSeq_quasr_single.bam using 1 thread
#>  finding positions with C

#>  counting state-pairs for alignments overlapping 1 region
#>  finding positions with C

#>  removed 0 unaligned (e.g. soft-masked) of 0 called bases
#>  finding positions with C

#>  read 184 alignments
#>  finding positions with C

#>  finding positions with C [89ms]
#> 
tbl
#> DataFrame with 200 rows and 5 columns
#>             S unmod_unmod unmod_mod mod_unmod   mod_mod
#>     <integer>   <numeric> <numeric> <numeric> <numeric>
#> 1           1        3242         0         0        68
#> 2           2         691        10         5         0
#> 3           3         561         9        10         1
#> 4           4         563        12        15         0
#> 5           5         533        12        13         0
#> ...       ...         ...       ...       ...       ...
#> 196       196           0         0         0         0
#> 197       197           0         0         0         0
#> 198       198           0         0         0         0
#> 199       199           0         0         0         0
#> 200       200           0         0         0         0