2 Reading data from files

This chapter describes how to read footprinting data from modBam files and other formats into R. It also introduces how the data is represented in memory. footprintR provides three functions for reading data into R from specific file types: readModBam, readBedMethyl and readModkitExtract. If you want to jump right in, have a look at the documentation of the reader function corresponding to the type of your data.

We start by loading the required packages and a BSgenome object providing the mouse genome sequence.

Show/hide code

BSgenomeName <- "BSgenome.Mmusculus.GENCODE.GRCm39.gencodeM34"

library(BiocParallel)
library(footprintR)
library(BSgenomeName, character.only = TRUE)
library(SummarizedExperiment)
library(SparseArray)

# Load genome
gnm <- get(BSgenomeName)
genome(gnm) <- "mm39"

2.1 Read-level versus summary-level

Typically, single molecule footprinting data at the single read level is stored in modBam files, which are standard bam files (Li et al. (2009)) that in addition to the usual information on alignments also contain base modifications encoded in the ML and MM tags (see section 1.7 of SAMtags.pdf for details).

readModBam reads modBam files and can either keep the data at the level of individual reads or summarize them per genomic position and sample. This is controlled via the level argument to readModBam. By default, read-level data is imported. In fact, footprintR provides two modes for importing read-level data: level = "quickread" is often the fastest, but does not return the full set of read-level annotations, and is not compatible with read sampling. level = "read" on the other hand provides a more complete output, but is a bit slower. If level is not specified, footprintR will select the best mode based on the other arguments provided to readModBam (specifically, whether variant names or read sampling are requested). As illustrated in Section 2.4 below, summary data (aggregated across reads) can also be imported by setting level = "summary".

2.2 Data representation

Let’s start by reading individual reads from a modBam file for a small region of the genome:

Show/hide code

se <- readModBam(
    bamfiles = c(wt1 = "data/mESC_wt_6mA_rep1.bam",
                 wt2 = "data/mESC_wt_6mA_rep2.bam"),
    modbase = "a", 
    regions = "chr8:39286000-39286200", 
    seqinfo = seqinfo(gnm), 
    sequenceContextWidth = 1, 
    sequenceReference = gnm,
    trim = TRUE
)
se

class: RangedSummarizedExperiment 
dim: 201 2 
metadata(3): readLevelData variantPositions filteredOutReads
assays(1): mod_prob
rownames(201): chr8:39286000:- chr8:39286001:- ... chr8:39286198:+
  chr8:39286199:+
rowData names(1): sequenceContext
colnames(2): wt1 wt2
colData names(4): sample modbase n_reads readInfo

Here is a scheme that illustrates the internals of se containing read-level data:

SummarizedExperiment with read-level data in assays. — Figure 2.1

Let’s go through the individual parts. You can see that the data is contained in a RangedSummarizedExperiment object. This is a container that contains one or more assays (here only one assay called mod_prob). Each assay is a matrix-like object and contains the loaded data (here: base modification probabilities of individual reads).

Show/hide code

assayNames(se)

[1] "mod_prob"

The rows represent genomic positions and the columns represent samples. In our example, we have loaded data from two modBam files corresponding to the two samples "wt1" and "wt2", thus the object has two columns:

Show/hide code

ncol(se)

[1] 2

Show/hide code

colnames(se)

[1] "wt1" "wt2"

The 201 rows correspond to individual base positions, for which at least one modification has been loaded. You can get the genomic coordinates of these positions using rowRanges:

Show/hide code

nrow(se)

[1] 201

Show/hide code

rowRanges(se)

UnstitchedGPos object with 201 positions and 1 metadata column:
                  seqnames       pos strand | sequenceContext
                     <Rle> <integer>  <Rle> |  <DNAStringSet>
  chr8:39286000:-     chr8  39286000      - |               G
  chr8:39286001:-     chr8  39286001      - |               A
  chr8:39286002:+     chr8  39286002      + |               A
  chr8:39286003:-     chr8  39286003      - |               A
  chr8:39286004:+     chr8  39286004      + |               A
              ...      ...       ...    ... .             ...
  chr8:39286195:+     chr8  39286195      + |               A
  chr8:39286196:+     chr8  39286196      + |               A
  chr8:39286197:-     chr8  39286197      - |               A
  chr8:39286198:+     chr8  39286198      + |               A
  chr8:39286199:+     chr8  39286199      + |               A
  -------
  seqinfo: 61 sequences (1 circular) from mm39 genome

These positions are stranded, as base modifications can be observed on either of the two strands:

Show/hide code

table(strand(se))


  +   -   * 
100 101   0

Finally, the individual reads of a sample are stored in the assay, inside the “column” of that sample:

Show/hide code

assayNames(se)

[1] "mod_prob"

Show/hide code

assay(se, "mod_prob")

DataFrame with 201 rows and 2 columns
                         wt1          wt2
                  <NaMatrix>   <NaMatrix>
chr8:39286000:- NA:NA:NA:... NA:NA:NA:...
chr8:39286001:-  0:NA:NA:...   0:NA:0:...
chr8:39286002:+   NA:0:0:...  NA:0:NA:...
chr8:39286003:-  0:NA:NA:...   0:NA:0:...
chr8:39286004:+   NA:0:0:...  NA:0:NA:...
...                      ...          ...
chr8:39286195:+  NA:0:NA:...  NA:0:NA:...
chr8:39286196:+  NA:0:NA:...  NA:0:NA:...
chr8:39286197:-  0:NA:NA:...   0:NA:0:...
chr8:39286198:+  NA:0:NA:... NA:NA:NA:...
chr8:39286199:+  NA:0:NA:... NA:NA:NA:...

You can see that each column is itself an object of type NaMatrix.

Show/hide code

# get the NaMatrix with reads of sample wt1
mp1 <- assay(se, "mod_prob")$wt1
mp1

<201 x 35 NaMatrix> of type "double" [nnacount=2368 (34%)]:
       wt1-d914fb79-7d7a-4a67-bb98-5d1168b057b6 ...
  [1,]                                       NA   .
  [2,]                                        0   .
  [3,]                                       NA   .
  [4,]                                        0   .
  [5,]                                       NA   .
   ...                                        .   .
[197,]                                       NA   .
[198,]                                       NA   .
[199,]                                        0   .
[200,]                                       NA   .
[201,]                                       NA   .
       wt1-70e9a128-ea64-4c1d-b9e9-44f76a51a780
  [1,]                                       NA
  [2,]                                       NA
  [3,]                                       NA
  [4,]                                       NA
  [5,]                                       NA
   ...                                        .
[197,]                                        0
[198,]                                        0
[199,]                                       NA
[200,]                                        0
[201,]                                        0

The rows in the NaMatrix corresponding to the same genomic positions we have seen above, but the columns correspond to the 35 individual reads. The numeric values correspond to modification probabilities. A value greater than 0.5 or 0.7 is typically considered modified. Many values are NA (missing), either because a read did not overlap a given position, or there was no information about modifications at that position in that read. Storing all these NA values would be a waste of memory, which is avoided in the NaMatrix, which is a special type of sparse matrix from the SparseArray package that only stores non-NA values.

2.3 Flattening read-level to summary-level data

We can use flattenReadLevelAssay to go from this read-level to summary-level data:

Show/hide code

se_summary <- flattenReadLevelAssay(se)

Here is an illustration of the summary-level data container:

SummarizedExperiment with summary-level data in assays. — Figure 2.2

Its dimensions are identical to the original se (positions by samples), however it contains additional assays that do not have the nested structure with individual reads, but are simple matrices:

Show/hide code

dim(se_summary)

[1] 201   2

Show/hide code

assayNames(se_summary)

[1] "mod_prob" "Nmod"     "Nvalid"   "FracMod"

Show/hide code

head(assay(se_summary, "Nmod"))

                wt1 wt2
chr8:39286000:-   0   0
chr8:39286001:-   0   1
chr8:39286002:+   0   1
chr8:39286003:-   1   3
chr8:39286004:+   0   0
chr8:39286005:-   0   2

By default, flattenReadLevelAssay keeps the read-level assay ("mod_prob") and generates summarized assays with the numbers of modified and total bases ("Nmod" and "Nvalid"), and the fraction of modified bases per position and sample (“FracMod”`).

2.4 Directly reading summary-level data

If the individual reads are not required, you can directly summarize them while reading, which is faster and eliminates the call to flattenReadLevelAssay. This is achieved by adding the argument level = "summary" to the readModBam call:

Show/hide code

se_summary_direct <- readModBam(
    bamfiles = c(wt1 = "data/mESC_wt_6mA_rep1.bam",
                 wt2 = "data/mESC_wt_6mA_rep2.bam"),
    modbase = "a", 
    regions = "chr8:39286000-39286200", 
    seqinfo = seqinfo(gnm), 
    sequenceContextWidth = 1, 
    sequenceReference = gnm,
    trim = TRUE,
    level = "summary"
)
se_summary_direct

class: RangedSummarizedExperiment 
dim: 201 2 
metadata(1): readLevelData
assays(3): Nmod Nvalid FracMod
rownames(201): chr8:39286000:- chr8:39286001:- ... chr8:39286198:+
  chr8:39286199:+
rowData names(1): sequenceContext
colnames(2): wt1 wt2
colData names(2): sample modbase

Show/hide code

identical(assay(se_summary, "FracMod"),
          assay(se_summary_direct, "FracMod"))

[1] TRUE

2.5 Read sampling

In some cases, we may want to read a random set of reads from a modBam file, for example to calculate average modification rates or quality statistics. In such cases, reading data by genomic position is not optimal.

readModBam provides a special reading mode for this, controlled by the arguments nAlnsToSample and seqnamesToSampleFrom:

Show/hide code

se_sampled <- readModBam(
    bamfiles = c(wt1 = "data/mESC_wt_6mA_rep1.bam",
                 wt2 = "data/mESC_wt_6mA_rep2.bam"),
    modbase = "a", 
    seqinfo = seqinfo(gnm), 
    sequenceContextWidth = 1, 
    sequenceReference = gnm,
    nAlnsToSample = 20,
    seqnamesToSampleFrom = "chr19"
)
se_sampled

class: RangedSummarizedExperiment 
dim: 61348 2 
metadata(3): readLevelData variantPositions filteredOutReads
assays(1): mod_prob
rownames(61348): chr19:6119863:- chr19:6119864:- ... chr19:60772501:+
  chr19:60772504:+
rowData names(1): sequenceContext
colnames(2): wt1 wt2
colData names(4): sample modbase n_reads readInfo

Show/hide code

se_sampled$n_reads

[1] 19 18

For details of how to make the sampling reproducible even when using multiple parallel threads, and why you might get a lower number of sampled reads than requested, please look at the documenation of readModBam.

2.6 Improving reading performance

Even though readModBam is implemented using C/C++ for speed and efficiency, reading the data may become slow. Here are a few points that may help to reduce the time to read modification data from modBam files:

Read summary-level data: Reading of summary-level data (argument level = "summary") is much faster than reading of read-level data, so use it whenever you do not need to represent indidvual reads.
Parallelization: readModBam supports parallel processing, controlled by the BPPARAM argument. If multiple workers are available in the BPPARAM object (see BiocParallel), they will be used to read data from from multiple modBam files in parallel. If the number of workers is larger than the number of bamfiles, the workers will further be used to speed-up decompression of bam records.
Region size: If you are reading read-level data from large or many regions, certain operations like sub-setting of positions may become slow. It may overall be faster to call readModBam multiple times on fewer or smaller regions, and keeping track of the partial results, than calling readModBam a single time on all regions.

2.7 Reading from other file types

In addition to reading from modBam files, you can also read (summary-level) data from bedMethyl files using readBedMethyl, or from tabular files created by modkit using readModkitExtract. The arguments of these reader functions are largely similar to the ones used for reading modBam files.

Show/hide code

# read from bedMethyl
bmfile <- system.file("extdata", "modkit_pileup_1.bed.gz",
                      package = "footprintR")
se_bedmethyl <- readBedMethyl(
    fnames = bmfile,
    modbase = "m",
    seqinfo = seqinfo(gnm), 
    sequenceContextWidth = 3, 
    sequenceReference = gnm
)
se_bedmethyl

class: RangedSummarizedExperiment 
dim: 10000 1 
metadata(1): readLevelData
assays(2): Nmod Nvalid
rownames: NULL
rowData names(1): sequenceContext
colnames(1): s1
colData names(2): sample modbase

Show/hide code

# read from a tabular file created by modkit
extrfile <- system.file("extdata", "modkit_extract_rc_5mC_1.tsv.gz",
                        package = "footprintR")
se_modkit <- readModkitExtract(
    fnames = extrfile,
    modbase = "m",
    filter = NULL,
    seqinfo = seqinfo(gnm), 
    sequenceContextWidth = 3, 
    sequenceReference = gnm
)
se_modkit

class: RangedSummarizedExperiment 
dim: 6432 1 
metadata(3): modkit_threshold filter_threshold readLevelData
assays(1): mod_prob
rownames(6432): chr1:6928983:- chr1:6928995:- ... chr1:6949455:+
  chr1:6949461:+
rowData names(1): sequenceContext
colnames(1): s1
colData names(2): sample modbase

2.8 Session info

Click to view session info

Show/hide code

sessioninfo::session_info(info = "packages")

═ Session info ═══════════════════════════════════════════════════════════════
─ Packages ───────────────────────────────────────────────────────────────────
 package                                      * version   date (UTC) lib source
 abind                                        * 1.4-8     2024-09-12 [1] CRAN (R 4.5.0)
 Biobase                                      * 2.68.0    2025-04-15 [1] Bioconductor 3.21 (R 4.5.0)
 BiocGenerics                                 * 0.54.0    2025-04-15 [1] Bioconductor 3.21 (R 4.5.0)
 BiocIO                                       * 1.18.0    2025-04-15 [1] Bioconductor 3.21 (R 4.5.0)
 BiocParallel                                 * 1.42.1    2025-06-01 [1] Bioconductor 3.21 (R 4.5.0)
 Biostrings                                   * 2.76.0    2025-04-15 [1] Bioconductor 3.21 (R 4.5.0)
 bitops                                         1.0-9     2024-10-03 [1] CRAN (R 4.5.0)
 BSgenome                                     * 1.76.0    2025-04-15 [1] Bioconductor 3.21 (R 4.5.0)
 BSgenome.Mmusculus.GENCODE.GRCm39.gencodeM34 * 0.1.0     2025-04-17 [1] Bioconductor
 cli                                            3.6.5     2025-04-23 [1] CRAN (R 4.5.0)
 codetools                                      0.2-20    2024-03-31 [2] CRAN (R 4.5.0)
 crayon                                         1.5.3     2024-06-20 [1] CRAN (R 4.5.0)
 curl                                           6.3.0     2025-06-06 [1] CRAN (R 4.5.0)
 data.table                                     1.17.4    2025-05-26 [1] CRAN (R 4.5.0)
 DelayedArray                                   0.34.1    2025-04-17 [1] Bioconductor 3.21 (R 4.5.0)
 dichromat                                      2.0-0.1   2022-05-02 [1] CRAN (R 4.5.0)
 digest                                         0.6.37    2024-08-19 [1] CRAN (R 4.5.0)
 dplyr                                          1.1.4     2023-11-17 [1] CRAN (R 4.5.0)
 evaluate                                       1.0.3     2025-01-10 [1] CRAN (R 4.5.0)
 farver                                         2.1.2     2024-05-13 [1] CRAN (R 4.5.0)
 fastmap                                        1.2.0     2024-05-15 [1] CRAN (R 4.5.0)
 footprintR                                   * 0.3.5     2025-06-04 [1] Github (fmicompbio/footprintR@612f713)
 generics                                     * 0.1.4     2025-05-09 [1] CRAN (R 4.5.0)
 GenomeInfoDb                                 * 1.44.0    2025-04-15 [1] Bioconductor 3.21 (R 4.5.0)
 GenomeInfoDbData                               1.2.14    2025-04-17 [1] Bioconductor
 GenomicAlignments                              1.44.0    2025-04-15 [1] Bioconductor 3.21 (R 4.5.0)
 GenomicRanges                                * 1.60.0    2025-04-15 [1] Bioconductor 3.21 (R 4.5.0)
 ggforce                                        0.4.2     2024-02-19 [1] CRAN (R 4.5.0)
 ggplot2                                        3.5.2     2025-04-09 [1] CRAN (R 4.5.0)
 glue                                           1.8.0     2024-09-30 [1] CRAN (R 4.5.0)
 gtable                                         0.3.6     2024-10-25 [1] CRAN (R 4.5.0)
 htmltools                                      0.5.8.1   2024-04-04 [1] CRAN (R 4.5.0)
 htmlwidgets                                    1.6.4     2023-12-06 [1] CRAN (R 4.5.0)
 httr                                           1.4.7     2023-08-15 [1] CRAN (R 4.5.0)
 IRanges                                      * 2.42.0    2025-04-15 [1] Bioconductor 3.21 (R 4.5.0)
 jsonlite                                       2.0.0     2025-03-27 [1] CRAN (R 4.5.0)
 knitr                                          1.50      2025-03-16 [1] CRAN (R 4.5.0)
 lattice                                        0.22-7    2025-04-02 [1] CRAN (R 4.5.0)
 lifecycle                                      1.0.4     2023-11-07 [1] CRAN (R 4.5.0)
 magrittr                                       2.0.3     2022-03-30 [1] CRAN (R 4.5.0)
 MASS                                           7.3-65    2025-02-28 [2] CRAN (R 4.5.0)
 Matrix                                       * 1.7-3     2025-03-11 [2] CRAN (R 4.5.0)
 MatrixGenerics                               * 1.20.0    2025-04-15 [1] Bioconductor 3.21 (R 4.5.0)
 matrixStats                                  * 1.5.0     2025-01-07 [1] CRAN (R 4.5.0)
 patchwork                                      1.3.0     2024-09-16 [1] CRAN (R 4.5.0)
 pillar                                         1.10.2    2025-04-05 [1] CRAN (R 4.5.0)
 pkgconfig                                      2.0.3     2019-09-22 [1] CRAN (R 4.5.0)
 polyclip                                       1.10-7    2024-07-23 [1] CRAN (R 4.5.0)
 purrr                                          1.0.4     2025-02-05 [1] CRAN (R 4.5.0)
 R.methodsS3                                    1.8.2     2022-06-13 [1] CRAN (R 4.5.0)
 R.oo                                           1.27.1    2025-05-02 [1] CRAN (R 4.5.0)
 R.utils                                        2.13.0    2025-02-24 [1] CRAN (R 4.5.0)
 R6                                             2.6.1     2025-02-15 [1] CRAN (R 4.5.0)
 RColorBrewer                                   1.1-3     2022-04-03 [1] CRAN (R 4.5.0)
 Rcpp                                           1.0.14    2025-01-12 [1] CRAN (R 4.5.0)
 RCurl                                          1.98-1.17 2025-03-22 [1] CRAN (R 4.5.0)
 restfulr                                       0.0.15    2022-06-16 [1] CRAN (R 4.5.0)
 rjson                                          0.2.23    2024-09-16 [1] CRAN (R 4.5.0)
 rlang                                          1.1.6     2025-04-11 [1] CRAN (R 4.5.0)
 rmarkdown                                      2.29      2024-11-04 [1] CRAN (R 4.5.0)
 Rsamtools                                      2.24.0    2025-04-15 [1] Bioconductor 3.21 (R 4.5.0)
 rtracklayer                                  * 1.68.0    2025-04-15 [1] Bioconductor 3.21 (R 4.5.0)
 S4Arrays                                     * 1.8.1     2025-06-01 [1] Bioconductor 3.21 (R 4.5.0)
 S4Vectors                                    * 0.46.0    2025-04-15 [1] Bioconductor 3.21 (R 4.5.0)
 scales                                         1.4.0     2025-04-24 [1] CRAN (R 4.5.0)
 sessioninfo                                    1.2.3     2025-02-05 [1] CRAN (R 4.5.0)
 SparseArray                                  * 1.8.0     2025-04-15 [1] Bioconductor 3.21 (R 4.5.0)
 SummarizedExperiment                         * 1.38.1    2025-04-30 [1] Bioconductor 3.21 (R 4.5.0)
 tibble                                         3.3.0     2025-06-08 [1] CRAN (R 4.5.0)
 tidyr                                          1.3.1     2024-01-24 [1] CRAN (R 4.5.0)
 tidyselect                                     1.2.1     2024-03-11 [1] CRAN (R 4.5.0)
 tweenr                                         2.0.3     2024-02-26 [1] CRAN (R 4.5.0)
 UCSC.utils                                     1.4.0     2025-04-15 [1] Bioconductor 3.21 (R 4.5.0)
 vctrs                                          0.6.5     2023-12-01 [1] CRAN (R 4.5.0)
 withr                                          3.0.2     2024-10-28 [1] CRAN (R 4.5.0)
 xfun                                           0.52      2025-04-02 [1] CRAN (R 4.5.0)
 XML                                            3.99-0.18 2025-01-01 [1] CRAN (R 4.5.0)
 XVector                                      * 0.48.0    2025-04-15 [1] Bioconductor 3.21 (R 4.5.0)
 yaml                                           2.3.10    2024-07-26 [1] CRAN (R 4.5.0)
 zoo                                            1.8-14    2025-04-10 [1] CRAN (R 4.5.0)

 [1] /tungstenfs/groups/gbioinfo/Appz/R/BioC/R-4.5-release-foss-2024.05_BioC-3.21-release-foss-2024.05
 [2] /tachyon/groups/gbioinfo/Appz/easybuild/software/R/4.5.0-foss-2024.05/lib64/R/library
 * ── Packages attached to the search path.

──────────────────────────────────────────────────────────────────────────────