Skip to contents

Load data

We start the analysis by loading footprinting data at the level of individual reads (for an example of working with summary-level data, see vignette("using-footprintR")). Read-level data can be extracted from modBam files (standard bam file with modification data stored in MM and ML tags, see SAMtags.pdf) using readModBam(). This is the preferred way described below.

Alternatively, read-level modification data can also be extracted from modBam files using modkit (see modkitExtract() for calling modkit from R and readModkitExtract() to read the resulting output file). This will require a local installation of modkit and may only be of interest if specific modkit features should be used that are not available otherwise.

The footprintR package contains small example modBam files that were generated using the Dorado aligner:

# load packages
library(footprintR)
library(SummarizedExperiment)

# read-level 6mA data generated by 'dorado'
modbamfiles <- system.file("extdata",
                           c("6mA_1_10reads.bam", "6mA_2_10reads.bam"),
                            package = "footprintR")
names(modbamfiles) <- c("sample1", "sample2")

Modification data is read from these files using readModBam():

se <- readModBam(bamfiles = modbamfiles,
                 regions = "chr1:6940000-7000000",
                 modbase = "a",
                 verbose = TRUE)
#> extracting base modifications from modBAM files
#>     opening input file /Users/runner/work/_temp/Library/footprintR/extdata/6mA_1_10reads.bam
#>     reading alignments overlapping any of 1 regions
#>     removed 1724 unaligned (e.g. soft-masked) of 25090 called bases
#>     read 3 alignments
#>     opening input file /Users/runner/work/_temp/Library/footprintR/extdata/6mA_2_10reads.bam
#>     reading alignments overlapping any of 1 regions
#>     removed 320 unaligned (e.g. soft-masked) of 10174 called bases
#>     read 2 alignments
#> finding unique genomic positions...collapsed 17739 positions to 7967 unique ones
se
#> class: RangedSummarizedExperiment 
#> dim: 7967 2 
#> metadata(0):
#> assays(1): mod_prob
#> rownames(7967): chr1:6930410:+ chr1:6930415:+ ... chr1:6941622:-
#>   chr1:6941631:-
#> rowData names(0):
#> colnames(2): sample1 sample2
#> colData names(4): sample modbase n_reads qscore

This will create a RangedSummarizedExperiment object with positions in rows:

# rows are positions...
rowRanges(se)
#> UnstitchedGPos object with 7967 positions and 0 metadata columns:
#>                  seqnames       pos strand
#>                     <Rle> <integer>  <Rle>
#>   chr1:6930410:+     chr1   6930410      +
#>   chr1:6930415:+     chr1   6930415      +
#>   chr1:6930419:+     chr1   6930419      +
#>   chr1:6930421:+     chr1   6930421      +
#>   chr1:6930428:+     chr1   6930428      +
#>              ...      ...       ...    ...
#>   chr1:6941611:-     chr1   6941611      -
#>   chr1:6941614:-     chr1   6941614      -
#>   chr1:6941620:-     chr1   6941620      -
#>   chr1:6941622:-     chr1   6941622      -
#>   chr1:6941631:-     chr1   6941631      -
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Just like with summary-level data, columns correspond to samples

# ... and columns are samples
colData(se)
#> DataFrame with 2 rows and 4 columns
#>              sample     modbase   n_reads                  qscore
#>         <character> <character> <integer>                  <List>
#> sample1     sample1           a         3 14.1428,16.0127,20.3082
#> sample2     sample2           a         2        9.67461,13.66480

The sample names are obtained from the input files (here extractfiles), or if the files are not named will be automatically assigned (each file corresponding to a separate sample).

However, as each sample typically contains several reads (see se$n_reads), the qscore column that stores the quality scores for individual reads, is not a simple vector, but a list of vectors, in which scores are grouped by sample. It may be needed to unlist() it if a flat vector is needed:

se$qscore
#> List of length 2
#> names(2): sample1 sample2
unlist(se$qscore)
#>  sample1  sample1  sample1  sample2  sample2 
#> 14.14280 16.01270 20.30820  9.67461 13.66480

Explore assay data

The single assay mod_prob is a DataFrame with modification probabilities.

assayNames(se)
#> [1] "mod_prob"

m <- assay(se, "mod_prob")
m
#> DataFrame with 7967 rows and 2 columns
#>                        sample1        sample2
#>                 <SparseMatrix> <SparseMatrix>
#> chr1:6930410:+           0:0:0         0:0.02
#> chr1:6930415:+           0:0:0         0:0.02
#> chr1:6930419:+           0:0:0         0:0.02
#> chr1:6930421:+           0:0:0         0:0.02
#> chr1:6930428:+           0:0:0         0:0.02
#> ...                        ...            ...
#> chr1:6941611:- 0:0:0.052734375            0:0
#> chr1:6941614:- 0:0:0.130859375            0:0
#> chr1:6941620:- 0:0:0.068359375            0:0
#> chr1:6941622:- 0:0:0.060546875            0:0
#> chr1:6941631:- 0:0:0.056640625            0:0

Again, to store read-level data for variable numbers of reads in each sample, each column (sample) is not just a simple vector, but a position-by-read SparseMatrix. If a simple matrix is needed, in which columns correspond to reads instead of sample, it can be created using as.matrix():

m$sample1
#> <7967 x 3 SparseMatrix> of type "double" [nzcount=11300 (47%)]:
#>         sample1-233e48a7-f379-4dcf-9270-958231125563
#>    [1,]                                            0
#>    [2,]                                            0
#>    [3,]                                            0
#>    [4,]                                            0
#>    [5,]                                            0
#>     ...                                            .
#> [7963,]                                   0.00000000
#> [7964,]                                   0.00000000
#> [7965,]                                   0.00000000
#> [7966,]                                   0.00000000
#> [7967,]                                   0.00000000
#>         sample1-d52a5f6a-a60a-4f85-913e-eada84bfbfb9
#>    [1,]                                            0
#>    [2,]                                            0
#>    [3,]                                            0
#>    [4,]                                            0
#>    [5,]                                            0
#>     ...                                            .
#> [7963,]                                   0.00000000
#> [7964,]                                   0.00000000
#> [7965,]                                   0.00000000
#> [7966,]                                   0.00000000
#> [7967,]                                   0.00000000
#>         sample1-92e906ae-cddb-4347-a114-bf9137761a8d
#>    [1,]                                            0
#>    [2,]                                            0
#>    [3,]                                            0
#>    [4,]                                            0
#>    [5,]                                            0
#>     ...                                            .
#> [7963,]                                   0.05273438
#> [7964,]                                   0.13085938
#> [7965,]                                   0.06835938
#> [7966,]                                   0.06054688
#> [7967,]                                   0.05664062

as.matrix(m)
#> <7967 x 5 SparseMatrix> of type "double" [nzcount=17739 (45%)]:
#>                sample1-233e48a7-f379-4dcf-9270-958231125563 ...
#> chr1:6930410:+                                            0   .
#> chr1:6930415:+                                            0   .
#> chr1:6930419:+                                            0   .
#> chr1:6930421:+                                            0   .
#> chr1:6930428:+                                            0   .
#>            ...                                            .   .
#> chr1:6941611:-                                            0   .
#> chr1:6941614:-                                            0   .
#> chr1:6941620:-                                            0   .
#> chr1:6941622:-                                            0   .
#> chr1:6941631:-                                            0   .
#>                sample2-d03efe3b-a45b-430b-9cb6-7e5882e4faf8
#> chr1:6930410:+                                         0.02
#> chr1:6930415:+                                         0.02
#> chr1:6930419:+                                         0.02
#> chr1:6930421:+                                         0.02
#> chr1:6930428:+                                         0.02
#>            ...                                            .
#> chr1:6941611:-                                            0
#> chr1:6941614:-                                            0
#> chr1:6941620:-                                            0
#> chr1:6941622:-                                            0
#> chr1:6941631:-                                            0

One advantage of the this grouping of reads per sample is that you can easily perform per-sample operations using lapply (returns a list) or endoapply (returns a DataFrame):

lapply(m, ncol)
#> $sample1
#> [1] 3
#> 
#> $sample2
#> [1] 2
endoapply(m, ncol)
#> DataFrame with 1 row and 2 columns
#>     sample1   sample2
#>   <integer> <integer>
#> 1         3         2

The SparseMatrix objects do not store the zeros and thus use much less memory compared to a normal (dense) matrix. However, you have to be careful when interpreting the values in that matrix, as they follow a specific convention:

Important: The zeros correspond to unobserved read/position combinations, while all values that are ‘implicitly’ called (with a modifictation probability of less than 5%) are represented with a value of 0.02.

That means if you would be manually calculating for example the average modification probability at a given position, taking the row mean would not be correct, as it would include the zero values that correspond to unobserved data:

# modification probabilities at position chr1:6928850:-
m["chr1:6928850:-", ]
#> DataFrame with 1 row and 2 columns
#>                                  sample1        sample2
#>                           <SparseMatrix> <SparseMatrix>
#> chr1:6928850:- 0.623046875:0.099609375:0            0:0

# WRONG: take the mean of all values
lapply(m["chr1:6928850:-", ], rowMeans)
#> $sample1
#> [1] 0.2408854
#> 
#> $sample2
#> [1] 0

# CORRECT: exclude the zeros
lapply(m["chr1:6928850:-", ], function(x) {
    non_zero <- SparseArray::nzwhich(x != 0)
    mean(x[non_zero])
})
#> $sample1
#> [1] 0.3613281
#> 
#> $sample2
#> [1] NaN

This is however rarely needed, as there are convenience functions that will exclude the unobserved (zero) values automatically for you. For example, you can summarize the reads in each sample using addReadsSummary() (see next section).

Summarize read-level data

Summarized data can be obtained from the read-level data by calculating a per-position summary over the reads in each sample:

se_summary <- addReadsSummary(se, keep.reads = TRUE,
                              statistics = c("Nmod", "Nvalid", "FracMod",
                                             "Pmod", "AvgConf"))

As discussed above, this will automatically exclude the non-observed (zero) values from the data when calculating the modification probability at each position (Pmod assay):

assay(se_summary, "Pmod")["chr1:6928850:-", ]
#>   sample1   sample2 
#> 0.3613281       NaN

The summary statistics to calculate are selected using the statistics argument. By default, addReadsSummary() will count the number of modified (Nmod) and total (Nvalid) reads at each position and sample, and calculate the fraction of modified bases from the two (FracMod).

assay(se_summary, "Nmod")["chr1:6928850:-", ]
#> sample1 sample2 
#>       1       0
assay(se_summary, "Nvalid")["chr1:6928850:-", ]
#> sample1 sample2 
#>       2       0
assay(se_summary, "FracMod")["chr1:6928850:-", ]
#> sample1 sample2 
#>     0.5     NaN

In the above example, we in addition also calculate the average modification probability (Pmod) and the average confidence of the modification calls per position (AvgConf). As we have set keep.reads = TRUE, we get in addition also the read-level assay from the input object (mod_prob) in which the reads are grouped by sample:

# read-level data is retained in "mod_prob" assay
assayNames(se_summary)
#> [1] "Nmod"     "Nvalid"   "FracMod"  "Pmod"     "AvgConf"  "mod_prob"

# ... which groups the reads by sample
assay(se_summary, "mod_prob")
#> DataFrame with 7967 rows and 2 columns
#>                        sample1        sample2
#>                 <SparseMatrix> <SparseMatrix>
#> chr1:6930410:+           0:0:0         0:0.02
#> chr1:6930415:+           0:0:0         0:0.02
#> chr1:6930419:+           0:0:0         0:0.02
#> chr1:6930421:+           0:0:0         0:0.02
#> chr1:6930428:+           0:0:0         0:0.02
#> ...                        ...            ...
#> chr1:6941611:- 0:0:0.052734375            0:0
#> chr1:6941614:- 0:0:0.130859375            0:0
#> chr1:6941620:- 0:0:0.068359375            0:0
#> chr1:6941622:- 0:0:0.060546875            0:0
#> chr1:6941631:- 0:0:0.056640625            0:0

# the dimensions of  read-level `se` and summarized `se_summary` are identical
dim(se)
#> [1] 7967    2
dim(se_summary)
#> [1] 7967    2

Plot data

The read-level data can then be visualized just like the summary-level data using plotRegion, using the tracks.reads argument instead of tracks.summary to select the type of plot(s).

For reference, here we plot the summary-level data:

plotRegion(se_summary, region = "chr1:6932700-6932800",
           tracks.summary = list(FracMod = "Point"))

A summary-level point plot with the fraction modified bases per position.

… and here we plot the read-level data of the same region:

plotRegion(se, region = "chr1:6932700-6932800",
           tracks.summary = NULL,
           tracks.reads = list(mod_prob = "Heatmap"))

A read-level heatmap showing modification probabilities of individual modified bases of each read grouped by sample.

plotRegion(se, region = "chr1:6932700-6932800",
           tracks.summary = NULL,
           tracks.reads = list(mod_prob = "Lollipop"))

A read-level point (lollipop) plot showing modification probabilities of individual modified bases of each read grouped by sample.

The x-axis in these plots is in “base-space”, meaning that it shows the coordinates of genomic bases on which modifications can be irregularly spaced. Alternatively, we can also generate these plot in “modbase-space”, in which only modified bases are shown and the gaps between them are removed:

plotRegion(se, region = "chr1:6932700-6932800",
           tracks.summary = NULL,
           tracks.reads = list(mod_prob = "Heatmap"),
           modbaseSpace = TRUE)

A read-level heatmap showing modification probabilities of individual modified bases only, without interviening non-modified bases.

plotRegion(se, region = "chr1:6932700-6932800",
           tracks.summary = NULL,
           tracks.reads = list(mod_prob = "Lollipop"),
           modbaseSpace = TRUE)

A read-level point (lollipop) plot showing modification probabilities of individual modified bases only, without interviening non-modified bases.

As mentioned above, the read-level data is still contained in the summary object se_summary, so we can plot both summary-level and read-level data simultaneously with this object as input:

plotRegion(se_summary, region = "chr1:6932700-6932800",
           tracks.summary = list(FracMod = c("Smooth")),
           tracks.reads = list(mod_prob = c("Heatmap", "Lollipop")))

A plot of modification data combining different track types.

Analyze read-level data

Amongst other things, single-molecule footprinting data allows to measure the average distance between neighboring nucleosome particles, the so called nucleosome repeat length or NRL. Traditionally, this was measured using MNase-seq data (phasogram analysis, see Valouev et al., Nature 2011, doi:10.1038/nature10002), which required millions of reads for an accurate estimate. By switching to measuring distances between modified bases within reads, we can extract much more information per read and already manage to get accurate estimates from a handful of reads:

moddist <- calcModbaseSpacing(se)
print(estimateNRL(moddist$sample1)[1:2])
#> $nrl
#> [1] 182.6
#> 
#> $nrl.CI95
#>    2.5 %   97.5 % 
#> 179.5475 185.6525
plotModbaseSpacing(moddist$sample1)

plotModbaseSpacing(moddist$sample1, detailedPlots = TRUE)

Session info

sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: aarch64-apple-darwin20
#> Running under: macOS Sonoma 14.6.1
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> time zone: UTC
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] SummarizedExperiment_1.35.1 Biobase_2.65.1             
#>  [3] GenomicRanges_1.57.1        GenomeInfoDb_1.41.1        
#>  [5] IRanges_2.39.2              S4Vectors_0.43.2           
#>  [7] BiocGenerics_0.51.1         MatrixGenerics_1.17.0      
#>  [9] matrixStats_1.4.1           footprintR_0.1.2           
#> 
#> loaded via a namespace (and not attached):
#>  [1] tidyselect_1.2.1            viridisLite_0.4.2          
#>  [3] dplyr_1.1.4                 farver_2.1.2               
#>  [5] Biostrings_2.73.1           bitops_1.0-8               
#>  [7] fastmap_1.2.0               SingleCellExperiment_1.27.2
#>  [9] RCurl_1.98-1.16             GenomicAlignments_1.41.0   
#> [11] XML_3.99-0.17               digest_0.6.37              
#> [13] lifecycle_1.0.4             magrittr_2.0.3             
#> [15] compiler_4.4.1              rlang_1.1.4                
#> [17] sass_0.4.9                  tools_4.4.1                
#> [19] utf8_1.2.4                  yaml_2.3.10                
#> [21] data.table_1.16.0           rtracklayer_1.65.0         
#> [23] knitr_1.48                  S4Arrays_1.5.7             
#> [25] labeling_0.4.3              curl_5.2.2                 
#> [27] DelayedArray_0.31.11        abind_1.4-8                
#> [29] BiocParallel_1.39.0         withr_3.0.1                
#> [31] purrr_1.0.2                 desc_1.4.3                 
#> [33] grid_4.4.1                  fansi_1.0.6                
#> [35] beachmat_2.21.6             colorspace_2.1-1           
#> [37] ggplot2_3.5.1               scales_1.3.0               
#> [39] cli_3.6.3                   rmarkdown_2.28             
#> [41] crayon_1.5.3                ragg_1.3.3                 
#> [43] generics_0.1.3              httr_1.4.7                 
#> [45] rjson_0.2.23                scuttle_1.15.4             
#> [47] cachem_1.1.0                zlibbioc_1.51.1            
#> [49] parallel_4.4.1              XVector_0.45.0             
#> [51] restfulr_0.0.15             vctrs_0.6.5                
#> [53] Matrix_1.7-0                jsonlite_1.8.8             
#> [55] patchwork_1.3.0             systemfonts_1.1.0          
#> [57] jquerylib_0.1.4             tidyr_1.3.1                
#> [59] glue_1.7.0                  pkgdown_2.1.0.9000         
#> [61] codetools_0.2-20            gtable_0.3.5               
#> [63] BiocIO_1.15.2               UCSC.utils_1.1.0           
#> [65] munsell_0.5.1               tibble_3.2.1               
#> [67] pillar_1.9.0                htmltools_0.5.8.1          
#> [69] GenomeInfoDbData_1.2.12     BSgenome_1.73.1            
#> [71] R6_2.5.1                    textshaping_0.4.0          
#> [73] evaluate_1.0.0              lattice_0.22-6             
#> [75] highr_0.11                  Rsamtools_2.21.1           
#> [77] bslib_0.8.0                 Rcpp_1.0.13                
#> [79] SparseArray_1.5.34          xfun_0.47                  
#> [81] fs_1.6.4                    pkgconfig_2.0.3