This function reads one or more modkit extract files and generates a
single SummarizedExperiment object with rows representing positions and
columns representing samples. The SummarizedExperiment object
contains one assay (mod_prob) with modification probabilities for
each position in each read. Unobserved read/position combinations are
represented by NA.
Usage
readModkitExtract(
fnames,
modbase,
filter = NULL,
nrows = Inf,
sampleAnnot = NULL,
seqinfo = NULL,
sequenceContextWidth = 0,
sequenceReference = NULL,
BPPARAM = MulticoreParam(4L, RNGseed = 42L),
verbose = FALSE
)Arguments
- fnames
Character vector with one or several paths of
extract.tsvfiles, such as generated bymodkit. Iffnamesis a named vector, the names are used as prefixes for the column (read) names in the returnedSummarizedExperimentobject, and as column names for the object itself. Otherwise, the prefixes/names will bes1, ...,sN, whereNis the length offnames.- modbase
Character vector defining the modified base for each sample. If
modbaseis a named vector, the names should correspond to the names offnames. Otherwise, it will be assumed that the elements are in the same order as the files infnames. Ifmodbasehas length 1, the same modified base will be used for all samples.- filter
Either
NULL(no filtering is performed),"modkit"(only records where the"fail"column in the extract file isFALSEare retained), or a named vector, providing filter thresholds (on the call probability) for the unmodified and modified bases, respectively. The names of the vector should be"-"and the value(s) inmodbase.- nrows
Only read
nrowsrows of each input file.- sampleAnnot
A
data.frame(orNULL) providing annotations for the samples. It must contain at least one column, named"sample", which must contain all the values ofnames(fnames). The provided annotations will be propagated to the returnedSummarizedExperimentobject.- seqinfo
NULLor aSeqinfoobject containing information about the set of genomic sequences (chromosomes). Alternatively, a named numeric vector with genomic sequence names and lengths. Useful to set the sorting order of sequence names.- sequenceContextWidth, sequenceReference
Define the sequence context to be extracted around modified bases. By default (
sequenceContextWidth = 0), no sequence context will be extracted, otherwise it will be returned inrowData(x)$sequenceContext. SeeaddSeqContextfor details.- BPPARAM
A
BiocParallelParamobject that controls the number of parallel CPU threads to use for some of the steps inreadModkitExtract(). The default value is (MulticoreParam(4L, RNGseed = 42L)).- verbose
If
TRUE, report on progress.
Value
A SummarizedExperiment object
with genomic positions in rows and samples in columns. The assay
"mod_prob" contains per-read modification probabilities,
with each column (sample) corresponding to a position-by-read
NaMatrix.
See also
modkit software,
SummarizedExperiment for the returned object type,
fread for the function used to read the input files
Examples
extrfile <- system.file("extdata", "modkit_extract_rc_5mC_1.tsv.gz",
package = "footprintR")
## ... no filtering
readModkitExtract(extrfile, modbase = "m", filter = NULL,
BPPARAM = BiocParallel::SerialParam())
#> class: RangedSummarizedExperiment
#> dim: 6432 1
#> metadata(3): modkit_threshold filter_threshold readLevelData
#> assays(1): mod_prob
#> rownames(6432): chr1:6928983:- chr1:6928995:- ... chr1:6949455:+
#> chr1:6949461:+
#> rowData names(0):
#> colnames(1): s1
#> colData names(2): sample modbase
## ... modkit filtering
readModkitExtract(extrfile, modbase = "m", filter = "modkit",
BPPARAM = BiocParallel::SerialParam())
#> class: RangedSummarizedExperiment
#> dim: 5893 1
#> metadata(3): modkit_threshold filter_threshold readLevelData
#> assays(1): mod_prob
#> rownames(5893): chr1:6928983:- chr1:6928995:- ... chr1:6949455:+
#> chr1:6949461:+
#> rowData names(0):
#> colnames(1): s1
#> colData names(2): sample modbase