This function reads one or more modkit extract files and generates a
single SummarizedExperiment object with rows representing positions and
columns representing samples. The SummarizedExperiment object
contains one assay (mod_prob
) with modification probabilities for
each position in each read. Unobserved read/position combinations are
represented by NA
.
Usage
readModkitExtract(
fnames,
modbase,
filter = NULL,
nrows = Inf,
seqinfo = NULL,
sequenceContextWidth = 0,
sequenceReference = NULL,
BPPARAM = bpparam(),
verbose = FALSE
)
Arguments
- fnames
Character vector with one or several paths of
extract.tsv
files, such as generated bymodkit
. Iffnames
is a named vector, the names are used as prefixes for the column (read) names in the returnedSummarizedExperiment
object, and as column names for the object itself. Otherwise, the prefixes/names will bes1
, ...,sN
, whereN
is the length offnames
.- modbase
Character vector defining the modified base for each sample. If
modbase
is a named vector, the names should correspond to the names offnames
. Otherwise, it will be assumed that the elements are in the same order as the files infnames
. Ifmodbase
has length 1, the same modified base will be used for all samples.- filter
Either
NULL
(no filtering is performed),"modkit"
(only records where the"fail"
column in the extract file isFALSE
are retained), or a named vector, providing filter thresholds (on the call probability) for the unmodified and modified bases, respectively. The names of the vector should be"-"
and the value(s) inmodbase
.- nrows
Only read
nrows
rows of each input file.- seqinfo
NULL
or aSeqinfo
object containing information about the set of genomic sequences (chromosomes). Alternatively, a named numeric vector with genomic sequence names and lengths. Useful to set the sorting order of sequence names.- sequenceContextWidth, sequenceReference
Define the sequence context to be extracted around modified bases. By default (
sequenceContextWidth = 0
), no sequence context will be extracted, otherwise it will be returned inrowData(x)$sequenceContext
. SeeaddSeqContext
for details.- BPPARAM
A
BiocParallelParam
object that controls the number of parallel CPU threads to use for some of the steps inreadModkitExtract()
. The default value (bpparam
) will select an appropriate value for the current environment, or the default parallel backend registered usingregister
.- verbose
If
TRUE
, report on progress.
Value
A SummarizedExperiment
object
with genomic positions in rows and samples in columns. The assay
"mod_prob"
contains per-read modification probabilities,
with each column (sample) corresponding to a position-by-read
NaMatrix
.
See also
modkit
software,
SummarizedExperiment
for the returned object type,
fread
for the function used to read the input files
Examples
extrfile <- system.file("extdata", "modkit_extract_rc_5mC_1.tsv.gz",
package = "footprintR")
## ... no filtering
readModkitExtract(extrfile, modbase = "m", filter = NULL)
#> class: RangedSummarizedExperiment
#> dim: 6432 1
#> metadata(3): modkit_threshold filter_threshold readLevelData
#> assays(1): mod_prob
#> rownames(6432): chr1:6932662:+ chr1:6932664:+ ... chr1:6942543:-
#> chr1:6942552:-
#> rowData names(0):
#> colnames(1): s1
#> colData names(2): sample modbase
## ... modkit filtering
readModkitExtract(extrfile, modbase = "m", filter = "modkit")
#> class: RangedSummarizedExperiment
#> dim: 5893 1
#> metadata(3): modkit_threshold filter_threshold readLevelData
#> assays(1): mod_prob
#> rownames(5893): chr1:6932662:+ chr1:6932664:+ ... chr1:6942543:-
#> chr1:6942552:-
#> rowData names(0):
#> colnames(1): s1
#> colData names(2): sample modbase