Skip to contents

This function reads one or more modkit extract files and generates a single SummarizedExperiment object with rows representing positions and columns representing samples. The SummarizedExperiment object contains one assay (mod_prob) with modification probabilities for each position in each read. Unobserved read/position combinations are represented by NA.

Usage

readModkitExtract(
  fnames,
  modbase,
  filter = NULL,
  nrows = Inf,
  seqinfo = NULL,
  sequenceContextWidth = 0,
  sequenceReference = NULL,
  BPPARAM = bpparam(),
  verbose = FALSE
)

Arguments

fnames

Character vector with one or several paths of extract.tsv files, such as generated by modkit. If fnames is a named vector, the names are used as prefixes for the column (read) names in the returned SummarizedExperiment object, and as column names for the object itself. Otherwise, the prefixes/names will be s1, ..., sN, where N is the length of fnames.

modbase

Character vector defining the modified base for each sample. If modbase is a named vector, the names should correspond to the names of fnames. Otherwise, it will be assumed that the elements are in the same order as the files in fnames. If modbase has length 1, the same modified base will be used for all samples.

filter

Either NULL (no filtering is performed), "modkit" (only records where the "fail" column in the extract file is FALSE are retained), or a named vector, providing filter thresholds (on the call probability) for the unmodified and modified bases, respectively. The names of the vector should be "-" and the value(s) in modbase.

nrows

Only read nrows rows of each input file.

seqinfo

NULL or a Seqinfo object containing information about the set of genomic sequences (chromosomes). Alternatively, a named numeric vector with genomic sequence names and lengths. Useful to set the sorting order of sequence names.

sequenceContextWidth, sequenceReference

Define the sequence context to be extracted around modified bases. By default ( sequenceContextWidth = 0), no sequence context will be extracted, otherwise it will be returned in rowData(x)$sequenceContext. See addSeqContext for details.

BPPARAM

A BiocParallelParam object that controls the number of parallel CPU threads to use for some of the steps in readModkitExtract(). The default value (bpparam) will select an appropriate value for the current environment, or the default parallel backend registered using register.

verbose

If TRUE, report on progress.

Value

A SummarizedExperiment object with genomic positions in rows and samples in columns. The assay "mod_prob" contains per-read modification probabilities, with each column (sample) corresponding to a position-by-read NaMatrix.

See also

modkit software, SummarizedExperiment for the returned object type, fread for the function used to read the input files

Author

Charlotte Soneson, Michael Stadler

Examples

extrfile <- system.file("extdata", "modkit_extract_rc_5mC_1.tsv.gz",
                        package = "footprintR")
## ... no filtering
readModkitExtract(extrfile, modbase = "m", filter = NULL)
#> class: RangedSummarizedExperiment 
#> dim: 6432 1 
#> metadata(3): modkit_threshold filter_threshold readLevelData
#> assays(1): mod_prob
#> rownames(6432): chr1:6932662:+ chr1:6932664:+ ... chr1:6942543:-
#>   chr1:6942552:-
#> rowData names(0):
#> colnames(1): s1
#> colData names(2): sample modbase
## ... modkit filtering
readModkitExtract(extrfile, modbase = "m", filter = "modkit")
#> class: RangedSummarizedExperiment 
#> dim: 5893 1 
#> metadata(3): modkit_threshold filter_threshold readLevelData
#> assays(1): mod_prob
#> rownames(5893): chr1:6932662:+ chr1:6932664:+ ... chr1:6942543:-
#>   chr1:6942552:-
#> rowData names(0):
#> colnames(1): s1
#> colData names(2): sample modbase