Read modkit extract file(s)

This function reads one or more modkit extract files and generates a single SummarizedExperiment object with rows representing positions and columns representing samples. The SummarizedExperiment object contains one assay (mod_prob) with modification probabilities for each position in each read. Unobserved read/position combinations are represented by NA.

Usage

readModkitExtract(
  fnames,
  modbase,
  filter = NULL,
  nrows = Inf,
  sampleAnnot = NULL,
  seqinfo = NULL,
  sequenceContextWidth = 0,
  sequenceReference = NULL,
  BPPARAM = MulticoreParam(4L, RNGseed = 42L),
  verbose = FALSE
)

Arguments

fnames: Character vector with one or several paths of extract.tsv files, such as generated by modkit. If fnames is a named vector, the names are used as prefixes for the column (read) names in the returned SummarizedExperiment object, and as column names for the object itself. Otherwise, the prefixes/names will be s1, ..., sN, where N is the length of fnames.
modbase: Character vector defining the modified base for each sample. If modbase is a named vector, the names should correspond to the names of fnames. Otherwise, it will be assumed that the elements are in the same order as the files in fnames. If modbase has length 1, the same modified base will be used for all samples.
filter: Either NULL (no filtering is performed), "modkit" (only records where the "fail" column in the extract file is FALSE are retained), or a named vector, providing filter thresholds (on the call probability) for the unmodified and modified bases, respectively. The names of the vector should be "-" and the value(s) in modbase.
nrows: Only read nrows rows of each input file.
sampleAnnot: A data.frame (or NULL) providing annotations for the samples. It must contain at least one column, named "sample", which must contain all the values of names(fnames). The provided annotations will be propagated to the returned SummarizedExperiment object.
seqinfo: NULL or a Seqinfo object containing information about the set of genomic sequences (chromosomes). Alternatively, a named numeric vector with genomic sequence names and lengths. Useful to set the sorting order of sequence names.
sequenceContextWidth, sequenceReference: Define the sequence context to be extracted around modified bases. By default ( sequenceContextWidth = 0), no sequence context will be extracted, otherwise it will be returned in rowData(x)$sequenceContext. See addSeqContext for details.
BPPARAM: A BiocParallelParam object that controls the number of parallel CPU threads to use for some of the steps in readModkitExtract(). The default value is (MulticoreParam(4L, RNGseed = 42L)).
verbose: If TRUE, report on progress.

Value

A SummarizedExperiment object with genomic positions in rows and samples in columns. The assay "mod_prob" contains per-read modification probabilities, with each column (sample) corresponding to a position-by-read NaMatrix.

Author

Charlotte Soneson, Michael Stadler

Examples

extrfile <- system.file("extdata", "modkit_extract_rc_5mC_1.tsv.gz",
                        package = "footprintR")
## ... no filtering
readModkitExtract(extrfile, modbase = "m", filter = NULL,
                  BPPARAM = BiocParallel::SerialParam())
#> class: RangedSummarizedExperiment 
#> dim: 6432 1 
#> metadata(3): modkit_threshold filter_threshold readLevelData
#> assays(1): mod_prob
#> rownames(6432): chr1:6928983:- chr1:6928995:- ... chr1:6949455:+
#>   chr1:6949461:+
#> rowData names(0):
#> colnames(1): s1
#> colData names(2): sample modbase
## ... modkit filtering
readModkitExtract(extrfile, modbase = "m", filter = "modkit",
                  BPPARAM = BiocParallel::SerialParam())
#> class: RangedSummarizedExperiment 
#> dim: 5893 1 
#> metadata(3): modkit_threshold filter_threshold readLevelData
#> assays(1): mod_prob
#> rownames(5893): chr1:6928983:- chr1:6928995:- ... chr1:6949455:+
#>   chr1:6949461:+
#> rowData names(0):
#> colnames(1): s1
#> colData names(2): sample modbase

Usage

Arguments

Value

See also

Author

Examples