Given a set of sequences, calculate observed and expected k-mer
frequencies. Expected frequencies are based on a Markov model of order
MMorder
.
Arguments
- seqs
Set of sequences, either a
character
vector or aDNAStringSet
.- kmerLen
A
numeric
scalar giving the k-mer length.- MMorder
A
numeric
scalar giving the order of the Markov model used to calculate the expected frequencies.- pseudocount
A
numeric
scalar - will be added to the observed counts for each k-mer to avoid zero values.- zoops
A
logical
scalar. IfTRUE
(the default), only one or zero occurences of a k-mer are considered per sequence.- strata
A
factor
or anumeric
scalar defining the strata of sequences. A separate Markov model and expected k-mer frequencies are estimated for the set of sequences in each stratum (level in astrata
factor). Ifstrata
is a scalar value, it will be interpreted as the number of strata to split the sequences into according to their CpG observed-over-expected counts usingkmeans(CpGoe, centers = strata)
.- p.adjust.method
A character scalar selecting the p value adjustment method (used in
p.adjust
).- includeRevComp
A
logical
scalar. IfTRUE
(default), count k-mer occurrences in bothseqs
and their reverse-complement, by concatenatingseqs
and their reverse-complemented versions before the counting. This is useful if motifs can be expected to occur on any strand (e.g. DNA sequences of ChIP-seq peaks). If motifs are only expected on the forward strand (e.g. RNA sequences of CLIP-seq peaks),includeRevComp = FALSE
should be used. Note that ifstrata
is a vector of the same length asseqs
, each reverse-complemented sequence will be assigned to the same stratum as the forward sequence.
Value
A list
with observed and expected k-mer frequencies
(freq.obs
and freq.exp
, respectively), and enrichment
statistics for each k-mer.
Examples
res <- getKmerFreq(seqs = c("AAAAATT", "AAATTTT"), kmerLen = 3)
names(res)
#> [1] "freq.obs" "freq.exp" "log2enr" "sqrtDelta" "z"
#> [6] "p" "padj" "strata" "freq.strata" "CpGoe"
head(res$freq.obs)
#> AAA AAC AAG AAT ACA ACC
#> 3 0 0 4 0 0
head(res$freq.exp)
#> AAA AAC AAG AAT ACA ACC
#> 1.8295455 0.1663223 0.1663223 1.4969008 0.3326446 0.3326446