Create a conversion table for collapsing similar sequences
groupSimilarSequences.Rd
Create a conversion table for collapsing similar sequences
Usage
groupSimilarSequences(
seqs,
scores,
collapseMaxDist,
collapseMinScore,
collapseMinRatio,
verbose
)
Arguments
- seqs
Character vector with nucleotide sequences (or pairs of sequences concatenated with "_") to be collapsed. The sequences must all be of the same length.
- scores
Numeric vector of "scores" for the sequences. Typically the total read/UMI count. A higher score will be preferred when deciding which sequence to use as the representative for a group of collapsed sequences.
- collapseMaxDist
Numeric scalar defining the tolerance for collapsing similar sequences. If the value is in [0, 1), it defines the maximal Hamming distance in terms of a fraction of sequence length: (
round(collapseMaxDist * nchar(sequence))
). A value greater or equal to 1 is rounded and directly used as the maximum allowed Hamming distance. Note that sequences can only be collapsed if they are all of the same length.- collapseMinScore
Numeric scalar, indicating the minimum score required for a sequence to be considered as a representative for a group of similar sequences (i.e., to allow other sequences to be collapsed into it).
- collapseMinRatio
Numeric scalar. During collapsing of similar sequences, a low-frequency sequence will be collapsed with a higher-frequency sequence only if the ratio between the high-frequency and the low-frequency scores is at least this high. A value of 0 indicates that no such check is performed.
- verbose
Logical scalar, whether to print progress messages.
Value
A data.frame with two columns, containing the input sequences and the representatives for the groups resulting from grouping similar sequences, respectively.
Examples
seqs <- c("AACGTAGCA", "ACCGTAGCA", "AACGGAGCA", "ATCGGAGCA", "TGAGGCATA")
scores <- c(5, 1, 3, 1, 8)
groupSimilarSequences(seqs = seqs, scores = scores,
collapseMaxDist = 1, collapseMinScore = 0,
collapseMinRatio = 0, verbose = FALSE)
#> sequence representative
#> 1 AACGTAGCA AACGTAGCA
#> 2 ACCGTAGCA AACGTAGCA
#> 3 AACGGAGCA AACGTAGCA
#> 4 ATCGGAGCA ATCGGAGCA
#> 5 TGAGGCATA TGAGGCATA