Skip to contents

Create a conversion table for collapsing similar sequences

Usage

groupSimilarSequences(
  seqs,
  scores,
  collapseMaxDist,
  collapseMinScore,
  collapseMinRatio,
  verbose
)

Arguments

seqs

Character vector with nucleotide sequences (or pairs of sequences concatenated with "_") to be collapsed. The sequences must all be of the same length.

scores

Numeric vector of "scores" for the sequences. Typically the total read/UMI count. A higher score will be preferred when deciding which sequence to use as the representative for a group of collapsed sequences.

collapseMaxDist

Numeric scalar defining the tolerance for collapsing similar sequences. If the value is in [0, 1), it defines the maximal Hamming distance in terms of a fraction of sequence length: (round(collapseMaxDist * nchar(sequence))). A value greater or equal to 1 is rounded and directly used as the maximum allowed Hamming distance. Note that sequences can only be collapsed if they are all of the same length.

collapseMinScore

Numeric scalar, indicating the minimum score required for a sequence to be considered as a representative for a group of similar sequences (i.e., to allow other sequences to be collapsed into it).

collapseMinRatio

Numeric scalar. During collapsing of similar sequences, a low-frequency sequence will be collapsed with a higher-frequency sequence only if the ratio between the high-frequency and the low-frequency scores is at least this high. A value of 0 indicates that no such check is performed.

verbose

Logical scalar, whether to print progress messages.

Value

A data.frame with two columns, containing the input sequences and the representatives for the groups resulting from grouping similar sequences, respectively.

Author

Michael Stadler, Charlotte Soneson

Examples

seqs <- c("AACGTAGCA", "ACCGTAGCA", "AACGGAGCA", "ATCGGAGCA", "TGAGGCATA")
scores <- c(5, 1, 3, 1, 8)
groupSimilarSequences(seqs = seqs, scores = scores, 
                      collapseMaxDist = 1, collapseMinScore = 0, 
                      collapseMinRatio = 0, verbose = FALSE)
#>    sequence representative
#> 1 AACGTAGCA      AACGTAGCA
#> 2 ACCGTAGCA      AACGTAGCA
#> 3 AACGGAGCA      AACGTAGCA
#> 4 ATCGGAGCA      ATCGGAGCA
#> 5 TGAGGCATA      TGAGGCATA