Collapse mutants by similarity — collapseMutantsBySimilarity • mutscan

These functions can be used to collapse variants, either by similarity or according to a pre-defined grouping. The functions collapseMutants and collapseMutantsByAA assume that a grouping variable is available as a column in rowData(se) (collapseMutantsByAA is a convenience function for the case when this column is "mutantNameAA", and is provided for backwards compatibility). The collapseMutantsBySimilarity will generate the grouping variable based on user-provided thresholds on the sequence similarity (defined by the Hamming distance), and subsequently collapse based on the derived grouping.

Usage

collapseMutantsBySimilarity(
  se,
  assayName,
  scoreMethod = "rowSum",
  sequenceCol = "sequence",
  collapseMaxDist = 0,
  collapseMinScore = 0,
  collapseMinRatio = 0,
  verbose = TRUE
)

collapseMutantsByAA(se)

collapseMutants(se, nameCol)

Arguments

se: A SummarizedExperiment generated by summarizeExperiment
assayName: The name of the assay that will be used to calculate a "score" (typically derived from the read counts) for each variant.
scoreMethod: Character scalar giving the approach used to calculate ranking scores from the assay defined by assayName. Currently, this can be one of "rowSum" or "rowMean". All filtering criteria will be applied to these scores.
sequenceCol: Character scalar giving the name of the column in rowData(se) that contains the nucleotide sequence of the variants.
collapseMaxDist: Numeric scalar defining the tolerance for collapsing similar sequences. If the value is in [0, 1), it defines the maximal Hamming distance in terms of a fraction of sequence length: (round(collapseMaxDist * nchar(sequence))). A value greater or equal to 1 is rounded and directly used as the maximum allowed Hamming distance. Note that sequences can only be collapsed if they are all of the same length.
collapseMinScore: Numeric scalar, indicating the minimum score for the sequence to be considered for collapsing with similar sequences.
collapseMinRatio: Numeric scalar. During collapsing of similar sequences, a low-frequency sequence will be collapsed with a higher-frequency sequence only if the ratio between the high-frequency and the low-frequency scores is at least this high. The default value of 0 indicates that no such check is performed.
verbose: Logical, whether to print progress messages.
nameCol: A character scalar providing the column of rowData(se) that contains the amino acid mutant names (that will be the new row names).

Value

A SummarizedExperiment where counts have been aggregated by the mutated amino acid(s).

Author

Charlotte Soneson, Michael Stadler

Examples

library(SummarizedExperiment)
se <- readRDS(system.file("extdata", "GSE102901_cis_se.rds",
                          package = "mutscan"))[1:200, ]
## The rows of this object correspond to individual codon variants
dim(se)
#> [1] 200   6
head(rownames(se))
#> [1] "f.0.WT"  "f.1.AAC" "f.1.AAG" "f.1.ACC" "f.1.ACG" "f.1.AGC"

## Collapse by amino acid
sec <- collapseMutantsByAA(se)
## The rows of the collapsed object correspond to amino acid variants
dim(sec)
#> [1] 128   6
head(rownames(sec))
#> [1] "f.0.WT" "f.1.*"  "f.1.A"  "f.1.C"  "f.1.D"  "f.1.E" 
## The mutantName column contains the individual codon variants that were 
## collapsed
head(rowData(sec))
#> DataFrame with 6 rows and 19 columns
#>        mutantNameAA               sequence             mutantName
#>         <character>            <character>            <character>
#> f.0.WT       f.0.WT ACCGATACACTCCAAGCGGA.. f.0.WT,f.1.ACC,f.1.A..
#> f.1.*         f.1.* TAGGATACACTCCAAGCGGA..                f.1.TAG
#> f.1.A         f.1.A GCCGATACACTCCAAGCGGA..        f.1.GCC,f.1.GCG
#> f.1.C         f.1.C TGCGATACACTCCAAGCGGA..                f.1.TGC
#> f.1.D         f.1.D GACGATACACTCCAAGCGGA..                f.1.GAC
#> f.1.E         f.1.E GAGGATACACTCCAAGCGGA..                f.1.GAG
#>                mutantNameBase     mutantNameBaseHGVS        mutantNameCodon
#>                   <character>            <character>            <character>
#> f.0.WT f.0.WT,f.3.C,f.3.G,f.. f:c,f:c.30A>G,f:c.31.. f.0.WT,f.1.ACC,f.1.A..
#> f.1.*       f.1.T_f.2.A_f.3.G       f:c.1_3delinsTAG                f.1.TAG
#> f.1.A  f.1.G_f.3.C,f.1.G_f... f:c.1_3delinsGCC,f:c..        f.1.GCC,f.1.GCG
#> f.1.C       f.1.T_f.2.G_f.3.C       f:c.1_3delinsTGC                f.1.TGC
#> f.1.D       f.1.G_f.2.A_f.3.C       f:c.1_3delinsGAC                f.1.GAC
#> f.1.E       f.1.G_f.2.A_f.3.G       f:c.1_3delinsGAG                f.1.GAG
#>        mutantNameAAHGVS             sequenceAA mutationTypes nbrMutBases
#>             <character>            <character>   <character> <character>
#> f.0.WT              f:p TDTLQAETDQLEDEKSALQT..        silent       0,1,2
#> f.1.*       f:p.(Thr1*) *DTLQAETDQLEDEKSALQT..          stop           3
#> f.1.A     f:p.(Thr1Ala) ADTLQAETDQLEDEKSALQT.. nonsynonymous           2
#> f.1.C     f:p.(Thr1Cys) CDTLQAETDQLEDEKSALQT.. nonsynonymous           3
#> f.1.D     f:p.(Thr1Asp) DDTLQAETDQLEDEKSALQT.. nonsynonymous           3
#> f.1.E     f:p.(Thr1Glu) EDTLQAETDQLEDEKSALQT.. nonsynonymous           3
#>        nbrMutCodons   nbrMutAAs  varLengths minNbrMutBases minNbrMutCodons
#>         <character> <character> <character>      <integer>       <integer>
#> f.0.WT          0,1           0          96              0               0
#> f.1.*             1           1          96              3               1
#> f.1.A             1           1          96              2               1
#> f.1.C             1           1          96              3               1
#> f.1.D             1           1          96              3               1
#> f.1.E             1           1          96              3               1
#>        minNbrMutAAs maxNbrMutBases maxNbrMutCodons maxNbrMutAAs
#>           <integer>      <integer>       <integer>    <integer>
#> f.0.WT            0              2               1            0
#> f.1.*             1              3               1            1
#> f.1.A             1              2               1            1
#> f.1.C             1              3               1            1
#> f.1.D             1              3               1            1
#> f.1.E             1              3               1            1

## Collapse similar sequences
sec2 <- collapseMutantsBySimilarity(
    se = se, assayName = "counts", scoreMethod = "rowSum",
    sequenceCol = "sequence", collapseMaxDist = 2,
    collapseMinScore = 0, collapseMinRatio = 0)
#> start collapsing sequences (tolerance: 2)...done (reduced from 200 to 12)
dim(sec2)
#> [1] 12  6
head(rownames(sec2))
#> [1] "f.0.WT"   "f.1.GGG"  "f.1.TTC"  "f.10.GGG" "f.10.TCC" "f.11.GCC"
head(rowData(sec2))
#> DataFrame with 6 rows and 20 columns
#>          collapseCol               sequence             mutantName
#>          <character>            <character>            <character>
#> f.0.WT        f.0.WT AACGATACACTCCAAGCGGA.. f.0.WT,f.1.AAC,f.1.A..
#> f.1.GGG      f.1.GGG CAGGATACACTCCAAGCGGA.. f.1.CAG,f.1.CGC,f.1...
#> f.1.TTC      f.1.TTC CACGATACACTCCAAGCGGA.. f.1.CAC,f.1.CTC,f.1...
#> f.10.GGG    f.10.GGG ACTGATACACTCCAAGCGGA.. f.10.ACG,f.10.AGC,f...
#> f.10.TCC    f.10.TCC ACTGATACACTCCAAGCGGA.. f.10.ACC,f.10.ATC,f...
#> f.11.GCC    f.11.GCC ACTGATACACTCCAAGCGGA.. f.11.ACC,f.11.AGC,f...
#>                  mutantNameBase     mutantNameBaseHGVS        mutantNameCodon
#>                     <character>            <character>            <character>
#> f.0.WT   f.0.WT,f.1.C_f.3.C,f.. f:c,f:c.1_3delinsCCC.. f.0.WT,f.1.AAC,f.1.A..
#> f.1.GGG  f.1.C_f.2.A_f.3.G,f... f:c.1_3delinsCAG,f:c.. f.1.CAG,f.1.CGC,f.1...
#> f.1.TTC  f.1.C_f.2.A_f.3.C,f... f:c.1_3delinsCAC,f:c.. f.1.CAC,f.1.CTC,f.1...
#> f.10.GGG f.28.A_f.29.C_f.30.G.. f:c.28_30delinsACG,f.. f.10.ACG,f.10.AGC,f...
#> f.10.TCC f.28.A_f.29.C_f.30.C.. f:c.28_30delinsACC,f.. f.10.ACC,f.10.ATC,f...
#> f.11.GCC f.31.A_f.32.C_f.33.C.. f:c.31_33delinsACC,f.. f.11.ACC,f.11.AGC,f...
#>                    mutantNameAA       mutantNameAAHGVS             sequenceAA
#>                     <character>            <character>            <character>
#> f.0.WT   f.0.WT,f.1.A,f.1.I,f.. f:p,f:p.(Asp13*),f:p.. ADTLQAETDQLEDEKSALQT..
#> f.1.GGG  f.1.*,f.1.C,f.1.D,f... f:p.(Thr1*),f:p.(Thr.. *DTLQAETDQLEDEKSALQT..
#> f.1.TTC  f.1.F,f.1.H,f.1.L,f... f:p.(Thr1His),f:p.(T.. FDTLQAETDQLEDEKSALQT..
#> f.10.GGG f.10.A,f.10.C,f.10.G.. f:p.(Gln10Ala),f:p.(.. TDTLQAETDALEDEKSALQT..
#> f.10.TCC f.10.F,f.10.I,f.10.S.. f:p.(Gln10Ile),f:p.(.. TDTLQAETDFLEDEKSALQT..
#> f.11.GCC f.11.A,f.11.G,f.11.S.. f:p.(Leu11Ala),f:p.(.. TDTLQAETDQAEDEKSALQT..
#>                   mutationTypes nbrMutBases nbrMutCodons   nbrMutAAs
#>                     <character> <character>  <character> <character>
#> f.0.WT   nonsynonymous,silent..       0,1,2          0,1         0,1
#> f.1.GGG      nonsynonymous,stop           3            1           1
#> f.1.TTC           nonsynonymous           3            1           1
#> f.10.GGG          nonsynonymous           3            1           1
#> f.10.TCC          nonsynonymous           3            1           1
#> f.11.GCC          nonsynonymous           3            1           1
#>           varLengths minNbrMutBases minNbrMutCodons minNbrMutAAs maxNbrMutBases
#>          <character>      <integer>       <integer>    <integer>      <integer>
#> f.0.WT            96              0               0            0              2
#> f.1.GGG           96              3               1            1              3
#> f.1.TTC           96              3               1            1              3
#> f.10.GGG          96              3               1            1              3
#> f.10.TCC          96              3               1            1              3
#> f.11.GCC          96              3               1            1              3
#>          maxNbrMutCodons maxNbrMutAAs
#>                <integer>    <integer>
#> f.0.WT                 1            1
#> f.1.GGG                1            1
#> f.1.TTC                1            1
#> f.10.GGG               1            1
#> f.10.TCC               1            1
#> f.11.GCC               1            1
## collapsed count matrix
assay(sec2, "counts")
#>          SRR5952435 SRR5952436 SRR5952437 SRR5952438 SRR5952439 SRR5952440
#> f.0.WT      3068572    3757468    3572269    6831160    6678754    6624525
#> f.1.GGG         621        726        503        190         98        208
#> f.1.TTC          97        115        109         27         18         23
#> f.10.GGG        470        511        371        173         84        181
#> f.10.TCC         36         82         32         11          7         20
#> f.11.GCC         61         74         42         13         10         14
#> f.11.TAG        259        294        196         55         41         64
#> f.12.AGG         81         81         51         19         15         46
#> f.12.CCC        286        294        167         76         45         89
#> f.13.CGC        471        581        447        156         87        195
#> f.13.TTG        114        146         86         37         12         42
#> f.14.CGC        122        137        123         36         26         42