These functions can be used to collapse variants, either by similarity or
according to a pre-defined grouping. The functions collapseMutants
and collapseMutantsByAA assume that a grouping variable is available
as a column in rowData(se) (collapseMutantsByAA is a
convenience function for the case when this column is "mutantNameAA", and
is provided for backwards compatibility). The
collapseMutantsBySimilarity will generate the grouping variable
based on user-provided thresholds on the sequence similarity (defined by
the Hamming distance), and subsequently collapse based on the derived
grouping.
Usage
collapseMutantsBySimilarity(
se,
assayName,
scoreMethod = "rowSum",
sequenceCol = "sequence",
collapseMaxDist = 0,
collapseMinScore = 0,
collapseMinRatio = 0,
verbose = TRUE
)
collapseMutantsByAA(se)
collapseMutants(se, nameCol)Arguments
- se
A
SummarizedExperimentgenerated bysummarizeExperiment- assayName
The name of the assay that will be used to calculate a "score" (typically derived from the read counts) for each variant.
- scoreMethod
Character scalar giving the approach used to calculate ranking scores from the assay defined by
assayName. Currently, this can be one of"rowSum"or"rowMean". All filtering criteria will be applied to these scores.- sequenceCol
Character scalar giving the name of the column in
rowData(se)that contains the nucleotide sequence of the variants.- collapseMaxDist
Numeric scalar defining the tolerance for collapsing similar sequences. If the value is in [0, 1), it defines the maximal Hamming distance in terms of a fraction of sequence length: (
round(collapseMaxDist * nchar(sequence))). A value greater or equal to 1 is rounded and directly used as the maximum allowed Hamming distance. Note that sequences can only be collapsed if they are all of the same length.- collapseMinScore
Numeric scalar, indicating the minimum score for the sequence to be considered for collapsing with similar sequences.
- collapseMinRatio
Numeric scalar. During collapsing of similar sequences, a low-frequency sequence will be collapsed with a higher-frequency sequence only if the ratio between the high-frequency and the low-frequency scores is at least this high. The default value of 0 indicates that no such check is performed.
- verbose
Logical, whether to print progress messages.
- nameCol
A character scalar providing the column of
rowData(se)that contains the amino acid mutant names (that will be the new row names).
Value
A SummarizedExperiment where
counts have been aggregated by the mutated amino acid(s).
Examples
library(SummarizedExperiment)
se <- readRDS(system.file("extdata", "GSE102901_cis_se.rds",
package = "mutscan"))[1:200, ]
## The rows of this object correspond to individual codon variants
dim(se)
#> [1] 200 6
head(rownames(se))
#> [1] "f.0.WT" "f.1.AAC" "f.1.AAG" "f.1.ACC" "f.1.ACG" "f.1.AGC"
## Collapse by amino acid
sec <- collapseMutantsByAA(se)
## The rows of the collapsed object correspond to amino acid variants
dim(sec)
#> [1] 128 6
head(rownames(sec))
#> [1] "f.0.WT" "f.1.*" "f.1.A" "f.1.C" "f.1.D" "f.1.E"
## The mutantName column contains the individual codon variants that were
## collapsed
head(rowData(sec))
#> DataFrame with 6 rows and 19 columns
#> mutantNameAA sequence mutantName
#> <character> <character> <character>
#> f.0.WT f.0.WT ACCGATACACTCCAAGCGGA.. f.0.WT,f.1.ACC,f.1.A..
#> f.1.* f.1.* TAGGATACACTCCAAGCGGA.. f.1.TAG
#> f.1.A f.1.A GCCGATACACTCCAAGCGGA.. f.1.GCC,f.1.GCG
#> f.1.C f.1.C TGCGATACACTCCAAGCGGA.. f.1.TGC
#> f.1.D f.1.D GACGATACACTCCAAGCGGA.. f.1.GAC
#> f.1.E f.1.E GAGGATACACTCCAAGCGGA.. f.1.GAG
#> mutantNameBase mutantNameBaseHGVS mutantNameCodon
#> <character> <character> <character>
#> f.0.WT f.0.WT,f.3.C,f.3.G,f.. f:c,f:c.30A>G,f:c.31.. f.0.WT,f.1.ACC,f.1.A..
#> f.1.* f.1.T_f.2.A_f.3.G f:c.1_3delinsTAG f.1.TAG
#> f.1.A f.1.G_f.3.C,f.1.G_f... f:c.1_3delinsGCC,f:c.. f.1.GCC,f.1.GCG
#> f.1.C f.1.T_f.2.G_f.3.C f:c.1_3delinsTGC f.1.TGC
#> f.1.D f.1.G_f.2.A_f.3.C f:c.1_3delinsGAC f.1.GAC
#> f.1.E f.1.G_f.2.A_f.3.G f:c.1_3delinsGAG f.1.GAG
#> mutantNameAAHGVS sequenceAA mutationTypes nbrMutBases
#> <character> <character> <character> <character>
#> f.0.WT f:p TDTLQAETDQLEDEKSALQT.. silent 0,1,2
#> f.1.* f:p.(Thr1*) *DTLQAETDQLEDEKSALQT.. stop 3
#> f.1.A f:p.(Thr1Ala) ADTLQAETDQLEDEKSALQT.. nonsynonymous 2
#> f.1.C f:p.(Thr1Cys) CDTLQAETDQLEDEKSALQT.. nonsynonymous 3
#> f.1.D f:p.(Thr1Asp) DDTLQAETDQLEDEKSALQT.. nonsynonymous 3
#> f.1.E f:p.(Thr1Glu) EDTLQAETDQLEDEKSALQT.. nonsynonymous 3
#> nbrMutCodons nbrMutAAs varLengths minNbrMutBases minNbrMutCodons
#> <character> <character> <character> <integer> <integer>
#> f.0.WT 0,1 0 96 0 0
#> f.1.* 1 1 96 3 1
#> f.1.A 1 1 96 2 1
#> f.1.C 1 1 96 3 1
#> f.1.D 1 1 96 3 1
#> f.1.E 1 1 96 3 1
#> minNbrMutAAs maxNbrMutBases maxNbrMutCodons maxNbrMutAAs
#> <integer> <integer> <integer> <integer>
#> f.0.WT 0 2 1 0
#> f.1.* 1 3 1 1
#> f.1.A 1 2 1 1
#> f.1.C 1 3 1 1
#> f.1.D 1 3 1 1
#> f.1.E 1 3 1 1
## Collapse similar sequences
sec2 <- collapseMutantsBySimilarity(
se = se, assayName = "counts", scoreMethod = "rowSum",
sequenceCol = "sequence", collapseMaxDist = 2,
collapseMinScore = 0, collapseMinRatio = 0)
#> start collapsing sequences (tolerance: 2)...done (reduced from 200 to 12)
dim(sec2)
#> [1] 12 6
head(rownames(sec2))
#> [1] "f.0.WT" "f.1.GGG" "f.1.TTC" "f.10.GGG" "f.10.TCC" "f.11.GCC"
head(rowData(sec2))
#> DataFrame with 6 rows and 20 columns
#> collapseCol sequence mutantName
#> <character> <character> <character>
#> f.0.WT f.0.WT AACGATACACTCCAAGCGGA.. f.0.WT,f.1.AAC,f.1.A..
#> f.1.GGG f.1.GGG CAGGATACACTCCAAGCGGA.. f.1.CAG,f.1.CGC,f.1...
#> f.1.TTC f.1.TTC CACGATACACTCCAAGCGGA.. f.1.CAC,f.1.CTC,f.1...
#> f.10.GGG f.10.GGG ACTGATACACTCCAAGCGGA.. f.10.ACG,f.10.AGC,f...
#> f.10.TCC f.10.TCC ACTGATACACTCCAAGCGGA.. f.10.ACC,f.10.ATC,f...
#> f.11.GCC f.11.GCC ACTGATACACTCCAAGCGGA.. f.11.ACC,f.11.AGC,f...
#> mutantNameBase mutantNameBaseHGVS mutantNameCodon
#> <character> <character> <character>
#> f.0.WT f.0.WT,f.1.C_f.3.C,f.. f:c,f:c.1_3delinsCCC.. f.0.WT,f.1.AAC,f.1.A..
#> f.1.GGG f.1.C_f.2.A_f.3.G,f... f:c.1_3delinsCAG,f:c.. f.1.CAG,f.1.CGC,f.1...
#> f.1.TTC f.1.C_f.2.A_f.3.C,f... f:c.1_3delinsCAC,f:c.. f.1.CAC,f.1.CTC,f.1...
#> f.10.GGG f.28.A_f.29.C_f.30.G.. f:c.28_30delinsACG,f.. f.10.ACG,f.10.AGC,f...
#> f.10.TCC f.28.A_f.29.C_f.30.C.. f:c.28_30delinsACC,f.. f.10.ACC,f.10.ATC,f...
#> f.11.GCC f.31.A_f.32.C_f.33.C.. f:c.31_33delinsACC,f.. f.11.ACC,f.11.AGC,f...
#> mutantNameAA mutantNameAAHGVS sequenceAA
#> <character> <character> <character>
#> f.0.WT f.0.WT,f.1.A,f.1.I,f.. f:p,f:p.(Asp13*),f:p.. ADTLQAETDQLEDEKSALQT..
#> f.1.GGG f.1.*,f.1.C,f.1.D,f... f:p.(Thr1*),f:p.(Thr.. *DTLQAETDQLEDEKSALQT..
#> f.1.TTC f.1.F,f.1.H,f.1.L,f... f:p.(Thr1His),f:p.(T.. FDTLQAETDQLEDEKSALQT..
#> f.10.GGG f.10.A,f.10.C,f.10.G.. f:p.(Gln10Ala),f:p.(.. TDTLQAETDALEDEKSALQT..
#> f.10.TCC f.10.F,f.10.I,f.10.S.. f:p.(Gln10Ile),f:p.(.. TDTLQAETDFLEDEKSALQT..
#> f.11.GCC f.11.A,f.11.G,f.11.S.. f:p.(Leu11Ala),f:p.(.. TDTLQAETDQAEDEKSALQT..
#> mutationTypes nbrMutBases nbrMutCodons nbrMutAAs
#> <character> <character> <character> <character>
#> f.0.WT nonsynonymous,silent.. 0,1,2 0,1 0,1
#> f.1.GGG nonsynonymous,stop 3 1 1
#> f.1.TTC nonsynonymous 3 1 1
#> f.10.GGG nonsynonymous 3 1 1
#> f.10.TCC nonsynonymous 3 1 1
#> f.11.GCC nonsynonymous 3 1 1
#> varLengths minNbrMutBases minNbrMutCodons minNbrMutAAs maxNbrMutBases
#> <character> <integer> <integer> <integer> <integer>
#> f.0.WT 96 0 0 0 2
#> f.1.GGG 96 3 1 1 3
#> f.1.TTC 96 3 1 1 3
#> f.10.GGG 96 3 1 1 3
#> f.10.TCC 96 3 1 1 3
#> f.11.GCC 96 3 1 1 3
#> maxNbrMutCodons maxNbrMutAAs
#> <integer> <integer>
#> f.0.WT 1 1
#> f.1.GGG 1 1
#> f.1.TTC 1 1
#> f.10.GGG 1 1
#> f.10.TCC 1 1
#> f.11.GCC 1 1
## collapsed count matrix
assay(sec2, "counts")
#> SRR5952435 SRR5952436 SRR5952437 SRR5952438 SRR5952439 SRR5952440
#> f.0.WT 3068572 3757468 3572269 6831160 6678754 6624525
#> f.1.GGG 621 726 503 190 98 208
#> f.1.TTC 97 115 109 27 18 23
#> f.10.GGG 470 511 371 173 84 181
#> f.10.TCC 36 82 32 11 7 20
#> f.11.GCC 61 74 42 13 10 14
#> f.11.TAG 259 294 196 55 41 64
#> f.12.AGG 81 81 51 19 15 46
#> f.12.CCC 286 294 167 76 45 89
#> f.13.CGC 471 581 447 156 87 195
#> f.13.TTG 114 146 86 37 12 42
#> f.14.CGC 122 137 123 36 26 42