Collapse mutants by similarity
collapseMutantsBySimilarity.Rd
These functions can be used to collapse variants, either by similarity or
according to a pre-defined grouping. The functions collapseMutants
and
collapseMutantsByAA
assume that a grouping variable is available as a
column in rowData(se)
(collapseMutantsByAA
is a convenience
function for the case when this column is "mutantNameAA", and is provided
for backwards compatibility). The collapseMutantsBySimilarity
will
generate the grouping variable based on user-provided thresholds on the
sequence similarity (defined by the Hamming distance), and subsequently
collapse based on the derived grouping.
Usage
collapseMutantsBySimilarity(
se,
assayName,
scoreMethod = "rowSum",
sequenceCol = "sequence",
collapseMaxDist = 0,
collapseMinScore = 0,
collapseMinRatio = 0,
verbose = TRUE
)
collapseMutantsByAA(se)
collapseMutants(se, nameCol)
Arguments
- se
A
SummarizedExperiment
generated bysummarizeExperiment
- assayName
The name of the assay that will be used to calculate a "score" (typically derived from the read counts) for each variant.
- scoreMethod
Character scalar giving the approach used to calculate ranking scores from the assay defined by
assayName
. Currently, this can be one of"rowSum"
or"rowMean"
. All filtering criteria will be applied to these scores.- sequenceCol
Character scalar giving the name of the column in
rowData(se)
that contains the nucleotide sequence of the variants.- collapseMaxDist
Numeric scalar defining the tolerance for collapsing similar sequences. If the value is in [0, 1), it defines the maximal Hamming distance in terms of a fraction of sequence length: (
round(collapseMaxDist * nchar(sequence))
). A value greater or equal to 1 is rounded and directly used as the maximum allowed Hamming distance. Note that sequences can only be collapsed if they are all of the same length.- collapseMinScore
Numeric scalar, indicating the minimum score for the sequence to be considered for collapsing with similar sequences.
- collapseMinRatio
Numeric scalar. During collapsing of similar sequences, a low-frequency sequence will be collapsed with a higher-frequency sequence only if the ratio between the high-frequency and the low-frequency scores is at least this high. The default value of 0 indicates that no such check is performed.
- verbose
Logical, whether to print progress messages.
- nameCol
A character scalar providing the column of
rowData(se)
that contains the amino acid mutant names (that will be the new row names).
Value
A SummarizedExperiment
where
counts have been aggregated by the mutated amino acid(s).
Examples
se <- readRDS(system.file("extdata", "GSE102901_cis_se.rds",
package = "mutscan"))[1:200, ]
## The rows of this object correspond to individual codon variants
dim(se)
#> [1] 200 6
head(rownames(se))
#> [1] "f.0.WT" "f.1.AAC" "f.1.AAG" "f.1.ACC" "f.1.ACG" "f.1.AGC"
## Collapse by amino acid
sec <- collapseMutantsByAA(se)
## The rows of the collapsed object correspond to amino acid variants
dim(sec)
#> [1] 128 6
head(rownames(sec))
#> [1] "f.0.WT" "f.1.*" "f.1.A" "f.1.C" "f.1.D" "f.1.E"
## The mutantName column contains the individual codon variants that were
## collapsed
head(SummarizedExperiment::rowData(sec))
#> DataFrame with 6 rows and 19 columns
#> mutantNameAA sequence mutantName
#> <character> <character> <character>
#> f.0.WT f.0.WT ACCGATACACTCCAAGCGGA.. f.0.WT,f.1.ACC,f.1.A..
#> f.1.* f.1.* TAGGATACACTCCAAGCGGA.. f.1.TAG
#> f.1.A f.1.A GCCGATACACTCCAAGCGGA.. f.1.GCC,f.1.GCG
#> f.1.C f.1.C TGCGATACACTCCAAGCGGA.. f.1.TGC
#> f.1.D f.1.D GACGATACACTCCAAGCGGA.. f.1.GAC
#> f.1.E f.1.E GAGGATACACTCCAAGCGGA.. f.1.GAG
#> mutantNameBase mutantNameBaseHGVS mutantNameCodon
#> <character> <character> <character>
#> f.0.WT f.0.WT,f.3.C,f.3.G,f.. f:c,f:c.30A>G,f:c.31.. f.0.WT,f.1.ACC,f.1.A..
#> f.1.* f.1.T_f.2.A_f.3.G f:c.1_3delinsTAG f.1.TAG
#> f.1.A f.1.G_f.3.C,f.1.G_f... f:c.1_3delinsGCC,f:c.. f.1.GCC,f.1.GCG
#> f.1.C f.1.T_f.2.G_f.3.C f:c.1_3delinsTGC f.1.TGC
#> f.1.D f.1.G_f.2.A_f.3.C f:c.1_3delinsGAC f.1.GAC
#> f.1.E f.1.G_f.2.A_f.3.G f:c.1_3delinsGAG f.1.GAG
#> mutantNameAAHGVS sequenceAA mutationTypes nbrMutBases
#> <character> <character> <character> <character>
#> f.0.WT f:p TDTLQAETDQLEDEKSALQT.. silent 0,1,2
#> f.1.* f:p.(Thr1*) *DTLQAETDQLEDEKSALQT.. stop 3
#> f.1.A f:p.(Thr1Ala) ADTLQAETDQLEDEKSALQT.. nonsynonymous 2
#> f.1.C f:p.(Thr1Cys) CDTLQAETDQLEDEKSALQT.. nonsynonymous 3
#> f.1.D f:p.(Thr1Asp) DDTLQAETDQLEDEKSALQT.. nonsynonymous 3
#> f.1.E f:p.(Thr1Glu) EDTLQAETDQLEDEKSALQT.. nonsynonymous 3
#> nbrMutCodons nbrMutAAs varLengths minNbrMutBases minNbrMutCodons
#> <character> <character> <character> <integer> <integer>
#> f.0.WT 0,1 0 96 0 0
#> f.1.* 1 1 96 3 1
#> f.1.A 1 1 96 2 1
#> f.1.C 1 1 96 3 1
#> f.1.D 1 1 96 3 1
#> f.1.E 1 1 96 3 1
#> minNbrMutAAs maxNbrMutBases maxNbrMutCodons maxNbrMutAAs
#> <integer> <integer> <integer> <integer>
#> f.0.WT 0 2 1 0
#> f.1.* 1 3 1 1
#> f.1.A 1 2 1 1
#> f.1.C 1 3 1 1
#> f.1.D 1 3 1 1
#> f.1.E 1 3 1 1
## Collapse similar sequences
sec2 <- collapseMutantsBySimilarity(
se = se, assayName = "counts", scoreMethod = "rowSum",
sequenceCol = "sequence", collapseMaxDist = 2,
collapseMinScore = 0, collapseMinRatio = 0)
#> start collapsing sequences (tolerance: 2)...done (reduced from 200 to 12)
dim(sec2)
#> [1] 12 6
head(rownames(sec2))
#> [1] "f.0.WT" "f.1.GGG" "f.1.TTC" "f.10.GGG" "f.10.TCC" "f.11.GCC"
head(SummarizedExperiment::rowData(sec2))
#> DataFrame with 6 rows and 20 columns
#> collapseCol sequence mutantName
#> <character> <character> <character>
#> f.0.WT f.0.WT AACGATACACTCCAAGCGGA.. f.0.WT,f.1.AAC,f.1.A..
#> f.1.GGG f.1.GGG CAGGATACACTCCAAGCGGA.. f.1.CAG,f.1.CGC,f.1...
#> f.1.TTC f.1.TTC CACGATACACTCCAAGCGGA.. f.1.CAC,f.1.CTC,f.1...
#> f.10.GGG f.10.GGG ACTGATACACTCCAAGCGGA.. f.10.ACG,f.10.AGC,f...
#> f.10.TCC f.10.TCC ACTGATACACTCCAAGCGGA.. f.10.ACC,f.10.ATC,f...
#> f.11.GCC f.11.GCC ACTGATACACTCCAAGCGGA.. f.11.ACC,f.11.AGC,f...
#> mutantNameBase mutantNameBaseHGVS mutantNameCodon
#> <character> <character> <character>
#> f.0.WT f.0.WT,f.1.C_f.3.C,f.. f:c,f:c.1_3delinsCCC.. f.0.WT,f.1.AAC,f.1.A..
#> f.1.GGG f.1.C_f.2.A_f.3.G,f... f:c.1_3delinsCAG,f:c.. f.1.CAG,f.1.CGC,f.1...
#> f.1.TTC f.1.C_f.2.A_f.3.C,f... f:c.1_3delinsCAC,f:c.. f.1.CAC,f.1.CTC,f.1...
#> f.10.GGG f.28.A_f.29.C_f.30.G.. f:c.28_30delinsACG,f.. f.10.ACG,f.10.AGC,f...
#> f.10.TCC f.28.A_f.29.C_f.30.C.. f:c.28_30delinsACC,f.. f.10.ACC,f.10.ATC,f...
#> f.11.GCC f.31.A_f.32.C_f.33.C.. f:c.31_33delinsACC,f.. f.11.ACC,f.11.AGC,f...
#> mutantNameAA mutantNameAAHGVS sequenceAA
#> <character> <character> <character>
#> f.0.WT f.0.WT,f.1.A,f.1.I,f.. f:p,f:p.(Asp13*),f:p.. ADTLQAETDQLEDEKSALQT..
#> f.1.GGG f.1.*,f.1.C,f.1.D,f... f:p.(Thr1*),f:p.(Thr.. *DTLQAETDQLEDEKSALQT..
#> f.1.TTC f.1.F,f.1.H,f.1.L,f... f:p.(Thr1His),f:p.(T.. FDTLQAETDQLEDEKSALQT..
#> f.10.GGG f.10.A,f.10.C,f.10.G.. f:p.(Gln10Ala),f:p.(.. TDTLQAETDALEDEKSALQT..
#> f.10.TCC f.10.F,f.10.I,f.10.S.. f:p.(Gln10Ile),f:p.(.. TDTLQAETDFLEDEKSALQT..
#> f.11.GCC f.11.A,f.11.G,f.11.S.. f:p.(Leu11Ala),f:p.(.. TDTLQAETDQAEDEKSALQT..
#> mutationTypes nbrMutBases nbrMutCodons nbrMutAAs
#> <character> <character> <character> <character>
#> f.0.WT nonsynonymous,silent.. 0,1,2 0,1 0,1
#> f.1.GGG nonsynonymous,stop 3 1 1
#> f.1.TTC nonsynonymous 3 1 1
#> f.10.GGG nonsynonymous 3 1 1
#> f.10.TCC nonsynonymous 3 1 1
#> f.11.GCC nonsynonymous 3 1 1
#> varLengths minNbrMutBases minNbrMutCodons minNbrMutAAs maxNbrMutBases
#> <character> <integer> <integer> <integer> <integer>
#> f.0.WT 96 0 0 0 2
#> f.1.GGG 96 3 1 1 3
#> f.1.TTC 96 3 1 1 3
#> f.10.GGG 96 3 1 1 3
#> f.10.TCC 96 3 1 1 3
#> f.11.GCC 96 3 1 1 3
#> maxNbrMutCodons maxNbrMutAAs
#> <integer> <integer>
#> f.0.WT 1 1
#> f.1.GGG 1 1
#> f.1.TTC 1 1
#> f.10.GGG 1 1
#> f.10.TCC 1 1
#> f.11.GCC 1 1
## collapsed count matrix
SummarizedExperiment::assay(sec2, "counts")
#> SRR5952435 SRR5952436 SRR5952437 SRR5952438 SRR5952439 SRR5952440
#> f.0.WT 3068572 3757468 3572269 6831160 6678754 6624525
#> f.1.GGG 621 726 503 190 98 208
#> f.1.TTC 97 115 109 27 18 23
#> f.10.GGG 470 511 371 173 84 181
#> f.10.TCC 36 82 32 11 7 20
#> f.11.GCC 61 74 42 13 10 14
#> f.11.TAG 259 294 196 55 41 64
#> f.12.AGG 81 81 51 19 15 46
#> f.12.CCC 286 294 167 76 45 89
#> f.13.CGC 471 581 447 156 87 195
#> f.13.TTG 114 146 86 37 12 42
#> f.14.CGC 122 137 123 36 26 42