runDIANNAnalysis.Rd
Launch an analysis workflow on quantifications obtained with DIA-NN
.
Note that DIA-NN support in einprot is currently experimental - please be
aware that the interface may change, and interpret results with caution.
runDIANNAnalysis(
templateRmd = system.file("extdata/process_basic_template.Rmd", package = "einprot"),
outputDir = ".",
outputBaseName = "DIANNAnalysis",
reportTitle = "DIA-NN LFQ data processing",
reportAuthor = "",
forceOverwrite = FALSE,
experimentInfo = list(),
species,
diannFile,
diannFileType,
outLevel,
diannLogFile,
aName,
idCol = function(df) combineIds(df, combineCols = c("Gene.names",
"Majority.protein.IDs")),
labelCol = function(df) combineIds(df, combineCols = c("Gene.names",
"Majority.protein.IDs")),
geneIdCol = function(df) getFirstId(df, colName = "Gene.names"),
proteinIdCol = "Majority.protein.IDs",
stringIdCol = function(df) combineIds(df, combineCols = c("Gene.names",
"Majority.protein.IDs"), combineWhen = "missing", makeUnique = FALSE),
extraFeatureCols = NULL,
sampleAnnot,
includeOnlySamples = "",
excludeSamples = "",
minScore = 10,
minPeptides = 2,
imputeMethod = "MinProb",
assaysForExport = NULL,
addHeatmaps = TRUE,
mergeGroups = list(),
comparisons = list(),
ctrlGroup = "",
allPairwiseComparisons = TRUE,
singleFit = TRUE,
subtractBaseline = FALSE,
baselineGroup = "",
normMethod = "none",
spikeFeatures = NULL,
stattest = "limma",
minNbrValidValues = 2,
minlFC = 0,
samSignificance = TRUE,
nperm = 250,
volcanoAdjPvalThr = 0.05,
volcanoLog2FCThr = 1,
volcanoMaxFeatures = 25,
volcanoLabelSign = "both",
volcanoS0 = 0.1,
volcanoFeaturesToLabel = "",
addInteractiveVolcanos = FALSE,
interactiveDisplayColumns = NULL,
interactiveGroupColumn = NULL,
complexFDRThr = 0.1,
maxNbrComplexesToPlot = Inf,
seed = 42,
includeFeatureCollections = c(),
minSizeToKeepSet = 2,
customComplexes = list(),
complexSpecies = "all",
complexDbPath = NULL,
stringVersion = "11.5",
stringDir = NULL,
linkTableColumns = c(),
customYml = NULL,
doRender = TRUE
)
Path to the template R Markdown file. Typically does not need to be modified.
Path to a directory where all output files will be written. Will be created if it doesn't exist.
Character string providing the 'base name' of the output files. All output files will start with this prefix.
Character scalars, giving the title and author for the result report.
Logical, whether to force overwrite an existing
Rmd file with the same outputBaseName
in the outputDir
.
Named list with information about the experiment. Each entry of the list must be a scalar value.
Character scalar providing the species. Must be one of the
supported species (see getSupportedSpecies()
). Either the common
or the scientific name can be used.
Character string pointing to the DIA-NN
pg_matrix.tsv
, pr_matrix.tsv
or main report.tsv
file. File paths will be expressed in canonical form (using
normalizePath()
) before they are processed.
Character string indicating what type of input file
diannFile
represents. Either "pg_matrix"
,
"pr_matrix"
or "main_report"
.
Character string indicating the desired output level.
Either "pg"
or "pr"
.
Character string pointing to the DIA-NN log file.
File paths will be expressed in canonical form (using
normalizePath()
) before they are processed.
Character scalar indicating the desired name of the main
assay (if diannFileType
is "pg_matrix"
or
"pr_matrix"
), or the column to use for the main assay (if
diannFileType
is "main_report"
).
Arguments defining
the feature identifiers (row names, should be unique),
feature labels (for plots, can be anything),
gene IDs (single gene symbols, will be matched against complexes and
GO terms, can be NULL
),
protein IDs (UniProt IDs, will be used to create automatic URLs and
match to species-specific identifiers, each entry can consist of
multiple UniProt IDs separated by semicolons), and
stringIdCol (single gene or protein ID, will be used to retrieve
STRING networks, can be NULL
).
Each of these arguments can be either a character vector of column
names in the input file (after application of make.names
),
in which case the corresponding feature ID
is generated by simply concatenating the values in these columns, or a
function with one input argument (a data.frame, corresponding to the
annotation columns of the input file), returning a
character vector corresponding to the desired feature IDs.
Named list (or NULL
) defining additional,
user-specified feature annotation columns to add to the object (in
addition to the ones defined by idCol
, labelCol
,
geneIdCol
, proteinIdCol
and stringIdCol
). Similar
to these column definitions, each entry of the list must be either a
character vector of column names or a function taking a data.frame as
input and returning a single character column. These columns will be
created after the standard columns (einprotId
, einprotGene
,
einprotProtein
, einprotLabel
, IDsForSTRING
), and
thus these columns can be used as well to create the user-specified
ones.
A data.frame
with at least columns named
sample
and group
, used to explicitly specify the group
assignment for each sample. It can also contain a column named
batch
, in which case this will be used as a covariate in
the limma
or proDA
tests. The values in the sample
column should correspond to the names of the columns of interest in the
input file, after removing the iColPattern
.
Character vectors defining specific samples to include or exclude from all analyses.
Numeric, minimum score for a protein to be retained in the
analysis. Set to NULL
if no score filtering is desired.
Numeric, minimum number of peptides for a protein to be
retained in the analysis. Set to NULL
if no filtering on the
number of peptides is desired.
Character string defining the imputation method to use.
Currently, "impSeqRob"
, "MinProb"
, and
"MinProbGlobal"
are supported. See doImputation
for
more details about the methods.
Character vector defining the name(s) of the assays
to use for exported abundances and barplots. This could, for example,
be set to an assay containing 'absolute' abundances, if available, even
if another assay is used for the actual analysis and comparison of
groups. If set to NULL
or an assay name that does not exist in
the SingleCellExperiment object, the 'main' assay will be used.
Logical scalar indicating whether to include heatmaps or not. This controls both the heatmap showing the missing value pattern in the data, as well as the summary heatmaps of the quantitative information in the data. For large data sets, excluding the heatmaps can significantly speed up the processing time.
Named list of character vectors defining sample groups
to merge to create new groups, that will be used for comparisons.
Any specification of comparisons
or ctrlGroup
should
be done in terms of the new (merged) group names.
List of character vectors defining comparisons to
perform. The first element of each vector represents the
denominator of the comparison. If not empty, ctrlGroup
and
allPairwiseComparisons
are ignored.
Character vector defining the sample group(s) to use as control group in comparisons.
Logical, should all pairwise comparisons be performed?
Logical scalar indicating whether a single model fit
should be used (and results for pairwise comparisons extracted via
contrasts). If FALSE
, the data set will be subset to the
relevant samples for each comparison. Only applicable if
stattest
is "limma"
or "proDA"
.
Logical scalar, whether to subtract the background/
reference value for each feature in each batch before fitting the
model. If TRUE
, requires that a 'batch' column is available.
Character scalar representing the reference group.
Only used if subtractBaseline
is TRUE
, in which case the
abundance values for a given sample will be adjusted by subtracting the
average value across all samples in the baselineGroup
from the
same batch as the original sample.
Character scalar indicating the normalization method to
use. Currently, any method from MsCoreUtils::normalizeMethods()
or "none"
are valid values.
Character vector indicating the 'spike-in' features
to use for estimation of normalization factors. If NULL
(default), all features are used.
Either "ttest"
, "limma"
or "proDA"
,
the testing framework to use. Could also be "none"
if no test
should be performed.
Numeric, the minimum number of valid values for a protein to be used for statistical testing.
Numeric, minimum log fold change to test against (only used
if stattest = "limma"
).
Logical scalar, indicating whether the SAM statistic
should be used to determine significance (similar to the approach used by
Perseus). Only used if stattest = "ttest"
. If FALSE
, the
p-values are adjusted using the Benjamini-Hochberg approach and used
to determine significance.
Numeric, number of permutations to use in the statistical
testing (only used if stattest = "ttest"
).
Numeric, adjusted p-value threshold to determine which proteins to highlight in the volcano plots.
Numeric, log-fold change threshold to determine which proteins to highlight in the volcano plots.
Numeric, maximum number of significant features to label in the volcano plots.
Character scalar, either 'both', 'pos', or 'neg', indicating whether to label the most significant features regardless of sign, or only those with positive/negative log-fold changes.
Numeric, S0 value to use to generate the significance
curve in the volcano plots (only used if stattest = "ttest"
).
Character vector with features to always label in the volcano plots (regardless of significance).
Logical scalar indicating whether to add
interactive volcano plots to the html report. For experiments with
many quantified features or many comparisons, setting this to
TRUE
can make the html report very large and difficult to
interact with.
Character vector (or NULL
)
indicating which columns to include in the tooltip for the
interactive volcano plots. The default shows the feature ID.
Character scalar (or NULL
, default)
indicating the column to group points by in the interactive volcano
plot. Hovering over a point will highlight all other points with the
same value of this column.
Numeric, FDR threshold for significance in testing of complexes.
Numeric, the maximum number of significant
complexes for which to make separate volcano plots. Defaults to
Inf
, i.e., no limit.
Numeric, random seed to use for any non-deterministic calculations.
Character vector, a subset of
c("complexes", "GO", "pathways")
.
Numeric scalar indicating the smallest number of features that have to overlap with the current data set in order to retain a feature set for testing.
List of character vectors providing custom complexes to test for significant differences between groups.
Either "all"
or "current"
, depending
on whether complexes defined for all species, or only those defined
for the current species, should be tested for significance.
Character string providing path to the complex DB
file (generated with makeComplexDB()
).
Character scalar giving the version of the STRING database to query.
Character scalar (or NULL
) providing the path to a
folder where the STRING files will be downloaded (or loaded from, if
they already exist). If NULL
(default), they will be downloaded
to a temporary directory.
Character vector with regular expressions that will be matched against the column names of the rowData of the generated SingleCellExperiment object and included in the link table in the end of the report.
Character string providing the path to a custom YAML file
that can be used to overwrite default settings in the report. If set
to NULL
(default), no alterations are made.
Logical scalar. If FALSE
, the Rmd file will be
generated (and any parameters injected), but not rendered.
Invisibly, the path to the compiled html report.
if (interactive()) {
sampleAnnot <- read.delim(
system.file("extdata/diann_example/PXD028735_sampleAnnot.txt",
package = "einprot"))
## Basic analysis, pg_matrix.tsv
out <- runDIANNAnalysis(
outputDir = tempdir(),
outputBaseName = "DIANN_LFQ_basic",
species = "human",
diannFile = system.file("extdata/diann_example/PXD028735.pg_matrix.tsv",
package = "einprot"),
diannFileType = "pg_matrix",
outLevel = "pg",
diannLogFile = system.file("extdata/diann_example/diann-output.log.txt",
package = "einprot"),
sampleAnnot = sampleAnnot,
includeFeatureCollections = "complexes",
stringIdCol = NULL,
aName = "MaxLFQ",
idCol = function(df) combineIds(df, combineCols = c("Genes", "Protein.Ids")),
labelCol = function(df) getFirstId(df, colName = "Protein.Names"),
geneIdCol = function(df) getFirstId(df, colName = "Genes"),
proteinIdCol = "Protein.Ids"
)
## Output file
out
## Basic analysis, main report
outM <- runDIANNAnalysis(
outputDir = tempdir(),
outputBaseName = "DIANN_LFQ_basic",
species = "human",
diannFile = system.file("extdata/diann_example/PXD028735.report.tsv",
package = "einprot"),
diannFileType = "main_report",
outLevel = "pg",
diannLogFile = system.file("extdata/diann_example/diann-output.log.txt",
package = "einprot"),
sampleAnnot = sampleAnnot,
includeFeatureCollections = "complexes",
stringIdCol = NULL,
aName = "PG.MaxLFQ",
idCol = "Protein.Group",
labelCol = "Protein.Group",
geneIdCol = NULL,
proteinIdCol = "Protein.Group"
)
## Output file
outM
}