Assign labels to cells using known marker genes

Given marker gene sets for cell types, identify cells with high expression of the marker genes (positive examples), then use these cells to create a reference transcriptome profile for each cell type and identify additional cells of each type using SingleR. These marker genes should specifically expressed a single cell type, e.g. CD3 which is expressed by all T cell subtypes would not be suitable for specific T cell subtypes.

labelCells(
  sce,
  markergenes,
  fraction_topscoring = 0.01,
  expr_values = "logcounts",
  normGenesetExpressionParams = list(R = 200),
  aggregateReferenceParams = list(power = 0.5),
  SingleRParams = list(),
  BPPARAM = SerialParam()
)

Arguments

sce: SingleCellExperiment object.
markergenes: Named list of character vectors with the marker genes for each cell types. The marker genes must be a subset of rownames(sce).
fraction_topscoring: numeric vector of length 1 or the same length as markergenes giving the fraction(s) of top scoring cells for each cell type to pick to create the reference transcriptome profile.
expr_values: Integer scalar or string indicating which assay of sce contains the expression values.
normGenesetExpressionParams: list with additional parameters for normGenesetExpression.
aggregateReferenceParams: list with additional parameters for aggregateReference.
SingleRParams: list with additional parameters for SingleR.
BPPARAM: An optional BiocParallelParam instance determining the parallel back-end to be used during evaluation.

Value

A list of three elements named cells, refs and labels. cells contains a list with the numerical indices of the top scoring cells for each cell type. refs contains the pseudo-bulk transcriptome profiles used as a reference for label assignment, as returned by aggregateReference. labels contains a DataFrame with the annotation statistics for each cell (one cell per row), generated by SingleR.

Author

Michael Stadler

Examples

if (requireNamespace("SingleR", quietly = TRUE) &&
    requireNamespace("SingleCellExperiment", quietly = TRUE) &&
    requireNamespace("scrapper", quietly = TRUE)) {
    
    # create SingleCellExperiment with cell-type specific genes
    library(SingleCellExperiment)
    n_types <- 3
    n_per_type <- 30
    n_cells <- n_types * n_per_type
    n_genes <- 500
    fraction_specific <- 0.1
    n_specific <- round(n_genes * fraction_specific)

    set.seed(42)
    mu <- ceiling(runif(n = n_genes, min = 0, max = 30))
    u <- do.call(rbind, lapply(mu, function(x) rpois(n_cells, lambda = x)))
    rownames(u) <- paste0("g", seq.int(nrow(u)))
    celltype.labels <- rep(paste0("t", seq.int(n_types)), each = n_per_type)
    celltype.genes <- split(sample(rownames(u), size = n_types * n_specific),
                            rep(paste0("t", seq.int(n_types)), each = n_specific))
    for (i in seq_along(celltype.genes)) {
        j <- celltype.genes[[i]]
        k <- celltype.labels == paste0("t", i)
        u[j, k] <- 2 * u[j, k]
    }
    v <- log2(u + 1)
    sce <- SingleCellExperiment(assays=list(counts=u, logcounts=v))

    # define marker genes (subset of true cell-type-specific genes)
    marker.genes <- lapply(celltype.genes, "[", 1:5)
    marker.genes

    # predict cell types
    res <- labelCells(sce, marker.genes,
                      fraction_topscoring = 0.1,
                      normGenesetExpressionParams = list(R = 50))

    # high-scoring cells used as references for each celltype
    res$cells

    # ... from these, pseudo-bulks were created:
    res$refs

    # ... and used to predict labels for all cells
    res$labels$pruned.labels

    # compare predicted to true cell types
    table(true = celltype.labels, predicted = res$labels$pruned.labels)
}
#> Loading required package: SummarizedExperiment
#> Loading required package: MatrixGenerics
#> Loading required package: matrixStats
#> 
#> Attaching package: ‘MatrixGenerics’
#> The following objects are masked from ‘package:matrixStats’:
#> 
#>     colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
#>     colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
#>     colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
#>     colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
#>     colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
#>     colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
#>     colWeightedMeans, colWeightedMedians, colWeightedSds,
#>     colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
#>     rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
#>     rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
#>     rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
#>     rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
#>     rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
#>     rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
#>     rowWeightedSds, rowWeightedVars
#> Loading required package: Biobase
#> Welcome to Bioconductor
#> 
#>     Vignettes contain introductory material; view with
#>     'browseVignettes()'. To cite Bioconductor, see
#>     'citation("Biobase")', and for packages 'citation("pkgname")'.
#> 
#> Attaching package: ‘Biobase’
#> The following object is masked from ‘package:MatrixGenerics’:
#> 
#>     rowMedians
#> The following objects are masked from ‘package:matrixStats’:
#> 
#>     anyMissing, rowMedians
#>     predicted
#> true t1 t2 t3
#>   t1 30  0  0
#>   t2  0 29  0
#>   t3  0  0 30

Assign labels to cells using known marker genes

Arguments

Value

See also

Author

Examples