Quantify alignments from sequencing data.
qCount(
proj,
query,
reportLevel = c(NULL, "gene", "exon", "promoter", "junction"),
selectReadPosition = c("start", "end"),
shift = 0L,
orientation = c("any", "same", "opposite"),
useRead = c("any", "first", "last"),
auxiliaryName = NULL,
mask = NULL,
collapseBySample = TRUE,
includeSpliced = TRUE,
includeSecondary = TRUE,
mapqMin = 0L,
mapqMax = 255L,
absIsizeMin = NULL,
absIsizeMax = NULL,
maxInsertSize = 500L,
clObj = NULL
)
A qProject
object representing a
sequencing experiment as returned by qAlign
An object of type GRanges
,
GRangesList
or
TxDb
with the regions to be
quantified. The type of query
will determine the mode of
quantification (see ‘Details’). For
reportLevel="junction"
, query
is ignored and can also
be NULL
.
Level of quantification (query
of type
TxDb
or NULL
), one of
gene
(default): one value per gene
exon
: one value per exon
promoter
: one value per promoter
junction
: one count per detected exon-exon junction
(query
will be ignored in this case)
defines the part of the alignment that has to be contained within a query region to produce an overlap (see ‘Details’). Possible values are:
start
(default): start of the alignment
end
: end of the alignment
controls the shifting alignments towards their 3'-end before
quantification. shift
can be one of:
an “integer” vector of the same length as the number of alignment files
a single “integer” value
the character string "halfInsert"
(only available for
paired-end experiments)
The default of 0
will not shift any alignments.
sets the required orientation of the alignments relative to the query region in order to be counted, one of:
any
(default): count alignment on the same and opposite strand
same
: count only alignment on the same strand
opposite
: count only alignment on the opposite strand
For paired-end experiments, selects the read mate whose alignments should be counted, one of:
any
(default): count all alignments
first
: count only alignments from the first read
last
: count only alignments from the last read
Which bam files to use in an experiments with auxiliary alignments (see ‘Details’).
If not NULL
, a GRanges
object with reference regions to be masked, i.e. excluded from the
quantification, such as unmappable or highly repetitive regions (see
‘Details’).
If TRUE
(the default), sum alignment
counts from bam files with the same sample name.
If TRUE
(the default), include spliced
alignments when counting. A spliced alignment is defined as an
alignment with a gap in the read of at least 60 bases.
If TRUE
(the default), include alignments
with the secondary bit (0x0100) set in the FLAG
when counting.
Minimal mapping quality of alignments to be included when
counting (mapping quality must be greater than or equal to
mapqMin
). Valid values are between 0 and 255. The default (0)
will include all alignments.
Maximal mapping quality of alignments to be included when
counting (mapping quality must be less than or equal to mapqMax
).
Valid values are between 0 and 255. The default (255) will include
all alignments.
For paired-end experiments, minimal absolute insert
size (TLEN field in SAM Spec v1.4) of alignments to be included when
counting. Valid values are greater than 0 or NULL
(default),
which will not apply any minimum insert size filtering.
For paired-end experiments, maximal absolute insert
size (TLEN field in SAM Spec v1.4) of alignments to be included when
counting. Valid values are greater than 0 or NULL
(default),
which will not apply any maximum insert size filtering.
Maximal fragment size of the paired-end experiment.
This parameter is used if shift="halfInsert"
and will
ensure that query regions are made wide enough to emcompass all
alignment pairs whose mid falls into the query region. The default
value is 500
bases.
A cluster object to be used for parallel processing (see ‘Details’).
A matrix
with effective query regions width in the first
column, and alignment counts in subsequent columns, or a
GRanges
object if reportLevel="junction"
.
The effective query region width returned as first column in the matrix is calculated by the number of unique, non-masked bases in the reference sequence that contributed to the count of this query name (irrespective if the bases were covered by alignments or not). An effective width of zero indicates that the region was fully masked and will have zero counts in all samples.
The alignment counts in the matrix are contained from column two
onwards. For projects with allele-specific quantification, i.e. if a
file with single nucleotide polymorphisms was supplied to the
snpFile
argument of qAlign
, there will be
three columns per bam file (number of alignments for Reference,
Unknown and Alternative genotypes, with suffixed _R, _U and
_A). Otherwise there is a single columns per bam file.
If collapseBySample
=TRUE
, groups of bam files with identical
sample name are combined by summing their alignment counts.
For reportLevel="junction"
, the return value is a
GRanges
object. The start and end coordinates correspond to the
first and last base in each detected intron. Plus- and minus-strand
alignments are quantified separately, so that in an unstranded RNA-seq
experiment, the same intron may be represented twice; once for each
strand. The counts for each sample are contained in the mcols
of the GRanges
object.
qCount
is used to count alignments in each sample from a
qProject
object. The features to be quantified, together with
the mode of quantification, are specified by the query
argument, which is one of:
GRanges
: Overlapping alignments
are counted separately for each coordinate region. If multiple
regions have identical names, their counts will be summed, counting
each alignment only once even if it overlaps more than one of these
regions. Alignments may be counted more than once if they overlap
multiple regions that have different names.
This mode is for example used to quantify ChIP-seq alignments in
promoter regions, or gene expression levels in an RNA-seq experiment
(using a query
with exon regions named by gene).
GRangesList
: Alignments are
counted and summed for each list element in query
if they
overlap with any of the regions contained in the list element. The
order of the list elements defines a hierarchy for quantification:
Alignment will only be counted for the first element (the one with
the lowest index in query
) that they overlap, but not for any
potential further list elements containing overlapping regions.
This mode can be used to hierarchically and uniquely count (assign)
each alignment to a one of several groups of regions (the elements
in query
), for example to estimate the fractions of different
classes of RNA in an RNA-seq experiment (rRNA, tRNA, snRNA, snoRNA,
mRNA, etc.)
TxDb
: Used to extract
regions from annotation and report alignment counts depending on the
value of reportLevel
. If reportLevel="exon"
,
alignments overlapping each exon in query
are counted.
If reportLevel="gene"
, alignment counts for all exons of a
gene will be summed, counting each alignment only once even if it
overlaps multiple annotated exons of a gene. These are useful to
calculate exon or gene expression levels in RNA-seq experiments
based on the annotation in a TxDb
object. If
reportLevel="promoter"
, the promoters
function from package
GenomicFeatures is used with default arguments to extract
promoter regions around transcript start sites, e.g. to quantify
alignments inf a ChIP-seq experiment.
NULL
for
reportLevel="junction"
: The query
argument is ignored
if reportLevel
is set to "junction"
, and qCount
will count the number of alignments supporting each exon-exon
junction detected in any of the samples in proj
. The
arguments selectReadPosition
, shift
,
orientation
, useRead
and mask
will have no
effect in this quantification mode.
The additional arguments allow to fine-tune the quantification:
selectReadPosition
defines the part of the alignment that has
to be contained within a query region for an overlap. The values
start
(default) and end
refer to the biological start
(5'-end) and end (3'-end) of the alignment. For example, the
start
of an alignment on the plus strand is its leftmost
(lowest) base, and the end
of an alignment on the minus strand
is also the leftmost base.
shift
allows on-the-fly shifting of alignments towards their
3'-end prior to overlap determination and counting. This can be
helpful to increase resolution of ChIP-seq experiments by moving
alignments by half the immuno-precipitated fragment size towards the
middle of fragments. shift
is either an “integer” vector
with one value per alignment file in proj
, or a single
“integer” value, in which case all alignment files will be
shifted by the same value. For paired-end experiments, it can be
alternatively set to "halfInsert", which will estimate the true
fragment size from the distance between aligned read pairs and shift
the alignments accordingly.
orientation
controls the interpretation of alignment strand
when counting, relative to the strand of the query region. any
will count all overlapping alignments, irrespective of the alignment
strand (e.g. used in an unstranded RNA-seq experiment). same
will only count the alignments on the same strand as the query region
(e.g. in a stranded RNA-seq experiment that generates reads from the
sense strand), and opposite
will only
count the alignments on the opposite strand from the query region
(e.g. to quantify anti-sense transcription in a stranded RNA-seq
experiment).
includeSpliced
and includeSecondary
can be used to
include or exclude spliced or secondary alignments,
respectively. mapqMin
and mapqMax
allow to select alignments
based on their mapping qualities. mapqMin
and mapqMax
can
take integer values between 0 and 255 and equal to
\(-10 log_{10} Pr(\textnormal{mapping position is wrong})\), rounded to the nearest
integer. A value 255 indicates that the mapping quality is not available.
In paired-end experiments, useRead
allows to quantify either
all alignments (useRead="any"
), or only the first
(useRead="first"
) or last (useRead="last"
) read from a
read pair or read group. Note that for useRead="any"
(the
default), an alignment pair that is fully contained within a query
region will contribute two counts to the value of that
region. absIsizeMin
and absIsizeMax
can be used to
select alignments based on their insert size (TLEN field in SAM Spec
v1.4).
auxiliaryName
selects the reference sequence for which
alignments should be quantified. NULL
(the default) will
select alignments against the genome. If set to a character string
that matches one of the auxiliary target names (as specified in
the auxiliaryFile
argument of qAlign
), the
corresponding alignments will be counted.
mask
can be used to specify a
GRanges
object with regions in the
reference sequence to be excluded from quantification. The regions
will be considered unstranded (strand="*"
). Alignments that
overlap with a region in mask
will not be counted. Masking may
reduce the effective width of query regions reported by qCount
,
even down to zero for regions that are fully contained in mask
.
If clObj
is set to an object that inherits from class
cluster
, for example an object returned by
makeCluster
from package parallel, the
quantification task is split into multiple chunks and processed in
parallel using clusterMap
. Currently, not all
tasks will be efficiently parallelized: For example, a single query
region and a single (group of) bam files will not be split into
multiple chunks.
qAlign
,
qProject
,
makeCluster
from package parallel
library(GenomicRanges)
library(Biostrings)
#> Loading required package: XVector
#>
#> Attaching package: ‘Biostrings’
#> The following object is masked from ‘package:base’:
#>
#> strsplit
library(Rsamtools)
# copy example data to current working directory
file.copy(system.file(package="QuasR", "extdata"), ".", recursive=TRUE)
#> [1] TRUE
# load genome sequence
genomeFile <- "extdata/hg19sub.fa"
gseq <- readDNAStringSet(genomeFile)
chrRegions <- GRanges(names(gseq), IRanges(start=1,width=width(gseq),names=names(gseq)))
# create alignments (paired-end experiment)
sampleFile <- "extdata/samples_rna_paired.txt"
proj <- qAlign(sampleFile, genomeFile, splicedAlignment=TRUE)
#> Creating .fai file for: /private/var/folders/2s/h6hvv9ps03xgz_krkkstvq_r0000gn/T/RtmpHtA1jk/file5ddc16c1b3df/reference/extdata/hg19sub.fa
#> alignment files missing - need to:
#> create alignment index for the genome
#> create 2 genomic alignment(s)
#> Creating an Rbowtie index for /private/var/folders/2s/h6hvv9ps03xgz_krkkstvq_r0000gn/T/RtmpHtA1jk/file5ddc16c1b3df/reference/extdata/hg19sub.fa
#> Finished creating index
#> Testing the compute nodes...
#> OK
#> Loading QuasR on the compute nodes...
#> preparing to run on 1 nodes...
#> done
#> Available cores:
#> Mac-1740133007481.local: 1
#> Performing genomic alignments for 2 samples. See progress in the log file:
#> /private/var/folders/2s/h6hvv9ps03xgz_krkkstvq_r0000gn/T/RtmpHtA1jk/file5ddc16c1b3df/reference/QuasR_log_5ddc6bf69095.txt
#> Genomic alignments have been created successfully
#>
# count reads using a "GRanges" query
qCount(proj, query=chrRegions)
#> counting alignments...
#> done
#> width Sample1 Sample2
#> chr1 40000 850 1464
#> chr2 10000 2924 2224
#> chr3 45000 2228 2312
qCount(proj, query=chrRegions, useRead="first")
#> counting alignments...
#> done
#> width Sample1 Sample2
#> chr1 40000 425 732
#> chr2 10000 1462 1112
#> chr3 45000 1114 1156
# hierarchical counting using a "GRangesList" query
library(rtracklayer)
annotationFile <- "extdata/hg19sub_annotation.gtf"
gtfRegions <- import.gff(annotationFile, format="gtf", feature.type="exon")
names(gtfRegions) <- mcols(gtfRegions)$source
gtfRegionList <- split(gtfRegions, names(gtfRegions))
names(gtfRegionList)
#> [1] "lincRNA" "protein_coding" "snoRNA"
res3 <- qCount(proj, gtfRegionList)
#> hierarchically removing redundancy from 'query'...
#> done
#> counting alignments...
#> done
#> collapsing counts by query name...
#> done
res3
#> width Sample1 Sample2
#> lincRNA 1104 0 0
#> protein_coding 21576 5876 5919
#> snoRNA 104 0 0
# gene expression levels using a "TxDb" query
library("txdbmaker")
#> Loading required package: GenomicFeatures
#> Loading required package: AnnotationDbi
#> Loading required package: Biobase
#> Welcome to Bioconductor
#>
#> Vignettes contain introductory material; view with
#> 'browseVignettes()'. To cite Bioconductor, see
#> 'citation("Biobase")', and for packages 'citation("pkgname")'.
#>
#> Attaching package: ‘txdbmaker’
#> The following objects are masked from ‘package:GenomicFeatures’:
#>
#> UCSCFeatureDbTableSchema, browseUCSCtrack, getChromInfoFromBiomart,
#> makeFDbPackageFromUCSC, makeFeatureDbFromUCSC, makePackageName,
#> makeTxDb, makeTxDbFromBiomart, makeTxDbFromEnsembl,
#> makeTxDbFromGFF, makeTxDbFromGRanges, makeTxDbFromUCSC,
#> makeTxDbPackage, makeTxDbPackageFromBiomart,
#> makeTxDbPackageFromUCSC, supportedMiRBaseBuildValues,
#> supportedUCSCFeatureDbTables, supportedUCSCFeatureDbTracks,
#> supportedUCSCtables
genomeRegion <- scanFaIndex(genomeFile)
chrominfo <- data.frame(chrom=as.character(seqnames(genomeRegion)),
length=end(genomeRegion),
is_circular=rep(FALSE, length(genomeRegion)))
txdb <- makeTxDbFromGFF(annotationFile,
format="gtf",
chrominfo=chrominfo,
dataSource="Ensembl modified",
organism="Homo sapiens")
#> Import genomic features from the file as a GRanges object ...
#> OK
#> Prepare the 'metadata' data frame ...
#> OK
#> Make the TxDb object ...
#> Warning: genome version information is not available for this TxDb object
#> OK
res4 <- qCount(proj, txdb, reportLevel="gene")
#> extracting gene regions from TxDb...
#> done
#> counting alignments...
#> done
#> collapsing counts by query name...
#> done
res4
#> width Sample1 Sample2
#> ENSG00000078808 4697 708 1076
#> ENSG00000134075 589 1201 1322
#> ENSG00000134086 4213 282 295
#> ENSG00000171863 5583 2922 2224
#> ENSG00000176022 2793 62 344
#> ENSG00000186827 1721 37 8
#> ENSG00000186891 1370 26 2
#> ENSG00000238345 98 0 0
#> ENSG00000238642 104 0 0
#> ENSG00000247886 1104 0 0
#> ENSG00000252531 163 6 2
#> ENSG00000254999 1233 1839 1970
# exon-exon junctions
res5 <- qCount(proj, NULL, reportLevel="junction")
#> counting junctions...
#> done
res5
#> GRanges object with 52 ranges and 2 metadata columns:
#> seqnames ranges strand | Sample1 Sample2
#> <Rle> <IRanges> <Rle> | <numeric> <numeric>
#> [1] chr1 12213-12321 + | 1 0
#> [2] chr1 12213-12321 - | 2 0
#> [3] chr1 13085-13371 - | 1 0
#> [4] chr1 18069-18837 + | 8 7
#> [5] chr1 18069-18837 - | 6 12
#> ... ... ... ... . ... ...
#> [48] chr1 14166-14362 - | 0 1
#> [49] chr1 19009-19148 - | 0 2
#> [50] chr1 29327-32271 - | 0 2
#> [51] chr1 4617-4778 - | 0 1
#> [52] chr3 2504-5589 + | 0 2
#> -------
#> seqinfo: 3 sequences from an unspecified genome; no seqlengths