Quantify alignments by relative position

Quantify alignments from sequencing data, relative to their position in query regions.

qProfile(
  proj,
  query,
  upstream = 1000,
  downstream = upstream,
  selectReadPosition = c("start", "end"),
  shift = 0L,
  orientation = c("any", "same", "opposite"),
  useRead = c("any", "first", "last"),
  auxiliaryName = NULL,
  mask = NULL,
  collapseBySample = TRUE,
  includeSpliced = TRUE,
  includeSecondary = TRUE,
  mapqMin = 0L,
  mapqMax = 255L,
  absIsizeMin = NULL,
  absIsizeMax = NULL,
  maxInsertSize = 500L,
  binSize = 1L,
  clObj = NULL
)

Arguments

proj

A qProject object representing a sequencing experiment as returned by qAlign

query

An object of type GRanges with the regions to be profiled. All regions in query will be anchored at their biological start position (start(query) for regions on strand “+” or “*”, end(query) for regions on strand “-”). This position will become position zero in the return value.

upstream

An “integer” vector of length one or the same length as query indicating the number of bases upstream of the anchor position to include in the profile.

downstream

An “integer” vector of length one or the same length as query indicating the number of bases downstream of the anchor position to include in the profile.

selectReadPosition

defines the part of the alignment that has to be contained within a query region to produce an overlap (see Details), and that is used to calculate the relative position within the query region. Possible values are:

start (default): : start of the alignment
end: : end of the alignment

shift

controls the shifting alignments towards their 3'-end before quantification. shift can be one of:

an “integer” vector of the same length as the number of alignment files
a single “integer” value
the character string "halfInsert" (only available for paired-end experiments)

The default of 0 will not shift any alignments.

orientation

sets the required orientation of the alignments relative to the query region in order to be counted, one of:

any (default): : count alignment on the same and opposite strand
same: : count only alignment on the same strand
opposite: : count only alignment on the opposite strand

useRead

For paired-end experiments, selects the read mate whose alignments should be counted, one of:

any (default): : count all alignments
first: : count only alignments from the first read
last: : count only alignments from the last read

auxiliaryName

Which bam files to use in an experiments with auxiliary alignments (see Details).

mask

If not NULL, a GRanges object with reference regions to be masked, i.e. excluded from the quantification, such as unmappable or highly repetitive regions (see Details).

collapseBySample

If TRUE (the default), sum alignment counts from bam files with the same sample name.

includeSpliced

If TRUE (the default), include spliced alignments when counting. A spliced alignment is defined as an alignment with a gap in the read of at least 60 bases.

includeSecondary

If TRUE (the default), include alignments with the secondary bit (0x0100) set in the FLAG when counting.

mapqMin

Minimal mapping quality of alignments to be included when counting (mapping quality must be greater than or equal to mapqMin). Valid values are between 0 and 255. The default (0) will include all alignments.

mapqMax

Maximal mapping quality of alignments to be included when counting (mapping quality must be less than or equal to mapqMax). Valid values are between 0 and 255. The default (255) will include all alignments.

absIsizeMin

For paired-end experiments, minimal absolute insert size (TLEN field in SAM Spec v1.4) of alignments to be included when counting. Valid values are greater than 0 or NULL (default), which will not apply any minimum insert size filtering.

absIsizeMax

For paired-end experiments, maximal absolute insert size (TLEN field in SAM Spec v1.4) of alignments to be included when counting. Valid values are greater than 0 or NULL (default), which will not apply any maximum insert size filtering.

maxInsertSize

Maximal fragment size of the paired-end experiment. This parameter is used if shift="halfInsert" and will ensure that query regions are made wide enough to emcompass all alignment pairs whose mid falls into the query region. The default value is 500 bases.

binSize

Numeric scalar giving the size of bins (must be an odd number). The default value (1) gives back counts for single bases. Otherwise, alignments are counted in adjacent, non-overlapping windows of size binSize that tile the interval defined by upstream and downstream.

clObj

A cluster object to be used for parallel processing (see ‘Details’).

Value

A list of matrices with length(unique(names(query))) rows with profile names, and max(upstream)+max(downstream)+1 columns indicating relative position (for binsize=1).

For binSize values greater than 1, the number of columns corresponds to the number of bins (tiles), namely ceiling(max(upstream)/binSize)+ceiling(max(downstream)/binSize). A middle bin of size binSize is always positioned centered at the anchor of each region. Additional bins are positioned upstream and downstream, adjacent to that middle bin, in order to include at least upstream and downstream bases, respectively (potentially more in order to fill the first and last bins).

The relative positions are given as column names (for binSize > 1 they refer to the bin mid). In that case, the bins are "right-open". For example, if binSize = 10, the bin with the midpoint "-50" contains counts for the alignments in [-55,-45).

The first list element is called “coverage” and contains, for each profile and relative position, the number of overlapping regions that contributed to the profile.

Subsequent list elements contain the alignment counts for individual sequence files (collapseBySample=FALSE) or samples (collapseBySample=TRUE) in proj.

For projects with allele-specific quantification, i.e. if a file with single nucleotide polymorphisms was supplied to the snpFile argument of qAlign, there will be three rows instead of one row with counts per unique region name, with numbers of alignments for Reference, Unknown and Alternative genotypes (suffixed _R, _U and _A).

Details

qProfile is used to count alignments in each sample from a qProject object, relative to their position in query regions.

Most arguments are identical to the ones of qCount.

The query argument is a GRanges object that defines the regions for the profile. All regions in query will be aligned to one another at their anchor position, which corresponds to their biological start position (start(query) for regions on strand “+” or “*”, end(query) for regions on strand “-”).

This anchor position will be extended (with regard to strand) by the number of bases specified by upstream and downstream. In the return value, the anchor position will be at position zero.

If binSize is greater than one, upstream and downstream will be slightly increased in order to include the complete first and last bins of binSize bases.

Regions with identical names in names{query} will be summed, and profiles will be padded with zeros to accomodate the length of all profiles.

Author

Anita Lerch, Dimos Gaidatzis and Michael Stadler

Examples

# copy example data to current working directory
file.copy(system.file(package="QuasR", "extdata"), ".", recursive=TRUE)
#> [1] TRUE

# create alignments (single-end experiment)
genomeFile <- "extdata/hg19sub.fa"
sampleFile <- "extdata/samples_chip_single.txt"
proj <- qAlign(sampleFile, genomeFile)
#> all necessary alignment files found

# load transcript start site coordinates
library(rtracklayer)
annotationFile <- "extdata/hg19sub_annotation.gtf"
tssRegions <- import.gff(annotationFile, format="gtf",
                         feature.type="start_codon")

# obtain a combined TSS profile
pr1 <- qProfile(proj, tssRegions)
#> profiling alignments...
#> done
lapply(pr1, dim)
#> $coverage
#> [1]    1 2001
#> 
#> $Sample1
#> [1]    1 2001
#> 
#> $Sample2
#> [1]    1 2001
#> 
lapply(pr1, "[", , 1:5)
#> $coverage
#> -1000  -999  -998  -997  -996 
#>    19    19    19    19    19 
#> 
#> $Sample1
#> -1000  -999  -998  -997  -996 
#>     0     0     1     0     0 
#> 
#> $Sample2
#> -1000  -999  -998  -997  -996 
#>     1     1     1     2     0 
#> 

prComb <- do.call("+", lapply(pr1[-1], function(x) x/pr1[[1]]))
barplot(prComb, xlab="Position", ylab="Mean no. of alignments")


# obtain TSS profiles for individual regions
names(tssRegions) <- mcols(tssRegions)$transcript_id
pr2 <- qProfile(proj, tssRegions)
#> profiling alignments...
#> done
lapply(pr2, dim)
#> $coverage
#> [1]   19 2001
#> 
#> $Sample1
#> [1]   19 2001
#> 
#> $Sample2
#> [1]   19 2001
#> 
lapply(pr2, "[", 1:3, 1:5)
#> $coverage
#>                 -1000 -999 -998 -997 -996
#> ENST00000328596     1    1    1    1    1
#> ENST00000379265     1    1    1    1    1
#> ENST00000379268     1    1    1    1    1
#> 
#> $Sample1
#>                 -1000 -999 -998 -997 -996
#> ENST00000328596     0    0    0    0    0
#> ENST00000379265     0    0    0    0    0
#> ENST00000379268     0    0    0    0    0
#> 
#> $Sample2
#>                 -1000 -999 -998 -997 -996
#> ENST00000328596     0    0    0    0    0
#> ENST00000379265     0    0    0    0    0
#> ENST00000379268     0    0    0    0    0
#>