Truncate sequences, remove parts matching to adapters and filter out low quality or low complexity sequences from (compressed) 'fasta' or 'fastq' files.
preprocessReads(
filename,
outputFilename = NULL,
filenameMate = NULL,
outputFilenameMate = NULL,
truncateStartBases = NULL,
truncateEndBases = NULL,
Lpattern = "",
Rpattern = "",
max.Lmismatch = rep(0:2, c(6, 3, 100)),
max.Rmismatch = rep(0:2, c(6, 3, 100)),
with.Lindels = FALSE,
with.Rindels = FALSE,
minLength = 14L,
nBases = 2L,
complexity = NULL,
nrec = 1000000L,
clObj = NULL
)
the name(s) of the input sequence file(s).
the name(s) of the output sequence file(s).
for paired-end experiments, the name(s) of the input sequence file(s) containing the second read (mate) of each pair.
for paired-end experiments, the name(s) of the output sequence file(s) containing the second read (mate) of each pair.
integer(1): the number of bases to be truncated (removed) from the beginning of each sequence.
integer(1): the number of bases to be truncated (removed) from the end of each sequence.
character(1): the left (5'-end) adapter sequence.
character(1): the right (3'-end) adapter sequence.
mismatch tolerance when searching for matches of
Lpattern
(see ‘Details’).
mismatch tolerance when searching for matches of
Rpattern
(see ‘Details’).
if TRUE
, indels are allowed in the alignments of
the suffixes of Lpattern
with the subject, at its beginning
(see ‘Details’).
same as with.Lindels
but for alignments of the
prefixes of Rpattern
with the subject, at its end (see
‘Details’).
integer(1): the minimal allowed sequence length.
integer(1): the maximal number of Ns allowed per sequence.
NULL
(default) or numeric(1): If not NULL
,
the minimal sequence complexity, as a fraction of the average complexity
in the human genome (~3.9bits). For example, complexity = 0.5
will
filter out sequences that do not have at least half the complexity of the
human genome. See ‘Details’ on how the complexity is calculated.
integer(1): the number of sequence records to read at a time.
a cluster object to be used for parallel processing of multiple files (see ‘Details’).
A matrix with summary statistics on the processed sequences, containing:
One column per input file (or pair of input files for paired-end experiments).
The number of sequences or sequence pairs in rows:
totalSequences
- the total number in the input
matchTo5pAdaptor
- matching to Lpattern
matchTo3pAdaptor
- matching to Rpattern
tooShort
- shorter than minLength
tooManyN
- more than nBases
Ns
lowComplexity
- relative complexity below complexity
totalPassed
- the number of sequences/sequence pairs that pass all filtering criteria and were written to the output file(s).
Sequence files can be in fasta or fastq format, and can be compressed by
either gzip, bzip2 or xz (extensions .gz, .bz2 or .xz). Multiple files
can be processed by a single call to preprocessReads
; in that
case all sequence file vectors must have identical lengths.
nrec
can be used to limit the memory usage when processing
large input files. preprocessReads
iteratively loads chunks of
nrec
sequences from the input until all data been processed.
Sequence pairs from paired-end experiments can be processed by
specifying pairs of input and output files (filenameMate
and
outputFilenameMate
arguments). In that case, it is assumed that
pairs appear in the same order in the two input files, and only pairs
in which both reads pass all filtering criteria are written to the
output files, maintaining the consistent ordering.
If output files are compressed, the processed sequences are first written to temporary files (created in the same directory as the final output file), and the output files are generated at the end by compressing the temporary files.
For the trimming of left and/or right flanking sequences (adapters) from
sequence reads, the trimLRPatterns
function
from package Biostrings is used, and the arguments Lpattern
,
Rpattern
, max.Lmismatch
, max.Rmismatch
,
with.Lindels
and with.Rindels
are used in the call to
trimLRPatterns
. Lfixed
and Rfixed
arguments
of trimLRPatterns
are set to TRUE
, thus only fixed
patterns (without IUPAC codes for ambigous bases) can be
used. Currently, trimming of adapters is only supported for single read
experiments.
Sequence complexity (\(H\)) is calculated based on the dinucleotide
composition using the formula (Shannon entropy): $$H = -\sum_i {f_i \log_2 f_i},$$
where \(f_i\) is the fraction of dinucleotide \(i\) from all
dinucleotides in the sequence. Sequence reads that fulfill the condition
\(H/H_r \ge c\) are retained (not filtered out), where \(H_r =
3.908\) is the reference complexity in bits obtained from the human
genome, and \(c\) is the value given to the argument complexity
.
If an object that inherits from class cluster
is provided to
the clObj
argument, for example an object returned by
makeCluster
from package parallel,
multiple files will be processed in parallel using
clusterMap
from package parallel.
trimLRPatterns
from package Biostrings,
makeCluster
from package parallel
# sample files
infiles <- system.file(package="QuasR", "extdata",
c("rna_1_1.fq.bz2","rna_1_2.fq.bz2"))
outfiles <- paste(tempfile(pattern=c("output_1_","output_2_")),".fastq",sep="")
# single read example
preprocessReads(infiles, outfiles, nBases=0, complexity=0.6)
#> filtering /Users/runner/work/_temp/Library/QuasR/extdata/rna_1_1.fq.bz2
#> filtering /Users/runner/work/_temp/Library/QuasR/extdata/rna_1_2.fq.bz2
#> rna_1_1.fq.bz2 rna_1_2.fq.bz2
#> totalSequences 3002 3002
#> matchTo5pAdapter 0 0
#> matchTo3pAdapter 0 0
#> tooShort 0 0
#> tooManyN 3 0
#> lowComplexity 0 0
#> totalPassed 2999 3002
unlink(outfiles)
# paired-end example
preprocessReads(filename=infiles[1],
outputFilename=outfiles[1],
filenameMate=infiles[2],
outputFilenameMate=outfiles[2],
nBases=0, complexity=0.6)
#> filtering /Users/runner/work/_temp/Library/QuasR/extdata/rna_1_1.fq.bz2 and
#> /Users/runner/work/_temp/Library/QuasR/extdata/rna_1_2.fq.bz2
#> rna_1_1.fq.bz2:rna_1_2.fq.bz2
#> totalSequences 3002
#> matchTo5pAdapter NA
#> matchTo3pAdapter NA
#> tooShort 0
#> tooManyN 3
#> lowComplexity 0
#> totalPassed 2999
unlink(outfiles)