RQCFilter2

Script: rqcfilter2.sh Package: jgi Class: RQCFilter2.java

RQCFilter2 is a revised version of RQCFilter that uses a common path for all dependencies. It performs quality control including adapter trimming, quality trimming, artifact removal, linker trimming, and spike-in removal using BBDuk, along with human/cat/dog/mouse/microbe removal using BBMap. The pipeline requires 40 GB RAM for vertebrate contaminant removal, but only ~1GB without them.

Basic Usage

rqcfilter2.sh in=<input file> path=<output directory> rqcfilterdata=<path to RQCFilterData directory>

The RQCFilterData dependencies are available at: http://portal.nersc.gov/dna/microbial/assembly/bushnell/RQCFilterData.tar

Parameters

RQCFilter2 parameters are organized by processing stage. The pipeline executes multiple stages in sequence: sketching, clumping, adapter trimming, quality trimming, contaminant mapping, filtering, and spikein removal. Each stage has dedicated parameters for precise control.

Primary I/O parameters

in=<file>: Input reads.
in2=<file>: Use this if 2nd read of pairs are in a different file.
path=null: Set to the directory to use for all output files.

Reference file path parameters

rqcfilterdata=: Path to unzipped RQCFilterData directory. Default is /global/projectb/sandbox/gaag/bbtools/RQCFilterData
ref=<file,file>: Comma-delimited list of additional reference files for filtering via BBDuk.

Output parameters

scafstats=scaffoldStats.txt: Scaffold stats file name (how many reads matched which reference scaffold).
kmerstats=kmerStats.txt: Kmer stats file name (duk-like output).
log=status.log: Progress log file name.
filelist=file-list.txt: List of output files.
stats=filterStats.txt: Overall stats file name.
stats2=filterStats2.txt: Better overall stats file name.
ihist=ihist_merge.txt: Insert size histogram name. Set to null to skip merging.
outribo=ribo.fq.gz: Output for ribosomal reads, if removeribo=t.
reproduceName=reproduce.sh: Name of shellscript to reproduce these results.
usetmpdir=t: Write temp files to TMPDIR.
tmpdir=: Override TMPDIR.

Adapter trimming parameters

trimhdist=1: Hamming distance used for trimming.
trimhdist2=: Hamming distance used for trimming with short kmers. If unset, trimhdist will be used.
trimk=23: Kmer length for trimming stage.
mink=11: Minimum kmer length for short kmers when trimming.
trimfragadapter=t: Trim all known Illumina adapter sequences, including TruSeq and Nextera.
trimrnaadapter=f: Trim Illumina TruSeq-RNA adapters.
bisulfite=f: Currently, this trims the last 1bp from all reads after the adapter-trimming phase.
findadapters=t: For paired-end files, attempt to discover the adapter sequence with BBMerge and use that rather than a set of known adapters.
swift=f: Trim Swift sequences: Trailing C/T/N R1, leading G/A/N R2.

Quality trimming parameters

qtrim=f: Trim read ends to remove bases with quality below minq. Performed AFTER looking for kmers. Values: rl (trim both ends), f (neither end), r (right end only), l (left end only).
trimq=10: Trim quality threshold. Must also set qtrim for direction.
minlength=45: (ml) Reads shorter than this after trimming will be discarded. Pairs will be discarded only if both are shorter.
mlf=0.333: (minlengthfraction) Reads shorter than this fraction of original length after trimming will be discarded.
minavgquality=5: (maq) Reads with average quality (before trimming) below this will be discarded.
maxns=0: Reads with more Ns than this will be discarded.
forcetrimmod=5: (ftm) If positive, right-trim length to be equal to zero, modulo this number.
forcetrimleft=-1: (ftl) If positive, trim bases to the left of this position (exclusive, 0-based).
forcetrimright=-1: (ftr) If positive, trim bases to the right of this position (exclusive, 0-based).
forcetrimright2=-1: (ftr2) If positive, trim this many bases on the right end.

Mapping parameters (for vertebrate contaminants)

mapk=14: Kmer length for mapping stage (9-15; longer is faster).
removehuman=f: (human) Remove human reads via mapping.
keephuman=f: Keep reads that map to human (or cat, dog, mouse) rather than removing them.
removedog=f: (dog) Remove dog reads via mapping.
removecat=f: (cat) Remove cat reads via mapping.
removemouse=f: (mouse) Remove mouse reads via mapping.
aggressivehuman=f: Aggressively remove human reads (and cat/dog/mouse) using unmasked references.
aggressivemicrobe=f: Aggressively microbial contaminant reads using unmasked references.
aggressive=f: Set both aggressivehuman and aggressivemicrobe at once.
mapref=: Remove contaminants by mapping to this fasta file (or comma-delimited list).

Bloom filter parameters (for vertebrate mapping)

bloom=t: Use a Bloom filter to accelerate mapping.
bloomminreads=4m: Disable Bloom filter if there are fewer than this many reads.
bloomk=29: Kmer length for Bloom filter.
bloomhashes=1: Number of hashes for the Bloom filter.
bloomminhits=6: Minimum consecutive hits to consider a read as matching.
bloomserial=t: Use the serialized Bloom filter for greater loading speed. This will use the default Bloom filter parameters.

Microbial contaminant removal parameters

detectmicrobes=f: Detect common microbes, but don't remove them. Use this OR removemicrobes, not both.
removemicrobes=f: (microbes) Remove common contaminant microbial reads via mapping, and place them in a separate file.
taxlist=: (tax) Remove these taxa from the database before filtering. Typically, this would be the organism name or NCBI ID, or a comma-delimited list. Organism names should have underscores instead of spaces, such as Escherichia_coli.
taxlevel=order: (level) Level to remove. For example, 'phylum' would remove everything in the same phylum as entries in the taxlist.
taxtree=auto: (tree) Override location of the TaxTree file.
gitable=auto: Override location of the gitable file.
loadgitable=f: Controls whether gi numbers may be used for taxonomy.
microberef=: Path to fasta file of microbes.
microbebuild=1: Chooses which masking was used. 1 is most stringent and should be used for bacteria. Eukaryotes should use 3.

Extended microbial contaminant parameters

detectmicrobes2=f: (detectothermicrobes) Detect an extended set of microbes that are currently being screened. This can be used in conjunction with removemicrobes.

Filtering parameters (for artificial and genomic contaminants)

skipfilter=f: Skip this phase. Not recommended.
filterpolya=f: Remove reads containing poly-A sequence (for RNA-seq).
filterpolyg=0: Remove reads that start with a G polymer at least this long (0 disables).
trimpolyg=6: Trim reads that start or end with a G polymer at least this long (0 disables).
phix=t: Remove reads containing phiX kmers.
lambda=f: Remove reads containing Lambda phage kmers.
pjet=t: Remove reads containing PJET kmers.
sip=f: Remove SIP spikeins.
maskmiddle=t: (mm) Treat the middle base of a kmer as a wildcard, to increase sensitivity in the presence of errors.
maxbadkmers=0: (mbk) Reads with more than this many contaminant kmers will be discarded.
filterhdist=1: Hamming distance used for filtering.
filterqhdist=1: Query hamming distance used for filtering.
copyundefined=f: (cu) Match all possible bases for sequences containing degenerate IUPAC symbols.
entropy=f: Remove low-complexity reads. The threshold can be specified by e.g entropy=0.4; default is 0.42 if enabled.
entropyk=2: Kmer length to use for entropy calculation.
entropywindow=40: Window size to use for entropy calculation.

Spikein removal/quantification parameters

mtst=f: Remove mtst.
kapa=t: Remove and quantify kapa.
spikeink=31: Kmer length for spikein removal.
spikeinhdist=0: Hamming distance for spikein removal.
spikeinref=: Additional references for spikein removal (comma-delimited list).

Ribosomal filtering parameters

ribohdist=1: Hamming distance used for rRNA removal.
riboedist=0: Edit distance used for rRNA removal.
removeribo=f: (ribo) Remove ribosomal reads via kmer-matching, and place them in a separate file.

Organelle filtering parameters

chloromap=f: Remove chloroplast reads by mapping to this organism's chloroplast.
mitomap=f: Remove mitochondrial reads by mapping to this organism's mitochondria.
ribomap=f: Remove ribosomal reads by mapping to this organism's ribosomes.

FilterByTile parameters

filterbytile=t: Run FilterByTile to remove reads from low-quality parts of the flowcell.
tiledump=: Set this to the tiledump of the full lane (recommended).

Recalibration parameters

recalibrate=t: Recalibrate quality scores based on PhiX alignment.
phixsam=: Set this to the aligned PhiX data for the lane (required).
quantize=2: Quantize the quality scores to reduce file size, using this divisor. 2 reduces size by roughly 25%. Disabled if recalibrate=f. Quantization happens AFTER all the quality-related steps.

Polyfilter parameters

polyfilter=GC: Remove reads with homopolymers of these subunits. Set polyfilter=null to disable.

Clumpify parameters

clumpify=f: Run clumpify; all deduplication flags require this.
dedupe=f: Remove duplicate reads; all deduplication flags require this.
opticaldupes=f: Remove optical duplicates (Clumpify optical flag).
edgedupes=f: Remove tile-edge duplicates (Clumpify spany and adjacent flags).
dpasses=1: Use this many deduplication passes.
dsubs=2: Allow this many substitutions between duplicates.
ddist=40: Remove optical/edge duplicates within this distance.
lowcomplexity=f: Set to true for low-complexity libraries such as RNA-seq to improve estimation of memory requirements.
clumpifytmpdir=f: Use TMPDIR for clumpify temp files.
clumpifygroups=-1: If positive, force Clumpify to use this many groups.

Sketch parameters

sketch=t: Run SendSketch on 2M read pairs.
silvalocal=t: Use the local flag for Silva (requires running RQCFilter on NERSC).
sketchreads=1m: Number of read pairs to sketch.
sketchsamplerate=1: Samplerate for SendSketch.
sketchminprob=0.2: Minprob for SendSketch.
sketchdb=nt,refseq,silva: Servers to use for SendSketch.

Other processing parameters

threads=auto: (t) Set number of threads to use; default is number of logical processors.
library=frag: Set to 'frag', 'clip', 'lfpe', or 'clrs'.
filterk=31: Kmer length for filtering stage.
rcomp=t: Look for reverse-complements of kmers in addition to forward kmers.
nexteralmp=f: Split into different files based on Nextera LMP junction sequence. Only for Nextera LMP, not normal Nextera.
extend=f: Extend reads during merging to allow insert size estimation of non-overlapping reads.
pigz=t: Use pigz for compression.
unpigz=t: Use pigz for decompression.
khist=f: Set to true to generate a kmer-frequency histogram of the output data.
merge=t: Set to false to skip generation of insert size histogram.

Header-specific parameters

NOTE: Be sure to disable these if the reads have improper headers, like SRA data.

chastityfilter=t: Remove reads failing chastity filter.
barcodefilter=f: Crash when improper barcodes are discovered. Set to 'f' to disable, 't' to remove improper barcodes, or 'crash' to crash if they are discovered.
barcodes=: A comma-delimited list of barcodes or files of barcodes.
filterbytile: Also needs to be disabled for SRA data.

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions.

Examples

Basic Quality Control

rqcfilter2.sh in=reads.fq.gz path=filtered/ rqcfilterdata=/path/to/RQCFilterData/

Basic run with default quality control settings.

Remove Human Contamination

rqcfilter2.sh in=reads.fq.gz path=filtered/ rqcfilterdata=/path/to/RQCFilterData/ removehuman=t

Remove human reads via mapping (requires 40GB RAM).

Aggressive Deduplication for NovaSeq

rqcfilter2.sh in=reads.fq.gz path=filtered/ rqcfilterdata=/path/to/RQCFilterData/ clumpify=t dedupe=t opticaldupes=t ddist=12000

Recommended deduplication flags for NovaSeq data.

RNA-seq with Poly-A Filtering

rqcfilter2.sh in=rnaseq.fq.gz path=filtered/ rqcfilterdata=/path/to/RQCFilterData/ trimrnaadapter=t filterpolya=t lowcomplexity=t

Process RNA-seq data with RNA-specific adapter trimming and poly-A removal.

Microbial Contaminant Detection

rqcfilter2.sh in=reads.fq.gz path=filtered/ rqcfilterdata=/path/to/RQCFilterData/ detectmicrobes=t taxlist=Escherichia_coli taxlevel=species

Detect microbial contaminants while excluding specific taxa from filtering.

Comprehensive Pipeline with All Features

rqcfilter2.sh in=reads.fq.gz path=filtered/ rqcfilterdata=/path/to/RQCFilterData/ \
removehuman=t removemicrobes=t clumpify=t dedupe=t sketch=t \
qtrim=rl trimq=15 minlength=50

Full pipeline with human removal, microbial detection, deduplication, sketching, and quality trimming.

Algorithm Details

Pipeline Architecture

RQCFilter2 implements a multi-stage quality control pipeline with step-based staging to minimize I/O overhead. The pipeline calculates the total number of processing steps using numSteps variable and uses temporary files on local disk for intermediate stages, writing final output to the specified directory only at completion.

Processing Stages

The pipeline executes stages in this order:

Sketching: Runs SendSketch on the first X reads (default 1M) for taxonomic classification before any modification
Clumpification: Optional deduplication using Clumpify with configurable optical/edge duplicate detection
FilterByTile: Removes reads from low-quality flowcell regions using tile-based statistics
Adapter Discovery: Uses BBMerge to automatically detect adapter sequences in paired-end data
Adapter Trimming: Removes adapters using kmer-based matching with configurable Hamming distance
Quality Trimming: Trims low-quality bases with multiple force-trim options
Recalibration: Recalibrates quality scores using PhiX alignment data
Artifact Filtering: Removes synthetic contaminants (PhiX, PJET, Lambda) using kmer matching
Spikein Processing: Detects and quantifies spike-in sequences (MTST, KAPA)
Vertebrate Mapping: Maps to human/mouse/cat/dog references using BBMap with Bloom filter acceleration
Microbial Detection: Maps to microbial database with taxonomic filtering capabilities
Ribosomal Filtering: Removes ribosomal sequences via kmer matching
Organelle Mapping: Removes chloroplast/mitochondrial reads via mapping
Polyfilter: Removes homopolymer-containing reads

Memory Management

The tool implements specific memory management strategies:

RAM allocation: Default 40GB RAM allocation with calcXmx() method for automatic scaling based on available memory
Bloom filter loading: Uses serialized Bloom filters for faster loading of large reference databases when bloomserial=t
Temporary file staging: Uses TMPDIR or user-specified tmpdir for intermediate files to minimize disk I/O overhead
Explicit cleanup: Calls Data.unloadAll() between processing stages to free loaded reference indexes

Kmer-Based Filtering Strategy

RQCFilter2 uses multiple kmer strategies:

Trimming kmers: Length 23 with minimum 11 for adapter detection
Filtering kmers: Length 31 for contaminant detection
Spikein kmers: Length 31 with exact matching (Hamming distance 0)
Bloom filter kmers: Length 29 for vertebrate mapping acceleration
Mapping kmers: Length 14 (configurable 9-15) for precise alignment

Quality Control Features

Entropy filtering: Removes low-complexity reads using sliding window entropy calculation
Chastity filtering: Removes reads failing Illumina chastity filter
Barcode validation: Validates expected barcode sequences
Tile-based filtering: Removes reads from poor-quality flowcell regions
Multiple trimming modes: Supports quality trimming, adapter trimming, and force trimming

Performance Implementation

Multi-threading: Uses threads=auto parameter with thread count defaulting to logical processor count
Compressed I/O: Uses pigz=t/unpigz=t parameters for parallel gzip compression/decompression
Bloom filter acceleration: Uses bloom=t with bloomk=29, bloomhashes=1, bloomminhits=6 to reduce mapping time
File staging: Uses step-based temporary file management with usetmpdir=t parameter
Index management: Loads reference indexes on demand with Data.unloadAll() cleanup between stages

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org