Decontaminate
Decontaminates multiplexed assemblies via normalization and mapping. Uses a multi-phase pipeline approach combining read normalization, error correction, and coverage-based filtering to remove contaminating sequences from multiplexed genomic assemblies.
Basic Usage
decontaminate.sh reads=<file,file> ref=<file,file> out=<directory>
Alternative usage with file lists:
decontaminate.sh readnamefile=<file> refnamefile=<file> out=<directory>
Parameters
Parameters are organized by their function in the decontamination process. The tool processes multiple libraries simultaneously, requiring paired read and reference files for each library.
Input Parameters
- reads=<file,file>
- Input reads, one file per library. Comma-separated list of read files corresponding to each assembly.
- ref=<file,file>
- Input assemblies, one file per library. Comma-separated list of reference assembly files to be decontaminated.
- readnamefile=<file>
- List of input reads, one line per library. Alternative to reads= parameter - file containing paths to read files, one per line.
- refnamefile=<file>
- List of input assemblies, one line per library. Alternative to ref= parameter - file containing paths to assembly files, one per line.
- interleaved=auto
- True forces paired/interleaved input; false forces single-ended mapping. If not specified, interleaved status will be autodetected from read names.
- unpigz=t
- Spawn a pigz (parallel gzip) process for faster decompression. Requires pigz to be installed.
- touppercase=t
- (tuc) Convert lowercase letters in reads to upper case (otherwise they will not match the reference).
Output Parameters
- pigz=f
- Spawn a pigz (parallel gzip) process for faster compression. Requires pigz to be installed.
- tmpdir=.
- Write temp files here. By default uses the system's $TMPDIR or current directory. Important for managing intermediate files during the multi-phase pipeline.
- outdir=.
- Write output files here. Final clean assemblies will be written to this directory with "_clean.fasta" suffix.
Mapping Parameters
- kfilter=55
- Set to a positive number N to require minimum N contiguous matches for a mapped read. Higher values increase mapping specificity.
- ambig=random
- Determines how coverage will be calculated for ambiguously-mapped reads:
- first: Add coverage only at first genomic mapping location
- random: Add coverage at a random best-scoring location
- all: Add coverage at all best-scoring locations
- toss: Discard ambiguously-mapped reads without adding coverage
Filtering Parameters
- minc=3.5
- Min average coverage to retain scaffold. Scaffolds with coverage below this threshold will be removed as likely contaminants.
- minp=20
- Min percent coverage to retain scaffold. Scaffolds with less than this percentage of bases covered will be filtered out.
- minr=18
- Min mapped reads to retain scaffold. Scaffolds with fewer mapped reads will be considered contaminants (default 20 in Java code).
- minl=500
- Min length to retain scaffold. Short scaffolds below this threshold will be removed.
- ratio=1.2
- Contigs will not be removed by minc unless the coverage changed by at least this factor. 0 disables this filter. Helps distinguish real coverage drops from contamination.
- mapraw=t
- Set true to map the unnormalized reads. Required to filter by 'ratio' parameter. Enables before/after coverage comparison.
- basesundermin=-1
- If positive, removes contigs with at least this many bases in low-coverage windows. Targets regions with consistently poor coverage.
- window=500
- Sliding window size for low-coverage analysis. Used in conjunction with windowcov parameter.
- windowcov=5
- Average coverage below this will be classified as low within sliding windows. Used with basesundermin filtering.
Tadpole Parameters
- ecct=f
- Error-correct with Tadpole before normalization. Can improve assembly quality by correcting sequencing errors.
- kt=42
- Kmer length for Tadpole error correction. Longer kmers provide more specificity but require higher coverage.
- aggressive=f
- Do aggressive error correction. Mutually exclusive with conservative mode. More thorough but potentially over-corrects.
- conservative=f
- Do conservative error correction. Mutually exclusive with aggressive mode. Safer approach with less risk of over-correction.
- tadpoleprefilter=1
- (tadpre) Ignore kmers under this depth to save memory. Filters out low-depth kmers likely to be errors.
Normalization Parameters
- mindepth=2
- Min depth of reads to keep during normalization. Reads below this depth may be considered errors or contaminants.
- target=20
- Target normalization depth. Reads will be normalized to approximately this coverage level.
- hashes=4
- Number of hashes in Bloom filter. More hashes increase accuracy but require more memory.
- passes=1
- Normalization passes. Multiple passes can improve uniformity of coverage distribution.
- minprob=0.5
- Min probability of correctness to add a kmer to the Bloom filter. Higher values increase specificity.
- dp=0.75
- (depthpercentile) Percentile to use for depth proxy (0.5 means median). Controls which depth statistic is used for normalization decisions.
- prefilter=t
- Prefilter, for large datasets. Uses a smaller preliminary filter to reduce memory usage for very large datasets.
- filterbits=32
- (fbits) Bits per cell in primary filter. More bits increase accuracy but require more memory.
- prefilterbits=2
- (pbits) Bits per cell in prefilter. Used when prefilter=t to manage memory usage.
- k=31
- Kmer length for normalization. Longer kmers are more precise but less sensitive to coverage variations.
Other parameters
- opfn=0
- (onlyprocessfirstn) Set to a positive number to only process that many datasets. This is for internal testing of specificity.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx800m will specify 800 megs. The max is typically 85% of physical memory.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions.
Examples
Basic Decontamination
decontaminate.sh reads=lib1.fq,lib2.fq ref=asm1.fa,asm2.fa out=clean/
Decontaminate two multiplexed libraries using default parameters.
With Error Correction and Strict Filtering
decontaminate.sh reads=reads.fq ref=assembly.fa out=results/ \
ecct=t aggressive=t minc=5.0 minp=30 minl=1000
Perform error correction with Tadpole using aggressive mode, then apply strict coverage and length filtering.
Using File Lists
decontaminate.sh readnamefile=reads.list refnamefile=refs.list out=output/
Process multiple libraries specified in list files, one path per line.
High Memory Dataset with Custom Normalization
decontaminate.sh reads=large.fq ref=genome.fa out=clean/ \
-Xmx100g target=50 k=35 hashes=6 prefilter=t
Process large dataset with increased memory, higher normalization target, longer kmers, and prefiltering enabled.
Algorithm Details
DecontaminateByNormalization.java implements a six-phase sequential pipeline for removing contaminating sequences from multiplexed genomic assemblies:
Pipeline Architecture
The algorithm executes six distinct phases in sequence:
- Optional Pre-mapping: If mapraw=t, maps original reads to assemblies for baseline coverage analysis
- Rename and Merge: RenameAndMux.main() consolidates all input read libraries into a single merged file with unique read identifiers using core filename + numericID format
- Error Correction (Optional): Uses Tadpole for kmer-based error correction if ecct=t
- Normalization: Applies BBNorm to reduce coverage variation and remove low-quality reads
- Demultiplexing: Separates normalized reads back into individual libraries based on read names
- Final Mapping and Filtering: Maps normalized reads and applies coverage-based filtering
Coverage-Based Contamination Detection
The core decontamination strategy relies on the principle that contaminating sequences will have significantly different coverage patterns compared to legitimate assembly sequences:
- Multi-metric Filtering: Combines coverage depth (minc), breadth (minp), read count (minr), and length (minl) criteria
- Sliding Window Analysis: Uses configurable windows (default 500bp) to detect regions of consistently low coverage
- Ratio-based Filtering: Compares coverage before and after normalization to identify artifacts
- Ambiguous Mapping Handling: Configurable strategies for reads mapping to multiple locations
Memory Management and Scalability
The implementation includes several optimizations for handling large datasets:
- Temporal File Management: Automatic cleanup of intermediate files to conserve disk space
- Bloom Filter Optimization: Configurable filter sizes and prefiltering for memory efficiency
- Parallel Compression: Optional pigz integration for faster I/O operations
- Multi-threaded Operations: Leverages available CPU cores throughout the pipeline
Quality Control Integration
Each phase includes detailed logging and validation:
- Coverage Statistics: Detailed per-scaffold coverage analysis written to covstats files
- Process Logging: Timestamped execution log for debugging and quality control
- Results Summary: Detailed filtering statistics and contamination removal metrics
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org