Decontaminate

Script: decontaminate.sh Package: jgi Class: DecontaminateByNormalization.java

Decontaminates multiplexed assemblies via normalization and mapping. Uses a multi-phase pipeline approach combining read normalization, error correction, and coverage-based filtering to remove contaminating sequences from multiplexed genomic assemblies.

Basic Usage

decontaminate.sh reads=<file,file> ref=<file,file> out=<directory>

Alternative usage with file lists:

decontaminate.sh readnamefile=<file> refnamefile=<file> out=<directory>

Parameters

Parameters are organized by their function in the decontamination process. The tool processes multiple libraries simultaneously, requiring paired read and reference files for each library.

Input Parameters

reads=<file,file>: Input reads, one file per library. Comma-separated list of read files corresponding to each assembly.
ref=<file,file>: Input assemblies, one file per library. Comma-separated list of reference assembly files to be decontaminated.
readnamefile=<file>: List of input reads, one line per library. Alternative to reads= parameter - file containing paths to read files, one per line.
refnamefile=<file>: List of input assemblies, one line per library. Alternative to ref= parameter - file containing paths to assembly files, one per line.
interleaved=auto: True forces paired/interleaved input; false forces single-ended mapping. If not specified, interleaved status will be autodetected from read names.
unpigz=t: Spawn a pigz (parallel gzip) process for faster decompression. Requires pigz to be installed.
touppercase=t: (tuc) Convert lowercase letters in reads to upper case (otherwise they will not match the reference).

Output Parameters

pigz=f: Spawn a pigz (parallel gzip) process for faster compression. Requires pigz to be installed.
tmpdir=.: Write temp files here. By default uses the system's $TMPDIR or current directory. Important for managing intermediate files during the multi-phase pipeline.
outdir=.: Write output files here. Final clean assemblies will be written to this directory with "_clean.fasta" suffix.

Mapping Parameters

kfilter=55

Set to a positive number N to require minimum N contiguous matches for a mapped read. Higher values increase mapping specificity.

ambig=random

Determines how coverage will be calculated for ambiguously-mapped reads:

first: Add coverage only at first genomic mapping location
random: Add coverage at a random best-scoring location
all: Add coverage at all best-scoring locations
toss: Discard ambiguously-mapped reads without adding coverage

Filtering Parameters

minc=3.5: Min average coverage to retain scaffold. Scaffolds with coverage below this threshold will be removed as likely contaminants.
minp=20: Min percent coverage to retain scaffold. Scaffolds with less than this percentage of bases covered will be filtered out.
minr=18: Min mapped reads to retain scaffold. Scaffolds with fewer mapped reads will be considered contaminants (default 20 in Java code).
minl=500: Min length to retain scaffold. Short scaffolds below this threshold will be removed.
ratio=1.2: Contigs will not be removed by minc unless the coverage changed by at least this factor. 0 disables this filter. Helps distinguish real coverage drops from contamination.
mapraw=t: Set true to map the unnormalized reads. Required to filter by 'ratio' parameter. Enables before/after coverage comparison.
basesundermin=-1: If positive, removes contigs with at least this many bases in low-coverage windows. Targets regions with consistently poor coverage.
window=500: Sliding window size for low-coverage analysis. Used in conjunction with windowcov parameter.
windowcov=5: Average coverage below this will be classified as low within sliding windows. Used with basesundermin filtering.

Tadpole Parameters

ecct=f: Error-correct with Tadpole before normalization. Can improve assembly quality by correcting sequencing errors.
kt=42: Kmer length for Tadpole error correction. Longer kmers provide more specificity but require higher coverage.
aggressive=f: Do aggressive error correction. Mutually exclusive with conservative mode. More thorough but potentially over-corrects.
conservative=f: Do conservative error correction. Mutually exclusive with aggressive mode. Safer approach with less risk of over-correction.
tadpoleprefilter=1: (tadpre) Ignore kmers under this depth to save memory. Filters out low-depth kmers likely to be errors.

Normalization Parameters

mindepth=2: Min depth of reads to keep during normalization. Reads below this depth may be considered errors or contaminants.
target=20: Target normalization depth. Reads will be normalized to approximately this coverage level.
hashes=4: Number of hashes in Bloom filter. More hashes increase accuracy but require more memory.
passes=1: Normalization passes. Multiple passes can improve uniformity of coverage distribution.
minprob=0.5: Min probability of correctness to add a kmer to the Bloom filter. Higher values increase specificity.
dp=0.75: (depthpercentile) Percentile to use for depth proxy (0.5 means median). Controls which depth statistic is used for normalization decisions.
prefilter=t: Prefilter, for large datasets. Uses a smaller preliminary filter to reduce memory usage for very large datasets.
filterbits=32: (fbits) Bits per cell in primary filter. More bits increase accuracy but require more memory.
prefilterbits=2: (pbits) Bits per cell in prefilter. Used when prefilter=t to manage memory usage.
k=31: Kmer length for normalization. Longer kmers are more precise but less sensitive to coverage variations.

Other parameters

opfn=0: (onlyprocessfirstn) Set to a positive number to only process that many datasets. This is for internal testing of specificity.

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx800m will specify 800 megs. The max is typically 85% of physical memory.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions.

Examples

Basic Decontamination

decontaminate.sh reads=lib1.fq,lib2.fq ref=asm1.fa,asm2.fa out=clean/

Decontaminate two multiplexed libraries using default parameters.

With Error Correction and Strict Filtering

decontaminate.sh reads=reads.fq ref=assembly.fa out=results/ \
    ecct=t aggressive=t minc=5.0 minp=30 minl=1000

Perform error correction with Tadpole using aggressive mode, then apply strict coverage and length filtering.

Using File Lists

decontaminate.sh readnamefile=reads.list refnamefile=refs.list out=output/

Process multiple libraries specified in list files, one path per line.

High Memory Dataset with Custom Normalization

decontaminate.sh reads=large.fq ref=genome.fa out=clean/ \
    -Xmx100g target=50 k=35 hashes=6 prefilter=t

Process large dataset with increased memory, higher normalization target, longer kmers, and prefiltering enabled.

Algorithm Details

DecontaminateByNormalization.java implements a six-phase sequential pipeline for removing contaminating sequences from multiplexed genomic assemblies:

Pipeline Architecture

The algorithm executes six distinct phases in sequence:

Optional Pre-mapping: If mapraw=t, maps original reads to assemblies for baseline coverage analysis
Rename and Merge: RenameAndMux.main() consolidates all input read libraries into a single merged file with unique read identifiers using core filename + numericID format
Error Correction (Optional): Uses Tadpole for kmer-based error correction if ecct=t
Normalization: Applies BBNorm to reduce coverage variation and remove low-quality reads
Demultiplexing: Separates normalized reads back into individual libraries based on read names
Final Mapping and Filtering: Maps normalized reads and applies coverage-based filtering

Coverage-Based Contamination Detection

The core decontamination strategy relies on the principle that contaminating sequences will have significantly different coverage patterns compared to legitimate assembly sequences:

Multi-metric Filtering: Combines coverage depth (minc), breadth (minp), read count (minr), and length (minl) criteria
Sliding Window Analysis: Uses configurable windows (default 500bp) to detect regions of consistently low coverage
Ratio-based Filtering: Compares coverage before and after normalization to identify artifacts
Ambiguous Mapping Handling: Configurable strategies for reads mapping to multiple locations

Memory Management and Scalability

The implementation includes several optimizations for handling large datasets:

Temporal File Management: Automatic cleanup of intermediate files to conserve disk space
Bloom Filter Optimization: Configurable filter sizes and prefiltering for memory efficiency
Parallel Compression: Optional pigz integration for faster I/O operations
Multi-threaded Operations: Leverages available CPU cores throughout the pipeline

Quality Control Integration

Each phase includes detailed logging and validation:

Coverage Statistics: Detailed per-scaffold coverage analysis written to covstats files
Process Logging: Timestamped execution log for debugging and quality control
Results Summary: Detailed filtering statistics and contamination removal metrics

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org