Decontaminate

Script: decontaminate.sh Package: jgi Class: DecontaminateByNormalization.java

Decontaminates multiplexed assemblies via normalization and mapping. Uses a multi-phase pipeline approach combining read normalization, error correction, and coverage-based filtering to remove contaminating sequences from multiplexed genomic assemblies.

Basic Usage

decontaminate.sh reads=<file,file> ref=<file,file> out=<directory>

Alternative usage with file lists:

decontaminate.sh readnamefile=<file> refnamefile=<file> out=<directory>

Parameters

Parameters are organized by their function in the decontamination process. The tool processes multiple libraries simultaneously, requiring paired read and reference files for each library.

Input Parameters

reads=<file,file>
Input reads, one file per library. Comma-separated list of read files corresponding to each assembly.
ref=<file,file>
Input assemblies, one file per library. Comma-separated list of reference assembly files to be decontaminated.
readnamefile=<file>
List of input reads, one line per library. Alternative to reads= parameter - file containing paths to read files, one per line.
refnamefile=<file>
List of input assemblies, one line per library. Alternative to ref= parameter - file containing paths to assembly files, one per line.
interleaved=auto
True forces paired/interleaved input; false forces single-ended mapping. If not specified, interleaved status will be autodetected from read names.
unpigz=t
Spawn a pigz (parallel gzip) process for faster decompression. Requires pigz to be installed.
touppercase=t
(tuc) Convert lowercase letters in reads to upper case (otherwise they will not match the reference).

Output Parameters

pigz=f
Spawn a pigz (parallel gzip) process for faster compression. Requires pigz to be installed.
tmpdir=.
Write temp files here. By default uses the system's $TMPDIR or current directory. Important for managing intermediate files during the multi-phase pipeline.
outdir=.
Write output files here. Final clean assemblies will be written to this directory with "_clean.fasta" suffix.

Mapping Parameters

kfilter=55
Set to a positive number N to require minimum N contiguous matches for a mapped read. Higher values increase mapping specificity.
ambig=random
Determines how coverage will be calculated for ambiguously-mapped reads:
  • first: Add coverage only at first genomic mapping location
  • random: Add coverage at a random best-scoring location
  • all: Add coverage at all best-scoring locations
  • toss: Discard ambiguously-mapped reads without adding coverage

Filtering Parameters

minc=3.5
Min average coverage to retain scaffold. Scaffolds with coverage below this threshold will be removed as likely contaminants.
minp=20
Min percent coverage to retain scaffold. Scaffolds with less than this percentage of bases covered will be filtered out.
minr=18
Min mapped reads to retain scaffold. Scaffolds with fewer mapped reads will be considered contaminants (default 20 in Java code).
minl=500
Min length to retain scaffold. Short scaffolds below this threshold will be removed.
ratio=1.2
Contigs will not be removed by minc unless the coverage changed by at least this factor. 0 disables this filter. Helps distinguish real coverage drops from contamination.
mapraw=t
Set true to map the unnormalized reads. Required to filter by 'ratio' parameter. Enables before/after coverage comparison.
basesundermin=-1
If positive, removes contigs with at least this many bases in low-coverage windows. Targets regions with consistently poor coverage.
window=500
Sliding window size for low-coverage analysis. Used in conjunction with windowcov parameter.
windowcov=5
Average coverage below this will be classified as low within sliding windows. Used with basesundermin filtering.

Tadpole Parameters

ecct=f
Error-correct with Tadpole before normalization. Can improve assembly quality by correcting sequencing errors.
kt=42
Kmer length for Tadpole error correction. Longer kmers provide more specificity but require higher coverage.
aggressive=f
Do aggressive error correction. Mutually exclusive with conservative mode. More thorough but potentially over-corrects.
conservative=f
Do conservative error correction. Mutually exclusive with aggressive mode. Safer approach with less risk of over-correction.
tadpoleprefilter=1
(tadpre) Ignore kmers under this depth to save memory. Filters out low-depth kmers likely to be errors.

Normalization Parameters

mindepth=2
Min depth of reads to keep during normalization. Reads below this depth may be considered errors or contaminants.
target=20
Target normalization depth. Reads will be normalized to approximately this coverage level.
hashes=4
Number of hashes in Bloom filter. More hashes increase accuracy but require more memory.
passes=1
Normalization passes. Multiple passes can improve uniformity of coverage distribution.
minprob=0.5
Min probability of correctness to add a kmer to the Bloom filter. Higher values increase specificity.
dp=0.75
(depthpercentile) Percentile to use for depth proxy (0.5 means median). Controls which depth statistic is used for normalization decisions.
prefilter=t
Prefilter, for large datasets. Uses a smaller preliminary filter to reduce memory usage for very large datasets.
filterbits=32
(fbits) Bits per cell in primary filter. More bits increase accuracy but require more memory.
prefilterbits=2
(pbits) Bits per cell in prefilter. Used when prefilter=t to manage memory usage.
k=31
Kmer length for normalization. Longer kmers are more precise but less sensitive to coverage variations.

Other parameters

opfn=0
(onlyprocessfirstn) Set to a positive number to only process that many datasets. This is for internal testing of specificity.

Java Parameters

-Xmx
This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx800m will specify 800 megs. The max is typically 85% of physical memory.
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da
Disable assertions.

Examples

Basic Decontamination

decontaminate.sh reads=lib1.fq,lib2.fq ref=asm1.fa,asm2.fa out=clean/

Decontaminate two multiplexed libraries using default parameters.

With Error Correction and Strict Filtering

decontaminate.sh reads=reads.fq ref=assembly.fa out=results/ \
    ecct=t aggressive=t minc=5.0 minp=30 minl=1000

Perform error correction with Tadpole using aggressive mode, then apply strict coverage and length filtering.

Using File Lists

decontaminate.sh readnamefile=reads.list refnamefile=refs.list out=output/

Process multiple libraries specified in list files, one path per line.

High Memory Dataset with Custom Normalization

decontaminate.sh reads=large.fq ref=genome.fa out=clean/ \
    -Xmx100g target=50 k=35 hashes=6 prefilter=t

Process large dataset with increased memory, higher normalization target, longer kmers, and prefiltering enabled.

Algorithm Details

DecontaminateByNormalization.java implements a six-phase sequential pipeline for removing contaminating sequences from multiplexed genomic assemblies:

Pipeline Architecture

The algorithm executes six distinct phases in sequence:

  1. Optional Pre-mapping: If mapraw=t, maps original reads to assemblies for baseline coverage analysis
  2. Rename and Merge: RenameAndMux.main() consolidates all input read libraries into a single merged file with unique read identifiers using core filename + numericID format
  3. Error Correction (Optional): Uses Tadpole for kmer-based error correction if ecct=t
  4. Normalization: Applies BBNorm to reduce coverage variation and remove low-quality reads
  5. Demultiplexing: Separates normalized reads back into individual libraries based on read names
  6. Final Mapping and Filtering: Maps normalized reads and applies coverage-based filtering

Coverage-Based Contamination Detection

The core decontamination strategy relies on the principle that contaminating sequences will have significantly different coverage patterns compared to legitimate assembly sequences:

Memory Management and Scalability

The implementation includes several optimizations for handling large datasets:

Quality Control Integration

Each phase includes detailed logging and validation:

Support

For questions and support: