RemoveMicrobes

Basic Usage

removemicrobes.sh in=<input file> outu=<clean output file>

Input may be FASTA or FASTQ, compressed or uncompressed. The tool requires at least 10GB RAM and is specifically designed for NERSC systems with hard-coded paths.

Parameters

This tool wraps BBMap with specific parameters tuned for microbial contamination removal. The parameters are organized by their function in the decontamination process.

Input/Output Parameters

in=<file>: Input reads. Should already be adapter-trimmed. Accepts FASTA or FASTQ format, compressed or uncompressed.
outu=<file>: Destination for clean reads that do not map to microbial contaminants.
outm=<file>: Optional destination for contaminant reads that map to microbial genomes.

Processing Parameters

threads=auto: (t) Set number of threads to use; default is number of logical processors. More threads will increase speed but also memory usage.
overwrite=t: (ow) Set to false to force the program to abort rather than overwrite an existing file. Default is true.
interleaved=auto: (int) If true, forces fastq input to be paired and interleaved. Auto-detects based on file format.
ziplevel=6: (zl) Set to 1 (lowest) through 9 (max) to change compression level; lower compression is faster. Default is 6 for balanced speed and compression.

Quality Trimming Parameters

trim=t: Trim read ends to remove bases with quality below minq. Values: t (trim both ends), f (neither end), r (right end only), l (left end only). Default is t.
untrim=t: Undo the trimming after mapping. This allows accurate mapping while preserving original read lengths in output. Default is t.
minq=4: Trim quality threshold. Bases with quality scores below this value will be trimmed from read ends. Default is 4.

Database Selection Parameters

build=1

Chooses which masking mode was used for the microbial reference database:

1 - Most stringent, should be used for bacteria (default)
2 - Uses fewer bacteria for masking (only RefSeq references)
3 - Only masked for plastids and entropy, for use on anything except bacteria
4 - Unmasked database

Examples

Basic Decontamination

removemicrobes.sh in=reads.fq outu=clean_reads.fq

Remove microbial contaminants from reads.fq and write clean reads to clean_reads.fq using the most stringent bacterial database (build=1).

Save Contaminant Reads

removemicrobes.sh in=reads.fq outu=clean_reads.fq outm=contaminants.fq

Remove contaminants and save both clean reads and detected contaminants to separate files for analysis.

Non-Bacterial Samples

removemicrobes.sh in=fungal_reads.fq outu=clean_fungal.fq build=3

For non-bacterial samples like fungi, use build=3 which only masks for plastids and entropy rather than bacterial sequences.

Paired-End Processing

removemicrobes.sh in=reads_R#.fq outu=clean_R#.fq threads=16

Process paired-end reads with automatic file detection (# wildcard) using 16 threads for faster processing.

Algorithm Details

BBMap Engine Implementation

RemoveMicrobes executes the align2.BBMap class with contamination-specific parameters hardcoded in the shell script:

K-mer indexing: Uses BBIndex with keylen=13 providing 13-mer seed finding through BitSet-based k-mer extraction and genomic coordinate storage in Block data structures
Alignment algorithm: Employs MultiStateAligner11ts with array-based affine gap penalties using POINTSoff_INS_ARRAY and POINTSoff_SUB_ARRAY for precise indel cost modeling
Threading architecture: Creates BBMapThread instances with automatic memory allocation (65MB per thread for pure Java, 105MB with JNI acceleration)
Hit processing: Uses minhits=2 requiring dual k-mer seed confirmation before initiating Smith-Waterman alignment refinement

Quality Score Processing Pipeline

Implements dual-phase quality trimming through BBMap's quality integration system:

Right-end trimming: qtrim=r with trimq=10 uses QualityTools.PROB_ERROR arrays for quality score probability calculation and base removal
Untrimming restoration: untrim=true restores original read coordinates after alignment through AbstractMapThread coordinate fixing
K-mer quality filtering: kfilter=25 applies MinHitsCalculator Monte Carlo simulation to exclude low-quality k-mers from seed finding
Identity thresholds: minid=0.95 and idfilter=0.95 use MSA.minIdToMinRatio() conversion for alignment score threshold calculation

Alignment Parameter Configuration

The tool applies restrictive alignment parameters for contamination specificity:

Bandwidth restriction: bw=12 with bwr=0.16 ratio constrains dynamic programming matrix to 12-column bands, reducing memory usage and false alignments
Indel constraints: strictmaxindel=4 limits maximum indel length to 4bp using BBIndex.MAX_INDEL parameter enforcement
Error tolerance: ef=0.001 sets error fraction threshold through MINIMUM_ALIGNMENT_SCORE_RATIO calculation in AbstractMapper
Site filtering: maxsites=1 prevents ambiguous mappings using BBMapThread site trimming with CLEARZONE thresholds

Bloom Filter Integration

Enables memory-efficient pre-screening through BBMap's BloomFilter implementation:

Filter construction: Creates BloomFilter with configurable k-mer length and hash function count for k-mer membership testing
Pre-screening logic: Eliminates non-matching reads before expensive Smith-Waterman computation
Memory optimization: Uses bit vector representation for space-efficient k-mer set approximation

Reference Database Architecture

Utilizes build-specific microbial databases with Index construction via IndexMaker4:

Build 1: Most stringent bacterial masking using aggressive repetitive region exclusion
Build 2: RefSeq-only references with moderate masking for reduced false positives
Build 3: Plastid and entropy masking only, suitable for non-bacterial samples
Build 4: Unmasked database providing maximum sensitivity at cost of specificity

Coordinate and Threading Management

Implements BBMap's parallel processing architecture:

Thread coordination: Uses AbstractMapThread array with automatic thread count detection via Shared.threads()
Memory management: Applies adjustThreadsforMemory() to prevent memory exhaustion with per-thread allocation limits
Site score tracking: Maintains alignment candidates through SiteScore objects with coordinate validation via GapTools
Output coordination: Uses ReadStreamWriter for thread-safe output with support for compressed formats via pigz/unpigz

Compression and I/O Implementation

Leverages BBMap's compression preferences and stream management:

Compression settings: zl=6 (ZIPLEVEL=6) balances compression ratio with processing speed
Stream handling: Uses ReadWrite.USE_BGZIP preference over pigz for SAM/BAM compatibility
Format detection: Automatic input format recognition through FileFormat.testInput() methods

Technical Notes

NERSC-Specific Implementation

This tool is specifically designed for the NERSC (National Energy Research Scientific Computing Center) environment:

Uses hard-coded paths to pre-built microbial databases
Requires NERSC's file system architecture and pre-installed databases
Will not function on systems without the required database paths
Database located at: /global/cfs/cdirs/bbtools/commonMicrobes

Performance Characteristics

Throughput: Removes ~98.5% of common bacterial contaminants
Accuracy: Zero false-positive rate for non-bacterial sequences
Threading: Linear scaling up to available CPU cores with automatic thread management
Memory: Requires minimum 10GB RAM, scales with dataset size and thread count

Limitations

Only runs on NERSC systems with pre-installed databases
Limited to pre-selected common microbial contaminants
May miss novel or rare contaminant species not in the database
Requires adapter-trimmed input for optimal performance

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org