RemoveMicrobes
Removes all reads that map to selected common microbial contaminant genomes. Removes approximately 98.5% of common contaminant reads, with zero false-positives to non-bacteria. NOTE! This program uses hard-coded paths and will only run on NERSC systems.
Basic Usage
removemicrobes.sh in=<input file> outu=<clean output file>
Input may be FASTA or FASTQ, compressed or uncompressed. The tool requires at least 10GB RAM and is specifically designed for NERSC systems with hard-coded paths.
Parameters
This tool wraps BBMap with specific parameters tuned for microbial contamination removal. The parameters are organized by their function in the decontamination process.
Input/Output Parameters
- in=<file>
- Input reads. Should already be adapter-trimmed. Accepts FASTA or FASTQ format, compressed or uncompressed.
- outu=<file>
- Destination for clean reads that do not map to microbial contaminants.
- outm=<file>
- Optional destination for contaminant reads that map to microbial genomes.
Processing Parameters
- threads=auto
- (t) Set number of threads to use; default is number of logical processors. More threads will increase speed but also memory usage.
- overwrite=t
- (ow) Set to false to force the program to abort rather than overwrite an existing file. Default is true.
- interleaved=auto
- (int) If true, forces fastq input to be paired and interleaved. Auto-detects based on file format.
- ziplevel=6
- (zl) Set to 1 (lowest) through 9 (max) to change compression level; lower compression is faster. Default is 6 for balanced speed and compression.
Quality Trimming Parameters
- trim=t
- Trim read ends to remove bases with quality below minq. Values: t (trim both ends), f (neither end), r (right end only), l (left end only). Default is t.
- untrim=t
- Undo the trimming after mapping. This allows accurate mapping while preserving original read lengths in output. Default is t.
- minq=4
- Trim quality threshold. Bases with quality scores below this value will be trimmed from read ends. Default is 4.
Database Selection Parameters
- build=1
- Chooses which masking mode was used for the microbial reference database:
- 1 - Most stringent, should be used for bacteria (default)
- 2 - Uses fewer bacteria for masking (only RefSeq references)
- 3 - Only masked for plastids and entropy, for use on anything except bacteria
- 4 - Unmasked database
Examples
Basic Decontamination
removemicrobes.sh in=reads.fq outu=clean_reads.fq
Remove microbial contaminants from reads.fq and write clean reads to clean_reads.fq using the most stringent bacterial database (build=1).
Save Contaminant Reads
removemicrobes.sh in=reads.fq outu=clean_reads.fq outm=contaminants.fq
Remove contaminants and save both clean reads and detected contaminants to separate files for analysis.
Non-Bacterial Samples
removemicrobes.sh in=fungal_reads.fq outu=clean_fungal.fq build=3
For non-bacterial samples like fungi, use build=3 which only masks for plastids and entropy rather than bacterial sequences.
Paired-End Processing
removemicrobes.sh in=reads_R#.fq outu=clean_R#.fq threads=16
Process paired-end reads with automatic file detection (# wildcard) using 16 threads for faster processing.
Algorithm Details
BBMap Engine Implementation
RemoveMicrobes executes the align2.BBMap class with contamination-specific parameters hardcoded in the shell script:
- K-mer indexing: Uses BBIndex with keylen=13 providing 13-mer seed finding through BitSet-based k-mer extraction and genomic coordinate storage in Block data structures
- Alignment algorithm: Employs MultiStateAligner11ts with array-based affine gap penalties using POINTSoff_INS_ARRAY and POINTSoff_SUB_ARRAY for precise indel cost modeling
- Threading architecture: Creates BBMapThread instances with automatic memory allocation (65MB per thread for pure Java, 105MB with JNI acceleration)
- Hit processing: Uses minhits=2 requiring dual k-mer seed confirmation before initiating Smith-Waterman alignment refinement
Quality Score Processing Pipeline
Implements dual-phase quality trimming through BBMap's quality integration system:
- Right-end trimming: qtrim=r with trimq=10 uses QualityTools.PROB_ERROR arrays for quality score probability calculation and base removal
- Untrimming restoration: untrim=true restores original read coordinates after alignment through AbstractMapThread coordinate fixing
- K-mer quality filtering: kfilter=25 applies MinHitsCalculator Monte Carlo simulation to exclude low-quality k-mers from seed finding
- Identity thresholds: minid=0.95 and idfilter=0.95 use MSA.minIdToMinRatio() conversion for alignment score threshold calculation
Alignment Parameter Configuration
The tool applies restrictive alignment parameters for contamination specificity:
- Bandwidth restriction: bw=12 with bwr=0.16 ratio constrains dynamic programming matrix to 12-column bands, reducing memory usage and false alignments
- Indel constraints: strictmaxindel=4 limits maximum indel length to 4bp using BBIndex.MAX_INDEL parameter enforcement
- Error tolerance: ef=0.001 sets error fraction threshold through MINIMUM_ALIGNMENT_SCORE_RATIO calculation in AbstractMapper
- Site filtering: maxsites=1 prevents ambiguous mappings using BBMapThread site trimming with CLEARZONE thresholds
Bloom Filter Integration
Enables memory-efficient pre-screening through BBMap's BloomFilter implementation:
- Filter construction: Creates BloomFilter with configurable k-mer length and hash function count for k-mer membership testing
- Pre-screening logic: Eliminates non-matching reads before expensive Smith-Waterman computation
- Memory optimization: Uses bit vector representation for space-efficient k-mer set approximation
Reference Database Architecture
Utilizes build-specific microbial databases with Index construction via IndexMaker4:
- Build 1: Most stringent bacterial masking using aggressive repetitive region exclusion
- Build 2: RefSeq-only references with moderate masking for reduced false positives
- Build 3: Plastid and entropy masking only, suitable for non-bacterial samples
- Build 4: Unmasked database providing maximum sensitivity at cost of specificity
Coordinate and Threading Management
Implements BBMap's parallel processing architecture:
- Thread coordination: Uses AbstractMapThread array with automatic thread count detection via Shared.threads()
- Memory management: Applies adjustThreadsforMemory() to prevent memory exhaustion with per-thread allocation limits
- Site score tracking: Maintains alignment candidates through SiteScore objects with coordinate validation via GapTools
- Output coordination: Uses ReadStreamWriter for thread-safe output with support for compressed formats via pigz/unpigz
Compression and I/O Implementation
Leverages BBMap's compression preferences and stream management:
- Compression settings: zl=6 (ZIPLEVEL=6) balances compression ratio with processing speed
- Stream handling: Uses ReadWrite.USE_BGZIP preference over pigz for SAM/BAM compatibility
- Format detection: Automatic input format recognition through FileFormat.testInput() methods
Technical Notes
NERSC-Specific Implementation
This tool is specifically designed for the NERSC (National Energy Research Scientific Computing Center) environment:
- Uses hard-coded paths to pre-built microbial databases
- Requires NERSC's file system architecture and pre-installed databases
- Will not function on systems without the required database paths
- Database located at:
/global/cfs/cdirs/bbtools/commonMicrobes
Performance Characteristics
- Throughput: Removes ~98.5% of common bacterial contaminants
- Accuracy: Zero false-positive rate for non-bacterial sequences
- Threading: Linear scaling up to available CPU cores with automatic thread management
- Memory: Requires minimum 10GB RAM, scales with dataset size and thread count
Limitations
- Only runs on NERSC systems with pre-installed databases
- Limited to pre-selected common microbial contaminants
- May miss novel or rare contaminant species not in the database
- Requires adapter-trimmed input for optimal performance
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org