RemoveMicrobes

Script: removemicrobes.sh Package: align2 Class: BBMap.java

Removes all reads that map to selected common microbial contaminant genomes. Removes approximately 98.5% of common contaminant reads, with zero false-positives to non-bacteria. NOTE! This program uses hard-coded paths and will only run on NERSC systems.

Basic Usage

removemicrobes.sh in=<input file> outu=<clean output file>

Input may be FASTA or FASTQ, compressed or uncompressed. The tool requires at least 10GB RAM and is specifically designed for NERSC systems with hard-coded paths.

Parameters

This tool wraps BBMap with specific parameters tuned for microbial contamination removal. The parameters are organized by their function in the decontamination process.

Input/Output Parameters

in=<file>
Input reads. Should already be adapter-trimmed. Accepts FASTA or FASTQ format, compressed or uncompressed.
outu=<file>
Destination for clean reads that do not map to microbial contaminants.
outm=<file>
Optional destination for contaminant reads that map to microbial genomes.

Processing Parameters

threads=auto
(t) Set number of threads to use; default is number of logical processors. More threads will increase speed but also memory usage.
overwrite=t
(ow) Set to false to force the program to abort rather than overwrite an existing file. Default is true.
interleaved=auto
(int) If true, forces fastq input to be paired and interleaved. Auto-detects based on file format.
ziplevel=6
(zl) Set to 1 (lowest) through 9 (max) to change compression level; lower compression is faster. Default is 6 for balanced speed and compression.

Quality Trimming Parameters

trim=t
Trim read ends to remove bases with quality below minq. Values: t (trim both ends), f (neither end), r (right end only), l (left end only). Default is t.
untrim=t
Undo the trimming after mapping. This allows accurate mapping while preserving original read lengths in output. Default is t.
minq=4
Trim quality threshold. Bases with quality scores below this value will be trimmed from read ends. Default is 4.

Database Selection Parameters

build=1
Chooses which masking mode was used for the microbial reference database:
  • 1 - Most stringent, should be used for bacteria (default)
  • 2 - Uses fewer bacteria for masking (only RefSeq references)
  • 3 - Only masked for plastids and entropy, for use on anything except bacteria
  • 4 - Unmasked database

Examples

Basic Decontamination

removemicrobes.sh in=reads.fq outu=clean_reads.fq

Remove microbial contaminants from reads.fq and write clean reads to clean_reads.fq using the most stringent bacterial database (build=1).

Save Contaminant Reads

removemicrobes.sh in=reads.fq outu=clean_reads.fq outm=contaminants.fq

Remove contaminants and save both clean reads and detected contaminants to separate files for analysis.

Non-Bacterial Samples

removemicrobes.sh in=fungal_reads.fq outu=clean_fungal.fq build=3

For non-bacterial samples like fungi, use build=3 which only masks for plastids and entropy rather than bacterial sequences.

Paired-End Processing

removemicrobes.sh in=reads_R#.fq outu=clean_R#.fq threads=16

Process paired-end reads with automatic file detection (# wildcard) using 16 threads for faster processing.

Algorithm Details

BBMap Engine Implementation

RemoveMicrobes executes the align2.BBMap class with contamination-specific parameters hardcoded in the shell script:

Quality Score Processing Pipeline

Implements dual-phase quality trimming through BBMap's quality integration system:

Alignment Parameter Configuration

The tool applies restrictive alignment parameters for contamination specificity:

Bloom Filter Integration

Enables memory-efficient pre-screening through BBMap's BloomFilter implementation:

Reference Database Architecture

Utilizes build-specific microbial databases with Index construction via IndexMaker4:

Coordinate and Threading Management

Implements BBMap's parallel processing architecture:

Compression and I/O Implementation

Leverages BBMap's compression preferences and stream management:

Technical Notes

NERSC-Specific Implementation

This tool is specifically designed for the NERSC (National Energy Research Scientific Computing Center) environment:

Performance Characteristics

Limitations

Support

For questions and support: