RemoveCatDogMouseHuman
Removes all reads that map to the cat, dog, mouse, or human genome with at least 95% identity after quality trimming. Removes approximately 98.6% of human 2x150bp reads, with zero false-positives to non-animals. NOTE! This program uses hard-coded paths and will only run on Nersc systems.
Basic Usage
removecatdogmousehuman.sh in=<input file> outu=<clean output file>
Input may be fasta or fastq, compressed or uncompressed.
Requirements: This script requires at least 52GB RAM and is designed for NERSC systems with hard-coded paths.
Parameters
This tool is a shell script wrapper around BBMap for removing mammalian contamination from sequencing data. It uses hard-coded BBMap parameters: minratio=0.9, maxindel=3, bwr=0.16, bw=12, minhits=2, k=14, kfilter=25, maxsites=1, fast, quickmatch.
Basic Parameters
- threads=auto
- (t) Set number of threads to use; default is number of logical processors.
- overwrite=t
- (ow) Set to false to force the program to abort rather than overwrite an existing file.
- interleaved=auto
- (int) If true, forces fastq input to be paired and interleaved.
Quality Trimming Parameters
- trim=t
- Trim read ends to remove bases with quality below minq. Values: t (trim both ends), f (neither end), r (right end only), l (left end only).
- untrim=t
- Undo the trimming after mapping. This allows quality-based trimming for better mapping sensitivity while preserving original read lengths in output.
- minq=4
- Trim quality threshold. Bases with quality scores below this value will be trimmed before mapping.
Output Parameters
- ziplevel=2
- (zl) Set to 1 (lowest) through 9 (max) to change compression level; lower compression is faster.
- outm=<file>
- File to output the reads that mapped to human/mammalian genomes. These are the contaminating reads that were removed.
Pre-configured BBMap Parameters
This tool uses the following hard-coded BBMap parameters for mammalian genome detection:
- minratio=0.9
- Minimum alignment score ratio. Set to 90% to ensure high identity matches (≥95% after accounting for quality trimming).
- maxindel=3
- Maximum allowed indels in alignment. Restricted to 3bp to maintain high specificity.
- bwr=0.16
- Bandwidth ratio for alignment. Set to 0.16 for detecting close matches to mammalian genomes.
- bw=12
- Bandwidth for alignment in bp. Set to 12bp for sensitive detection of mammalian sequences.
- quickmatch
- Enables quick matching mode for faster processing of obvious matches.
- fast
- Uses fast alignment mode (sets TIP_SEARCH_DIST/5, maxindel=80, bwr=0.18, bw=40, minratio=0.65, quickmatch=t).
- minhits=2
- Minimum number of kmers that must match. Set to 2 for sensitive detection.
- k=14
- Kmer length for indexing. Set to 14bp (BBMap.java default keylen=13, overridden by script).
- maxsites=1
- Maximum number of sites to align. Set to 1 since we only need to detect presence of contamination.
- kfilter=25
- Kmer filter threshold. Removes reads with ≥25 matching 14-mers to reduce false positives.
- qtrim=r
- Quality trimming from right end during mapping (overridden by trim parameter).
- trimq=10
- Quality threshold for internal trimming (overridden by minq parameter).
Note: All standard BBMap parameters can be used to override defaults. Run bbmap.sh for complete parameter documentation.
Examples
Basic Contamination Removal
removecatdogmousehuman.sh in=reads.fq outu=clean_reads.fq
Removes mammalian contamination from sequencing reads, outputting clean non-mammalian reads to clean_reads.fq.
Save Contaminating Reads
removecatdogmousehuman.sh in=reads.fq outu=clean_reads.fq outm=mammalian_reads.fq
Removes contamination while saving the mammalian reads to mammalian_reads.fq for inspection.
Paired-End Processing
removecatdogmousehuman.sh in=reads_R1.fq in2=reads_R2.fq outu=clean_R1.fq outu2=clean_R2.fq
Process paired-end reads, maintaining pairing in output files.
Custom Thread Count
removecatdogmousehuman.sh in=reads.fq outu=clean_reads.fq threads=8
Use 8 threads for faster processing on multi-core systems.
Algorithm Details
Contamination Detection Strategy
This tool implements a BBMap-based approach using hard-coded parameters for detecting and removing mammalian contamination from non-mammalian sequencing datasets:
Reference Database
- Multi-genome index: Uses a combined index containing cat, dog, mouse, and human genomes
- Hard-coded paths: References are stored at
/global/cfs/cdirs/bbtools/mousecatdoghuman/
on NERSC systems - Pre-built index: Uses 14-mer index (k=14) with bloomfilter enabled for contamination screening
Mapping Strategy
- Identity threshold enforcement: Uses MINIMUM_ALIGNMENT_SCORE_RATIO=0.9 to detect ≥95% identity matches after quality trimming
- Quality-aware mapping: Performs quality trimming (minq=4) before mapping using MSA.calcMatchScore(), then restores original lengths with untrim
- Indel constraint enforcement: Limits indels to 3bp (maxindel=3) via BBIndex.MAX_INDEL to maintain specificity
- Alignment bandwidth control: Uses MSA.bandwidthRatio=0.16, MSA.bandwidth=12 for sensitive alignment within computational constraints
False Positive Prevention
- Kmer filtering: Uses KFILTER=25 with BBIndex.MIN_APPROX_HITS_TO_KEEP to remove reads with excessive kmer matches (repetitive elements)
- Single-site alignment: Sets maxsites=1 via BBIndex.QUIT_AFTER_TWO_PERFECTS to prevent over-alignment of repetitive sequences
- Conservative matching: Requires AbstractIndex.MIN_APPROX_HITS_TO_KEEP=2 to avoid spurious single-kmer matches
Performance Characteristics
- Memory requirement: 52GB RAM minimum for loading multi-genome index (set via -Xmx50g in shell script)
- Sensitivity: Removes ~98.6% of human 2×150bp reads
- Specificity: Zero false-positives on non-animal sequences in validation testing
- Speed optimization: Uses BBMap fast mode (TIP_SEARCH_DIST/5, bwr=0.18, quickmatch=t) and single-site alignment (maxsites=1)
Quality Control Features
- Quality trimming: Uses qtrim=r trimq=10 to remove low-quality bases before alignment, then untrim=t restores original lengths
- Length preservation: untrim=t restores original read lengths in clean output after quality-based mapping
- Compression control: ziplevel parameter (default=6 internal, overridden by user zl=2) controls gzip compression ratio
- Processing statistics: BBMap reports read counts, mapping rates, and timing statistics via printunmappedcount and usemodulo flags
Use Cases
- Environmental samples: Remove human/pet contamination from environmental DNA
- Non-mammalian genomics: Clean plant, fungal, or microbial datasets
- Metagenomic studies: Remove host contamination from microbiome samples
- Ancient DNA: Remove modern mammalian contamination from archaeological samples
Technical Implementation
Built on BBMap's AbstractMapper framework with hard-coded parameter optimization:
- AbstractMapper framework: Inherits BBMap.java alignment engine with MSA_TYPE="MultiStateAligner11ts"
- Multi-threaded processing: Uses Java threading with configurable thread count (threads=auto defaults to logical processor count)
- BBIndex implementation: Uses 14-mer index (k=14) with FRACTION_GENOME_TO_EXCLUDE and bloomfilter for memory management
- Quality-aware alignment: MSA.calcMatchScore() incorporates Phred quality scores into MINIMUM_ALIGNMENT_SCORE_RATIO calculation
System Requirements
Hardware Requirements
- Memory: Minimum 52GB RAM
- Storage: Access to NERSC filesystem with pre-built indices
- CPU: Multi-core recommended (threads=auto defaults to logical processor count)
Software Requirements
- System: NERSC computing environment
- Java: Java 8 or higher
- File access: Read access to /global/cfs/cdirs/bbtools/mousecatdoghuman/
Limitations
- Platform-specific: Only runs on NERSC systems due to hard-coded paths
- Reference limitation: Limited to cat, dog, mouse, and human genomes
- Memory intensive: Requires substantial RAM for multi-genome index
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org