RemoveCatDogMouseHuman

Script: removecatdogmousehuman.sh Package: align2 Class: BBMap.java

Removes all reads that map to the cat, dog, mouse, or human genome with at least 95% identity after quality trimming. Removes approximately 98.6% of human 2x150bp reads, with zero false-positives to non-animals. NOTE! This program uses hard-coded paths and will only run on Nersc systems.

Basic Usage

removecatdogmousehuman.sh in=<input file> outu=<clean output file>

Input may be fasta or fastq, compressed or uncompressed.

Requirements: This script requires at least 52GB RAM and is designed for NERSC systems with hard-coded paths.

Parameters

This tool is a shell script wrapper around BBMap for removing mammalian contamination from sequencing data. It uses hard-coded BBMap parameters: minratio=0.9, maxindel=3, bwr=0.16, bw=12, minhits=2, k=14, kfilter=25, maxsites=1, fast, quickmatch.

Basic Parameters

threads=auto
(t) Set number of threads to use; default is number of logical processors.
overwrite=t
(ow) Set to false to force the program to abort rather than overwrite an existing file.
interleaved=auto
(int) If true, forces fastq input to be paired and interleaved.

Quality Trimming Parameters

trim=t
Trim read ends to remove bases with quality below minq. Values: t (trim both ends), f (neither end), r (right end only), l (left end only).
untrim=t
Undo the trimming after mapping. This allows quality-based trimming for better mapping sensitivity while preserving original read lengths in output.
minq=4
Trim quality threshold. Bases with quality scores below this value will be trimmed before mapping.

Output Parameters

ziplevel=2
(zl) Set to 1 (lowest) through 9 (max) to change compression level; lower compression is faster.
outm=<file>
File to output the reads that mapped to human/mammalian genomes. These are the contaminating reads that were removed.

Pre-configured BBMap Parameters

This tool uses the following hard-coded BBMap parameters for mammalian genome detection:

minratio=0.9
Minimum alignment score ratio. Set to 90% to ensure high identity matches (≥95% after accounting for quality trimming).
maxindel=3
Maximum allowed indels in alignment. Restricted to 3bp to maintain high specificity.
bwr=0.16
Bandwidth ratio for alignment. Set to 0.16 for detecting close matches to mammalian genomes.
bw=12
Bandwidth for alignment in bp. Set to 12bp for sensitive detection of mammalian sequences.
quickmatch
Enables quick matching mode for faster processing of obvious matches.
fast
Uses fast alignment mode (sets TIP_SEARCH_DIST/5, maxindel=80, bwr=0.18, bw=40, minratio=0.65, quickmatch=t).
minhits=2
Minimum number of kmers that must match. Set to 2 for sensitive detection.
k=14
Kmer length for indexing. Set to 14bp (BBMap.java default keylen=13, overridden by script).
maxsites=1
Maximum number of sites to align. Set to 1 since we only need to detect presence of contamination.
kfilter=25
Kmer filter threshold. Removes reads with ≥25 matching 14-mers to reduce false positives.
qtrim=r
Quality trimming from right end during mapping (overridden by trim parameter).
trimq=10
Quality threshold for internal trimming (overridden by minq parameter).

Note: All standard BBMap parameters can be used to override defaults. Run bbmap.sh for complete parameter documentation.

Examples

Basic Contamination Removal

removecatdogmousehuman.sh in=reads.fq outu=clean_reads.fq

Removes mammalian contamination from sequencing reads, outputting clean non-mammalian reads to clean_reads.fq.

Save Contaminating Reads

removecatdogmousehuman.sh in=reads.fq outu=clean_reads.fq outm=mammalian_reads.fq

Removes contamination while saving the mammalian reads to mammalian_reads.fq for inspection.

Paired-End Processing

removecatdogmousehuman.sh in=reads_R1.fq in2=reads_R2.fq outu=clean_R1.fq outu2=clean_R2.fq

Process paired-end reads, maintaining pairing in output files.

Custom Thread Count

removecatdogmousehuman.sh in=reads.fq outu=clean_reads.fq threads=8

Use 8 threads for faster processing on multi-core systems.

Algorithm Details

Contamination Detection Strategy

This tool implements a BBMap-based approach using hard-coded parameters for detecting and removing mammalian contamination from non-mammalian sequencing datasets:

Reference Database

Mapping Strategy

False Positive Prevention

Performance Characteristics

Quality Control Features

Use Cases

Technical Implementation

Built on BBMap's AbstractMapper framework with hard-coded parameter optimization:

System Requirements

Hardware Requirements

  • Memory: Minimum 52GB RAM
  • Storage: Access to NERSC filesystem with pre-built indices
  • CPU: Multi-core recommended (threads=auto defaults to logical processor count)

Software Requirements

  • System: NERSC computing environment
  • Java: Java 8 or higher
  • File access: Read access to /global/cfs/cdirs/bbtools/mousecatdoghuman/

Limitations

  • Platform-specific: Only runs on NERSC systems due to hard-coded paths
  • Reference limitation: Limited to cat, dog, mouse, and human genomes
  • Memory intensive: Requires substantial RAM for multi-genome index

Support

For questions and support: