RemoveHuman
Removes reads that map to the human genome with at least 95% identity after quality trimming. Uses hard-coded paths for NERSC systems. Removes approximately 98.6% of human 2x150bp reads with zero false-positives to non-animals. Requires at least 16GB RAM.
Basic Usage
removehuman.sh in=<input file> outu=<clean output file>
Input may be fasta or fastq, compressed or uncompressed.
Important: This program uses hard-coded paths and will only run on NERSC systems unless you change the path parameter. The default path is /global/cfs/cdirs/bbtools/hg19
.
Parameters
Parameters control the human genome removal process, including threading, file handling, quality trimming, and mapping behavior.
Basic Parameters
- threads=auto
- (t) Set number of threads to use; default is number of logical processors.
- overwrite=t
- (ow) Set to false to force the program to abort rather than overwrite an existing file.
- interleaved=auto
- (int) If true, forces fastq input to be paired and interleaved.
Quality Trimming Parameters
- trim=t
- Trim read ends to remove bases with quality below minq. Values: t (trim both ends), f (neither end), r (right end only), l (left end only).
- untrim=t
- Undo the trimming after mapping. This allows accurate mapping while preserving original read lengths in output.
- minq=4
- Trim quality threshold. Bases with quality scores below this value will be trimmed during the mapping process.
Output Parameters
- ziplevel=2
- (zl) Set to 1 (lowest) through 9 (max) to change compression level; lower compression is faster.
- outm=<file>
- File to output the reads that mapped to human genome. These are the contaminating reads that were removed.
Reference Path Parameter
- path=
- Set the path to an indexed human genome. Default is /global/cfs/cdirs/bbtools/hg19 for NERSC systems. Must be changed for use on other systems.
Additional BBMap Parameters
All BBMap parameters can be used with removehuman.sh. The tool internally uses BBMap with the following optimized settings:
- minratio=0.9
- Minimum alignment score ratio (set internally)
- maxindel=3
- Maximum indel length (set internally)
- bwr=0.16
- Bandwidth ratio for alignment (set internally)
- bw=12
- Bandwidth for alignment (set internally)
- minhits=2
- Minimum seed hits required (set internally)
- k=14
- Kmer length for mapping (set internally)
- kfilter=25
- Kmer filter threshold (set internally)
- maxsites=1
- Maximum mapping sites per read (set internally)
These internal settings are optimized for human contamination removal but can be overridden by specifying them explicitly.
Examples
Basic Human Removal
removehuman.sh in=sample.fastq.gz outu=clean_sample.fastq.gz
Removes human contamination from compressed FASTQ reads, outputting clean reads to clean_sample.fastq.gz
Paired-end Reads with Contamination Output
removehuman.sh in=reads_R1.fq in2=reads_R2.fq outu=clean_R1.fq outu2=clean_R2.fq outm=human_contam.fq
Process paired-end reads, output clean reads separately and save human contamination to a separate file
Custom Path and Threading
removehuman.sh in=data.fq outu=clean_data.fq path=/custom/path/to/human/index threads=8
Use a custom human genome index path and specify 8 threads for processing
Adjust Quality Trimming
removehuman.sh in=reads.fq outu=clean.fq minq=10 trim=r untrim=f
Trim low-quality bases (Q<10) from the right end only and keep the trimmed length in output
Algorithm Details
RemoveHuman is a wrapper around BBMap configured for identifying and removing human genomic contamination from sequencing data. The implementation uses specific parameter combinations:
Mapping Strategy
The tool uses BBMap's align2.BBMap engine with predefined parameters for contamination detection:
- Identity threshold: Uses minratio=0.9 (90% identity threshold via MINIMUM_ALIGNMENT_SCORE_RATIO) to identify human sequences
- K-mer indexing: k=14 with BBIndex.loadIndex() for seed-based alignment initialization
- Indel constraints: maxindel=3 uses MSA.bandwidth constraints to limit alignment complexity and reduce false positives
- Alignment bandwidth: bw=12, bwr=0.16 configure MSA.bandwidthRatio for dynamic programming alignment matrix boundaries
Quality Control Integration
The tool integrates quality trimming directly into the mapping process:
- Pre-mapping trimming: Removes low-quality bases before mapping to improve accuracy
- Post-mapping restoration: Can restore original read lengths after mapping via untrim parameter
- Quality threshold: Default minq=4 removes very low-quality bases that could cause mapping artifacts
Performance Characteristics
RemoveHuman implementation characteristics:
- Memory requirements: Requires minimum 16GB RAM for human genome index storage (-Xmx15000m)
- Thread scaling: Uses threads=auto parameter to detect available logical processors
- I/O handling: Supports compressed input/output via pigz/unpigz with ziplevel configuration
- File processing: Processes files sequentially through align2.BBMap command execution
Accuracy Metrics
Based on validation with human 2x150bp reads:
- Removal rate: Approximately 98.6% of human reads are correctly identified and removed
- False positive rate: Zero false positives against non-animal sequences
- Cross-species specificity: Parameter settings reduce removal of non-human sequences
Technical Implementation
The tool uses BBMap's align2.BBMap mapping engine with these components:
- Bloom filter: BloomFilter.class with k=14 parameter for k-mer presence testing
- Seed extension: minhits=2 requires multiple k-mer seed matches before triggering alignment via BBMapThread
- Mapping restriction: maxsites=1 limits output to single best alignment per read
- Index management: BBIndex.loadIndex() loads chromosome arrays via ChromosomeArray and Data.loadChromosomes()
- Thread management: AbstractMapThread[] array creates BBMapThread workers coordinated through cris.start() stream processing
- Alignment algorithm: MSA_TYPE="MultiStateAligner11ts" implements dynamic programming alignment with gap penalties
System Requirements
- Memory: Minimum 16GB RAM required for human genome index
- Storage: Access to indexed human genome (hg19 by default)
- Platform: Designed for NERSC systems but adaptable with path modification
- Java: Requires Java runtime environment
Notes and Limitations
- Hard-coded paths: Default configuration only works on NERSC systems
- Genome version: Uses hg19 human genome by default
- Memory requirements: Minimum 16GB RAM is non-negotiable
- Species specificity: Optimized for human contamination, may need adjustment for other species
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org