RemoveHuman2
Removes all reads that map to the human genome with at least 88% identity after quality trimming. This is more aggressive than removehuman.sh and uses an unmasked human genome reference. It removes roughly 99.99% of human 2x150bp reads, but may incur false-positive removals. NOTE! This program uses hard-coded paths and will only run on NERSC systems unless you change the path.
System Requirements
This script requires at least 17GB RAM and is specifically designed for NERSC systems with hard-coded paths. To run on other systems, you must modify the path parameter to point to your indexed human genome reference.
Basic Usage
removehuman2.sh in=<input file> outu=<clean output file>
Input may be fasta or fastq, compressed or uncompressed.
Parameters
This tool is a preconfigured wrapper around BBMap with parameters set for human read removal (minratio=0.75, maxindel=8, bwr=0.22, bw=26, minhits=1, maxsites=1, k=14, tipsearch=0, kfilter=25). The parameters are configured for high sensitivity human read detection.
Core Parameters
- threads=auto
- (t) Set number of threads to use; default is number of logical processors.
- overwrite=t
- (ow) Set to false to force the program to abort rather than overwrite an existing file.
- interleaved=auto
- (int) If true, forces fastq input to be paired and interleaved.
- path=
- Set the path to an indexed human genome. Default is hard-coded to /global/cfs/cdirs/bbtools/hg19 for NERSC systems. Must be changed for other systems.
Quality Trimming Parameters
- trim=t
- Trim read ends to remove bases with quality below minq. Values: t (trim both ends), f (neither end), r (right end only), l (left end only).
- untrim=t
- Undo the trimming after mapping. This ensures output reads maintain their original length while benefiting from quality trimming during mapping.
- minq=4
- Trim quality threshold. Bases with quality below this value will be trimmed before mapping.
Output Parameters
- outm=<file>
- File to output the reads that mapped to human. Useful for quality control and verification of human content removal.
- ziplevel=2
- (zl) Set to 1 (lowest) through 9 (max) to change compression level; lower compression is faster.
Pre-configured BBMap Parameters
The following parameters are automatically set for human read detection:
- minratio=0.75
- Minimum alignment score ratio for a read to be considered mapped. Set to 0.75 for aggressive filtering.
- maxindel=8
- Maximum allowed indel length during alignment.
- bwr=0.22
- Bandwidth ratio for alignment. Controls the alignment search space.
- bw=26
- Bandwidth for alignment in bases.
- minhits=1
- Minimum number of seed hits required to attempt alignment.
- build=2
- Genome build number (hg19).
- qtrim=r
- Quality trim from right end with quality threshold of 10.
- trimq=10
- Quality threshold for trimming during mapping phase.
- k=14
- Kmer length for initial seed finding.
- tipsearch=0
- Disable tip searching for faster processing.
- kfilter=25
- Filter out reads with fewer than this many kmers.
- maxsites=1
- Maximum number of mapping sites to consider per read.
Examples
Basic Human Contamination Removal
removehuman2.sh in=contaminated_reads.fastq.gz outu=clean_reads.fastq.gz
Removes human reads from contaminated_reads.fastq.gz and outputs clean reads to clean_reads.fastq.gz. Human reads are discarded.
Save Removed Human Reads
removehuman2.sh in=sample.fq outu=clean.fq outm=human_reads.fq
Removes human reads and saves them separately for quality control. Clean reads go to clean.fq, human reads go to human_reads.fq.
Paired-end Processing
removehuman2.sh in=reads_R1.fq in2=reads_R2.fq outu=clean_R1.fq outu2=clean_R2.fq
Process paired-end reads, removing human contamination from both files while maintaining pairing.
Custom Reference Path
removehuman2.sh in=sample.fq outu=clean.fq path=/path/to/human/genome/index/
Use a custom human genome reference instead of the default NERSC path.
Algorithm Details
RemoveHuman2 is built on the BBMap alignment engine (align2.BBMap class) and uses pre-configured mapping parameters tuned for human contamination detection. The tool implements key mapping features from the BBMap AbstractMapper framework:
Mapping Strategy
The tool uses a dual-phase approach: quality trimming followed by sensitive mapping. Quality trimming (qtrim=r, trimq=10) removes low-quality bases that could interfere with accurate mapping, while the untrim=t parameter restores original read lengths in the output.
Sensitivity vs Specificity Balance
RemoveHuman2 is configured for maximum sensitivity in human read detection:
- Low minimum ratio (0.75): Accepts alignments with 75% or better identity, catching divergent human sequences
- Unmasked reference: Uses an unmasked human genome to detect reads mapping to repetitive regions
- Single best site (maxsites=1): Considers only the best mapping location to reduce computation time
- Aggressive bandwidth (bwr=0.22, bw=26): Allows sufficient alignment flexibility for indel detection
Kmer-based Seed Finding
The algorithm uses 14-mer seeds (k=14) with kmer filtering (kfilter=25) to quickly identify potential mapping locations. The kmer filter removes reads with too few valid kmers, indicating low-quality or highly repetitive sequences.
Performance Configuration
The tool uses specific BBMap parameters for computational performance:
- Disabled tip searching (tipsearch=0): Disables TIP_SEARCH_DIST extension mechanism
- Limited indel tolerance (maxindel=8): Maximum allowed insertion/deletion length during alignment
- Single hit requirement (minhits=1): Minimum seed hits needed to attempt alignment (MIN_APPROX_HITS_TO_KEEP)
Output Control
The tool provides mapping statistics and supports separate output of human reads (outm parameter) for verification. The unmasked reference approach enables detection of reads mapping to repetitive human sequences.
Memory and Threading
The 17GB memory requirement (set via -Xmx16000m and -Xms16000m in the shell script) accommodates the human genome index and BBMap data structures. Multi-threading uses the AbstractMapper framework with automatic thread detection or manual specification via the threads parameter.
Performance Characteristics
- Memory Usage: Minimum 17GB RAM required
- Speed: Uses pre-configured BBMap parameters for processing speed (tipsearch=0, maxsites=1, k=14)
- Accuracy: Removes ~99.99% of human 2x150bp reads
- False Positives: May remove some non-human reads with high similarity to human genome
Additional BBMap Parameters
All BBMap parameters can be used with removehuman2.sh. Run bbmap.sh without arguments to see the complete parameter list. Commonly used additional parameters include:
- sam=1.3: Output SAM format alignments
- maxindel: Adjust maximum indel length
- minratio: Adjust minimum alignment score ratio
- ambiguous: Control handling of ambiguous mappings
Troubleshooting
Common Issues
- Path not found: The default path /global/cfs/cdirs/bbtools/hg19 only exists on NERSC systems. Set path= to your indexed human genome location.
- Memory errors: Ensure at least 17GB RAM is available. Consider reducing threads if memory is limited.
- Index not found: Verify the human genome index exists and is properly formatted for BBMap.
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org