RemoveHuman2

Script: removehuman2.sh Package: align2 Class: BBMap.java

Removes all reads that map to the human genome with at least 88% identity after quality trimming. This is more aggressive than removehuman.sh and uses an unmasked human genome reference. It removes roughly 99.99% of human 2x150bp reads, but may incur false-positive removals. NOTE! This program uses hard-coded paths and will only run on NERSC systems unless you change the path.

System Requirements

This script requires at least 17GB RAM and is specifically designed for NERSC systems with hard-coded paths. To run on other systems, you must modify the path parameter to point to your indexed human genome reference.

Basic Usage

removehuman2.sh in=<input file> outu=<clean output file>

Input may be fasta or fastq, compressed or uncompressed.

Parameters

This tool is a preconfigured wrapper around BBMap with parameters set for human read removal (minratio=0.75, maxindel=8, bwr=0.22, bw=26, minhits=1, maxsites=1, k=14, tipsearch=0, kfilter=25). The parameters are configured for high sensitivity human read detection.

Core Parameters

threads=auto: (t) Set number of threads to use; default is number of logical processors.
overwrite=t: (ow) Set to false to force the program to abort rather than overwrite an existing file.
interleaved=auto: (int) If true, forces fastq input to be paired and interleaved.
path=: Set the path to an indexed human genome. Default is hard-coded to /global/cfs/cdirs/bbtools/hg19 for NERSC systems. Must be changed for other systems.

Quality Trimming Parameters

trim=t: Trim read ends to remove bases with quality below minq. Values: t (trim both ends), f (neither end), r (right end only), l (left end only).
untrim=t: Undo the trimming after mapping. This ensures output reads maintain their original length while benefiting from quality trimming during mapping.
minq=4: Trim quality threshold. Bases with quality below this value will be trimmed before mapping.

Output Parameters

outm=<file>: File to output the reads that mapped to human. Useful for quality control and verification of human content removal.
ziplevel=2: (zl) Set to 1 (lowest) through 9 (max) to change compression level; lower compression is faster.

Pre-configured BBMap Parameters

The following parameters are automatically set for human read detection:

minratio=0.75: Minimum alignment score ratio for a read to be considered mapped. Set to 0.75 for aggressive filtering.
maxindel=8: Maximum allowed indel length during alignment.
bwr=0.22: Bandwidth ratio for alignment. Controls the alignment search space.
bw=26: Bandwidth for alignment in bases.
minhits=1: Minimum number of seed hits required to attempt alignment.
build=2: Genome build number (hg19).
qtrim=r: Quality trim from right end with quality threshold of 10.
trimq=10: Quality threshold for trimming during mapping phase.
k=14: Kmer length for initial seed finding.
tipsearch=0: Disable tip searching for faster processing.
kfilter=25: Filter out reads with fewer than this many kmers.
maxsites=1: Maximum number of mapping sites to consider per read.

Examples

Basic Human Contamination Removal

removehuman2.sh in=contaminated_reads.fastq.gz outu=clean_reads.fastq.gz

Removes human reads from contaminated_reads.fastq.gz and outputs clean reads to clean_reads.fastq.gz. Human reads are discarded.

Save Removed Human Reads

removehuman2.sh in=sample.fq outu=clean.fq outm=human_reads.fq

Removes human reads and saves them separately for quality control. Clean reads go to clean.fq, human reads go to human_reads.fq.

Paired-end Processing

removehuman2.sh in=reads_R1.fq in2=reads_R2.fq outu=clean_R1.fq outu2=clean_R2.fq

Process paired-end reads, removing human contamination from both files while maintaining pairing.

Custom Reference Path

removehuman2.sh in=sample.fq outu=clean.fq path=/path/to/human/genome/index/

Use a custom human genome reference instead of the default NERSC path.

Algorithm Details

RemoveHuman2 is built on the BBMap alignment engine (align2.BBMap class) and uses pre-configured mapping parameters tuned for human contamination detection. The tool implements key mapping features from the BBMap AbstractMapper framework:

Mapping Strategy

The tool uses a dual-phase approach: quality trimming followed by sensitive mapping. Quality trimming (qtrim=r, trimq=10) removes low-quality bases that could interfere with accurate mapping, while the untrim=t parameter restores original read lengths in the output.

Sensitivity vs Specificity Balance

RemoveHuman2 is configured for maximum sensitivity in human read detection:

Low minimum ratio (0.75): Accepts alignments with 75% or better identity, catching divergent human sequences
Unmasked reference: Uses an unmasked human genome to detect reads mapping to repetitive regions
Single best site (maxsites=1): Considers only the best mapping location to reduce computation time
Aggressive bandwidth (bwr=0.22, bw=26): Allows sufficient alignment flexibility for indel detection

Kmer-based Seed Finding

The algorithm uses 14-mer seeds (k=14) with kmer filtering (kfilter=25) to quickly identify potential mapping locations. The kmer filter removes reads with too few valid kmers, indicating low-quality or highly repetitive sequences.

Performance Configuration

The tool uses specific BBMap parameters for computational performance:

Disabled tip searching (tipsearch=0): Disables TIP_SEARCH_DIST extension mechanism
Limited indel tolerance (maxindel=8): Maximum allowed insertion/deletion length during alignment
Single hit requirement (minhits=1): Minimum seed hits needed to attempt alignment (MIN_APPROX_HITS_TO_KEEP)

Output Control

The tool provides mapping statistics and supports separate output of human reads (outm parameter) for verification. The unmasked reference approach enables detection of reads mapping to repetitive human sequences.

Memory and Threading

The 17GB memory requirement (set via -Xmx16000m and -Xms16000m in the shell script) accommodates the human genome index and BBMap data structures. Multi-threading uses the AbstractMapper framework with automatic thread detection or manual specification via the threads parameter.

Performance Characteristics

Memory Usage: Minimum 17GB RAM required
Speed: Uses pre-configured BBMap parameters for processing speed (tipsearch=0, maxsites=1, k=14)
Accuracy: Removes ~99.99% of human 2x150bp reads
False Positives: May remove some non-human reads with high similarity to human genome

Additional BBMap Parameters

All BBMap parameters can be used with removehuman2.sh. Run bbmap.sh without arguments to see the complete parameter list. Commonly used additional parameters include:

sam=1.3: Output SAM format alignments
maxindel: Adjust maximum indel length
minratio: Adjust minimum alignment score ratio
ambiguous: Control handling of ambiguous mappings

Troubleshooting

Common Issues

Path not found: The default path /global/cfs/cdirs/bbtools/hg19 only exists on NERSC systems. Set path= to your indexed human genome location.
Memory errors: Ensure at least 17GB RAM is available. Consider reducing threads if memory is limited.
Index not found: Verify the human genome index exists and is properly formatted for BBMap.

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org