BBWrap
Simple wrapper that allows BBMap to be run multiple times without reloading the index each time. Particularly useful for saving compute resources when processing multiple datasets against large references, and for handling mixed paired and unpaired reads in the same workflow.
When to Use BBWrap
BBWrap is designed for specific scenarios where standard BBMap is inefficient or cannot handle the input:
Primary Use Cases
- Multiple datasets with large references: Save time by avoiding repeated index loading for each dataset
- Mixed paired and unpaired reads: BBMap cannot process both paired and unpaired reads in the same run, except by using BBWrap
- Batch processing workflows: Process many files systematically while maintaining consistent reference indexing
- Resource-constrained environments: Particularly beneficial with small number of reads and large reference genomes
Important Limitation: BBWrap will not work with stdin/stdout or histogram output. Use standard BBMap for those cases.
Basic Usage
bbwrap.sh ref=<reference fasta> in=<file,file,...> out=<file,file,...>
Usage Patterns
To index only:
bbwrap.sh ref=<reference fasta>
To map to an existing index:
bbwrap.sh in=<file,file,...> out=<file,file,...>
To map pairs and singletons to the same output file:
bbwrap.sh in1=read1.fq,singleton.fq in2=read2.fq,null out=mapped.sam append
Input Parameters
- in=<file,file>
- Input sequences to map. Accepts comma-separated list of files.
- in1=<file,file>
- Input sequences for read 1 (paired-end). Accepts comma-separated list of files.
- in2=<file,file>
- Input sequences for read 2 (paired-end). Use "null" as placeholder for unpaired datasets.
- inlist=<fofn>
- File containing list of input files, one per line. Alternative to comma-separated lists.
- in1list=<fofn>
- File containing list of read 1 input files, one per line.
- in2list=<fofn>
- File containing list of read 2 input files, one per line.
Output Parameters
Primary Output
- out=<file,file>
- Primary output files. Accepts comma-separated list of files.
- out1=<file,file>
- Output file for read 1 (paired-end). Accepts comma-separated list of files.
- out2=<file,file>
- Output file for read 2 (paired-end). Accepts comma-separated list of files.
- outlist=<fofn>
- File containing list of primary output files, one per line.
- out1list=<fofn>
- File containing list of read 1 output files, one per line.
- out2list=<fofn>
- File containing list of read 2 output files, one per line.
Mapped/Unmapped Streams
- outm=<file,file>
- Output file for mapped reads. Accepts comma-separated list of files.
- outm1=<file,file>
- Output file for mapped read 1 (paired-end).
- outm2=<file,file>
- Output file for mapped read 2 (paired-end).
- outu=<file,file>
- Output file for unmapped reads. Accepts comma-separated list of files.
- outu1=<file,file>
- Output file for unmapped read 1 (paired-end).
- outu2=<file,file>
- Output file for unmapped read 2 (paired-end).
- outb=<file,file>
- Output file for blacklisted/filtered reads.
- outb1=<file,file>
- Output file for blacklisted read 1 (paired-end).
- outb2=<file,file>
- Output file for blacklisted read 2 (paired-end).
- outmlist=<fofn>
- File containing list of mapped output files, one per line.
- outm1list=<fofn>
- File containing list of mapped read 1 output files, one per line.
- outm2list=<fofn>
- File containing list of mapped read 2 output files, one per line.
- outulist=<fofn>
- File containing list of unmapped output files, one per line.
- outu1list=<fofn>
- File containing list of unmapped read 1 output files, one per line.
- outu2list=<fofn>
- File containing list of unmapped read 2 output files, one per line.
Analysis Output
- qualityhistogram=<file,file>
- Output quality histogram files. Aliases: qualityhist, qhist.
- matchhistogram=<file,file>
- Output match histogram files. Aliases: matchhist, mhist.
- inserthistogram=<file,file>
- Output insert size histogram files. Aliases: inserthist, ihist.
- bamscript=<file,file>
- BAM script generation files. Aliases: bs.
Parameters
BBWrap accepts all standard BBMap parameters plus wrapper-specific options for managing multiple input/output files.
Control Parameters
- ref=<file>
- Reference fasta file. Only specify for the first run when creating the index. Aliases: reference, fasta.
- mapper=bbmap
- Select mapping algorithm. Options: bbmap (default), bbmappacbio, bbmappacbioskimmer, bbmap5, bbmapacc, bbsplit.
- append=f
- Append to files rather than overwriting. When true and exactly one output file is specified, all output is written to that single file.
- path=<dir>
- Root directory for index storage. Aliases: root.
BBMap Parameters: All standard BBMap parameters can be used with BBWrap. See bbmap.sh documentation for complete parameter list.
Examples
Efficient Multi-File Processing
bbwrap.sh ref=large_genome.fasta \
in=sample1.fq,sample2.fq,sample3.fq,sample4.fq \
out=mapped1.sam,mapped2.sam,mapped3.sam,mapped4.sam
Process four datasets against a large reference. Index is loaded once and reused for all four mappings, saving significant time compared to running BBMap four times separately.
Mixed Paired and Unpaired Reads
bbwrap.sh ref=genome.fasta \
in1=paired_R1.fq,unpaired.fq \
in2=paired_R2.fq,null \
out=all_mapped.sam \
append
Map both paired-end and unpaired reads to the same reference, outputting all results to a single file. This workflow is impossible with standard BBMap.
Batch Processing with File Lists
# Create input file list
echo -e "dataset1.fq\ndataset2.fq\ndataset3.fq" > input_files.txt
echo -e "mapped1.sam\nmapped2.sam\nmapped3.sam" > output_files.txt
bbwrap.sh ref=reference.fasta inlist=input_files.txt outlist=output_files.txt
Process multiple files using file lists, useful for automated pipelines with many datasets.
Separate Mapped and Unmapped Outputs
bbwrap.sh ref=host_genome.fasta \
in=sample1.fq,sample2.fq,sample3.fq \
outm=host_reads1.fq,host_reads2.fq,host_reads3.fq \
outu=nonhost_reads1.fq,nonhost_reads2.fq,nonhost_reads3.fq
Separate host and non-host reads from multiple samples efficiently, useful for contamination removal workflows.
PacBio Long Read Processing
bbwrap.sh ref=reference.fasta \
in=pacbio_run1.fq,pacbio_run2.fq \
out=mapped_run1.sam,mapped_run2.sam \
mapper=bbmappacbio
Process multiple PacBio datasets using the specialized long-read mapper, sharing the index across runs.
Quality Control Workflow
bbwrap.sh ref=reference.fasta \
in=sample1.fq,sample2.fq \
out=mapped1.sam,mapped2.sam \
qhist=quality1.txt,quality2.txt \
ihist=insert1.txt,insert2.txt \
mhist=match1.txt,match2.txt
Generate mapping results and quality control statistics for multiple samples in a single run.
Algorithm Details
Index Reuse Strategy
BBWrap's primary efficiency comes from index persistence. When processing multiple datasets:
- First dataset: Loads reference and builds index (normal BBMap overhead)
- Subsequent datasets: Reuses existing index in memory (near-zero index overhead)
This approach dramatically reduces processing time when the index loading time is significant relative to mapping time, particularly with large references and smaller read datasets.
File Coordination
BBWrap maintains parallel lists of input and output files, processing them in synchronized fashion:
- Position-based matching: First input maps to first output, second to second, etc.
- Append mode exception: When append=true and single output file specified, all inputs map to the same output
- Null placeholders: Use "null" in file lists to handle mixed paired/unpaired datasets
Mapper Selection
BBWrap can delegate to different alignment algorithms based on the mapper parameter:
- bbmap: Standard short-read aligner (default)
- bbmappacbio: Optimized for PacBio long reads with high error rates
- bbmappacbioskimmer: Fast approximate mapping for error correction workflows
- bbmap5: Enhanced version with additional features
- bbmapacc: High-accuracy variant for sensitive applications
- bbsplit: Multi-reference mapping for contamination detection
Memory Management
BBWrap processes datasets sequentially rather than simultaneously, maintaining constant memory usage regardless of the number of input files. The shared index remains in memory across all runs, but read data is processed one dataset at a time.
Performance Characteristics
Performance benefits are most pronounced when:
- Index size >> read data size: Large genomes with relatively small read datasets
- Multiple similar datasets: Same reference, multiple samples
- I/O-constrained environments: Systems where disk access is the bottleneck
Limitations and Considerations
- No stdin/stdout support: Cannot use with pipes or stream processing
- No histogram output support: Use standard BBMap for detailed statistical analysis
- Sequential processing: Datasets are processed one at a time, not in parallel
- Index persistence requirement: Index must fit in available memory for duration of all runs
- Single reference limitation: All datasets must map to the same reference sequence