BBSplitPairs
Separates paired reads into files of 'good' pairs and 'good' singletons by removing 'bad' reads that are shorter than a min length. Designed to handle situations where reads become too short to be useful after trimming. This program also optionally performs quality trimming.
Basic Usage
bbsplitpairs.sh in=<input file> out=<pair output file> outs=<singleton output file> minlen=<minimum read length, an integer>
Input may be fasta or fastq, compressed or uncompressed. This tool processes paired-end reads and separates them based on length filtering, optionally applying quality trimming first.
Parameters
Parameters are organized by their function in the read processing pipeline. All parameters from the shell script usage() function are documented below.
Input/Output Parameters
- in=<file>
- Primary input file. The 'in=' flag is needed if the input file is not the first parameter. Use 'in=stdin' to pipe from standard input. Required parameter.
- in2=<file>
- Secondary input file. Use this if the second read of pairs are in a different file (non-interleaved mode).
- out=<file>
- Output file for valid pairs. The 'out=' flag is needed if the output file is not the second parameter. Use 'out=stdout' to pipe to standard output.
- out2=<file>
- Secondary output file. Use this to write the second read of pairs to a different file.
- outsingle=<file>
- (outs) Write singleton reads here. These are reads that passed length filtering but whose mate did not.
Processing Control Parameters
- overwrite=t
- (ow) Set to false to force the program to abort rather than overwrite an existing file. Default: true
- showspeed=t
- (ss) Set to false to suppress display of processing speed. Default: true
- interleaved=auto
- (int) Controls interleaved format handling. If true, forces fastq input to be paired and interleaved. Auto-detection is used by default.
Quality Trimming Parameters
- qtrim=f
- Trim read ends to remove bases with quality below trimq. Values: rl (trim both ends), f (neither end), r (right end only), l (left end only). Default: f (no trimming)
- trimq=6
- Trim quality threshold. Bases with quality below this value will be trimmed. Default: 6
Length Filtering Parameters
- minlen=20
- (ml) Reads shorter than this after trimming will be discarded. This is the core parameter that determines which reads are considered "good" vs "bad". Default: 20
Compression Parameters
- ziplevel=2
- (zl) Set to 1 (lowest) through 9 (maximum) to change compression level; lower compression is faster. Default: 2
Repair and Correction Parameters
- fixinterleaving=f
- (fint) Fixes corrupted interleaved files by examining pair names. Only use on files with broken interleaving. Cannot be used with repair mode. Default: false
- repair=f
- (rp) Fixes arbitrarily corrupted paired reads by examining read names. High memory usage. Cannot be used with fixinterleaving mode. Default: false
- ain=f
- (allowidenticalnames) When detecting pair names, allows identical names, instead of requiring /1 and /2 or 1: and 2: suffixes. Default: false
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions.
Examples
Basic Read Filtering
bbsplitpairs.sh in=reads.fq out=good_pairs.fq outs=singletons.fq minlen=30
Filters paired reads, keeping only pairs where both reads are at least 30bp long. Reads that pass individually but whose mate fails go to the singleton file.
Quality Trimming with Filtering
bbsplitpairs.sh in=reads.fq out=clean_pairs.fq outs=clean_singles.fq qtrim=rl trimq=10 minlen=25
First trims low-quality bases from both ends (quality < 10), then filters reads shorter than 25bp after trimming.
Separate Input/Output Files
bbsplitpairs.sh in=R1.fq in2=R2.fq out=clean_R1.fq out2=clean_R2.fq outs=singles.fq minlen=50
Processes separate R1 and R2 files, writing clean pairs to separate output files and singletons to a single file.
Repair Corrupted Pairs
bbsplitpairs.sh in=corrupted.fq out=repaired_pairs.fq outs=unpaired.fq repair=t minlen=30
Repairs arbitrarily corrupted paired reads by examining read names, then applies length filtering. High memory usage but handles severely damaged files.
Fix Interleaving Issues
bbsplitpairs.sh in=broken_interleaved.fq out=fixed_pairs.fq outs=singles.fq fixinterleaving=t minlen=25
Fixes corrupted interleaved files where the pairing has been disrupted, then applies length filtering.
Algorithm Details
Core Processing Architecture
BBSplitPairs implements a three-stage processing pipeline through the SplitPairsAndSingles.java class with distinct processing methods:
1. Quality Trimming Implementation
When qtrim is enabled, the tool calls TrimRead.trimFast(read, qtrimLeft, qtrimRight, trimq, trimE, 1) to remove bases with quality scores below the trimq threshold. The trimE parameter represents the error rate derived from trimq. Trimming operates on both reads in a pair independently, tracking basesTrimmed and readsTrimmed statistics.
2. Length-Based Read Classification Logic
The processPair() method implements the core classification algorithm:
- Valid Pairs: Both reads satisfy rlen1 >= minReadLength && rlen2 >= minReadLength, maintaining mate relationships with setPairnum(0/1)
- Singletons: Only one read meets the length requirement, with mate=null and setPairnum(0)
- Discarded: Neither read meets minReadLength, incrementing removed counter
3. Concurrent I/O Stream Management
The tool utilizes ConcurrentReadInputStream and ConcurrentReadOutputStream with default buffer=4 for thread-safe read/write operations. The process3() method uses ArrayList<Read> pairs and singles collections, clearing them after each batch to maintain memory efficiency.
Specialized Processing Modes
Fix Interleaving Mode (process3_fixInterleaving)
Sequential read processing using FASTQ.testPairNames(prev, current, allowIdenticalPairNames) to validate consecutive reads as pairs. The algorithm maintains prev and current read pointers, calling processPair() when valid pairs are identified. This method repairs corrupted interleaving while preserving read order.
Repair Mode (process3_repair)
Uses LinkedHashMap<String, Read> pairMap for mate reconstruction. The repair() method extracts read prefixes from IDs by parsing /1, /2, 1:, 2: suffixes or slash separators. Unpaired reads are stored in pairMap by prefix; when mates are found, pairs are reconstructed and removed from the map. Memory usage scales linearly with unpaired read count.
Memory Management Implementation
The tool implements three distinct memory strategies:
- Standard mode: Uses ArrayList<Read> structures with Shared.bufferLen() capacity, processed in batches
- Repair mode: LinkedHashMap<String, Read> with memory proportional to unpaired reads awaiting mates
- Concurrent buffering: ConcurrentReadOutputStream with configurable buffer size (default=4) for streaming I/O
Pair Name Recognition Algorithm
The repair() method implements comprehensive pair name parsing:
- Splits read IDs on whitespace using split("\\s+") to separate prefix from suffix
- Handles /1, /2 suffixes with id.indexOf('/') and substring extraction
- Recognizes 1:, 2: suffixes with startsWith() pattern matching
- SAM format integration through SamLine.pairnum() when available
- Fallback to allowIdenticalPairNames when suffix parsing fails
Performance Implementation Details
- I/O Throughput: ConcurrentReadInputStream/ConcurrentReadOutputStream with buffered batch processing
- Memory Efficiency: Standard mode uses O(buffer_size) memory; repair mode uses O(unpaired_reads)
- Data Integrity: Preserves original sequence and quality data, modifying only pair relationships and length filtering
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org