BBSplitPairs

Script: bbsplitpairs.sh Package: jgi Class: SplitPairsAndSingles.java

Separates paired reads into files of 'good' pairs and 'good' singletons by removing 'bad' reads that are shorter than a min length. Designed to handle situations where reads become too short to be useful after trimming. This program also optionally performs quality trimming.

Basic Usage

bbsplitpairs.sh in=<input file> out=<pair output file> outs=<singleton output file> minlen=<minimum read length, an integer>

Input may be fasta or fastq, compressed or uncompressed. This tool processes paired-end reads and separates them based on length filtering, optionally applying quality trimming first.

Parameters

Parameters are organized by their function in the read processing pipeline. All parameters from the shell script usage() function are documented below.

Input/Output Parameters

in=<file>: Primary input file. The 'in=' flag is needed if the input file is not the first parameter. Use 'in=stdin' to pipe from standard input. Required parameter.
in2=<file>: Secondary input file. Use this if the second read of pairs are in a different file (non-interleaved mode).
out=<file>: Output file for valid pairs. The 'out=' flag is needed if the output file is not the second parameter. Use 'out=stdout' to pipe to standard output.
out2=<file>: Secondary output file. Use this to write the second read of pairs to a different file.
outsingle=<file>: (outs) Write singleton reads here. These are reads that passed length filtering but whose mate did not.

Processing Control Parameters

overwrite=t: (ow) Set to false to force the program to abort rather than overwrite an existing file. Default: true
showspeed=t: (ss) Set to false to suppress display of processing speed. Default: true
interleaved=auto: (int) Controls interleaved format handling. If true, forces fastq input to be paired and interleaved. Auto-detection is used by default.

Quality Trimming Parameters

qtrim=f: Trim read ends to remove bases with quality below trimq. Values: rl (trim both ends), f (neither end), r (right end only), l (left end only). Default: f (no trimming)
trimq=6: Trim quality threshold. Bases with quality below this value will be trimmed. Default: 6

Length Filtering Parameters

minlen=20: (ml) Reads shorter than this after trimming will be discarded. This is the core parameter that determines which reads are considered "good" vs "bad". Default: 20

Compression Parameters

ziplevel=2: (zl) Set to 1 (lowest) through 9 (maximum) to change compression level; lower compression is faster. Default: 2

Repair and Correction Parameters

fixinterleaving=f: (fint) Fixes corrupted interleaved files by examining pair names. Only use on files with broken interleaving. Cannot be used with repair mode. Default: false
repair=f: (rp) Fixes arbitrarily corrupted paired reads by examining read names. High memory usage. Cannot be used with fixinterleaving mode. Default: false
ain=f: (allowidenticalnames) When detecting pair names, allows identical names, instead of requiring /1 and /2 or 1: and 2: suffixes. Default: false

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions.

Examples

Basic Read Filtering

bbsplitpairs.sh in=reads.fq out=good_pairs.fq outs=singletons.fq minlen=30

Filters paired reads, keeping only pairs where both reads are at least 30bp long. Reads that pass individually but whose mate fails go to the singleton file.

Quality Trimming with Filtering

bbsplitpairs.sh in=reads.fq out=clean_pairs.fq outs=clean_singles.fq qtrim=rl trimq=10 minlen=25

First trims low-quality bases from both ends (quality < 10), then filters reads shorter than 25bp after trimming.

Separate Input/Output Files

bbsplitpairs.sh in=R1.fq in2=R2.fq out=clean_R1.fq out2=clean_R2.fq outs=singles.fq minlen=50

Processes separate R1 and R2 files, writing clean pairs to separate output files and singletons to a single file.

Repair Corrupted Pairs

bbsplitpairs.sh in=corrupted.fq out=repaired_pairs.fq outs=unpaired.fq repair=t minlen=30

Repairs arbitrarily corrupted paired reads by examining read names, then applies length filtering. High memory usage but handles severely damaged files.

Fix Interleaving Issues

bbsplitpairs.sh in=broken_interleaved.fq out=fixed_pairs.fq outs=singles.fq fixinterleaving=t minlen=25

Fixes corrupted interleaved files where the pairing has been disrupted, then applies length filtering.

Algorithm Details

Core Processing Architecture

BBSplitPairs implements a three-stage processing pipeline through the SplitPairsAndSingles.java class with distinct processing methods:

1. Quality Trimming Implementation

When qtrim is enabled, the tool calls TrimRead.trimFast(read, qtrimLeft, qtrimRight, trimq, trimE, 1) to remove bases with quality scores below the trimq threshold. The trimE parameter represents the error rate derived from trimq. Trimming operates on both reads in a pair independently, tracking basesTrimmed and readsTrimmed statistics.

2. Length-Based Read Classification Logic

The processPair() method implements the core classification algorithm:

Valid Pairs: Both reads satisfy rlen1 >= minReadLength && rlen2 >= minReadLength, maintaining mate relationships with setPairnum(0/1)
Singletons: Only one read meets the length requirement, with mate=null and setPairnum(0)
Discarded: Neither read meets minReadLength, incrementing removed counter

3. Concurrent I/O Stream Management

The tool utilizes ConcurrentReadInputStream and ConcurrentReadOutputStream with default buffer=4 for thread-safe read/write operations. The process3() method uses ArrayList<Read> pairs and singles collections, clearing them after each batch to maintain memory efficiency.

Specialized Processing Modes

Fix Interleaving Mode (process3_fixInterleaving)

Sequential read processing using FASTQ.testPairNames(prev, current, allowIdenticalPairNames) to validate consecutive reads as pairs. The algorithm maintains prev and current read pointers, calling processPair() when valid pairs are identified. This method repairs corrupted interleaving while preserving read order.

Repair Mode (process3_repair)

Uses LinkedHashMap<String, Read> pairMap for mate reconstruction. The repair() method extracts read prefixes from IDs by parsing /1, /2, 1:, 2: suffixes or slash separators. Unpaired reads are stored in pairMap by prefix; when mates are found, pairs are reconstructed and removed from the map. Memory usage scales linearly with unpaired read count.

Memory Management Implementation

The tool implements three distinct memory strategies:

Standard mode: Uses ArrayList<Read> structures with Shared.bufferLen() capacity, processed in batches
Repair mode: LinkedHashMap<String, Read> with memory proportional to unpaired reads awaiting mates
Concurrent buffering: ConcurrentReadOutputStream with configurable buffer size (default=4) for streaming I/O

Pair Name Recognition Algorithm

The repair() method implements comprehensive pair name parsing:

Splits read IDs on whitespace using split("\\s+") to separate prefix from suffix
Handles /1, /2 suffixes with id.indexOf('/') and substring extraction
Recognizes 1:, 2: suffixes with startsWith() pattern matching
SAM format integration through SamLine.pairnum() when available
Fallback to allowIdenticalPairNames when suffix parsing fails

Performance Implementation Details

I/O Throughput: ConcurrentReadInputStream/ConcurrentReadOutputStream with buffered batch processing
Memory Efficiency: Standard mode uses O(buffer_size) memory; repair mode uses O(unpaired_reads)
Data Integrity: Preserves original sequence and quality data, modifying only pair relationships and length filtering

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org