BBSort

Basic Usage

bbsort.sh in=<file> out=<file>

Input may be FASTA, FASTQ, or SAM, compressed or uncompressed. Temp files will use the same format as the output. Pairs are kept together if reads are paired, and in2/out2 may be used for paired input/output.

Parameters

Parameters are organized by function: main sorting options, memory management settings, and taxonomy-specific parameters for taxonomic sorting.

Parameters

in=<file>: Input file.
out=<file>: Output file.
name=t: Sort reads by name. This is the default sorting mode.
length=f: Sort reads by length. Longer reads will be placed first when ascending=f.
quality=f: Sort reads by quality. Actually sorts by average expected error rate, so ascending will place the highest-quality reads first.
position=f: Sort reads by position (for mapped reads). Only applicable to SAM files with mapping coordinates.
taxa=f: Sort reads by taxonomy (for NCBI naming convention). Requires taxonomy database files.
sequence=f: Sort reads by sequence, alphabetically. Performs lexicographic comparison of DNA sequences.
clump=f: Sort reads by shared kmers, like Clumpify. Groups reads with similar kmer composition together.
flowcell=f: Sort reads by flowcell coordinates. Parses coordinate information from read headers.
shuffle=f: Shuffle reads randomly (untested). Uses deterministic random ordering.
list=<file>: Sort reads according to this list of names. Reads are ordered based on the sequence of names provided in the file.
ascending=t: Sort ascending. Set to false for descending order.

Memory parameters

memmult=0.30: Write a temp file when used memory exceeds this fraction of available memory. Prevents out-of-memory crashes by spilling data to disk.
memlimit=0.65: Wait for temp files to finish writing until used memory drops below this fraction of available memory. Controls memory pressure during temp file operations.
delete=t: Delete temporary files after merging. Set to false to preserve temp files for debugging.
allowtemp=t: Allow writing temporary files. Set to false to force in-memory sorting only.

Taxonomy-sorting parameters (for taxa mode only)

tree=: Specify a taxtree file. On Genepool, use 'auto' to automatically locate the taxonomy tree file.
gi=: Specify a gitable file. On Genepool, use 'auto' to automatically locate the GI to taxonomy ID mapping file.
accession=: Specify one or more comma-delimited NCBI accession to taxid files. On Genepool, use 'auto' to automatically locate accession mapping files.

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions.

Examples

Sort by name (default)

bbsort.sh in=raw.fq out=sorted.fq

Sorts reads alphabetically by name using the default name comparator.

Sort by sequence

bbsort.sh in=raw.fq out=sorted.fq sequence

Sorts reads lexicographically by DNA sequence content.

Sort by mapping position

bbsort.sh in=mapped.sam out=sorted.sam position

Sorts mapped reads by their genomic coordinates (chromosome and position).

Sort by length, descending

bbsort.sh in=reads.fq out=sorted.fq length ascending=f

Sorts reads by length with longest reads first.

Sort with custom memory limits

bbsort.sh in=large.fq out=sorted.fq memmult=0.2 memlimit=0.5

Uses conservative memory settings to avoid out-of-memory issues with very large files.

Algorithm Details

BBSort implements external sorting using dual memory thresholds (memBlockMult=0.30f, memTotalMult=0.65f) with temporary file chunking and PriorityQueue-based k-way merging to handle datasets larger than available memory. The algorithm employs the following implementation strategies:

Memory Management Strategy

BBSort uses a dual-threshold memory management system. When memory usage exceeds the memmult fraction of available memory (default 30%), the tool writes accumulated reads to temporary files. It then waits until memory usage drops below the memlimit fraction (default 65%) before continuing to process more data.

External Merge Algorithm

For datasets requiring temporary files, BBSort implements k-way merge using PriorityQueue<CrisContainer> with configurable limits:

Chunking: Input data is processed in chunks up to readLimit=2000000000 reads, each sorted via Shared.sort() and written to temporary files
Recursive merging: When temp files exceed maxFiles=12 (default), files are recursively merged in groups with mergeFileLimit=24 to reduce total count
Priority queue merging: Final merge uses PriorityQueue<CrisContainer> with limit=100000 buffer management for efficient stream combining
Buffer optimization: Buffer sizes use ArrayList<Read> with 200-read batches, dynamically adjusting via maxMergeFiles() based on available memory

Sorting Comparators

BBSort uses specialized ReadComparator implementations selected at runtime:

Name sorting: ReadComparatorName.comparator uses lexicographic string comparison with pair number tie-breaking
Length sorting: ReadLengthComparator.comparator with hierarchical length, mate length, string ID, numeric ID comparison
Quality sorting: ReadQualityComparator.comparator calculates expected error rates from quality scores
Position sorting: ReadComparatorPosition.comparator with ScafMap.loadSamHeader() for coordinate extraction and multi-level genomic sorting
Sequence sorting: ReadComparatorTopological.comparator performs lexicographic DNA sequence comparison
Clump sorting: ReadComparatorClump.comparator uses 31-base k-mer hashing for similarity grouping
Flowcell sorting: ReadComparatorFlowcell.comparator with thread-local coordinate parsing from read headers
Taxonomic sorting: ReadComparatorTaxa.comparator integrates TaxTree.loadTaxTree() with GiToTaxid and AccessionToTaxid mapping

Performance Characteristics

The algorithm implementation provides specific performance characteristics:

Time complexity: O(n log n) via Shared.sort() for in-memory, O(n log k) per mergeRecursive() pass where k=maxFiles (default 12)
Memory usage: Bounded by memAvailable()*memBlockMult (0.30f) with automatic WriteThread.run() temp file creation
Disk I/O: Sequential writes via ConcurrentReadOutputStream with buff=4 threading and temp file cleanup
Paired-read handling: Mate pairs maintained via Read.mate references throughout sorting and merging

File Format Support

BBSort automatically detects and handles multiple sequence formats:

FASTA/FASTQ: Standard sequence formats, compressed or uncompressed
SAM/BAM: Alignment formats with full header preservation
Interleaved/paired: Automatic detection and preservation of pairing information
Compression: Native support for gzip compression on input and output

Important Notes

Sorting modes (name, length, quality, position, etc.) are mutually exclusive
Quality sorting actually sorts by average expected error rate, so ascending order places highest-quality reads first
Taxonomic sorting requires appropriate database files (tree, gi table, accession files)
Temp files are created in the current working directory when memory limits are exceeded
Paired reads are kept together regardless of sorting mode

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org