BBSort

Script: bbsort.sh Package: sort Class: SortByName.java

Sorts reads by name or other keys such as length, quality, mapping position, flowcell coordinates, or taxonomy. Writes temp files if memory is exceeded.

Basic Usage

bbsort.sh in=<file> out=<file>

Input may be FASTA, FASTQ, or SAM, compressed or uncompressed. Temp files will use the same format as the output. Pairs are kept together if reads are paired, and in2/out2 may be used for paired input/output.

Parameters

Parameters are organized by function: main sorting options, memory management settings, and taxonomy-specific parameters for taxonomic sorting.

Parameters

in=<file>
Input file.
out=<file>
Output file.
name=t
Sort reads by name. This is the default sorting mode.
length=f
Sort reads by length. Longer reads will be placed first when ascending=f.
quality=f
Sort reads by quality. Actually sorts by average expected error rate, so ascending will place the highest-quality reads first.
position=f
Sort reads by position (for mapped reads). Only applicable to SAM files with mapping coordinates.
taxa=f
Sort reads by taxonomy (for NCBI naming convention). Requires taxonomy database files.
sequence=f
Sort reads by sequence, alphabetically. Performs lexicographic comparison of DNA sequences.
clump=f
Sort reads by shared kmers, like Clumpify. Groups reads with similar kmer composition together.
flowcell=f
Sort reads by flowcell coordinates. Parses coordinate information from read headers.
shuffle=f
Shuffle reads randomly (untested). Uses deterministic random ordering.
list=<file>
Sort reads according to this list of names. Reads are ordered based on the sequence of names provided in the file.
ascending=t
Sort ascending. Set to false for descending order.

Memory parameters

memmult=0.30
Write a temp file when used memory exceeds this fraction of available memory. Prevents out-of-memory crashes by spilling data to disk.
memlimit=0.65
Wait for temp files to finish writing until used memory drops below this fraction of available memory. Controls memory pressure during temp file operations.
delete=t
Delete temporary files after merging. Set to false to preserve temp files for debugging.
allowtemp=t
Allow writing temporary files. Set to false to force in-memory sorting only.

Taxonomy-sorting parameters (for taxa mode only)

tree=
Specify a taxtree file. On Genepool, use 'auto' to automatically locate the taxonomy tree file.
gi=
Specify a gitable file. On Genepool, use 'auto' to automatically locate the GI to taxonomy ID mapping file.
accession=
Specify one or more comma-delimited NCBI accession to taxid files. On Genepool, use 'auto' to automatically locate accession mapping files.

Java Parameters

-Xmx
This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da
Disable assertions.

Examples

Sort by name (default)

bbsort.sh in=raw.fq out=sorted.fq

Sorts reads alphabetically by name using the default name comparator.

Sort by sequence

bbsort.sh in=raw.fq out=sorted.fq sequence

Sorts reads lexicographically by DNA sequence content.

Sort by mapping position

bbsort.sh in=mapped.sam out=sorted.sam position

Sorts mapped reads by their genomic coordinates (chromosome and position).

Sort by length, descending

bbsort.sh in=reads.fq out=sorted.fq length ascending=f

Sorts reads by length with longest reads first.

Sort with custom memory limits

bbsort.sh in=large.fq out=sorted.fq memmult=0.2 memlimit=0.5

Uses conservative memory settings to avoid out-of-memory issues with very large files.

Algorithm Details

BBSort implements external sorting using dual memory thresholds (memBlockMult=0.30f, memTotalMult=0.65f) with temporary file chunking and PriorityQueue-based k-way merging to handle datasets larger than available memory. The algorithm employs the following implementation strategies:

Memory Management Strategy

BBSort uses a dual-threshold memory management system. When memory usage exceeds the memmult fraction of available memory (default 30%), the tool writes accumulated reads to temporary files. It then waits until memory usage drops below the memlimit fraction (default 65%) before continuing to process more data.

External Merge Algorithm

For datasets requiring temporary files, BBSort implements k-way merge using PriorityQueue<CrisContainer> with configurable limits:

Sorting Comparators

BBSort uses specialized ReadComparator implementations selected at runtime:

Performance Characteristics

The algorithm implementation provides specific performance characteristics:

File Format Support

BBSort automatically detects and handles multiple sequence formats:

Important Notes

Support

For questions and support: