BBSort
Sorts reads by name or other keys such as length, quality, mapping position, flowcell coordinates, or taxonomy. Writes temp files if memory is exceeded.
Basic Usage
bbsort.sh in=<file> out=<file>
Input may be FASTA, FASTQ, or SAM, compressed or uncompressed. Temp files will use the same format as the output. Pairs are kept together if reads are paired, and in2/out2 may be used for paired input/output.
Parameters
Parameters are organized by function: main sorting options, memory management settings, and taxonomy-specific parameters for taxonomic sorting.
Parameters
- in=<file>
- Input file.
- out=<file>
- Output file.
- name=t
- Sort reads by name. This is the default sorting mode.
- length=f
- Sort reads by length. Longer reads will be placed first when ascending=f.
- quality=f
- Sort reads by quality. Actually sorts by average expected error rate, so ascending will place the highest-quality reads first.
- position=f
- Sort reads by position (for mapped reads). Only applicable to SAM files with mapping coordinates.
- taxa=f
- Sort reads by taxonomy (for NCBI naming convention). Requires taxonomy database files.
- sequence=f
- Sort reads by sequence, alphabetically. Performs lexicographic comparison of DNA sequences.
- clump=f
- Sort reads by shared kmers, like Clumpify. Groups reads with similar kmer composition together.
- flowcell=f
- Sort reads by flowcell coordinates. Parses coordinate information from read headers.
- shuffle=f
- Shuffle reads randomly (untested). Uses deterministic random ordering.
- list=<file>
- Sort reads according to this list of names. Reads are ordered based on the sequence of names provided in the file.
- ascending=t
- Sort ascending. Set to false for descending order.
Memory parameters
- memmult=0.30
- Write a temp file when used memory exceeds this fraction of available memory. Prevents out-of-memory crashes by spilling data to disk.
- memlimit=0.65
- Wait for temp files to finish writing until used memory drops below this fraction of available memory. Controls memory pressure during temp file operations.
- delete=t
- Delete temporary files after merging. Set to false to preserve temp files for debugging.
- allowtemp=t
- Allow writing temporary files. Set to false to force in-memory sorting only.
Taxonomy-sorting parameters (for taxa mode only)
- tree=
- Specify a taxtree file. On Genepool, use 'auto' to automatically locate the taxonomy tree file.
- gi=
- Specify a gitable file. On Genepool, use 'auto' to automatically locate the GI to taxonomy ID mapping file.
- accession=
- Specify one or more comma-delimited NCBI accession to taxid files. On Genepool, use 'auto' to automatically locate accession mapping files.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions.
Examples
Sort by name (default)
bbsort.sh in=raw.fq out=sorted.fq
Sorts reads alphabetically by name using the default name comparator.
Sort by sequence
bbsort.sh in=raw.fq out=sorted.fq sequence
Sorts reads lexicographically by DNA sequence content.
Sort by mapping position
bbsort.sh in=mapped.sam out=sorted.sam position
Sorts mapped reads by their genomic coordinates (chromosome and position).
Sort by length, descending
bbsort.sh in=reads.fq out=sorted.fq length ascending=f
Sorts reads by length with longest reads first.
Sort with custom memory limits
bbsort.sh in=large.fq out=sorted.fq memmult=0.2 memlimit=0.5
Uses conservative memory settings to avoid out-of-memory issues with very large files.
Algorithm Details
BBSort implements external sorting using dual memory thresholds (memBlockMult=0.30f, memTotalMult=0.65f) with temporary file chunking and PriorityQueue-based k-way merging to handle datasets larger than available memory. The algorithm employs the following implementation strategies:
Memory Management Strategy
BBSort uses a dual-threshold memory management system. When memory usage exceeds the memmult
fraction of available memory (default 30%), the tool writes accumulated reads to temporary files. It then waits until memory usage drops below the memlimit
fraction (default 65%) before continuing to process more data.
External Merge Algorithm
For datasets requiring temporary files, BBSort implements k-way merge using PriorityQueue<CrisContainer> with configurable limits:
- Chunking: Input data is processed in chunks up to readLimit=2000000000 reads, each sorted via Shared.sort() and written to temporary files
- Recursive merging: When temp files exceed maxFiles=12 (default), files are recursively merged in groups with mergeFileLimit=24 to reduce total count
- Priority queue merging: Final merge uses PriorityQueue<CrisContainer> with limit=100000 buffer management for efficient stream combining
- Buffer optimization: Buffer sizes use ArrayList<Read> with 200-read batches, dynamically adjusting via maxMergeFiles() based on available memory
Sorting Comparators
BBSort uses specialized ReadComparator implementations selected at runtime:
- Name sorting: ReadComparatorName.comparator uses lexicographic string comparison with pair number tie-breaking
- Length sorting: ReadLengthComparator.comparator with hierarchical length, mate length, string ID, numeric ID comparison
- Quality sorting: ReadQualityComparator.comparator calculates expected error rates from quality scores
- Position sorting: ReadComparatorPosition.comparator with ScafMap.loadSamHeader() for coordinate extraction and multi-level genomic sorting
- Sequence sorting: ReadComparatorTopological.comparator performs lexicographic DNA sequence comparison
- Clump sorting: ReadComparatorClump.comparator uses 31-base k-mer hashing for similarity grouping
- Flowcell sorting: ReadComparatorFlowcell.comparator with thread-local coordinate parsing from read headers
- Taxonomic sorting: ReadComparatorTaxa.comparator integrates TaxTree.loadTaxTree() with GiToTaxid and AccessionToTaxid mapping
Performance Characteristics
The algorithm implementation provides specific performance characteristics:
- Time complexity: O(n log n) via Shared.sort() for in-memory, O(n log k) per mergeRecursive() pass where k=maxFiles (default 12)
- Memory usage: Bounded by memAvailable()*memBlockMult (0.30f) with automatic WriteThread.run() temp file creation
- Disk I/O: Sequential writes via ConcurrentReadOutputStream with buff=4 threading and temp file cleanup
- Paired-read handling: Mate pairs maintained via Read.mate references throughout sorting and merging
File Format Support
BBSort automatically detects and handles multiple sequence formats:
- FASTA/FASTQ: Standard sequence formats, compressed or uncompressed
- SAM/BAM: Alignment formats with full header preservation
- Interleaved/paired: Automatic detection and preservation of pairing information
- Compression: Native support for gzip compression on input and output
Important Notes
- Sorting modes (name, length, quality, position, etc.) are mutually exclusive
- Quality sorting actually sorts by average expected error rate, so ascending order places highest-quality reads first
- Taxonomic sorting requires appropriate database files (tree, gi table, accession files)
- Temp files are created in the current working directory when memory limits are exceeded
- Paired reads are kept together regardless of sorting mode
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org