SortByName

Script: sortbyname.sh Package: sort Class: SortByName.java

Sorts reads by name or other keys such as length, quality, mapping position, flowcell coordinates, or taxonomy. Writes temp files if memory is exceeded.

Basic Usage

sortbyname.sh in=<file> out=<file>

Input may be fasta, fastq, or sam, compressed or uncompressed. Temp files will use the same format as the output. Pairs are kept together if reads are paired, and in2/out2 may be used for paired-end files.

Parameters

Parameters are organized into functional groups for different sorting modes and memory management. Note that name, length, and quality are mutually exclusive sorting criteria.

Input/Output Parameters

in=<file>: Input file. Can be fasta, fastq, or sam format, compressed or uncompressed.
out=<file>: Output file. Will use the same format as input unless specified otherwise.
in2=<file>: Secondary input file for paired-end reads (if applicable).
out2=<file>: Secondary output file for paired-end reads (if applicable).
overwrite=t: Overwrite existing output files.
append=f: Append to existing output files instead of overwriting.
extin=<extension>: Override input file extension detection.
extout=<extension>: Override output file extension.

Sorting Mode Parameters

name=t: Sort reads by name (default mode). Uses lexicographic ordering of read names.
length=f: Sort reads by sequence length. Longer reads will be placed first (descending) or last (ascending) based on the ascending parameter.
quality=f: Sort reads by quality score. Actually sorts by average expected error rate, so ascending=t places highest-quality reads first.
position=f: Sort reads by mapping position (for mapped reads in SAM format). Requires coordinate information in the input.
taxa=f: Sort reads by taxonomy (for NCBI naming convention). Requires taxonomy files to be specified.
sequence=f: Sort reads by sequence content alphabetically. Uses lexicographic ordering of actual DNA sequence.
clump=f: Sort reads by shared kmers, similar to Clumpify. Groups reads with similar kmer content together.
flowcell=f: Sort reads by flowcell coordinates. Extracts coordinate information from Illumina read headers.
shuffle=f: Shuffle reads randomly (untested). Uses deterministic random seed for reproducible results.
list=<file>: Sort reads according to the order specified in this list of names. Reads not in the list are placed at the end.
ascending=t: Sort in ascending order. Set to false for descending order.

Memory Parameters

memmult=0.30: Write a temporary file when used memory exceeds this fraction of available memory. Reduces this value if experiencing crashes.
memlimit=0.65: Wait for temporary files to finish writing until used memory drops below this fraction of available memory.
delete=t: Delete temporary files after merging. Set to false to preserve temp files for debugging.
allowtemp=t: Allow writing temporary files when memory is exceeded. If false, the program will crash rather than create temp files.
maxfiles=12: Maximum number of temporary files to merge per pass. Higher values use more memory but may be faster.

Taxonomy-sorting Parameters (for taxa mode only)

tree=<file>: Specify a taxtree file. On Genepool systems, use 'auto' to automatically locate the default taxonomy tree.
gi=<file>: Specify a gitable file for GI number to taxonomy mapping. On Genepool systems, use 'auto' to automatically locate the default GI table.
accession=<file>: Specify one or more comma-delimited NCBI accession to taxid files. On Genepool systems, use 'auto' to automatically locate the default accession files.

Processing Control Parameters

maxreads=-1: Process at most this many reads. -1 means no limit.
minlen=0: Discard reads shorter than this length before sorting.
interleaved=auto: Set interleaved mode explicitly. Auto-detected based on input format.
verbose=f: Print verbose status messages during processing.

Java Parameters

-Xmx<size>: Set Java's maximum memory usage, overriding autodetection. Examples: -Xmx20g (20 GB), -Xmx2000m (2 GB). Maximum is typically 85% of physical memory.
-eoom: Exit if an out-of-memory exception occurs. Requires Java 8u92 or later.
-da: Disable Java assertions for better performance.

Examples

Basic name sorting

sortbyname.sh in=raw.fq out=sorted.fq

Sort reads by name (default mode) and write to sorted file.

Sort by sequence content

sortbyname.sh in=raw.fq out=sorted.fq sequence=t

Sort reads alphabetically by DNA sequence content.

Sort mapped reads by position

sortbyname.sh in=mapped.sam out=sorted.sam position=t

Sort mapped reads by their genomic position.

Sort by read length (longest first)

sortbyname.sh in=reads.fq out=sorted.fq length=t ascending=f

Sort reads by length in descending order, placing longest reads first.

Sort with memory constraints

sortbyname.sh in=large.fq out=sorted.fq memmult=0.2 memlimit=0.5

Sort with conservative memory usage, creating temp files when memory usage exceeds 20% of available RAM.

Sort paired-end reads

sortbyname.sh in1=reads_1.fq in2=reads_2.fq out1=sorted_1.fq out2=sorted_2.fq

Sort paired-end reads while maintaining proper pairing.

Algorithm Details

SortByName implements an external merge sort algorithm using ArrayList storage with dual-threshold memory monitoring to handle datasets larger than available memory:

Memory Management Strategy

The algorithm uses Shared.memAvailable() to monitor memory consumption and employs AtomicLong for thread-safe memory tracking:

Block Threshold (memmult=0.30): When memory usage exceeds this fraction of available memory, current reads are sorted using Shared.sort() and written to a temporary file via WriteThread
Total Threshold (memlimit=0.65): The program calls waitOnMemory() to block until background write operations complete and memory drops below this threshold

Sorting Implementation

The tool uses static ReadComparator instances from the sort package for different sorting modes:

Name sorting: ReadComparatorName.comparator with string comparison and pair number tiebreaker
Length sorting: ReadLengthComparator.comparator with descending length ordering and mate length secondary sort
Quality sorting: ReadQualityComparator.comparator using expected error rate calculation from quality scores
Position sorting: ReadComparatorPosition.comparator using ScafMap.loadSamHeader() for coordinate mapping
Sequence sorting: ReadComparatorTopological.comparator with base-by-base lexicographic comparison
Clump sorting: ReadComparatorClump.comparator using 31-base kmer hashing via ReadComparatorClump.set()
Taxonomy sorting: ReadComparatorTaxa.comparator with TaxTree hierarchy traversal

External Merge Process

When memory is exceeded, the algorithm creates File.createTempFile() temporary files that are later merged using java.util.PriorityQueue:

Each temporary file is opened with a CrisContainer wrapper that fetches ArrayList<Read> batches
A PriorityQueue<CrisContainer> manages merge order using the selected ReadComparator instance
The merge process buffers 100,000 reads maximum, sorts with Shared.sort(), then outputs sorted batches
Recursive merging via mergeRecursive() is used when file count exceeds maxMergeFiles() calculated from available memory

Performance Characteristics

Memory scaling: Uses external merge sort with configurable thresholds (default 30%/65% of available RAM)
I/O implementation: ConcurrentReadInputStream/ConcurrentReadOutputStream with buffer sizes up to 200 reads per batch
Concurrent processing: WriteThread background processing using AtomicLong synchronization and outstandingMem.wait() blocking
Format handling: FileFormat.testInput() detection preserves FASTA/FASTQ/SAM compression and structure through temporary files

Special Modes

Quality sorting note: When sorting by quality, the algorithm actually sorts by average expected error rate rather than raw quality scores, meaning ascending=t will place the highest-quality reads first.

Paired-end handling: When processing paired reads, both mates are kept together and moved as a unit during sorting, ensuring proper pairing is maintained in the output.

Random shuffling: The shuffle mode calls ListNum.setDeterministicRandomSeed(-1) and ListNum.setDeterministicRandom(true) for reproducible pseudorandom ordering via ReadComparatorRandom.comparator.

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org