SortByName

Script: sortbyname.sh Package: sort Class: SortByName.java

Sorts reads by name or other keys such as length, quality, mapping position, flowcell coordinates, or taxonomy. Writes temp files if memory is exceeded.

Basic Usage

sortbyname.sh in=<file> out=<file>

Input may be fasta, fastq, or sam, compressed or uncompressed. Temp files will use the same format as the output. Pairs are kept together if reads are paired, and in2/out2 may be used for paired-end files.

Parameters

Parameters are organized into functional groups for different sorting modes and memory management. Note that name, length, and quality are mutually exclusive sorting criteria.

Input/Output Parameters

in=<file>
Input file. Can be fasta, fastq, or sam format, compressed or uncompressed.
out=<file>
Output file. Will use the same format as input unless specified otherwise.
in2=<file>
Secondary input file for paired-end reads (if applicable).
out2=<file>
Secondary output file for paired-end reads (if applicable).
overwrite=t
Overwrite existing output files.
append=f
Append to existing output files instead of overwriting.
extin=<extension>
Override input file extension detection.
extout=<extension>
Override output file extension.

Sorting Mode Parameters

name=t
Sort reads by name (default mode). Uses lexicographic ordering of read names.
length=f
Sort reads by sequence length. Longer reads will be placed first (descending) or last (ascending) based on the ascending parameter.
quality=f
Sort reads by quality score. Actually sorts by average expected error rate, so ascending=t places highest-quality reads first.
position=f
Sort reads by mapping position (for mapped reads in SAM format). Requires coordinate information in the input.
taxa=f
Sort reads by taxonomy (for NCBI naming convention). Requires taxonomy files to be specified.
sequence=f
Sort reads by sequence content alphabetically. Uses lexicographic ordering of actual DNA sequence.
clump=f
Sort reads by shared kmers, similar to Clumpify. Groups reads with similar kmer content together.
flowcell=f
Sort reads by flowcell coordinates. Extracts coordinate information from Illumina read headers.
shuffle=f
Shuffle reads randomly (untested). Uses deterministic random seed for reproducible results.
list=<file>
Sort reads according to the order specified in this list of names. Reads not in the list are placed at the end.
ascending=t
Sort in ascending order. Set to false for descending order.

Memory Parameters

memmult=0.30
Write a temporary file when used memory exceeds this fraction of available memory. Reduces this value if experiencing crashes.
memlimit=0.65
Wait for temporary files to finish writing until used memory drops below this fraction of available memory.
delete=t
Delete temporary files after merging. Set to false to preserve temp files for debugging.
allowtemp=t
Allow writing temporary files when memory is exceeded. If false, the program will crash rather than create temp files.
maxfiles=12
Maximum number of temporary files to merge per pass. Higher values use more memory but may be faster.

Taxonomy-sorting Parameters (for taxa mode only)

tree=<file>
Specify a taxtree file. On Genepool systems, use 'auto' to automatically locate the default taxonomy tree.
gi=<file>
Specify a gitable file for GI number to taxonomy mapping. On Genepool systems, use 'auto' to automatically locate the default GI table.
accession=<file>
Specify one or more comma-delimited NCBI accession to taxid files. On Genepool systems, use 'auto' to automatically locate the default accession files.

Processing Control Parameters

maxreads=-1
Process at most this many reads. -1 means no limit.
minlen=0
Discard reads shorter than this length before sorting.
interleaved=auto
Set interleaved mode explicitly. Auto-detected based on input format.
verbose=f
Print verbose status messages during processing.

Java Parameters

-Xmx<size>
Set Java's maximum memory usage, overriding autodetection. Examples: -Xmx20g (20 GB), -Xmx2000m (2 GB). Maximum is typically 85% of physical memory.
-eoom
Exit if an out-of-memory exception occurs. Requires Java 8u92 or later.
-da
Disable Java assertions for better performance.

Examples

Basic name sorting

sortbyname.sh in=raw.fq out=sorted.fq

Sort reads by name (default mode) and write to sorted file.

Sort by sequence content

sortbyname.sh in=raw.fq out=sorted.fq sequence=t

Sort reads alphabetically by DNA sequence content.

Sort mapped reads by position

sortbyname.sh in=mapped.sam out=sorted.sam position=t

Sort mapped reads by their genomic position.

Sort by read length (longest first)

sortbyname.sh in=reads.fq out=sorted.fq length=t ascending=f

Sort reads by length in descending order, placing longest reads first.

Sort with memory constraints

sortbyname.sh in=large.fq out=sorted.fq memmult=0.2 memlimit=0.5

Sort with conservative memory usage, creating temp files when memory usage exceeds 20% of available RAM.

Sort paired-end reads

sortbyname.sh in1=reads_1.fq in2=reads_2.fq out1=sorted_1.fq out2=sorted_2.fq

Sort paired-end reads while maintaining proper pairing.

Algorithm Details

SortByName implements an external merge sort algorithm using ArrayList storage with dual-threshold memory monitoring to handle datasets larger than available memory:

Memory Management Strategy

The algorithm uses Shared.memAvailable() to monitor memory consumption and employs AtomicLong for thread-safe memory tracking:

Sorting Implementation

The tool uses static ReadComparator instances from the sort package for different sorting modes:

External Merge Process

When memory is exceeded, the algorithm creates File.createTempFile() temporary files that are later merged using java.util.PriorityQueue:

Performance Characteristics

Special Modes

Quality sorting note: When sorting by quality, the algorithm actually sorts by average expected error rate rather than raw quality scores, meaning ascending=t will place the highest-quality reads first.

Paired-end handling: When processing paired reads, both mates are kept together and moved as a unit during sorting, ensuring proper pairing is maintained in the output.

Random shuffling: The shuffle mode calls ListNum.setDeterministicRandomSeed(-1) and ListNum.setDeterministicRandom(true) for reproducible pseudorandom ordering via ReadComparatorRandom.comparator.

Support

For questions and support: