MergeSorted

Basic Usage

mergesorted.sh sort_temp* out=<file>

Input may be fasta, fastq, or sam, compressed or uncompressed.

Parameters

Parameters are organized by their function in the sorting and merging process.

Input/Output Parameters

in=<file,file,...>: Input files. Files may be specified without in=.
out=<file>: Output file.
delete=t: Delete input files after merging. Default: true

Sorting Parameters

name=t: Sort reads by name. Default: true
length=f: Sort reads by length. Default: false
quality=f: Sort reads by quality. Default: false
position=f: Sort reads by position (for mapped reads). Default: false
taxa=f: Sort reads by taxonomy (for NCBI naming convention). Default: false
sequence=f: Sort reads by sequence, alphabetically. Default: false
flowcell=f: Sort reads by flowcell coordinates. Default: false
shuffle=f: Shuffle reads randomly (untested). Default: false
list=<file>: Sort reads according to this list of names.
ascending=t: Sort ascending. Default: true

Memory Management Parameters

memmult=.35: Write a temp file when used memory drops below this fraction of total memory. Default: 0.35

Taxonomy-sorting Parameters

tree=: Specify a taxtree file. On Genepool, use 'auto'.
gi=: Specify a gitable file. On Genepool, use 'auto'.
accession=: Specify one or more comma-delimited NCBI accession to taxid files. On Genepool, use 'auto'.

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions.

Examples

Merge Sorted Temp Files

mergesorted.sh sort_temp001.fq.gz sort_temp002.fq.gz sort_temp003.fq.gz out=merged_sorted.fq.gz

Merges multiple sorted temporary files produced by SortByName into a single output file, deleting the temporary files after merging.

Merge Without Deleting Input Files

mergesorted.sh in=temp1.fq,temp2.fq,temp3.fq out=merged.fq delete=f

Merges sorted files while preserving the original input files.

Sort by Length During Merge

mergesorted.sh sort_temp*.fq out=merged_by_length.fq length=t name=f

Changes the sorting criteria to read length instead of read name during the merge operation.

Sort by Quality

mergesorted.sh in=reads1.fq,reads2.fq out=quality_sorted.fq quality=t name=f

Sorts reads by quality score during the merge process.

Algorithm Details

MergeSorted is specifically designed to handle the situation where SortByName ran out of time during the merging phase, leaving multiple sorted temporary files that need to be combined.

Recursive Merge Implementation

The tool implements mergeRecursive() with specific file handling limits:

Maximum Files Limit: Merges at most 16 files simultaneously (configurable via maxFiles parameter) to avoid exceeding file handle limits
Batch Grouping: When file count exceeds maxFiles, creates groups using (size+maxFiles-1)/maxFiles calculation
Recursive Processing: Uses round-robin distribution with listList.get(i%groups).add() to balance file groups
Compression Management: Reduces compression level to maximum 4 during intermediate merging via ReadWrite.ZIPLEVEL=Tools.min(ReadWrite.ZIPLEVEL, 4)

Comparator Architecture

The tool uses a static ReadComparator comparator field with these specialized implementations:

Name Sorting: ReadComparatorName.comparator with null-safe lexicographic comparison and pair number tiebreaking
Length Sorting: ReadLengthComparator.comparator with descending length priority and mate length secondary sorting
Quality Sorting: ReadQualityComparator.comparator using expected error calculations for read and mate
Position Sorting: ReadComparatorPosition.comparator requires ScafMap.loadSamHeader() for scaffold information
Taxonomy Sorting: ReadComparatorTaxa.comparator with hierarchical taxonomic node comparison using TaxTree
Topological Sorting: ReadComparatorTopological.comparator or ReadComparatorTopological5Bit.comparator with 12-mer k-mer generation when genKmer=true
Flowcell Sorting: ReadComparatorFlowcell.comparator using thread-local coordinate parsing for lane, tile, y, x coordinates
Random Shuffle: ReadComparatorRandom.comparator with deterministic seed using ListNum.setDeterministicRandomSeed(-1)
List-based Sorting: ReadComparatorList(b) constructor with file-based or comma-separated ordering

Memory Management Implementation

Memory handling uses specific Java classes and methods:

Thread Management: Uses ReadWrite.setZipThreads(Shared.threads()) and ByteFile.FORCE_MODE_BF2=true for multi-threaded I/O when threads > 2
Memory Monitoring: The memmult parameter (default 0.35) controls temporary file writing based on memory fraction
File Creation: Uses File.createTempFile("sort_temp_", tempExt, dir) with automatic extension detection via Tools.getTempExt()

File Format Processing

Format handling through specific FileFormat and stream classes:

Format Detection: Uses FileFormat.testInput() and FileFormat.testOutput() with automatic FASTQ default
Stream Integration: Delegates to SortByName.mergeAndDump() with parameters: inList, ff1, ff2, delete, useSharedHeader, allowInputSubprocess
Compression Support: Enables PIGZ via ReadWrite.USE_PIGZ=ReadWrite.USE_UNPIGZ=true

Taxonomy Database Integration

Taxonomy sorting requires specific database initialization:

GI Table Loading: GiToTaxid.initialize(giTableFile) for GI to taxonomy ID mapping
Accession Loading: AccessionToTaxid.load(accessionFile) for NCBI accession processing
Tax Tree Loading: TaxTree.loadTaxTree(taxTreeFile) with nameMap verification
Auto Configuration: Uses TaxTree.defaultTreeFile(), TaxTree.defaultTableFile(), and TaxTree.defaultAccessionFile() when "auto" specified

Notes

This tool is intended specifically for merging temporary files produced by SortByName when that program ran out of time during merging
Input files are assumed to already be sorted according to the specified criteria
The shuffle option is marked as untested and should be used with caution
For taxonomy-based sorting, appropriate taxonomy databases (tree, gi table, accession files) must be available
Position-based sorting requires mapped reads in SAM format with proper headers
The tool automatically manages file handles and memory to handle large numbers of input files efficiently

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org