MergeSorted
Sorts reads by name or other keys such as length, quality, mapping position, flowcell coordinates, or taxonomy. Intended to merge temp files produced by SortByName if the program ran out of time during merging.
Basic Usage
mergesorted.sh sort_temp* out=<file>
Input may be fasta, fastq, or sam, compressed or uncompressed.
Parameters
Parameters are organized by their function in the sorting and merging process.
Input/Output Parameters
- in=<file,file,...>
- Input files. Files may be specified without in=.
- out=<file>
- Output file.
- delete=t
- Delete input files after merging. Default: true
Sorting Parameters
- name=t
- Sort reads by name. Default: true
- length=f
- Sort reads by length. Default: false
- quality=f
- Sort reads by quality. Default: false
- position=f
- Sort reads by position (for mapped reads). Default: false
- taxa=f
- Sort reads by taxonomy (for NCBI naming convention). Default: false
- sequence=f
- Sort reads by sequence, alphabetically. Default: false
- flowcell=f
- Sort reads by flowcell coordinates. Default: false
- shuffle=f
- Shuffle reads randomly (untested). Default: false
- list=<file>
- Sort reads according to this list of names.
- ascending=t
- Sort ascending. Default: true
Memory Management Parameters
- memmult=.35
- Write a temp file when used memory drops below this fraction of total memory. Default: 0.35
Taxonomy-sorting Parameters
- tree=
- Specify a taxtree file. On Genepool, use 'auto'.
- gi=
- Specify a gitable file. On Genepool, use 'auto'.
- accession=
- Specify one or more comma-delimited NCBI accession to taxid files. On Genepool, use 'auto'.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions.
Examples
Merge Sorted Temp Files
mergesorted.sh sort_temp001.fq.gz sort_temp002.fq.gz sort_temp003.fq.gz out=merged_sorted.fq.gz
Merges multiple sorted temporary files produced by SortByName into a single output file, deleting the temporary files after merging.
Merge Without Deleting Input Files
mergesorted.sh in=temp1.fq,temp2.fq,temp3.fq out=merged.fq delete=f
Merges sorted files while preserving the original input files.
Sort by Length During Merge
mergesorted.sh sort_temp*.fq out=merged_by_length.fq length=t name=f
Changes the sorting criteria to read length instead of read name during the merge operation.
Sort by Quality
mergesorted.sh in=reads1.fq,reads2.fq out=quality_sorted.fq quality=t name=f
Sorts reads by quality score during the merge process.
Algorithm Details
MergeSorted is specifically designed to handle the situation where SortByName ran out of time during the merging phase, leaving multiple sorted temporary files that need to be combined.
Recursive Merge Implementation
The tool implements mergeRecursive()
with specific file handling limits:
- Maximum Files Limit: Merges at most 16 files simultaneously (configurable via maxFiles parameter) to avoid exceeding file handle limits
- Batch Grouping: When file count exceeds maxFiles, creates groups using
(size+maxFiles-1)/maxFiles
calculation - Recursive Processing: Uses round-robin distribution with
listList.get(i%groups).add()
to balance file groups - Compression Management: Reduces compression level to maximum 4 during intermediate merging via
ReadWrite.ZIPLEVEL=Tools.min(ReadWrite.ZIPLEVEL, 4)
Comparator Architecture
The tool uses a static ReadComparator comparator
field with these specialized implementations:
- Name Sorting:
ReadComparatorName.comparator
with null-safe lexicographic comparison and pair number tiebreaking - Length Sorting:
ReadLengthComparator.comparator
with descending length priority and mate length secondary sorting - Quality Sorting:
ReadQualityComparator.comparator
using expected error calculations for read and mate - Position Sorting:
ReadComparatorPosition.comparator
requiresScafMap.loadSamHeader()
for scaffold information - Taxonomy Sorting:
ReadComparatorTaxa.comparator
with hierarchical taxonomic node comparison using TaxTree - Topological Sorting:
ReadComparatorTopological.comparator
orReadComparatorTopological5Bit.comparator
with 12-mer k-mer generation when genKmer=true - Flowcell Sorting:
ReadComparatorFlowcell.comparator
using thread-local coordinate parsing for lane, tile, y, x coordinates - Random Shuffle:
ReadComparatorRandom.comparator
with deterministic seed usingListNum.setDeterministicRandomSeed(-1)
- List-based Sorting:
ReadComparatorList(b)
constructor with file-based or comma-separated ordering
Memory Management Implementation
Memory handling uses specific Java classes and methods:
- Thread Management: Uses
ReadWrite.setZipThreads(Shared.threads())
andByteFile.FORCE_MODE_BF2=true
for multi-threaded I/O when threads > 2 - Memory Monitoring: The memmult parameter (default 0.35) controls temporary file writing based on memory fraction
- File Creation: Uses
File.createTempFile("sort_temp_", tempExt, dir)
with automatic extension detection viaTools.getTempExt()
File Format Processing
Format handling through specific FileFormat and stream classes:
- Format Detection: Uses
FileFormat.testInput()
andFileFormat.testOutput()
with automatic FASTQ default - Stream Integration: Delegates to
SortByName.mergeAndDump()
with parameters: inList, ff1, ff2, delete, useSharedHeader, allowInputSubprocess - Compression Support: Enables PIGZ via
ReadWrite.USE_PIGZ=ReadWrite.USE_UNPIGZ=true
Taxonomy Database Integration
Taxonomy sorting requires specific database initialization:
- GI Table Loading:
GiToTaxid.initialize(giTableFile)
for GI to taxonomy ID mapping - Accession Loading:
AccessionToTaxid.load(accessionFile)
for NCBI accession processing - Tax Tree Loading:
TaxTree.loadTaxTree(taxTreeFile)
with nameMap verification - Auto Configuration: Uses
TaxTree.defaultTreeFile()
,TaxTree.defaultTableFile()
, andTaxTree.defaultAccessionFile()
when "auto" specified
Notes
- This tool is intended specifically for merging temporary files produced by SortByName when that program ran out of time during merging
- Input files are assumed to already be sorted according to the specified criteria
- The shuffle option is marked as untested and should be used with caution
- For taxonomy-based sorting, appropriate taxonomy databases (tree, gi table, accession files) must be available
- Position-based sorting requires mapped reads in SAM format with proper headers
- The tool automatically manages file handles and memory to handle large numbers of input files efficiently
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org