MergeSorted

Script: mergesorted.sh Package: sort Class: MergeSorted.java

Sorts reads by name or other keys such as length, quality, mapping position, flowcell coordinates, or taxonomy. Intended to merge temp files produced by SortByName if the program ran out of time during merging.

Basic Usage

mergesorted.sh sort_temp* out=<file>

Input may be fasta, fastq, or sam, compressed or uncompressed.

Parameters

Parameters are organized by their function in the sorting and merging process.

Input/Output Parameters

in=<file,file,...>
Input files. Files may be specified without in=.
out=<file>
Output file.
delete=t
Delete input files after merging. Default: true

Sorting Parameters

name=t
Sort reads by name. Default: true
length=f
Sort reads by length. Default: false
quality=f
Sort reads by quality. Default: false
position=f
Sort reads by position (for mapped reads). Default: false
taxa=f
Sort reads by taxonomy (for NCBI naming convention). Default: false
sequence=f
Sort reads by sequence, alphabetically. Default: false
flowcell=f
Sort reads by flowcell coordinates. Default: false
shuffle=f
Shuffle reads randomly (untested). Default: false
list=<file>
Sort reads according to this list of names.
ascending=t
Sort ascending. Default: true

Memory Management Parameters

memmult=.35
Write a temp file when used memory drops below this fraction of total memory. Default: 0.35

Taxonomy-sorting Parameters

tree=
Specify a taxtree file. On Genepool, use 'auto'.
gi=
Specify a gitable file. On Genepool, use 'auto'.
accession=
Specify one or more comma-delimited NCBI accession to taxid files. On Genepool, use 'auto'.

Java Parameters

-Xmx
This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da
Disable assertions.

Examples

Merge Sorted Temp Files

mergesorted.sh sort_temp001.fq.gz sort_temp002.fq.gz sort_temp003.fq.gz out=merged_sorted.fq.gz

Merges multiple sorted temporary files produced by SortByName into a single output file, deleting the temporary files after merging.

Merge Without Deleting Input Files

mergesorted.sh in=temp1.fq,temp2.fq,temp3.fq out=merged.fq delete=f

Merges sorted files while preserving the original input files.

Sort by Length During Merge

mergesorted.sh sort_temp*.fq out=merged_by_length.fq length=t name=f

Changes the sorting criteria to read length instead of read name during the merge operation.

Sort by Quality

mergesorted.sh in=reads1.fq,reads2.fq out=quality_sorted.fq quality=t name=f

Sorts reads by quality score during the merge process.

Algorithm Details

MergeSorted is specifically designed to handle the situation where SortByName ran out of time during the merging phase, leaving multiple sorted temporary files that need to be combined.

Recursive Merge Implementation

The tool implements mergeRecursive() with specific file handling limits:

Comparator Architecture

The tool uses a static ReadComparator comparator field with these specialized implementations:

Memory Management Implementation

Memory handling uses specific Java classes and methods:

File Format Processing

Format handling through specific FileFormat and stream classes:

Taxonomy Database Integration

Taxonomy sorting requires specific database initialization:

Notes

Support

For questions and support: