Shuffle

Script: shuffle.sh Package: sort Class: Shuffle.java

Reorders reads randomly, keeping pairs together. Also supports sorting reads by name, coordinate position, sequence, or read ID.

Basic Usage

shuffle.sh in=<file> out=<file>

Shuffle takes sequence files and randomly reorders the reads while maintaining pairing information for paired-end data.

Parameters

Parameters are organized by their function in the shuffling and sorting process, following the exact organization from the shell script.

Standard parameters

in=<file>
The 'in=' flag is needed if the input file is not the first parameter. 'in=stdin' will pipe from standard in. Accepts fasta, fastq, or sam formats, compressed or uncompressed.
in2=<file>
Use this if 2nd read of pairs are in a different file. When specified, forces paired-end mode and disables interleaved input detection.
out=<file>
The 'out=' flag is needed if the output file is not the second parameter. 'out=stdout' will pipe to standard out. Output format matches input format.
out2=<file>
Use this to write 2nd read of pairs to a different file. When specified, forces interleaved input mode if only one input file is provided.
overwrite=t
(ow) Set to false to force the program to abort rather than overwrite an existing file. Default: true
ziplevel=2
(zl) Set to 1 (lowest) through 9 (max) to change compression level; lower compression is faster. Only applies to compressed output formats.
int=auto
(interleaved) Set to t or f to override interleaving autodetection. Auto mode detects based on file structure and paired-end status.

Processing parameters

shuffle
Randomly reorders reads (default). Uses Collections.shuffle() with Fisher-Yates algorithm providing uniform random distribution while preserving read pairing through Read object mate reference unity.
name
Sort reads by name (read header). Uses ReadComparatorName.compareInner() with r1.id.compareTo(r2.id) lexicographic comparison, null IDs sort first, ties broken by r1.pairnum()-r2.pairnum() difference.
coordinate
Sort reads by mapping location. Uses ReadComparatorMapping with mapped/unmapped read prioritization, then compares chromosome, strand, position, and mapping quality. Requires SAM format mapping information.
sequence
Sort reads by sequence content using ReadComparatorTopological.compareVectors() with byte-wise comparison using Tools.min(a.length, b.length) limit, then mate sequence, length differences, inverted quality scores, numericID, and string ID as tiebreakers.
id
Sort reads by read ID (additional option from Java source). Uses ReadComparatorID.compareInner() with primary r1.numericID vs r2.numericID numeric comparison, secondary r1.pairnum() vs r2.pairnum() comparison, tertiary r1.id.compareTo(r2.id) string fallback.
mode=<option>
Alternative way to specify processing mode. Options: shuffle, name, coordinate, sequence, id. Equivalent to using the individual flags above.

Performance and Debugging

verbose=f
Enable verbose output for debugging. Shows detailed information about file processing, threading, and internal operations.
showspeed=t
(ss) Display processing speed statistics including reads per second and bases per second. Default: true

Java Parameters

-Xmx
This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Default: 2GB
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da
Disable assertions. Can improve performance slightly in production environments.

Examples

Basic Shuffling

shuffle.sh in=reads.fq out=shuffled.fq

Randomly reorders all reads in the input file while maintaining read pairing.

Paired-End Shuffling

shuffle.sh in1=reads_R1.fq in2=reads_R2.fq out1=shuffled_R1.fq out2=shuffled_R2.fq

Shuffles paired-end reads while keeping mate pairs together.

Sort by Name

shuffle.sh in=reads.fq out=sorted.fq name

Sorts reads alphabetically by read name instead of shuffling.

Sort by Sequence

shuffle.sh in=reads.fq out=sorted.fq sequence

Sorts reads by their DNA sequence content using topological ordering.

Using Mode Parameter

shuffle.sh in=reads.fq out=sorted.fq mode=coordinate

Alternative syntax for specifying sort mode - sorts by genomic coordinates if mapping data is present.

High Memory Usage

shuffle.sh -Xmx32g in=large_dataset.fq out=shuffled.fq

Process large datasets with increased memory allocation. ArrayList<Read> bigList can accommodate more reads in memory before processing.

Algorithm Details

Core Processing Algorithm

Shuffle uses a load-all-then-process strategy with ArrayList<Read> initialized to 65530 capacity. The processing workflow:

Sorting Algorithms

Different sorting modes use specialized comparators with precise comparison hierarchies:

Threading and Performance

Shuffle includes a synchronized threading management system for batch processing:

Interleaving Detection

Automatic paired-end file organization detection uses conditional logic based on input/output file configurations:

Support

For questions and support: