Shuffle2

Script: shuffle2.sh Package: sort Class: Shuffle2.java

Reorders reads randomly, keeping pairs together. Unlike Shuffle, Shuffle2 can write temp files to handle large datasets.

Basic Usage

shuffle2.sh in=<file> out=<file>

Shuffle2 randomly reorders sequencing reads while preserving paired-read relationships. It is designed to handle large datasets by using temporary files when memory becomes limited.

Parameters

Parameters are organized based on their function in the shuffling process.

Standard parameters

in=<file>
The 'in=' flag is needed if the input file is not the first parameter. 'in=stdin' will pipe from standard in.
in2=<file>
Use this if 2nd read of pairs are in a different file.
out=<file>
The 'out=' flag is needed if the output file is not the second parameter. 'out=stdout' will pipe to standard out.
out2=<file>
Use this to write 2nd read of pairs to a different file.
overwrite=t
(ow) Set to false to force the program to abort rather than overwrite an existing file.
ziplevel=2
(zl) Set to 1 (lowest) through 9 (max) to change compression level; lower compression is faster.
int=auto
(interleaved) Set to t or f to override interleaving autodetection.

Processing parameters

shuffle
Randomly reorders reads (default).
seed=-1
Set to a positive number for deterministic shuffling. Default -1 uses a random seed for each run.

Java Parameters

-Xmx
This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da
Disable assertions.

Examples

Basic Shuffling

shuffle2.sh in=reads.fq out=shuffled.fq

Randomly shuffle reads from a single FASTQ file.

Paired-End Shuffling

shuffle2.sh in1=reads_R1.fq in2=reads_R2.fq out1=shuffled_R1.fq out2=shuffled_R2.fq

Shuffle paired-end reads while maintaining pairing relationships.

Deterministic Shuffling

shuffle2.sh in=reads.fq out=shuffled.fq seed=12345

Shuffle reads with a fixed seed for reproducible results.

Interleaved Input/Output

shuffle2.sh in=interleaved.fq out=shuffled_interleaved.fq int=t

Process interleaved paired-end reads.

Algorithm Details

Memory Management Strategy

Shuffle2 implements external memory sorting using AtomicLong for thread-safe memory tracking and waitOnMemory() synchronization:

Shuffling Algorithm

The shuffling process maintains paired-read relationships through Read.mate references and multi-stage randomization:

File Format Handling

Shuffle2 uses FileFormat.testInput() and FileFormat.testOutput() for format detection with ConcurrentReadInputStream processing:

Performance Characteristics

Shuffle2 performance relies on asynchronous I/O threading and dynamic buffer management:

Differences from Original Shuffle

Shuffle2 implements external memory sorting capabilities not present in the original Shuffle class:

Support

For questions and support: