Shuffle2

Basic Usage

shuffle2.sh in=<file> out=<file>

Shuffle2 randomly reorders sequencing reads while preserving paired-read relationships. It is designed to handle large datasets by using temporary files when memory becomes limited.

Parameters

Parameters are organized based on their function in the shuffling process.

Standard parameters

in=<file>: The 'in=' flag is needed if the input file is not the first parameter. 'in=stdin' will pipe from standard in.
in2=<file>: Use this if 2nd read of pairs are in a different file.
out=<file>: The 'out=' flag is needed if the output file is not the second parameter. 'out=stdout' will pipe to standard out.
out2=<file>: Use this to write 2nd read of pairs to a different file.
overwrite=t: (ow) Set to false to force the program to abort rather than overwrite an existing file.
ziplevel=2: (zl) Set to 1 (lowest) through 9 (max) to change compression level; lower compression is faster.
int=auto: (interleaved) Set to t or f to override interleaving autodetection.

Processing parameters

shuffle: Randomly reorders reads (default).
seed=-1: Set to a positive number for deterministic shuffling. Default -1 uses a random seed for each run.

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions.

Examples

Basic Shuffling

shuffle2.sh in=reads.fq out=shuffled.fq

Randomly shuffle reads from a single FASTQ file.

Paired-End Shuffling

shuffle2.sh in1=reads_R1.fq in2=reads_R2.fq out1=shuffled_R1.fq out2=shuffled_R2.fq

Shuffle paired-end reads while maintaining pairing relationships.

Deterministic Shuffling

shuffle2.sh in=reads.fq out=shuffled.fq seed=12345

Shuffle reads with a fixed seed for reproducible results.

Interleaved Input/Output

shuffle2.sh in=interleaved.fq out=shuffled_interleaved.fq int=t

Process interleaved paired-end reads.

Algorithm Details

Memory Management Strategy

Shuffle2 implements external memory sorting using AtomicLong for thread-safe memory tracking and waitOnMemory() synchronization:

Memory Monitoring: Uses memMult=0.35f (35% of available memory) as currentLimit for in-memory storage, with memLimit set to 75% of maxMem from Shared.memAvailable()
Temporary Files: When currentMem exceeds currentLimit or storage reaches readLimit (2 billion reads), calls shuffleAndDump() with WriteThread for asynchronous file I/O
Chunked Processing: ArrayList<Read> storage accumulates reads until memory thresholds trigger temp file creation via File.createTempFile() with configurable tempExt format
Recursive Merging: mergeRecursive() method handles cases where file count exceeds maxFiles (default 16) or total size exceeds 2GB, creating nested merge passes

Shuffling Algorithm

The shuffling process maintains paired-read relationships through Read.mate references and multi-stage randomization:

Paired Storage: Read pairs stored as r1.mate=r2 relationship, with pairCount() tracking both reads as single unit during readsProcessed increment
Collections.shuffle(): Calls Collections.shuffle(storage) in WriteThread.run() before file output, using Shared.threadLocalRandom(seed) for deterministic seeding
Deterministic Option: Uses randy=Shared.threadLocalRandom(seed) when seed != -1, enabling reproducible shuffling via Random constructor
Multi-Stage Shuffling: Shuffles both in WriteThread during temp file creation and in mergeAndDump() via Collections.shuffle(buffer) during final merge

File Format Handling

Shuffle2 uses FileFormat.testInput() and FileFormat.testOutput() for format detection with ConcurrentReadInputStream processing:

Format Detection: FileFormat.testInput() with FileFormat.FASTQ default detects format from file extensions, supporting fasta(), samOrBam(), and compressed formats
Interleaving Support: FASTQ.FORCE_INTERLEAVED and FASTQ.TEST_INTERLEAVED flags control paired-end detection, with setInterleaved override from parser
Compression: ReadWrite.ZIPLEVEL configurable 1-9, with temporary reduction to level 2 during processing via Tools.mid(1, ReadWrite.ZIPLEVEL, 2)
Header Preservation: useSharedHeader flag enables SAM/BAM header preservation when ffin1.samOrBam() && ffout1.samOrBam() both true

Performance Characteristics

Shuffle2 performance relies on asynchronous I/O threading and dynamic buffer management:

Scalability: External memory algorithm using WriteThread extends Thread for asynchronous temp file I/O, enabling processing beyond RAM limits
Threading: ConcurrentReadInputStream with ByteFile.FORCE_MODE_BF2=true for multi-threaded reads when Shared.threads()>2
Memory Efficiency: AtomicLong.addAndGet(-currentMem) in WriteThread decrements outstandingMem with synchronized notify() to release waiting threads
Compression Balance: ReadWrite.ZIPLEVEL temporary reduction to Tools.mid(1, ReadWrite.ZIPLEVEL, 2) during processing, restored to ziplevel0 for final output

Differences from Original Shuffle

Shuffle2 implements external memory sorting capabilities not present in the original Shuffle class:

Large Dataset Support: allowTempFiles=true enables File.createTempFile() with WriteThread for external sorting, while original Shuffle processes in-memory only
Memory Control: Uses AtomicLong outstandingMem with waitOnMemory() synchronization and memMult threshold, versus Shuffle's simpler ArrayList storage
I/O Performance: WriteThread asynchronous file writing with ConcurrentReadOutputStream buffers, compared to Shuffle's direct synchronous output
Robustness: Implements mergeRecursive() for handling maxFiles limits and Tools.capBufferLen() dynamic adjustment under memory pressure

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org