Shuffle2
Reorders reads randomly, keeping pairs together. Unlike Shuffle, Shuffle2 can write temp files to handle large datasets.
Basic Usage
shuffle2.sh in=<file> out=<file>
Shuffle2 randomly reorders sequencing reads while preserving paired-read relationships. It is designed to handle large datasets by using temporary files when memory becomes limited.
Parameters
Parameters are organized based on their function in the shuffling process.
Standard parameters
- in=<file>
- The 'in=' flag is needed if the input file is not the first parameter. 'in=stdin' will pipe from standard in.
- in2=<file>
- Use this if 2nd read of pairs are in a different file.
- out=<file>
- The 'out=' flag is needed if the output file is not the second parameter. 'out=stdout' will pipe to standard out.
- out2=<file>
- Use this to write 2nd read of pairs to a different file.
- overwrite=t
- (ow) Set to false to force the program to abort rather than overwrite an existing file.
- ziplevel=2
- (zl) Set to 1 (lowest) through 9 (max) to change compression level; lower compression is faster.
- int=auto
- (interleaved) Set to t or f to override interleaving autodetection.
Processing parameters
- shuffle
- Randomly reorders reads (default).
- seed=-1
- Set to a positive number for deterministic shuffling. Default -1 uses a random seed for each run.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions.
Examples
Basic Shuffling
shuffle2.sh in=reads.fq out=shuffled.fq
Randomly shuffle reads from a single FASTQ file.
Paired-End Shuffling
shuffle2.sh in1=reads_R1.fq in2=reads_R2.fq out1=shuffled_R1.fq out2=shuffled_R2.fq
Shuffle paired-end reads while maintaining pairing relationships.
Deterministic Shuffling
shuffle2.sh in=reads.fq out=shuffled.fq seed=12345
Shuffle reads with a fixed seed for reproducible results.
Interleaved Input/Output
shuffle2.sh in=interleaved.fq out=shuffled_interleaved.fq int=t
Process interleaved paired-end reads.
Algorithm Details
Memory Management Strategy
Shuffle2 implements external memory sorting using AtomicLong for thread-safe memory tracking and waitOnMemory() synchronization:
- Memory Monitoring: Uses memMult=0.35f (35% of available memory) as currentLimit for in-memory storage, with memLimit set to 75% of maxMem from Shared.memAvailable()
- Temporary Files: When currentMem exceeds currentLimit or storage reaches readLimit (2 billion reads), calls shuffleAndDump() with WriteThread for asynchronous file I/O
- Chunked Processing: ArrayList<Read> storage accumulates reads until memory thresholds trigger temp file creation via File.createTempFile() with configurable tempExt format
- Recursive Merging: mergeRecursive() method handles cases where file count exceeds maxFiles (default 16) or total size exceeds 2GB, creating nested merge passes
Shuffling Algorithm
The shuffling process maintains paired-read relationships through Read.mate references and multi-stage randomization:
- Paired Storage: Read pairs stored as r1.mate=r2 relationship, with pairCount() tracking both reads as single unit during readsProcessed increment
- Collections.shuffle(): Calls Collections.shuffle(storage) in WriteThread.run() before file output, using Shared.threadLocalRandom(seed) for deterministic seeding
- Deterministic Option: Uses randy=Shared.threadLocalRandom(seed) when seed != -1, enabling reproducible shuffling via Random constructor
- Multi-Stage Shuffling: Shuffles both in WriteThread during temp file creation and in mergeAndDump() via Collections.shuffle(buffer) during final merge
File Format Handling
Shuffle2 uses FileFormat.testInput() and FileFormat.testOutput() for format detection with ConcurrentReadInputStream processing:
- Format Detection: FileFormat.testInput() with FileFormat.FASTQ default detects format from file extensions, supporting fasta(), samOrBam(), and compressed formats
- Interleaving Support: FASTQ.FORCE_INTERLEAVED and FASTQ.TEST_INTERLEAVED flags control paired-end detection, with setInterleaved override from parser
- Compression: ReadWrite.ZIPLEVEL configurable 1-9, with temporary reduction to level 2 during processing via Tools.mid(1, ReadWrite.ZIPLEVEL, 2)
- Header Preservation: useSharedHeader flag enables SAM/BAM header preservation when ffin1.samOrBam() && ffout1.samOrBam() both true
Performance Characteristics
Shuffle2 performance relies on asynchronous I/O threading and dynamic buffer management:
- Scalability: External memory algorithm using WriteThread extends Thread for asynchronous temp file I/O, enabling processing beyond RAM limits
- Threading: ConcurrentReadInputStream with ByteFile.FORCE_MODE_BF2=true for multi-threaded reads when Shared.threads()>2
- Memory Efficiency: AtomicLong.addAndGet(-currentMem) in WriteThread decrements outstandingMem with synchronized notify() to release waiting threads
- Compression Balance: ReadWrite.ZIPLEVEL temporary reduction to Tools.mid(1, ReadWrite.ZIPLEVEL, 2) during processing, restored to ziplevel0 for final output
Differences from Original Shuffle
Shuffle2 implements external memory sorting capabilities not present in the original Shuffle class:
- Large Dataset Support: allowTempFiles=true enables File.createTempFile() with WriteThread for external sorting, while original Shuffle processes in-memory only
- Memory Control: Uses AtomicLong outstandingMem with waitOnMemory() synchronization and memMult threshold, versus Shuffle's simpler ArrayList storage
- I/O Performance: WriteThread asynchronous file writing with ConcurrentReadOutputStream buffers, compared to Shuffle's direct synchronous output
- Robustness: Implements mergeRecursive() for handling maxFiles limits and Tools.capBufferLen() dynamic adjustment under memory pressure
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org