Shuffle

Basic Usage

shuffle.sh in=<file> out=<file>

Shuffle takes sequence files and randomly reorders the reads while maintaining pairing information for paired-end data.

Parameters are organized by their function in the shuffling and sorting process, following the exact organization from the shell script.

in=<file>: The 'in=' flag is needed if the input file is not the first parameter. 'in=stdin' will pipe from standard in. Accepts fasta, fastq, or sam formats, compressed or uncompressed.
in2=<file>: Use this if 2nd read of pairs are in a different file. When specified, forces paired-end mode and disables interleaved input detection.
out=<file>: The 'out=' flag is needed if the output file is not the second parameter. 'out=stdout' will pipe to standard out. Output format matches input format.
out2=<file>: Use this to write 2nd read of pairs to a different file. When specified, forces interleaved input mode if only one input file is provided.
overwrite=t: (ow) Set to false to force the program to abort rather than overwrite an existing file. Default: true
ziplevel=2: (zl) Set to 1 (lowest) through 9 (max) to change compression level; lower compression is faster. Only applies to compressed output formats.
int=auto: (interleaved) Set to t or f to override interleaving autodetection. Auto mode detects based on file structure and paired-end status.

shuffle: Randomly reorders reads (default). Uses Collections.shuffle() with Fisher-Yates algorithm providing uniform random distribution while preserving read pairing through Read object mate reference unity.
name: Sort reads by name (read header). Uses ReadComparatorName.compareInner() with r1.id.compareTo(r2.id) lexicographic comparison, null IDs sort first, ties broken by r1.pairnum()-r2.pairnum() difference.
coordinate: Sort reads by mapping location. Uses ReadComparatorMapping with mapped/unmapped read prioritization, then compares chromosome, strand, position, and mapping quality. Requires SAM format mapping information.
sequence: Sort reads by sequence content using ReadComparatorTopological.compareVectors() with byte-wise comparison using Tools.min(a.length, b.length) limit, then mate sequence, length differences, inverted quality scores, numericID, and string ID as tiebreakers.
id: Sort reads by read ID (additional option from Java source). Uses ReadComparatorID.compareInner() with primary r1.numericID vs r2.numericID numeric comparison, secondary r1.pairnum() vs r2.pairnum() comparison, tertiary r1.id.compareTo(r2.id) string fallback.
mode=<option>: Alternative way to specify processing mode. Options: shuffle, name, coordinate, sequence, id. Equivalent to using the individual flags above.

verbose=f: Enable verbose output for debugging. Shows detailed information about file processing, threading, and internal operations.
showspeed=t: (ss) Display processing speed statistics including reads per second and bases per second. Default: true

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Default: 2GB
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions. Can improve performance slightly in production environments.

shuffle.sh in=reads.fq out=shuffled.fq

Randomly reorders all reads in the input file while maintaining read pairing.

shuffle.sh in1=reads_R1.fq in2=reads_R2.fq out1=shuffled_R1.fq out2=shuffled_R2.fq

Shuffles paired-end reads while keeping mate pairs together.

shuffle.sh in=reads.fq out=sorted.fq name

Sorts reads alphabetically by read name instead of shuffling.

shuffle.sh in=reads.fq out=sorted.fq sequence

Sorts reads by their DNA sequence content using topological ordering.

shuffle.sh in=reads.fq out=sorted.fq mode=coordinate

Alternative syntax for specifying sort mode - sorts by genomic coordinates if mapping data is present.

shuffle.sh -Xmx32g in=large_dataset.fq out=shuffled.fq

Process large datasets with increased memory allocation. ArrayList<Read> bigList can accommodate more reads in memory before processing.

Shuffle uses a load-all-then-process strategy with ArrayList<Read> initialized to 65530 capacity. The processing workflow:

Stream processing: ConcurrentReadInputStream.nextList() loads ListNum<Read> chunks, iterating until reads==null. Each read gets bigList.add(r1) with mate information preserved in r1.mate reference
Operation dispatch: Based on mode constant (SHUFFLE=1, SORT_NAME=2, SORT_SEQ=3, SORT_COORD=4, SORT_ID=5), calls Collections.shuffle(bigList) or Shared.sort(bigList, comparator)
Fisher-Yates randomization: Collections.shuffle() provides uniform random distribution while preserving read pairing via Read object unity
Output streaming: Processes bigList.set(i, null) for memory cleanup during ByteStreamWriter.println() output, maintaining format preservation through FileFormat detection

Different sorting modes use specialized comparators with precise comparison hierarchies:

Name sorting: ReadComparatorName uses r1.id.compareTo(r2.id) for lexicographic comparison, with pair number (r1.pairnum()-r2.pairnum()) as tiebreaker. Null IDs sort before non-null
Sequence sorting: ReadComparatorTopological implements multi-level comparison: (1) byte-wise sequence comparison using Tools.min(a.length, b.length), (2) mate sequence comparison, (3) sequence length differences, (4) quality score comparison (inverted), (5) numeric ID, (6) string ID as final tiebreaker
Coordinate sorting: ReadComparatorMapping performs mapped/unmapped read prioritization, then compares chromosome, strand, position, and mapping quality with complex mate pair handling
ID sorting: ReadComparatorID compares r1.numericID vs r2.numericID numerically first, then pair numbers (r1.pairnum() vs r2.pairnum()), then falls back to r1.id.compareTo(r2.id) string comparison

Shuffle includes a synchronized threading management system for batch processing:

Thread pooling: ShuffleThread class uses synchronized addThread() method with SHUFFLE_LOCK object, implementing wait(2000) for thread limiting. maxShuffleThreads controls concurrency (default 1), with SHUFFLE_LOCK.notify() for thread release coordination
Memory management: Uses calcXmx() function with freeRam calculation (2000m baseline, 84% max utilization), setting both -Xmx and -Xms to same value for consistent heap allocation
I/O processing: ByteStreamWriter provides parallel I/O with start()/poisonAndWait() lifecycle, supporting configurable compression levels via ziplevel parameter. ByteFile.FORCE_MODE_BF2 enabled when Shared.threads()>2

Automatic paired-end file organization detection uses conditional logic based on input/output file configurations:

Auto-detection algorithm: If in2 is specified, forces FASTQ.FORCE_INTERLEAVED=false and FASTQ.TEST_INTERLEAVED=false. If out2 is specified with single input, forces FASTQ.FORCE_INTERLEAVED=true
Input configuration support: Supports separate files (in1/in2), interleaved single files, or mixed input/output configurations with different organization
Hash wildcard expansion: Preprocesses "#" wildcards by string replacement: "#" → "1" for in1/out1, "#" → "2" for in2/out2, enabling batch file processing

For questions and support: