Shuffle
Reorders reads randomly, keeping pairs together. Also supports sorting reads by name, coordinate position, sequence, or read ID.
Basic Usage
shuffle.sh in=<file> out=<file>
Shuffle takes sequence files and randomly reorders the reads while maintaining pairing information for paired-end data.
Parameters
Parameters are organized by their function in the shuffling and sorting process, following the exact organization from the shell script.
Standard parameters
- in=<file>
- The 'in=' flag is needed if the input file is not the first parameter. 'in=stdin' will pipe from standard in. Accepts fasta, fastq, or sam formats, compressed or uncompressed.
- in2=<file>
- Use this if 2nd read of pairs are in a different file. When specified, forces paired-end mode and disables interleaved input detection.
- out=<file>
- The 'out=' flag is needed if the output file is not the second parameter. 'out=stdout' will pipe to standard out. Output format matches input format.
- out2=<file>
- Use this to write 2nd read of pairs to a different file. When specified, forces interleaved input mode if only one input file is provided.
- overwrite=t
- (ow) Set to false to force the program to abort rather than overwrite an existing file. Default: true
- ziplevel=2
- (zl) Set to 1 (lowest) through 9 (max) to change compression level; lower compression is faster. Only applies to compressed output formats.
- int=auto
- (interleaved) Set to t or f to override interleaving autodetection. Auto mode detects based on file structure and paired-end status.
Processing parameters
- shuffle
- Randomly reorders reads (default). Uses Collections.shuffle() with Fisher-Yates algorithm providing uniform random distribution while preserving read pairing through Read object mate reference unity.
- name
- Sort reads by name (read header). Uses ReadComparatorName.compareInner() with r1.id.compareTo(r2.id) lexicographic comparison, null IDs sort first, ties broken by r1.pairnum()-r2.pairnum() difference.
- coordinate
- Sort reads by mapping location. Uses ReadComparatorMapping with mapped/unmapped read prioritization, then compares chromosome, strand, position, and mapping quality. Requires SAM format mapping information.
- sequence
- Sort reads by sequence content using ReadComparatorTopological.compareVectors() with byte-wise comparison using Tools.min(a.length, b.length) limit, then mate sequence, length differences, inverted quality scores, numericID, and string ID as tiebreakers.
- id
- Sort reads by read ID (additional option from Java source). Uses ReadComparatorID.compareInner() with primary r1.numericID vs r2.numericID numeric comparison, secondary r1.pairnum() vs r2.pairnum() comparison, tertiary r1.id.compareTo(r2.id) string fallback.
- mode=<option>
- Alternative way to specify processing mode. Options: shuffle, name, coordinate, sequence, id. Equivalent to using the individual flags above.
Performance and Debugging
- verbose=f
- Enable verbose output for debugging. Shows detailed information about file processing, threading, and internal operations.
- showspeed=t
- (ss) Display processing speed statistics including reads per second and bases per second. Default: true
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Default: 2GB
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions. Can improve performance slightly in production environments.
Examples
Basic Shuffling
shuffle.sh in=reads.fq out=shuffled.fq
Randomly reorders all reads in the input file while maintaining read pairing.
Paired-End Shuffling
shuffle.sh in1=reads_R1.fq in2=reads_R2.fq out1=shuffled_R1.fq out2=shuffled_R2.fq
Shuffles paired-end reads while keeping mate pairs together.
Sort by Name
shuffle.sh in=reads.fq out=sorted.fq name
Sorts reads alphabetically by read name instead of shuffling.
Sort by Sequence
shuffle.sh in=reads.fq out=sorted.fq sequence
Sorts reads by their DNA sequence content using topological ordering.
Using Mode Parameter
shuffle.sh in=reads.fq out=sorted.fq mode=coordinate
Alternative syntax for specifying sort mode - sorts by genomic coordinates if mapping data is present.
High Memory Usage
shuffle.sh -Xmx32g in=large_dataset.fq out=shuffled.fq
Process large datasets with increased memory allocation. ArrayList<Read> bigList can accommodate more reads in memory before processing.
Algorithm Details
Core Processing Algorithm
Shuffle uses a load-all-then-process strategy with ArrayList<Read> initialized to 65530 capacity. The processing workflow:
- Stream processing: ConcurrentReadInputStream.nextList() loads ListNum<Read> chunks, iterating until reads==null. Each read gets bigList.add(r1) with mate information preserved in r1.mate reference
- Operation dispatch: Based on mode constant (SHUFFLE=1, SORT_NAME=2, SORT_SEQ=3, SORT_COORD=4, SORT_ID=5), calls Collections.shuffle(bigList) or Shared.sort(bigList, comparator)
- Fisher-Yates randomization: Collections.shuffle() provides uniform random distribution while preserving read pairing via Read object unity
- Output streaming: Processes bigList.set(i, null) for memory cleanup during ByteStreamWriter.println() output, maintaining format preservation through FileFormat detection
Sorting Algorithms
Different sorting modes use specialized comparators with precise comparison hierarchies:
- Name sorting: ReadComparatorName uses r1.id.compareTo(r2.id) for lexicographic comparison, with pair number (r1.pairnum()-r2.pairnum()) as tiebreaker. Null IDs sort before non-null
- Sequence sorting: ReadComparatorTopological implements multi-level comparison: (1) byte-wise sequence comparison using Tools.min(a.length, b.length), (2) mate sequence comparison, (3) sequence length differences, (4) quality score comparison (inverted), (5) numeric ID, (6) string ID as final tiebreaker
- Coordinate sorting: ReadComparatorMapping performs mapped/unmapped read prioritization, then compares chromosome, strand, position, and mapping quality with complex mate pair handling
- ID sorting: ReadComparatorID compares r1.numericID vs r2.numericID numerically first, then pair numbers (r1.pairnum() vs r2.pairnum()), then falls back to r1.id.compareTo(r2.id) string comparison
Threading and Performance
Shuffle includes a synchronized threading management system for batch processing:
- Thread pooling: ShuffleThread class uses synchronized addThread() method with SHUFFLE_LOCK object, implementing wait(2000) for thread limiting. maxShuffleThreads controls concurrency (default 1), with SHUFFLE_LOCK.notify() for thread release coordination
- Memory management: Uses calcXmx() function with freeRam calculation (2000m baseline, 84% max utilization), setting both -Xmx and -Xms to same value for consistent heap allocation
- I/O processing: ByteStreamWriter provides parallel I/O with start()/poisonAndWait() lifecycle, supporting configurable compression levels via ziplevel parameter. ByteFile.FORCE_MODE_BF2 enabled when Shared.threads()>2
Interleaving Detection
Automatic paired-end file organization detection uses conditional logic based on input/output file configurations:
- Auto-detection algorithm: If in2 is specified, forces FASTQ.FORCE_INTERLEAVED=false and FASTQ.TEST_INTERLEAVED=false. If out2 is specified with single input, forces FASTQ.FORCE_INTERLEAVED=true
- Input configuration support: Supports separate files (in1/in2), interleaved single files, or mixed input/output configurations with different organization
- Hash wildcard expansion: Preprocesses "#" wildcards by string replacement: "#" → "1" for in1/out1, "#" → "2" for in2/out2, enabling batch file processing
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org