Partition

Script: partition.sh Package: jgi Class: PartitionReads.java

Splits a sequence file evenly into multiple files.

Basic Usage

partition.sh in=<file> in2=<file2> out=<outfile> out2=<outfile2> ways=<number>

The output filenames must contain a '%' symbol, which will be replaced by a number (0, 1, 2, etc.). For paired-end data, if out2 is not specified, data will be written interleaved to the single output pattern.

Parameters

Parameters control input/output files, partitioning strategy, and specialized modes for different sequencing technologies.

Parameters and their defaults

in=<file>
Input file. Primary input file for sequences to be partitioned.
out=<file>
Output file pattern (containing a % symbol, like 'part%.fa'). The % will be replaced with partition numbers starting from 0.
ways=-1
The number of output files to create; must be positive. This determines how many partitions the input data will be split into.
pacbio=f
Set to true to keep PacBio subreads together. When enabled, sequences with the same ZMW (zero-mode waveguide) identifier will be assigned to the same output file to maintain subread groupings.
bp=f
Optimize for an even split by base pairs instead of sequences. Not compatible with PacBio mode. Uses a priority queue algorithm to balance the total number of base pairs across output files rather than just sequence counts.
ow=f
(overwrite) Overwrites files that already exist. Set to true to replace existing output files.
app=f
(append) Append to files that already exist. Set to true to add new data to existing output files instead of overwriting.
zl=4
(ziplevel) Set compression level, 1 (low) to 9 (max). Controls the compression level for gzipped output files.
int=f
(interleaved) Determines whether INPUT file is considered interleaved. Set to true if the input file contains interleaved paired-end reads.

Java Parameters

-Xmx
This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da
Disable assertions. Can improve performance in production environments.

Examples

Basic Sequence Partitioning

partition.sh in=sequences.fasta out=part%.fasta ways=4

Splits sequences.fasta into 4 files: part0.fasta, part1.fasta, part2.fasta, and part3.fasta, distributing sequences evenly using round-robin allocation.

Paired-End Data Partitioning

partition.sh in1=reads_R1.fastq in2=reads_R2.fastq out1=part%_R1.fastq out2=part%_R2.fastq ways=3

Partitions paired-end reads into 3 sets while maintaining pair relationships across files.

Base Pair Optimized Partitioning

partition.sh in=contigs.fasta out=split%.fasta ways=5 bp=t

Splits contigs into 5 files optimized for equal base pair distribution rather than equal sequence counts. Useful when sequences have highly variable lengths.

PacBio Subread Partitioning

partition.sh in=subreads.fastq out=zmw%.fastq ways=8 pacbio=t

Partitions PacBio subreads while keeping all subreads from the same ZMW together in the same output file.

Interleaved Input to Separate Files

partition.sh in=interleaved.fastq out1=part%_R1.fastq out2=part%_R2.fastq ways=6 int=t

Processes interleaved paired-end input and outputs to separate R1 and R2 files for each partition.

Algorithm Details

Partitioning Strategies

Partition implements two distinct algorithms implemented as processInner() and processInner_heap() methods:

Round-Robin Distribution (Default)

The default processInner() method uses simple cyclic assignment where sequences are distributed using modulo arithmetic (nextIndex=(nextIndex+1)%ways). Each sequence is assigned to the next available output file in sequence. This guarantees equal sequence counts per file with O(1) assignment complexity but may result in uneven base pair distributions when sequence lengths vary significantly.

Base Pair Optimization (bp=t)

When bp=true, the tool switches to processInner_heap() which implements a PriorityQueue<Partition> with custom Partition objects that implement Comparable<Partition>. Each partition tracks cumulative base pairs (bp field). The algorithm uses queue.poll() to retrieve the partition with fewest base pairs, assigns the sequence via outLists[p.id].add(r1), updates the partition's bp count with p.bp+=r1.pairLength(), and reinserts using queue.add(p). This ensures O(log n) assignment but maintains balanced data distribution by size.

PacBio Mode Implementation

PacBio mode (pacbio=t) uses Parse.parseZmw(r1.id) to extract ZMW identifiers from sequence headers and applies modulo arithmetic (zmw%ways) to assign all subreads from the same zero-mode waveguide to identical output files. This maintains the biological relationship between subreads from the same DNA molecule, critical for PacBio-specific downstream analyses that rely on subread consensus calling.

Memory Management Architecture

The tool uses ConcurrentReadInputStream and ConcurrentReadOutputStream classes for threaded I/O operations. Processing occurs in batches using ArrayList<Read>[] outLists arrays, where each partition maintains its own ArrayList. Memory usage remains constant regardless of file size because sequences are processed in ListNum<Read> chunks and immediately written to output streams. The priority queue in base pair optimization mode requires only O(ways) memory overhead, where ways is the number of output files.

File Format Handling

The tool creates FileFormat arrays (ffout1[], ffout2[]) during initialization, with each array element corresponding to one output partition. Format detection uses FileFormat.testInput() and FileFormat.testOutput() methods that automatically handle FASTA, FASTQ, and compressed formats. The '%' symbol replacement occurs via out1.replaceFirst("%", ""+i) during the FileFormat array creation loop. Compression levels are managed through the FileFormat objects using the zl parameter passed to the underlying compression libraries.

Support

For questions and support: