Partition
Splits a sequence file evenly into multiple files.
Basic Usage
partition.sh in=<file> in2=<file2> out=<outfile> out2=<outfile2> ways=<number>
The output filenames must contain a '%' symbol, which will be replaced by a number (0, 1, 2, etc.). For paired-end data, if out2 is not specified, data will be written interleaved to the single output pattern.
Parameters
Parameters control input/output files, partitioning strategy, and specialized modes for different sequencing technologies.
Parameters and their defaults
- in=<file>
- Input file. Primary input file for sequences to be partitioned.
- out=<file>
- Output file pattern (containing a % symbol, like 'part%.fa'). The % will be replaced with partition numbers starting from 0.
- ways=-1
- The number of output files to create; must be positive. This determines how many partitions the input data will be split into.
- pacbio=f
- Set to true to keep PacBio subreads together. When enabled, sequences with the same ZMW (zero-mode waveguide) identifier will be assigned to the same output file to maintain subread groupings.
- bp=f
- Optimize for an even split by base pairs instead of sequences. Not compatible with PacBio mode. Uses a priority queue algorithm to balance the total number of base pairs across output files rather than just sequence counts.
- ow=f
- (overwrite) Overwrites files that already exist. Set to true to replace existing output files.
- app=f
- (append) Append to files that already exist. Set to true to add new data to existing output files instead of overwriting.
- zl=4
- (ziplevel) Set compression level, 1 (low) to 9 (max). Controls the compression level for gzipped output files.
- int=f
- (interleaved) Determines whether INPUT file is considered interleaved. Set to true if the input file contains interleaved paired-end reads.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions. Can improve performance in production environments.
Examples
Basic Sequence Partitioning
partition.sh in=sequences.fasta out=part%.fasta ways=4
Splits sequences.fasta into 4 files: part0.fasta, part1.fasta, part2.fasta, and part3.fasta, distributing sequences evenly using round-robin allocation.
Paired-End Data Partitioning
partition.sh in1=reads_R1.fastq in2=reads_R2.fastq out1=part%_R1.fastq out2=part%_R2.fastq ways=3
Partitions paired-end reads into 3 sets while maintaining pair relationships across files.
Base Pair Optimized Partitioning
partition.sh in=contigs.fasta out=split%.fasta ways=5 bp=t
Splits contigs into 5 files optimized for equal base pair distribution rather than equal sequence counts. Useful when sequences have highly variable lengths.
PacBio Subread Partitioning
partition.sh in=subreads.fastq out=zmw%.fastq ways=8 pacbio=t
Partitions PacBio subreads while keeping all subreads from the same ZMW together in the same output file.
Interleaved Input to Separate Files
partition.sh in=interleaved.fastq out1=part%_R1.fastq out2=part%_R2.fastq ways=6 int=t
Processes interleaved paired-end input and outputs to separate R1 and R2 files for each partition.
Algorithm Details
Partitioning Strategies
Partition implements two distinct algorithms implemented as processInner() and processInner_heap() methods:
Round-Robin Distribution (Default)
The default processInner() method uses simple cyclic assignment where sequences are distributed using modulo arithmetic (nextIndex=(nextIndex+1)%ways). Each sequence is assigned to the next available output file in sequence. This guarantees equal sequence counts per file with O(1) assignment complexity but may result in uneven base pair distributions when sequence lengths vary significantly.
Base Pair Optimization (bp=t)
When bp=true, the tool switches to processInner_heap() which implements a PriorityQueue<Partition> with custom Partition objects that implement Comparable<Partition>. Each partition tracks cumulative base pairs (bp field). The algorithm uses queue.poll() to retrieve the partition with fewest base pairs, assigns the sequence via outLists[p.id].add(r1), updates the partition's bp count with p.bp+=r1.pairLength(), and reinserts using queue.add(p). This ensures O(log n) assignment but maintains balanced data distribution by size.
PacBio Mode Implementation
PacBio mode (pacbio=t) uses Parse.parseZmw(r1.id) to extract ZMW identifiers from sequence headers and applies modulo arithmetic (zmw%ways) to assign all subreads from the same zero-mode waveguide to identical output files. This maintains the biological relationship between subreads from the same DNA molecule, critical for PacBio-specific downstream analyses that rely on subread consensus calling.
Memory Management Architecture
The tool uses ConcurrentReadInputStream and ConcurrentReadOutputStream classes for threaded I/O operations. Processing occurs in batches using ArrayList<Read>[] outLists arrays, where each partition maintains its own ArrayList. Memory usage remains constant regardless of file size because sequences are processed in ListNum<Read> chunks and immediately written to output streams. The priority queue in base pair optimization mode requires only O(ways) memory overhead, where ways is the number of output files.
File Format Handling
The tool creates FileFormat arrays (ffout1[], ffout2[]) during initialization, with each array element corresponding to one output partition. Format detection uses FileFormat.testInput() and FileFormat.testOutput() methods that automatically handle FASTA, FASTQ, and compressed formats. The '%' symbol replacement occurs via out1.replaceFirst("%", ""+i) during the FileFormat array creation loop. Compression levels are managed through the FileFormat objects using the zl parameter passed to the underlying compression libraries.
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org