Fuse
Fuses sequences together, padding gaps with Ns. Supports two modes: concatenating all sequences into a single long sequence (default), or fusing paired-end reads together with N-padding.
Basic Usage
fuse.sh in=<input file> out=<output file> pad=<number of Ns>
The default behavior concatenates all input sequences into a single long sequence, separated by N-padding. Use fusepairs=t
to instead fuse paired-end reads together.
Parameters
Parameters control input/output, padding behavior, sequence length limits, and output formatting.
Input/Output Parameters
- in=<file>
- The 'in=' flag is needed if the input file is not the first parameter. 'in=stdin' will pipe from standard in. Accepts fasta or fastq format, gzipped files supported.
- out=<file>
- The 'out=' flag is needed if the output file is not the second parameter. 'out=stdout' will pipe to standard out. Output format matches input format.
- overwrite=t
- (ow) Set to false to force the program to abort rather than overwrite an existing file. Default: true.
- ziplevel=2
- (zl) Set to 1 (lowest) through 9 (max) to change compression level; lower compression is faster. Only applies to gzipped output.
Fusion Parameters
- pad=300
- Pad this many N characters between sequences. For amino acid sequences, uses X instead of N. Default: 300.
- fusepairs=f
- Default mode fuses all sequences into one long sequence. Setting fusepairs=t will instead fuse each pair together, with the second read reverse-complemented. Default: false.
- maxlen=2g
- If positive, don't make fused sequences longer than this. When limit is reached, starts a new output sequence. Supports K/M/G suffixes. Default: 2G (2 billion bases).
- padsymbol=N
- Character to use for padding between sequences. Default is N for nucleotide sequences, X for amino acid sequences. (Java parameter only)
Sequence Naming Parameters
- name=
- Set name of output sequence. Default is the name of the first input sequence. When multiple output sequences are created (due to maxlen), numbers are appended.
- addnumber=f
- Always add sequential numbers to output sequence names, even for the first sequence. (Java parameter only)
Quality Parameters
- quality=30
- Fake quality scores to use when generating fastq from fasta input, or when input sequences lack quality scores. Quality 30 corresponds to 99.9% accuracy. Default: 30.
Advanced Parameters
- amino=f
- Process amino acid sequences instead of nucleotide sequences. Changes default padding symbol from N to X. (Java parameter only)
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions.
Examples
Basic Sequence Fusion
fuse.sh in=sequences.fasta out=fused.fasta pad=500
Concatenates all sequences from sequences.fasta into a single sequence, with 500 Ns between each original sequence.
Fuse Paired-End Reads
fuse.sh in=reads.fastq out=fused_pairs.fastq fusepairs=t pad=100
Fuses paired-end reads together, with 100 Ns between each pair. The second read in each pair is reverse-complemented before fusion.
Limited Length Output
fuse.sh in=contigs.fasta out=chunks.fasta maxlen=1000000 pad=200
Creates fused sequences up to 1 million bases each. When the length limit is reached, starts a new output sequence.
Custom Sequence Naming
fuse.sh in=input.fasta out=output.fasta name=scaffold pad=300
Creates fused sequences with custom names starting with "scaffold". Multiple sequences will be named "scaffold 1", "scaffold 2", etc.
Amino Acid Sequences
fuse.sh in=proteins.fasta out=fused_proteins.fasta amino=t pad=50
Fuses amino acid sequences using X as the padding character instead of N.
Algorithm Details
Fusion Strategy
FuseSequence implements two distinct fusion modes controlled by the fusePairs
boolean parameter in the processReadPair() method:
Default Mode (fusepairs=false)
In concatenation mode, all input sequences are appended to a single ByteBuilder buffer:
- Sequential Processing: The processRead() method appends each sequence to the shared ByteBuilder.bases buffer
- N-padding Implementation: Inserts npad characters using PAD_SYMBOL (default 'N', 'X' for amino acids) via ByteBuilder.append()
- Length Control: Uses condition (bases.length+initialLength1+initialLength2+npad>maxlen) to trigger bufferToRead() when approaching the limit
- ByteBuilder Architecture: Uses two ByteBuilder instances (bases, quals) that dynamically resize internal byte arrays
Pair Fusion Mode (fusepairs=true)
In paired mode, the fusePair() method creates fused reads using direct array manipulation:
- Read Pairing: Processes Read.mate pairs together from ConcurrentReadInputStream
- Reverse Complement: Calls r2.reverseComplement() on the second read before fusion
- Direct Array Allocation: Creates new byte[len] arrays where len = r1.length() + r2.length() + npad
- Quality Preservation: Maintains original quality arrays when present, nulls when absent
Memory Management
FuseSequence uses ListNum-based streaming with controlled memory allocation:
- ListNum Processing: Reads batches via ConcurrentReadInputStream.nextList() to avoid loading entire files
- ByteBuilder Reuse: Uses clear() method on ByteBuilder instances to reset without deallocation
- Length Tracking: Maintains bases.length() counter for maxlen boundary detection
- Conditional Quality Arrays: Only creates quality ByteBuilder when input Read.quality is non-null
Quality Score Management
Quality handling varies by input format, implemented in processRead() and fusePair() methods:
- Fastq Input: Preserves r.quality arrays via ByteBuilder.append(r.quality)
- Fasta Input: Generates qualities using defaultQuality parameter (default 30) in a loop
- Padding Regions: Inserts quality byte 0 for N-padding using quals.append((byte)0)
- Mixed Quality: When r.quality==null, fills with defaultQuality via ByteBuilder.append() loop
Performance Characteristics
- Time Complexity: O(n) linear scan with ByteBuilder.append() operations
- Memory Usage: Two ByteBuilder instances plus current ListNum batch (typically <1000 reads)
- I/O Pattern: Streaming via ConcurrentReadInputStream/ConcurrentReadOutputStream with ListNum batching
- Scalability: maxlen parameter prevents single sequences exceeding Shared.MAX_ARRAY_LEN (2^30-8 bytes)
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org