Fuse

Script: fuse.sh Package: synth Class: FuseSequence.java

Fuses sequences together, padding gaps with Ns. Supports two modes: concatenating all sequences into a single long sequence (default), or fusing paired-end reads together with N-padding.

Basic Usage

fuse.sh in=<input file> out=<output file> pad=<number of Ns>

The default behavior concatenates all input sequences into a single long sequence, separated by N-padding. Use fusepairs=t to instead fuse paired-end reads together.

Parameters

Parameters control input/output, padding behavior, sequence length limits, and output formatting.

Input/Output Parameters

in=<file>: The 'in=' flag is needed if the input file is not the first parameter. 'in=stdin' will pipe from standard in. Accepts fasta or fastq format, gzipped files supported.
out=<file>: The 'out=' flag is needed if the output file is not the second parameter. 'out=stdout' will pipe to standard out. Output format matches input format.
overwrite=t: (ow) Set to false to force the program to abort rather than overwrite an existing file. Default: true.
ziplevel=2: (zl) Set to 1 (lowest) through 9 (max) to change compression level; lower compression is faster. Only applies to gzipped output.

Fusion Parameters

pad=300: Pad this many N characters between sequences. For amino acid sequences, uses X instead of N. Default: 300.
fusepairs=f: Default mode fuses all sequences into one long sequence. Setting fusepairs=t will instead fuse each pair together, with the second read reverse-complemented. Default: false.
maxlen=2g: If positive, don't make fused sequences longer than this. When limit is reached, starts a new output sequence. Supports K/M/G suffixes. Default: 2G (2 billion bases).
padsymbol=N: Character to use for padding between sequences. Default is N for nucleotide sequences, X for amino acid sequences. (Java parameter only)

Sequence Naming Parameters

name=: Set name of output sequence. Default is the name of the first input sequence. When multiple output sequences are created (due to maxlen), numbers are appended.
addnumber=f: Always add sequential numbers to output sequence names, even for the first sequence. (Java parameter only)

Quality Parameters

quality=30: Fake quality scores to use when generating fastq from fasta input, or when input sequences lack quality scores. Quality 30 corresponds to 99.9% accuracy. Default: 30.

Advanced Parameters

amino=f: Process amino acid sequences instead of nucleotide sequences. Changes default padding symbol from N to X. (Java parameter only)

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions.

Examples

Basic Sequence Fusion

fuse.sh in=sequences.fasta out=fused.fasta pad=500

Concatenates all sequences from sequences.fasta into a single sequence, with 500 Ns between each original sequence.

Fuse Paired-End Reads

fuse.sh in=reads.fastq out=fused_pairs.fastq fusepairs=t pad=100

Fuses paired-end reads together, with 100 Ns between each pair. The second read in each pair is reverse-complemented before fusion.

Limited Length Output

fuse.sh in=contigs.fasta out=chunks.fasta maxlen=1000000 pad=200

Creates fused sequences up to 1 million bases each. When the length limit is reached, starts a new output sequence.

Custom Sequence Naming

fuse.sh in=input.fasta out=output.fasta name=scaffold pad=300

Creates fused sequences with custom names starting with "scaffold". Multiple sequences will be named "scaffold 1", "scaffold 2", etc.

Amino Acid Sequences

fuse.sh in=proteins.fasta out=fused_proteins.fasta amino=t pad=50

Fuses amino acid sequences using X as the padding character instead of N.

Algorithm Details

Fusion Strategy

FuseSequence implements two distinct fusion modes controlled by the fusePairs boolean parameter in the processReadPair() method:

Default Mode (fusepairs=false)

In concatenation mode, all input sequences are appended to a single ByteBuilder buffer:

Sequential Processing: The processRead() method appends each sequence to the shared ByteBuilder.bases buffer
N-padding Implementation: Inserts npad characters using PAD_SYMBOL (default 'N', 'X' for amino acids) via ByteBuilder.append()
Length Control: Uses condition (bases.length+initialLength1+initialLength2+npad>maxlen) to trigger bufferToRead() when approaching the limit
ByteBuilder Architecture: Uses two ByteBuilder instances (bases, quals) that dynamically resize internal byte arrays

Pair Fusion Mode (fusepairs=true)

In paired mode, the fusePair() method creates fused reads using direct array manipulation:

Read Pairing: Processes Read.mate pairs together from ConcurrentReadInputStream
Reverse Complement: Calls r2.reverseComplement() on the second read before fusion
Direct Array Allocation: Creates new byte[len] arrays where len = r1.length() + r2.length() + npad
Quality Preservation: Maintains original quality arrays when present, nulls when absent

Memory Management

FuseSequence uses ListNum-based streaming with controlled memory allocation:

ListNum Processing: Reads batches via ConcurrentReadInputStream.nextList() to avoid loading entire files
ByteBuilder Reuse: Uses clear() method on ByteBuilder instances to reset without deallocation
Length Tracking: Maintains bases.length() counter for maxlen boundary detection
Conditional Quality Arrays: Only creates quality ByteBuilder when input Read.quality is non-null

Quality Score Management

Quality handling varies by input format, implemented in processRead() and fusePair() methods:

Fastq Input: Preserves r.quality arrays via ByteBuilder.append(r.quality)
Fasta Input: Generates qualities using defaultQuality parameter (default 30) in a loop
Padding Regions: Inserts quality byte 0 for N-padding using quals.append((byte)0)
Mixed Quality: When r.quality==null, fills with defaultQuality via ByteBuilder.append() loop

Performance Characteristics

Time Complexity: O(n) linear scan with ByteBuilder.append() operations
Memory Usage: Two ByteBuilder instances plus current ListNum batch (typically <1000 reads)
I/O Pattern: Streaming via ConcurrentReadInputStream/ConcurrentReadOutputStream with ListNum batching
Scalability: maxlen parameter prevents single sequences exceeding Shared.MAX_ARRAY_LEN (2^30-8 bytes)

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org