Fuse

Script: fuse.sh Package: synth Class: FuseSequence.java

Fuses sequences together, padding gaps with Ns. Supports two modes: concatenating all sequences into a single long sequence (default), or fusing paired-end reads together with N-padding.

Basic Usage

fuse.sh in=<input file> out=<output file> pad=<number of Ns>

The default behavior concatenates all input sequences into a single long sequence, separated by N-padding. Use fusepairs=t to instead fuse paired-end reads together.

Parameters

Parameters control input/output, padding behavior, sequence length limits, and output formatting.

Input/Output Parameters

in=<file>
The 'in=' flag is needed if the input file is not the first parameter. 'in=stdin' will pipe from standard in. Accepts fasta or fastq format, gzipped files supported.
out=<file>
The 'out=' flag is needed if the output file is not the second parameter. 'out=stdout' will pipe to standard out. Output format matches input format.
overwrite=t
(ow) Set to false to force the program to abort rather than overwrite an existing file. Default: true.
ziplevel=2
(zl) Set to 1 (lowest) through 9 (max) to change compression level; lower compression is faster. Only applies to gzipped output.

Fusion Parameters

pad=300
Pad this many N characters between sequences. For amino acid sequences, uses X instead of N. Default: 300.
fusepairs=f
Default mode fuses all sequences into one long sequence. Setting fusepairs=t will instead fuse each pair together, with the second read reverse-complemented. Default: false.
maxlen=2g
If positive, don't make fused sequences longer than this. When limit is reached, starts a new output sequence. Supports K/M/G suffixes. Default: 2G (2 billion bases).
padsymbol=N
Character to use for padding between sequences. Default is N for nucleotide sequences, X for amino acid sequences. (Java parameter only)

Sequence Naming Parameters

name=
Set name of output sequence. Default is the name of the first input sequence. When multiple output sequences are created (due to maxlen), numbers are appended.
addnumber=f
Always add sequential numbers to output sequence names, even for the first sequence. (Java parameter only)

Quality Parameters

quality=30
Fake quality scores to use when generating fastq from fasta input, or when input sequences lack quality scores. Quality 30 corresponds to 99.9% accuracy. Default: 30.

Advanced Parameters

amino=f
Process amino acid sequences instead of nucleotide sequences. Changes default padding symbol from N to X. (Java parameter only)

Java Parameters

-Xmx
This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da
Disable assertions.

Examples

Basic Sequence Fusion

fuse.sh in=sequences.fasta out=fused.fasta pad=500

Concatenates all sequences from sequences.fasta into a single sequence, with 500 Ns between each original sequence.

Fuse Paired-End Reads

fuse.sh in=reads.fastq out=fused_pairs.fastq fusepairs=t pad=100

Fuses paired-end reads together, with 100 Ns between each pair. The second read in each pair is reverse-complemented before fusion.

Limited Length Output

fuse.sh in=contigs.fasta out=chunks.fasta maxlen=1000000 pad=200

Creates fused sequences up to 1 million bases each. When the length limit is reached, starts a new output sequence.

Custom Sequence Naming

fuse.sh in=input.fasta out=output.fasta name=scaffold pad=300

Creates fused sequences with custom names starting with "scaffold". Multiple sequences will be named "scaffold 1", "scaffold 2", etc.

Amino Acid Sequences

fuse.sh in=proteins.fasta out=fused_proteins.fasta amino=t pad=50

Fuses amino acid sequences using X as the padding character instead of N.

Algorithm Details

Fusion Strategy

FuseSequence implements two distinct fusion modes controlled by the fusePairs boolean parameter in the processReadPair() method:

Default Mode (fusepairs=false)

In concatenation mode, all input sequences are appended to a single ByteBuilder buffer:

Pair Fusion Mode (fusepairs=true)

In paired mode, the fusePair() method creates fused reads using direct array manipulation:

Memory Management

FuseSequence uses ListNum-based streaming with controlled memory allocation:

Quality Score Management

Quality handling varies by input format, implemented in processRead() and fusePair() methods:

Performance Characteristics

Support

For questions and support: