FungalRelease

Script: fungalrelease.sh Package: jgi Class: FungalRelease.java

Reformats a fungal assembly for release. Also creates contig and agp files.

Basic Usage

fungalrelease.sh in=<input file> out=<output file>

Fungalrelease processes scaffold assemblies to prepare them for public release. It standardizes naming conventions, sorts scaffolds by length, breaks scaffolds into contigs at gap regions, and creates accompanying AGP and legend files for genome database submission.

Parameters

Parameters are organized into functional groups corresponding to input/output handling, processing options, and system configuration. All parameters from the shell script are documented below.

I/O parameters

in=<file>
Input scaffolds file in FASTA format. This is the primary scaffold assembly that will be reformatted for release.
out=<file>
Output scaffolds file. The reformatted scaffold assembly with standardized names and optional sorting applied.
outc=<file>
Output contigs file. Contains individual contigs extracted from scaffolds by breaking at gap regions (stretches of N bases).
qfin=<file>
Optional quality scores input file in FASTQ format, corresponding to the input scaffolds.
qfout=<file>
Optional quality scores output file for the reformatted scaffolds.
qfoutc=<file>
Optional contig quality scores output file, containing quality scores for the extracted contigs.
agp=<file>
Output AGP (A Golden Path) file. Provides a detailed description of how contigs are assembled into scaffolds, including gap locations and sizes. Required for genome database submissions.
legend=<file>
Output name legend file. Maps original scaffold names to the new standardized names (e.g., "original_scaffold_1" → "scaffold_1").
overwrite=f
(ow) Set to false to force the program to abort rather than overwrite an existing file. Set to true to allow overwriting existing output files.

Processing parameters

fastawrap=60
Wrap length for FASTA output lines. Sequences will be broken into lines of this length for better readability and standard formatting.
tuc=t
Convert sequence to upper case. Ensures all bases (A, C, G, T, N) are in uppercase format for consistency.
baniupac=t
Ban IUPAC ambiguous bases and crash on encountering non-ACGTN base calls. When true, the program will terminate with an error if ambiguous nucleotide codes (R, Y, S, W, K, M, B, D, H, V) are found.
mingap=10
Expand all gaps (stretches of N bases) to be at least this long. Smaller gaps will be padded with additional N bases to reach this minimum length.
mingapin=1
Only expand gaps that are at least this long initially. Gaps shorter than this value will not be expanded.
sortcscaffolds=t
Sort scaffolds in descending order by length. Longer scaffolds will appear first in the output file.
sortcontigs=f
Sort contigs in descending order by length. When enabled, longer contigs will appear first in the contig output file.
renamescaffolds=t
Rename scaffolds to standardized format 'scaffold_#' where # is an incremental number. Original names are preserved in the legend file if specified.
scafnum=1
Starting number for the first scaffold when renaming. Scaffolds will be numbered sequentially starting from this value.
renamecontigs=f
When true, rename contigs to 'contig_#' format instead of 'scaffold_name_c#'. Provides completely independent contig naming.
contignum=1
Starting number for the first contig when renamecontigs=t. Contigs will be numbered sequentially starting from this value.
minscaf=1
Minimum scaffold length threshold. Only retain scaffolds that are at least this many bases long.
mincontig=1
Minimum contig length threshold. Only retain contigs that are at least this many bases long after breaking scaffolds at gaps.

Java Parameters

-Xmx
This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da
Disable assertions.

Examples

Basic Assembly Formatting

fungalrelease.sh in=draft_assembly.fasta out=release_scaffolds.fasta outc=release_contigs.fasta

Basic reformatting of a fungal assembly with scaffold and contig output. Applies default processing: sorts scaffolds by length, renames to standard format, and extracts contigs.

Complete Release Package

fungalrelease.sh in=assembly.fasta out=scaffolds.fasta outc=contigs.fasta agp=assembly.agp legend=name_mapping.txt

Generates a complete release package including AGP file for database submission and legend file for name mapping.

Custom Gap Handling

fungalrelease.sh in=scaffolds.fasta out=clean_scaffolds.fasta mingap=50 mingapin=10

Expands gaps to minimum 50bp length, but only processes gaps that are initially at least 10bp long.

Length Filtering

fungalrelease.sh in=assembly.fasta out=filtered.fasta minscaf=1000 mincontig=500

Filters assembly to retain only scaffolds ≥1000bp and contigs ≥500bp, removing small fragments that may be artifacts.

Custom Contig Naming

fungalrelease.sh in=scaffolds.fasta outc=contigs.fasta renamecontigs=t contignum=1000

Uses independent contig naming starting from contig_1000, rather than scaffold-based names.

Algorithm Details

Assembly Processing Pipeline

Fungalrelease implements a multi-stage pipeline specifically designed for preparing fungal genome assemblies for public release:

Sequence Validation and Normalization

Gap Processing Strategy

The tool uses Read.inflateGaps() method for gap processing:

Sorting and Organization

The pipeline includes flexible sorting capabilities:

Naming and Tracking

AGP File Generation

The breakAtGaps() method generates detailed AGP records:

Memory and Performance

File Formats

Input Requirements

Output Formats

Support

For questions and support: