FungalRelease
Reformats a fungal assembly for release. Also creates contig and agp files.
Basic Usage
fungalrelease.sh in=<input file> out=<output file>
Fungalrelease processes scaffold assemblies to prepare them for public release. It standardizes naming conventions, sorts scaffolds by length, breaks scaffolds into contigs at gap regions, and creates accompanying AGP and legend files for genome database submission.
Parameters
Parameters are organized into functional groups corresponding to input/output handling, processing options, and system configuration. All parameters from the shell script are documented below.
I/O parameters
- in=<file>
- Input scaffolds file in FASTA format. This is the primary scaffold assembly that will be reformatted for release.
- out=<file>
- Output scaffolds file. The reformatted scaffold assembly with standardized names and optional sorting applied.
- outc=<file>
- Output contigs file. Contains individual contigs extracted from scaffolds by breaking at gap regions (stretches of N bases).
- qfin=<file>
- Optional quality scores input file in FASTQ format, corresponding to the input scaffolds.
- qfout=<file>
- Optional quality scores output file for the reformatted scaffolds.
- qfoutc=<file>
- Optional contig quality scores output file, containing quality scores for the extracted contigs.
- agp=<file>
- Output AGP (A Golden Path) file. Provides a detailed description of how contigs are assembled into scaffolds, including gap locations and sizes. Required for genome database submissions.
- legend=<file>
- Output name legend file. Maps original scaffold names to the new standardized names (e.g., "original_scaffold_1" → "scaffold_1").
- overwrite=f
- (ow) Set to false to force the program to abort rather than overwrite an existing file. Set to true to allow overwriting existing output files.
Processing parameters
- fastawrap=60
- Wrap length for FASTA output lines. Sequences will be broken into lines of this length for better readability and standard formatting.
- tuc=t
- Convert sequence to upper case. Ensures all bases (A, C, G, T, N) are in uppercase format for consistency.
- baniupac=t
- Ban IUPAC ambiguous bases and crash on encountering non-ACGTN base calls. When true, the program will terminate with an error if ambiguous nucleotide codes (R, Y, S, W, K, M, B, D, H, V) are found.
- mingap=10
- Expand all gaps (stretches of N bases) to be at least this long. Smaller gaps will be padded with additional N bases to reach this minimum length.
- mingapin=1
- Only expand gaps that are at least this long initially. Gaps shorter than this value will not be expanded.
- sortcscaffolds=t
- Sort scaffolds in descending order by length. Longer scaffolds will appear first in the output file.
- sortcontigs=f
- Sort contigs in descending order by length. When enabled, longer contigs will appear first in the contig output file.
- renamescaffolds=t
- Rename scaffolds to standardized format 'scaffold_#' where # is an incremental number. Original names are preserved in the legend file if specified.
- scafnum=1
- Starting number for the first scaffold when renaming. Scaffolds will be numbered sequentially starting from this value.
- renamecontigs=f
- When true, rename contigs to 'contig_#' format instead of 'scaffold_name_c#'. Provides completely independent contig naming.
- contignum=1
- Starting number for the first contig when renamecontigs=t. Contigs will be numbered sequentially starting from this value.
- minscaf=1
- Minimum scaffold length threshold. Only retain scaffolds that are at least this many bases long.
- mincontig=1
- Minimum contig length threshold. Only retain contigs that are at least this many bases long after breaking scaffolds at gaps.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions.
Examples
Basic Assembly Formatting
fungalrelease.sh in=draft_assembly.fasta out=release_scaffolds.fasta outc=release_contigs.fasta
Basic reformatting of a fungal assembly with scaffold and contig output. Applies default processing: sorts scaffolds by length, renames to standard format, and extracts contigs.
Complete Release Package
fungalrelease.sh in=assembly.fasta out=scaffolds.fasta outc=contigs.fasta agp=assembly.agp legend=name_mapping.txt
Generates a complete release package including AGP file for database submission and legend file for name mapping.
Custom Gap Handling
fungalrelease.sh in=scaffolds.fasta out=clean_scaffolds.fasta mingap=50 mingapin=10
Expands gaps to minimum 50bp length, but only processes gaps that are initially at least 10bp long.
Length Filtering
fungalrelease.sh in=assembly.fasta out=filtered.fasta minscaf=1000 mincontig=500
Filters assembly to retain only scaffolds ≥1000bp and contigs ≥500bp, removing small fragments that may be artifacts.
Custom Contig Naming
fungalrelease.sh in=scaffolds.fasta outc=contigs.fasta renamecontigs=t contignum=1000
Uses independent contig naming starting from contig_1000, rather than scaffold-based names.
Algorithm Details
Assembly Processing Pipeline
Fungalrelease implements a multi-stage pipeline specifically designed for preparing fungal genome assemblies for public release:
Sequence Validation and Normalization
- IUPAC Base Checking: Scans sequences for ambiguous nucleotide codes and optionally terminates processing if found, ensuring clean ACGTN-only sequences
- Case Normalization: Converts all bases to uppercase for standard formatting
- Gap Standardization: Uses inflateGaps() method to ensure all gap regions (N stretches) meet minimum length requirements
Gap Processing Strategy
The tool uses Read.inflateGaps() method for gap processing:
- Selective Expansion: Only gaps meeting the mingapin threshold are processed
- Minimum Gap Size: Expanded gaps are padded to at least mingapout length
- Contig Boundary Definition: Gaps serve as natural breaking points for scaffold-to-contig conversion
Sorting and Organization
The pipeline includes flexible sorting capabilities:
- Length-Based Sorting: Uses ReadLengthComparator for descending length order
- Independent Control: Scaffolds and contigs can be sorted independently
- Memory Efficient: Sorting occurs in-memory using ArrayList structures
Naming and Tracking
- Standardized Nomenclature: Generates scaffold_# and contig_# names following genome database conventions
- Name Mapping: Maintains bidirectional mapping between original and new names
- Incremental Numbering: Sequential numbering with user-defined starting values
AGP File Generation
The breakAtGaps() method generates detailed AGP records:
- Component Tracking: Records each contig's position within scaffolds
- Gap Documentation: Specifies gap types, lengths, and bridging evidence
- Coordinate Mapping: Maintains precise start/end coordinates for all elements
Memory and Performance
- Stream Processing: Uses ConcurrentReadInputStream for efficient file handling
- Buffered Output: Employs buffered streams (default buffer size 4) for optimal I/O performance
- Default Memory: Configured with 4GB heap (-Xmx4g) suitable for typical fungal genome sizes (10-50MB)
- Scalable Design: Can handle assemblies from small fungi to larger genomes with appropriate memory allocation
File Formats
Input Requirements
- Scaffolds: FASTA format with scaffold sequences
- Quality Scores: Optional FASTQ format with matching sequence identifiers
Output Formats
- Scaffolds/Contigs: FASTA format with 60-character line wrapping
- AGP: Tab-delimited format following AGP 2.0 specification
- Legend: Tab-delimited mapping: original_name → new_name
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org