CutGFF

Script: cutgff.sh Package: gff Class: CutGff.java

Cuts out features defined by a GFF file and writes them to a new fasta. Features are output in their sense strand, with filtering by attributes, length constraints, and quality thresholds. Supports batch processing of multiple files and optional ribosomal sequence alignment validation.

Basic Usage

cutgff.sh in=<fna file> gff=<gff file> out=<fna file>

The input GFF file is optional and will be automatically inferred from the fasta filename if not specified. This allows for batch processing of multiple files:

cutgff.sh types=rRNA out=16S.fa minlen=1440 maxlen=1620 attributes=16S bacteria/*.fna.gz

Parameters

Parameters are organized by their functional roles in the feature extraction and processing workflow.

File Parameters

in=<file>: Input FNA (fasta) file. Can be compressed with gzip.
gff=<file>: Input GFF file (optional). If not specified, will be automatically inferred by replacing the fasta file extension with .gff or .gff.gz.
out=<file>: Output FNA file containing the extracted features.

Feature Selection Parameters

types=CDS: Types of features to cut from the GFF file. Can specify multiple types separated by commas (e.g., CDS,rRNA,tRNA).
invert=false: Invert selection: rather than outputting the features, mask them with Ns in the original sequences. This creates a version of the input with features removed rather than extracted.
attributes=: A comma-delimited list of strings. If present, one of these strings must be found in the GFF line attributes column for the feature to be included. Useful for filtering by gene names, products, or other annotations.
bannedattributes=: A comma-delimited list of banned strings. Features containing any of these strings in their attributes will be excluded from output.
banpartial=t: Ignore lines with 'partial=true' in attributes. This excludes partial gene predictions that may be incomplete at sequence ends.

Length Filtering Parameters

minlen=1: Ignore features shorter than this length in base pairs. Useful for filtering out very short features that may be annotation artifacts.
maxlen=2147483647: Ignore features longer than this length in base pairs. Can be used to exclude unusually long features or limit memory usage.

Quality Filtering Parameters

maxns=-1: If non-negative, ignore features with more than this many undefined bases (Ns or IUPAC ambiguity symbols). Use -1 to disable this filter.
maxnfraction=-1.0: If non-negative, ignore features with more than this fraction of undefined bases (Ns or IUPAC symbols). Should be between 0.0 and 1.0. Use -1.0 to disable this filter.

Taxonomic Renaming Parameters

renamebytaxid=f

Rename sequences with their taxonomic ID. Input sequences must be named appropriately according to the specified taxmode format.

taxmode=accession

Valid modes for taxonomic ID parsing:

accession: Sequence names must start with an accession number
gi: Sequence names must start with gi|number format
taxid: Sequence names must start with tid|number format
header: Best effort parsing for various header formats

requirepresent=t

Crash if a taxonomic ID cannot be found for a sequence when renamebytaxid=true. Set to false to continue processing with sequences that lack taxonomic information.

Output Control Parameters

oneperfile=f: Only output one sequence per input file. When processing multiple files, this limits output to the first qualifying feature from each file.

Ribosomal Sequence Parameters

align=f: Align ribosomal sequences to consensus sequences (if available). Discard sequences with low identity to consensus, and flip sequences annotated on the wrong strand. Requires consensus sequences to be loaded.
adjustendpoints=f: When align=true, adjust the start and stop coordinates of ribosomal features based on alignment results to improve accuracy.
slop16s=999: Maximum allowed coordinate adjustment for 16S/SSU rRNA features when adjustendpoints=true. Larger values allow more aggressive coordinate correction.
slop23s=999: Maximum allowed coordinate adjustment for 23S/LSU rRNA features when adjustendpoints=true. Larger values allow more aggressive coordinate correction.
pickbest=f: When multiple ribosomal sequences are found per file and align=true, only output the one with the highest alignment identity to consensus.

Examples

Basic Feature Extraction

cutgff.sh in=genome.fna gff=genome.gff out=cds.fna types=CDS

Extracts all coding sequences (CDS features) from the genome using the GFF annotation file.

Extract 16S rRNA Genes

cutgff.sh in=genome.fna out=16S.fna types=rRNA attributes=16S minlen=1400 maxlen=1600

Extracts 16S rRNA genes with length filtering to ensure high-quality sequences. The GFF file is automatically inferred as genome.gff.

Batch Processing with Quality Control

cutgff.sh types=rRNA out=all_rRNA.fa minlen=100 maxnfraction=0.05 *.fna.gz

Processes all compressed fasta files in the current directory, extracting rRNA sequences longer than 100bp with less than 5% ambiguous bases.

Mask Features Instead of Extracting

cutgff.sh in=genome.fna out=masked.fna types=CDS invert=true

Creates a masked version of the genome where all CDS regions are replaced with Ns, leaving only non-coding regions intact.

Ribosomal Alignment Validation

cutgff.sh in=genome.fna out=validated_16S.fna types=rRNA attributes=16S align=true pickbest=true

Extracts 16S rRNA sequences with alignment validation against consensus sequences. Only outputs the best-aligned sequence per genome file.

Taxonomic ID Integration

cutgff.sh in=ncbi_genome.fna out=genes_with_taxid.fna types=CDS renamebytaxid=true taxmode=accession

Extracts CDS features and renames them with taxonomic IDs parsed from NCBI-format accession headers.

Algorithm Details

Feature Extraction Pipeline

CutGff implements a multi-stage feature extraction pipeline that processes GFF annotations against reference sequences:

1. File Processing Strategy

The tool uses dual processing modes based on workload: single-threaded processing for small jobs and multi-threaded processing when multiple files are processed with sufficient CPU cores available (>2 threads). The multi-threaded implementation uses an AtomicInteger counter system to distribute files across ProcessThread worker threads.

2. GFF Parsing and Validation

GFF lines are parsed with full attribute parsing enabled (GffLine.parseAttributes=true), allowing complex filtering based on annotation metadata. The tool validates feature boundaries against sequence lengths and applies multiple filtering criteria simultaneously:

Length filtering: Features outside minlen/maxlen bounds are excluded
Attribute filtering: Required attributes must be present; banned attributes cause exclusion via hasAttributes() method
Partial feature filtering: Features marked as partial can be excluded
Quality filtering: Features with excessive ambiguous bases are filtered out using countUndefined()

3. Sequence Extraction Methods

The core extraction algorithm operates in two modes:

Normal mode: Creates new Read objects containing feature sequences extracted from the reference using Arrays.copyOfRange()
Invert mode: Modifies the original sequences in-place by replacing feature regions with 'N' bytes

4. Strand Handling

Features are automatically oriented according to their sense strand. For features on the minus strand (strand=1 in GFF format), the extracted sequence is reverse-complemented using r.reverseComplement() to ensure proper orientation.

5. Ribosomal Sequence Validation

When align=true, the tool implements ribosomal sequence validation using ProkObject consensus sequences:

Consensus alignment: Extracted sequences are aligned against universal consensus sequences loaded from ProkObject.consensusReads()
Bidirectional testing: Sequences are tested in both orientations using Alignment objects to detect strand annotation errors
Identity thresholding: Only sequences meeting minimum identity thresholds (ID_MULT=0.96) are retained
Strand correction: Sequences with better reverse-complement alignment are automatically flipped
Coordinate refinement: When adjustendpoints=true, feature boundaries are refined based on alignment results with slop parameters

6. Taxonomic Integration

The taxonomic renaming system supports multiple header formats and integrates with BBTools' taxonomic databases:

Header parsing: Extracts identifiers using format-specific parsers (parseAccession(), parseGi(), parseTaxIds())
Batch lookup: Uses TaxClient.accessionToTaxidArray() for batch taxonomic ID resolution
Header reformatting: Converts sequences to tid|taxid|description format for downstream compatibility

7. Memory Management

The tool uses memory management strategies:

Streaming processing: Processes files individually rather than loading all data simultaneously
HashMap indexing: Uses HashMap<String, Read> for O(1) sequence lookup during feature extraction
Controlled threading: Thread count is limited by Tools.min(Shared.threads(), fnaList.size()) to prevent resource contention
ByteStreamWriter: Uses buffered output writing with ByteStreamWriter to minimize I/O overhead

8. Performance Characteristics

CutGff processing characteristics:

Scalability: Linear scaling with file count up to available CPU cores
Memory usage: Memory usage scales with individual file size, not total dataset size
I/O handling: Supports compressed input/output and uses FileFormat.testOutput() for format detection
Batch processing: Automatic GFF file inference enables directory-scale processing

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org