CutGFF

Script: cutgff.sh Package: gff Class: CutGff.java

Cuts out features defined by a GFF file and writes them to a new fasta. Features are output in their sense strand, with filtering by attributes, length constraints, and quality thresholds. Supports batch processing of multiple files and optional ribosomal sequence alignment validation.

Basic Usage

cutgff.sh in=<fna file> gff=<gff file> out=<fna file>

The input GFF file is optional and will be automatically inferred from the fasta filename if not specified. This allows for batch processing of multiple files:

cutgff.sh types=rRNA out=16S.fa minlen=1440 maxlen=1620 attributes=16S bacteria/*.fna.gz

Parameters

Parameters are organized by their functional roles in the feature extraction and processing workflow.

File Parameters

in=<file>
Input FNA (fasta) file. Can be compressed with gzip.
gff=<file>
Input GFF file (optional). If not specified, will be automatically inferred by replacing the fasta file extension with .gff or .gff.gz.
out=<file>
Output FNA file containing the extracted features.

Feature Selection Parameters

types=CDS
Types of features to cut from the GFF file. Can specify multiple types separated by commas (e.g., CDS,rRNA,tRNA).
invert=false
Invert selection: rather than outputting the features, mask them with Ns in the original sequences. This creates a version of the input with features removed rather than extracted.
attributes=
A comma-delimited list of strings. If present, one of these strings must be found in the GFF line attributes column for the feature to be included. Useful for filtering by gene names, products, or other annotations.
bannedattributes=
A comma-delimited list of banned strings. Features containing any of these strings in their attributes will be excluded from output.
banpartial=t
Ignore lines with 'partial=true' in attributes. This excludes partial gene predictions that may be incomplete at sequence ends.

Length Filtering Parameters

minlen=1
Ignore features shorter than this length in base pairs. Useful for filtering out very short features that may be annotation artifacts.
maxlen=2147483647
Ignore features longer than this length in base pairs. Can be used to exclude unusually long features or limit memory usage.

Quality Filtering Parameters

maxns=-1
If non-negative, ignore features with more than this many undefined bases (Ns or IUPAC ambiguity symbols). Use -1 to disable this filter.
maxnfraction=-1.0
If non-negative, ignore features with more than this fraction of undefined bases (Ns or IUPAC symbols). Should be between 0.0 and 1.0. Use -1.0 to disable this filter.

Taxonomic Renaming Parameters

renamebytaxid=f
Rename sequences with their taxonomic ID. Input sequences must be named appropriately according to the specified taxmode format.
taxmode=accession
Valid modes for taxonomic ID parsing:
  • accession: Sequence names must start with an accession number
  • gi: Sequence names must start with gi|number format
  • taxid: Sequence names must start with tid|number format
  • header: Best effort parsing for various header formats
requirepresent=t
Crash if a taxonomic ID cannot be found for a sequence when renamebytaxid=true. Set to false to continue processing with sequences that lack taxonomic information.

Output Control Parameters

oneperfile=f
Only output one sequence per input file. When processing multiple files, this limits output to the first qualifying feature from each file.

Ribosomal Sequence Parameters

align=f
Align ribosomal sequences to consensus sequences (if available). Discard sequences with low identity to consensus, and flip sequences annotated on the wrong strand. Requires consensus sequences to be loaded.
adjustendpoints=f
When align=true, adjust the start and stop coordinates of ribosomal features based on alignment results to improve accuracy.
slop16s=999
Maximum allowed coordinate adjustment for 16S/SSU rRNA features when adjustendpoints=true. Larger values allow more aggressive coordinate correction.
slop23s=999
Maximum allowed coordinate adjustment for 23S/LSU rRNA features when adjustendpoints=true. Larger values allow more aggressive coordinate correction.
pickbest=f
When multiple ribosomal sequences are found per file and align=true, only output the one with the highest alignment identity to consensus.

Examples

Basic Feature Extraction

cutgff.sh in=genome.fna gff=genome.gff out=cds.fna types=CDS

Extracts all coding sequences (CDS features) from the genome using the GFF annotation file.

Extract 16S rRNA Genes

cutgff.sh in=genome.fna out=16S.fna types=rRNA attributes=16S minlen=1400 maxlen=1600

Extracts 16S rRNA genes with length filtering to ensure high-quality sequences. The GFF file is automatically inferred as genome.gff.

Batch Processing with Quality Control

cutgff.sh types=rRNA out=all_rRNA.fa minlen=100 maxnfraction=0.05 *.fna.gz

Processes all compressed fasta files in the current directory, extracting rRNA sequences longer than 100bp with less than 5% ambiguous bases.

Mask Features Instead of Extracting

cutgff.sh in=genome.fna out=masked.fna types=CDS invert=true

Creates a masked version of the genome where all CDS regions are replaced with Ns, leaving only non-coding regions intact.

Ribosomal Alignment Validation

cutgff.sh in=genome.fna out=validated_16S.fna types=rRNA attributes=16S align=true pickbest=true

Extracts 16S rRNA sequences with alignment validation against consensus sequences. Only outputs the best-aligned sequence per genome file.

Taxonomic ID Integration

cutgff.sh in=ncbi_genome.fna out=genes_with_taxid.fna types=CDS renamebytaxid=true taxmode=accession

Extracts CDS features and renames them with taxonomic IDs parsed from NCBI-format accession headers.

Algorithm Details

Feature Extraction Pipeline

CutGff implements a multi-stage feature extraction pipeline that processes GFF annotations against reference sequences:

1. File Processing Strategy

The tool uses dual processing modes based on workload: single-threaded processing for small jobs and multi-threaded processing when multiple files are processed with sufficient CPU cores available (>2 threads). The multi-threaded implementation uses an AtomicInteger counter system to distribute files across ProcessThread worker threads.

2. GFF Parsing and Validation

GFF lines are parsed with full attribute parsing enabled (GffLine.parseAttributes=true), allowing complex filtering based on annotation metadata. The tool validates feature boundaries against sequence lengths and applies multiple filtering criteria simultaneously:

3. Sequence Extraction Methods

The core extraction algorithm operates in two modes:

4. Strand Handling

Features are automatically oriented according to their sense strand. For features on the minus strand (strand=1 in GFF format), the extracted sequence is reverse-complemented using r.reverseComplement() to ensure proper orientation.

5. Ribosomal Sequence Validation

When align=true, the tool implements ribosomal sequence validation using ProkObject consensus sequences:

6. Taxonomic Integration

The taxonomic renaming system supports multiple header formats and integrates with BBTools' taxonomic databases:

7. Memory Management

The tool uses memory management strategies:

8. Performance Characteristics

CutGff processing characteristics:

Support

For questions and support: