CutGFF
Cuts out features defined by a GFF file and writes them to a new fasta. Features are output in their sense strand, with filtering by attributes, length constraints, and quality thresholds. Supports batch processing of multiple files and optional ribosomal sequence alignment validation.
Basic Usage
cutgff.sh in=<fna file> gff=<gff file> out=<fna file>
The input GFF file is optional and will be automatically inferred from the fasta filename if not specified. This allows for batch processing of multiple files:
cutgff.sh types=rRNA out=16S.fa minlen=1440 maxlen=1620 attributes=16S bacteria/*.fna.gz
Parameters
Parameters are organized by their functional roles in the feature extraction and processing workflow.
File Parameters
- in=<file>
- Input FNA (fasta) file. Can be compressed with gzip.
- gff=<file>
- Input GFF file (optional). If not specified, will be automatically inferred by replacing the fasta file extension with .gff or .gff.gz.
- out=<file>
- Output FNA file containing the extracted features.
Feature Selection Parameters
- types=CDS
- Types of features to cut from the GFF file. Can specify multiple types separated by commas (e.g., CDS,rRNA,tRNA).
- invert=false
- Invert selection: rather than outputting the features, mask them with Ns in the original sequences. This creates a version of the input with features removed rather than extracted.
- attributes=
- A comma-delimited list of strings. If present, one of these strings must be found in the GFF line attributes column for the feature to be included. Useful for filtering by gene names, products, or other annotations.
- bannedattributes=
- A comma-delimited list of banned strings. Features containing any of these strings in their attributes will be excluded from output.
- banpartial=t
- Ignore lines with 'partial=true' in attributes. This excludes partial gene predictions that may be incomplete at sequence ends.
Length Filtering Parameters
- minlen=1
- Ignore features shorter than this length in base pairs. Useful for filtering out very short features that may be annotation artifacts.
- maxlen=2147483647
- Ignore features longer than this length in base pairs. Can be used to exclude unusually long features or limit memory usage.
Quality Filtering Parameters
- maxns=-1
- If non-negative, ignore features with more than this many undefined bases (Ns or IUPAC ambiguity symbols). Use -1 to disable this filter.
- maxnfraction=-1.0
- If non-negative, ignore features with more than this fraction of undefined bases (Ns or IUPAC symbols). Should be between 0.0 and 1.0. Use -1.0 to disable this filter.
Taxonomic Renaming Parameters
- renamebytaxid=f
- Rename sequences with their taxonomic ID. Input sequences must be named appropriately according to the specified taxmode format.
- taxmode=accession
- Valid modes for taxonomic ID parsing:
- accession: Sequence names must start with an accession number
- gi: Sequence names must start with gi|number format
- taxid: Sequence names must start with tid|number format
- header: Best effort parsing for various header formats
- requirepresent=t
- Crash if a taxonomic ID cannot be found for a sequence when renamebytaxid=true. Set to false to continue processing with sequences that lack taxonomic information.
Output Control Parameters
- oneperfile=f
- Only output one sequence per input file. When processing multiple files, this limits output to the first qualifying feature from each file.
Ribosomal Sequence Parameters
- align=f
- Align ribosomal sequences to consensus sequences (if available). Discard sequences with low identity to consensus, and flip sequences annotated on the wrong strand. Requires consensus sequences to be loaded.
- adjustendpoints=f
- When align=true, adjust the start and stop coordinates of ribosomal features based on alignment results to improve accuracy.
- slop16s=999
- Maximum allowed coordinate adjustment for 16S/SSU rRNA features when adjustendpoints=true. Larger values allow more aggressive coordinate correction.
- slop23s=999
- Maximum allowed coordinate adjustment for 23S/LSU rRNA features when adjustendpoints=true. Larger values allow more aggressive coordinate correction.
- pickbest=f
- When multiple ribosomal sequences are found per file and align=true, only output the one with the highest alignment identity to consensus.
Examples
Basic Feature Extraction
cutgff.sh in=genome.fna gff=genome.gff out=cds.fna types=CDS
Extracts all coding sequences (CDS features) from the genome using the GFF annotation file.
Extract 16S rRNA Genes
cutgff.sh in=genome.fna out=16S.fna types=rRNA attributes=16S minlen=1400 maxlen=1600
Extracts 16S rRNA genes with length filtering to ensure high-quality sequences. The GFF file is automatically inferred as genome.gff.
Batch Processing with Quality Control
cutgff.sh types=rRNA out=all_rRNA.fa minlen=100 maxnfraction=0.05 *.fna.gz
Processes all compressed fasta files in the current directory, extracting rRNA sequences longer than 100bp with less than 5% ambiguous bases.
Mask Features Instead of Extracting
cutgff.sh in=genome.fna out=masked.fna types=CDS invert=true
Creates a masked version of the genome where all CDS regions are replaced with Ns, leaving only non-coding regions intact.
Ribosomal Alignment Validation
cutgff.sh in=genome.fna out=validated_16S.fna types=rRNA attributes=16S align=true pickbest=true
Extracts 16S rRNA sequences with alignment validation against consensus sequences. Only outputs the best-aligned sequence per genome file.
Taxonomic ID Integration
cutgff.sh in=ncbi_genome.fna out=genes_with_taxid.fna types=CDS renamebytaxid=true taxmode=accession
Extracts CDS features and renames them with taxonomic IDs parsed from NCBI-format accession headers.
Algorithm Details
Feature Extraction Pipeline
CutGff implements a multi-stage feature extraction pipeline that processes GFF annotations against reference sequences:
1. File Processing Strategy
The tool uses dual processing modes based on workload: single-threaded processing for small jobs and multi-threaded processing when multiple files are processed with sufficient CPU cores available (>2 threads). The multi-threaded implementation uses an AtomicInteger counter system to distribute files across ProcessThread worker threads.
2. GFF Parsing and Validation
GFF lines are parsed with full attribute parsing enabled (GffLine.parseAttributes=true), allowing complex filtering based on annotation metadata. The tool validates feature boundaries against sequence lengths and applies multiple filtering criteria simultaneously:
- Length filtering: Features outside minlen/maxlen bounds are excluded
- Attribute filtering: Required attributes must be present; banned attributes cause exclusion via hasAttributes() method
- Partial feature filtering: Features marked as partial can be excluded
- Quality filtering: Features with excessive ambiguous bases are filtered out using countUndefined()
3. Sequence Extraction Methods
The core extraction algorithm operates in two modes:
- Normal mode: Creates new Read objects containing feature sequences extracted from the reference using Arrays.copyOfRange()
- Invert mode: Modifies the original sequences in-place by replacing feature regions with 'N' bytes
4. Strand Handling
Features are automatically oriented according to their sense strand. For features on the minus strand (strand=1 in GFF format), the extracted sequence is reverse-complemented using r.reverseComplement() to ensure proper orientation.
5. Ribosomal Sequence Validation
When align=true, the tool implements ribosomal sequence validation using ProkObject consensus sequences:
- Consensus alignment: Extracted sequences are aligned against universal consensus sequences loaded from ProkObject.consensusReads()
- Bidirectional testing: Sequences are tested in both orientations using Alignment objects to detect strand annotation errors
- Identity thresholding: Only sequences meeting minimum identity thresholds (ID_MULT=0.96) are retained
- Strand correction: Sequences with better reverse-complement alignment are automatically flipped
- Coordinate refinement: When adjustendpoints=true, feature boundaries are refined based on alignment results with slop parameters
6. Taxonomic Integration
The taxonomic renaming system supports multiple header formats and integrates with BBTools' taxonomic databases:
- Header parsing: Extracts identifiers using format-specific parsers (parseAccession(), parseGi(), parseTaxIds())
- Batch lookup: Uses TaxClient.accessionToTaxidArray() for batch taxonomic ID resolution
- Header reformatting: Converts sequences to tid|taxid|description format for downstream compatibility
7. Memory Management
The tool uses memory management strategies:
- Streaming processing: Processes files individually rather than loading all data simultaneously
- HashMap indexing: Uses HashMap<String, Read> for O(1) sequence lookup during feature extraction
- Controlled threading: Thread count is limited by Tools.min(Shared.threads(), fnaList.size()) to prevent resource contention
- ByteStreamWriter: Uses buffered output writing with ByteStreamWriter to minimize I/O overhead
8. Performance Characteristics
CutGff processing characteristics:
- Scalability: Linear scaling with file count up to available CPU cores
- Memory usage: Memory usage scales with individual file size, not total dataset size
- I/O handling: Supports compressed input/output and uses FileFormat.testOutput() for format detection
- Batch processing: Automatic GFF file inference enables directory-scale processing
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org