RenameBySketch
Renames fasta files with a TaxID, based on SendSketch results. Designed for metagenome binning evaluation and synthetic read generation.
Basic Usage
renamebysketch.sh *.fa
Input may be fasta or fastq, compressed or uncompressed. Files will be renamed with the format: tid_[TAXID]_[original_filename]
Parameters
RenameBySketch uses standard Java parameters for memory management and execution control. The tool automatically processes all input files using SendSketch to determine taxonomic identity.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Default: 4g
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions.
Examples
Rename Multiple Assemblies
renamebysketch.sh bin1.fa bin2.fa bin3.fa
Processes each assembly file, identifies the top taxonomic hit using SendSketch, and renames files with their taxonomic ID. For example, bin1.fa might become tid_562_bin1.fa if identified as E. coli (TaxID 562).
Process All FASTA Files in Directory
renamebysketch.sh *.fa
Batch processes all .fa files in the current directory, renaming each based on its taxonomic identification.
With Custom Memory Settings
renamebysketch.sh -Xmx8g assembly1.fasta assembly2.fasta
Processes assemblies with 8GB of allocated memory, useful for large genome files that require more memory for sketch comparison.
Algorithm Details
FileRenamer implements a MinHash sketch-based taxonomic identification workflow that processes input files through SketchTool and SendSketch components for NCBI RefSeq database matching.
Processing Strategy
The implementation uses a four-stage processing pipeline with direct class method invocations:
- Sketch Generation: Creates SketchTool instance with k=10,000 kmers and DisplayParams.FORMAT_JSON output format
- Parallel Loading: Executes tool.loadSketches_MT() with multi-threaded sketch generation, applying SketchIdComparator.comparator for deterministic ordering when mode==Sketch.PER_FILE
- Taxonomic Query: Invokes SendSketch.sendSketches() with ArrayList<Sketch> input and "refseq" database parameter, returning ArrayList<JsonObject> taxonomic matches
- File Operations: Creates File objects for original and target filenames, validates existence with File.exists() assertions, executes File.renameTo() for atomic renaming
JSON Response Processing
The SketchRecord constructor extracts specific fields from JsonObject responses using typed accessor methods:
- ANI (Average Nucleotide Identity): Float value from hit.getDouble("ANI").floatValue() stored in SketchRecord.ani field
- Completeness (Complt): Float fraction from hit.getDouble("Complt").floatValue() representing reference genome coverage
- Contamination (Contam): Float contamination estimate from hit.getDouble("Contam").floatValue()
- Matches: Integer shared kmer count via hit.getLong("Matches").intValue() stored in SketchRecord.matches
- TaxID: Integer NCBI taxonomy identifier from hit.getLong("TaxID").intValue() used directly for filename prefix
- TaxName: String taxonomic name via hit.getString("taxName") with name truncation in shrink() method for names >40 characters
File Processing Implementation
The main loop processes args[i] filenames sequentially with validation and error handling:
- Existence Verification: assert(f.exists()) prevents processing of non-existent input files
- Collision Prevention: assert(!f2.exists()) prevents overwriting existing target files
- Atomic Renaming: File.renameTo() provides atomic filesystem operation with POSIX move semantics
- Null Handling: When JsonObject result is null or empty (result.jmapSize()<=0), assigns taxid=-1 as default
- Top Hit Selection: Iterates result.jmap.keySet() to extract first entry as highest-scoring match
Use Cases
FileRenamer addresses specific bioinformatics workflow requirements:
- Metagenome Binning Validation: Automated taxonomic labeling of assembled genome bins for purity assessment
- Synthetic Dataset Generation: Taxonomic organization of reference genomes for simulation and benchmarking
- Quality Control Pipelines: Batch taxonomic assignment for contamination detection workflows
- File Organization: Taxonomic prefix addition for systematic file management
Implementation Characteristics
The sketch-based approach exhibits specific computational and memory properties:
- Memory Usage: Linear with sketch count (10,000 kmers × 8 bytes per hash = 80KB per sketch) plus Java heap for JsonObject parsing
- Network Dependency: SendSketch.sendSketches() requires active internet connection for JGI sketch server communication
- Processing Time: Network latency dominates execution time; sketch generation is O(sequence_length), file renaming is O(1)
- Memory Allocation: Default 4GB heap (-Xmx4g) accommodates hundreds of concurrent sketches and JSON response objects
- Computational Complexity: O(n) for n input files, with constant-time sketch comparison via remote service
Output Format
Files are renamed using the pattern: tid_[TAXID]_[original_filename]
Example Transformations
assembly.fa
→tid_562_assembly.fa
(E. coli)bin_001.fasta
→tid_1280_bin_001.fasta
(Staphylococcus aureus)unknown.fa
→tid_-1_unknown.fa
(no match found)
Taxonomic ID Assignment
TaxIDs are assigned based on the top SendSketch hit:
- Positive integers represent valid NCBI Taxonomy IDs
- -1 indicates no suitable taxonomic match was found
- Only the single best match is used for naming
Dependencies
RenameBySketch requires network access to function properly:
- SendSketch Service: Must be able to connect to sketch comparison servers
- RefSeq Database: Queries the RefSeq database for taxonomic identification
- Internet Connection: Required for real-time sketch comparisons
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org