RenameBySketch

Basic Usage

renamebysketch.sh *.fa

Input may be fasta or fastq, compressed or uncompressed. Files will be renamed with the format: tid_[TAXID]_[original_filename]

Parameters

RenameBySketch uses standard Java parameters for memory management and execution control. The tool automatically processes all input files using SendSketch to determine taxonomic identity.

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Default: 4g
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions.

Examples

Rename Multiple Assemblies

renamebysketch.sh bin1.fa bin2.fa bin3.fa

Processes each assembly file, identifies the top taxonomic hit using SendSketch, and renames files with their taxonomic ID. For example, bin1.fa might become tid_562_bin1.fa if identified as E. coli (TaxID 562).

Process All FASTA Files in Directory

renamebysketch.sh *.fa

Batch processes all .fa files in the current directory, renaming each based on its taxonomic identification.

With Custom Memory Settings

renamebysketch.sh -Xmx8g assembly1.fasta assembly2.fasta

Processes assemblies with 8GB of allocated memory, useful for large genome files that require more memory for sketch comparison.

Algorithm Details

FileRenamer implements a MinHash sketch-based taxonomic identification workflow that processes input files through SketchTool and SendSketch components for NCBI RefSeq database matching.

Processing Strategy

The implementation uses a four-stage processing pipeline with direct class method invocations:

Sketch Generation: Creates SketchTool instance with k=10,000 kmers and DisplayParams.FORMAT_JSON output format
Parallel Loading: Executes tool.loadSketches_MT() with multi-threaded sketch generation, applying SketchIdComparator.comparator for deterministic ordering when mode==Sketch.PER_FILE
Taxonomic Query: Invokes SendSketch.sendSketches() with ArrayList<Sketch> input and "refseq" database parameter, returning ArrayList<JsonObject> taxonomic matches
File Operations: Creates File objects for original and target filenames, validates existence with File.exists() assertions, executes File.renameTo() for atomic renaming

JSON Response Processing

The SketchRecord constructor extracts specific fields from JsonObject responses using typed accessor methods:

ANI (Average Nucleotide Identity): Float value from hit.getDouble("ANI").floatValue() stored in SketchRecord.ani field
Completeness (Complt): Float fraction from hit.getDouble("Complt").floatValue() representing reference genome coverage
Contamination (Contam): Float contamination estimate from hit.getDouble("Contam").floatValue()
Matches: Integer shared kmer count via hit.getLong("Matches").intValue() stored in SketchRecord.matches
TaxID: Integer NCBI taxonomy identifier from hit.getLong("TaxID").intValue() used directly for filename prefix
TaxName: String taxonomic name via hit.getString("taxName") with name truncation in shrink() method for names >40 characters

File Processing Implementation

The main loop processes args[i] filenames sequentially with validation and error handling:

Existence Verification: assert(f.exists()) prevents processing of non-existent input files
Collision Prevention: assert(!f2.exists()) prevents overwriting existing target files
Atomic Renaming: File.renameTo() provides atomic filesystem operation with POSIX move semantics
Null Handling: When JsonObject result is null or empty (result.jmapSize()<=0), assigns taxid=-1 as default
Top Hit Selection: Iterates result.jmap.keySet() to extract first entry as highest-scoring match

Use Cases

FileRenamer addresses specific bioinformatics workflow requirements:

Metagenome Binning Validation: Automated taxonomic labeling of assembled genome bins for purity assessment
Synthetic Dataset Generation: Taxonomic organization of reference genomes for simulation and benchmarking
Quality Control Pipelines: Batch taxonomic assignment for contamination detection workflows
File Organization: Taxonomic prefix addition for systematic file management

Implementation Characteristics

The sketch-based approach exhibits specific computational and memory properties:

Memory Usage: Linear with sketch count (10,000 kmers × 8 bytes per hash = 80KB per sketch) plus Java heap for JsonObject parsing
Network Dependency: SendSketch.sendSketches() requires active internet connection for JGI sketch server communication
Processing Time: Network latency dominates execution time; sketch generation is O(sequence_length), file renaming is O(1)
Memory Allocation: Default 4GB heap (-Xmx4g) accommodates hundreds of concurrent sketches and JSON response objects
Computational Complexity: O(n) for n input files, with constant-time sketch comparison via remote service

Output Format

Files are renamed using the pattern: tid_[TAXID]_[original_filename]

Example Transformations

assembly.fa → tid_562_assembly.fa (E. coli)
bin_001.fasta → tid_1280_bin_001.fasta (Staphylococcus aureus)
unknown.fa → tid_-1_unknown.fa (no match found)

Taxonomic ID Assignment

TaxIDs are assigned based on the top SendSketch hit:

Positive integers represent valid NCBI Taxonomy IDs
-1 indicates no suitable taxonomic match was found
Only the single best match is used for naming

Dependencies

RenameBySketch requires network access to function properly:

SendSketch Service: Must be able to connect to sketch comparison servers
RefSeq Database: Queries the RefSeq database for taxonomic identification
Internet Connection: Required for real-time sketch comparisons

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org