RenameIMG

Basic Usage

renameimg.sh in=auto out=renamed.fa.gz

This tool processes IMG (Integrated Microbial Genomes) records and renames sequences by prefixing them with taxonomic IDs and IMG IDs for internal JGI (Joint Genome Institute) workflows.

Parameters

Parameters are parsed by Parser class and processed through argument splitting on '=' delimiter in the constructor.

Input/Output Parameters

in=: 3-column TSV file with imgID, taxID, and file path. These files will have their sequences renamed and concatenated. Use "auto" to load the default IMG file from TaxTree.
out=: Output file for renamed sequences. Default output is in FASTA format.
img=: Optional, if a different (presumably bigger) file will be used for taxonomic assignment. For example, 'in' could be a subset of 'img', potentially with incorrect taxIDs. Use "auto" to load the default IMG file.

Processing Parameters

lines=: Maximum number of lines to process from each input file. Default is unlimited (Long.MAX_VALUE). Use negative values for unlimited processing.
verbose=: Enable verbose output by setting static boolean flags: ByteFile1.verbose, ByteFile2.verbose, FastaReadInputStream.verbose, ConcurrentGenericReadInputStream.verbose, FastqReadInputStream.verbose, and ReadWrite.verbose. Default: false.
overwrite=: Allow overwriting of existing output files. Default: true.
append=: Append to existing output files instead of overwriting. Default: false.

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions.

Examples

Basic Usage

renameimg.sh in=auto out=renamed.fa.gz

Process the default IMG file and output renamed sequences to a gzipped FASTA file.

Custom Input File

renameimg.sh in=my_img_records.tsv out=renamed_sequences.fa

Process a custom 3-column TSV file containing imgID, taxID, and file paths.

With Separate Taxonomy File

renameimg.sh in=subset_records.tsv img=complete_img.tsv out=renamed.fa verbose=t

Process a subset of IMG records but use a complete IMG file for taxonomic assignment, with verbose output enabled.

Limited Processing

renameimg.sh in=large_dataset.tsv out=sample.fa lines=1000

Process only the first 1000 lines from each input file for testing purposes.

Algorithm Details

Sequence Renaming Strategy

The tool implements line-by-line processing within the process_inner() method:

Header Processing: For each FASTA header line (line[0]=='>'), the tool calls ByteBuilder.append() methods to construct new headers
ID Prefixing: Headers are reconstructed using conditional logic: if(tid>=0) adds "tid|[id]|", then always adds "img|[img]|" prefix
Taxonomic Resolution: Calls TaxTree.imgToTaxid(img) method which returns int tid for taxonomic ID lookup, or -1 for unknown IDs
Error Handling: Increments unknownTaxid counter for negative taxonomy IDs and tracks file existence with File.exists() and File.canRead() checks

Memory Management

Stream Processing: Uses ByteFile1 and ByteFile2 classes with FORCE_MODE_BF1=true for single-threaded file reading to minimize memory overhead
Set-based Tracking: Employs IntHashSet with initial capacity of 10,000 slots to track unique taxonomic IDs without hash collisions
Buffer Management: ByteBuilder provides mutable string concatenation with automatic capacity expansion for header construction
Default Memory: Allocates 1GB of heap space by default (z="-Xmx1g" in shell script)

File Format Support

Input Format: 3-column TSV files with imgID, taxID, and file path columns parsed by ImgRecord.toArray()
Sequence Format: Processes FASTA files referenced in the TSV paths using FastaReadInputStream with verbose output control
Compression: Uses ReadWrite.USE_PIGZ=true and ReadWrite.USE_UNPIGZ=true for parallel gzip compression/decompression
Output Format: Generates standard FASTA format through ByteStreamWriter with renamed headers

Statistical Reporting

The tool provides processing statistics tracked through instance variables:

Files processed vs. valid files (filesProcessed, filesValid counters)
Total contigs and bases processed (sequencesProcessed, basesProcessed counters)
Unique taxonomic IDs encountered (IntHashSet.size() and unknownTaxid counter)
Processing time and throughput metrics (Tools.linesBytesProcessed() with Timer.elapsed)
Lines processed vs. valid lines (linesProcessed, linesValid counters)

Integration with TaxTree

The tool integrates with BBTools' taxonomic framework through static method calls:

Default Files: Uses TaxTree.defaultImgFile() method for automatic IMG file path resolution when "auto" parameter specified
Taxonomic Mapping: Calls TaxTree.imgToTaxid(img) method for converting IMG IDs to NCBI taxonomy IDs
IMG Loading: Invokes TaxTree.loadIMG() method to initialize taxonomic lookup tables from ImgRecord arrays
Quality Filtering: Uses TaxTree.IMG_HQ boolean flag in ImgRecord.toArray() calls for high-quality record selection

Output Format

Renamed Sequence Headers

Original FASTA headers are transformed according to this pattern:

# Original header:
>scaffold_1 length=1000 GC=0.45

# Renamed header (with taxonomic ID):
>tid|12345|img|67890 scaffold_1 length=1000 GC=0.45

# Renamed header (no taxonomic ID):
>img|67890 scaffold_1 length=1000 GC=0.45

Processing Statistics Output

Time:                         	0.125 seconds
Files Processed:    15
Contigs Processed:  1,234,567
Bases Processed:    123,456,789
TaxIDs Processed:   892 	(45 unknown)
Lines Processed:    1,456,789 lines, 98.7 MB, 789.3 MB/s

Valid Files:       	14
Invalid Files:     	1
Valid Lines:       	1,456,234
Invalid Lines:     	555

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org