RenameIMG
Renames img records to be prefixed by their id. This is for internal JGI use and has no external utility.
Basic Usage
renameimg.sh in=auto out=renamed.fa.gz
This tool processes IMG (Integrated Microbial Genomes) records and renames sequences by prefixing them with taxonomic IDs and IMG IDs for internal JGI (Joint Genome Institute) workflows.
Parameters
Parameters are parsed by Parser class and processed through argument splitting on '=' delimiter in the constructor.
Input/Output Parameters
- in=
- 3-column TSV file with imgID, taxID, and file path. These files will have their sequences renamed and concatenated. Use "auto" to load the default IMG file from TaxTree.
- out=
- Output file for renamed sequences. Default output is in FASTA format.
- img=
- Optional, if a different (presumably bigger) file will be used for taxonomic assignment. For example, 'in' could be a subset of 'img', potentially with incorrect taxIDs. Use "auto" to load the default IMG file.
Processing Parameters
- lines=
- Maximum number of lines to process from each input file. Default is unlimited (Long.MAX_VALUE). Use negative values for unlimited processing.
- verbose=
- Enable verbose output by setting static boolean flags: ByteFile1.verbose, ByteFile2.verbose, FastaReadInputStream.verbose, ConcurrentGenericReadInputStream.verbose, FastqReadInputStream.verbose, and ReadWrite.verbose. Default: false.
- overwrite=
- Allow overwriting of existing output files. Default: true.
- append=
- Append to existing output files instead of overwriting. Default: false.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions.
Examples
Basic Usage
renameimg.sh in=auto out=renamed.fa.gz
Process the default IMG file and output renamed sequences to a gzipped FASTA file.
Custom Input File
renameimg.sh in=my_img_records.tsv out=renamed_sequences.fa
Process a custom 3-column TSV file containing imgID, taxID, and file paths.
With Separate Taxonomy File
renameimg.sh in=subset_records.tsv img=complete_img.tsv out=renamed.fa verbose=t
Process a subset of IMG records but use a complete IMG file for taxonomic assignment, with verbose output enabled.
Limited Processing
renameimg.sh in=large_dataset.tsv out=sample.fa lines=1000
Process only the first 1000 lines from each input file for testing purposes.
Algorithm Details
Sequence Renaming Strategy
The tool implements line-by-line processing within the process_inner() method:
- Header Processing: For each FASTA header line (line[0]=='>'), the tool calls ByteBuilder.append() methods to construct new headers
- ID Prefixing: Headers are reconstructed using conditional logic: if(tid>=0) adds "tid|[id]|", then always adds "img|[img]|" prefix
- Taxonomic Resolution: Calls TaxTree.imgToTaxid(img) method which returns int tid for taxonomic ID lookup, or -1 for unknown IDs
- Error Handling: Increments unknownTaxid counter for negative taxonomy IDs and tracks file existence with File.exists() and File.canRead() checks
Memory Management
- Stream Processing: Uses ByteFile1 and ByteFile2 classes with FORCE_MODE_BF1=true for single-threaded file reading to minimize memory overhead
- Set-based Tracking: Employs IntHashSet with initial capacity of 10,000 slots to track unique taxonomic IDs without hash collisions
- Buffer Management: ByteBuilder provides mutable string concatenation with automatic capacity expansion for header construction
- Default Memory: Allocates 1GB of heap space by default (z="-Xmx1g" in shell script)
File Format Support
- Input Format: 3-column TSV files with imgID, taxID, and file path columns parsed by ImgRecord.toArray()
- Sequence Format: Processes FASTA files referenced in the TSV paths using FastaReadInputStream with verbose output control
- Compression: Uses ReadWrite.USE_PIGZ=true and ReadWrite.USE_UNPIGZ=true for parallel gzip compression/decompression
- Output Format: Generates standard FASTA format through ByteStreamWriter with renamed headers
Statistical Reporting
The tool provides processing statistics tracked through instance variables:
- Files processed vs. valid files (filesProcessed, filesValid counters)
- Total contigs and bases processed (sequencesProcessed, basesProcessed counters)
- Unique taxonomic IDs encountered (IntHashSet.size() and unknownTaxid counter)
- Processing time and throughput metrics (Tools.linesBytesProcessed() with Timer.elapsed)
- Lines processed vs. valid lines (linesProcessed, linesValid counters)
Integration with TaxTree
The tool integrates with BBTools' taxonomic framework through static method calls:
- Default Files: Uses TaxTree.defaultImgFile() method for automatic IMG file path resolution when "auto" parameter specified
- Taxonomic Mapping: Calls TaxTree.imgToTaxid(img) method for converting IMG IDs to NCBI taxonomy IDs
- IMG Loading: Invokes TaxTree.loadIMG() method to initialize taxonomic lookup tables from ImgRecord arrays
- Quality Filtering: Uses TaxTree.IMG_HQ boolean flag in ImgRecord.toArray() calls for high-quality record selection
Output Format
Renamed Sequence Headers
Original FASTA headers are transformed according to this pattern:
# Original header:
>scaffold_1 length=1000 GC=0.45
# Renamed header (with taxonomic ID):
>tid|12345|img|67890 scaffold_1 length=1000 GC=0.45
# Renamed header (no taxonomic ID):
>img|67890 scaffold_1 length=1000 GC=0.45
Processing Statistics Output
Time: 0.125 seconds
Files Processed: 15
Contigs Processed: 1,234,567
Bases Processed: 123,456,789
TaxIDs Processed: 892 (45 unknown)
Lines Processed: 1,456,789 lines, 98.7 MB, 789.3 MB/s
Valid Files: 14
Invalid Files: 1
Valid Lines: 1,456,234
Invalid Lines: 555
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org