RenameIMG

Script: renameimg.sh Package: tax Class: RenameIMG.java

Renames img records to be prefixed by their id. This is for internal JGI use and has no external utility.

Basic Usage

renameimg.sh in=auto out=renamed.fa.gz

This tool processes IMG (Integrated Microbial Genomes) records and renames sequences by prefixing them with taxonomic IDs and IMG IDs for internal JGI (Joint Genome Institute) workflows.

Parameters

Parameters are parsed by Parser class and processed through argument splitting on '=' delimiter in the constructor.

Input/Output Parameters

in=
3-column TSV file with imgID, taxID, and file path. These files will have their sequences renamed and concatenated. Use "auto" to load the default IMG file from TaxTree.
out=
Output file for renamed sequences. Default output is in FASTA format.
img=
Optional, if a different (presumably bigger) file will be used for taxonomic assignment. For example, 'in' could be a subset of 'img', potentially with incorrect taxIDs. Use "auto" to load the default IMG file.

Processing Parameters

lines=
Maximum number of lines to process from each input file. Default is unlimited (Long.MAX_VALUE). Use negative values for unlimited processing.
verbose=
Enable verbose output by setting static boolean flags: ByteFile1.verbose, ByteFile2.verbose, FastaReadInputStream.verbose, ConcurrentGenericReadInputStream.verbose, FastqReadInputStream.verbose, and ReadWrite.verbose. Default: false.
overwrite=
Allow overwriting of existing output files. Default: true.
append=
Append to existing output files instead of overwriting. Default: false.

Java Parameters

-Xmx
This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da
Disable assertions.

Examples

Basic Usage

renameimg.sh in=auto out=renamed.fa.gz

Process the default IMG file and output renamed sequences to a gzipped FASTA file.

Custom Input File

renameimg.sh in=my_img_records.tsv out=renamed_sequences.fa

Process a custom 3-column TSV file containing imgID, taxID, and file paths.

With Separate Taxonomy File

renameimg.sh in=subset_records.tsv img=complete_img.tsv out=renamed.fa verbose=t

Process a subset of IMG records but use a complete IMG file for taxonomic assignment, with verbose output enabled.

Limited Processing

renameimg.sh in=large_dataset.tsv out=sample.fa lines=1000

Process only the first 1000 lines from each input file for testing purposes.

Algorithm Details

Sequence Renaming Strategy

The tool implements line-by-line processing within the process_inner() method:

Memory Management

File Format Support

Statistical Reporting

The tool provides processing statistics tracked through instance variables:

Integration with TaxTree

The tool integrates with BBTools' taxonomic framework through static method calls:

Output Format

Renamed Sequence Headers

Original FASTA headers are transformed according to this pattern:

# Original header:
>scaffold_1 length=1000 GC=0.45

# Renamed header (with taxonomic ID):
>tid|12345|img|67890 scaffold_1 length=1000 GC=0.45

# Renamed header (no taxonomic ID):
>img|67890 scaffold_1 length=1000 GC=0.45

Processing Statistics Output

Time:                         	0.125 seconds
Files Processed:    15
Contigs Processed:  1,234,567
Bases Processed:    123,456,789
TaxIDs Processed:   892 	(45 unknown)
Lines Processed:    1,456,789 lines, 98.7 MB, 789.3 MB/s

Valid Files:       	14
Invalid Files:     	1
Valid Lines:       	1,456,234
Invalid Lines:     	555

Support

For questions and support: