RenameRef

Script: renameref.sh Package: jgi Class: RefRenamer.java

Converts reference sequence names in genomics files, supporting SAM, BAM, FASTA, VCF, and GFF. Updates reference names in headers and data records according to a mapping file. Useful for converting between reference naming conventions (e.g. HG19 <-> GRCh37). Sequence names not in the mapping file are kept as-is. Name mapping will first be attempted using the full header, and secondly using the prefix of the original name up to the first whitespace.

Basic Usage

renameref.sh in=<input file> out=<output file> mapping=<ref_mapping.tsv>

RenameRef processes genomics files to convert reference sequence names according to a user-provided mapping file. Format detection uses FileFormat extension methods (hasFastaExtension(), hasVcfExtension(), hasGffExtension(), hasSamOrBamExtension()) and processes the appropriate fields for each format.

Parameters

Parameters control input/output files, mapping behavior, and processing options. All parameters are optional except for input and mapping files.

Input/Output Parameters

in=<file>: Input file to process. Supported formats: SAM, BAM, FASTA, VCF, GFF. Format detection uses FileFormat extension methods based on file name patterns.
out=<file>: Output file with converted reference names. Uses stdout if not specified. Output format matches input format.
map=<file> (mapping=<file>): Tab-delimited file with old_name<tab>new_name mappings. Each line contains the original reference name, a tab character, and the replacement name. Comment lines starting with # are ignored.

Processing Parameters

invert=<bool> (swap=<bool>): Reverse the order of names in the map file. If true, treats the second column as the source and first column as the target. Default: false
strict=<bool>: Crash on unknown references. When true, the program will terminate if it encounters a reference name not present in the mapping file. When false, unknown references are left unchanged. Default: false
verbose=<bool>: Print detailed progress information including individual mappings loaded, unknown references encountered, and processing statistics. Default: false
lines=<long>: Maximum number of lines to process from the input file. Use -1 or omit for unlimited processing. Useful for testing on large files. Default: unlimited

File Format Support

SAM/BAM Files

Processes @SQ header lines (SN: field) and alignment records (RNAME and RNEXT fields). All other fields are preserved unchanged.

FASTA Files

Converts sequence headers while preserving descriptions after whitespace. Attempts full header replacement first, then prefix replacement up to the first space character.

VCF Files

Updates ##contig header lines and CHROM field in variant records. All other VCF fields remain unchanged.

GFF Files

Converts the seqname field (column 1) in feature records. Comment lines and other fields are preserved.

Mapping File Format

The mapping file must be tab-delimited with two columns:

# Example mapping file (comments start with #)
chr1	1
chr2	2
chr3	3
chrX	X
chrY	Y
chrM	MT
NC_000001.11	chr1
NC_000002.12	chr2

Each line contains the original reference name, a tab character, and the replacement name. Blank lines and lines starting with # are ignored. The mapping is case-sensitive and must match reference names exactly.

Examples

Basic Reference Name Conversion

renameref.sh in=aligned.sam out=converted.sam mapping=hg19_to_grch37.tsv

Converts reference names in a SAM file from HG19 to GRCh37 naming convention.

Strict Mode Processing

renameref.sh in=data.sam out=renamed.sam mapping=refs.tsv strict=true

Process a SAM file with strict validation - the program will terminate if any unmapped reference names are encountered.

FASTA Header Conversion

renameref.sh in=assembly.fasta out=renamed.fasta mapping=contig_names.tsv verbose=true

Convert contig names in a FASTA assembly file with detailed progress reporting.

VCF Chromosome Name Standardization

renameref.sh in=variants.vcf out=standardized.vcf mapping=chr_mapping.tsv

Standardize chromosome names in a VCF file to match a reference genome convention.

Inverted Mapping

renameref.sh in=data.gff out=converted.gff mapping=reverse_map.tsv invert=true

Apply mapping in reverse order - use the second column as source and first column as target.

Algorithm Details

RenameRef implements format-specific processing through dedicated handler methods in the RefRenamer.java class:

File Format Detection and Processing

Format detection uses FileFormat.hasFastaExtension(), FileFormat.hasVcfExtension(), FileFormat.hasGffExtension(), and FileFormat.hasSamOrBamExtension() methods. Each format is processed by specialized methods:

SAM/BAM: processSamHeaderLine() handles @SQ headers, processAlignmentLine() handles RNAME/RNEXT fields
FASTA: processFastaLine() processes headers with Tools.indexOfWhitespace() for prefix extraction
VCF: processVcfLine() handles ##contig headers and CHROM field conversion
GFF: processGffLine() converts seqname field (column 0) in feature records

Reference Name Mapping Implementation

The mapping process uses HashMap<String,String> refMap for O(1) lookup performance with two-stage matching:

Full Header Matching: refMap.get(oldRef) attempts complete reference name lookup
Prefix Matching: For FASTA, extracts prefix using new String(line, 1, limit) up to Tools.indexOfWhitespace()
Fallback: refMap.getOrDefault(rname, rname) preserves unmapped references

Data Structure Implementation

Core data structures provide memory-optimized processing:

HashMap<String,String> refMap: Primary mapping storage with O(1) reference lookup
HashSet<String> unknownRefs: Tracks unique unmapped references, preventing duplicate warnings
LineParser1 lp: Tab-delimited field parsing with lp.set(line) and lp.parseString(i) methods
ByteBuilder bb: Output construction with bb.append(), bb.tab(), bb.nl() for line assembly

I/O Processing Architecture

Streaming I/O uses BBTools fileIO classes for memory-constrained processing:

ByteFile.makeByteFile(ffin1): Line-by-line input reading with automatic decompression
ByteStreamWriter: Threaded output writing with bsw.print() method calls
ByteFile.FORCE_MODE_BF2: Multi-threaded I/O optimization when Shared.threads() > 2

Error Handling and Statistics

The handleUnknownRef() method implements validation with HashSet-based duplicate prevention. Statistics counters track:

linesProcessed, bytesProcessed: Raw throughput metrics
headersProcessed, headersConverted: Header modification tracking
recordsConverted, unknownsProcessed: Data record conversion statistics

Output Statistics

Upon completion, RenameRef reports processing statistics:

Lines Processed: Total number of lines read from input file
Headers Processed: Number of header lines encountered (SAM @SQ, FASTA headers, VCF contigs)
Headers Converted: Number of headers successfully renamed using the mapping file
Records Converted: Number of data records (alignment lines, variant lines) with updated reference names
Unknown References: List of unique reference names not found in the mapping file

Technical Implementation Notes

Compression and File Format Support

Tools.fixExtension() handles file extension processing for compressed formats. FileFormat.testInput() and FileFormat.testOutput() configure decompression/compression through ReadWrite.USE_PIGZ and ReadWrite.USE_UNPIGZ settings with ReadWrite.setZipThreads() thread management.

Multi-Threading Architecture

Threading optimization uses checkStatics() method which enables ByteFile.FORCE_MODE_BF2 when Shared.threads() > 2. This activates multi-threaded I/O processing through ByteStreamWriter.start() for concurrent read/write operations on multi-core systems.

Memory Usage and Optimization

Memory consumption scales with mapping file size through HashMap<String,String> refMap storage (approximately 50-100 bytes per mapping entry) plus HashSet<String> unknownRefs for unique reference tracking. ByteBuilder reuse via bb.clear() prevents excessive object allocation during line processing.

Validation and Error States

Parameter validation occurs in validateParams() with strict mode enforcement. Error handling uses errorState boolean tracking with bf.close() and bsw.poisonAndWait() return code checking. File existence verification uses Tools.testInputFiles() and Tools.testOutputFiles() before processing begins.

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org