RenameRef
Converts reference sequence names in genomics files, supporting SAM, BAM, FASTA, VCF, and GFF. Updates reference names in headers and data records according to a mapping file. Useful for converting between reference naming conventions (e.g. HG19 <-> GRCh37). Sequence names not in the mapping file are kept as-is. Name mapping will first be attempted using the full header, and secondly using the prefix of the original name up to the first whitespace.
Basic Usage
renameref.sh in=<input file> out=<output file> mapping=<ref_mapping.tsv>
RenameRef processes genomics files to convert reference sequence names according to a user-provided mapping file. Format detection uses FileFormat extension methods (hasFastaExtension(), hasVcfExtension(), hasGffExtension(), hasSamOrBamExtension()) and processes the appropriate fields for each format.
Parameters
Parameters control input/output files, mapping behavior, and processing options. All parameters are optional except for input and mapping files.
Input/Output Parameters
- in=<file>
- Input file to process. Supported formats: SAM, BAM, FASTA, VCF, GFF. Format detection uses FileFormat extension methods based on file name patterns.
- out=<file>
- Output file with converted reference names. Uses stdout if not specified. Output format matches input format.
- map=<file> (mapping=<file>)
- Tab-delimited file with old_name<tab>new_name mappings. Each line contains the original reference name, a tab character, and the replacement name. Comment lines starting with # are ignored.
Processing Parameters
- invert=<bool> (swap=<bool>)
- Reverse the order of names in the map file. If true, treats the second column as the source and first column as the target. Default: false
- strict=<bool>
- Crash on unknown references. When true, the program will terminate if it encounters a reference name not present in the mapping file. When false, unknown references are left unchanged. Default: false
- verbose=<bool>
- Print detailed progress information including individual mappings loaded, unknown references encountered, and processing statistics. Default: false
- lines=<long>
- Maximum number of lines to process from the input file. Use -1 or omit for unlimited processing. Useful for testing on large files. Default: unlimited
File Format Support
SAM/BAM Files
Processes @SQ header lines (SN: field) and alignment records (RNAME and RNEXT fields). All other fields are preserved unchanged.
FASTA Files
Converts sequence headers while preserving descriptions after whitespace. Attempts full header replacement first, then prefix replacement up to the first space character.
VCF Files
Updates ##contig header lines and CHROM field in variant records. All other VCF fields remain unchanged.
GFF Files
Converts the seqname field (column 1) in feature records. Comment lines and other fields are preserved.
Mapping File Format
The mapping file must be tab-delimited with two columns:
# Example mapping file (comments start with #)
chr1 1
chr2 2
chr3 3
chrX X
chrY Y
chrM MT
NC_000001.11 chr1
NC_000002.12 chr2
Each line contains the original reference name, a tab character, and the replacement name. Blank lines and lines starting with # are ignored. The mapping is case-sensitive and must match reference names exactly.
Examples
Basic Reference Name Conversion
renameref.sh in=aligned.sam out=converted.sam mapping=hg19_to_grch37.tsv
Converts reference names in a SAM file from HG19 to GRCh37 naming convention.
Strict Mode Processing
renameref.sh in=data.sam out=renamed.sam mapping=refs.tsv strict=true
Process a SAM file with strict validation - the program will terminate if any unmapped reference names are encountered.
FASTA Header Conversion
renameref.sh in=assembly.fasta out=renamed.fasta mapping=contig_names.tsv verbose=true
Convert contig names in a FASTA assembly file with detailed progress reporting.
VCF Chromosome Name Standardization
renameref.sh in=variants.vcf out=standardized.vcf mapping=chr_mapping.tsv
Standardize chromosome names in a VCF file to match a reference genome convention.
Inverted Mapping
renameref.sh in=data.gff out=converted.gff mapping=reverse_map.tsv invert=true
Apply mapping in reverse order - use the second column as source and first column as target.
Algorithm Details
RenameRef implements format-specific processing through dedicated handler methods in the RefRenamer.java class:
File Format Detection and Processing
Format detection uses FileFormat.hasFastaExtension(), FileFormat.hasVcfExtension(), FileFormat.hasGffExtension(), and FileFormat.hasSamOrBamExtension() methods. Each format is processed by specialized methods:
- SAM/BAM: processSamHeaderLine() handles @SQ headers, processAlignmentLine() handles RNAME/RNEXT fields
- FASTA: processFastaLine() processes headers with Tools.indexOfWhitespace() for prefix extraction
- VCF: processVcfLine() handles ##contig headers and CHROM field conversion
- GFF: processGffLine() converts seqname field (column 0) in feature records
Reference Name Mapping Implementation
The mapping process uses HashMap<String,String> refMap for O(1) lookup performance with two-stage matching:
- Full Header Matching: refMap.get(oldRef) attempts complete reference name lookup
- Prefix Matching: For FASTA, extracts prefix using new String(line, 1, limit) up to Tools.indexOfWhitespace()
- Fallback: refMap.getOrDefault(rname, rname) preserves unmapped references
Data Structure Implementation
Core data structures provide memory-optimized processing:
- HashMap<String,String> refMap: Primary mapping storage with O(1) reference lookup
- HashSet<String> unknownRefs: Tracks unique unmapped references, preventing duplicate warnings
- LineParser1 lp: Tab-delimited field parsing with lp.set(line) and lp.parseString(i) methods
- ByteBuilder bb: Output construction with bb.append(), bb.tab(), bb.nl() for line assembly
I/O Processing Architecture
Streaming I/O uses BBTools fileIO classes for memory-constrained processing:
- ByteFile.makeByteFile(ffin1): Line-by-line input reading with automatic decompression
- ByteStreamWriter: Threaded output writing with bsw.print() method calls
- ByteFile.FORCE_MODE_BF2: Multi-threaded I/O optimization when Shared.threads() > 2
Error Handling and Statistics
The handleUnknownRef() method implements validation with HashSet-based duplicate prevention. Statistics counters track:
- linesProcessed, bytesProcessed: Raw throughput metrics
- headersProcessed, headersConverted: Header modification tracking
- recordsConverted, unknownsProcessed: Data record conversion statistics
Output Statistics
Upon completion, RenameRef reports processing statistics:
- Lines Processed: Total number of lines read from input file
- Headers Processed: Number of header lines encountered (SAM @SQ, FASTA headers, VCF contigs)
- Headers Converted: Number of headers successfully renamed using the mapping file
- Records Converted: Number of data records (alignment lines, variant lines) with updated reference names
- Unknown References: List of unique reference names not found in the mapping file
Technical Implementation Notes
Compression and File Format Support
Tools.fixExtension() handles file extension processing for compressed formats. FileFormat.testInput() and FileFormat.testOutput() configure decompression/compression through ReadWrite.USE_PIGZ and ReadWrite.USE_UNPIGZ settings with ReadWrite.setZipThreads() thread management.
Multi-Threading Architecture
Threading optimization uses checkStatics() method which enables ByteFile.FORCE_MODE_BF2 when Shared.threads() > 2. This activates multi-threaded I/O processing through ByteStreamWriter.start() for concurrent read/write operations on multi-core systems.
Memory Usage and Optimization
Memory consumption scales with mapping file size through HashMap<String,String> refMap storage (approximately 50-100 bytes per mapping entry) plus HashSet<String> unknownRefs for unique reference tracking. ByteBuilder reuse via bb.clear() prevents excessive object allocation during line processing.
Validation and Error States
Parameter validation occurs in validateParams() with strict mode enforcement. Error handling uses errorState boolean tracking with bf.close() and bsw.poisonAndWait() return code checking. File existence verification uses Tools.testInputFiles() and Tools.testOutputFiles() before processing begins.
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org