RenameByMapping
Renames contigs based on mapping information. Appends coverage and optionally taxID from parsing sam line headers. For taxID renaming, read headers should contain a term like 'tid_1234'; output will be named as 'original tid_1234 cov_45.67' with potentially multiple coverage entries (if there are multiple sam files) but only one tid entry based on the highest-coverage sam file. Designed for metagenome binning evaluation and synthetic read generation.
Basic Usage
renamebymapping.sh in=contigs.fa out=renamed.fa *.sam
This tool processes a reference assembly file and one or more SAM/BAM mapping files to rename contigs with coverage and taxonomic information extracted from read headers.
Parameters
Parameters control input/output files, the information to append to contig names, and how the renaming is performed.
Input/Output Parameters
- in=<file>
- Assembly to rename. Input FASTA file containing contigs/scaffolds that will be renamed based on mapping information.
- out=<file>
- Renamed assembly. Output FASTA file with contigs renamed to include coverage and optionally taxonomic information.
- sam=<file>
- This can be a file, directory, or comma-delimited list. Unrecognized arguments that are existing files will also be treated as sam files. BAM is acceptable too. Multiple SAM/BAM files can be processed to calculate coverage from different mapping experiments.
Renaming Control Parameters
- delimiter=space
- Delimiter between appended fields. Character used to separate the original contig name from the appended coverage and taxonomic information. Default is space.
- wipe=f
- Replace the original header with contig_#. When true, completely replaces original contig names with sequential numbering (contig_1, contig_2, etc.) before appending coverage/taxonomic information. Default: false.
- depth=t
- Add a depth field. When true, appends coverage information in format 'cov_X.XX' where X.XX is the calculated coverage depth. Default: true.
- tid=t
- Add a tid field (if not already present). When true, appends taxonomic ID information in format 'tid_XXXX' where XXXX is the taxonomic ID extracted from read headers. Only added if the contig doesn't already have a taxonomic ID. Default: true.
Advanced Parameters
- cami=<file>
- CAMI format input file for taxonomic assignments. Uses processCami() method with LineParser1 tab-delimited parsing instead of SAM-based processing. Sets addDepth=false and addTid=true automatically.
- ordered=f
- Process contigs in input order. Controls ConcurrentReadOutputStream buffer size calculation: ordered uses Tools.mid(16, 128, (Shared.threads()*2)/3), unordered uses 8. Default: false.
- verbose=f
- Print verbose status messages to outstream during processing. Outputs "Started cris", "Started Streamer", and "Finished; closing streams" messages. Default: false.
- clearfilters=f
- Clear all SAM filtering criteria using samFilter.clear(). When true, bypasses default SamFilter settings that exclude unmapped, supplementary, duplicate, non-primary, and quality-failed reads. Default: false.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions. Can improve performance in production environments by skipping internal consistency checks.
Examples
Basic Contig Renaming
renamebymapping.sh in=assembly.fa out=renamed.fa mapping.sam
Renames contigs in assembly.fa based on coverage calculated from mapping.sam, appending coverage information to each contig name.
Multiple SAM Files
renamebymapping.sh in=contigs.fa out=renamed.fa sample1.sam sample2.sam sample3.sam
Processes multiple SAM files to calculate coverage from different samples. Each coverage value will be appended separately (e.g., 'contig1 cov_12.34 cov_8.76 cov_15.23').
With Taxonomic Information
renamebymapping.sh in=metagenome.fa out=renamed.fa reads_mapped.sam tid=t
Extracts taxonomic IDs from read headers (format 'tid_1234') and appends both coverage and taxonomic information. The taxonomic ID is selected from the highest-coverage mapping.
Complete Header Replacement
renamebymapping.sh in=assembly.fa out=clean.fa mapping.sam wipe=t delimiter=_
Replaces original contig names with sequential numbering and uses underscore as delimiter (e.g., 'contig_1_cov_45.67').
Coverage Only (No Taxonomic Info)
renamebymapping.sh in=contigs.fa out=cov_only.fa mapping.sam tid=f
Appends only coverage information without extracting or adding taxonomic IDs.
Algorithm Details
Coverage Calculation Strategy
The tool implements a three-stage coverage calculation process:
- Mapping Stage: Parses SAM/BAM files using SamLineStreamer, applying SamFilter default settings to exclude unmapped, supplementary, duplicate, and non-primary alignments
- Accumulation Stage: Uses thread-safe IntLongHashMap with synchronized increment operations (map.increment(taxid, length)) to accumulate mapped bases per contig
- Coverage Calculation: Divides totalMappedBases by contig length using float arithmetic (totalMappedBases/(float)length) with 2 decimal places precision in output formatting
Taxonomic ID Extraction
The algorithm implements TaxTree.parseHeaderStatic2() for taxonomic ID parsing:
- Header Parsing: Searches read headers for taxonomic information in format 'tid_XXXX' using TaxTree.parseHeaderStatic2() method
- Conflict Resolution: When multiple taxonomic IDs map to the same contig, selects the ID associated with the highest coverage mapping using maxMappedBases comparison
- Existing ID Preservation: Checks if contigs already contain taxonomic information using TaxTree.parseHeaderStatic2() and avoids duplication when oldTaxid >= 0
Multi-File Processing
When processing multiple SAM/BAM files:
- Each file is processed independently to calculate per-file coverage
- Coverage values from all files are appended as separate fields
- Taxonomic ID is selected from the file contributing the highest coverage
- Thread-safe processing allows efficient handling of large mapping files
Memory Management
The tool implements specific memory optimization strategies:
- Scaffold Mapping: Creates HashMap<String, Scaf> from SAM headers (@SQ lines) using makeScafMap() to avoid loading entire reference into memory
- Streaming Processing: Uses SamLineStreamer with configurable thread count for memory-efficient reading of large SAM/BAM files
- Concurrent Processing: Employs multiple ProcessThread instances for SAM processing while maintaining thread-safe IntLongHashMap data structures with synchronized access
- Incremental Processing: Processes contigs using ConcurrentReadInputStream with ListNum<Read> batching rather than loading everything into memory simultaneously
Performance Characteristics
- Time Complexity: O(n) where n is the total number of mapped reads across all SAM files, processed through SamFilter.passesFilter() validation
- Memory Usage: Scales with number of unique contigs stored in HashMap<String, Scaf> and IntLongHashMap per contig for taxonomic ID tracking
- Scalability: Uses Shared.threads() for parallel ProcessThread instances, with SamLineStreamer supporting configurable streamerThreads for large datasets
- I/O Optimization: Uses ConcurrentReadInputStream and ConcurrentReadOutputStream with configurable buffer sizes based on ordered processing requirements
Output Format
Output contig names follow this pattern:
[original_name|contig_#][delimiter][tid_XXXX][delimiter][cov_XX.XX][delimiter][cov_YY.YY]...
Components:
- Base Name: Original contig name or 'contig_#' if wipe=t
- Taxonomic ID: 'tid_XXXX' where XXXX is the taxonomic ID (only if tid=t and not already present)
- Coverage Values: 'cov_XX.XX' for each input SAM file (only if depth=t)
- Delimiter: Configurable separator between fields (default: space)
Example Outputs:
# Original: >scaffold_1
# Output: >scaffold_1 tid_1234 cov_45.67
# With multiple SAM files:
# Output: >scaffold_1 tid_5678 cov_23.45 cov_67.89 cov_12.34
# With wipe=t:
# Output: >contig_1 tid_9999 cov_88.88
Use Cases
- Metagenome Binning Evaluation: Annotate assembled contigs with coverage and taxonomic information for downstream binning analysis
- Synthetic Read Generation: Prepare reference sequences with realistic coverage annotations for simulation studies
- Coverage-Based Filtering: Identify contigs with specific coverage ranges for quality control or analysis
- Multi-Sample Analysis: Track how contig coverage varies across multiple sequencing experiments
- Taxonomic Validation: Verify taxonomic assignments by comparing read-based and assembly-based classifications
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org