RenameByMapping

Script: renamebymapping.sh Package: bin Class: ContigRenamer.java

Renames contigs based on mapping information. Appends coverage and optionally taxID from parsing sam line headers. For taxID renaming, read headers should contain a term like 'tid_1234'; output will be named as 'original tid_1234 cov_45.67' with potentially multiple coverage entries (if there are multiple sam files) but only one tid entry based on the highest-coverage sam file. Designed for metagenome binning evaluation and synthetic read generation.

Basic Usage

renamebymapping.sh in=contigs.fa out=renamed.fa *.sam

This tool processes a reference assembly file and one or more SAM/BAM mapping files to rename contigs with coverage and taxonomic information extracted from read headers.

Parameters

Parameters control input/output files, the information to append to contig names, and how the renaming is performed.

Input/Output Parameters

in=<file>: Assembly to rename. Input FASTA file containing contigs/scaffolds that will be renamed based on mapping information.
out=<file>: Renamed assembly. Output FASTA file with contigs renamed to include coverage and optionally taxonomic information.
sam=<file>: This can be a file, directory, or comma-delimited list. Unrecognized arguments that are existing files will also be treated as sam files. BAM is acceptable too. Multiple SAM/BAM files can be processed to calculate coverage from different mapping experiments.

Renaming Control Parameters

delimiter=space: Delimiter between appended fields. Character used to separate the original contig name from the appended coverage and taxonomic information. Default is space.
wipe=f: Replace the original header with contig_#. When true, completely replaces original contig names with sequential numbering (contig_1, contig_2, etc.) before appending coverage/taxonomic information. Default: false.
depth=t: Add a depth field. When true, appends coverage information in format 'cov_X.XX' where X.XX is the calculated coverage depth. Default: true.
tid=t: Add a tid field (if not already present). When true, appends taxonomic ID information in format 'tid_XXXX' where XXXX is the taxonomic ID extracted from read headers. Only added if the contig doesn't already have a taxonomic ID. Default: true.

Advanced Parameters

cami=<file>: CAMI format input file for taxonomic assignments. Uses processCami() method with LineParser1 tab-delimited parsing instead of SAM-based processing. Sets addDepth=false and addTid=true automatically.
ordered=f: Process contigs in input order. Controls ConcurrentReadOutputStream buffer size calculation: ordered uses Tools.mid(16, 128, (Shared.threads()*2)/3), unordered uses 8. Default: false.
verbose=f: Print verbose status messages to outstream during processing. Outputs "Started cris", "Started Streamer", and "Finished; closing streams" messages. Default: false.
clearfilters=f: Clear all SAM filtering criteria using samFilter.clear(). When true, bypasses default SamFilter settings that exclude unmapped, supplementary, duplicate, non-primary, and quality-failed reads. Default: false.

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions. Can improve performance in production environments by skipping internal consistency checks.

Examples

Basic Contig Renaming

renamebymapping.sh in=assembly.fa out=renamed.fa mapping.sam

Renames contigs in assembly.fa based on coverage calculated from mapping.sam, appending coverage information to each contig name.

Multiple SAM Files

renamebymapping.sh in=contigs.fa out=renamed.fa sample1.sam sample2.sam sample3.sam

Processes multiple SAM files to calculate coverage from different samples. Each coverage value will be appended separately (e.g., 'contig1 cov_12.34 cov_8.76 cov_15.23').

With Taxonomic Information

renamebymapping.sh in=metagenome.fa out=renamed.fa reads_mapped.sam tid=t

Extracts taxonomic IDs from read headers (format 'tid_1234') and appends both coverage and taxonomic information. The taxonomic ID is selected from the highest-coverage mapping.

Complete Header Replacement

renamebymapping.sh in=assembly.fa out=clean.fa mapping.sam wipe=t delimiter=_

Replaces original contig names with sequential numbering and uses underscore as delimiter (e.g., 'contig_1_cov_45.67').

Coverage Only (No Taxonomic Info)

renamebymapping.sh in=contigs.fa out=cov_only.fa mapping.sam tid=f

Appends only coverage information without extracting or adding taxonomic IDs.

Algorithm Details

Coverage Calculation Strategy

The tool implements a three-stage coverage calculation process:

Mapping Stage: Parses SAM/BAM files using SamLineStreamer, applying SamFilter default settings to exclude unmapped, supplementary, duplicate, and non-primary alignments
Accumulation Stage: Uses thread-safe IntLongHashMap with synchronized increment operations (map.increment(taxid, length)) to accumulate mapped bases per contig
Coverage Calculation: Divides totalMappedBases by contig length using float arithmetic (totalMappedBases/(float)length) with 2 decimal places precision in output formatting

Taxonomic ID Extraction

The algorithm implements TaxTree.parseHeaderStatic2() for taxonomic ID parsing:

Header Parsing: Searches read headers for taxonomic information in format 'tid_XXXX' using TaxTree.parseHeaderStatic2() method
Conflict Resolution: When multiple taxonomic IDs map to the same contig, selects the ID associated with the highest coverage mapping using maxMappedBases comparison
Existing ID Preservation: Checks if contigs already contain taxonomic information using TaxTree.parseHeaderStatic2() and avoids duplication when oldTaxid >= 0

Multi-File Processing

When processing multiple SAM/BAM files:

Each file is processed independently to calculate per-file coverage
Coverage values from all files are appended as separate fields
Taxonomic ID is selected from the file contributing the highest coverage
Thread-safe processing allows efficient handling of large mapping files

Memory Management

The tool implements specific memory optimization strategies:

Scaffold Mapping: Creates HashMap<String, Scaf> from SAM headers (@SQ lines) using makeScafMap() to avoid loading entire reference into memory
Streaming Processing: Uses SamLineStreamer with configurable thread count for memory-efficient reading of large SAM/BAM files
Concurrent Processing: Employs multiple ProcessThread instances for SAM processing while maintaining thread-safe IntLongHashMap data structures with synchronized access
Incremental Processing: Processes contigs using ConcurrentReadInputStream with ListNum<Read> batching rather than loading everything into memory simultaneously

Performance Characteristics

Time Complexity: O(n) where n is the total number of mapped reads across all SAM files, processed through SamFilter.passesFilter() validation
Memory Usage: Scales with number of unique contigs stored in HashMap<String, Scaf> and IntLongHashMap per contig for taxonomic ID tracking
Scalability: Uses Shared.threads() for parallel ProcessThread instances, with SamLineStreamer supporting configurable streamerThreads for large datasets
I/O Optimization: Uses ConcurrentReadInputStream and ConcurrentReadOutputStream with configurable buffer sizes based on ordered processing requirements

Output Format

Output contig names follow this pattern:

[original_name|contig_#][delimiter][tid_XXXX][delimiter][cov_XX.XX][delimiter][cov_YY.YY]...

Components:

Base Name: Original contig name or 'contig_#' if wipe=t
Taxonomic ID: 'tid_XXXX' where XXXX is the taxonomic ID (only if tid=t and not already present)
Coverage Values: 'cov_XX.XX' for each input SAM file (only if depth=t)
Delimiter: Configurable separator between fields (default: space)

Example Outputs:

# Original: >scaffold_1
# Output: >scaffold_1 tid_1234 cov_45.67

# With multiple SAM files:
# Output: >scaffold_1 tid_5678 cov_23.45 cov_67.89 cov_12.34

# With wipe=t:
# Output: >contig_1 tid_9999 cov_88.88

Use Cases

Metagenome Binning Evaluation: Annotate assembled contigs with coverage and taxonomic information for downstream binning analysis
Synthetic Read Generation: Prepare reference sequences with realistic coverage annotations for simulation studies
Coverage-Based Filtering: Identify contigs with specific coverage ranges for quality control or analysis
Multi-Sample Analysis: Track how contig coverage varies across multiple sequencing experiments
Taxonomic Validation: Verify taxonomic assignments by comparing read-based and assembly-based classifications

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org