RenameByMapping

Script: renamebymapping.sh Package: bin Class: ContigRenamer.java

Renames contigs based on mapping information. Appends coverage and optionally taxID from parsing sam line headers. For taxID renaming, read headers should contain a term like 'tid_1234'; output will be named as 'original tid_1234 cov_45.67' with potentially multiple coverage entries (if there are multiple sam files) but only one tid entry based on the highest-coverage sam file. Designed for metagenome binning evaluation and synthetic read generation.

Basic Usage

renamebymapping.sh in=contigs.fa out=renamed.fa *.sam

This tool processes a reference assembly file and one or more SAM/BAM mapping files to rename contigs with coverage and taxonomic information extracted from read headers.

Parameters

Parameters control input/output files, the information to append to contig names, and how the renaming is performed.

Input/Output Parameters

in=<file>
Assembly to rename. Input FASTA file containing contigs/scaffolds that will be renamed based on mapping information.
out=<file>
Renamed assembly. Output FASTA file with contigs renamed to include coverage and optionally taxonomic information.
sam=<file>
This can be a file, directory, or comma-delimited list. Unrecognized arguments that are existing files will also be treated as sam files. BAM is acceptable too. Multiple SAM/BAM files can be processed to calculate coverage from different mapping experiments.

Renaming Control Parameters

delimiter=space
Delimiter between appended fields. Character used to separate the original contig name from the appended coverage and taxonomic information. Default is space.
wipe=f
Replace the original header with contig_#. When true, completely replaces original contig names with sequential numbering (contig_1, contig_2, etc.) before appending coverage/taxonomic information. Default: false.
depth=t
Add a depth field. When true, appends coverage information in format 'cov_X.XX' where X.XX is the calculated coverage depth. Default: true.
tid=t
Add a tid field (if not already present). When true, appends taxonomic ID information in format 'tid_XXXX' where XXXX is the taxonomic ID extracted from read headers. Only added if the contig doesn't already have a taxonomic ID. Default: true.

Advanced Parameters

cami=<file>
CAMI format input file for taxonomic assignments. Uses processCami() method with LineParser1 tab-delimited parsing instead of SAM-based processing. Sets addDepth=false and addTid=true automatically.
ordered=f
Process contigs in input order. Controls ConcurrentReadOutputStream buffer size calculation: ordered uses Tools.mid(16, 128, (Shared.threads()*2)/3), unordered uses 8. Default: false.
verbose=f
Print verbose status messages to outstream during processing. Outputs "Started cris", "Started Streamer", and "Finished; closing streams" messages. Default: false.
clearfilters=f
Clear all SAM filtering criteria using samFilter.clear(). When true, bypasses default SamFilter settings that exclude unmapped, supplementary, duplicate, non-primary, and quality-failed reads. Default: false.

Java Parameters

-Xmx
This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da
Disable assertions. Can improve performance in production environments by skipping internal consistency checks.

Examples

Basic Contig Renaming

renamebymapping.sh in=assembly.fa out=renamed.fa mapping.sam

Renames contigs in assembly.fa based on coverage calculated from mapping.sam, appending coverage information to each contig name.

Multiple SAM Files

renamebymapping.sh in=contigs.fa out=renamed.fa sample1.sam sample2.sam sample3.sam

Processes multiple SAM files to calculate coverage from different samples. Each coverage value will be appended separately (e.g., 'contig1 cov_12.34 cov_8.76 cov_15.23').

With Taxonomic Information

renamebymapping.sh in=metagenome.fa out=renamed.fa reads_mapped.sam tid=t

Extracts taxonomic IDs from read headers (format 'tid_1234') and appends both coverage and taxonomic information. The taxonomic ID is selected from the highest-coverage mapping.

Complete Header Replacement

renamebymapping.sh in=assembly.fa out=clean.fa mapping.sam wipe=t delimiter=_

Replaces original contig names with sequential numbering and uses underscore as delimiter (e.g., 'contig_1_cov_45.67').

Coverage Only (No Taxonomic Info)

renamebymapping.sh in=contigs.fa out=cov_only.fa mapping.sam tid=f

Appends only coverage information without extracting or adding taxonomic IDs.

Algorithm Details

Coverage Calculation Strategy

The tool implements a three-stage coverage calculation process:

Taxonomic ID Extraction

The algorithm implements TaxTree.parseHeaderStatic2() for taxonomic ID parsing:

Multi-File Processing

When processing multiple SAM/BAM files:

Memory Management

The tool implements specific memory optimization strategies:

Performance Characteristics

Output Format

Output contig names follow this pattern:

[original_name|contig_#][delimiter][tid_XXXX][delimiter][cov_XX.XX][delimiter][cov_YY.YY]...

Components:

Example Outputs:

# Original: >scaffold_1
# Output: >scaffold_1 tid_1234 cov_45.67

# With multiple SAM files:
# Output: >scaffold_1 tid_5678 cov_23.45 cov_67.89 cov_12.34

# With wipe=t:
# Output: >contig_1 tid_9999 cov_88.88

Use Cases

Support

For questions and support: