DedupeByMapping

Script: dedupebymapping.sh Package: jgi Class: DedupeByMapping.java

Deduplicates mapped reads based on pair mapping coordinates.

Basic Usage

dedupebymapping.sh in=<file> out=<file>

DedupeByMapping removes duplicate reads from mapped SAM/BAM files based on their mapping coordinates. This tool is specifically designed for mapped reads and uses pair mapping coordinates to identify duplicates, making it more accurate than sequence-based deduplication for mapped data.

Parameters

Parameters control input/output handling, duplicate detection criteria, and memory management. The tool processes SAM/BAM files containing mapped reads and removes duplicates based on mapping coordinates rather than sequence similarity.

Input/Output Parameters

in=<file>
Input SAM/BAM file containing mapped reads. The 'in=' flag is needed if the input file is not the first parameter. 'in=stdin' will pipe from standard input. File must contain mapping information (SAM/BAM format).
out=<file>
Output file for deduplicated reads. The 'out=' flag is needed if the output file is not the second parameter. 'out=stdout' will pipe to standard output. Output format matches input format.
overwrite=t
(ow) Set to false to force the program to abort rather than overwrite an existing file. Default: true. When false, program will exit with error if output file already exists.
ziplevel=2
(zl) Set compression level from 1 (lowest/fastest) through 9 (maximum/slowest) to change compression level for output files. Lower compression is faster but produces larger files. Default: 2.

Deduplication Parameters

keepunmapped=t
(ku) Keep unmapped reads in the output. This refers to unmapped single-ended reads or pairs where both reads are unmapped. When true, unmapped reads pass through without deduplication. When false, unmapped reads are discarded. Default: true.
keepsingletons=t
(ks) Keep all pairs in which only one read mapped (singletons). If false, duplicate singletons will be discarded based on their mapping coordinates. When true, singletons are retained even if they map to the same location. Default: true.
ignorepairorder=f
(ipo) If true, consider reverse-complementary pairs as duplicates. When false (default), read pair order matters for duplicate detection. When true, pairs mapping to the same coordinates but in opposite orientations are considered duplicates. Default: false.

Java Parameters

-Xmx
This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Default: 3g for this tool.
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+. Useful for automated pipelines to prevent hanging on memory exhaustion.
-da
Disable assertions. Can provide minor performance improvement in production environments by skipping internal consistency checks.

Examples

Basic Deduplication

dedupebymapping.sh in=mapped_reads.sam out=deduped_reads.sam

Remove duplicate reads from a SAM file based on mapping coordinates. Keeps unmapped reads and singletons by default.

Strict Deduplication

dedupebymapping.sh in=mapped_reads.bam out=deduped_reads.bam keepunmapped=f keepsingletons=f

Remove all duplicates including unmapped reads and singletons. Only properly paired, uniquely mapping read pairs are retained.

Ignore Pair Orientation

dedupebymapping.sh in=mapped_reads.sam out=deduped_reads.sam ignorepairorder=t

Consider pairs mapping to the same coordinates as duplicates regardless of their relative orientation. Useful for protocols where read orientation may vary.

High Memory Processing

dedupebymapping.sh -Xmx32g in=large_dataset.bam out=deduped_large.bam

Process a large dataset with 32GB of memory allocation for improved performance with many unique mapping positions.

Algorithm Details

Coordinate-Based Duplicate Detection

DedupeByMapping uses a coordinate-based approach for duplicate detection that converts read pairs to Quad objects containing chromosome IDs and start positions for mapped sequencing data:

Core Algorithm Strategy

Duplicate Resolution Strategy

When multiple read pairs map to identical coordinates:

Data Structure Implementation

Pair Order Handling

The algorithm can operate in two modes controlled by the ignorepairorder parameter:

Processing Modes

Performance Characteristics

Quality Assurance

The tool provides comprehensive statistics:

Advantages Over Sequence-Based Deduplication

Support

For questions and support: