DedupeByMapping

Basic Usage

dedupebymapping.sh in=<file> out=<file>

DedupeByMapping removes duplicate reads from mapped SAM/BAM files based on their mapping coordinates. This tool is specifically designed for mapped reads and uses pair mapping coordinates to identify duplicates, making it more accurate than sequence-based deduplication for mapped data.

Parameters

Parameters control input/output handling, duplicate detection criteria, and memory management. The tool processes SAM/BAM files containing mapped reads and removes duplicates based on mapping coordinates rather than sequence similarity.

Input/Output Parameters

in=<file>: Input SAM/BAM file containing mapped reads. The 'in=' flag is needed if the input file is not the first parameter. 'in=stdin' will pipe from standard input. File must contain mapping information (SAM/BAM format).
out=<file>: Output file for deduplicated reads. The 'out=' flag is needed if the output file is not the second parameter. 'out=stdout' will pipe to standard output. Output format matches input format.
overwrite=t: (ow) Set to false to force the program to abort rather than overwrite an existing file. Default: true. When false, program will exit with error if output file already exists.
ziplevel=2: (zl) Set compression level from 1 (lowest/fastest) through 9 (maximum/slowest) to change compression level for output files. Lower compression is faster but produces larger files. Default: 2.

Deduplication Parameters

keepunmapped=t: (ku) Keep unmapped reads in the output. This refers to unmapped single-ended reads or pairs where both reads are unmapped. When true, unmapped reads pass through without deduplication. When false, unmapped reads are discarded. Default: true.
keepsingletons=t: (ks) Keep all pairs in which only one read mapped (singletons). If false, duplicate singletons will be discarded based on their mapping coordinates. When true, singletons are retained even if they map to the same location. Default: true.
ignorepairorder=f: (ipo) If true, consider reverse-complementary pairs as duplicates. When false (default), read pair order matters for duplicate detection. When true, pairs mapping to the same coordinates but in opposite orientations are considered duplicates. Default: false.

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Default: 3g for this tool.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+. Useful for automated pipelines to prevent hanging on memory exhaustion.
-da: Disable assertions. Can provide minor performance improvement in production environments by skipping internal consistency checks.

Examples

Basic Deduplication

dedupebymapping.sh in=mapped_reads.sam out=deduped_reads.sam

Remove duplicate reads from a SAM file based on mapping coordinates. Keeps unmapped reads and singletons by default.

Strict Deduplication

dedupebymapping.sh in=mapped_reads.bam out=deduped_reads.bam keepunmapped=f keepsingletons=f

Remove all duplicates including unmapped reads and singletons. Only properly paired, uniquely mapping read pairs are retained.

Ignore Pair Orientation

dedupebymapping.sh in=mapped_reads.sam out=deduped_reads.sam ignorepairorder=t

Consider pairs mapping to the same coordinates as duplicates regardless of their relative orientation. Useful for protocols where read orientation may vary.

High Memory Processing

dedupebymapping.sh -Xmx32g in=large_dataset.bam out=deduped_large.bam

Process a large dataset with 32GB of memory allocation for improved performance with many unique mapping positions.

Algorithm Details

Coordinate-Based Duplicate Detection

DedupeByMapping uses a coordinate-based approach for duplicate detection that converts read pairs to Quad objects containing chromosome IDs and start positions for mapped sequencing data:

Core Algorithm Strategy

Quad-Based Hashing: Each read pair is converted to a "Quad" object containing start positions and chromosome IDs for both reads in the pair
Coordinate Precision: Uses exact mapping coordinates (start/stop positions) rather than approximate sequence similarity
Strand-Aware Processing: Considers read strand orientation when calculating mapping positions for duplicate detection
Memory-Efficient Design: Uses LinkedHashMap structures sized based on available memory (default ~2M entries)

Duplicate Resolution Strategy

When multiple read pairs map to identical coordinates:

Quality-Based Selection: Retains the pair with the lowest expected error rate
Expected Error Calculation: Sums expected errors across both reads in a pair, normalized by total pair length
Deterministic Resolution: Ensures consistent results across runs by using a well-defined quality metric

Data Structure Implementation

Contig Mapping: HashMap converts reference sequence names to integer IDs for efficient storage
Quad Storage: LinkedHashMap stores mapping coordinates as composite keys (chr1, start1, chr2, start2)
Name Resolution: Separate HashMap handles read name to Read object mapping for pair reconstruction
Memory Scaling: Initial capacity scales with available memory: min(2M, max(80K, available_memory/4000))

Pair Order Handling

The algorithm can operate in two modes controlled by the ignorepairorder parameter:

Order-Sensitive (default): Pairs (A,B) and (B,A) mapping to the same coordinates are considered different
Order-Insensitive: Pairs are normalized so (A,B) and (B,A) are treated as identical duplicates

Processing Modes

Unsorted Mode (default): Processes reads in any order, suitable for most input files
Sorted Mode (experimental): Optimized for coordinate-sorted input, but currently not fully implemented

Performance Characteristics

Memory Usage: Scales with number of unique mapping positions, not total read count
Time Complexity: O(n) for n reads, with hash table lookup overhead
Scalability: Can handle datasets with millions of reads given sufficient memory
I/O Efficiency: Single-pass algorithm minimizes disk access

Quality Assurance

The tool provides comprehensive statistics:

Count of duplicate reads and bases removed
Count of unmapped reads and bases processed
Count of retained reads and bases in final output
Detailed reporting for pipeline monitoring and quality control

Advantages Over Sequence-Based Deduplication

Precision: Uses exact mapping coordinates rather than approximate sequence similarity
Speed: Coordinate comparison is faster than full sequence alignment
Memory Efficiency: Stores compact coordinate tuples rather than full sequences
Mapping-Aware: Leverages existing alignment information for more accurate duplicate detection

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org