DedupeByMapping
Deduplicates mapped reads based on pair mapping coordinates.
Basic Usage
dedupebymapping.sh in=<file> out=<file>
DedupeByMapping removes duplicate reads from mapped SAM/BAM files based on their mapping coordinates. This tool is specifically designed for mapped reads and uses pair mapping coordinates to identify duplicates, making it more accurate than sequence-based deduplication for mapped data.
Parameters
Parameters control input/output handling, duplicate detection criteria, and memory management. The tool processes SAM/BAM files containing mapped reads and removes duplicates based on mapping coordinates rather than sequence similarity.
Input/Output Parameters
- in=<file>
- Input SAM/BAM file containing mapped reads. The 'in=' flag is needed if the input file is not the first parameter. 'in=stdin' will pipe from standard input. File must contain mapping information (SAM/BAM format).
- out=<file>
- Output file for deduplicated reads. The 'out=' flag is needed if the output file is not the second parameter. 'out=stdout' will pipe to standard output. Output format matches input format.
- overwrite=t
- (ow) Set to false to force the program to abort rather than overwrite an existing file. Default: true. When false, program will exit with error if output file already exists.
- ziplevel=2
- (zl) Set compression level from 1 (lowest/fastest) through 9 (maximum/slowest) to change compression level for output files. Lower compression is faster but produces larger files. Default: 2.
Deduplication Parameters
- keepunmapped=t
- (ku) Keep unmapped reads in the output. This refers to unmapped single-ended reads or pairs where both reads are unmapped. When true, unmapped reads pass through without deduplication. When false, unmapped reads are discarded. Default: true.
- keepsingletons=t
- (ks) Keep all pairs in which only one read mapped (singletons). If false, duplicate singletons will be discarded based on their mapping coordinates. When true, singletons are retained even if they map to the same location. Default: true.
- ignorepairorder=f
- (ipo) If true, consider reverse-complementary pairs as duplicates. When false (default), read pair order matters for duplicate detection. When true, pairs mapping to the same coordinates but in opposite orientations are considered duplicates. Default: false.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Default: 3g for this tool.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+. Useful for automated pipelines to prevent hanging on memory exhaustion.
- -da
- Disable assertions. Can provide minor performance improvement in production environments by skipping internal consistency checks.
Examples
Basic Deduplication
dedupebymapping.sh in=mapped_reads.sam out=deduped_reads.sam
Remove duplicate reads from a SAM file based on mapping coordinates. Keeps unmapped reads and singletons by default.
Strict Deduplication
dedupebymapping.sh in=mapped_reads.bam out=deduped_reads.bam keepunmapped=f keepsingletons=f
Remove all duplicates including unmapped reads and singletons. Only properly paired, uniquely mapping read pairs are retained.
Ignore Pair Orientation
dedupebymapping.sh in=mapped_reads.sam out=deduped_reads.sam ignorepairorder=t
Consider pairs mapping to the same coordinates as duplicates regardless of their relative orientation. Useful for protocols where read orientation may vary.
High Memory Processing
dedupebymapping.sh -Xmx32g in=large_dataset.bam out=deduped_large.bam
Process a large dataset with 32GB of memory allocation for improved performance with many unique mapping positions.
Algorithm Details
Coordinate-Based Duplicate Detection
DedupeByMapping uses a coordinate-based approach for duplicate detection that converts read pairs to Quad objects containing chromosome IDs and start positions for mapped sequencing data:
Core Algorithm Strategy
- Quad-Based Hashing: Each read pair is converted to a "Quad" object containing start positions and chromosome IDs for both reads in the pair
- Coordinate Precision: Uses exact mapping coordinates (start/stop positions) rather than approximate sequence similarity
- Strand-Aware Processing: Considers read strand orientation when calculating mapping positions for duplicate detection
- Memory-Efficient Design: Uses LinkedHashMap structures sized based on available memory (default ~2M entries)
Duplicate Resolution Strategy
When multiple read pairs map to identical coordinates:
- Quality-Based Selection: Retains the pair with the lowest expected error rate
- Expected Error Calculation: Sums expected errors across both reads in a pair, normalized by total pair length
- Deterministic Resolution: Ensures consistent results across runs by using a well-defined quality metric
Data Structure Implementation
- Contig Mapping: HashMap converts reference sequence names to integer IDs for efficient storage
- Quad Storage: LinkedHashMap stores mapping coordinates as composite keys (chr1, start1, chr2, start2)
- Name Resolution: Separate HashMap handles read name to Read object mapping for pair reconstruction
- Memory Scaling: Initial capacity scales with available memory: min(2M, max(80K, available_memory/4000))
Pair Order Handling
The algorithm can operate in two modes controlled by the ignorepairorder parameter:
- Order-Sensitive (default): Pairs (A,B) and (B,A) mapping to the same coordinates are considered different
- Order-Insensitive: Pairs are normalized so (A,B) and (B,A) are treated as identical duplicates
Processing Modes
- Unsorted Mode (default): Processes reads in any order, suitable for most input files
- Sorted Mode (experimental): Optimized for coordinate-sorted input, but currently not fully implemented
Performance Characteristics
- Memory Usage: Scales with number of unique mapping positions, not total read count
- Time Complexity: O(n) for n reads, with hash table lookup overhead
- Scalability: Can handle datasets with millions of reads given sufficient memory
- I/O Efficiency: Single-pass algorithm minimizes disk access
Quality Assurance
The tool provides comprehensive statistics:
- Count of duplicate reads and bases removed
- Count of unmapped reads and bases processed
- Count of retained reads and bases in final output
- Detailed reporting for pipeline monitoring and quality control
Advantages Over Sequence-Based Deduplication
- Precision: Uses exact mapping coordinates rather than approximate sequence similarity
- Speed: Coordinate comparison is faster than full sequence alignment
- Memory Efficiency: Stores compact coordinate tuples rather than full sequences
- Mapping-Aware: Leverages existing alignment information for more accurate duplicate detection
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org