FixGaps
Uses paired read insert sizes to estimate the correct length of scaffold gaps, and resizes incorrectly-sized gaps.
Basic Usage
fixgaps.sh in=mapped.sam ref=scaffolds.fa out=fixed.fa
FixGaps requires mapped paired-end reads in SAM/BAM format and a reference scaffold assembly. The tool analyzes insert size distributions from properly mapped read pairs to estimate the correct size of gaps (consecutive N regions) in scaffolds, then adjusts gap sizes accordingly.
Parameters
Parameters are organized by their function in the gap correction process. The tool uses insert size statistics from mapped paired reads to determine optimal gap lengths.
Standard parameters
- in=<file>
- Reads mapped to the reference; should be sam or bam. Must be paired-end reads with proper mapping information including template length (TLEN) field.
- ref=<file>
- Reference scaffold assembly; may be fasta or fastq. Should contain gaps marked with consecutive N characters.
- out=<file>
- Modified reference output file; may be fasta or fastq. Will contain the same scaffolds with adjusted gap sizes.
- overwrite=f
- (ow) Set to false to force the program to abort rather than overwrite an existing file.
Processing parameters
- gap=10
- Consider any consecutive streak of Ns at least this long to be a scaffold break. Gaps will not be resized to less than this minimum value. Default: 10.
- border=0.4
- Ignore the outermost (border*readlen) of an insert (read pair) when incrementing coverage. A higher value is more accurate but requires more coverage and/or longer inserts. Range: 0-1. Default: 0.4 (ignores 40% of read length from each end of insert).
- mindepth=10
- Minimum spanning read pairs required to correct a gap. Only gaps with at least this many supporting read pairs will be resized. Default: 10.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions.
Examples
Basic Gap Correction
fixgaps.sh in=paired_reads.bam ref=scaffolds.fa out=fixed_scaffolds.fa
Corrects gaps in scaffold assembly using paired read insert sizes. Reads must be properly mapped with mate pair information.
Sensitive Gap Correction
fixgaps.sh in=reads.sam ref=assembly.fa out=corrected.fa border=0.2 mindepth=5
Uses less stringent trimming (border=0.2) and lower depth requirement (mindepth=5) for more sensitive gap detection and correction.
Conservative Gap Correction
fixgaps.sh in=high_cov.bam ref=scaffolds.fa out=fixed.fa gap=50 mindepth=20
Uses conservative settings with larger minimum gap size (50 Ns) and higher depth requirement (20 read pairs) for high-confidence corrections only.
Algorithm Details
Gap Size Estimation Strategy
FixGaps implements a multi-threaded insert size analysis algorithm using `AtomicIntegerArray` and `AtomicLongArray` data structures for thread-safe gap size estimation:
Insert Size Collection
The `processRead()` method processes mapped paired reads and collects insert size information from properly paired reads that meet these criteria:
- Both reads mapped to the same chromosome/scaffold (`sl.pairedOnSameChrom()`)
- Marked as proper pair in SAM flags (`sl.properPair()`)
- Primary alignment (`sl.primary()` and `!sl.supplementary()`)
- Leftmost read of the pair (`sl.leftmost()` to avoid double-counting)
Coverage-Aware Trimming
The `add()` method in the Scaffold class trims the outer portions of each insert when calculating coverage:
- Calculates trim amount: `int trim=(int)(sl.length()*trimFraction)`
- Default trimFraction=0.4 removes 40% of read length from each end
- Applies trimming: `start+=trim; stop-=trim` before incrementing arrays
- Updates atomic arrays: `depthArray.incrementAndGet(i)` and `insertArray.addAndGet(i, insertSize)`
Gap Detection and Analysis
The `fixScaffold()` method identifies gaps using a streak-counting approach:
- Scans reference sequences for consecutive N characters using byte-level comparison
- Only considers streaks ≥ scaffoldBreakNs parameter as actual gaps
- Enforces 300bp buffer:
i-streak>300 && i<bases.length-300
to avoid scaffold ends - Calculates coverage depth and insert size statistics around each gap using array lookups
Size Correction Algorithm
For each detected gap, the `fixScaffold()` method performs statistical analysis using specific calculations:
- Coverage Assessment: Calculates `avgDepth=((depthArray.get(i-200-streak)+depthArray.get(i+200))/2)` around the gap region
- Insert Size Distribution: Builds histogram using `Tools.makeHistogram(insertCounts, buckets)` with 1000 buckets and 20,000 maximum insert size
- Percentile-Based Correction: Uses coverage-based percentile ranking: `percentile=(int)(buckets*Tools.max(0.5, 1.0-depthSum/(double)(avgDepth+depthSum)))`
- Gap Adjustment: Compares actual gap size to `insertByPercentile[percentile]` proxy value and adjusts with `Tools.max(scaffoldBreakNs, streak+dif)`
Statistical Validation
The tool employs several quality controls implemented in the gap detection logic:
- Minimum Depth: Only corrects gaps with sufficient supporting read pairs (mindepth parameter enforced through depth calculations)
- Percentile Ranking: Uses coverage percentiles to select appropriate insert size from distribution histogram
- Conservative Bounds: Never reduces gaps below minimum: `Tools.max(scaffoldBreakNs, streak+dif)`
- Local Context: Considers local coverage patterns with 200bp flanking regions for depth calculation
Data Structures and Performance
FixGaps uses thread-safe atomic data structures for concurrent processing:
- AtomicIntegerArray depthArray: Thread-safe depth counting per reference position with `incrementAndGet()` operations
- AtomicLongArray insertArray: Thread-safe insert size accumulation per position with `addAndGet()` operations
- AtomicLongArray insertCounts: Global insert size histogram with 20,000 buckets for distribution analysis
- ProcessThread extends Thread: Multi-threaded read processing using concurrent read streams
- LinkedHashMap refMap/refMap2: Dual scaffold lookup tables for full and trimmed reference names
Output Statistics
The tool reports detailed statistics tracked through atomic counters:
- Average insert size: `totalAverageInsert=totalInsertSum/(double)totalInsertCount`
- Gap modification counts: `gapsUnchanged`, `gapsWidened`, `gapsNarrowed` incremented during processing
- N character tracking: `nsTotal`, `nsAdded`, `nsRemoved` accumulated during gap adjustments
- Processing throughput: `Tools.timeReadsBasesProcessed()` reporting reads and bases per second
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org