FixGaps

Script: fixgaps.sh Package: consensus Class: FixScaffoldGaps.java

Uses paired read insert sizes to estimate the correct length of scaffold gaps, and resizes incorrectly-sized gaps.

Basic Usage

fixgaps.sh in=mapped.sam ref=scaffolds.fa out=fixed.fa

FixGaps requires mapped paired-end reads in SAM/BAM format and a reference scaffold assembly. The tool analyzes insert size distributions from properly mapped read pairs to estimate the correct size of gaps (consecutive N regions) in scaffolds, then adjusts gap sizes accordingly.

Parameters

Parameters are organized by their function in the gap correction process. The tool uses insert size statistics from mapped paired reads to determine optimal gap lengths.

Standard parameters

in=<file>
Reads mapped to the reference; should be sam or bam. Must be paired-end reads with proper mapping information including template length (TLEN) field.
ref=<file>
Reference scaffold assembly; may be fasta or fastq. Should contain gaps marked with consecutive N characters.
out=<file>
Modified reference output file; may be fasta or fastq. Will contain the same scaffolds with adjusted gap sizes.
overwrite=f
(ow) Set to false to force the program to abort rather than overwrite an existing file.

Processing parameters

gap=10
Consider any consecutive streak of Ns at least this long to be a scaffold break. Gaps will not be resized to less than this minimum value. Default: 10.
border=0.4
Ignore the outermost (border*readlen) of an insert (read pair) when incrementing coverage. A higher value is more accurate but requires more coverage and/or longer inserts. Range: 0-1. Default: 0.4 (ignores 40% of read length from each end of insert).
mindepth=10
Minimum spanning read pairs required to correct a gap. Only gaps with at least this many supporting read pairs will be resized. Default: 10.

Java Parameters

-Xmx
This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da
Disable assertions.

Examples

Basic Gap Correction

fixgaps.sh in=paired_reads.bam ref=scaffolds.fa out=fixed_scaffolds.fa

Corrects gaps in scaffold assembly using paired read insert sizes. Reads must be properly mapped with mate pair information.

Sensitive Gap Correction

fixgaps.sh in=reads.sam ref=assembly.fa out=corrected.fa border=0.2 mindepth=5

Uses less stringent trimming (border=0.2) and lower depth requirement (mindepth=5) for more sensitive gap detection and correction.

Conservative Gap Correction

fixgaps.sh in=high_cov.bam ref=scaffolds.fa out=fixed.fa gap=50 mindepth=20

Uses conservative settings with larger minimum gap size (50 Ns) and higher depth requirement (20 read pairs) for high-confidence corrections only.

Algorithm Details

Gap Size Estimation Strategy

FixGaps implements a multi-threaded insert size analysis algorithm using `AtomicIntegerArray` and `AtomicLongArray` data structures for thread-safe gap size estimation:

Insert Size Collection

The `processRead()` method processes mapped paired reads and collects insert size information from properly paired reads that meet these criteria:

Coverage-Aware Trimming

The `add()` method in the Scaffold class trims the outer portions of each insert when calculating coverage:

Gap Detection and Analysis

The `fixScaffold()` method identifies gaps using a streak-counting approach:

Size Correction Algorithm

For each detected gap, the `fixScaffold()` method performs statistical analysis using specific calculations:

  1. Coverage Assessment: Calculates `avgDepth=((depthArray.get(i-200-streak)+depthArray.get(i+200))/2)` around the gap region
  2. Insert Size Distribution: Builds histogram using `Tools.makeHistogram(insertCounts, buckets)` with 1000 buckets and 20,000 maximum insert size
  3. Percentile-Based Correction: Uses coverage-based percentile ranking: `percentile=(int)(buckets*Tools.max(0.5, 1.0-depthSum/(double)(avgDepth+depthSum)))`
  4. Gap Adjustment: Compares actual gap size to `insertByPercentile[percentile]` proxy value and adjusts with `Tools.max(scaffoldBreakNs, streak+dif)`

Statistical Validation

The tool employs several quality controls implemented in the gap detection logic:

Data Structures and Performance

FixGaps uses thread-safe atomic data structures for concurrent processing:

Output Statistics

The tool reports detailed statistics tracked through atomic counters:

Support

For questions and support: