Insertion Calling Pipeline

Overview

This pipeline addresses a fundamental limitation in variant calling: detecting insertions longer than the sequencing read length. By combining error correction, read extension, and sophisticated merging techniques, it creates longer effective reads that can span large insertions, enabling accurate detection of structural variants that would otherwise be missed.

Optimization Target: This pipeline was specifically designed and optimized for 2x100bp reads with coverage of at least 40x. Parameters may need adjustment for different read lengths, coverage depths, or data types.

Prerequisites

System Requirements

BBTools suite installed
Reference genome file (reference.fa)
High-coverage paired-end sequencing data (≥40x recommended)
Sufficient memory for tadpole and bbmerge operations

Input Requirements

Interleaved paired-end reads file named "reads.fq.gz"
Reference genome file named "reference.fa"
High coverage (≥40x) for reliable insertion detection
2x100bp reads (or similar paired-end data)

Pipeline Stages

1. Initial Read Processing

1.1 Read Reordering and Deduplication

clumpify.sh in=reads.fq.gz out=clumped.fq.gz dedupe optical

Reorders reads for improved processing speed and removes optical duplicates that could interfere with accurate insertion calling.

1.2 Tile-Based Quality Filtering

filterbytile.sh in=clumped.fq.gz out=filtered_by_tile.fq.gz

Removes reads from low-quality regions of the flowcell based on positional quality patterns.

1.3 Adapter Trimming and Quality Control

bbduk.sh in=filtered_by_tile.fq.gz out=trimmed.fq.gz maxns=0 ktrim=r k=23 mink=11 hdist=1 tbo tpe minlen=85 ref=adapters ftm=5 ordered

Trims adapters and discards reads containing N bases (maxns=0), ensuring high-quality input for downstream processing. Uses sophisticated trimming parameters with variable k-mer sizes.

1.4 Contaminant Removal

bbduk.sh in=trimmed.fq.gz out=filtered.fq.gz k=31 ref=artifacts,phix ordered

Removes synthetic artifacts and PhiX spike-ins that could create false positive variant calls.

2. Multi-Stage Error Correction

2.1 Error Correction Phase 1: Overlap-Based

bbmerge.sh in=filtered.fq.gz out=ecco.fq.gz ecco mix vstrict ordered ihist=ihist_merge1.txt

Performs overlap-based error correction using very strict parameters (vstrict) to ensure high accuracy. The 'mix' option combines corrected merged and unmerged reads.

2.2 Error Correction Phase 2: Clumping-Based

clumpify.sh in=ecco.fq.gz out=eccc.fq.gz passes=4 ecc unpair repair

Applies clumping-based error correction with multiple passes, including read unpairing and repair for optimal correction.

2.3 Error Correction Phase 3: K-mer Based

tadpole.sh in=eccc.fq.gz out=ecct.fq.gz ecc ordered

Final k-mer based error correction using Tadpole. For large datasets, may require prefilter flags to manage memory usage.

3. Read Extension

3.1 K-mer Based Read Extension

tadpole.sh in=ecct.fq.gz out=extended.fq.gz ordered mode=extend el=20 er=20 k=62

Extends reads by 20 bases on both ends (el=20, er=20) using k-mer overlap information. This crucial step enables detection of insertions longer than the original read length.

4. Advanced Read Merging

4.1 Automatic Parameter Optimization

bbmerge-auto.sh in=extended.fq.gz out=merged.fq.gz outu=unmerged.fq.gz rem k=81 extend2=120 zl=8 ordered

Performs sophisticated read merging with automatic parameter optimization, k-mer extension (extend2=120), and read extension mode (rem). Uses compression level 8 for efficiency.

5. Alignment and Variant Calling

5.1 High-Sensitivity Alignment

bbmap.sh in=merged.fq.gz out=merged.sam.gz slow bs=bs.sh pigz unpigz ref=reference.fa

Maps only the merged reads (which can span large insertions) using slow, high-sensitivity alignment mode. Generates a BAM creation script (bs.sh) for visualization.

5.2 Variant Detection

callvariants.sh in=merged.sam.gz out=vars.txt vcf=vars.vcf.gz ref=ref.fa ploidy=1

Calls variants including insertions from the alignment. Ploidy may need adjustment for different organisms. For insertion-only calling, add "calldel=f callsub=f" to save memory by not calling deletions or substitutions.

5.3 BAM File Generation

sh bs.sh

Executes the generated script to create a BAM file for visualization in genome browsers like IGV.

Basic Usage

# 1. Prepare input files
ln -s your_paired_reads.fq.gz reads.fq.gz
ln -s your_reference_genome.fa reference.fa

# 2. Run the insertion calling pipeline
bash callInsertions.sh

# 3. Examine results
# - vars.txt: Text format variant calls
# - vars.vcf.gz: VCF format variant calls
# - merged.sam.gz: Alignment file
# - merged.bam: BAM file for visualization (after bs.sh)

Algorithm Strategy

Read Lengthening Approach

The pipeline uses a multi-pronged strategy to effectively lengthen reads:

Error Correction: Cleans reads to maximize merge success
K-mer Extension: Extends reads using overlap information
Advanced Merging: Combines paired reads with sophisticated algorithms
Selective Alignment: Uses only successfully merged reads

Insertion Detection Strategy

Coverage Requirement: ≥40x coverage ensures sufficient reads span insertions
Quality Control: Strict filtering prevents false positives
Read Length Maximization: Every step optimized to create longest possible reads
Sensitive Alignment: Slow alignment mode maximizes insertion detection

Parameter Optimization

Memory Considerations

For large genomes, the following tools may need memory management flags:

tadpole.sh: Add "prefilter=2" for large datasets
bbmerge.sh: Add "prefilter=2" if memory limited
callvariants.sh: Add "prefilter" for very large datasets

Coverage Optimization

≥40x coverage: Optimal for reliable insertion detection
Lower coverage: May miss insertions or require parameter adjustment
Higher coverage: Improves sensitivity but increases computational requirements

Insertion-Specific Parameters

Extension parameters: el=20, er=20 optimized for 2x100bp reads
Merging parameters: k=81, extend2=120 for maximum merging
Alignment parameters: slow mode for maximum sensitivity

Output Files

vars.txt - Variant calls in text format
vars.vcf.gz - Variant calls in VCF format
merged.sam.gz - Alignment file of merged reads
merged.bam - BAM file for genome browser visualization
ihist_merge1.txt - Insert size histogram from phase 1
bs.sh - Script for BAM file generation
clumped.fq.gz - Deduplicated reads
filtered_by_tile.fq.gz - Tile-filtered reads
trimmed.fq.gz - Adapter-trimmed reads
filtered.fq.gz - Contaminant-free reads
ecco.fq.gz, eccc.fq.gz, ecct.fq.gz - Error-corrected reads
extended.fq.gz - Length-extended reads
unmerged.fq.gz - Reads that could not be merged

Variant Calling Options

Insertion-Only Mode

callvariants.sh in=merged.sam.gz out=vars.txt vcf=vars.vcf.gz ref=ref.fa ploidy=1 calldel=f callsub=f

For insertion-focused analysis, disable deletion and substitution calling to save memory and focus on insertions.

Ploidy Adjustment

Haploid organisms: ploidy=1 (default)
Diploid organisms: ploidy=2
Polyploid organisms: Adjust ploidy accordingly

Quality Control and Validation

Pipeline Success Indicators

High merge rate: Check ihist_merge1.txt for successful overlaps
Alignment statistics: Monitor mapping rates in bbmap output
Variant quality: Examine quality scores in VCF output

Common Issues

Low merge rate: May indicate insufficient overlap or low quality
Memory errors: Add prefilter flags to memory-intensive steps
Poor insertion calls: Verify coverage depth and read quality

Visualization

The pipeline generates files suitable for genome browser visualization:

BAM file: Load merged.bam in IGV or other browsers
VCF file: Display variant calls alongside alignments
Coverage tracks: Visualize read depth across insertions

Troubleshooting

Low insertion detection: Verify coverage depth and read quality
Memory issues: Add prefilter=2 to tadpole and bbmerge steps
Poor merge rates: Check read quality and overlap parameters
Alignment failures: Verify reference genome compatibility
False positives: Examine adapter trimming and artifact removal steps
Missing large insertions: Increase extension parameters or merge settings

Performance Notes

Pipeline optimized for 2x100bp reads with 40x+ coverage
Different read lengths may require parameter adjustment
Higher coverage improves sensitivity but increases runtime
Memory requirements scale with genome size and coverage depth
Consider data subsampling for initial parameter optimization