Insertion Calling Pipeline
Specialized variant calling pipeline designed to detect insertions longer than raw read length by lengthening reads through error correction, extension, and merging prior to alignment. Optimized for high-coverage paired-end sequencing data.
Overview
This pipeline addresses a fundamental limitation in variant calling: detecting insertions longer than the sequencing read length. By combining error correction, read extension, and sophisticated merging techniques, it creates longer effective reads that can span large insertions, enabling accurate detection of structural variants that would otherwise be missed.
Prerequisites
System Requirements
- BBTools suite installed
- Reference genome file (reference.fa)
- High-coverage paired-end sequencing data (≥40x recommended)
- Sufficient memory for tadpole and bbmerge operations
Input Requirements
- Interleaved paired-end reads file named "reads.fq.gz"
- Reference genome file named "reference.fa"
- High coverage (≥40x) for reliable insertion detection
- 2x100bp reads (or similar paired-end data)
Pipeline Stages
1. Initial Read Processing
1.1 Read Reordering and Deduplication
clumpify.sh in=reads.fq.gz out=clumped.fq.gz dedupe optical
Reorders reads for improved processing speed and removes optical duplicates that could interfere with accurate insertion calling.
1.2 Tile-Based Quality Filtering
filterbytile.sh in=clumped.fq.gz out=filtered_by_tile.fq.gz
Removes reads from low-quality regions of the flowcell based on positional quality patterns.
1.3 Adapter Trimming and Quality Control
bbduk.sh in=filtered_by_tile.fq.gz out=trimmed.fq.gz maxns=0 ktrim=r k=23 mink=11 hdist=1 tbo tpe minlen=85 ref=adapters ftm=5 ordered
Trims adapters and discards reads containing N bases (maxns=0), ensuring high-quality input for downstream processing. Uses sophisticated trimming parameters with variable k-mer sizes.
1.4 Contaminant Removal
bbduk.sh in=trimmed.fq.gz out=filtered.fq.gz k=31 ref=artifacts,phix ordered
Removes synthetic artifacts and PhiX spike-ins that could create false positive variant calls.
2. Multi-Stage Error Correction
2.1 Error Correction Phase 1: Overlap-Based
bbmerge.sh in=filtered.fq.gz out=ecco.fq.gz ecco mix vstrict ordered ihist=ihist_merge1.txt
Performs overlap-based error correction using very strict parameters (vstrict) to ensure high accuracy. The 'mix' option combines corrected merged and unmerged reads.
2.2 Error Correction Phase 2: Clumping-Based
clumpify.sh in=ecco.fq.gz out=eccc.fq.gz passes=4 ecc unpair repair
Applies clumping-based error correction with multiple passes, including read unpairing and repair for optimal correction.
2.3 Error Correction Phase 3: K-mer Based
tadpole.sh in=eccc.fq.gz out=ecct.fq.gz ecc ordered
Final k-mer based error correction using Tadpole. For large datasets, may require prefilter flags to manage memory usage.
3. Read Extension
3.1 K-mer Based Read Extension
tadpole.sh in=ecct.fq.gz out=extended.fq.gz ordered mode=extend el=20 er=20 k=62
Extends reads by 20 bases on both ends (el=20, er=20) using k-mer overlap information. This crucial step enables detection of insertions longer than the original read length.
4. Advanced Read Merging
4.1 Automatic Parameter Optimization
bbmerge-auto.sh in=extended.fq.gz out=merged.fq.gz outu=unmerged.fq.gz rem k=81 extend2=120 zl=8 ordered
Performs sophisticated read merging with automatic parameter optimization, k-mer extension (extend2=120), and read extension mode (rem). Uses compression level 8 for efficiency.
5. Alignment and Variant Calling
5.1 High-Sensitivity Alignment
bbmap.sh in=merged.fq.gz out=merged.sam.gz slow bs=bs.sh pigz unpigz ref=reference.fa
Maps only the merged reads (which can span large insertions) using slow, high-sensitivity alignment mode. Generates a BAM creation script (bs.sh) for visualization.
5.2 Variant Detection
callvariants.sh in=merged.sam.gz out=vars.txt vcf=vars.vcf.gz ref=ref.fa ploidy=1
Calls variants including insertions from the alignment. Ploidy may need adjustment for different organisms. For insertion-only calling, add "calldel=f callsub=f" to save memory by not calling deletions or substitutions.
5.3 BAM File Generation
sh bs.sh
Executes the generated script to create a BAM file for visualization in genome browsers like IGV.
Basic Usage
# 1. Prepare input files
ln -s your_paired_reads.fq.gz reads.fq.gz
ln -s your_reference_genome.fa reference.fa
# 2. Run the insertion calling pipeline
bash callInsertions.sh
# 3. Examine results
# - vars.txt: Text format variant calls
# - vars.vcf.gz: VCF format variant calls
# - merged.sam.gz: Alignment file
# - merged.bam: BAM file for visualization (after bs.sh)
Algorithm Strategy
Read Lengthening Approach
The pipeline uses a multi-pronged strategy to effectively lengthen reads:
- Error Correction: Cleans reads to maximize merge success
- K-mer Extension: Extends reads using overlap information
- Advanced Merging: Combines paired reads with sophisticated algorithms
- Selective Alignment: Uses only successfully merged reads
Insertion Detection Strategy
- Coverage Requirement: ≥40x coverage ensures sufficient reads span insertions
- Quality Control: Strict filtering prevents false positives
- Read Length Maximization: Every step optimized to create longest possible reads
- Sensitive Alignment: Slow alignment mode maximizes insertion detection
Parameter Optimization
Memory Considerations
For large genomes, the following tools may need memory management flags:
- tadpole.sh: Add "prefilter=2" for large datasets
- bbmerge.sh: Add "prefilter=2" if memory limited
- callvariants.sh: Add "prefilter" for very large datasets
Coverage Optimization
- ≥40x coverage: Optimal for reliable insertion detection
- Lower coverage: May miss insertions or require parameter adjustment
- Higher coverage: Improves sensitivity but increases computational requirements
Insertion-Specific Parameters
- Extension parameters: el=20, er=20 optimized for 2x100bp reads
- Merging parameters: k=81, extend2=120 for maximum merging
- Alignment parameters: slow mode for maximum sensitivity
Output Files
- vars.txt - Variant calls in text format
- vars.vcf.gz - Variant calls in VCF format
- merged.sam.gz - Alignment file of merged reads
- merged.bam - BAM file for genome browser visualization
- ihist_merge1.txt - Insert size histogram from phase 1
- bs.sh - Script for BAM file generation
- clumped.fq.gz - Deduplicated reads
- filtered_by_tile.fq.gz - Tile-filtered reads
- trimmed.fq.gz - Adapter-trimmed reads
- filtered.fq.gz - Contaminant-free reads
- ecco.fq.gz, eccc.fq.gz, ecct.fq.gz - Error-corrected reads
- extended.fq.gz - Length-extended reads
- unmerged.fq.gz - Reads that could not be merged
Variant Calling Options
Insertion-Only Mode
callvariants.sh in=merged.sam.gz out=vars.txt vcf=vars.vcf.gz ref=ref.fa ploidy=1 calldel=f callsub=f
For insertion-focused analysis, disable deletion and substitution calling to save memory and focus on insertions.
Ploidy Adjustment
- Haploid organisms: ploidy=1 (default)
- Diploid organisms: ploidy=2
- Polyploid organisms: Adjust ploidy accordingly
Quality Control and Validation
Pipeline Success Indicators
- High merge rate: Check ihist_merge1.txt for successful overlaps
- Alignment statistics: Monitor mapping rates in bbmap output
- Variant quality: Examine quality scores in VCF output
Common Issues
- Low merge rate: May indicate insufficient overlap or low quality
- Memory errors: Add prefilter flags to memory-intensive steps
- Poor insertion calls: Verify coverage depth and read quality
Visualization
The pipeline generates files suitable for genome browser visualization:
- BAM file: Load merged.bam in IGV or other browsers
- VCF file: Display variant calls alongside alignments
- Coverage tracks: Visualize read depth across insertions
Troubleshooting
- Low insertion detection: Verify coverage depth and read quality
- Memory issues: Add prefilter=2 to tadpole and bbmerge steps
- Poor merge rates: Check read quality and overlap parameters
- Alignment failures: Verify reference genome compatibility
- False positives: Examine adapter trimming and artifact removal steps
- Missing large insertions: Increase extension parameters or merge settings
Performance Notes
- Pipeline optimized for 2x100bp reads with 40x+ coverage
- Different read lengths may require parameter adjustment
- Higher coverage improves sensitivity but increases runtime
- Memory requirements scale with genome size and coverage depth
- Consider data subsampling for initial parameter optimization