Insertion Calling Pipeline

Script: callInsertions.sh Author: Brian Bushnell Last Updated: February 21, 2018 Optimized For: 2x100bp reads with ≥40x coverage

Specialized variant calling pipeline designed to detect insertions longer than raw read length by lengthening reads through error correction, extension, and merging prior to alignment. Optimized for high-coverage paired-end sequencing data.

Overview

This pipeline addresses a fundamental limitation in variant calling: detecting insertions longer than the sequencing read length. By combining error correction, read extension, and sophisticated merging techniques, it creates longer effective reads that can span large insertions, enabling accurate detection of structural variants that would otherwise be missed.

Optimization Target: This pipeline was specifically designed and optimized for 2x100bp reads with coverage of at least 40x. Parameters may need adjustment for different read lengths, coverage depths, or data types.

Prerequisites

System Requirements

Input Requirements

Pipeline Stages

1. Initial Read Processing

1.1 Read Reordering and Deduplication

clumpify.sh in=reads.fq.gz out=clumped.fq.gz dedupe optical

Reorders reads for improved processing speed and removes optical duplicates that could interfere with accurate insertion calling.

1.2 Tile-Based Quality Filtering

filterbytile.sh in=clumped.fq.gz out=filtered_by_tile.fq.gz

Removes reads from low-quality regions of the flowcell based on positional quality patterns.

1.3 Adapter Trimming and Quality Control

bbduk.sh in=filtered_by_tile.fq.gz out=trimmed.fq.gz maxns=0 ktrim=r k=23 mink=11 hdist=1 tbo tpe minlen=85 ref=adapters ftm=5 ordered

Trims adapters and discards reads containing N bases (maxns=0), ensuring high-quality input for downstream processing. Uses sophisticated trimming parameters with variable k-mer sizes.

1.4 Contaminant Removal

bbduk.sh in=trimmed.fq.gz out=filtered.fq.gz k=31 ref=artifacts,phix ordered

Removes synthetic artifacts and PhiX spike-ins that could create false positive variant calls.

2. Multi-Stage Error Correction

2.1 Error Correction Phase 1: Overlap-Based

bbmerge.sh in=filtered.fq.gz out=ecco.fq.gz ecco mix vstrict ordered ihist=ihist_merge1.txt

Performs overlap-based error correction using very strict parameters (vstrict) to ensure high accuracy. The 'mix' option combines corrected merged and unmerged reads.

2.2 Error Correction Phase 2: Clumping-Based

clumpify.sh in=ecco.fq.gz out=eccc.fq.gz passes=4 ecc unpair repair

Applies clumping-based error correction with multiple passes, including read unpairing and repair for optimal correction.

2.3 Error Correction Phase 3: K-mer Based

tadpole.sh in=eccc.fq.gz out=ecct.fq.gz ecc ordered

Final k-mer based error correction using Tadpole. For large datasets, may require prefilter flags to manage memory usage.

3. Read Extension

3.1 K-mer Based Read Extension

tadpole.sh in=ecct.fq.gz out=extended.fq.gz ordered mode=extend el=20 er=20 k=62

Extends reads by 20 bases on both ends (el=20, er=20) using k-mer overlap information. This crucial step enables detection of insertions longer than the original read length.

4. Advanced Read Merging

4.1 Automatic Parameter Optimization

bbmerge-auto.sh in=extended.fq.gz out=merged.fq.gz outu=unmerged.fq.gz rem k=81 extend2=120 zl=8 ordered

Performs sophisticated read merging with automatic parameter optimization, k-mer extension (extend2=120), and read extension mode (rem). Uses compression level 8 for efficiency.

5. Alignment and Variant Calling

5.1 High-Sensitivity Alignment

bbmap.sh in=merged.fq.gz out=merged.sam.gz slow bs=bs.sh pigz unpigz ref=reference.fa

Maps only the merged reads (which can span large insertions) using slow, high-sensitivity alignment mode. Generates a BAM creation script (bs.sh) for visualization.

5.2 Variant Detection

callvariants.sh in=merged.sam.gz out=vars.txt vcf=vars.vcf.gz ref=ref.fa ploidy=1

Calls variants including insertions from the alignment. Ploidy may need adjustment for different organisms. For insertion-only calling, add "calldel=f callsub=f" to save memory by not calling deletions or substitutions.

5.3 BAM File Generation

sh bs.sh

Executes the generated script to create a BAM file for visualization in genome browsers like IGV.

Basic Usage

# 1. Prepare input files
ln -s your_paired_reads.fq.gz reads.fq.gz
ln -s your_reference_genome.fa reference.fa

# 2. Run the insertion calling pipeline
bash callInsertions.sh

# 3. Examine results
# - vars.txt: Text format variant calls
# - vars.vcf.gz: VCF format variant calls
# - merged.sam.gz: Alignment file
# - merged.bam: BAM file for visualization (after bs.sh)

Algorithm Strategy

Read Lengthening Approach

The pipeline uses a multi-pronged strategy to effectively lengthen reads:

  1. Error Correction: Cleans reads to maximize merge success
  2. K-mer Extension: Extends reads using overlap information
  3. Advanced Merging: Combines paired reads with sophisticated algorithms
  4. Selective Alignment: Uses only successfully merged reads

Insertion Detection Strategy

Parameter Optimization

Memory Considerations

For large genomes, the following tools may need memory management flags:

Coverage Optimization

Insertion-Specific Parameters

Output Files

Variant Calling Options

Insertion-Only Mode

callvariants.sh in=merged.sam.gz out=vars.txt vcf=vars.vcf.gz ref=ref.fa ploidy=1 calldel=f callsub=f

For insertion-focused analysis, disable deletion and substitution calling to save memory and focus on insertions.

Ploidy Adjustment

Quality Control and Validation

Pipeline Success Indicators

Common Issues

Visualization

The pipeline generates files suitable for genome browser visualization:

Troubleshooting

Performance Notes