CRISPR Detection Pipeline

Script: crisprPipeline.sh Author: Brian Bushnell Last Updated: September 20, 2023 Optimized For: Illumina 2x150bp libraries

Comprehensive pipeline for detecting and characterizing CRISPR (Clustered Regularly Interspaced Short Palindromic Repeats) arrays in Illumina sequencing data. Includes sophisticated preprocessing, error correction, read merging, and iterative CRISPR detection with optional reference-based refinement.

Overview

CRISPR arrays are crucial components of bacterial and archaeal adaptive immune systems. This pipeline provides a complete workflow for detecting, characterizing, and analyzing CRISPR arrays from Illumina sequencing data. It combines advanced preprocessing techniques with the specialized BBCrisprFinder tool to identify repeats, spacers, and complete CRISPR structures.

Design Target: This pipeline is optimized for Illumina 2x150bp libraries and includes flexibility for both reference-guided and de novo CRISPR discovery approaches.

Prerequisites

System Requirements

Input Requirements

Pipeline Stages

1. Data Preprocessing (For Raw Data)

1.1 Flowcell Quality Filtering

filterbytile.sh in=temp.fq.gz out=filtered_by_tile.fq.gz

Removes reads from low-quality regions of the flowcell based on positional quality patterns. This step is crucial for maintaining high-quality input for CRISPR detection.

1.2 Adapter Trimming

bbduk.sh in=temp.fq.gz out=trimmed.fq.gz ktrim=r k=23 mink=11 hdist=1 tbo tpe minlen=90 ref=adapters ftm=5

Trims adapters with sophisticated parameters optimized for CRISPR detection. Optional parameters include maxns=0 to discard reads with Ns and maq=8 to remove very low average quality reads.

1.3 Contaminant Removal

bbduk.sh in=temp.fq.gz out=filtered.fq.gz k=31 ref=artifacts,phix cardinality

Removes synthetic artifacts and PhiX spike-ins that could interfere with CRISPR array detection.

2. Read Merging and Error Correction

2.1 Initial Read Merging

bbmerge.sh in=temp.fq.gz out=merged.fq.gz mix strict

Merges paired-end reads using strict parameters. The 'mix' option combines merged and unmerged reads in the same file, which is essential for subsequent non-interleaved processing.

2.2 Error Correction Phase 1

clumpify.sh in=temp.fq.gz int=f out=eccc.fq.gz ecc conservative passes=9

Performs clumping-based error correction with conservative settings and multiple passes (9) to ensure high accuracy for CRISPR repeat detection. Note the int=f flag since reads are no longer interleaved.

2.3 Error Correction Phase 2

tadpole.sh in=temp.fq.gz int=f out=ecct.fq.gz ecc k=72 conservative

K-mer based error correction using Tadpole with conservative parameters. This step is optional but recommended for improved CRISPR detection accuracy. Skip if memory is limited.

3. CRISPR Detection

3.1 Reference-Based CRISPR Finding

bbcrisprfinder.sh in=temp.fq.gz int=f outc=crisprs.fq outr=repeats.fa outs=spacers.fa chist=chist.txt phist=phist.txt ref=knownRepeats.fa outref=uses.fa

When known CRISPR repeats are available, this approach uses them as references to guide CRISPR detection. Outputs include:

3.2 De Novo CRISPR Finding

bbcrisprfinder.sh in=temp.fq.gz int=f outc=crisprs.fq outr=repeats.fa outs=spacers.fa chist=chist.txt phist=phist.txt ow

Alternative approach when no reference repeats are available. Performs de novo discovery of CRISPR arrays without prior knowledge of repeat sequences.

4. Iterative Refinement

4.1 First Refinement Pass

bbcrisprfinder.sh in=temp.fq.gz int=f outc=crisprs2.fq outr=repeats2.fa outs=spacers2.fa chist=chist2.txt phist=phist2.txt ref=repeats.fa mincount=3

Uses repeats discovered in the first pass as references for a second, more refined analysis. The mincount=3 parameter restricts analysis to repeats encountered at least 3 times, preventing pollution from spurious repeats.

4.2 Second Refinement Pass

bbcrisprfinder.sh in=temp.fq.gz int=f outc=crisprs3.fq outr=repeats3.fa outs=spacers3.fa chist=chist3.txt phist=phist3.txt ref=repeats2.fa mincount=3

Final refinement pass using the most confident repeat set from the second pass. This iterative approach progressively improves CRISPR detection accuracy.

Basic Usage

With Known CRISPR Repeats

# 1. Prepare input files
ln -s your_illumina_reads.fq.gz reads.fq.gz
# Prepare knownRepeats.fa with your reference repeats

# 2. Run the pipeline (will use reference-based detection)
bash crisprPipeline.sh

# 3. Results will be in crisprs3.fq, repeats3.fa, spacers3.fa

De Novo Discovery

# 1. Prepare input files  
ln -s your_illumina_reads.fq.gz reads.fq.gz
# No reference file needed

# 2. Run the pipeline (will perform de novo detection)
bash crisprPipeline.sh

# 3. Results progress through crisprs.fq → crisprs2.fq → crisprs3.fq

CRISPR Detection Strategy

Iterative Refinement Approach

The pipeline employs a sophisticated iterative strategy:

  1. Initial Detection: Broad search for potential CRISPR structures
  2. First Refinement: Uses high-confidence repeats as references
  3. Second Refinement: Final polishing with most reliable repeat set

Quality Control Parameters

Reference vs. De Novo Trade-offs

Approach Advantages Disadvantages Best For
Reference-Based Higher sensitivity, faster processing, validated repeats Limited to known repeat families, may miss novel CRISPRs Well-studied organisms, targeted analysis
De Novo Discovers novel repeats, unbiased detection, comprehensive Slower, may include false positives, requires validation Novel organisms, exploratory analysis

Pass Recommendations

With Reference Repeats

Without Reference (De Novo)

Output Files

CRISPR Detection Results

Analysis Files

Preprocessing Files

Parameter Optimization

Memory Management

Sensitivity Tuning

Speed Optimization

Quality Assessment

Success Indicators

Validation Steps

Pipeline Flexibility

Stage Disabling

The symbolic linking approach allows easy disabling of pipeline stages:

# To skip tile filtering:
# Comment out: filterbytile.sh in=temp.fq.gz out=filtered_by_tile.fq.gz
# And the linking: rm temp.fq.gz; ln -s filtered_by_tile.fq.gz temp.fq.gz

Customization Options

Troubleshooting

Common Issues

Optimization Strategies

Downstream Analysis

Repeat Analysis

Spacer Analysis

Array Architecture