CRISPR Detection Pipeline

Overview

CRISPR arrays are crucial components of bacterial and archaeal adaptive immune systems. This pipeline provides a complete workflow for detecting, characterizing, and analyzing CRISPR arrays from Illumina sequencing data. It combines advanced preprocessing techniques with the specialized BBCrisprFinder tool to identify repeats, spacers, and complete CRISPR structures.

Design Target: This pipeline is optimized for Illumina 2x150bp libraries and includes flexibility for both reference-guided and de novo CRISPR discovery approaches.

Prerequisites

System Requirements

BBTools suite installed
Sufficient memory for error correction and CRISPR detection
Adequate storage for intermediate files and results

Input Requirements

Interleaved paired-end Illumina reads file named "reads.fq.gz"
Optional: Known CRISPR repeat reference file "knownRepeats.fa"
Sufficient sequencing depth for CRISPR array coverage

Pipeline Stages

1. Data Preprocessing (For Raw Data)

1.1 Flowcell Quality Filtering

filterbytile.sh in=temp.fq.gz out=filtered_by_tile.fq.gz

Removes reads from low-quality regions of the flowcell based on positional quality patterns. This step is crucial for maintaining high-quality input for CRISPR detection.

1.2 Adapter Trimming

bbduk.sh in=temp.fq.gz out=trimmed.fq.gz ktrim=r k=23 mink=11 hdist=1 tbo tpe minlen=90 ref=adapters ftm=5

Trims adapters with sophisticated parameters optimized for CRISPR detection. Optional parameters include maxns=0 to discard reads with Ns and maq=8 to remove very low average quality reads.

1.3 Contaminant Removal

bbduk.sh in=temp.fq.gz out=filtered.fq.gz k=31 ref=artifacts,phix cardinality

Removes synthetic artifacts and PhiX spike-ins that could interfere with CRISPR array detection.

2. Read Merging and Error Correction

2.1 Initial Read Merging

bbmerge.sh in=temp.fq.gz out=merged.fq.gz mix strict

Merges paired-end reads using strict parameters. The 'mix' option combines merged and unmerged reads in the same file, which is essential for subsequent non-interleaved processing.

2.2 Error Correction Phase 1

clumpify.sh in=temp.fq.gz int=f out=eccc.fq.gz ecc conservative passes=9

Performs clumping-based error correction with conservative settings and multiple passes (9) to ensure high accuracy for CRISPR repeat detection. Note the int=f flag since reads are no longer interleaved.

2.3 Error Correction Phase 2

tadpole.sh in=temp.fq.gz int=f out=ecct.fq.gz ecc k=72 conservative

K-mer based error correction using Tadpole with conservative parameters. This step is optional but recommended for improved CRISPR detection accuracy. Skip if memory is limited.

3. CRISPR Detection

3.1 Reference-Based CRISPR Finding

bbcrisprfinder.sh in=temp.fq.gz int=f outc=crisprs.fq outr=repeats.fa outs=spacers.fa chist=chist.txt phist=phist.txt ref=knownRepeats.fa outref=uses.fa

When known CRISPR repeats are available, this approach uses them as references to guide CRISPR detection. Outputs include:

crisprs.fq: Reads containing CRISPR arrays
repeats.fa: Identified repeat sequences
spacers.fa: Identified spacer sequences
chist.txt: CRISPR histogram data
phist.txt: Positional histogram data
uses.fa: Reference repeats actually used

3.2 De Novo CRISPR Finding

bbcrisprfinder.sh in=temp.fq.gz int=f outc=crisprs.fq outr=repeats.fa outs=spacers.fa chist=chist.txt phist=phist.txt ow

Alternative approach when no reference repeats are available. Performs de novo discovery of CRISPR arrays without prior knowledge of repeat sequences.

4. Iterative Refinement

4.1 First Refinement Pass

bbcrisprfinder.sh in=temp.fq.gz int=f outc=crisprs2.fq outr=repeats2.fa outs=spacers2.fa chist=chist2.txt phist=phist2.txt ref=repeats.fa mincount=3

Uses repeats discovered in the first pass as references for a second, more refined analysis. The mincount=3 parameter restricts analysis to repeats encountered at least 3 times, preventing pollution from spurious repeats.

4.2 Second Refinement Pass

bbcrisprfinder.sh in=temp.fq.gz int=f outc=crisprs3.fq outr=repeats3.fa outs=spacers3.fa chist=chist3.txt phist=phist3.txt ref=repeats2.fa mincount=3

Final refinement pass using the most confident repeat set from the second pass. This iterative approach progressively improves CRISPR detection accuracy.

Basic Usage

With Known CRISPR Repeats

# 1. Prepare input files
ln -s your_illumina_reads.fq.gz reads.fq.gz
# Prepare knownRepeats.fa with your reference repeats

# 2. Run the pipeline (will use reference-based detection)
bash crisprPipeline.sh

# 3. Results will be in crisprs3.fq, repeats3.fa, spacers3.fa

De Novo Discovery

# 1. Prepare input files  
ln -s your_illumina_reads.fq.gz reads.fq.gz
# No reference file needed

# 2. Run the pipeline (will perform de novo detection)
bash crisprPipeline.sh

# 3. Results progress through crisprs.fq → crisprs2.fq → crisprs3.fq

CRISPR Detection Strategy

Iterative Refinement Approach

The pipeline employs a sophisticated iterative strategy:

Initial Detection: Broad search for potential CRISPR structures
First Refinement: Uses high-confidence repeats as references
Second Refinement: Final polishing with most reliable repeat set

Quality Control Parameters

mincount=3: Minimum repeat occurrences for reliability
Conservative error correction: Preserves genuine sequence variations
Strict merging: Ensures high-quality merged reads

Reference vs. De Novo Trade-offs

Approach	Advantages	Disadvantages	Best For
Reference-Based	Higher sensitivity, faster processing, validated repeats	Limited to known repeat families, may miss novel CRISPRs	Well-studied organisms, targeted analysis
De Novo	Discovers novel repeats, unbiased detection, comprehensive	Slower, may include false positives, requires validation	Novel organisms, exploratory analysis

Pass Recommendations

With Reference Repeats

1-2 total passes: Usually sufficient with good reference
Single pass: If reference is comprehensive and high-quality
Two passes: For refinement and validation

Without Reference (De Novo)

3 total passes: Recommended for comprehensive detection
Pass 1: Broad discovery of potential CRISPRs
Pass 2: Refinement using high-confidence repeats
Pass 3: Final validation and polishing

Output Files

CRISPR Detection Results

crisprs.fq, crisprs2.fq, crisprs3.fq - Reads containing CRISPR arrays from each pass
repeats.fa, repeats2.fa, repeats3.fa - Identified repeat sequences from each pass
spacers.fa, spacers2.fa, spacers3.fa - Identified spacer sequences from each pass
uses.fa - Reference repeats actually utilized (reference mode only)

Analysis Files

chist.txt, chist2.txt, chist3.txt - CRISPR count histograms from each pass
phist.txt, phist2.txt, phist3.txt - Positional histogram data from each pass

Preprocessing Files

filtered_by_tile.fq.gz - Quality-filtered reads
trimmed.fq.gz - Adapter-trimmed reads
filtered.fq.gz - Contaminant-free reads
merged.fq.gz - Merged paired-end reads
eccc.fq.gz, ecct.fq.gz - Error-corrected reads

Parameter Optimization

Memory Management

Error correction: Skip tadpole error correction if memory is limited
BBCrisprFinder: May benefit from prefilter flags for large datasets
Conservative approach: Start with default parameters

Sensitivity Tuning

mincount parameter: Lower values (even 1) can be used but may slow analysis
Error correction passes: Increase passes=9 for higher accuracy
Merging strictness: Adjust based on data quality

Speed Optimization

Single pass: Use only with high-quality reference repeats
Skip optional steps: Remove error correction phases if speed is critical
Higher mincount: Increases speed but may reduce sensitivity

Quality Assessment

Success Indicators

Progressive improvement: Each pass should refine results
Consistent repeats: High-quality repeats appear across passes
Reasonable spacer diversity: Multiple unique spacers per repeat family

Validation Steps

Manual inspection: Examine repeat and spacer sequences
Length distributions: Check histogram files for expected patterns
Cross-reference databases: Compare repeats to known CRISPR families

Pipeline Flexibility

Stage Disabling

The symbolic linking approach allows easy disabling of pipeline stages:

# To skip tile filtering:
# Comment out: filterbytile.sh in=temp.fq.gz out=filtered_by_tile.fq.gz
# And the linking: rm temp.fq.gz; ln -s filtered_by_tile.fq.gz temp.fq.gz

Customization Options

Preprocessing: Adjust or skip quality filtering steps
Error correction: Modify parameters or skip phases
CRISPR detection: Tune mincount and other BBCrisprFinder parameters
Iteration count: Adjust number of refinement passes

Troubleshooting

Common Issues

No CRISPRs detected: Check input data quality and organism type
Too many false positives: Increase mincount or improve preprocessing
Memory errors: Skip optional error correction or add prefilter flags
Poor repeat quality: Examine preprocessing steps and data quality

Optimization Strategies

Data quality: Ensure sufficient coverage and read quality
Reference selection: Use high-quality, relevant reference repeats
Parameter tuning: Adjust mincount based on data characteristics
Validation: Cross-check results with external CRISPR databases

Downstream Analysis

Repeat Analysis

Compare identified repeats to CRISPRdb or other databases
Analyze repeat family relationships and evolution
Examine repeat secondary structures and motifs

Spacer Analysis

BLAST spacers against viral and plasmid databases
Analyze spacer acquisition patterns and timeline
Identify protospacer adjacent motifs (PAMs)

Array Architecture

Reconstruct complete CRISPR array structures
Analyze leader sequences and orientations
Compare arrays across related organisms