CRISPR Detection Pipeline
Comprehensive pipeline for detecting and characterizing CRISPR (Clustered Regularly Interspaced Short Palindromic Repeats) arrays in Illumina sequencing data. Includes sophisticated preprocessing, error correction, read merging, and iterative CRISPR detection with optional reference-based refinement.
Overview
CRISPR arrays are crucial components of bacterial and archaeal adaptive immune systems. This pipeline provides a complete workflow for detecting, characterizing, and analyzing CRISPR arrays from Illumina sequencing data. It combines advanced preprocessing techniques with the specialized BBCrisprFinder tool to identify repeats, spacers, and complete CRISPR structures.
Prerequisites
System Requirements
- BBTools suite installed
- Sufficient memory for error correction and CRISPR detection
- Adequate storage for intermediate files and results
Input Requirements
- Interleaved paired-end Illumina reads file named "reads.fq.gz"
- Optional: Known CRISPR repeat reference file "knownRepeats.fa"
- Sufficient sequencing depth for CRISPR array coverage
Pipeline Stages
1. Data Preprocessing (For Raw Data)
1.1 Flowcell Quality Filtering
filterbytile.sh in=temp.fq.gz out=filtered_by_tile.fq.gz
Removes reads from low-quality regions of the flowcell based on positional quality patterns. This step is crucial for maintaining high-quality input for CRISPR detection.
1.2 Adapter Trimming
bbduk.sh in=temp.fq.gz out=trimmed.fq.gz ktrim=r k=23 mink=11 hdist=1 tbo tpe minlen=90 ref=adapters ftm=5
Trims adapters with sophisticated parameters optimized for CRISPR detection. Optional parameters include maxns=0 to discard reads with Ns and maq=8 to remove very low average quality reads.
1.3 Contaminant Removal
bbduk.sh in=temp.fq.gz out=filtered.fq.gz k=31 ref=artifacts,phix cardinality
Removes synthetic artifacts and PhiX spike-ins that could interfere with CRISPR array detection.
2. Read Merging and Error Correction
2.1 Initial Read Merging
bbmerge.sh in=temp.fq.gz out=merged.fq.gz mix strict
Merges paired-end reads using strict parameters. The 'mix' option combines merged and unmerged reads in the same file, which is essential for subsequent non-interleaved processing.
2.2 Error Correction Phase 1
clumpify.sh in=temp.fq.gz int=f out=eccc.fq.gz ecc conservative passes=9
Performs clumping-based error correction with conservative settings and multiple passes (9) to ensure high accuracy for CRISPR repeat detection. Note the int=f flag since reads are no longer interleaved.
2.3 Error Correction Phase 2
tadpole.sh in=temp.fq.gz int=f out=ecct.fq.gz ecc k=72 conservative
K-mer based error correction using Tadpole with conservative parameters. This step is optional but recommended for improved CRISPR detection accuracy. Skip if memory is limited.
3. CRISPR Detection
3.1 Reference-Based CRISPR Finding
bbcrisprfinder.sh in=temp.fq.gz int=f outc=crisprs.fq outr=repeats.fa outs=spacers.fa chist=chist.txt phist=phist.txt ref=knownRepeats.fa outref=uses.fa
When known CRISPR repeats are available, this approach uses them as references to guide CRISPR detection. Outputs include:
- crisprs.fq: Reads containing CRISPR arrays
- repeats.fa: Identified repeat sequences
- spacers.fa: Identified spacer sequences
- chist.txt: CRISPR histogram data
- phist.txt: Positional histogram data
- uses.fa: Reference repeats actually used
3.2 De Novo CRISPR Finding
bbcrisprfinder.sh in=temp.fq.gz int=f outc=crisprs.fq outr=repeats.fa outs=spacers.fa chist=chist.txt phist=phist.txt ow
Alternative approach when no reference repeats are available. Performs de novo discovery of CRISPR arrays without prior knowledge of repeat sequences.
4. Iterative Refinement
4.1 First Refinement Pass
bbcrisprfinder.sh in=temp.fq.gz int=f outc=crisprs2.fq outr=repeats2.fa outs=spacers2.fa chist=chist2.txt phist=phist2.txt ref=repeats.fa mincount=3
Uses repeats discovered in the first pass as references for a second, more refined analysis. The mincount=3 parameter restricts analysis to repeats encountered at least 3 times, preventing pollution from spurious repeats.
4.2 Second Refinement Pass
bbcrisprfinder.sh in=temp.fq.gz int=f outc=crisprs3.fq outr=repeats3.fa outs=spacers3.fa chist=chist3.txt phist=phist3.txt ref=repeats2.fa mincount=3
Final refinement pass using the most confident repeat set from the second pass. This iterative approach progressively improves CRISPR detection accuracy.
Basic Usage
With Known CRISPR Repeats
# 1. Prepare input files
ln -s your_illumina_reads.fq.gz reads.fq.gz
# Prepare knownRepeats.fa with your reference repeats
# 2. Run the pipeline (will use reference-based detection)
bash crisprPipeline.sh
# 3. Results will be in crisprs3.fq, repeats3.fa, spacers3.fa
De Novo Discovery
# 1. Prepare input files
ln -s your_illumina_reads.fq.gz reads.fq.gz
# No reference file needed
# 2. Run the pipeline (will perform de novo detection)
bash crisprPipeline.sh
# 3. Results progress through crisprs.fq → crisprs2.fq → crisprs3.fq
CRISPR Detection Strategy
Iterative Refinement Approach
The pipeline employs a sophisticated iterative strategy:
- Initial Detection: Broad search for potential CRISPR structures
- First Refinement: Uses high-confidence repeats as references
- Second Refinement: Final polishing with most reliable repeat set
Quality Control Parameters
- mincount=3: Minimum repeat occurrences for reliability
- Conservative error correction: Preserves genuine sequence variations
- Strict merging: Ensures high-quality merged reads
Reference vs. De Novo Trade-offs
Approach | Advantages | Disadvantages | Best For |
---|---|---|---|
Reference-Based | Higher sensitivity, faster processing, validated repeats | Limited to known repeat families, may miss novel CRISPRs | Well-studied organisms, targeted analysis |
De Novo | Discovers novel repeats, unbiased detection, comprehensive | Slower, may include false positives, requires validation | Novel organisms, exploratory analysis |
Pass Recommendations
With Reference Repeats
- 1-2 total passes: Usually sufficient with good reference
- Single pass: If reference is comprehensive and high-quality
- Two passes: For refinement and validation
Without Reference (De Novo)
- 3 total passes: Recommended for comprehensive detection
- Pass 1: Broad discovery of potential CRISPRs
- Pass 2: Refinement using high-confidence repeats
- Pass 3: Final validation and polishing
Output Files
CRISPR Detection Results
- crisprs.fq, crisprs2.fq, crisprs3.fq - Reads containing CRISPR arrays from each pass
- repeats.fa, repeats2.fa, repeats3.fa - Identified repeat sequences from each pass
- spacers.fa, spacers2.fa, spacers3.fa - Identified spacer sequences from each pass
- uses.fa - Reference repeats actually utilized (reference mode only)
Analysis Files
- chist.txt, chist2.txt, chist3.txt - CRISPR count histograms from each pass
- phist.txt, phist2.txt, phist3.txt - Positional histogram data from each pass
Preprocessing Files
- filtered_by_tile.fq.gz - Quality-filtered reads
- trimmed.fq.gz - Adapter-trimmed reads
- filtered.fq.gz - Contaminant-free reads
- merged.fq.gz - Merged paired-end reads
- eccc.fq.gz, ecct.fq.gz - Error-corrected reads
Parameter Optimization
Memory Management
- Error correction: Skip tadpole error correction if memory is limited
- BBCrisprFinder: May benefit from prefilter flags for large datasets
- Conservative approach: Start with default parameters
Sensitivity Tuning
- mincount parameter: Lower values (even 1) can be used but may slow analysis
- Error correction passes: Increase passes=9 for higher accuracy
- Merging strictness: Adjust based on data quality
Speed Optimization
- Single pass: Use only with high-quality reference repeats
- Skip optional steps: Remove error correction phases if speed is critical
- Higher mincount: Increases speed but may reduce sensitivity
Quality Assessment
Success Indicators
- Progressive improvement: Each pass should refine results
- Consistent repeats: High-quality repeats appear across passes
- Reasonable spacer diversity: Multiple unique spacers per repeat family
Validation Steps
- Manual inspection: Examine repeat and spacer sequences
- Length distributions: Check histogram files for expected patterns
- Cross-reference databases: Compare repeats to known CRISPR families
Pipeline Flexibility
Stage Disabling
The symbolic linking approach allows easy disabling of pipeline stages:
# To skip tile filtering:
# Comment out: filterbytile.sh in=temp.fq.gz out=filtered_by_tile.fq.gz
# And the linking: rm temp.fq.gz; ln -s filtered_by_tile.fq.gz temp.fq.gz
Customization Options
- Preprocessing: Adjust or skip quality filtering steps
- Error correction: Modify parameters or skip phases
- CRISPR detection: Tune mincount and other BBCrisprFinder parameters
- Iteration count: Adjust number of refinement passes
Troubleshooting
Common Issues
- No CRISPRs detected: Check input data quality and organism type
- Too many false positives: Increase mincount or improve preprocessing
- Memory errors: Skip optional error correction or add prefilter flags
- Poor repeat quality: Examine preprocessing steps and data quality
Optimization Strategies
- Data quality: Ensure sufficient coverage and read quality
- Reference selection: Use high-quality, relevant reference repeats
- Parameter tuning: Adjust mincount based on data characteristics
- Validation: Cross-check results with external CRISPR databases
Downstream Analysis
Repeat Analysis
- Compare identified repeats to CRISPRdb or other databases
- Analyze repeat family relationships and evolution
- Examine repeat secondary structures and motifs
Spacer Analysis
- BLAST spacers against viral and plasmid databases
- Analyze spacer acquisition patterns and timeline
- Identify protospacer adjacent motifs (PAMs)
Array Architecture
- Reconstruct complete CRISPR array structures
- Analyze leader sequences and orientations
- Compare arrays across related organisms