BBCrisprFinder

Script: bbcrisprfinder.sh Package: jgi Class: CrisprFinder.java

Finds interspersed repeats contained within sequences; specifically, only information within a sequence is used. This is based on the repeat-spacer model of crisprs. Designed for short reads, but should work with full genomes.

Basic Usage

bbcrisprfinder.sh in=<input file>

See bbmap/pipelines/crisprPipeline.sh for more examples.

Parameters

Parameters are organized by function in the CRISPR detection process. The tool uses k-mer seeding followed by extension and consensus refinement to identify repeat-spacer patterns characteristic of CRISPR arrays.

Standard parameters

in=<file>
Primary input (a fasta or fastq file).
out=<file>
Output for annotated reads containing interspersed repeats. Reads not matching the criteria are discarded (though pairs are retained).
outc=<file>
Output for crispr sequences and their flanking repeats. Repeats are in lower-case; spacer is upper-case.
outr=<file>
Output of just repeats (one copy of each unique repeat). Only includes full-length repeats.
outs=<file>
Output of just spacers (one copy of each unique spacer). Only includes spacers adjacent to at least one full-length repeat (so the bounds are known).
chist=<file>
Histogram of crispr stats.
phist=<file>
Histogram of palindrome stats.

Processing parameters

merge=f
Try to merge paired reads before processing them.
masked=f
Indicates that the sequences are masked to lowercase; repeats will not be extended outside of lowercase regions. BBDuk or BBMask can produce masked sequences.
pad=0
Extend the boundaries of the repeats by this much when printing to outcrisper. The padding will be in uppercase. This helps to show why a repeat was not extended.
annotate=f
Rename reads to indicate the repeat boundaries. Annotations look like this: [56-77,117-138;150],P=7,M=5,L=6,S=62,T=6+0 ...meaning the repeat occurs at positions 56-77 and 117-138; the read length is 150; and there is a palindrome of length 7 with 5 matches, a loop length of 6, starting at position 62, with tails outside the palindromic region length 6 and 0.
reads=-1
If positive, quit after processing this many reads.

Repeat Parameters

krepeat=13
(kr) Use this seed kmer length to find repeats. Repeats shorter than k may not be found. Max is 31.
mm=1
(mmr) Mask this many bases in the middle of kmers to allow imperfect repeats. Also requires rmismatches to be set.
minrepeats=2
Ignore repeats with fewer than this many nearby copies. The more copies, the better consensus will refine the borders. For this purpose 2 partials count as 1 full-length repeat.
rmismatches=3
(rmm) Maximum allowed mismatches in a repeat pair.
minspacer=14
Ignore spacers shorter than this.
maxspacer=60
Ignore spacers longer than this.
minrepeat=20
Ignore repeats shorter than this.
maxrepeat=54
Ignore repeats longer than this.
minrgc=0.09
Ignore repeats with GC below this.
maxrgc=0.89
Ignore repeats with GC above this.
grow=t
Extend repeats through mismatches, if rmm>0. Increases the number of unique repeats, and decreases unique spacers.
lookahead=5
Minimum subsequent matches to extend through a mismatch.

Reference Parameters

ref=<file>
If a reference of known CRISPR repeats is supplied, all detected repeates will be aligned to it, using the best match to override the predicted repeat boundaries. Subsequent parameters in this section have no effect without a ref.
outref=<file>
Output the reference sequences used, annotated to indicate the number of uses and palindrome position.
kref=13
Kmer length for selecting ref sequences to attempt align. Lower is more sensitive but can take exponentially longer. 'k' will set both krepeat and kref to the same value.
mmref=1
(maskMiddleRef) Number of bases masked in the middle of the kmer. 0 is umasked, and more increases mismatch tolerance.
minrefm=18
(minRefMatches) Reject alignments with fewer matching bases.
refmm=5
(refMismatches) Reject alignments with more mismatches.
refmmf=0.2
(refmismatchfraction) Allowed mismatches will be the maximum of refmm and refmmf*length.
minrefc=0
(minRefCount) Only load reference sequences with a count of at least this. Intended for repeats generated via the 'outr' flag, whose headers have a 'count=x' term indicating how many times that exact repeat was encountered. Small numbers can be slow with large references.
discardaf=t
(discardAlignmentFailures) When the best alignment for a repeat fails repeat thresholds like minspacer, minrepeat, or rmismatches, discard that repeat.
shrinkaf=f
Trim mismatches from the ends of alignment failures.
revertaf=f
Revert alignment failures to their pre-alignment borders.
discardua=f
(discardUnaligned) Discard repeats that do not align to any ref repeats. This means they did not meet minrefm/maxrefmm.
minrepeat0=11
Allow repeats this short prior to alignment (discarded unless they lengthen). Has no impact if below kref or krepeat.
sortref=auto
Sort alignment candidates by count, then length. May be set to t/f/auto. When set to auto, sequences will be sorted only if counts are present.
doublefetch=t
(ff) Fetch ref sequences using kmers from both repeat copies. Increases sensitivity in the presence of mismatches.
doublealign=t
(aa) Align ref sequences to both ref repeat copies.

Consensus Parameters

consensus=t
When 3+ nearby repeat copies are found, adjust the boundaries so that all are identical.
minoverlapconsensus=18
Only try consensus on repeats that overlap at least this much.
maxtrimconsensus=5
Do not trim more than this many bases on each end.

Partial Repeat Parameters

bruteforce=t
Look for partial or inexact repeats adjacent to detected repeats. These can improve consensus.
mintailrepeat=9
(mtr) Minimum length of partial repeats on read tips.
rmmt=1
Maximum allowed mismatches in short repeats at read tips.
rmmtpad=3
Ignore this many extra leading mismatches when looking for tip repeats (they will be trimmed later).

Palindrome Parameters

minpal=5
If greater than 0, each repeat will be scanned for its longest palindrome of at least this length (just the palindrome sequence, excluding the loop or tail). Palindromes will be annotated alongside repeats.
pmatches=4
Minimum number of matches in a palindrome.
pmismatches=2
Maximum allowed mismatches in a palindrome.
minloop=3
Ignore palindromes with a loop shorter than this.
maxloop=26
Ignore palindromes with a loop longer than this.
mintail=0
Ignore palindromes with a tail shorter than this.
maxtail=24
Ignore palindromes with a tail longer than this.
maxtaildif=21
Ignore palindromes with a tail length difference greater than this.
reqpal=f
Discard repeats that lack a suitable palindrome.
symmetric=f
Trim repeats to make them symmetric around the palindrome. Not recommended.

Java Parameters

-Xmx
This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da
Disable assertions.

Examples

Basic CRISPR Detection

bbcrisprfinder.sh in=reads.fq out=crisprs.fq outc=crispr_sequences.fa

Find CRISPR repeats in sequencing data and output both the annotated reads and the CRISPR sequences.

Extract Repeats and Spacers

bbcrisprfinder.sh in=genome.fa outr=repeats.fa outs=spacers.fa minrepeat=25 maxrepeat=50

Extract unique CRISPR repeats and spacers from genomic data with length constraints.

Reference-Guided Detection

bbcrisprfinder.sh in=reads.fq ref=known_crisprs.fa outref=used_refs.fa discardua=t

Use known CRISPR repeats as a reference to improve detection accuracy and discard unaligned sequences.

High Sensitivity Detection

bbcrisprfinder.sh in=reads.fq out=crisprs.fq krepeat=11 mm=2 rmismatches=4 grow=t

Use more sensitive parameters for detecting divergent CRISPR repeats with mismatches.

Palindrome Analysis

bbcrisprfinder.sh in=genome.fa outr=repeats.fa minpal=8 reqpal=t phist=palindrome_stats.txt

Focus on CRISPR repeats containing palindromes and generate palindrome statistics.

Algorithm Details

BBCrisprFinder implements a multi-phase k-mer mapping and bidirectional extension algorithm based on the repeat-spacer model of CRISPR arrays. The implementation uses specialized data structures and precise extension algorithms for accurate repeat detection.

K-mer Position Mapping Implementation

The core k-mer seeding uses fillMap method with LongLongListHashMap for position storage. The algorithm processes each sequence position using a 2-bit encoding scheme with configurable middle masking:

Bidirectional Extension Algorithm

The findRanges method implements precise extension through checkLeft and checkRight validation functions:

  • Period Calculation: period = bStop - aStop with bounds checking against minPeriod/maxPeriod
  • Left Extension: for(aStart--, bStart--; bases[aStart]==bases[bStart] && AminoAcid.isFullyDefined(bases[aStart]); aStart--, bStart--)
  • Right Extension: for(aStop++, bStop++; bases[aStop]==bases[bStop] && AminoAcid.isFullyDefined(bases[aStop]); aStop++, bStop++)
  • Match Counting: countMatches function validates identity using s[a1]==s[a2] && AminoAcid.isFullyDefined(s[a1])

Lookahead-Based Mismatch Extension

When grow=true, the algorithm extends through mismatches using growLeft and growRight functions with lookahead validation:

  • Lookahead Validation: checkLeft(bases, aStart-2, bStart-2, lookahead) requires subsequent matches ≥ lookahead parameter
  • Extension Decision: if(left>=right && left>=lookahead) determines extension direction
  • Mismatch Tracking: mismatches++; matches+=left updates alignment statistics
  • Boundary Control: Extensions terminate when mismatches > maxMismatches or lookahead fails

Reference Alignment Implementation

Reference-guided detection uses k-mer fetching and dual alignment strategies:

  • Double Fetching: doubleFetch=true enables k-mer collection from both repeat copies for comprehensive candidate selection
  • Double Alignment: doubleAlign=true aligns reference candidates against both detected repeat instances
  • Candidate Filtering: Reference sequences filtered by minRefCount and sorted by count/length when sortRefCandidates=true
  • Alignment Scoring: Uses maxRefMismatches and maxRefMismatchFraction*length for threshold determination

Consensus Boundary Refinement

The consensus algorithm processes repeat clusters to standardize boundaries:

  • Cluster Identification: Groups repeats with overlap ≥ minOverlapConsensus bases
  • Boundary Adjustment: Trims up to maxTrimConsensus bases from each end to maximize identity
  • Identity Verification: Ensures all repeat copies become identical after trimming
  • Statistics Tracking: trimmedByConsensus counter tracks consensus modifications

Palindrome Detection via PalindromeFinder

Each repeat undergoes palindrome analysis using dedicated PalindromeFinder class:

  • Palindrome Search: palFinder.longestPalindrome(bases, start, stop) identifies optimal palindrome structure
  • Dual Tracking: Separate tracking for edge-touching (tracker) and internal repeats (trackerFull)
  • Parameter Validation: Palindromes must satisfy minLoop, maxLoop, minTail, maxTail, and maxTailDif constraints
  • Symmetry Enforcement: Optional forceSymmetry trims repeats to center around palindrome

Memory and Performance Architecture

The implementation uses optimized data structures for scalability:

  • Hash Map Efficiency: LongLongListHashMap provides O(1) k-mer lookup with collision handling
  • Parallel Processing: Multi-threaded architecture with ProcessThread instances and shared accumulators
  • Memory Management: K-mer maps scale with sequence diversity, not length; approximately 12-20 bytes per unique k-mer
  • Linear Scalability: Processing time scales linearly with sequence length for typical genomic inputs

Output Stream Specialization

Multiple output streams serve distinct analytical purposes:

  • Annotated Reads (out): Original reads with CRISPR coordinate annotations in headers
  • CRISPR Sequences (outc): Extracted sequences with repeat regions in lowercase, spacers in uppercase
  • Unique Repeats (outr): Deduplicated repeat sequences with count=N annotations
  • Unique Spacers (outs): Spacer sequences bounded by full-length repeats for accurate extraction
  • Reference Usage (outref): Reference sequences annotated with usage counts and palindrome positions

Support

For questions and support: