BBCrisprFinder
Finds interspersed repeats contained within sequences; specifically, only information within a sequence is used. This is based on the repeat-spacer model of crisprs. Designed for short reads, but should work with full genomes.
Basic Usage
bbcrisprfinder.sh in=<input file>
See bbmap/pipelines/crisprPipeline.sh for more examples.
Parameters
Parameters are organized by function in the CRISPR detection process. The tool uses k-mer seeding followed by extension and consensus refinement to identify repeat-spacer patterns characteristic of CRISPR arrays.
Standard parameters
- in=<file>
- Primary input (a fasta or fastq file).
- out=<file>
- Output for annotated reads containing interspersed repeats. Reads not matching the criteria are discarded (though pairs are retained).
- outc=<file>
- Output for crispr sequences and their flanking repeats. Repeats are in lower-case; spacer is upper-case.
- outr=<file>
- Output of just repeats (one copy of each unique repeat). Only includes full-length repeats.
- outs=<file>
- Output of just spacers (one copy of each unique spacer). Only includes spacers adjacent to at least one full-length repeat (so the bounds are known).
- chist=<file>
- Histogram of crispr stats.
- phist=<file>
- Histogram of palindrome stats.
Processing parameters
- merge=f
- Try to merge paired reads before processing them.
- masked=f
- Indicates that the sequences are masked to lowercase; repeats will not be extended outside of lowercase regions. BBDuk or BBMask can produce masked sequences.
- pad=0
- Extend the boundaries of the repeats by this much when printing to outcrisper. The padding will be in uppercase. This helps to show why a repeat was not extended.
- annotate=f
- Rename reads to indicate the repeat boundaries. Annotations look like this: [56-77,117-138;150],P=7,M=5,L=6,S=62,T=6+0 ...meaning the repeat occurs at positions 56-77 and 117-138; the read length is 150; and there is a palindrome of length 7 with 5 matches, a loop length of 6, starting at position 62, with tails outside the palindromic region length 6 and 0.
- reads=-1
- If positive, quit after processing this many reads.
Repeat Parameters
- krepeat=13
- (kr) Use this seed kmer length to find repeats. Repeats shorter than k may not be found. Max is 31.
- mm=1
- (mmr) Mask this many bases in the middle of kmers to allow imperfect repeats. Also requires rmismatches to be set.
- minrepeats=2
- Ignore repeats with fewer than this many nearby copies. The more copies, the better consensus will refine the borders. For this purpose 2 partials count as 1 full-length repeat.
- rmismatches=3
- (rmm) Maximum allowed mismatches in a repeat pair.
- minspacer=14
- Ignore spacers shorter than this.
- maxspacer=60
- Ignore spacers longer than this.
- minrepeat=20
- Ignore repeats shorter than this.
- maxrepeat=54
- Ignore repeats longer than this.
- minrgc=0.09
- Ignore repeats with GC below this.
- maxrgc=0.89
- Ignore repeats with GC above this.
- grow=t
- Extend repeats through mismatches, if rmm>0. Increases the number of unique repeats, and decreases unique spacers.
- lookahead=5
- Minimum subsequent matches to extend through a mismatch.
Reference Parameters
- ref=<file>
- If a reference of known CRISPR repeats is supplied, all detected repeates will be aligned to it, using the best match to override the predicted repeat boundaries. Subsequent parameters in this section have no effect without a ref.
- outref=<file>
- Output the reference sequences used, annotated to indicate the number of uses and palindrome position.
- kref=13
- Kmer length for selecting ref sequences to attempt align. Lower is more sensitive but can take exponentially longer. 'k' will set both krepeat and kref to the same value.
- mmref=1
- (maskMiddleRef) Number of bases masked in the middle of the kmer. 0 is umasked, and more increases mismatch tolerance.
- minrefm=18
- (minRefMatches) Reject alignments with fewer matching bases.
- refmm=5
- (refMismatches) Reject alignments with more mismatches.
- refmmf=0.2
- (refmismatchfraction) Allowed mismatches will be the maximum of refmm and refmmf*length.
- minrefc=0
- (minRefCount) Only load reference sequences with a count of at least this. Intended for repeats generated via the 'outr' flag, whose headers have a 'count=x' term indicating how many times that exact repeat was encountered. Small numbers can be slow with large references.
- discardaf=t
- (discardAlignmentFailures) When the best alignment for a repeat fails repeat thresholds like minspacer, minrepeat, or rmismatches, discard that repeat.
- shrinkaf=f
- Trim mismatches from the ends of alignment failures.
- revertaf=f
- Revert alignment failures to their pre-alignment borders.
- discardua=f
- (discardUnaligned) Discard repeats that do not align to any ref repeats. This means they did not meet minrefm/maxrefmm.
- minrepeat0=11
- Allow repeats this short prior to alignment (discarded unless they lengthen). Has no impact if below kref or krepeat.
- sortref=auto
- Sort alignment candidates by count, then length. May be set to t/f/auto. When set to auto, sequences will be sorted only if counts are present.
- doublefetch=t
- (ff) Fetch ref sequences using kmers from both repeat copies. Increases sensitivity in the presence of mismatches.
- doublealign=t
- (aa) Align ref sequences to both ref repeat copies.
Consensus Parameters
- consensus=t
- When 3+ nearby repeat copies are found, adjust the boundaries so that all are identical.
- minoverlapconsensus=18
- Only try consensus on repeats that overlap at least this much.
- maxtrimconsensus=5
- Do not trim more than this many bases on each end.
Partial Repeat Parameters
- bruteforce=t
- Look for partial or inexact repeats adjacent to detected repeats. These can improve consensus.
- mintailrepeat=9
- (mtr) Minimum length of partial repeats on read tips.
- rmmt=1
- Maximum allowed mismatches in short repeats at read tips.
- rmmtpad=3
- Ignore this many extra leading mismatches when looking for tip repeats (they will be trimmed later).
Palindrome Parameters
- minpal=5
- If greater than 0, each repeat will be scanned for its longest palindrome of at least this length (just the palindrome sequence, excluding the loop or tail). Palindromes will be annotated alongside repeats.
- pmatches=4
- Minimum number of matches in a palindrome.
- pmismatches=2
- Maximum allowed mismatches in a palindrome.
- minloop=3
- Ignore palindromes with a loop shorter than this.
- maxloop=26
- Ignore palindromes with a loop longer than this.
- mintail=0
- Ignore palindromes with a tail shorter than this.
- maxtail=24
- Ignore palindromes with a tail longer than this.
- maxtaildif=21
- Ignore palindromes with a tail length difference greater than this.
- reqpal=f
- Discard repeats that lack a suitable palindrome.
- symmetric=f
- Trim repeats to make them symmetric around the palindrome. Not recommended.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions.
Examples
Basic CRISPR Detection
bbcrisprfinder.sh in=reads.fq out=crisprs.fq outc=crispr_sequences.fa
Find CRISPR repeats in sequencing data and output both the annotated reads and the CRISPR sequences.
Extract Repeats and Spacers
bbcrisprfinder.sh in=genome.fa outr=repeats.fa outs=spacers.fa minrepeat=25 maxrepeat=50
Extract unique CRISPR repeats and spacers from genomic data with length constraints.
Reference-Guided Detection
bbcrisprfinder.sh in=reads.fq ref=known_crisprs.fa outref=used_refs.fa discardua=t
Use known CRISPR repeats as a reference to improve detection accuracy and discard unaligned sequences.
High Sensitivity Detection
bbcrisprfinder.sh in=reads.fq out=crisprs.fq krepeat=11 mm=2 rmismatches=4 grow=t
Use more sensitive parameters for detecting divergent CRISPR repeats with mismatches.
Palindrome Analysis
bbcrisprfinder.sh in=genome.fa outr=repeats.fa minpal=8 reqpal=t phist=palindrome_stats.txt
Focus on CRISPR repeats containing palindromes and generate palindrome statistics.
Algorithm Details
BBCrisprFinder implements a multi-phase k-mer mapping and bidirectional extension algorithm based on the repeat-spacer model of CRISPR arrays. The implementation uses specialized data structures and precise extension algorithms for accurate repeat detection.
K-mer Position Mapping Implementation
The core k-mer seeding uses fillMap
method with LongLongListHashMap
for position storage. The algorithm processes each sequence position using a 2-bit encoding scheme with configurable middle masking:
- Hash Table Structure:
LongLongListHashMap
maps masked k-mers to position lists using 64-bit keys - Middle Masking Logic: When
mm > 0
, creates mask using~((-1L)<
where bits = maskMiddle*2 - Position Storage: Each k-mer occurrence stored with
kmerPositionMap.put(mmKmer, position)
- Repeat Detection: K-mers with multiple positions trigger repeat analysis via
positions.size() >= 2
Bidirectional Extension Algorithm
The findRanges
method implements precise extension through checkLeft
and checkRight
validation functions:
- Period Calculation:
period = bStop - aStop
with bounds checking against minPeriod/maxPeriod - Left Extension:
for(aStart--, bStart--; bases[aStart]==bases[bStart] && AminoAcid.isFullyDefined(bases[aStart]); aStart--, bStart--)
- Right Extension:
for(aStop++, bStop++; bases[aStop]==bases[bStop] && AminoAcid.isFullyDefined(bases[aStop]); aStop++, bStop++)
- Match Counting:
countMatches
function validates identity usings[a1]==s[a2] && AminoAcid.isFullyDefined(s[a1])
Lookahead-Based Mismatch Extension
When grow=true
, the algorithm extends through mismatches using growLeft
and growRight
functions with lookahead validation:
- Lookahead Validation:
checkLeft(bases, aStart-2, bStart-2, lookahead)
requires subsequent matches ≥ lookahead parameter - Extension Decision:
if(left>=right && left>=lookahead)
determines extension direction - Mismatch Tracking:
mismatches++; matches+=left
updates alignment statistics - Boundary Control: Extensions terminate when
mismatches > maxMismatches
or lookahead fails
Reference Alignment Implementation
Reference-guided detection uses k-mer fetching and dual alignment strategies:
- Double Fetching:
doubleFetch=true
enables k-mer collection from both repeat copies for comprehensive candidate selection - Double Alignment:
doubleAlign=true
aligns reference candidates against both detected repeat instances - Candidate Filtering: Reference sequences filtered by
minRefCount
and sorted by count/length whensortRefCandidates=true
- Alignment Scoring: Uses
maxRefMismatches
andmaxRefMismatchFraction*length
for threshold determination
Consensus Boundary Refinement
The consensus algorithm processes repeat clusters to standardize boundaries:
- Cluster Identification: Groups repeats with overlap ≥
minOverlapConsensus
bases - Boundary Adjustment: Trims up to
maxTrimConsensus
bases from each end to maximize identity - Identity Verification: Ensures all repeat copies become identical after trimming
- Statistics Tracking:
trimmedByConsensus
counter tracks consensus modifications
Palindrome Detection via PalindromeFinder
Each repeat undergoes palindrome analysis using dedicated PalindromeFinder
class:
- Palindrome Search:
palFinder.longestPalindrome(bases, start, stop)
identifies optimal palindrome structure - Dual Tracking: Separate tracking for edge-touching (
tracker
) and internal repeats (trackerFull
) - Parameter Validation: Palindromes must satisfy minLoop, maxLoop, minTail, maxTail, and maxTailDif constraints
- Symmetry Enforcement: Optional
forceSymmetry
trims repeats to center around palindrome
Memory and Performance Architecture
The implementation uses optimized data structures for scalability:
- Hash Map Efficiency:
LongLongListHashMap
provides O(1) k-mer lookup with collision handling - Parallel Processing: Multi-threaded architecture with
ProcessThread
instances and shared accumulators - Memory Management: K-mer maps scale with sequence diversity, not length; approximately 12-20 bytes per unique k-mer
- Linear Scalability: Processing time scales linearly with sequence length for typical genomic inputs
Output Stream Specialization
Multiple output streams serve distinct analytical purposes:
- Annotated Reads (out): Original reads with CRISPR coordinate annotations in headers
- CRISPR Sequences (outc): Extracted sequences with repeat regions in lowercase, spacers in uppercase
- Unique Repeats (outr): Deduplicated repeat sequences with
count=N
annotations - Unique Spacers (outs): Spacer sequences bounded by full-length repeats for accurate extraction
- Reference Usage (outref): Reference sequences annotated with usage counts and palindrome positions
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org