FindRepeats
Finds repeats in a genome using k-mer based analysis without alignment. Supports exact and inexact repeats with configurable gap tolerance and depth filtering.
Basic Usage
findrepeats.sh in=<input file> out=<output file>
No alignment is performed; a sequence is considered to be a repeat of depth D if all kmers within it have a depth of at least D. Gaps of up to length G consecutive kmers with lower counts may be allowed, which typically finds far more and substantially longer repeats even with a small G.
Parameters
Parameters are organized by their function in the repeat detection process. The tool uses k-mer based analysis to identify repetitive regions without performing sequence alignment.
Standard parameters
- in=<file>
- Primary input (the genome fasta). Input sequences should be in FASTA format.
- out=<file>
- Primary output (list of repeats as TSV). If no file is given this will be printed to screen; to suppress printing, use 'out=null'. Output format includes coordinates, depth, length, and sequence preview.
- outs=<file>
- (outsequence) Optional sequence output, for printing or masking repeats. Used with mask or print parameters to output modified sequences.
- overwrite=f
- (ow) False ('f') forces the program to abort rather than overwrite an existing file. Set to 't' to allow overwriting output files.
- showspeed=t
- (ss) Set to 'f' to suppress display of processing speed. When enabled, shows processing rate in bases per second.
- ziplevel=2
- (zl) Set to 1 (lowest) through 9 (max) to change compression level; lower compression is faster. Only applies to compressed output files.
Processing parameters
- k=31
- Kmer length to use (range is 1-31; 31 is recommended). Longer k-mers provide more specificity but may miss shorter repeats. K=31 provides good balance of sensitivity and specificity.
- gap=0
- Maximum allowed gap length within a repeat, in kmers. This allows inexact repeats. Note that a 1bp mismatch will spawn a length K gap. Higher values find longer but less exact repeats.
- qhdist=0
- Hamming distance within a kmer; allows inexact repeats. Values above 0 become exponentially slower. Use with caution as runtime increases dramatically with higher values.
- minrepeat=0
- Minimum repeat length to report, in bases. Nothing shorter than kmer length will be found regardless of this value. Set higher to filter out very short repetitive elements.
- mindepth=2
- Ignore copy counts below mindepth. Minimum value is 2. Higher values focus on higher-copy repeats and reduce false positives from unique sequences.
- maxdepth=-1
- If positive, copy counts greater than this will be reported as this number. This can greatly increase speed in rare cases of thousand+ copy repeats by capping the maximum reported depth.
- preview=27
- Print this many characters of the repeat per line. Set to 0 to suppress (may save memory). Provides sequence context in the output for manual inspection.
- mask=f
- Write sequence with masked repeats to 'outs'. Possible values:
f: Do not mask.
t: Mask (by default, 't' or 'true' are the same as 'soft').
soft: Convert masked bases to lower case.
hard: Convert masked bases to 'N'.
Other characters: Convert masked bases to that character. - print=t
- (printrepeats) Print repeat sequence to outs. 'print' and 'mask' are mutually exclusive so enabling one will disable the other. Used to extract just the repetitive sequences.
- weak=f
- (weaksubsumes) Ignore repeats that are weakly subsumed by other repeats. A repeat is subsumed if there is another repeat with greater depth at the same coordinates. Since a 3-copy repeat is also a 2-copy repeat, only the 3-copy repeat will be reported. However, in the case that the 3-copy repeat is inexact (has gaps) and the 2-copy repeat is perfect, both will be reported when 'weak=f' as per default. If you set the 'weak=t' flag, only the highest-depth version will be reported even if it has more gaps. In either case all 3 repeats would be reported, but with 'weak=f' some copies would be reported twice for the same coordinates, once as a depth-2 perfect repeat and again as a depth-3 imperfect repeat.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Large genomes require substantial memory for k-mer tables.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+. Useful for automated pipelines to handle memory exhaustion gracefully.
- -da
- Disable assertions. May provide slight performance improvement in production use by disabling internal consistency checks.
Examples
Basic Repeat Detection
findrepeats.sh in=genome.fasta out=repeats.tsv
Finds all repeats with default parameters (k=31, mindepth=2) and outputs results in TSV format.
Allow Gaps in Repeats
findrepeats.sh in=genome.fasta out=repeats.tsv gap=2 minrepeat=100
Allows up to 2 k-mer gaps within repeats and only reports repeats ≥100bp long. Useful for finding longer imperfect repeats.
High-Copy Repeats Only
findrepeats.sh in=genome.fasta out=high_copy_repeats.tsv mindepth=10 maxdepth=100
Focuses on high-copy repeats (≥10 copies) and caps reporting at 100 copies for performance in highly repetitive genomes.
Mask Repeats in Output Sequence
findrepeats.sh in=genome.fasta out=repeats.tsv outs=masked_genome.fasta mask=soft
Detects repeats and outputs both the repeat list and a soft-masked genome sequence (repeats in lowercase).
Extract Repeat Sequences
findrepeats.sh in=genome.fasta out=repeats.tsv outs=repeat_sequences.fasta print=t
Extracts just the repetitive sequences to a separate FASTA file for further analysis.
Algorithm Details
K-mer Based Approach
FindRepeats uses a k-mer based strategy that avoids expensive sequence alignment while maintaining high sensitivity for repetitive elements. The algorithm works by:
- K-mer Counting: All k-mers in the genome are counted using hash tables (KmerTableSet)
- Depth Assessment: Each position is evaluated based on the depth of its constituent k-mers
- Repeat Extension: Consecutive positions with k-mer depths ≥ mindepth are extended into repeat regions
- Gap Tolerance: Gaps up to 'gap' k-mers with insufficient depth are bridged to form longer repeats
Repeat Classification Strategy
The tool uses a dual strategy for managing repeats of different depths:
- Perfect vs. Imperfect: Perfect repeats have no gaps; imperfect repeats contain gaps but may be longer
- Subsumption Logic: Higher-depth repeats can subsume lower-depth repeats at the same coordinates
- Weak Subsumption Control: The 'weak' parameter controls whether perfect lower-depth repeats are reported alongside imperfect higher-depth repeats
Memory Management
Memory usage scales with genome size and k-mer diversity:
- K-mer Tables: Hash tables store k-mer counts; memory scales with unique k-mer count
- Repeat Storage: Active repeats are maintained in memory during processing
- Sequence Buffering: Large sequences are processed in chunks to manage memory usage
Performance Characteristics
Performance depends on several factors:
- K-mer Length: Longer k-mers reduce false positives but may miss shorter repeats
- Hamming Distance: qhdist > 0 causes exponential slowdown due to k-mer variations
- Gap Tolerance: Higher gap values increase sensitivity but reduce specificity
- Genome Repetitiveness: Highly repetitive genomes require more memory and processing time
Additional Detection Methods
Beyond basic k-mer depth analysis, the tool supports:
- Entropy-based Detection: Low-entropy regions can be flagged as repetitive
- Short Tandem Repeats: Specialized detection for STRs with configurable k-mer sizes
- Hamming Distance Tolerance: Inexact k-mer matching for divergent repeats
Output Format
TSV Output Columns
The main output file contains tab-separated values with the following information:
- Coordinates: Start and stop positions of the repeat
- Length: Length of the repeat region in base pairs
- Depth: Copy number (k-mer depth) of the repeat
- Sequence Preview: First and last bases of the repeat (if preview > 0)
- Gap Information: Number and length of gaps within the repeat
- Statistics: GC content and other characteristics
Sequence Output Options
When using the 'outs' parameter:
- Masked Output: Original sequence with repeats masked (soft or hard masking)
- Repeat Extraction: Only the repetitive sequences, useful for repeat libraries
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org