FindRepeats

Basic Usage

findrepeats.sh in=<input file> out=<output file>

No alignment is performed; a sequence is considered to be a repeat of depth D if all kmers within it have a depth of at least D. Gaps of up to length G consecutive kmers with lower counts may be allowed, which typically finds far more and substantially longer repeats even with a small G.

Parameters

Parameters are organized by their function in the repeat detection process. The tool uses k-mer based analysis to identify repetitive regions without performing sequence alignment.

Standard parameters

in=<file>: Primary input (the genome fasta). Input sequences should be in FASTA format.
out=<file>: Primary output (list of repeats as TSV). If no file is given this will be printed to screen; to suppress printing, use 'out=null'. Output format includes coordinates, depth, length, and sequence preview.
outs=<file>: (outsequence) Optional sequence output, for printing or masking repeats. Used with mask or print parameters to output modified sequences.
overwrite=f: (ow) False ('f') forces the program to abort rather than overwrite an existing file. Set to 't' to allow overwriting output files.
showspeed=t: (ss) Set to 'f' to suppress display of processing speed. When enabled, shows processing rate in bases per second.
ziplevel=2: (zl) Set to 1 (lowest) through 9 (max) to change compression level; lower compression is faster. Only applies to compressed output files.

Processing parameters

k=31: Kmer length to use (range is 1-31; 31 is recommended). Longer k-mers provide more specificity but may miss shorter repeats. K=31 provides good balance of sensitivity and specificity.
gap=0: Maximum allowed gap length within a repeat, in kmers. This allows inexact repeats. Note that a 1bp mismatch will spawn a length K gap. Higher values find longer but less exact repeats.
qhdist=0: Hamming distance within a kmer; allows inexact repeats. Values above 0 become exponentially slower. Use with caution as runtime increases dramatically with higher values.
minrepeat=0: Minimum repeat length to report, in bases. Nothing shorter than kmer length will be found regardless of this value. Set higher to filter out very short repetitive elements.
mindepth=2: Ignore copy counts below mindepth. Minimum value is 2. Higher values focus on higher-copy repeats and reduce false positives from unique sequences.
maxdepth=-1: If positive, copy counts greater than this will be reported as this number. This can greatly increase speed in rare cases of thousand+ copy repeats by capping the maximum reported depth.
preview=27: Print this many characters of the repeat per line. Set to 0 to suppress (may save memory). Provides sequence context in the output for manual inspection.
mask=f: Write sequence with masked repeats to 'outs'. Possible values:
f: Do not mask.
t: Mask (by default, 't' or 'true' are the same as 'soft').
soft: Convert masked bases to lower case.
hard: Convert masked bases to 'N'.
Other characters: Convert masked bases to that character.
print=t: (printrepeats) Print repeat sequence to outs. 'print' and 'mask' are mutually exclusive so enabling one will disable the other. Used to extract just the repetitive sequences.
weak=f: (weaksubsumes) Ignore repeats that are weakly subsumed by other repeats. A repeat is subsumed if there is another repeat with greater depth at the same coordinates. Since a 3-copy repeat is also a 2-copy repeat, only the 3-copy repeat will be reported. However, in the case that the 3-copy repeat is inexact (has gaps) and the 2-copy repeat is perfect, both will be reported when 'weak=f' as per default. If you set the 'weak=t' flag, only the highest-depth version will be reported even if it has more gaps. In either case all 3 repeats would be reported, but with 'weak=f' some copies would be reported twice for the same coordinates, once as a depth-2 perfect repeat and again as a depth-3 imperfect repeat.

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Large genomes require substantial memory for k-mer tables.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+. Useful for automated pipelines to handle memory exhaustion gracefully.
-da: Disable assertions. May provide slight performance improvement in production use by disabling internal consistency checks.

Examples

Basic Repeat Detection

findrepeats.sh in=genome.fasta out=repeats.tsv

Finds all repeats with default parameters (k=31, mindepth=2) and outputs results in TSV format.

Allow Gaps in Repeats

findrepeats.sh in=genome.fasta out=repeats.tsv gap=2 minrepeat=100

Allows up to 2 k-mer gaps within repeats and only reports repeats ≥100bp long. Useful for finding longer imperfect repeats.

High-Copy Repeats Only

findrepeats.sh in=genome.fasta out=high_copy_repeats.tsv mindepth=10 maxdepth=100

Focuses on high-copy repeats (≥10 copies) and caps reporting at 100 copies for performance in highly repetitive genomes.

Mask Repeats in Output Sequence

findrepeats.sh in=genome.fasta out=repeats.tsv outs=masked_genome.fasta mask=soft

Detects repeats and outputs both the repeat list and a soft-masked genome sequence (repeats in lowercase).

Extract Repeat Sequences

findrepeats.sh in=genome.fasta out=repeats.tsv outs=repeat_sequences.fasta print=t

Extracts just the repetitive sequences to a separate FASTA file for further analysis.

Algorithm Details

K-mer Based Approach

FindRepeats uses a k-mer based strategy that avoids expensive sequence alignment while maintaining high sensitivity for repetitive elements. The algorithm works by:

K-mer Counting: All k-mers in the genome are counted using hash tables (KmerTableSet)
Depth Assessment: Each position is evaluated based on the depth of its constituent k-mers
Repeat Extension: Consecutive positions with k-mer depths ≥ mindepth are extended into repeat regions
Gap Tolerance: Gaps up to 'gap' k-mers with insufficient depth are bridged to form longer repeats

Repeat Classification Strategy

The tool uses a dual strategy for managing repeats of different depths:

Perfect vs. Imperfect: Perfect repeats have no gaps; imperfect repeats contain gaps but may be longer
Subsumption Logic: Higher-depth repeats can subsume lower-depth repeats at the same coordinates
Weak Subsumption Control: The 'weak' parameter controls whether perfect lower-depth repeats are reported alongside imperfect higher-depth repeats

Memory Management

Memory usage scales with genome size and k-mer diversity:

K-mer Tables: Hash tables store k-mer counts; memory scales with unique k-mer count
Repeat Storage: Active repeats are maintained in memory during processing
Sequence Buffering: Large sequences are processed in chunks to manage memory usage

Performance Characteristics

Performance depends on several factors:

K-mer Length: Longer k-mers reduce false positives but may miss shorter repeats
Hamming Distance: qhdist > 0 causes exponential slowdown due to k-mer variations
Gap Tolerance: Higher gap values increase sensitivity but reduce specificity
Genome Repetitiveness: Highly repetitive genomes require more memory and processing time

Additional Detection Methods

Beyond basic k-mer depth analysis, the tool supports:

Entropy-based Detection: Low-entropy regions can be flagged as repetitive
Short Tandem Repeats: Specialized detection for STRs with configurable k-mer sizes
Hamming Distance Tolerance: Inexact k-mer matching for divergent repeats

Output Format

TSV Output Columns

The main output file contains tab-separated values with the following information:

Coordinates: Start and stop positions of the repeat
Length: Length of the repeat region in base pairs
Depth: Copy number (k-mer depth) of the repeat
Sequence Preview: First and last bases of the repeat (if preview > 0)
Gap Information: Number and length of gaps within the repeat
Statistics: GC content and other characteristics

Sequence Output Options

When using the 'outs' parameter:

Masked Output: Original sequence with repeats masked (soft or hard masking)
Repeat Extraction: Only the repetitive sequences, useful for repeat libraries

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org