FindRepeats

Script: findrepeats.sh Package: repeat Class: RepeatFinder.java

Finds repeats in a genome using k-mer based analysis without alignment. Supports exact and inexact repeats with configurable gap tolerance and depth filtering.

Basic Usage

findrepeats.sh in=<input file> out=<output file>

No alignment is performed; a sequence is considered to be a repeat of depth D if all kmers within it have a depth of at least D. Gaps of up to length G consecutive kmers with lower counts may be allowed, which typically finds far more and substantially longer repeats even with a small G.

Parameters

Parameters are organized by their function in the repeat detection process. The tool uses k-mer based analysis to identify repetitive regions without performing sequence alignment.

Standard parameters

in=<file>
Primary input (the genome fasta). Input sequences should be in FASTA format.
out=<file>
Primary output (list of repeats as TSV). If no file is given this will be printed to screen; to suppress printing, use 'out=null'. Output format includes coordinates, depth, length, and sequence preview.
outs=<file>
(outsequence) Optional sequence output, for printing or masking repeats. Used with mask or print parameters to output modified sequences.
overwrite=f
(ow) False ('f') forces the program to abort rather than overwrite an existing file. Set to 't' to allow overwriting output files.
showspeed=t
(ss) Set to 'f' to suppress display of processing speed. When enabled, shows processing rate in bases per second.
ziplevel=2
(zl) Set to 1 (lowest) through 9 (max) to change compression level; lower compression is faster. Only applies to compressed output files.

Processing parameters

k=31
Kmer length to use (range is 1-31; 31 is recommended). Longer k-mers provide more specificity but may miss shorter repeats. K=31 provides good balance of sensitivity and specificity.
gap=0
Maximum allowed gap length within a repeat, in kmers. This allows inexact repeats. Note that a 1bp mismatch will spawn a length K gap. Higher values find longer but less exact repeats.
qhdist=0
Hamming distance within a kmer; allows inexact repeats. Values above 0 become exponentially slower. Use with caution as runtime increases dramatically with higher values.
minrepeat=0
Minimum repeat length to report, in bases. Nothing shorter than kmer length will be found regardless of this value. Set higher to filter out very short repetitive elements.
mindepth=2
Ignore copy counts below mindepth. Minimum value is 2. Higher values focus on higher-copy repeats and reduce false positives from unique sequences.
maxdepth=-1
If positive, copy counts greater than this will be reported as this number. This can greatly increase speed in rare cases of thousand+ copy repeats by capping the maximum reported depth.
preview=27
Print this many characters of the repeat per line. Set to 0 to suppress (may save memory). Provides sequence context in the output for manual inspection.
mask=f
Write sequence with masked repeats to 'outs'. Possible values:
f: Do not mask.
t: Mask (by default, 't' or 'true' are the same as 'soft').
soft: Convert masked bases to lower case.
hard: Convert masked bases to 'N'.
Other characters: Convert masked bases to that character.
print=t
(printrepeats) Print repeat sequence to outs. 'print' and 'mask' are mutually exclusive so enabling one will disable the other. Used to extract just the repetitive sequences.
weak=f
(weaksubsumes) Ignore repeats that are weakly subsumed by other repeats. A repeat is subsumed if there is another repeat with greater depth at the same coordinates. Since a 3-copy repeat is also a 2-copy repeat, only the 3-copy repeat will be reported. However, in the case that the 3-copy repeat is inexact (has gaps) and the 2-copy repeat is perfect, both will be reported when 'weak=f' as per default. If you set the 'weak=t' flag, only the highest-depth version will be reported even if it has more gaps. In either case all 3 repeats would be reported, but with 'weak=f' some copies would be reported twice for the same coordinates, once as a depth-2 perfect repeat and again as a depth-3 imperfect repeat.

Java Parameters

-Xmx
This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Large genomes require substantial memory for k-mer tables.
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+. Useful for automated pipelines to handle memory exhaustion gracefully.
-da
Disable assertions. May provide slight performance improvement in production use by disabling internal consistency checks.

Examples

Basic Repeat Detection

findrepeats.sh in=genome.fasta out=repeats.tsv

Finds all repeats with default parameters (k=31, mindepth=2) and outputs results in TSV format.

Allow Gaps in Repeats

findrepeats.sh in=genome.fasta out=repeats.tsv gap=2 minrepeat=100

Allows up to 2 k-mer gaps within repeats and only reports repeats ≥100bp long. Useful for finding longer imperfect repeats.

High-Copy Repeats Only

findrepeats.sh in=genome.fasta out=high_copy_repeats.tsv mindepth=10 maxdepth=100

Focuses on high-copy repeats (≥10 copies) and caps reporting at 100 copies for performance in highly repetitive genomes.

Mask Repeats in Output Sequence

findrepeats.sh in=genome.fasta out=repeats.tsv outs=masked_genome.fasta mask=soft

Detects repeats and outputs both the repeat list and a soft-masked genome sequence (repeats in lowercase).

Extract Repeat Sequences

findrepeats.sh in=genome.fasta out=repeats.tsv outs=repeat_sequences.fasta print=t

Extracts just the repetitive sequences to a separate FASTA file for further analysis.

Algorithm Details

K-mer Based Approach

FindRepeats uses a k-mer based strategy that avoids expensive sequence alignment while maintaining high sensitivity for repetitive elements. The algorithm works by:

  1. K-mer Counting: All k-mers in the genome are counted using hash tables (KmerTableSet)
  2. Depth Assessment: Each position is evaluated based on the depth of its constituent k-mers
  3. Repeat Extension: Consecutive positions with k-mer depths ≥ mindepth are extended into repeat regions
  4. Gap Tolerance: Gaps up to 'gap' k-mers with insufficient depth are bridged to form longer repeats

Repeat Classification Strategy

The tool uses a dual strategy for managing repeats of different depths:

Memory Management

Memory usage scales with genome size and k-mer diversity:

Performance Characteristics

Performance depends on several factors:

Additional Detection Methods

Beyond basic k-mer depth analysis, the tool supports:

Output Format

TSV Output Columns

The main output file contains tab-separated values with the following information:

Sequence Output Options

When using the 'outs' parameter:

Support

For questions and support: