Clumpify

Purpose and Applications

Clumpify clusters reads that share k-mers, which makes them likely (but not guaranteed) to overlap. The primary applications are:

Enhanced Compression: Gzip is more efficient when similar sequences are nearby, as they can be replaced by pointers to prior copies
Assembly Acceleration: Overlap-based assemblers benefit from pre-clustered reads
Cache Optimization: Applications that are cache-sensitive perform better with grouped similar reads
Error Correction: Consensus sequences can be generated from clumps (currently rudimentary)

Data Requirements and Recommendations

Paired Read Handling

Clumpify supports paired reads but is much more effective when treating reads as unpaired. For optimal results: merge reads with BBMerge first, then concatenate merged reads with unmerged pairs, and clump them all together as unpaired reads.

Data Quality Requirements

Clumpify is designed for accurate data and will not work well with high error rate sequences. Even for Illumina data, quality-trimming or error-correction may be beneficial before clumping.

Memory Management and Processing

Clumpify stores all sequences in memory while clumping but can operate in two phases to handle large datasets:

KmerSplit: Breaks data into an arbitrary number of temporary files
KmerSort: Sorts each temporary file independently, then merges results

This approach makes Clumpify O(N) complexity rather than O(N*log(N)) when using multiple groups, with no strict memory bounds since group count can be adjusted as needed. When groups=1, the split phase is skipped for better performance with smaller datasets.

Basic Usage

clumpify.sh in=<file> out=<file> reorder

Input may be fasta or fastq, compressed or uncompressed. Cannot accept sam format.

Parameters

Input/Output Parameters

in=<file>: Input file.
in2=<file>: Optional input for read 2 of twin paired files.
out=<file>: Output file. May not be standard out.
out2=<file>: Optional output for read 2 of twin paired files.
groups=auto: Use this many intermediate files to save memory. 1 group is fastest. Auto estimates the number of groups needed based on file size to prevent out-of-memory conditions.
lowcomplexity=f: For compressed low-complexity libraries such as RNA-seq, use more conservative memory estimates to automatically decide the number of groups.
rcomp=f: Give read clumps the same orientation to increase compression. Should be disabled for paired reads.
overwrite=f: (ow) Set to false to force the program to abort rather than overwrite an existing file.
qin=auto: Auto-detect input quality encoding. May be set to: 33 (ASCII-33/Sanger), 64 (ASCII-64/old Illumina). All modern sequence is ASCII-33.
qout=auto: Use input quality encoding as output quality encoding.
changequality=f: (cq) Fix broken quality scores such as Ns with Q>0. Default false ensures lossless compression.
fastawrap=70: Set to a higher number like 4000 for longer lines in fasta format, which increases compression.

Compression Parameters

ziplevel=6: (zl) Gzip compression level (1-11). Higher is slower. Level 11 requires pigz, extremely slow to compress but faster to decompress. Use *.bz2 extension for ~9% additional compression.
blocksize=128: Size of blocks for pigz, in kb. Higher gives slightly better compression.
shortname=f: Make the names as short as possible. 'shortname=shrink' shortens names while retaining flowcell and barcode information.
reorder=f: Reorder clumps for additional compression. Only valid when groups=1, passes=1, and ecc=f. Modes: f (no reorder), c (consensus reads), p (pair information, highest compression), a (auto-choose between c and p).
quantize=f: Bin quality scores like NextSeq. Greatly increases compression but loses information.

Temp File Parameters

compresstemp=auto: (ct) Gzip temporary files. By default temp files are compressed if output file is compressed.
deletetemp=t: Delete temporary files.
deleteinput=f: Delete input upon successful completion.
usetmpdir=f: Use tmpdir for temp files.
tmpdir=: Temporary directory. Default is environment variable TMPDIR.

Hashing Parameters

k=31: Use k-mers of this length (1-31). Shorter k-mers may increase compression, but 31 is recommended for error correction.
mincount=0: Don't use pivot k-mers with count less than this. Setting mincount=2 can increase compression by filtering singleton k-mers. Increases time and memory usage.
seed=1: Random number generator seed for hashing. Set to negative to use a random seed.
hashes=4: Use this many masks when hashing. 0 uses raw k-mers. Often hashes=0 increases compression, but should not be used with error-correction.
border=1: Do not use k-mers within this many bases of read ends.

Deduplication Parameters

dedupe=f: Remove duplicate reads. For pairs, both must match. If dedupe and markduplicates are both false, duplicate-related flags have no effect.
markduplicates=f: Don't remove duplicates; just append ' duplicate' to the name.
allduplicates=f: Mark or remove all copies of duplicates, instead of keeping the highest-quality copy.
addcount=f: Append the number of copies to the read name. Mutually exclusive with markduplicates or allduplicates.
entryfilter=f: Assists in removing exact duplicates, saving memory in libraries with huge numbers of duplicates. Enabled automatically as needed.
subs=2: (s) Maximum substitutions allowed between duplicates.
subrate=0.0: (dsr) If set, substitutions allowed = max(subs, subrate*min(length1, length2)) for 2 sequences.
allowns=t: No-called bases will not be considered substitutions.
scanlimit=5: (scan) Continue for this many reads after encountering a non-duplicate. Improves detection of inexact duplicates.
umi=f: If reads have UMIs in headers, require them to match to consider reads duplicates.
umisubs=0: Consider UMIs as matching if they have up to this many mismatches.
containment=f: Allow containments (where one sequence is shorter).
affix=f: For containments, require one sequence to be a prefix or suffix of the other.
optical=f: Mark or remove optical duplicates only. Requires Illumina reads within flowcell distance. Also handles tile-edge and well duplicates.
dupedist=40: (dist) Max distance for optical duplicates. Platform-specific recommendations: NextSeq 40 (with spany=t), HiSeq 1T/2500 40, HiSeq 3k/4k 2500, NovaSeq 6000 12000, NovaSeq X+ 50.
spany=f: Allow optical duplicates on different tiles if within dupedist in y-axis. Enable for tile-edge duplicates (NextSeq).
spanx=f: Like spany, but for x-axis. Not necessary for NextSeq.
spantiles=f: Set both spanx and spany.
adjacent=f: Limit tile-spanning to adjacent tiles (consecutive numbers).

NextSeq Recommendation: Use flags: dedupe optical spany adjacent

Pairing/Ordering Parameters

unpair=f: For paired reads, clump all of them rather than just read 1. Destroys pairing. Without this flag, only read 1 is error-corrected for paired reads.
repair=f: After clumping and error-correction, restore pairing. If groups>1, sorts by name which destroys clump ordering; with single group, clumping is retained.

Error-Correction Parameters

ecc=f: Error-correct reads. Requires multiple passes for complete correction.
ecco=f: Error-correct paired reads via overlap before clumping.
passes=1: Use this many error-correction passes. 6 passes are suggested, though more will be more thorough.
conservative=f: Only correct highest-confidence errors, minimizing chances of eliminating minor alleles or inexact repeats.
aggressive=f: Maximize the number of errors corrected.
consensus=f: Output consensus sequence instead of clumps.

Advanced Error-Correction Parameters

mincc=4: (mincountcorrect) Do not correct to alleles occurring less often than this.
minss=4: (minsizesplit) Do not split into new clumps smaller than this.
minsfs=0.17: (minsizefractionsplit) Do not split on pivot alleles in areas with local depth less than this fraction of clump size.
minsfc=0.20: (minsizefractioncorrect) Do not correct in areas with local depth less than this.
minr=30.0: (minratio) Correct to consensus if ratio of consensus allele to second-most-common allele is ≥minr. Actual ratio: min(minr, minro+minorCount*minrm+quality*minrqm).
minro=1.9: (minratiooffset) Base ratio.
minrm=1.8: (minratiomult) Ratio multiplier for secondary allele count.
minrqm=0.08: (minratioqmult) Ratio multiplier for base quality.
minqr=2.8: (minqratio) Do not correct bases when cq*minqr>rqsum.
minaqr=0.70: (minaqratio) Do not correct bases when cq*minaqr>5+rqavg.
minid=0.97: (minidentity) Do not correct reads with identity to consensus less than this.
maxqadjust=0: Adjust quality scores by at most maxqadjust per pass.
maxqi=-1: (maxqualityincorrect) Do not correct bases with quality above this (if positive).
maxci=-1: (maxcountincorrect) Do not correct alleles with count above this (if positive).
findcorrelations=t: Look for correlated SNPs in clumps to split into alleles.
maxcorrelations=12: Maximum number of eligible SNPs per clump to consider for correlations. Increasing reduces false-positive corrections but may decrease speed.

Java Parameters

-Xmx: Set Java memory usage, overriding autodetection. -Xmx20g specifies 20GB RAM, -Xmx200m specifies 200MB. Max is typically 85% of physical memory.
-eoom: Exit if out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions.

Usage Examples

Basic Clumping with Multi-Group Processing

clumpify.sh in=reads.fq.gz out=clumped.fq.gz groups=16

Groups similar reads using 16 temporary files during processing to manage memory usage for large datasets.

Maximum Compression Workflow

# Step 1: Shorten read names
rename.sh in=reads.fq out=renamed.fq prefix=x

# Step 2: Merge overlapping pairs when possible  
bbmerge.sh in=renamed.fq out=merged.fq mix

# Step 3: Clumpify with maximum compression settings
clumpify.sh in=merged.fq out=clumped.fa.gz zl=9 pigz fastawrap=100000

Optimal compression workflow: strip names, merge reads, then clumpify as fasta format with unlimited line wrap and maximum compression level. This approach maximizes compression by removing headers and quality values, which take up most space in compressed files.

Paired Read Optimization

# Recommended: Merge pairs first, then clumpify as unpaired
bbmerge.sh in1=reads_1.fq in2=reads_2.fq out=merged.fq outu1=unmerged_1.fq outu2=unmerged_2.fq
cat merged.fq unmerged_1.fq unmerged_2.fq > combined.fq
clumpify.sh in=combined.fq out=clumped.fq

Most effective approach for paired reads: merge overlapping pairs first, concatenate with unmerged reads, then clumpify all as unpaired. This is much more effective than clumping paired reads directly.

Error Correction

clumpify.sh in=reads.fq out=corrected.fq ecc passes=6 aggressive

Error correction using 6 passes with aggressive mode. Multiple passes are needed for complete correction as each pass adapts parameters and improves clustering.

NextSeq Optical Duplicate Removal

clumpify.sh in=reads.fq out=deduped.fq dedupe optical spany adjacent dupedist=40

NextSeq-optimized duplicate removal with tile-edge detection (spany) and adjacency constraints, using the recommended 40-pixel distance threshold.

Memory-Constrained Processing

clumpify.sh in=large.fq out=clumped.fq groups=auto lowcomplexity

Automatic memory management for large files. The lowcomplexity flag provides conservative memory estimates for repetitive data like RNA-seq.

Compression Strategies

Why Clumping Improves Compression

Gzip compression works by identifying repeated patterns and replacing them with pointers to previous occurrences. When similar sequences are grouped together, the compression algorithm can more efficiently identify these patterns, resulting in smaller file sizes.

Optimizing for Maximum Compression

The most space-efficient sequence storage combines several strategies:

Fasta format: Removes quality values which consume significant space
Unlimited line wrap: fastawrap=100000 eliminates line breaks within sequences
Short headers: Use rename.sh with prefix=x to create minimal identifiers
Maximum zip level: zl=9 trades CPU time for better compression
Read merging: BBMerge reduces redundancy in overlapping paired reads
Clump reordering: reorder=p uses pair information for optimal arrangement

Algorithm Details

K-mer Based Clustering

Clumpify uses shared k-mers to identify potentially overlapping reads. Reads sharing k-mers are likely to overlap but this is not guaranteed. The clustering process:

Extracts k-mers from each read using canonical representation (lexicographically larger of forward/reverse complement)
Selects pivot k-mers based on frequency thresholds (mincount parameter)
Groups reads sharing the same pivot k-mer into clumps
Optionally reorders clumps for compression optimization

Two-Phase Processing

For memory management, Clumpify can split processing into two phases:

KmerSplit: Distributes reads across multiple temporary files based on k-mer hash values
KmerSort: Independently processes each temporary file, then merges results

This approach changes complexity from O(N*log(N)) to O(N*log(N/groups)), approaching O(N) as group count increases.

Memory Estimation

The auto groups feature estimates memory requirements based on file size and sequence complexity, automatically determining the optimal number of temporary files to prevent out-of-memory conditions while maintaining performance.

Performance Considerations

Single group (groups=1): Fastest processing but requires all data in memory
Multiple groups: Lower memory usage but increased I/O overhead
Compression vs speed: Higher zip levels trade CPU time for better compression
Reordering cost: Reorder modes add processing time but can significantly improve compression
K-mer length: Shorter k-mers may improve compression but reduce specificity

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org
Guide: bbmap/docs/guides/ClumpifyGuide.txt