Clumpify
Rapidly groups overlapping reads into clumps based on shared k-mers. Designed for increased file compression, accelerated overlap-based assembly, or cache-sensitive applications. Works with accurate data (Illumina, Ion Torrent, error-corrected PacBio).
Purpose and Applications
Clumpify clusters reads that share k-mers, which makes them likely (but not guaranteed) to overlap. The primary applications are:
- Enhanced Compression: Gzip is more efficient when similar sequences are nearby, as they can be replaced by pointers to prior copies
- Assembly Acceleration: Overlap-based assemblers benefit from pre-clustered reads
- Cache Optimization: Applications that are cache-sensitive perform better with grouped similar reads
- Error Correction: Consensus sequences can be generated from clumps (currently rudimentary)
Data Requirements and Recommendations
Paired Read Handling
Clumpify supports paired reads but is much more effective when treating reads as unpaired. For optimal results: merge reads with BBMerge first, then concatenate merged reads with unmerged pairs, and clump them all together as unpaired reads.
Data Quality Requirements
Clumpify is designed for accurate data and will not work well with high error rate sequences. Even for Illumina data, quality-trimming or error-correction may be beneficial before clumping.
Memory Management and Processing
Clumpify stores all sequences in memory while clumping but can operate in two phases to handle large datasets:
- KmerSplit: Breaks data into an arbitrary number of temporary files
- KmerSort: Sorts each temporary file independently, then merges results
This approach makes Clumpify O(N) complexity rather than O(N*log(N)) when using multiple groups, with no strict memory bounds since group count can be adjusted as needed. When groups=1, the split phase is skipped for better performance with smaller datasets.
Basic Usage
clumpify.sh in=<file> out=<file> reorder
Input may be fasta or fastq, compressed or uncompressed. Cannot accept sam format.
Parameters
Input/Output Parameters
- in=<file>
- Input file.
- in2=<file>
- Optional input for read 2 of twin paired files.
- out=<file>
- Output file. May not be standard out.
- out2=<file>
- Optional output for read 2 of twin paired files.
- groups=auto
- Use this many intermediate files to save memory. 1 group is fastest. Auto estimates the number of groups needed based on file size to prevent out-of-memory conditions.
- lowcomplexity=f
- For compressed low-complexity libraries such as RNA-seq, use more conservative memory estimates to automatically decide the number of groups.
- rcomp=f
- Give read clumps the same orientation to increase compression. Should be disabled for paired reads.
- overwrite=f
- (ow) Set to false to force the program to abort rather than overwrite an existing file.
- qin=auto
- Auto-detect input quality encoding. May be set to: 33 (ASCII-33/Sanger), 64 (ASCII-64/old Illumina). All modern sequence is ASCII-33.
- qout=auto
- Use input quality encoding as output quality encoding.
- changequality=f
- (cq) Fix broken quality scores such as Ns with Q>0. Default false ensures lossless compression.
- fastawrap=70
- Set to a higher number like 4000 for longer lines in fasta format, which increases compression.
Compression Parameters
- ziplevel=6
- (zl) Gzip compression level (1-11). Higher is slower. Level 11 requires pigz, extremely slow to compress but faster to decompress. Use *.bz2 extension for ~9% additional compression.
- blocksize=128
- Size of blocks for pigz, in kb. Higher gives slightly better compression.
- shortname=f
- Make the names as short as possible. 'shortname=shrink' shortens names while retaining flowcell and barcode information.
- reorder=f
- Reorder clumps for additional compression. Only valid when groups=1, passes=1, and ecc=f. Modes: f (no reorder), c (consensus reads), p (pair information, highest compression), a (auto-choose between c and p).
- quantize=f
- Bin quality scores like NextSeq. Greatly increases compression but loses information.
Temp File Parameters
- compresstemp=auto
- (ct) Gzip temporary files. By default temp files are compressed if output file is compressed.
- deletetemp=t
- Delete temporary files.
- deleteinput=f
- Delete input upon successful completion.
- usetmpdir=f
- Use tmpdir for temp files.
- tmpdir=
- Temporary directory. Default is environment variable TMPDIR.
Hashing Parameters
- k=31
- Use k-mers of this length (1-31). Shorter k-mers may increase compression, but 31 is recommended for error correction.
- mincount=0
- Don't use pivot k-mers with count less than this. Setting mincount=2 can increase compression by filtering singleton k-mers. Increases time and memory usage.
- seed=1
- Random number generator seed for hashing. Set to negative to use a random seed.
- hashes=4
- Use this many masks when hashing. 0 uses raw k-mers. Often hashes=0 increases compression, but should not be used with error-correction.
- border=1
- Do not use k-mers within this many bases of read ends.
Deduplication Parameters
- dedupe=f
- Remove duplicate reads. For pairs, both must match. If dedupe and markduplicates are both false, duplicate-related flags have no effect.
- markduplicates=f
- Don't remove duplicates; just append ' duplicate' to the name.
- allduplicates=f
- Mark or remove all copies of duplicates, instead of keeping the highest-quality copy.
- addcount=f
- Append the number of copies to the read name. Mutually exclusive with markduplicates or allduplicates.
- entryfilter=f
- Assists in removing exact duplicates, saving memory in libraries with huge numbers of duplicates. Enabled automatically as needed.
- subs=2
- (s) Maximum substitutions allowed between duplicates.
- subrate=0.0
- (dsr) If set, substitutions allowed = max(subs, subrate*min(length1, length2)) for 2 sequences.
- allowns=t
- No-called bases will not be considered substitutions.
- scanlimit=5
- (scan) Continue for this many reads after encountering a non-duplicate. Improves detection of inexact duplicates.
- umi=f
- If reads have UMIs in headers, require them to match to consider reads duplicates.
- umisubs=0
- Consider UMIs as matching if they have up to this many mismatches.
- containment=f
- Allow containments (where one sequence is shorter).
- affix=f
- For containments, require one sequence to be a prefix or suffix of the other.
- optical=f
- Mark or remove optical duplicates only. Requires Illumina reads within flowcell distance. Also handles tile-edge and well duplicates.
- dupedist=40
- (dist) Max distance for optical duplicates. Platform-specific recommendations: NextSeq 40 (with spany=t), HiSeq 1T/2500 40, HiSeq 3k/4k 2500, NovaSeq 6000 12000, NovaSeq X+ 50.
- spany=f
- Allow optical duplicates on different tiles if within dupedist in y-axis. Enable for tile-edge duplicates (NextSeq).
- spanx=f
- Like spany, but for x-axis. Not necessary for NextSeq.
- spantiles=f
- Set both spanx and spany.
- adjacent=f
- Limit tile-spanning to adjacent tiles (consecutive numbers).
NextSeq Recommendation: Use flags: dedupe optical spany adjacent
Pairing/Ordering Parameters
- unpair=f
- For paired reads, clump all of them rather than just read 1. Destroys pairing. Without this flag, only read 1 is error-corrected for paired reads.
- repair=f
- After clumping and error-correction, restore pairing. If groups>1, sorts by name which destroys clump ordering; with single group, clumping is retained.
Error-Correction Parameters
- ecc=f
- Error-correct reads. Requires multiple passes for complete correction.
- ecco=f
- Error-correct paired reads via overlap before clumping.
- passes=1
- Use this many error-correction passes. 6 passes are suggested, though more will be more thorough.
- conservative=f
- Only correct highest-confidence errors, minimizing chances of eliminating minor alleles or inexact repeats.
- aggressive=f
- Maximize the number of errors corrected.
- consensus=f
- Output consensus sequence instead of clumps.
Advanced Error-Correction Parameters
- mincc=4
- (mincountcorrect) Do not correct to alleles occurring less often than this.
- minss=4
- (minsizesplit) Do not split into new clumps smaller than this.
- minsfs=0.17
- (minsizefractionsplit) Do not split on pivot alleles in areas with local depth less than this fraction of clump size.
- minsfc=0.20
- (minsizefractioncorrect) Do not correct in areas with local depth less than this.
- minr=30.0
- (minratio) Correct to consensus if ratio of consensus allele to second-most-common allele is ≥minr. Actual ratio: min(minr, minro+minorCount*minrm+quality*minrqm).
- minro=1.9
- (minratiooffset) Base ratio.
- minrm=1.8
- (minratiomult) Ratio multiplier for secondary allele count.
- minrqm=0.08
- (minratioqmult) Ratio multiplier for base quality.
- minqr=2.8
- (minqratio) Do not correct bases when cq*minqr>rqsum.
- minaqr=0.70
- (minaqratio) Do not correct bases when cq*minaqr>5+rqavg.
- minid=0.97
- (minidentity) Do not correct reads with identity to consensus less than this.
- maxqadjust=0
- Adjust quality scores by at most maxqadjust per pass.
- maxqi=-1
- (maxqualityincorrect) Do not correct bases with quality above this (if positive).
- maxci=-1
- (maxcountincorrect) Do not correct alleles with count above this (if positive).
- findcorrelations=t
- Look for correlated SNPs in clumps to split into alleles.
- maxcorrelations=12
- Maximum number of eligible SNPs per clump to consider for correlations. Increasing reduces false-positive corrections but may decrease speed.
Java Parameters
- -Xmx
- Set Java memory usage, overriding autodetection. -Xmx20g specifies 20GB RAM, -Xmx200m specifies 200MB. Max is typically 85% of physical memory.
- -eoom
- Exit if out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions.
Usage Examples
Basic Clumping with Multi-Group Processing
clumpify.sh in=reads.fq.gz out=clumped.fq.gz groups=16
Groups similar reads using 16 temporary files during processing to manage memory usage for large datasets.
Maximum Compression Workflow
# Step 1: Shorten read names
rename.sh in=reads.fq out=renamed.fq prefix=x
# Step 2: Merge overlapping pairs when possible
bbmerge.sh in=renamed.fq out=merged.fq mix
# Step 3: Clumpify with maximum compression settings
clumpify.sh in=merged.fq out=clumped.fa.gz zl=9 pigz fastawrap=100000
Optimal compression workflow: strip names, merge reads, then clumpify as fasta format with unlimited line wrap and maximum compression level. This approach maximizes compression by removing headers and quality values, which take up most space in compressed files.
Paired Read Optimization
# Recommended: Merge pairs first, then clumpify as unpaired
bbmerge.sh in1=reads_1.fq in2=reads_2.fq out=merged.fq outu1=unmerged_1.fq outu2=unmerged_2.fq
cat merged.fq unmerged_1.fq unmerged_2.fq > combined.fq
clumpify.sh in=combined.fq out=clumped.fq
Most effective approach for paired reads: merge overlapping pairs first, concatenate with unmerged reads, then clumpify all as unpaired. This is much more effective than clumping paired reads directly.
Error Correction
clumpify.sh in=reads.fq out=corrected.fq ecc passes=6 aggressive
Error correction using 6 passes with aggressive mode. Multiple passes are needed for complete correction as each pass adapts parameters and improves clustering.
NextSeq Optical Duplicate Removal
clumpify.sh in=reads.fq out=deduped.fq dedupe optical spany adjacent dupedist=40
NextSeq-optimized duplicate removal with tile-edge detection (spany) and adjacency constraints, using the recommended 40-pixel distance threshold.
Memory-Constrained Processing
clumpify.sh in=large.fq out=clumped.fq groups=auto lowcomplexity
Automatic memory management for large files. The lowcomplexity flag provides conservative memory estimates for repetitive data like RNA-seq.
Compression Strategies
Why Clumping Improves Compression
Gzip compression works by identifying repeated patterns and replacing them with pointers to previous occurrences. When similar sequences are grouped together, the compression algorithm can more efficiently identify these patterns, resulting in smaller file sizes.
Optimizing for Maximum Compression
The most space-efficient sequence storage combines several strategies:
- Fasta format: Removes quality values which consume significant space
- Unlimited line wrap: fastawrap=100000 eliminates line breaks within sequences
- Short headers: Use rename.sh with prefix=x to create minimal identifiers
- Maximum zip level: zl=9 trades CPU time for better compression
- Read merging: BBMerge reduces redundancy in overlapping paired reads
- Clump reordering: reorder=p uses pair information for optimal arrangement
Algorithm Details
K-mer Based Clustering
Clumpify uses shared k-mers to identify potentially overlapping reads. Reads sharing k-mers are likely to overlap but this is not guaranteed. The clustering process:
- Extracts k-mers from each read using canonical representation (lexicographically larger of forward/reverse complement)
- Selects pivot k-mers based on frequency thresholds (mincount parameter)
- Groups reads sharing the same pivot k-mer into clumps
- Optionally reorders clumps for compression optimization
Two-Phase Processing
For memory management, Clumpify can split processing into two phases:
- KmerSplit: Distributes reads across multiple temporary files based on k-mer hash values
- KmerSort: Independently processes each temporary file, then merges results
This approach changes complexity from O(N*log(N)) to O(N*log(N/groups)), approaching O(N) as group count increases.
Memory Estimation
The auto groups feature estimates memory requirements based on file size and sequence complexity, automatically determining the optimal number of temporary files to prevent out-of-memory conditions while maintaining performance.
Performance Considerations
- Single group (groups=1): Fastest processing but requires all data in memory
- Multiple groups: Lower memory usage but increased I/O overhead
- Compression vs speed: Higher zip levels trade CPU time for better compression
- Reordering cost: Reorder modes add processing time but can significantly improve compression
- K-mer length: Shorter k-mers may improve compression but reduce specificity
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org
- Guide: bbmap/docs/guides/ClumpifyGuide.txt