KeepBestCopy
Discards all but the best copy of a ribosomal gene per TaxID. Gene sequences should be named like this: >tid|123|whatever Sequences are selected based on the number of fully defined bases.
Basic Usage
keepbestcopy.sh in=<input file> out=<output file> rate=<float>
Input may be fasta or fastq, compressed or uncompressed. Sequences are selected based on the number of fully defined bases (non-N characters).
Parameters
Parameters are organized by their function in the keepbestcopy process. The tool expects sequences to be named with TaxID information in the format >tid|123|whatever.
Standard parameters
- in=<file>
- Input sequences. Can be fasta or fastq format, compressed or uncompressed.
- out=<file>
- Output sequences. Will contain only the best copy of each ribosomal gene per TaxID.
- overwrite=f
- (ow) Set to false to force the program to abort rather than overwrite an existing file.
- ziplevel=2
- (zl) Set to 1 (lowest) through 9 (max) to change compression level; lower compression is faster.
Processing parameters
- maxlen=1600
- Prefer sequences shorter than this length. If both the current sequence and stored sequence are longer than maxlen, the shorter one is preferred. If one is under maxlen and one is over, the shorter one is always preferred.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions.
Examples
Basic Usage
keepbestcopy.sh in=ribosomal_genes.fasta out=best_copies.fasta
Process a fasta file containing ribosomal gene sequences with TaxID information, keeping only the best copy per organism.
With Length Preference
keepbestcopy.sh in=ssu_sequences.fq.gz out=filtered_ssu.fasta maxlen=1400 overwrite=t
Process compressed fastq input, preferring sequences under 1400bp, and allow overwriting existing output files.
High Compression Output
keepbestcopy.sh in=input.fasta out=output.fasta.gz ziplevel=9
Use maximum compression for the output file to save disk space.
Algorithm Details
Selection Strategy
KeepBestCopy uses a single-pass algorithm with a LinkedHashMap<Integer, Read> data structure to store the best sequence for each TaxID. The algorithm calls GiToTaxid.parseTaxidNumber(r.id, '|') to extract TaxID numbers from sequence headers using the format >tid|[number]|[anything], where the number represents the taxonomic identifier. The process() method performs O(1) HashMap lookup and replacement operations for each input sequence.
Quality Assessment
Sequence quality is determined by the isBetterThan() method which counts fully defined bases using two specific Read class methods: countNocalls() for the stored sequence and countUndefined() for the new sequence. The algorithm calculates defined bases as sequence length minus N count (def = r.length() - Ns; oldDef = old.length() - oldNs). This implementation prioritizes sequences with fewer ambiguous nucleotide calls.
Length Preference Logic
The tool implements a two-stage comparison algorithm in the isBetterThan() method:
- If the stored sequence is longer than maxlen and the new sequence is shorter, the new sequence is preferred
- If the new sequence is longer than maxlen and the stored sequence is shorter, the stored sequence is kept
- Otherwise, sequences are compared by number of defined bases (higher is better)
- If defined base counts are equal, the sequence with fewer N characters is preferred
Memory Efficiency
The algorithm uses a LinkedHashMap<Integer, Read> data structure that maintains insertion order while providing O(1) lookup time. This approach ensures memory usage scales linearly with the number of unique TaxIDs rather than the total number of input sequences, making it efficient for processing large datasets with many duplicate sequences per organism.
Output Generation
After processing all input sequences, the tool outputs the selected sequences in batches of 200 to optimize I/O performance. The output maintains the taxonomic organization while ensuring only the highest-quality representative sequence remains for each TaxID.
Technical Notes
Input Format Requirements
Sequence headers must contain TaxID information in the format >tid|[number]|[description]. Sequences without valid TaxID parsing are automatically discarded from the output.
Performance Characteristics
- Time Complexity: O(n) where n is the number of input sequences
- Space Complexity: O(t) where t is the number of unique TaxIDs
- Memory Usage: Automatically configured to 8GB default, scales with unique TaxID count
Error Handling
The tool includes comprehensive error handling for malformed headers, I/O issues, and memory constraints. Invalid TaxID formats result in sequence exclusion rather than program termination, ensuring robust processing of mixed-quality input data.
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org