KeepBestCopy

Script: keepbestcopy.sh Package: jgi Class: KeepBestCopy.java

Discards all but the best copy of a ribosomal gene per TaxID. Gene sequences should be named like this: >tid|123|whatever Sequences are selected based on the number of fully defined bases.

Basic Usage

keepbestcopy.sh in=<input file> out=<output file> rate=<float>

Input may be fasta or fastq, compressed or uncompressed. Sequences are selected based on the number of fully defined bases (non-N characters).

Parameters

Parameters are organized by their function in the keepbestcopy process. The tool expects sequences to be named with TaxID information in the format >tid|123|whatever.

Standard parameters

in=<file>: Input sequences. Can be fasta or fastq format, compressed or uncompressed.
out=<file>: Output sequences. Will contain only the best copy of each ribosomal gene per TaxID.
overwrite=f: (ow) Set to false to force the program to abort rather than overwrite an existing file.
ziplevel=2: (zl) Set to 1 (lowest) through 9 (max) to change compression level; lower compression is faster.

Processing parameters

maxlen=1600: Prefer sequences shorter than this length. If both the current sequence and stored sequence are longer than maxlen, the shorter one is preferred. If one is under maxlen and one is over, the shorter one is always preferred.

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions.

Examples

Basic Usage

keepbestcopy.sh in=ribosomal_genes.fasta out=best_copies.fasta

Process a fasta file containing ribosomal gene sequences with TaxID information, keeping only the best copy per organism.

With Length Preference

keepbestcopy.sh in=ssu_sequences.fq.gz out=filtered_ssu.fasta maxlen=1400 overwrite=t

Process compressed fastq input, preferring sequences under 1400bp, and allow overwriting existing output files.

High Compression Output

keepbestcopy.sh in=input.fasta out=output.fasta.gz ziplevel=9

Use maximum compression for the output file to save disk space.

Algorithm Details

Selection Strategy

KeepBestCopy uses a single-pass algorithm with a LinkedHashMap<Integer, Read> data structure to store the best sequence for each TaxID. The algorithm calls GiToTaxid.parseTaxidNumber(r.id, '|') to extract TaxID numbers from sequence headers using the format >tid|[number]|[anything], where the number represents the taxonomic identifier. The process() method performs O(1) HashMap lookup and replacement operations for each input sequence.

Quality Assessment

Sequence quality is determined by the isBetterThan() method which counts fully defined bases using two specific Read class methods: countNocalls() for the stored sequence and countUndefined() for the new sequence. The algorithm calculates defined bases as sequence length minus N count (def = r.length() - Ns; oldDef = old.length() - oldNs). This implementation prioritizes sequences with fewer ambiguous nucleotide calls.

Length Preference Logic

The tool implements a two-stage comparison algorithm in the isBetterThan() method:

If the stored sequence is longer than maxlen and the new sequence is shorter, the new sequence is preferred
If the new sequence is longer than maxlen and the stored sequence is shorter, the stored sequence is kept
Otherwise, sequences are compared by number of defined bases (higher is better)
If defined base counts are equal, the sequence with fewer N characters is preferred

Memory Efficiency

The algorithm uses a LinkedHashMap<Integer, Read> data structure that maintains insertion order while providing O(1) lookup time. This approach ensures memory usage scales linearly with the number of unique TaxIDs rather than the total number of input sequences, making it efficient for processing large datasets with many duplicate sequences per organism.

Output Generation

After processing all input sequences, the tool outputs the selected sequences in batches of 200 to optimize I/O performance. The output maintains the taxonomic organization while ensuring only the highest-quality representative sequence remains for each TaxID.

Technical Notes

Input Format Requirements

Sequence headers must contain TaxID information in the format >tid|[number]|[description]. Sequences without valid TaxID parsing are automatically discarded from the output.

Performance Characteristics

Time Complexity: O(n) where n is the number of input sequences
Space Complexity: O(t) where t is the number of unique TaxIDs
Memory Usage: Automatically configured to 8GB default, scales with unique TaxID count

Error Handling

The tool includes comprehensive error handling for malformed headers, I/O issues, and memory constraints. Invalid TaxID formats result in sequence exclusion rather than program termination, ensuring robust processing of mixed-quality input data.

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org