PickSubset
Selects a subset of files from an all-to-all identity comparison. The subset will contain exactly X files with maximal pairwise ANI, or all files with at most Y pairwise identity. This program is similar to representative.sh but does not use taxonomy.
Basic Usage
picksubset.sh in=<file> out=<file> invalid=<file> files=<number>
Input should be in 3+ column TSV format (first 3 are required):
(query, ref, ANI)
...as produced by CompareSketch when run like this:
comparesketch.sh ata format=3 includeself perfile records=99999 *.fasta
Note: Either files
or ani
, or both, must be set.
Parameters
Parameters control input/output files, selection criteria, and processing behavior.
Input/Output Parameters
- in=
- Input file comparing all-to-all comparisons. Should contain tab-separated values with columns: query, ref, ANI, and optionally additional data like QSize, RefSize, QBases, RBases, QTaxID, RTaxID, KID, WKID, SSU.
- out=
- Output file for the list of files to retain. Contains the names of selected files that form the diverse subset.
- invalid=
- Output file for the list of files to discard. Contains the names of files that were removed during the selection process.
- overwrite=f
- (ow) Set to false to force the program to abort rather than overwrite an existing file. Default: false.
Selection Criteria
- files=0
- (nodes, genomes) Number of files to retain in the final subset. The algorithm will iteratively remove files until exactly this number remains. Default: 0 (must be specified).
- ani=0
- (maxani) Maximum pairwise ANI allowed, expressed as a percent. The algorithm will remove files until all remaining pairwise ANI values are at or below this threshold. Default: 0 (must be specified).
Processing Parameters
- lines=
- Maximum number of lines to process from the input file. Set to a negative value for unlimited processing. Default: unlimited.
- verbose=f
- Print verbose messages during processing. Enables detailed logging of file reading and processing steps. Default: false.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically around 85% of physical memory.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions.
Examples
Select 50 Most Diverse Files
picksubset.sh in=all_vs_all.tsv out=selected.txt invalid=discarded.txt files=50
Selects exactly 50 files from the all-to-all comparison that maximize pairwise diversity.
Filter by ANI Threshold
picksubset.sh in=comparisons.tsv out=filtered.txt invalid=removed.txt ani=95
Removes files until all pairwise ANI values are 95% or below.
Combined Criteria
picksubset.sh in=sketches.tsv out=subset.txt files=100 ani=90
Selects up to 100 files while ensuring no pairwise ANI exceeds 90%.
Generate Input with CompareSketch
comparesketch.sh ata format=3 includeself perfile records=99999 *.fasta > all_vs_all.tsv
picksubset.sh in=all_vs_all.tsv out=diverse_set.txt files=20
First generates all-to-all comparisons, then selects 20 most diverse files.
Algorithm Details
PickSubset implements a greedy algorithm to select a maximally diverse subset from pairwise similarity data:
Core Algorithm Strategy
The algorithm uses a node removal approach where files (nodes) are iteratively eliminated based on their connectivity and similarity scores until the desired subset size or ANI threshold is reached.
Data Structures
- Node Structure: Each file is represented as a Node containing name, size, base count, sum score, validity status, and list of edges to other nodes
- Edge Structure: Pairwise relationships stored as Edge objects with query name, reference name, ANI value, KID score, SSU score, and validity status
- HashMap Storage: Uses HashMap<String, Node> for O(1) node lookup by filename
Selection Process
- Input Parsing: Reads tab-separated comparison data, expecting columns: query, ref, ANI, and optional additional metrics (QSize, RefSize, QBases, RBases, QTaxID, RTaxID, KID, WKID, SSU)
- Graph Construction: Builds internal graph where each file is a node and similarities are edges (self-edges are ignored)
- Score Calculation: For each node, calculates sum score as: sum = Σ(KID + ANI×0.0001) across all edges
- Edge Sorting: Sorts all edges by ANI (descending), then by KID (descending), then lexicographically
- Iterative Removal: For each edge (highest ANI first):
- If remaining nodes ≤ desired count OR edge ANI ≤ desired threshold: stop
- If both nodes are valid, compare their sum scores
- Remove the node with higher sum score (more connected/similar to others)
- Mark removed node and all its edges as invalid
- Output Generation: Write valid node names to output file, invalid node names to discard file
Node Comparison Logic
When deciding which of two highly similar nodes to remove, the algorithm uses a scoring system:
- Primary criterion: Sum score (higher score = more similar to other nodes = removed first)
- Tiebreaker: Lexicographic name comparison for deterministic results
Performance Characteristics
- Time Complexity: O(E log E + E×N) where E is edges, N is nodes
- Memory Usage: Stores all nodes and edges in memory; approximately 100-200 bytes per comparison
- Scalability: Efficient for datasets with millions of pairwise comparisons
Comparison with RepresentativeSet
Unlike representative.sh which uses taxonomic information and centroid selection, PickSubset:
- Operates purely on pairwise similarity data without taxonomy
- Uses greedy removal rather than centroid-based clustering
- Provides deterministic results based on filename and similarity scores
- Better suited for datasets without reliable taxonomic classification
Input Format
The input file should contain tab-separated values with the following columns:
Column | Description | Required |
---|---|---|
1 | Query filename | Yes |
2 | Reference filename | Yes |
3 | ANI percentage | Yes |
4 | Query size | Optional |
5 | Reference size | Optional |
6 | Query bases | Optional |
7 | Reference bases | Optional |
8 | Query TaxID | Optional |
9 | Reference TaxID | Optional |
10 | KID score | Optional |
11 | WKID score | Optional |
12 | SSU score | Optional |
Example format as generated by CompareSketch:
#Query Ref ANI QSize RefSize QBases RBases QTaxID RTaxID KID WKID SSU
a.fa b.fa 80.980 11178684 8473676 12055201 8514625 -1 -1 0.237 0.313 94.233
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org