PickSubset

Basic Usage

picksubset.sh in=<file> out=<file> invalid=<file> files=<number>

Input should be in 3+ column TSV format (first 3 are required):

(query, ref, ANI)
...as produced by CompareSketch when run like this:
comparesketch.sh ata format=3 includeself perfile records=99999 *.fasta

Note: Either files or ani, or both, must be set.

Parameters

Parameters control input/output files, selection criteria, and processing behavior.

Input/Output Parameters

in=: Input file comparing all-to-all comparisons. Should contain tab-separated values with columns: query, ref, ANI, and optionally additional data like QSize, RefSize, QBases, RBases, QTaxID, RTaxID, KID, WKID, SSU.
out=: Output file for the list of files to retain. Contains the names of selected files that form the diverse subset.
invalid=: Output file for the list of files to discard. Contains the names of files that were removed during the selection process.
overwrite=f: (ow) Set to false to force the program to abort rather than overwrite an existing file. Default: false.

Selection Criteria

files=0: (nodes, genomes) Number of files to retain in the final subset. The algorithm will iteratively remove files until exactly this number remains. Default: 0 (must be specified).
ani=0: (maxani) Maximum pairwise ANI allowed, expressed as a percent. The algorithm will remove files until all remaining pairwise ANI values are at or below this threshold. Default: 0 (must be specified).

Processing Parameters

lines=: Maximum number of lines to process from the input file. Set to a negative value for unlimited processing. Default: unlimited.
verbose=f: Print verbose messages during processing. Enables detailed logging of file reading and processing steps. Default: false.

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically around 85% of physical memory.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions.

Examples

Select 50 Most Diverse Files

picksubset.sh in=all_vs_all.tsv out=selected.txt invalid=discarded.txt files=50

Selects exactly 50 files from the all-to-all comparison that maximize pairwise diversity.

Filter by ANI Threshold

picksubset.sh in=comparisons.tsv out=filtered.txt invalid=removed.txt ani=95

Removes files until all pairwise ANI values are 95% or below.

Combined Criteria

picksubset.sh in=sketches.tsv out=subset.txt files=100 ani=90

Selects up to 100 files while ensuring no pairwise ANI exceeds 90%.

Generate Input with CompareSketch

comparesketch.sh ata format=3 includeself perfile records=99999 *.fasta > all_vs_all.tsv
picksubset.sh in=all_vs_all.tsv out=diverse_set.txt files=20

First generates all-to-all comparisons, then selects 20 most diverse files.

Algorithm Details

PickSubset implements a greedy algorithm to select a maximally diverse subset from pairwise similarity data:

Core Algorithm Strategy

The algorithm uses a node removal approach where files (nodes) are iteratively eliminated based on their connectivity and similarity scores until the desired subset size or ANI threshold is reached.

Data Structures

Node Structure: Each file is represented as a Node containing name, size, base count, sum score, validity status, and list of edges to other nodes
Edge Structure: Pairwise relationships stored as Edge objects with query name, reference name, ANI value, KID score, SSU score, and validity status
HashMap Storage: Uses HashMap<String, Node> for O(1) node lookup by filename

Selection Process

Input Parsing: Reads tab-separated comparison data, expecting columns: query, ref, ANI, and optional additional metrics (QSize, RefSize, QBases, RBases, QTaxID, RTaxID, KID, WKID, SSU)
Graph Construction: Builds internal graph where each file is a node and similarities are edges (self-edges are ignored)
Score Calculation: For each node, calculates sum score as: sum = Σ(KID + ANI×0.0001) across all edges
Edge Sorting: Sorts all edges by ANI (descending), then by KID (descending), then lexicographically
Iterative Removal: For each edge (highest ANI first):
- If remaining nodes ≤ desired count OR edge ANI ≤ desired threshold: stop
- If both nodes are valid, compare their sum scores
- Remove the node with higher sum score (more connected/similar to others)
- Mark removed node and all its edges as invalid
Output Generation: Write valid node names to output file, invalid node names to discard file

Node Comparison Logic

When deciding which of two highly similar nodes to remove, the algorithm uses a scoring system:

Primary criterion: Sum score (higher score = more similar to other nodes = removed first)
Tiebreaker: Lexicographic name comparison for deterministic results

Performance Characteristics

Time Complexity: O(E log E + E×N) where E is edges, N is nodes
Memory Usage: Stores all nodes and edges in memory; approximately 100-200 bytes per comparison
Scalability: Efficient for datasets with millions of pairwise comparisons

Comparison with RepresentativeSet

Unlike representative.sh which uses taxonomic information and centroid selection, PickSubset:

Operates purely on pairwise similarity data without taxonomy
Uses greedy removal rather than centroid-based clustering
Provides deterministic results based on filename and similarity scores
Better suited for datasets without reliable taxonomic classification

Input Format

The input file should contain tab-separated values with the following columns:

Column	Description	Required
1	Query filename	Yes
2	Reference filename	Yes
3	ANI percentage	Yes
4	Query size	Optional
5	Reference size	Optional
6	Query bases	Optional
7	Reference bases	Optional
8	Query TaxID	Optional
9	Reference TaxID	Optional
10	KID score	Optional
11	WKID score	Optional
12	SSU score	Optional

Example format as generated by CompareSketch:

#Query	Ref	ANI	QSize	RefSize	QBases	RBases	QTaxID	RTaxID	KID	WKID	SSU
a.fa	b.fa	80.980	11178684	8473676	12055201	8514625	-1	-1	0.237	0.313	94.233

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org