PickSubset

Script: picksubset.sh Package: jgi Class: PickSubset.java

Selects a subset of files from an all-to-all identity comparison. The subset will contain exactly X files with maximal pairwise ANI, or all files with at most Y pairwise identity. This program is similar to representative.sh but does not use taxonomy.

Basic Usage

picksubset.sh in=<file> out=<file> invalid=<file> files=<number>

Input should be in 3+ column TSV format (first 3 are required):

(query, ref, ANI)
...as produced by CompareSketch when run like this:
comparesketch.sh ata format=3 includeself perfile records=99999 *.fasta

Note: Either files or ani, or both, must be set.

Parameters

Parameters control input/output files, selection criteria, and processing behavior.

Input/Output Parameters

in=
Input file comparing all-to-all comparisons. Should contain tab-separated values with columns: query, ref, ANI, and optionally additional data like QSize, RefSize, QBases, RBases, QTaxID, RTaxID, KID, WKID, SSU.
out=
Output file for the list of files to retain. Contains the names of selected files that form the diverse subset.
invalid=
Output file for the list of files to discard. Contains the names of files that were removed during the selection process.
overwrite=f
(ow) Set to false to force the program to abort rather than overwrite an existing file. Default: false.

Selection Criteria

files=0
(nodes, genomes) Number of files to retain in the final subset. The algorithm will iteratively remove files until exactly this number remains. Default: 0 (must be specified).
ani=0
(maxani) Maximum pairwise ANI allowed, expressed as a percent. The algorithm will remove files until all remaining pairwise ANI values are at or below this threshold. Default: 0 (must be specified).

Processing Parameters

lines=
Maximum number of lines to process from the input file. Set to a negative value for unlimited processing. Default: unlimited.
verbose=f
Print verbose messages during processing. Enables detailed logging of file reading and processing steps. Default: false.

Java Parameters

-Xmx
This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically around 85% of physical memory.
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da
Disable assertions.

Examples

Select 50 Most Diverse Files

picksubset.sh in=all_vs_all.tsv out=selected.txt invalid=discarded.txt files=50

Selects exactly 50 files from the all-to-all comparison that maximize pairwise diversity.

Filter by ANI Threshold

picksubset.sh in=comparisons.tsv out=filtered.txt invalid=removed.txt ani=95

Removes files until all pairwise ANI values are 95% or below.

Combined Criteria

picksubset.sh in=sketches.tsv out=subset.txt files=100 ani=90

Selects up to 100 files while ensuring no pairwise ANI exceeds 90%.

Generate Input with CompareSketch

comparesketch.sh ata format=3 includeself perfile records=99999 *.fasta > all_vs_all.tsv
picksubset.sh in=all_vs_all.tsv out=diverse_set.txt files=20

First generates all-to-all comparisons, then selects 20 most diverse files.

Algorithm Details

PickSubset implements a greedy algorithm to select a maximally diverse subset from pairwise similarity data:

Core Algorithm Strategy

The algorithm uses a node removal approach where files (nodes) are iteratively eliminated based on their connectivity and similarity scores until the desired subset size or ANI threshold is reached.

Data Structures

Selection Process

  1. Input Parsing: Reads tab-separated comparison data, expecting columns: query, ref, ANI, and optional additional metrics (QSize, RefSize, QBases, RBases, QTaxID, RTaxID, KID, WKID, SSU)
  2. Graph Construction: Builds internal graph where each file is a node and similarities are edges (self-edges are ignored)
  3. Score Calculation: For each node, calculates sum score as: sum = Σ(KID + ANI×0.0001) across all edges
  4. Edge Sorting: Sorts all edges by ANI (descending), then by KID (descending), then lexicographically
  5. Iterative Removal: For each edge (highest ANI first):
    • If remaining nodes ≤ desired count OR edge ANI ≤ desired threshold: stop
    • If both nodes are valid, compare their sum scores
    • Remove the node with higher sum score (more connected/similar to others)
    • Mark removed node and all its edges as invalid
  6. Output Generation: Write valid node names to output file, invalid node names to discard file

Node Comparison Logic

When deciding which of two highly similar nodes to remove, the algorithm uses a scoring system:

Performance Characteristics

Comparison with RepresentativeSet

Unlike representative.sh which uses taxonomic information and centroid selection, PickSubset:

Input Format

The input file should contain tab-separated values with the following columns:

ColumnDescriptionRequired
1Query filenameYes
2Reference filenameYes
3ANI percentageYes
4Query sizeOptional
5Reference sizeOptional
6Query basesOptional
7Reference basesOptional
8Query TaxIDOptional
9Reference TaxIDOptional
10KID scoreOptional
11WKID scoreOptional
12SSU scoreOptional

Example format as generated by CompareSketch:

#Query	Ref	ANI	QSize	RefSize	QBases	RBases	QTaxID	RTaxID	KID	WKID	SSU
a.fa	b.fa	80.980	11178684	8473676	12055201	8514625	-1	-1	0.237	0.313	94.233

Support

For questions and support: