Representative

Script: representative.sh Package: jgi Class: RepresentativeSet.java

Makes a representative set of taxa from all-to-all identity comparison. Input should be in 3+ column TSV format (first 3 are required): (query, ref, ANI, qsize, rsize, qbases, rbases) as produced by CompareSketch with format=3 and usetaxidname. Additional columns are allowed and will be ignored.

Basic Usage

representative.sh in=<input file> out=<output file>

Creates a minimal representative set by retaining nodes such that all original nodes are within a minimum distance of at least one representative node. Singleton nodes will only be included if they are represented by a self-edge.

Parameters

Parameters are organized by their function in the representative set generation process. All parameters from the shell script are preserved exactly as specified.

Parameters

overwrite=f
(ow) Set to false to force the program to abort rather than overwrite an existing file.
threshold=0
Ignore edges under threshold value. This also affects the choice of centroids; a high threshold gives more weight to higher-value edges. Used in the node scoring algorithm to filter low-quality connections.
minratio=0
Ignores edges with a ratio below this value. The ratio is calculated as max(sizeA, sizeB) / min(sizeA, sizeB) to ensure size compatibility between nodes.
invertratio=f
Invert the ratio when greater than 1. When enabled, ratios > 1 are converted to 1/ratio, ensuring all ratios are ≤ 1 for consistent filtering.
printheader=t
Print a header line in the output. The header includes "#Representative" and optionally "Size", "NodeCount", and "Nodes" depending on other print flags.
printsize=t
Print the size of retained nodes. Shows the estimated size in unique kmers for each representative node.
printclusters=t
Print the nodes subsumed by each retained node. Outputs the count and comma-separated list of all nodes represented by each cluster centroid.
minsize=0
Ignore nodes under this size (in unique kmers). Nodes below this threshold are excluded from processing entirely.
maxsize=0
If positive, ignore nodes over this size (unique kmers). Large nodes above this threshold are excluded from processing.
minbases=0
Ignore nodes under this size (in total bases). Alternative size filtering based on total sequence length rather than unique kmers.
maxbases=0
If positive, ignore nodes over this size (total bases). Upper limit for sequence length-based filtering.

Taxonomy parameters

level=
Taxonomic level, such as phylum. Filtering will operate on sequences within the same taxonomic level as specified ids. If not set, only matches to a node or its descendants will be considered.
ids=
Comma-delimited list of NCBI numeric IDs. Can also be a file with one taxID per line. Used for taxonomic filtering of input sequences.
names=
Alternately, a list of names (such as 'Homo sapiens'). Note that spaces need special handling. Provides name-based taxonomic filtering as an alternative to numeric IDs.
include=f
'f' will discard filtered sequences, 't' will keep them. Determines whether taxonomically filtered sequences are included or excluded from the representative set.
tree=<file>
Specify a TaxTree file like tree.taxtree.gz. On Genepool, use 'auto'. Required for taxonomic filtering functionality.

Java Parameters

-Xmx
This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically around 85% of physical memory.
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da
Disable assertions.

Examples

Basic Representative Set Generation

representative.sh in=comparisons.tsv out=representatives.txt

Creates a representative set from all-to-all comparison data, outputting the minimal set of representatives.

Filtered Representative Set with Thresholds

representative.sh in=comparisons.tsv out=representatives.txt threshold=0.95 minratio=0.8 minsize=1000

Applies strict filtering: only edges with ≥95% identity, size ratios ≥0.8, and nodes with ≥1000 unique kmers.

Taxonomically Filtered Representatives

representative.sh in=comparisons.tsv out=representatives.txt tree=tree.taxtree.gz ids=562,511145 level=species include=t

Generates representatives only for specified taxonomic IDs at the species level, including filtered sequences.

Detailed Output with Clustering Information

representative.sh in=comparisons.tsv out=representatives.txt printsize=t printclusters=t printheader=t

Outputs detailed information including node sizes and complete cluster membership for each representative.

Algorithm Details

RepresentativeSet implements a greedy clustering algorithm using HashMap<Long, Node> data structures and Collections.sort() for centroid selection:

Core Algorithm Implementation

The clustering process uses specific data structures and methods from the Java implementation:

Input Processing (Edge Constructor)

TSV parsing implemented in Edge(byte[] line) constructor with specific column handling:

Memory Management

Memory usage determined by specific data structure sizes:

Processing Pipeline

Support

For questions and support: