Representative

Script: representative.sh Package: jgi Class: RepresentativeSet.java

Makes a representative set of taxa from all-to-all identity comparison. Input should be in 3+ column TSV format (first 3 are required): (query, ref, ANI, qsize, rsize, qbases, rbases) as produced by CompareSketch with format=3 and usetaxidname. Additional columns are allowed and will be ignored.

Basic Usage

representative.sh in=<input file> out=<output file>

Creates a minimal representative set by retaining nodes such that all original nodes are within a minimum distance of at least one representative node. Singleton nodes will only be included if they are represented by a self-edge.

Parameters

Parameters are organized by their function in the representative set generation process. All parameters from the shell script are preserved exactly as specified.

Parameters

overwrite=f: (ow) Set to false to force the program to abort rather than overwrite an existing file.
threshold=0: Ignore edges under threshold value. This also affects the choice of centroids; a high threshold gives more weight to higher-value edges. Used in the node scoring algorithm to filter low-quality connections.
minratio=0: Ignores edges with a ratio below this value. The ratio is calculated as max(sizeA, sizeB) / min(sizeA, sizeB) to ensure size compatibility between nodes.
invertratio=f: Invert the ratio when greater than 1. When enabled, ratios > 1 are converted to 1/ratio, ensuring all ratios are ≤ 1 for consistent filtering.
printheader=t: Print a header line in the output. The header includes "#Representative" and optionally "Size", "NodeCount", and "Nodes" depending on other print flags.
printsize=t: Print the size of retained nodes. Shows the estimated size in unique kmers for each representative node.
printclusters=t: Print the nodes subsumed by each retained node. Outputs the count and comma-separated list of all nodes represented by each cluster centroid.
minsize=0: Ignore nodes under this size (in unique kmers). Nodes below this threshold are excluded from processing entirely.
maxsize=0: If positive, ignore nodes over this size (unique kmers). Large nodes above this threshold are excluded from processing.
minbases=0: Ignore nodes under this size (in total bases). Alternative size filtering based on total sequence length rather than unique kmers.
maxbases=0: If positive, ignore nodes over this size (total bases). Upper limit for sequence length-based filtering.

Taxonomy parameters

level=: Taxonomic level, such as phylum. Filtering will operate on sequences within the same taxonomic level as specified ids. If not set, only matches to a node or its descendants will be considered.
ids=: Comma-delimited list of NCBI numeric IDs. Can also be a file with one taxID per line. Used for taxonomic filtering of input sequences.
names=: Alternately, a list of names (such as 'Homo sapiens'). Note that spaces need special handling. Provides name-based taxonomic filtering as an alternative to numeric IDs.
include=f: 'f' will discard filtered sequences, 't' will keep them. Determines whether taxonomically filtered sequences are included or excluded from the representative set.
tree=<file>: Specify a TaxTree file like tree.taxtree.gz. On Genepool, use 'auto'. Required for taxonomic filtering functionality.

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically around 85% of physical memory.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions.

Examples

Basic Representative Set Generation

representative.sh in=comparisons.tsv out=representatives.txt

Creates a representative set from all-to-all comparison data, outputting the minimal set of representatives.

Filtered Representative Set with Thresholds

representative.sh in=comparisons.tsv out=representatives.txt threshold=0.95 minratio=0.8 minsize=1000

Applies strict filtering: only edges with ≥95% identity, size ratios ≥0.8, and nodes with ≥1000 unique kmers.

Taxonomically Filtered Representatives

representative.sh in=comparisons.tsv out=representatives.txt tree=tree.taxtree.gz ids=562,511145 level=species include=t

Generates representatives only for specified taxonomic IDs at the species level, including filtered sequences.

Detailed Output with Clustering Information

representative.sh in=comparisons.tsv out=representatives.txt printsize=t printclusters=t printheader=t

Outputs detailed information including node sizes and complete cluster membership for each representative.

Algorithm Details

RepresentativeSet implements a greedy clustering algorithm using HashMap<Long, Node> data structures and Collections.sort() for centroid selection:

Core Algorithm Implementation

The clustering process uses specific data structures and methods from the Java implementation:

Node Scoring (Node.add method): sum += (edge.dist - threshold + 0.000001) accumulates edge weights in the Node class
Size-weighted Comparator: Node.compareTo() calculates score as sum + 0.25 × sum × Math.log(size) for size-biased selection
HashMap Storage: Uses HashMap<Long, Node> with Long.parseLong(split[0]) for node ID mapping during load() phase
LongHashSet Tracking: LongHashSet.contains(e.b) prevents selection of nodes already represented by chosen centroids
Collections.sort(): Sorts ArrayList<Node> using compareTo(), then Collections.reverse() for descending order

Input Processing (Edge Constructor)

TSV parsing implemented in Edge(byte[] line) constructor with specific column handling:

Line Parsing: String.split("\t+") with Long.parseLong() for columns 0-1 (node IDs) and Float.parseFloat() for column 2 (distance)
Size Extraction: Optional columns 3-4 parsed as sizeA/sizeB using Long.parseLong() with NumberFormatException handling
Base Count Extraction: Optional columns 5-6 parsed as basesA/basesB for sequence length filtering
Ratio Calculation: Edge.ratio() method implements Tools.max(1, sizeA)/(float)Tools.max(1, sizeB) with optional inversion

Memory Management

Memory usage determined by specific data structure sizes:

Node Storage: Each Node object stores long id, long size, long bases, ArrayList<Edge> edges, boolean used, double sum
Edge Storage: Each Edge object contains 6 long fields (a, b, sizeA, sizeB, basesA, basesB) plus 1 double (dist)
HashMap Overhead: HashMap<Long, Node> with default load factor 0.75, resize threshold based on node count
calcXmx() Method: Automatic memory detection using freeRam(4000m, 84) allocating 4GB base with 84% of available RAM

Processing Pipeline

ByteFile Input: ByteFile.makeByteFile() with line-by-line parsing via bf.nextLine()
Singleton Isolation: ArrayList separation of nodes with null/empty edges for automatic inclusion
Filter Chain: Sequential filtering by size (minSize/maxSize), bases (minBases/maxBases), threshold, ratio, and TaxFilter.passesFilter()
ByteStreamWriter Output: Concurrent writing using ByteStreamWriter.start() with tab-delimited formatting

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org