Representative
Makes a representative set of taxa from all-to-all identity comparison. Input should be in 3+ column TSV format (first 3 are required): (query, ref, ANI, qsize, rsize, qbases, rbases) as produced by CompareSketch with format=3 and usetaxidname. Additional columns are allowed and will be ignored.
Basic Usage
representative.sh in=<input file> out=<output file>
Creates a minimal representative set by retaining nodes such that all original nodes are within a minimum distance of at least one representative node. Singleton nodes will only be included if they are represented by a self-edge.
Parameters
Parameters are organized by their function in the representative set generation process. All parameters from the shell script are preserved exactly as specified.
Parameters
- overwrite=f
- (ow) Set to false to force the program to abort rather than overwrite an existing file.
- threshold=0
- Ignore edges under threshold value. This also affects the choice of centroids; a high threshold gives more weight to higher-value edges. Used in the node scoring algorithm to filter low-quality connections.
- minratio=0
- Ignores edges with a ratio below this value. The ratio is calculated as max(sizeA, sizeB) / min(sizeA, sizeB) to ensure size compatibility between nodes.
- invertratio=f
- Invert the ratio when greater than 1. When enabled, ratios > 1 are converted to 1/ratio, ensuring all ratios are ≤ 1 for consistent filtering.
- printheader=t
- Print a header line in the output. The header includes "#Representative" and optionally "Size", "NodeCount", and "Nodes" depending on other print flags.
- printsize=t
- Print the size of retained nodes. Shows the estimated size in unique kmers for each representative node.
- printclusters=t
- Print the nodes subsumed by each retained node. Outputs the count and comma-separated list of all nodes represented by each cluster centroid.
- minsize=0
- Ignore nodes under this size (in unique kmers). Nodes below this threshold are excluded from processing entirely.
- maxsize=0
- If positive, ignore nodes over this size (unique kmers). Large nodes above this threshold are excluded from processing.
- minbases=0
- Ignore nodes under this size (in total bases). Alternative size filtering based on total sequence length rather than unique kmers.
- maxbases=0
- If positive, ignore nodes over this size (total bases). Upper limit for sequence length-based filtering.
Taxonomy parameters
- level=
- Taxonomic level, such as phylum. Filtering will operate on sequences within the same taxonomic level as specified ids. If not set, only matches to a node or its descendants will be considered.
- ids=
- Comma-delimited list of NCBI numeric IDs. Can also be a file with one taxID per line. Used for taxonomic filtering of input sequences.
- names=
- Alternately, a list of names (such as 'Homo sapiens'). Note that spaces need special handling. Provides name-based taxonomic filtering as an alternative to numeric IDs.
- include=f
- 'f' will discard filtered sequences, 't' will keep them. Determines whether taxonomically filtered sequences are included or excluded from the representative set.
- tree=<file>
- Specify a TaxTree file like tree.taxtree.gz. On Genepool, use 'auto'. Required for taxonomic filtering functionality.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically around 85% of physical memory.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions.
Examples
Basic Representative Set Generation
representative.sh in=comparisons.tsv out=representatives.txt
Creates a representative set from all-to-all comparison data, outputting the minimal set of representatives.
Filtered Representative Set with Thresholds
representative.sh in=comparisons.tsv out=representatives.txt threshold=0.95 minratio=0.8 minsize=1000
Applies strict filtering: only edges with ≥95% identity, size ratios ≥0.8, and nodes with ≥1000 unique kmers.
Taxonomically Filtered Representatives
representative.sh in=comparisons.tsv out=representatives.txt tree=tree.taxtree.gz ids=562,511145 level=species include=t
Generates representatives only for specified taxonomic IDs at the species level, including filtered sequences.
Detailed Output with Clustering Information
representative.sh in=comparisons.tsv out=representatives.txt printsize=t printclusters=t printheader=t
Outputs detailed information including node sizes and complete cluster membership for each representative.
Algorithm Details
RepresentativeSet implements a greedy clustering algorithm using HashMap<Long, Node> data structures and Collections.sort() for centroid selection:
Core Algorithm Implementation
The clustering process uses specific data structures and methods from the Java implementation:
- Node Scoring (Node.add method): sum += (edge.dist - threshold + 0.000001) accumulates edge weights in the Node class
- Size-weighted Comparator: Node.compareTo() calculates score as sum + 0.25 × sum × Math.log(size) for size-biased selection
- HashMap Storage: Uses HashMap<Long, Node> with Long.parseLong(split[0]) for node ID mapping during load() phase
- LongHashSet Tracking: LongHashSet.contains(e.b) prevents selection of nodes already represented by chosen centroids
- Collections.sort(): Sorts ArrayList<Node> using compareTo(), then Collections.reverse() for descending order
Input Processing (Edge Constructor)
TSV parsing implemented in Edge(byte[] line) constructor with specific column handling:
- Line Parsing: String.split("\t+") with Long.parseLong() for columns 0-1 (node IDs) and Float.parseFloat() for column 2 (distance)
- Size Extraction: Optional columns 3-4 parsed as sizeA/sizeB using Long.parseLong() with NumberFormatException handling
- Base Count Extraction: Optional columns 5-6 parsed as basesA/basesB for sequence length filtering
- Ratio Calculation: Edge.ratio() method implements Tools.max(1, sizeA)/(float)Tools.max(1, sizeB) with optional inversion
Memory Management
Memory usage determined by specific data structure sizes:
- Node Storage: Each Node object stores long id, long size, long bases, ArrayList<Edge> edges, boolean used, double sum
- Edge Storage: Each Edge object contains 6 long fields (a, b, sizeA, sizeB, basesA, basesB) plus 1 double (dist)
- HashMap Overhead: HashMap<Long, Node> with default load factor 0.75, resize threshold based on node count
- calcXmx() Method: Automatic memory detection using freeRam(4000m, 84) allocating 4GB base with 84% of available RAM
Processing Pipeline
- ByteFile Input: ByteFile.makeByteFile() with line-by-line parsing via bf.nextLine()
- Singleton Isolation: ArrayList separation of nodes with null/empty edges for automatic inclusion
- Filter Chain: Sequential filtering by size (minSize/maxSize), bases (minBases/maxBases), threshold, ratio, and TaxFilter.passesFilter()
- ByteStreamWriter Output: Concurrent writing using ByteStreamWriter.start() with tab-delimited formatting
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org