Consensus

Basic Usage

consensus.sh in=mapped.sam ref=ref.fa out=consensus.fa

The consensus tool requires aligned reads (SAM/BAM format) or unaligned sequences (FASTA/FASTQ format) and a reference sequence. If unaligned sequences are provided, they will be aligned to the first reference sequence automatically.

Parameters

Parameters are organized by function: standard file I/O parameters, processing parameters for controlling consensus calling thresholds, and Java memory settings.

Standard parameters

in=<file>: Reads mapped to the reference; should be sam or bam. For unaligned sequences, use fasta or fastq format and they will be automatically aligned to the first reference sequence.
ref=<file>: Reference sequence file in fasta or fastq format. This serves as the template for consensus generation.
out=<file>: Output consensus sequence file in fasta or fastq format. The reference will be modified according to aligned read evidence.
outm=<file>: Optional output for binary model file containing statistical information about the consensus calling process. Preferred extension is .alm.
inm=<file>: Optional input model file for incorporating prior statistical information into consensus calling decisions.
hist=<file>: Optional output file for identity and score histograms showing the distribution of read alignment qualities.
overwrite=f: (ow) Set to false to force the program to abort rather than overwrite an existing output file.

Processing parameters

mindepth=2: Minimum read depth required to make changes to the reference sequence. Positions with fewer supporting reads remain unchanged.
mafsub=0.25: Minimum allele frequency for incorporating substitutions. Variants below this threshold are not included in the consensus.
mafdel=0.50: Minimum allele frequency for incorporating deletions. Deletions must be supported by at least this fraction of reads.
mafins=0.50: Minimum allele frequency for incorporating insertions. Insertions must be supported by at least this fraction of reads.
mafn=0.40: Minimum allele frequency for changing ambiguous bases (Ns) to specific nucleotides. Lower thresholds allow more aggressive N resolution.
usemapq=f: Include mapping quality (MAPQ) scores as a positive factor in calculating edge weights for consensus decisions.
nonly=f: Only change ambiguous bases (Ns) to different bases, leaving all other positions unchanged regardless of read evidence.
noindels=f: Disable incorporation of insertions and deletions, allowing only substitutions in the consensus sequence.
ceiling=: If set, alignments will be weighted by their inverse identity to favor lower-quality reads. For example, at ceiling=105, a read with 96% identity gets bonus weight of 105-96=9 while a read with 70% identity gets 105-70=35. This parameter helps incorporate evidence from divergent sequences.
name=: Set the output sequence name for single-sequence consensus generation. Overrides the original reference sequence name.

Java Parameters

-Xmx: Set Java's memory usage, overriding automatic detection. Use -Xmx20g for 20 GB RAM or -Xmx200m for 200 MB. Maximum is typically 85% of physical memory.
-eoom: Exit if an out-of-memory exception occurs. Requires Java 8u92 or later.
-da: Disable Java assertions for slightly improved performance.

Examples

Assembly Polishing with Illumina Reads

consensus.sh in=illumina_mapped.sam ref=draft_assembly.fa out=polished_assembly.fa mafsub=0.5

Polish a draft assembly using mapped Illumina reads. The recommended mafsub=0.5 setting ensures only high-confidence substitutions are incorporated.

Creating Representative Ribosomal Sequences

consensus.sh in=16s_aligned.sam ref=16s_reference.fa out=16s_consensus.fa mindepth=5 mafn=0.3

Generate a consensus 16S rRNA sequence from multiple aligned sequences, requiring at least 5x depth and allowing N resolution at 30% frequency.

PacBio Read Correction

consensus.sh in=pacbio_self_aligned.sam ref=pacbio_reads.fa out=corrected_reads.fa noindels=f ceiling=110

Correct PacBio reads using self-alignment, allowing indels and using inverse identity weighting to incorporate evidence from divergent reads.

Conservative Consensus with Model Output

consensus.sh in=mapped_reads.bam ref=reference.fa out=consensus.fa outm=model.alm hist=quality_hist.txt mindepth=10 mafsub=0.8

Generate a conservative consensus requiring 10x depth and 80% allele frequency, while outputting statistical models and quality histograms for analysis.

Algorithm Details

Consensus Calling Strategy

The consensus algorithm employs a probabilistic graph-based approach implemented through BaseGraph objects that incorporate read alignment evidence into reference sequence modification:

BaseGraph Data Structure

Each reference sequence is represented as a BaseGraph object initialized with three core arrays: `original` (padded reference sequence), `ref` (BaseNode array for reference positions), and `del` (BaseNode array for deletion events). The constructor creates linked BaseNode objects where each position maintains `acgtCount` and `acgtWeight` arrays tracking A/C/G/T frequencies and quality-weighted evidence. Padding using 'N' symbols is applied via the `pad()` method for robust boundary handling.

Multi-threaded Processing

The ConsensusMaker implements parallel processing through the Accumulator interface and ProcessThread worker classes:

SamReadStreamer threads - Parse SAM/BAM files using configurable `streamerThreads` parameter (default SamStreamer.DEFAULT_THREADS)
ProcessThread workers - Execute `processRead()` method to accumulate alignment evidence via synchronized BaseNode.add() calls
Thread synchronization - Uses ReentrantReadWriteLock from java.util.concurrent.locks for concurrent BaseGraph access
ThreadWaiter coordination - Manages worker thread lifecycle through ThreadWaiter.waitForThreadsToFinish() method

Identity-based Weighting

When the `ceiling` parameter is set, the BaseGraph.add() method applies inverse identity weighting calculated as `weight + Tools.max(0, identityCeiling - identity)` where identity is computed as `(int)(r.identity() * 100)`. This mathematical inversion systematically increases evidence weight for lower-identity alignments:

96% identity read at ceiling=105 receives bonus weight of 9
70% identity read at ceiling=105 receives bonus weight of 35
Higher identity reads (>ceiling) receive no bonus weight
Implementation uses `invertIdentity` boolean flag and `identityCeiling` integer parameter

Variant Calling Thresholds

The algorithm applies different minimum allele frequency (MAF) thresholds defined in ConsensusObject as static constants:

Substitutions - Controlled by MAF_sub (default 0.25f)
Deletions - Controlled by MAF_del (default 0.5f)
Insertions - Controlled by MAF_ins (default 0.5f)
N resolution - Controlled by MAF_noref (default 0.4f)

Higher thresholds for indels reflect their typically higher error rates compared to substitutions.

Alignment Processing

The tool supports both pre-aligned (SAM/BAM) and unaligned (FASTA/FASTQ) input through distinct processing paths:

Pre-aligned input - Uses SamFilter for quality control and reads existing match strings from alignment records
Unaligned input - Automatically aligns sequences using SketchObject.alignAndMakeMatch() with quality-weighted scoring
MAPQ integration - When useMapq=true, incorporates MAPQ via Math.ceil(Math.sqrt(q*mapq)) weighting function
Match string processing - Parses CIGAR operations ('m', 'D', 'I', 'S', 'N') to accumulate base evidence

Consensus Generation

Final consensus sequences are generated through the BaseGraph.traverse() method which implements position-by-position consensus calling:

BaseNode consensus determination - Uses BaseNode.consensus() method to select optimal base at each position
Quality score preservation - Maintains quality information through consensus generation process
Statistical tracking - Accumulates refCount, subCount, delCount, and insCount for variant statistics
Batch output - Processes consensus sequences in batches of 200 for memory efficiency

Memory Management

The algorithm implements several memory optimization strategies for large-scale processing:

Streaming input - Uses ConcurrentReadInputStream for memory-limited file processing without full file loading
Configurable validation - Disables Read.VALIDATE_IN_CONSTRUCTOR when threads >= 4 for performance optimization
Dynamic buffer sizing - Calculates buffer sizes as Tools.mid(16, 128, (Shared.threads()*2)/3) for ordered output
Concurrent output - Uses ConcurrentReadOutputStream for parallel result writing

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org