Consensus
Generates the consensus sequence of a reference using aligned sequences. This can be used for polishing assemblies, making representative ribosomal sub units, correcting PacBio reads, etc.
Basic Usage
consensus.sh in=mapped.sam ref=ref.fa out=consensus.fa
The consensus tool requires aligned reads (SAM/BAM format) or unaligned sequences (FASTA/FASTQ format) and a reference sequence. If unaligned sequences are provided, they will be aligned to the first reference sequence automatically.
Parameters
Parameters are organized by function: standard file I/O parameters, processing parameters for controlling consensus calling thresholds, and Java memory settings.
Standard parameters
- in=<file>
- Reads mapped to the reference; should be sam or bam. For unaligned sequences, use fasta or fastq format and they will be automatically aligned to the first reference sequence.
- ref=<file>
- Reference sequence file in fasta or fastq format. This serves as the template for consensus generation.
- out=<file>
- Output consensus sequence file in fasta or fastq format. The reference will be modified according to aligned read evidence.
- outm=<file>
- Optional output for binary model file containing statistical information about the consensus calling process. Preferred extension is .alm.
- inm=<file>
- Optional input model file for incorporating prior statistical information into consensus calling decisions.
- hist=<file>
- Optional output file for identity and score histograms showing the distribution of read alignment qualities.
- overwrite=f
- (ow) Set to false to force the program to abort rather than overwrite an existing output file.
Processing parameters
- mindepth=2
- Minimum read depth required to make changes to the reference sequence. Positions with fewer supporting reads remain unchanged.
- mafsub=0.25
- Minimum allele frequency for incorporating substitutions. Variants below this threshold are not included in the consensus.
- mafdel=0.50
- Minimum allele frequency for incorporating deletions. Deletions must be supported by at least this fraction of reads.
- mafins=0.50
- Minimum allele frequency for incorporating insertions. Insertions must be supported by at least this fraction of reads.
- mafn=0.40
- Minimum allele frequency for changing ambiguous bases (Ns) to specific nucleotides. Lower thresholds allow more aggressive N resolution.
- usemapq=f
- Include mapping quality (MAPQ) scores as a positive factor in calculating edge weights for consensus decisions.
- nonly=f
- Only change ambiguous bases (Ns) to different bases, leaving all other positions unchanged regardless of read evidence.
- noindels=f
- Disable incorporation of insertions and deletions, allowing only substitutions in the consensus sequence.
- ceiling=
- If set, alignments will be weighted by their inverse identity to favor lower-quality reads. For example, at ceiling=105, a read with 96% identity gets bonus weight of 105-96=9 while a read with 70% identity gets 105-70=35. This parameter helps incorporate evidence from divergent sequences.
- name=
- Set the output sequence name for single-sequence consensus generation. Overrides the original reference sequence name.
Java Parameters
- -Xmx
- Set Java's memory usage, overriding automatic detection. Use -Xmx20g for 20 GB RAM or -Xmx200m for 200 MB. Maximum is typically 85% of physical memory.
- -eoom
- Exit if an out-of-memory exception occurs. Requires Java 8u92 or later.
- -da
- Disable Java assertions for slightly improved performance.
Examples
Assembly Polishing with Illumina Reads
consensus.sh in=illumina_mapped.sam ref=draft_assembly.fa out=polished_assembly.fa mafsub=0.5
Polish a draft assembly using mapped Illumina reads. The recommended mafsub=0.5 setting ensures only high-confidence substitutions are incorporated.
Creating Representative Ribosomal Sequences
consensus.sh in=16s_aligned.sam ref=16s_reference.fa out=16s_consensus.fa mindepth=5 mafn=0.3
Generate a consensus 16S rRNA sequence from multiple aligned sequences, requiring at least 5x depth and allowing N resolution at 30% frequency.
PacBio Read Correction
consensus.sh in=pacbio_self_aligned.sam ref=pacbio_reads.fa out=corrected_reads.fa noindels=f ceiling=110
Correct PacBio reads using self-alignment, allowing indels and using inverse identity weighting to incorporate evidence from divergent reads.
Conservative Consensus with Model Output
consensus.sh in=mapped_reads.bam ref=reference.fa out=consensus.fa outm=model.alm hist=quality_hist.txt mindepth=10 mafsub=0.8
Generate a conservative consensus requiring 10x depth and 80% allele frequency, while outputting statistical models and quality histograms for analysis.
Algorithm Details
Consensus Calling Strategy
The consensus algorithm employs a probabilistic graph-based approach implemented through BaseGraph objects that incorporate read alignment evidence into reference sequence modification:
BaseGraph Data Structure
Each reference sequence is represented as a BaseGraph object initialized with three core arrays: `original` (padded reference sequence), `ref` (BaseNode array for reference positions), and `del` (BaseNode array for deletion events). The constructor creates linked BaseNode objects where each position maintains `acgtCount` and `acgtWeight` arrays tracking A/C/G/T frequencies and quality-weighted evidence. Padding using 'N' symbols is applied via the `pad()` method for robust boundary handling.
Multi-threaded Processing
The ConsensusMaker implements parallel processing through the Accumulator interface and ProcessThread worker classes:
- SamReadStreamer threads - Parse SAM/BAM files using configurable `streamerThreads` parameter (default SamStreamer.DEFAULT_THREADS)
- ProcessThread workers - Execute `processRead()` method to accumulate alignment evidence via synchronized BaseNode.add() calls
- Thread synchronization - Uses ReentrantReadWriteLock from java.util.concurrent.locks for concurrent BaseGraph access
- ThreadWaiter coordination - Manages worker thread lifecycle through ThreadWaiter.waitForThreadsToFinish() method
Identity-based Weighting
When the `ceiling` parameter is set, the BaseGraph.add() method applies inverse identity weighting calculated as `weight + Tools.max(0, identityCeiling - identity)` where identity is computed as `(int)(r.identity() * 100)`. This mathematical inversion systematically increases evidence weight for lower-identity alignments:
- 96% identity read at ceiling=105 receives bonus weight of 9
- 70% identity read at ceiling=105 receives bonus weight of 35
- Higher identity reads (>ceiling) receive no bonus weight
- Implementation uses `invertIdentity` boolean flag and `identityCeiling` integer parameter
Variant Calling Thresholds
The algorithm applies different minimum allele frequency (MAF) thresholds defined in ConsensusObject as static constants:
- Substitutions - Controlled by MAF_sub (default 0.25f)
- Deletions - Controlled by MAF_del (default 0.5f)
- Insertions - Controlled by MAF_ins (default 0.5f)
- N resolution - Controlled by MAF_noref (default 0.4f)
Higher thresholds for indels reflect their typically higher error rates compared to substitutions.
Alignment Processing
The tool supports both pre-aligned (SAM/BAM) and unaligned (FASTA/FASTQ) input through distinct processing paths:
- Pre-aligned input - Uses SamFilter for quality control and reads existing match strings from alignment records
- Unaligned input - Automatically aligns sequences using SketchObject.alignAndMakeMatch() with quality-weighted scoring
- MAPQ integration - When useMapq=true, incorporates MAPQ via Math.ceil(Math.sqrt(q*mapq)) weighting function
- Match string processing - Parses CIGAR operations ('m', 'D', 'I', 'S', 'N') to accumulate base evidence
Consensus Generation
Final consensus sequences are generated through the BaseGraph.traverse() method which implements position-by-position consensus calling:
- BaseNode consensus determination - Uses BaseNode.consensus() method to select optimal base at each position
- Quality score preservation - Maintains quality information through consensus generation process
- Statistical tracking - Accumulates refCount, subCount, delCount, and insCount for variant statistics
- Batch output - Processes consensus sequences in batches of 200 for memory efficiency
Memory Management
The algorithm implements several memory optimization strategies for large-scale processing:
- Streaming input - Uses ConcurrentReadInputStream for memory-limited file processing without full file loading
- Configurable validation - Disables Read.VALIDATE_IN_CONSTRUCTOR when threads >= 4 for performance optimization
- Dynamic buffer sizing - Calculates buffer sizes as Tools.mid(16, 128, (Shared.threads()*2)/3) for ordered output
- Concurrent output - Uses ConcurrentReadOutputStream for parallel result writing
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org