Consensus

Script: consensus.sh Package: consensus Class: ConsensusMaker.java

Generates the consensus sequence of a reference using aligned sequences. This can be used for polishing assemblies, making representative ribosomal sub units, correcting PacBio reads, etc.

Basic Usage

consensus.sh in=mapped.sam ref=ref.fa out=consensus.fa

The consensus tool requires aligned reads (SAM/BAM format) or unaligned sequences (FASTA/FASTQ format) and a reference sequence. If unaligned sequences are provided, they will be aligned to the first reference sequence automatically.

Parameters

Parameters are organized by function: standard file I/O parameters, processing parameters for controlling consensus calling thresholds, and Java memory settings.

Standard parameters

in=<file>
Reads mapped to the reference; should be sam or bam. For unaligned sequences, use fasta or fastq format and they will be automatically aligned to the first reference sequence.
ref=<file>
Reference sequence file in fasta or fastq format. This serves as the template for consensus generation.
out=<file>
Output consensus sequence file in fasta or fastq format. The reference will be modified according to aligned read evidence.
outm=<file>
Optional output for binary model file containing statistical information about the consensus calling process. Preferred extension is .alm.
inm=<file>
Optional input model file for incorporating prior statistical information into consensus calling decisions.
hist=<file>
Optional output file for identity and score histograms showing the distribution of read alignment qualities.
overwrite=f
(ow) Set to false to force the program to abort rather than overwrite an existing output file.

Processing parameters

mindepth=2
Minimum read depth required to make changes to the reference sequence. Positions with fewer supporting reads remain unchanged.
mafsub=0.25
Minimum allele frequency for incorporating substitutions. Variants below this threshold are not included in the consensus.
mafdel=0.50
Minimum allele frequency for incorporating deletions. Deletions must be supported by at least this fraction of reads.
mafins=0.50
Minimum allele frequency for incorporating insertions. Insertions must be supported by at least this fraction of reads.
mafn=0.40
Minimum allele frequency for changing ambiguous bases (Ns) to specific nucleotides. Lower thresholds allow more aggressive N resolution.
usemapq=f
Include mapping quality (MAPQ) scores as a positive factor in calculating edge weights for consensus decisions.
nonly=f
Only change ambiguous bases (Ns) to different bases, leaving all other positions unchanged regardless of read evidence.
noindels=f
Disable incorporation of insertions and deletions, allowing only substitutions in the consensus sequence.
ceiling=
If set, alignments will be weighted by their inverse identity to favor lower-quality reads. For example, at ceiling=105, a read with 96% identity gets bonus weight of 105-96=9 while a read with 70% identity gets 105-70=35. This parameter helps incorporate evidence from divergent sequences.
name=
Set the output sequence name for single-sequence consensus generation. Overrides the original reference sequence name.

Java Parameters

-Xmx
Set Java's memory usage, overriding automatic detection. Use -Xmx20g for 20 GB RAM or -Xmx200m for 200 MB. Maximum is typically 85% of physical memory.
-eoom
Exit if an out-of-memory exception occurs. Requires Java 8u92 or later.
-da
Disable Java assertions for slightly improved performance.

Examples

Assembly Polishing with Illumina Reads

consensus.sh in=illumina_mapped.sam ref=draft_assembly.fa out=polished_assembly.fa mafsub=0.5

Polish a draft assembly using mapped Illumina reads. The recommended mafsub=0.5 setting ensures only high-confidence substitutions are incorporated.

Creating Representative Ribosomal Sequences

consensus.sh in=16s_aligned.sam ref=16s_reference.fa out=16s_consensus.fa mindepth=5 mafn=0.3

Generate a consensus 16S rRNA sequence from multiple aligned sequences, requiring at least 5x depth and allowing N resolution at 30% frequency.

PacBio Read Correction

consensus.sh in=pacbio_self_aligned.sam ref=pacbio_reads.fa out=corrected_reads.fa noindels=f ceiling=110

Correct PacBio reads using self-alignment, allowing indels and using inverse identity weighting to incorporate evidence from divergent reads.

Conservative Consensus with Model Output

consensus.sh in=mapped_reads.bam ref=reference.fa out=consensus.fa outm=model.alm hist=quality_hist.txt mindepth=10 mafsub=0.8

Generate a conservative consensus requiring 10x depth and 80% allele frequency, while outputting statistical models and quality histograms for analysis.

Algorithm Details

Consensus Calling Strategy

The consensus algorithm employs a probabilistic graph-based approach implemented through BaseGraph objects that incorporate read alignment evidence into reference sequence modification:

BaseGraph Data Structure

Each reference sequence is represented as a BaseGraph object initialized with three core arrays: `original` (padded reference sequence), `ref` (BaseNode array for reference positions), and `del` (BaseNode array for deletion events). The constructor creates linked BaseNode objects where each position maintains `acgtCount` and `acgtWeight` arrays tracking A/C/G/T frequencies and quality-weighted evidence. Padding using 'N' symbols is applied via the `pad()` method for robust boundary handling.

Multi-threaded Processing

The ConsensusMaker implements parallel processing through the Accumulator interface and ProcessThread worker classes:

Identity-based Weighting

When the `ceiling` parameter is set, the BaseGraph.add() method applies inverse identity weighting calculated as `weight + Tools.max(0, identityCeiling - identity)` where identity is computed as `(int)(r.identity() * 100)`. This mathematical inversion systematically increases evidence weight for lower-identity alignments:

Variant Calling Thresholds

The algorithm applies different minimum allele frequency (MAF) thresholds defined in ConsensusObject as static constants:

Higher thresholds for indels reflect their typically higher error rates compared to substitutions.

Alignment Processing

The tool supports both pre-aligned (SAM/BAM) and unaligned (FASTA/FASTQ) input through distinct processing paths:

Consensus Generation

Final consensus sequences are generated through the BaseGraph.traverse() method which implements position-by-position consensus calling:

Memory Management

The algorithm implements several memory optimization strategies for large-scale processing:

Support

For questions and support: