MergePGM

Basic Usage

mergepgm.sh in=x.pgm,y.pgm out=z.pgm

Combines multiple .pgm (probabilistic gene model) files into a single merged model. This is useful for creating composite gene models from datasets of different sizes or combining specialized models.

Parameters

Parameters control input/output files and merging behavior. The tool supports weighted merging and normalization to handle models of different training set sizes.

File Parameters

in=<file,file>: A pgm file or comma-delimited list of pgm files. Supports @ multiplier syntax (e.g., model.pgm@0.5) to weight specific models during merging. Multiple files are separated by commas without spaces.
out=<file>: Output filename for the merged probabilistic gene model. The output will be in .pgm format compatible with other BBTools prokaryotic gene calling utilities.

Merging Parameters

normalize=f: Merge proportionally to base counts, so small models have equal weight to large models. When enabled, models are normalized based on the number of bases processed during training, preventing larger training sets from dominating the merged model. Normalization happens before applying any @ multiplier values.

Examples

Basic Model Merging

mergepgm.sh in=genome1.pgm,genome2.pgm out=merged.pgm

Merges two gene models with equal weighting. The resulting model combines the statistical characteristics of both training datasets.

Weighted Model Merging

mergepgm.sh in=large_dataset.pgm@0.7,small_dataset.pgm@0.3 out=weighted_merged.pgm

Merges two models with custom weights. The large dataset model contributes 70% while the small dataset contributes 30% to the final merged model.

Normalized Merging

mergepgm.sh in=big_model.pgm,small_model.pgm out=normalized.pgm normalize=t

Merges models with normalization enabled. This ensures that models trained on datasets of different sizes contribute equally to the merged result, preventing bias toward larger training sets.

Multiple Model Merging

mergepgm.sh in=archaea.pgm@0.3,bacteria.pgm@0.6,mixed.pgm@0.1 out=comprehensive.pgm normalize=t

Combines three specialized models with custom weights and normalization. Useful for creating representative models that work across diverse prokaryotic genomes.

Algorithm Details

Merging Strategy

MergePGM implements a two-phase merging process using GeneModelParser.loadModel() to deserialize input models, followed by iterative GeneModel.add() operations to combine statistical parameters additively across all gene feature types.

Processing Pipeline

1. Model Loading and Parsing

The loadModels() method processes each input file by calling GeneModelParser.loadModel(fname) to deserialize PGM files. The @ multiplier syntax is parsed using String.split("@") to extract custom weights, with multipliers stored in a parallel ArrayList<Double> structure.

2. Normalization Processing

When normalization is enabled, the tool calculates the maximum number of bases processed across all input models. Smaller models are then scaled up proportionally using the formula: multiplier = max_bases / model_bases. This prevents models trained on larger datasets from dominating the merged result.

3. Weight Application

After normalization (if enabled), custom weights specified with the @ syntax are applied to each model. This allows fine-tuned control over the contribution of each model to the final merged result.

4. Statistical Merging

The tool creates a new composite GeneModel and iteratively adds the statistics from each weighted input model. Statistical parameters for different gene types (CDS statistics, rRNA statistics, tRNA statistics) are combined additively, preserving the probabilistic characteristics of the individual models.

Gene Model Components

The merging process handles multiple statistical components within each gene model:

CDS Statistics: Coding sequence start/stop codon probabilities and inner codon usage patterns
rRNA Statistics: Ribosomal RNA gene patterns for 16S, 23S, 5S, and 18S sequences
tRNA Statistics: Transfer RNA gene structural and sequence characteristics
Base Composition: Overall nucleotide composition and GC content patterns

File Handling and Validation

The tool includes comprehensive validation to ensure input models are compatible and output files can be written. It automatically resolves "auto" or "default" input paths to the system default gene model, and prevents duplicate file specifications unless explicitly allowed.

Memory and Performance

MergePGM is designed to be memory-efficient, using a default heap size of 200MB which is sufficient for merging typical gene model files. The merging process is linear in the number of input models and scales well with model complexity.

Technical Notes

File Format Compatibility

Input and output files must be in .pgm (probabilistic gene model) format as used by BBTools prokaryotic gene calling utilities. The merged model maintains full compatibility with downstream gene calling tools.

Normalization Algorithm

The normalization process uses the maximum bases processed across all models as the reference point. Smaller models are scaled up using multiplicative factors, while the largest model remains unchanged. This approach preserves the relative statistical relationships within each model while equalizing their influence on the merged result.

Weight Interaction

When both normalization and @ weights are used, normalization is applied first to equalize model sizes, then the @ weights are applied to achieve the desired contribution ratios. This two-stage process allows for both size-independent and user-controlled weighting.

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org