RunHMM

Script: runhmm.sh Package: hmm Class: HMMSearchReport.java

Parses HMMER search output files using manual byte array processing, extracting 23 fields per hit line into HMMSearchLine objects. Groups hits by query protein name in HashMap<String, ProteinSummary> structure, retaining only the longest hit length per protein-model combination for memory efficiency.

Basic Usage

runhmm.sh in=<file> out=<file>

Reads HMM search output files line-by-line using ByteFile.nextLine(), parsing each non-comment line into HMMSearchLine objects with manual byte array processing. Each line is parsed into 23 fields using pointer advancement (a,b variables) with space delimiter detection and Parse method conversion for integers, floats, and doubles.

Parameters

RunHMM has minimal parameters as it focuses on standard HMM file processing with consistent output formatting.

File I/O Parameters

in=<file>: Input HMM search results file. Should be in standard HMMER output format with tab-delimited fields containing protein names, model information, coordinates, and scores.
out=<file>: Output file for processed results. If not specified, results are written to stdout.

Processing Parameters

ow=f: (overwrite) Set to true to overwrite existing output files. Default: false.
verbose=f: Enable verbose output for debugging and detailed processing information. Default: false.

Examples

Basic HMM Results Processing

runhmm.sh in=hmmsearch_output.txt out=processed_results.txt

Processes HMM search results from a HMMER output file, creating organized protein summaries.

Processing with Overwrite

runhmm.sh in=domain_search.out out=summary.txt ow=t

Processes domain search results, overwriting any existing output file.

Standard Pipeline Usage

# First run HMM search (external tool)
hmmsearch protein_models.hmm query_sequences.faa > search_results.txt

# Then process results with runhmm
runhmm.sh in=search_results.txt out=organized_hits.txt

Typical workflow showing HMM search followed by results processing.

Algorithm Details

HMM Search Line Parsing

RunHMM implements manual byte-by-byte parsing of HMMER output format using pointer advancement (a,b variables) through space-delimited fields, extracting 23 distinct fields from each hit line with Parse.parseInt(), Parse.parseDouble(), and Parse.parseFloat() methods:

Protein identification: Query sequence name and reference model information
Statistical scores: E-values, bit scores, and bias corrections
Coordinate mapping: Start/end positions in both query and model sequences
Model metadata: HMM model names, accession numbers, and lengths

Data Organization Strategy

The tool uses HashMap-based storage with length-based filtering:

Primary level: Groups all hits by query protein name using HashMap<String, ProteinSummary> for direct lookup
Secondary level: Each ProteinSummary contains HashMap<String, Integer> that uses line.name (protein name) as key and line.length as value
Length filtering: ProteinSummary.add() method compares line.length against existing Integer values in the map, updating only when new length exceeds stored value

Field Processing

Each input line is parsed into structured components using specific Parse methods:

String fields: Direct byte array substring extraction for protein names (field 0), model identifiers (field 1), HMM names (field 3), accession numbers (field 4), and descriptions (field 22)
Integer fields: Parse.parseInt() for sequence length (field 2), model length (field 5), and coordinate positions (fields 15-20)
Float/Double fields: Parse.parseDouble() for E-values (fields 6, 11, 12) and Parse.parseFloat() for bit scores and bias values (fields 7-10, 13-14, 21)
Whitespace handling: Iterative space skipping with while(line[b]==' '){b++;} between each field extraction

Output Generation

Results are output via System.err.println() calls during processing:

Line-by-line output: Each HMMSearchLine object prints its toString() representation during addToMap() processing
Protein tracking: ProteinSummary objects stored in HashMap but not directly written to output files
ByteBuilder formatting: HMMSearchLine.toText() method generates tab-delimited output using ByteBuilder with name, length, and hmmName fields

Performance Characteristics

Memory usage: Linear with unique protein names (HashMap<String, ProteinSummary>) and unique model names per protein (HashMap<String, Integer> in each ProteinSummary)
Processing speed: Single-pass file reading with ByteFile.nextLine() and manual byte array parsing avoiding string splits
Hit reduction: ProteinSummary.add() method filters duplicate hits by retaining only the maximum length value for each protein-model combination

Input Format

RunHMM expects standard HMMER output format with the following characteristics:

Tab-delimited fields: 23 fields per hit line separated by whitespace
Comment lines: Lines starting with '#' are skipped during processing
Field order: Protein name, model info, coordinates, scores in standard HMMER arrangement
Numeric precision: Supports scientific notation for E-values and high-precision scores

Expected Field Structure

Each data line should contain fields in this order:

Query protein name
Model identifier
Sequence length
HMM model name
Accession number
Model length
Full sequence E-value
Full sequence score
Full sequence bias
Best domain number
Domain count
Domain E-value
Independent E-value
Domain score
Domain bias
HMM start coordinate
HMM end coordinate
Query start coordinate
Query end coordinate
Envelope start
Envelope end
Accuracy score
Description field

Output Format

The tool generates organized summaries showing:

Protein summaries: Each query protein with its significant HMM hits
Best hits only: For each protein-model combination, only the longest alignment is retained
Tabular output: Clean tab-delimited format suitable for further analysis

Technical Notes

Memory Management

RunHMM uses HashMap-based data structures for hit tracking:

HashMap storage: HashMap<String, ProteinSummary> for protein lookup with each ProteinSummary containing HashMap<String, Integer> using line.name as key
Length-based retention: ProteinSummary.add() compares line.length against existing Integer values, updating when old==null or old<line.length
Memory scaling: Storage grows with unique protein names, with each ProteinSummary tracking length values per protein name occurrence

File Processing

Single-pass reading: ByteFile.nextLine() iteration until null return value indicates end of input
Comment filtering: Lines starting with '#' character (line[0]!='#') are skipped during load() processing
Field validation: Each field extraction includes assert statements checking b>a to ensure valid field boundaries were found

Integration with HMM Workflows

This tool is designed to complement standard HMMER workflows:

Post-processing: Processes raw HMMER output into organized summaries
Data reduction: Reduces redundant hits while preserving significant matches
Downstream compatibility: Outputs structured data for further analysis tools

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org