RunHMM
Parses HMMER search output files using manual byte array processing, extracting 23 fields per hit line into HMMSearchLine objects. Groups hits by query protein name in HashMap<String, ProteinSummary> structure, retaining only the longest hit length per protein-model combination for memory efficiency.
Basic Usage
runhmm.sh in=<file> out=<file>
Reads HMM search output files line-by-line using ByteFile.nextLine(), parsing each non-comment line into HMMSearchLine objects with manual byte array processing. Each line is parsed into 23 fields using pointer advancement (a,b variables) with space delimiter detection and Parse method conversion for integers, floats, and doubles.
Parameters
RunHMM has minimal parameters as it focuses on standard HMM file processing with consistent output formatting.
File I/O Parameters
- in=<file>
- Input HMM search results file. Should be in standard HMMER output format with tab-delimited fields containing protein names, model information, coordinates, and scores.
- out=<file>
- Output file for processed results. If not specified, results are written to stdout.
Processing Parameters
- ow=f
- (overwrite) Set to true to overwrite existing output files. Default: false.
- verbose=f
- Enable verbose output for debugging and detailed processing information. Default: false.
Examples
Basic HMM Results Processing
runhmm.sh in=hmmsearch_output.txt out=processed_results.txt
Processes HMM search results from a HMMER output file, creating organized protein summaries.
Processing with Overwrite
runhmm.sh in=domain_search.out out=summary.txt ow=t
Processes domain search results, overwriting any existing output file.
Standard Pipeline Usage
# First run HMM search (external tool)
hmmsearch protein_models.hmm query_sequences.faa > search_results.txt
# Then process results with runhmm
runhmm.sh in=search_results.txt out=organized_hits.txt
Typical workflow showing HMM search followed by results processing.
Algorithm Details
HMM Search Line Parsing
RunHMM implements manual byte-by-byte parsing of HMMER output format using pointer advancement (a,b variables) through space-delimited fields, extracting 23 distinct fields from each hit line with Parse.parseInt(), Parse.parseDouble(), and Parse.parseFloat() methods:
- Protein identification: Query sequence name and reference model information
- Statistical scores: E-values, bit scores, and bias corrections
- Coordinate mapping: Start/end positions in both query and model sequences
- Model metadata: HMM model names, accession numbers, and lengths
Data Organization Strategy
The tool uses HashMap-based storage with length-based filtering:
- Primary level: Groups all hits by query protein name using HashMap<String, ProteinSummary> for direct lookup
- Secondary level: Each ProteinSummary contains HashMap<String, Integer> that uses line.name (protein name) as key and line.length as value
- Length filtering: ProteinSummary.add() method compares line.length against existing Integer values in the map, updating only when new length exceeds stored value
Field Processing
Each input line is parsed into structured components using specific Parse methods:
- String fields: Direct byte array substring extraction for protein names (field 0), model identifiers (field 1), HMM names (field 3), accession numbers (field 4), and descriptions (field 22)
- Integer fields: Parse.parseInt() for sequence length (field 2), model length (field 5), and coordinate positions (fields 15-20)
- Float/Double fields: Parse.parseDouble() for E-values (fields 6, 11, 12) and Parse.parseFloat() for bit scores and bias values (fields 7-10, 13-14, 21)
- Whitespace handling: Iterative space skipping with while(line[b]==' '){b++;} between each field extraction
Output Generation
Results are output via System.err.println() calls during processing:
- Line-by-line output: Each HMMSearchLine object prints its toString() representation during addToMap() processing
- Protein tracking: ProteinSummary objects stored in HashMap but not directly written to output files
- ByteBuilder formatting: HMMSearchLine.toText() method generates tab-delimited output using ByteBuilder with name, length, and hmmName fields
Performance Characteristics
- Memory usage: Linear with unique protein names (HashMap<String, ProteinSummary>) and unique model names per protein (HashMap<String, Integer> in each ProteinSummary)
- Processing speed: Single-pass file reading with ByteFile.nextLine() and manual byte array parsing avoiding string splits
- Hit reduction: ProteinSummary.add() method filters duplicate hits by retaining only the maximum length value for each protein-model combination
Input Format
RunHMM expects standard HMMER output format with the following characteristics:
- Tab-delimited fields: 23 fields per hit line separated by whitespace
- Comment lines: Lines starting with '#' are skipped during processing
- Field order: Protein name, model info, coordinates, scores in standard HMMER arrangement
- Numeric precision: Supports scientific notation for E-values and high-precision scores
Expected Field Structure
Each data line should contain fields in this order:
- Query protein name
- Model identifier
- Sequence length
- HMM model name
- Accession number
- Model length
- Full sequence E-value
- Full sequence score
- Full sequence bias
- Best domain number
- Domain count
- Domain E-value
- Independent E-value
- Domain score
- Domain bias
- HMM start coordinate
- HMM end coordinate
- Query start coordinate
- Query end coordinate
- Envelope start
- Envelope end
- Accuracy score
- Description field
Output Format
The tool generates organized summaries showing:
- Protein summaries: Each query protein with its significant HMM hits
- Best hits only: For each protein-model combination, only the longest alignment is retained
- Tabular output: Clean tab-delimited format suitable for further analysis
Technical Notes
Memory Management
RunHMM uses HashMap-based data structures for hit tracking:
- HashMap storage: HashMap<String, ProteinSummary> for protein lookup with each ProteinSummary containing HashMap<String, Integer> using line.name as key
- Length-based retention: ProteinSummary.add() compares line.length against existing Integer values, updating when old==null or old<line.length
- Memory scaling: Storage grows with unique protein names, with each ProteinSummary tracking length values per protein name occurrence
File Processing
- Single-pass reading: ByteFile.nextLine() iteration until null return value indicates end of input
- Comment filtering: Lines starting with '#' character (line[0]!='#') are skipped during load() processing
- Field validation: Each field extraction includes assert statements checking b>a to ensure valid field boundaries were found
Integration with HMM Workflows
This tool is designed to complement standard HMMER workflows:
- Post-processing: Processes raw HMMER output into organized summaries
- Data reduction: Reduces redundant hits while preserving significant matches
- Downstream compatibility: Outputs structured data for further analysis tools
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org