Phylip2Fasta
Transforms interleaved phylip to fasta format. This tool reads phylip files with interleaved sequence data and converts them to standard FASTA format for broader compatibility with bioinformatics tools.
Basic Usage
phylip2fasta.sh in=<input> out=<output>
Input may be stdin or an interleaved phylip file, compressed or uncompressed. The input phylip file is the only required parameter.
Parameters
Parameters are organized by their function in the conversion process. All parameters from the shell script usage function are documented below.
Input Parameters
- in=<file>
- The input phylip file; this is the only required parameter. Can accept stdin, compressed or uncompressed phylip files. The file should be in interleaved phylip format with sequence names followed by sequence data in blocks.
- unpigz=true
- Decompress with pigz for faster decompression. Uses parallel gzip decompression when processing compressed input files. Default: true
Output Parameters
- out=<file>
- Fasta output destination. Specifies where to write the converted FASTA sequences. If not specified, output goes to stdout. The output will be in standard FASTA format with sequence headers starting with '>' followed by the sequence name from the phylip file.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Default memory allocation is 1GB for this tool.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+. Useful for automated pipelines where graceful failure is preferred over hanging processes.
- -da
- Disable assertions. Can provide minor performance improvements in production environments by skipping assertion checks in the Java code.
Examples
Basic Conversion
phylip2fasta.sh in=alignment.phy out=alignment.fasta
Converts an interleaved phylip file to FASTA format.
Processing Compressed Input
phylip2fasta.sh in=alignment.phy.gz out=alignment.fasta
Converts a gzip-compressed phylip file to FASTA format using parallel decompression.
Using Standard Input/Output
cat alignment.phy | phylip2fasta.sh
Reads phylip data from standard input and writes FASTA output to standard output.
With Custom Memory Settings
phylip2fasta.sh -Xmx4g in=large_alignment.phy out=large_alignment.fasta
Processes a large phylip file with 4GB of memory allocated to Java.
Algorithm Details
The phylip2fasta conversion algorithm implements a two-phase parsing strategy optimized for interleaved phylip format:
Phase 1: Header and Initial Sequence Parsing
- Reads the first line containing sequence count and length information
- Parses sequence names and initial sequence data from the first block
- Creates StringBuilder objects for each sequence to efficiently accumulate sequence data
- Extracts sequence names by parsing characters until whitespace is encountered
- Collects initial sequence letters while skipping whitespace and non-letter characters
Phase 2: Interleaved Block Processing
- Continues reading subsequent lines containing additional sequence data
- Uses modular arithmetic to assign sequence data to the correct sequence based on line position
- Processes lines that start with whitespace as continuation of interleaved sequence blocks
- Filters out non-letter characters, preserving only valid sequence data
Memory Management
- Uses ArrayList of StringBuilder objects for efficient memory utilization
- Nullifies StringBuilder references after writing to free memory during output
- Streams output using TextStreamWriter for large dataset handling
- Default memory allocation of 1GB, configurable via -Xmx parameter
File Format Support
- Automatically detects and handles compressed input files (gzip, etc.)
- Uses pigz for parallel decompression when available
- Supports both file input and standard input streams
- Outputs standard FASTA format with '>' headers followed by sequence names
Error Handling
- Tracks error states throughout the conversion process
- Validates output file permissions before processing
- Reports processing statistics including read count, base count, and elapsed time
- Terminates with informative error messages if corruption is detected
Technical Notes
Phylip Format Requirements
- Input must be in interleaved phylip format
- First line should contain the number of sequences and sequence length
- Sequence names should be followed by sequence data
- Continuation blocks should start with whitespace (typically 8 spaces)
Performance Characteristics
- Linear time complexity with respect to total sequence data
- Memory usage proportional to total sequence length plus overhead
- Efficient StringBuilder-based sequence accumulation
- Parallel decompression support for compressed inputs
Output Format
- Standard FASTA format with '>' prefixed headers
- Sequence names extracted from phylip sequence identifiers
- All non-letter characters filtered from sequence data
- Newline-terminated sequences for compatibility
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org