Phylip2Fasta

Script: phylip2fasta.sh Package: jgi Class: PhylipToFasta.java

Transforms interleaved phylip to fasta format. This tool reads phylip files with interleaved sequence data and converts them to standard FASTA format for broader compatibility with bioinformatics tools.

Basic Usage

phylip2fasta.sh in=<input> out=<output>

Input may be stdin or an interleaved phylip file, compressed or uncompressed. The input phylip file is the only required parameter.

Parameters

Parameters are organized by their function in the conversion process. All parameters from the shell script usage function are documented below.

Input Parameters

in=<file>: The input phylip file; this is the only required parameter. Can accept stdin, compressed or uncompressed phylip files. The file should be in interleaved phylip format with sequence names followed by sequence data in blocks.
unpigz=true: Decompress with pigz for faster decompression. Uses parallel gzip decompression when processing compressed input files. Default: true

Output Parameters

out=<file>: Fasta output destination. Specifies where to write the converted FASTA sequences. If not specified, output goes to stdout. The output will be in standard FASTA format with sequence headers starting with '>' followed by the sequence name from the phylip file.

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Default memory allocation is 1GB for this tool.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+. Useful for automated pipelines where graceful failure is preferred over hanging processes.
-da: Disable assertions. Can provide minor performance improvements in production environments by skipping assertion checks in the Java code.

Examples

Basic Conversion

phylip2fasta.sh in=alignment.phy out=alignment.fasta

Converts an interleaved phylip file to FASTA format.

Processing Compressed Input

phylip2fasta.sh in=alignment.phy.gz out=alignment.fasta

Converts a gzip-compressed phylip file to FASTA format using parallel decompression.

Using Standard Input/Output

cat alignment.phy | phylip2fasta.sh

Reads phylip data from standard input and writes FASTA output to standard output.

With Custom Memory Settings

phylip2fasta.sh -Xmx4g in=large_alignment.phy out=large_alignment.fasta

Processes a large phylip file with 4GB of memory allocated to Java.

Algorithm Details

The phylip2fasta conversion algorithm implements a two-phase parsing strategy optimized for interleaved phylip format:

Phase 1: Header and Initial Sequence Parsing

Reads the first line containing sequence count and length information
Parses sequence names and initial sequence data from the first block
Creates StringBuilder objects for each sequence to efficiently accumulate sequence data
Extracts sequence names by parsing characters until whitespace is encountered
Collects initial sequence letters while skipping whitespace and non-letter characters

Phase 2: Interleaved Block Processing

Continues reading subsequent lines containing additional sequence data
Uses modular arithmetic to assign sequence data to the correct sequence based on line position
Processes lines that start with whitespace as continuation of interleaved sequence blocks
Filters out non-letter characters, preserving only valid sequence data

Memory Management

Uses ArrayList of StringBuilder objects for efficient memory utilization
Nullifies StringBuilder references after writing to free memory during output
Streams output using TextStreamWriter for large dataset handling
Default memory allocation of 1GB, configurable via -Xmx parameter

File Format Support

Automatically detects and handles compressed input files (gzip, etc.)
Uses pigz for parallel decompression when available
Supports both file input and standard input streams
Outputs standard FASTA format with '>' headers followed by sequence names

Error Handling

Tracks error states throughout the conversion process
Validates output file permissions before processing
Reports processing statistics including read count, base count, and elapsed time
Terminates with informative error messages if corruption is detected

Technical Notes

Phylip Format Requirements

Input must be in interleaved phylip format
First line should contain the number of sequences and sequence length
Sequence names should be followed by sequence data
Continuation blocks should start with whitespace (typically 8 spaces)

Performance Characteristics

Linear time complexity with respect to total sequence data
Memory usage proportional to total sequence length plus overhead
Efficient StringBuilder-based sequence accumulation
Parallel decompression support for compressed inputs

Output Format

Standard FASTA format with '>' prefixed headers
Sequence names extracted from phylip sequence identifiers
All non-letter characters filtered from sequence data
Newline-terminated sequences for compatibility

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org