Lilypad
Uses mapped paired reads to generate scaffolds from contigs. Designed for use with ordinary paired-end Illumina libraries.
Basic Usage
lilypad.sh in=mapped.sam ref=contigs.fa out=scaffolds.fa
Lilypad takes mapped paired-end reads and reference contigs to generate scaffolded assemblies by analyzing insert size distributions and read pair orientations.
Parameters
Parameters are organized by their function in the scaffolding process.
Standard Parameters
- in=<file>
- Reads mapped to the reference; should be sam or bam format. Required input parameter.
- ref=<file>
- Reference contigs; may be fasta or fastq format. Required reference parameter.
- out=<file>
- Modified reference output; should be fasta format. Generated scaffolds will be written here.
- overwrite=f
- (ow) Set to false to force the program to abort rather than overwrite an existing file. Default: false.
Processing Parameters
- gap=10
- Pad gaps with a minimum of this many Ns. Sets the minimum number of N bases inserted between joined contigs. Default: 10.
- mindepth=4
- Minimum spanning read pairs to join contigs. Higher values require more evidence for joining but reduce misassemblies. Default: 4.
- maxinsert=3000
- Maximum allowed insert size for proper pairs. Read pairs with insert sizes above this threshold are filtered out. Default: 3000.
- mincontig=200
- Ignore contigs under this length if there is a longer alternative. Helps prioritize longer, more reliable contigs during scaffolding. Default: 200.
- minwr=0.8
- (minWeightRatio) Minimum fraction of outgoing edges pointing to the same contig. Lower values will increase continuity at a risk of misassemblies. Range: 0.0-1.0. Default: 0.8.
- minsr=0.8
- (minStrandRatio) Minimum fraction of outgoing edges indicating the same orientation. Lower values will increase continuity at a possible risk of inversions. Range: 0.0-1.0. Default: 0.8.
- passes=8
- Number of scaffolding passes to perform. More passes may increase continuity by allowing iterative improvement of scaffold connections. Default: 8.
- samestrandpairs=f
- Read pairs map to the same strand. Set to true for libraries where both reads in a pair have the same orientation. Currently untested. Default: false.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions. May provide minor performance improvement in production runs.
Examples
Basic Scaffolding
lilypad.sh in=mapped_pairs.sam ref=contigs.fasta out=scaffolds.fasta
Scaffolds contigs using paired-end reads mapped to the reference with default parameters.
Conservative Scaffolding
lilypad.sh in=mapped_pairs.sam ref=contigs.fasta out=scaffolds.fasta mindepth=8 minwr=0.9 minsr=0.9
Uses more stringent parameters to reduce misassemblies: requires more read pair evidence (mindepth=8) and higher consensus for edge directions (minwr=0.9, minsr=0.9).
Large Insert Libraries
lilypad.sh in=mate_pairs.sam ref=contigs.fasta out=scaffolds.fasta maxinsert=8000 gap=50
Configured for mate pair libraries with larger insert sizes. Increases maximum insert size to 8kb and gap padding to 50 Ns.
High Memory Usage
lilypad.sh -Xmx32g in=large_dataset.sam ref=assembly.fasta out=scaffolds.fasta
Allocates 32GB of memory for processing large datasets with many contigs and read pairs.
Algorithm Details
Lilypad implements a LinkedHashMap-based scaffolding algorithm that analyzes paired-end read mappings to determine contig connectivity and orientation using Edge and Contig classes with weight-based connection validation.
Core Algorithm Components
Graph Construction
The algorithm builds a scaffold graph where:
- Nodes: Individual contigs with associated coverage depth arrays
- Edges: Connections between contigs based on paired-read evidence
- Edge weights: Accumulated mapping quality scores from supporting read pairs
- Edge orientation: Tracks relative strand orientations between connected contigs
Insert Size Analysis
Lilypad performs insert size analysis using 1000-bucket histograms:
- Distribution calculation: Builds insert size histograms from proper pairs using AtomicLongArray for thread safety
- Percentile-based inference: Uses 1000 buckets to estimate insert size distributions and infer gap lengths
- Coverage-weighted estimation: Combines observed gap lengths with local coverage depth to predict scaffold gaps
- Dynamic gap sizing: Gap length = max(minimum_gap, inferred_insert_size - observed_distance)
Edge Quality Assessment
Each potential scaffold connection is evaluated using multiple criteria:
- Depth filtering: Requires minimum number of supporting read pairs (mindepth parameter)
- Weight ratio test: Best edge must represent minimum fraction of total outgoing weight (minWeightRatio)
- Strand consistency: Majority of evidence must agree on relative orientation (minStrandRatio)
- Distance validation: Insert sizes must be within reasonable bounds (maxinsert parameter)
Scaffold Path Finding
The scaffolding process uses findLeftmost() and expandRight() methods with bestEdge() selection:
- Leftmost identification: Each component starts from the leftmost unprocessed contig
- Bidirectional validation: Confirms reciprocal best edges between adjacent contigs
- Strand synchronization: Automatically flips contigs to maintain consistent orientations
- Iterative extension: Follows best edges until no more valid connections exist
Thread Safety and Performance
Lilypad uses AtomicIntegerArray, AtomicLongArray, and ReadWriteLock for scalability:
- AtomicIntegerArray: Thread-safe coverage depth tracking per contig position
- AtomicLongArray: Thread-safe insert size histogram accumulation
- ReadWriteLock: Coordinate access to shared scaffold graph structures
- Multi-threaded processing: Parallel read processing with accumulator pattern for statistics
Quality Control Features
Multiple validation layers ensure scaffold quality:
- SAM filtering: Excludes unmapped, supplementary, non-primary, and low-quality alignments
- Proper pair validation: Only uses properly paired reads with consistent orientations
- Coverage analysis: Tracks depth per position to identify problematic regions
- Edge validation: Prevents scaffolding based on insufficient or contradictory evidence
Memory Management
Memory usage patterns:
- Streaming processing: Reads are processed in batches rather than loading all into memory
- Compact data structures: Uses byte arrays for sequences and atomic arrays for counters
- Reference-based storage: Contigs stored by reference to avoid duplication
- Automatic memory sizing: Default allocation scales with available system memory
Technical Notes
Input Requirements
- SAM/BAM file must contain paired-end reads mapped to the reference contigs
- Reads should be properly paired with reasonable insert size distribution
- Reference contigs should be in FASTA format
- Mapping quality scores are used for edge weighting
Performance Considerations
- Memory usage scales with number of contigs and read coverage
- Processing time increases with number of scaffolding passes
- Large insert size ranges may require more memory for histograms
- Thread count automatically scales with available processors
Common Issues
- Low continuity: Try reducing minwr and minsr parameters
- Misassemblies: Increase mindepth and quality thresholds
- Memory errors: Increase -Xmx parameter or reduce contig count
- No scaffolds generated: Check that reads are properly paired and mapped
Output Format
- Scaffolds are output in FASTA format with original contig names
- Gaps between contigs are filled with N bases (minimum: gap parameter)
- Scaffold orientation follows the path-finding algorithm's decisions
- Statistics are printed to stderr including insert sizes and join counts
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org