TrimContigs
Trims contigs to remove sequence unsupported by read alignment. The coverage range file can be generated by pileup.sh.
Basic Usage
trimcontigs.sh in=<assembly> ranges=<ranges> out=<trimmed assembly>
This tool processes assembly contigs using coverage information to identify and remove poorly supported regions. It requires a coverage ranges file that can be generated using pileup.sh with the 'ranges' flag.
Parameters
Parameters are organized by their primary function in the trimming and filtering process. The tool supports both trimming at contig ends and breaking contigs at internal low-coverage regions.
Input/Output Parameters
- in=<file>
- File containing input assembly. Required parameter specifying the FASTA/FASTQ file with contigs to be processed.
- ranges=<file>
- File generated by pileup with the 'ranges' flag. Required parameter containing coverage range information in the format: #contigname followed by start-stop-depth lines.
- out=<file>
- Destination of clean output assembly. Contains trimmed and filtered contigs that meet quality criteria.
- outdirty=<file>
- (outd) Optional dirty output containing removed contigs. Useful for debugging or manual inspection of discarded sequences.
- gffin=<file>
- Optional GFF file containing annotations for the input assembly.
- gffout=<file>
- Modified GFF file with coordinates adjusted for trimmed/broken contigs. Features are repositioned and filtered based on trimming operations.
Quality Filtering Parameters
- mincov=1
- Discard contigs with lower average coverage than this threshold. Applied after trimming operations to filter out poorly supported contigs.
- minlen=1
- Discard contigs shorter than this length, after trimming. Prevents retention of excessively short sequences that may lack biological significance.
Trimming Parameters
- trimmin=0
- Trim the first and last X bases of each sequence. Minimum amount to trim from contig ends regardless of coverage.
- trimmax=big
- Don't trim more than this much on contig ends. Maximum trimming limit (default: 2,000,000,000 bp). Can be set to "big", "2b", or "2g" for maximum value.
- trimextra=5
- Trim an additional amount when trimming. Extra bases to remove beyond the uncovered region boundary to ensure clean transitions.
- maxuncovered=3
- Don't trim where there are at most this many uncovered bases. Prevents trimming for very small uncovered gaps that may be artifacts.
Contig Breaking Parameters
- break=t
- Break contigs where uncovered areas are present. When true, contigs are split at internal uncovered regions rather than just trimmed at ends.
- breaklist=
- Optional file to report the list of broken contigs. Contains names of contigs that were broken into multiple pieces.
- skippolyn=t
- Don't break around uncovered poly-Ns (scaffold breaks). When true, ignores uncovered regions consisting primarily of N bases, which typically represent scaffold gaps.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions.
Examples
Basic Trimming
trimcontigs.sh in=assembly.fasta ranges=coverage.ranges out=trimmed.fasta
Trims assembly contigs based on coverage ranges, keeping only well-supported regions.
With Quality Filtering
trimcontigs.sh in=assembly.fasta ranges=coverage.ranges out=clean.fasta \
mincov=5 minlen=500 trimextra=10
Applies stricter quality filters: minimum 5x coverage, minimum 500bp length, with extra 10bp trimming beyond uncovered regions.
Breaking Contigs at Gaps
trimcontigs.sh in=scaffolds.fasta ranges=coverage.ranges out=broken.fasta \
break=t breaklist=broken_contigs.txt maxuncovered=5
Breaks contigs at internal uncovered regions larger than 5bp, with a report of broken contigs saved to file.
With GFF Annotation Processing
trimcontigs.sh in=assembly.fasta ranges=coverage.ranges out=trimmed.fasta \
gffin=annotations.gff gffout=adjusted_annotations.gff
Processes both sequence and annotations, adjusting feature coordinates for trimmed/broken contigs.
Conservative Trimming (Scaffold-Aware)
trimcontigs.sh in=scaffolds.fasta ranges=coverage.ranges out=conservative.fasta \
skippolyn=t maxuncovered=10 break=f
Conservative approach that ignores poly-N regions and allows larger uncovered gaps, suitable for scaffolded assemblies.
Algorithm Details
Coverage-Based Trimming Strategy
TrimContigs uses a dual-mode approach for processing assemblies based on coverage information:
Single Range Processing
For contigs with continuous covered regions, the algorithm performs end-trimming optimization:
- Identifies covered regions from the ranges file
- Applies trimming constraints (trimmin, trimmax, trimextra)
- Considers maxuncovered threshold for small gaps at contig ends
- Filters results based on minimum coverage and length requirements
Multi-Range Processing (Contig Breaking)
When break=t is enabled and multiple covered regions exist, contigs are split:
- Each covered region becomes a separate contig part
- Part naming follows the pattern: original_name_partN
- GFF features are repositioned and filtered for each part
- Broken contig names are logged if breaklist is specified
Poly-N Handling
The skipPolyN feature uses the fixPolyN method to distinguish scaffold gaps from biological breaks:
- Uses AminoAcid.isFullyDefined(byte) to classify bases between covered regions as defined vs undefined
- Fuses adjacent ranges when defined bases <= maxUncoveredLength or when undefined > 0 and defined <= maxUncoveredLength*2
- Prevents breaking at legitimate scaffold gaps while preserving biological breaks
GFF Coordinate Transformation
When processing annotations, the processGff method handles coordinate transformation:
- Adjusts start/stop coordinates for trimmed regions
- Maintains reading frame information by updating phase values
- Filters features that lose significant coverage (less than 25% of original length)
- Assigns features to appropriate contig parts when breaking occurs
Performance Characteristics
- Memory usage scales with contig count and ranges complexity
- Default memory allocation: 800MB, adjustable via -Xmx
- Concurrent processing using ConcurrentReadInputStream and ConcurrentReadOutputStream classes
- Range lookup using HashMap<String, ArrayList<Range>> data structure
Quality Assessment Integration
The processSeq method applies quality thresholds using conditional logic:
- Average coverage calculation per processed region
- Length-based filtering post-trimming
- Uncovered base count tolerance for gap handling
- Statistics reporting for processed, trimmed, and broken contigs
Input File Formats
Assembly Files
Accepts FASTA and FASTQ formats (compressed or uncompressed). Each sequence represents a contig or scaffold to be processed.
Ranges File Format
Coverage ranges file generated by pileup.sh with the 'ranges' flag:
#contig1
start1 stop1 depth1
start2 stop2 depth2
#contig2
start3 stop3 depth3
Lines beginning with # indicate contig names, followed by tab-delimited start-stop-depth triplets for covered regions.
GFF File Format
Standard GFF3 format for genomic annotations. Features will be adjusted based on trimming operations and filtered for significant overlap.
Output Statistics
TrimContigs tracks processing statistics using dedicated counter variables:
- Scaffolds In/Out: Count of input and output sequences
- Bases In/Out: Total nucleotide count before and after processing
- Scaffolds/Bases Filtered: Sequences removed due to quality criteria
- Scaffolds/Bases Trimmed: Sequences modified by end-trimming
- Scaffolds Broken: Sequences split into multiple parts
- Scaffold Breaks: Total number of break events
- Processing Rate: Throughput in reads/second and bases/second
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org