TrimContigs

Script: trimcontigs.sh Package: jgi Class: TrimContigs.java

Trims contigs to remove sequence unsupported by read alignment. The coverage range file can be generated by pileup.sh.

Basic Usage

trimcontigs.sh in=<assembly> ranges=<ranges> out=<trimmed assembly>

This tool processes assembly contigs using coverage information to identify and remove poorly supported regions. It requires a coverage ranges file that can be generated using pileup.sh with the 'ranges' flag.

Parameters

Parameters are organized by their primary function in the trimming and filtering process. The tool supports both trimming at contig ends and breaking contigs at internal low-coverage regions.

Input/Output Parameters

in=<file>
File containing input assembly. Required parameter specifying the FASTA/FASTQ file with contigs to be processed.
ranges=<file>
File generated by pileup with the 'ranges' flag. Required parameter containing coverage range information in the format: #contigname followed by start-stop-depth lines.
out=<file>
Destination of clean output assembly. Contains trimmed and filtered contigs that meet quality criteria.
outdirty=<file>
(outd) Optional dirty output containing removed contigs. Useful for debugging or manual inspection of discarded sequences.
gffin=<file>
Optional GFF file containing annotations for the input assembly.
gffout=<file>
Modified GFF file with coordinates adjusted for trimmed/broken contigs. Features are repositioned and filtered based on trimming operations.

Quality Filtering Parameters

mincov=1
Discard contigs with lower average coverage than this threshold. Applied after trimming operations to filter out poorly supported contigs.
minlen=1
Discard contigs shorter than this length, after trimming. Prevents retention of excessively short sequences that may lack biological significance.

Trimming Parameters

trimmin=0
Trim the first and last X bases of each sequence. Minimum amount to trim from contig ends regardless of coverage.
trimmax=big
Don't trim more than this much on contig ends. Maximum trimming limit (default: 2,000,000,000 bp). Can be set to "big", "2b", or "2g" for maximum value.
trimextra=5
Trim an additional amount when trimming. Extra bases to remove beyond the uncovered region boundary to ensure clean transitions.
maxuncovered=3
Don't trim where there are at most this many uncovered bases. Prevents trimming for very small uncovered gaps that may be artifacts.

Contig Breaking Parameters

break=t
Break contigs where uncovered areas are present. When true, contigs are split at internal uncovered regions rather than just trimmed at ends.
breaklist=
Optional file to report the list of broken contigs. Contains names of contigs that were broken into multiple pieces.
skippolyn=t
Don't break around uncovered poly-Ns (scaffold breaks). When true, ignores uncovered regions consisting primarily of N bases, which typically represent scaffold gaps.

Java Parameters

-Xmx
This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da
Disable assertions.

Examples

Basic Trimming

trimcontigs.sh in=assembly.fasta ranges=coverage.ranges out=trimmed.fasta

Trims assembly contigs based on coverage ranges, keeping only well-supported regions.

With Quality Filtering

trimcontigs.sh in=assembly.fasta ranges=coverage.ranges out=clean.fasta \
    mincov=5 minlen=500 trimextra=10

Applies stricter quality filters: minimum 5x coverage, minimum 500bp length, with extra 10bp trimming beyond uncovered regions.

Breaking Contigs at Gaps

trimcontigs.sh in=scaffolds.fasta ranges=coverage.ranges out=broken.fasta \
    break=t breaklist=broken_contigs.txt maxuncovered=5

Breaks contigs at internal uncovered regions larger than 5bp, with a report of broken contigs saved to file.

With GFF Annotation Processing

trimcontigs.sh in=assembly.fasta ranges=coverage.ranges out=trimmed.fasta \
    gffin=annotations.gff gffout=adjusted_annotations.gff

Processes both sequence and annotations, adjusting feature coordinates for trimmed/broken contigs.

Conservative Trimming (Scaffold-Aware)

trimcontigs.sh in=scaffolds.fasta ranges=coverage.ranges out=conservative.fasta \
    skippolyn=t maxuncovered=10 break=f

Conservative approach that ignores poly-N regions and allows larger uncovered gaps, suitable for scaffolded assemblies.

Algorithm Details

Coverage-Based Trimming Strategy

TrimContigs uses a dual-mode approach for processing assemblies based on coverage information:

Single Range Processing

For contigs with continuous covered regions, the algorithm performs end-trimming optimization:

Multi-Range Processing (Contig Breaking)

When break=t is enabled and multiple covered regions exist, contigs are split:

Poly-N Handling

The skipPolyN feature uses the fixPolyN method to distinguish scaffold gaps from biological breaks:

GFF Coordinate Transformation

When processing annotations, the processGff method handles coordinate transformation:

Performance Characteristics

Quality Assessment Integration

The processSeq method applies quality thresholds using conditional logic:

Input File Formats

Assembly Files

Accepts FASTA and FASTQ formats (compressed or uncompressed). Each sequence represents a contig or scaffold to be processed.

Ranges File Format

Coverage ranges file generated by pileup.sh with the 'ranges' flag:

#contig1
start1	stop1	depth1
start2	stop2	depth2
#contig2
start3	stop3	depth3

Lines beginning with # indicate contig names, followed by tab-delimited start-stop-depth triplets for covered regions.

GFF File Format

Standard GFF3 format for genomic annotations. Features will be adjusted based on trimming operations and filtered for significant overlap.

Output Statistics

TrimContigs tracks processing statistics using dedicated counter variables:

Support

For questions and support: