AddAdapters

Script: addadapters.sh Package: jgi Class: AddAdapters.java Status: DEPRECATED for paired reads

Tool designed for benchmarking adapter-trimming software. Adds synthetic adapters to reads for testing trimmer performance, or evaluates adapter trimming accuracy on previously processed files.

⚠️ DEPRECATED for paired reads: This tool does not understand insert size, making adapter placement unrealistic for paired-end data. Use RandomReads instead for paired reads as it adds adapters at biologically correct locations based on insert size, enabling overlap-based adapter detection.

Purpose and Scope

AddAdapters is specifically designed for grading the performance of adapter-trimming tools. It serves two primary functions:

Synthetic Data Generation: Creates test datasets by adding known adapter contamination to clean reads
Performance Evaluation: Analyzes trimmed reads to calculate trimming accuracy metrics

Recommended Workflow for Paired Reads:
Instead of AddAdapters, use RandomReads for realistic adapter contamination:

randomreads.sh ref=ref.fa out=reads.fq len=150 paired reads=100k \
    mininsert=50 maxinsert=350 fragadapter1=AGATCGGAAGAGC fragadapter2=CTGTCTCTTATAC
rename.sh in=reads.fq out=renamed.fq renamebytrim interleaved

The resulting reads can still be evaluated by AddAdapters in grade mode.

Basic Usage

addadapters.sh in=<file> in2=<file2> out=<outfile> out2=<outfile2> adapters=<file>

Operation Modes

Add Mode (default): Synthetically contaminates clean reads with adapter sequences at random positions. Encodes the correct trimming answer in read headers for later evaluation.
Grade Mode: Evaluates adapter trimming performance by comparing actual read lengths against the encoded correct answers.

Parameters

Input/Output Parameters

in=<file>: Primary input file (FASTQ/FASTA format). Can be stdin.
in2=<file>: Secondary input file for paired reads (optional).
out=<file>: Primary output file. Required in add mode, unused in grade mode.
out2=<file>: Secondary output file for paired reads (optional).
ow=f: (overwrite) Overwrites files that already exist.
int=f: (interleaved) Determines whether INPUT file is considered interleaved.

Quality Parameters

qin=auto: ASCII offset for input quality. May be 33 (Sanger), 64 (Illumina), or auto.
qout=auto: ASCII offset for output quality. May be 33 (Sanger), 64 (Illumina), or auto (same as input).

Operation Mode Parameters

add: Add adapters to input files. Default mode.
grade: Evaluate trimmed input files.

Adapter Configuration

adapters=<file>: FASTA file of adapter sequences. Required in add mode.
literal=<sequence>: Comma-delimited list of adapter sequences as alternative to file.
left: Place adapters on the left (3') end of reads.
right: Place adapters on the right (5') end of reads. Default mode.
arc=f: Add reverse-complemented adapters as well as forward orientation.
rate=0.5: Fraction of reads that receive adapter contamination (0.0-1.0).

Adapter Addition Parameters

adderrors=t: Add sequencing errors to adapter bases using quality score error probabilities.
addpaired=t: Place adapters at the same position in both reads of a pair. Note: position is relative within each read, not based on insert size.
minlength=1: (minlen/ml) Minimum read length to consider valid after adapter addition.

Examples

Basic Synthetic Contamination (Single-End)

addadapters.sh in=clean_reads.fq out=contaminated.fq adapters=adapters.fa

Adds adapter sequences to 50% of reads (default rate) for benchmarking adapter trimming tools.

Controlled Contamination Rate

addadapters.sh in=reads.fq out=contaminated.fq literal=AGATCGGAAGAGC rate=0.3

Contaminates 30% of reads with the specified Illumina TruSeq adapter sequence.

Evaluate Trimming Performance

addadapters.sh in=trimmed_reads.fq grade

Analyzes trimmed reads to calculate adapter removal accuracy, over-trimming, and under-trimming rates.

Paired-End Contamination (Not Recommended)

addadapters.sh in1=r1.fq in2=r2.fq out1=cont_r1.fq out2=cont_r2.fq adapters=adapters.fa addpaired=t

Warning: This places adapters at the same relative position in both reads, which is not biologically realistic. Consider RandomReads instead.

Recommended Paired-End Workflow

# Generate realistic paired-end data with proper adapter placement
randomreads.sh ref=genome.fa out=reads.fq len=150 paired reads=100k \
    mininsert=50 maxinsert=350 fragadapter1=AGATCGGAAGAGC fragadapter2=CTGTCTCTTATAC

# Rename reads for compatibility with AddAdapters grading
rename.sh in=reads.fq out=test_data.fq renamebytrim interleaved

# Test your adapter trimmer
your_trimmer.sh in=test_data.fq out=trimmed.fq

# Grade the trimming performance
addadapters.sh in=trimmed.fq grade

This workflow creates biologically realistic test data where adapters appear due to read-through based on actual insert sizes.

Algorithm Details

Synthetic Contamination Strategy

In add mode, AddAdapters implements a controlled contamination algorithm:

Position Selection: Randomly selects adapter insertion points within reads using uniform distribution
Adapter Assignment: Randomly chooses from provided adapter sequences for each contaminated read
Sequence Replacement: Replaces bases from insertion point onward with adapter sequence
Tail Randomization: Fills remaining positions beyond adapter with random nucleotides
Quality-Based Errors: When adderrors=true, introduces realistic sequencing errors in adapter regions based on Phred quality scores
Answer Key Encoding: Modifies read IDs to format "original_length_correct_length" for automated evaluation

Performance Evaluation Metrics

In grade mode, the tool calculates comprehensive trimming statistics:

True Positives: Reads with adapters that were correctly trimmed to the right length
False Positives: Clean reads that were incorrectly trimmed (over-trimming)
True Negatives: Clean reads that were correctly left untrimmed
False Negatives: Contaminated reads that were not trimmed (under-trimming)
Adapter Retention: Percentage of adapter sequences remaining after trimming
Base-Level Accuracy: Precise measurement of trimming position accuracy

Limitations and Considerations

Single-End Reads

Suitable for single-end read benchmarking where adapter contamination occurs through random fragmentation or read-through events.

Paired-End Limitations

For paired reads, AddAdapters has significant limitations:

No Insert Size Awareness: Cannot place adapters at biologically correct positions
Unrealistic Contamination Pattern: Places adapters at same relative positions rather than complementary insert-based positions
Overlap Detection Issues: Prevents testing of overlap-based adapter trimming methods

RandomReads Advantages

RandomReads addresses these limitations by:

Generating reads from reference sequences with specified insert size distributions
Adding adapters only when reads extend beyond insert boundaries (read-through)
Creating realistic overlap patterns that enable overlap-based adapter detection
Supporting fragment size-dependent contamination patterns

Memory and Performance

Memory Footprint: Default allocation of 200MB, suitable for most datasets
Streaming Processing: Processes reads in batches to maintain constant memory usage
Thread Safety: Supports multi-threaded processing for improved throughput
I/O Efficiency: Compatible with compressed input/output formats

Output Interpretation

Add Mode Output

Contaminated reads with modified headers encoding the correct trimming position:

@150_85 original_read_header
AGATCGGAAGAGCACACGTCTGAACTCCAGTCACATGAATCTCGTATGCCGTCTTCTGCTTG...
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII...

Header format: "original_length_correct_remaining_length"

Grade Mode Statistics

Total output:                        10000 reads                  1500000 bases          
Perfectly Correct (% of output):     8750 reads (87.500%)        1312500 bases (87.500%)
Incorrect (% of output):             1250 reads (12.500%)        187500 bases (12.500%)

Adapters Remaining (% of adapters):  125 reads (2.500%)          18750 bases (1.250%)
Non-Adapter Removed (% of valid):    50 reads (0.500%)           7500 bases (0.571%)

Best Practices

Use for single-end benchmarking when testing adapter trimming algorithms
Choose realistic adapter sequences from your sequencing platform (Illumina TruSeq, NextSeq, etc.)
Test multiple contamination rates (10%, 30%, 50%) to evaluate trimmer robustness
Enable error introduction (adderrors=t) for realistic contamination patterns
For paired reads, use RandomReads workflow for biologically accurate test data
Grade mode requires AddAdapters-generated headers - it cannot evaluate arbitrary trimmed data