AddAdapters
Tool designed for benchmarking adapter-trimming software. Adds synthetic adapters to reads for testing trimmer performance, or evaluates adapter trimming accuracy on previously processed files.
Purpose and Scope
AddAdapters is specifically designed for grading the performance of adapter-trimming tools. It serves two primary functions:
- Synthetic Data Generation: Creates test datasets by adding known adapter contamination to clean reads
- Performance Evaluation: Analyzes trimmed reads to calculate trimming accuracy metrics
Instead of AddAdapters, use RandomReads for realistic adapter contamination:
randomreads.sh ref=ref.fa out=reads.fq len=150 paired reads=100k \
mininsert=50 maxinsert=350 fragadapter1=AGATCGGAAGAGC fragadapter2=CTGTCTCTTATAC
rename.sh in=reads.fq out=renamed.fq renamebytrim interleaved
The resulting reads can still be evaluated by AddAdapters in grade mode.
Basic Usage
addadapters.sh in=<file> in2=<file2> out=<outfile> out2=<outfile2> adapters=<file>
Operation Modes
- Add Mode (default)
- Synthetically contaminates clean reads with adapter sequences at random positions. Encodes the correct trimming answer in read headers for later evaluation.
- Grade Mode
- Evaluates adapter trimming performance by comparing actual read lengths against the encoded correct answers.
Parameters
Input/Output Parameters
- in=<file>
- Primary input file (FASTQ/FASTA format). Can be stdin.
- in2=<file>
- Secondary input file for paired reads (optional).
- out=<file>
- Primary output file. Required in add mode, unused in grade mode.
- out2=<file>
- Secondary output file for paired reads (optional).
- ow=f
- (overwrite) Overwrites files that already exist.
- int=f
- (interleaved) Determines whether INPUT file is considered interleaved.
Quality Parameters
- qin=auto
- ASCII offset for input quality. May be 33 (Sanger), 64 (Illumina), or auto.
- qout=auto
- ASCII offset for output quality. May be 33 (Sanger), 64 (Illumina), or auto (same as input).
Operation Mode Parameters
- add
- Add adapters to input files. Default mode.
- grade
- Evaluate trimmed input files.
Adapter Configuration
- adapters=<file>
- FASTA file of adapter sequences. Required in add mode.
- literal=<sequence>
- Comma-delimited list of adapter sequences as alternative to file.
- left
- Place adapters on the left (3') end of reads.
- right
- Place adapters on the right (5') end of reads. Default mode.
- arc=f
- Add reverse-complemented adapters as well as forward orientation.
- rate=0.5
- Fraction of reads that receive adapter contamination (0.0-1.0).
Adapter Addition Parameters
- adderrors=t
- Add sequencing errors to adapter bases using quality score error probabilities.
- addpaired=t
- Place adapters at the same position in both reads of a pair. Note: position is relative within each read, not based on insert size.
- minlength=1
- (minlen/ml) Minimum read length to consider valid after adapter addition.
Examples
Basic Synthetic Contamination (Single-End)
addadapters.sh in=clean_reads.fq out=contaminated.fq adapters=adapters.fa
Adds adapter sequences to 50% of reads (default rate) for benchmarking adapter trimming tools.
Controlled Contamination Rate
addadapters.sh in=reads.fq out=contaminated.fq literal=AGATCGGAAGAGC rate=0.3
Contaminates 30% of reads with the specified Illumina TruSeq adapter sequence.
Evaluate Trimming Performance
addadapters.sh in=trimmed_reads.fq grade
Analyzes trimmed reads to calculate adapter removal accuracy, over-trimming, and under-trimming rates.
Paired-End Contamination (Not Recommended)
addadapters.sh in1=r1.fq in2=r2.fq out1=cont_r1.fq out2=cont_r2.fq adapters=adapters.fa addpaired=t
Warning: This places adapters at the same relative position in both reads, which is not biologically realistic. Consider RandomReads instead.
Recommended Paired-End Workflow
# Generate realistic paired-end data with proper adapter placement
randomreads.sh ref=genome.fa out=reads.fq len=150 paired reads=100k \
mininsert=50 maxinsert=350 fragadapter1=AGATCGGAAGAGC fragadapter2=CTGTCTCTTATAC
# Rename reads for compatibility with AddAdapters grading
rename.sh in=reads.fq out=test_data.fq renamebytrim interleaved
# Test your adapter trimmer
your_trimmer.sh in=test_data.fq out=trimmed.fq
# Grade the trimming performance
addadapters.sh in=trimmed.fq grade
This workflow creates biologically realistic test data where adapters appear due to read-through based on actual insert sizes.
Algorithm Details
Synthetic Contamination Strategy
In add mode, AddAdapters implements a controlled contamination algorithm:
- Position Selection: Randomly selects adapter insertion points within reads using uniform distribution
- Adapter Assignment: Randomly chooses from provided adapter sequences for each contaminated read
- Sequence Replacement: Replaces bases from insertion point onward with adapter sequence
- Tail Randomization: Fills remaining positions beyond adapter with random nucleotides
- Quality-Based Errors: When adderrors=true, introduces realistic sequencing errors in adapter regions based on Phred quality scores
- Answer Key Encoding: Modifies read IDs to format "original_length_correct_length" for automated evaluation
Performance Evaluation Metrics
In grade mode, the tool calculates comprehensive trimming statistics:
- True Positives: Reads with adapters that were correctly trimmed to the right length
- False Positives: Clean reads that were incorrectly trimmed (over-trimming)
- True Negatives: Clean reads that were correctly left untrimmed
- False Negatives: Contaminated reads that were not trimmed (under-trimming)
- Adapter Retention: Percentage of adapter sequences remaining after trimming
- Base-Level Accuracy: Precise measurement of trimming position accuracy
Limitations and Considerations
Single-End Reads
Suitable for single-end read benchmarking where adapter contamination occurs through random fragmentation or read-through events.
Paired-End Limitations
For paired reads, AddAdapters has significant limitations:
- No Insert Size Awareness: Cannot place adapters at biologically correct positions
- Unrealistic Contamination Pattern: Places adapters at same relative positions rather than complementary insert-based positions
- Overlap Detection Issues: Prevents testing of overlap-based adapter trimming methods
RandomReads Advantages
RandomReads addresses these limitations by:
- Generating reads from reference sequences with specified insert size distributions
- Adding adapters only when reads extend beyond insert boundaries (read-through)
- Creating realistic overlap patterns that enable overlap-based adapter detection
- Supporting fragment size-dependent contamination patterns
Memory and Performance
- Memory Footprint: Default allocation of 200MB, suitable for most datasets
- Streaming Processing: Processes reads in batches to maintain constant memory usage
- Thread Safety: Supports multi-threaded processing for improved throughput
- I/O Efficiency: Compatible with compressed input/output formats
Output Interpretation
Add Mode Output
Contaminated reads with modified headers encoding the correct trimming position:
@150_85 original_read_header
AGATCGGAAGAGCACACGTCTGAACTCCAGTCACATGAATCTCGTATGCCGTCTTCTGCTTG...
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII...
Header format: "original_length_correct_remaining_length"
Grade Mode Statistics
Total output: 10000 reads 1500000 bases
Perfectly Correct (% of output): 8750 reads (87.500%) 1312500 bases (87.500%)
Incorrect (% of output): 1250 reads (12.500%) 187500 bases (12.500%)
Adapters Remaining (% of adapters): 125 reads (2.500%) 18750 bases (1.250%)
Non-Adapter Removed (% of valid): 50 reads (0.500%) 7500 bases (0.571%)
Best Practices
- Use for single-end benchmarking when testing adapter trimming algorithms
- Choose realistic adapter sequences from your sequencing platform (Illumina TruSeq, NextSeq, etc.)
- Test multiple contamination rates (10%, 30%, 50%) to evaluate trimmer robustness
- Enable error introduction (adderrors=t) for realistic contamination patterns
- For paired reads, use RandomReads workflow for biologically accurate test data
- Grade mode requires AddAdapters-generated headers - it cannot evaluate arbitrary trimmed data