RemoveSmartBell
Remove Smart Bell adapters from PacBio reads using MultiStateAligner algorithms with locality-aware adapter detection and optional backup aligner for increased sensitivity.
Basic Usage
removesmartbell.sh in=<input> out=<output> split=t
Input may be fasta or fastq, compressed or uncompressed (not H5 files).
Parameters
Parameters control adapter detection, processing modes, and output formatting.
Core Parameters
- in=file
- Specify the input file, or stdin. Can use # notation for paired files (e.g., reads#.fq becomes reads1.fq and reads2.fq).
- in2=file
- Specify the second input file for paired data.
- out=file
- Specify the output file, or stdout. Can use # notation for paired output files.
- adapter=string
- Specify the adapter sequence. Default is normal SmrtBell adapter sequence (ATCTCTCTCTTTTCCTCCTCCTCCGTTGTTGTTGTTGAGAGAGAT).
- split=t
- Processing mode: t=splits reads at adapters into separate contigs, f=masks adapters with X symbols but keeps reads intact. Default: true
Alignment Parameters
- minratio=0.31
- Minimum alignment score ratio for adapter detection. At 250bp reads, approximately 0.01% false-positive and 94% true-positive rate.
- suspectratio=0.85
- Ratio threshold for suspect alignments that may be confirmed by nearby adapters or secondary alignment methods.
- usealtmsa=t
- Enable alternate multi-state alignment algorithm for improved sensitivity. Uses MultiStateAligner9PacBioAdapter2 as backup method.
- plusonly=f
- Only search for adapters in forward orientation. When true, disables reverse complement search.
- minusonly=f
- Only search for adapters in reverse complement orientation. When true, disables forward search.
Quality Control Parameters
- mincontig=50
- Minimum contig length to retain after splitting. Shorter sequences are discarded.
- reads=unlimited
- Maximum number of reads to process. Supports K/M/G suffixes (e.g., reads=1M).
- maxreads=unlimited
- Alias for reads parameter.
Processing Parameters
- threads=auto
- Number of processing threads. Use 'auto' to detect available processors automatically.
- overwrite=t
- Allow overwriting of existing output files.
- append=f
- Append to existing output files instead of overwriting.
- verbose=f
- Print additional processing information and statistics.
Common Parser Parameters
- path=
- Set the path for temporary files and working directory.
- tempdir=
- Specify temporary directory for intermediate files.
- root=
- Root directory for relative file paths.
Examples
Basic Adapter Removal
removesmartbell.sh in=pacbio_reads.fq out=clean_reads.fq split=t
Removes Smart Bell adapters from PacBio reads, splitting reads at adapter locations.
Mask Adapters Instead of Splitting
removesmartbell.sh in=pacbio_reads.fq out=masked_reads.fq split=f
Masks adapter sequences with X symbols but keeps reads intact as single sequences.
Custom Adapter Sequence
removesmartbell.sh in=reads.fq out=clean.fq adapter=CUSTOMADAPTERSEQUENCE
Uses a custom adapter sequence instead of the default Smart Bell adapter.
High Sensitivity Processing
removesmartbell.sh in=reads.fq out=clean.fq minratio=0.25 usealtmsa=t
Increases sensitivity by lowering alignment threshold and enabling alternate alignment algorithm.
Paired-End Processing
removesmartbell.sh in=reads#.fq out=clean#.fq
Processes paired-end files reads1.fq and reads2.fq, outputting to clean1.fq and clean2.fq.
Algorithm Details
Alignment Strategy
RemoveSmartBell uses MultiStateAligner algorithms specifically designed for PacBio adapter detection:
- Primary Algorithm: MultiStateAligner9PacBioAdapter for local alignment scoring and adapter detection
- Backup Algorithm: MultiStateAligner9PacBioAdapter2 (when usealtmsa=true) for secondary scoring when primary alignments are ambiguous
- Dual-Direction Search: Searches for adapters in both forward and reverse complement orientations unless restricted by plusonly/minusonly parameters
Scoring and Thresholds
The algorithm implements a multi-threshold scoring system based on Smith-Waterman alignment scores:
- Primary Threshold (minratio=0.31): minSwScore = maxSwScore * 0.31, where maxSwScore = msa.maxQuality(query_length), achieves ~0.01% false positive rate at 250bp read length
- Suspect Threshold (suspectratio=0.85): minSwScoreSuspect calculated as maxSwScore * min(0.31 * 0.85, 0.31 - ((1-0.85) * 0.2)) for intermediate confidence detections
- Locality-Based Confirmation: Suspect alignments within suspectDistance=100bp of confirmed adapters are accepted using lastConfirmed and lastSuspect position tracking
- Look-Ahead Validation: When array.length-stop > window, performs additional msa.fillAndScoreLimited() call on downstream window-sized region
Processing Strategy
The algorithm processes reads using a sliding window approach:
- Window Size: window = query1.length * 2.5 + 10 (calculated as int)(query1.length*2.5f+10)
- Stride: stride = query1.length * 0.95 (calculated as int)(query1.length*0.95f) to ensure overlap detection
- Padding: npad=35bp N-padding on sequence ends using npad() method to handle boundary conditions
- Threading: ProcessThread[] array with THREADS=Shared.LOGICAL_PROCESSORS for parallel processing
Output Modes
Two primary processing modes are available:
- Split Mode (split=t): Divides reads at adapter locations, creating multiple shorter contigs from each original read. Discards resulting sequences shorter than mincontig threshold
- Mask Mode (split=f): Replaces adapter sequences with X symbols while maintaining read integrity and length
Performance Characteristics
Implementation characteristics based on source code analysis:
- Memory Usage: Base allocation z="-Xmx400m" with Shared.capBufferLen(20) buffer length capping
- Throughput: ConcurrentReadInputStream and ConcurrentReadOutputStream with configurable buffer sizes
- Accuracy: Tuned with MINIMUM_ALIGNMENT_SCORE_RATIO=0.31f and SUSPECT_RATIO=0.85F constants
- Scalability: ProcessThread.run() loop processes ListNum<Read> batches for concurrent execution
Statistics and Output
RemoveSmartBell provides comprehensive statistics on adapter detection performance:
Processing Statistics
- Reads Processed: Total input reads and base count
- Good vs Bad Reads: Counts of reads with and without adapters
- Adapter Counts: Separate counts for forward and reverse complement adapters
- Adapter Density: Adapters per megabase for contamination assessment
- Output Summary: Final read and base counts after processing
Accuracy Metrics
For synthetic data with known adapter positions:
- True Positive: Correctly identified adapters
- False Positive: Incorrectly identified adapters
- True Negative: Correctly identified clean regions
- False Negative: Missed adapters
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org