FilterQC
Fastq Filter pipeline implementing Rolling Quality Control (RQC) filtering for sequencing data. This wrapper script provides a simplified interface to the RQC filtering pipeline that includes adapter trimming, contaminant removal, and quality control operations.
Basic Usage
filterqc.sh in=<file> out=<dir>
FilterQC is a simplified wrapper for the Rolling Quality Control (RQC) filtering pipeline. It processes a single FASTQ or FASTQ.gz file through a comprehensive quality control workflow and outputs filtered results to a specified directory.
Parameters
FilterQC accepts a minimal set of parameters to configure the filtering pipeline. The underlying Python pipeline implements the full RQC filtering workflow with automatic parameter selection.
Input/Output Parameters
- in=file
- Specify the input FASTQ or FASTQ.gz file to be processed. This is the raw sequencing data that will undergo quality control filtering. The file can be compressed with gzip compression (.gz extension).
- out=dir
- The output directory where filtered results and quality control reports will be written. The directory will be created if it does not exist. All output files from the filtering pipeline will be placed in this directory.
- rqcfilterdata=dir
- Path to the RQCFilterData directory containing reference databases, adapter sequences, and other resources required by the filtering pipeline. This directory contains contamination databases, adapter sequences, and quality control references used throughout the filtering process.
Optional Parameters
- qc
- Enable quality control reporting on the filtered output. When specified, additional QC statistics and reports will be generated for the filtered data, providing comprehensive quality metrics for downstream analysis evaluation.
Examples
Basic Filtering
filterqc.sh in=raw_reads.fastq.gz out=filtered_output
Processes a compressed FASTQ file through the RQC filtering pipeline, outputting filtered results to the 'filtered_output' directory.
Filtering with Custom RQC Data Path
filterqc.sh in=sample.fastq out=results rqcfilterdata=/path/to/RQCFilterData
Runs filtering with a custom path to the RQCFilterData directory containing the necessary reference databases and resources.
Filtering with Quality Control Reporting
filterqc.sh in=illumina_reads.fastq.gz out=qc_filtered rqcfilterdata=/data/RQCFilterData qc
Performs comprehensive filtering with additional quality control reporting enabled, generating detailed statistics and metrics for the filtered output data.
Pipeline Integration
# Process multiple samples
for sample in *.fastq.gz; do
base=$(basename "$sample" .fastq.gz)
filterqc.sh in="$sample" out="filtered_${base}" qc
done
Batch processing multiple FASTQ files through the FilterQC pipeline with quality control reporting.
Algorithm Details
RQC Filtering Pipeline
FilterQC implements a simplified interface to the Rolling Quality Control (RQC) filtering system, which coordinates multiple BBTools programs for sequencing data quality control. The underlying system performs multiple filtering operations in sequence:
Core Filtering Operations
- Adapter Trimming: Removes sequencing adapters and primers from read ends using kmer-based detection
- Contaminant Removal: Filters out common contaminants including human DNA, microbial sequences, and synthetic constructs
- Quality Trimming: Removes low-quality bases from read ends based on quality scores
- Length Filtering: Removes reads that are too short after trimming operations
- Complexity Filtering: Removes low-complexity sequences and homopolymer runs
Python Pipeline Architecture
The FilterQC script calls a Python-based pipeline (filter.py) located in the pytools directory. This pipeline coordinates multiple BBTools programs to achieve comprehensive filtering:
- SendSketch: Taxonomic classification and contamination detection
- Clumpify: Duplicate detection and read ordering optimization
- BBDuk: Adapter trimming and kmer-based filtering
- BBMap: Reference-based contamination removal
- BBMerge: Paired-end read merging for fragment library analysis
Quality Control Integration
When the 'qc' parameter is specified, the pipeline generates comprehensive quality control reports including:
- Read count statistics before and after each filtering step
- Quality score distributions and GC content analysis
- Adapter and contamination detection summaries
- Insert size distributions for paired-end libraries
- Taxonomic composition of removed sequences
Output Organization
The pipeline produces organized output in the specified directory:
- Filtered FASTQ: Clean reads ready for downstream analysis
- QC Reports: Statistical summaries and quality metrics
- Contamination Logs: Detailed information about removed sequences
- Processing Logs: Complete record of filtering operations performed
Performance Characteristics
FilterQC processes sequencing data using several implementation characteristics:
- Memory Efficiency: Uses streaming algorithms to process large files with minimal memory footprint
- Multi-threading: Automatically utilizes available CPU cores for parallel processing
- Temporary File Management: Efficiently manages intermediate files during multi-step processing
- Compression Support: Native support for compressed input and output files
RQCFilterData Dependencies
The filtering pipeline requires access to the RQCFilterData directory containing:
- Reference Databases: Human, mouse, and microbial genome references for contamination detection
- Adapter Libraries: Comprehensive collections of sequencing adapters and primers
- Taxonomic Databases: Classification databases for contamination identification
- Quality Control Standards: Reference sequences and metrics for quality assessment
Integration with Larger Workflows
FilterQC integrates with larger genomic analysis workflows through:
- Preprocessing Step: Typically used as the first step in genomic analysis pipelines
- Standardized Output: Produces consistently formatted output compatible with downstream tools
- Quality Metrics: Generates standardized quality reports for pipeline monitoring
- Resource Management: Compatible with cluster and cloud computing environments
Output Files
FilterQC generates multiple output files in the specified output directory:
Primary Output
- Filtered FASTQ: Clean sequencing reads after quality control filtering
- Processing Log: Detailed log of all filtering operations performed
Quality Control Reports (when qc enabled)
- Read Statistics: Counts and quality metrics before and after filtering
- Contamination Summary: Details of removed contaminant sequences
- Quality Distributions: Quality score and GC content histograms
- Adapter Detection: Summary of detected and removed adapters
Intermediate Files
- Temporary Processing Files: Cleaned up automatically unless errors occur
- Reference Mapping Results: Contamination detection mappings
- Taxonomic Classifications: Species identification results
Best Practices
Input Preparation
- File Format: Ensure input files are valid FASTQ format with proper quality scores
- Compression: Use gzip compression for large files to reduce I/O overhead
- File Naming: Use descriptive filenames that will be preserved in output directories
Resource Configuration
- RQCFilterData: Ensure the RQCFilterData directory is properly configured and accessible
- Disk Space: Allocate sufficient disk space for both input and output files
- Memory: Provide adequate memory for processing large sequencing files
Quality Control
- Enable QC Reporting: Always use the 'qc' parameter for comprehensive quality assessment
- Review Output: Examine quality control reports before proceeding to downstream analysis
- Retention of Raw Data: Preserve original files until filtered results are validated
Pipeline Integration
- Batch Processing: Use shell loops or workflow management systems for multiple samples
- Error Handling: Implement proper error checking and logging in automated workflows
- Resource Monitoring: Monitor system resources during large-scale processing
Troubleshooting
Common Issues
- File Not Found Error
- Ensure the input FASTQ file exists and is accessible. Check file permissions and path specifications.
- RQCFilterData Path Error
- Verify the RQCFilterData directory exists and contains the required reference databases and adapter libraries.
- Insufficient Disk Space
- Ensure adequate disk space is available for both temporary processing files and final output.
- Python Pipeline Errors
- Check that Python is properly installed and the pytools directory is accessible with the filter.py script.
Performance Optimization
- Memory Allocation: Increase available memory for large files using Java memory flags
- Temporary Directory: Use fast local storage for temporary files during processing
- Thread Configuration: Optimize thread usage based on available CPU cores
Related Tools
FilterQC works in conjunction with other BBTools programs:
- rqcfilter.sh/rqcfilter2.sh: Full-featured RQC filtering with extensive parameter control
- bbduk.sh: Direct adapter trimming and kmer-based filtering
- bbmap.sh: Reference-based contamination removal
- sendsketch.sh: Taxonomic classification and contamination detection
- clumpify.sh: Read deduplication and optimization
Contact and Support
For questions and support regarding FilterQC:
- Tool Author: Shijie Yao at syao@lbl.gov
- BBTools Support: bbushnell@lbl.gov
- Documentation: bbmap.org