ReadQC
Read QC pipeline for quality assessment of fastq files. Provides HTML-formatted reports with quality metrics and visualization.
Basic Usage
readqc.sh in=<file> out=<dir>
This tool generates quality control reports for sequencing data in fastq format. It processes both uncompressed and gzipped fastq files, producing HTML reports with quality metrics and visualizations.
Parameters
ReadQC has a simple parameter structure focused on input specification and output directory configuration. The tool automatically generates HTML reports and skips BLAST analysis for faster processing.
Input/Output Parameters
- in=file
- Specify the input fastq or fastq.gz file. The tool accepts both compressed (.gz) and uncompressed fastq files. This parameter is required.
- out=dir
- The output directory where quality control reports will be generated. The directory will be created if it doesn't exist. HTML reports and associated files will be placed in this directory. This parameter is required.
Examples
Basic Quality Control
readqc.sh in=sample.fastq out=qc_results
Performs quality control analysis on sample.fastq and generates HTML reports in the qc_results directory.
Compressed Input File
readqc.sh in=reads.fastq.gz out=quality_reports
Processes a gzipped fastq file and creates quality assessment reports in the quality_reports directory.
Multiple Sample Analysis
# Process multiple samples
for sample in *.fastq.gz; do
base=$(basename $sample .fastq.gz)
readqc.sh in=$sample out=qc_${base}
done
Example shell loop to process multiple fastq files, creating separate output directories for each sample.
Algorithm Details
Quality Assessment Pipeline
ReadQC implements a quality control pipeline for fastq files. The tool performs the following analyses:
- Sequence Quality Analysis: Evaluates per-base and per-read quality scores using Phred quality scores
- Base Composition Analysis: Analyzes nucleotide composition and detects potential bias or contamination
- Length Distribution: Calculates read length statistics and distribution patterns
- Quality Score Distribution: Provides detailed quality score histograms and statistics
- HTML Report Generation: Creates interactive HTML reports with charts and visualizations
Processing Strategy
The readqc.py backend processes fastq files using the following implementation approach:
- Streaming Processing: Handles large files without loading entire datasets into memory
- Compression Support: Native support for gzipped files without requiring decompression
- Skip BLAST Analysis: By default skips time-consuming BLAST searches (--skip-blast flag)
- HTML Output: Generates HTML reports with embedded visualizations
Report Contents
The generated HTML reports typically include:
- Summary statistics (total reads, total bases, average length)
- Quality score distributions and per-base quality plots
- Base composition analysis and GC content
- Read length distribution histograms
- Quality filtering recommendations
Output Files
ReadQC generates several output files in the specified output directory:
- HTML Report: Main quality control report with interactive visualizations
- Supporting Files: CSS, JavaScript, and image files for the HTML report
- Data Files: Tab-delimited files containing raw statistics and metrics
Performance Considerations
Memory Usage
ReadQC uses a streaming approach for memory management:
- Sequential record processing minimizes memory footprint
- Memory usage typically scales with read complexity, not file size
- Suitable for processing files ranging from small test datasets to large-scale sequencing runs
Processing Time
Processing time depends on several factors:
- File size and number of reads
- Compression level (gzipped files require additional decompression time)
- I/O performance of storage system
- BLAST analysis is skipped by default for faster processing
Technical Notes
File Format Requirements
- Supports standard fastq format (both .fastq and .fq extensions)
- Handles gzip-compressed files (.fastq.gz, .fq.gz)
- Expects standard 4-line fastq format: header, sequence, separator, quality
- Compatible with various quality score encodings (Phred+33, Phred+64)
Error Handling
- Validates input file existence before processing
- Creates output directory if it doesn't exist
- Provides clear error messages for missing or invalid inputs
- Handles malformed fastq records gracefully
Dependencies
ReadQC requires:
- Python environment with necessary libraries for fastq parsing and HTML generation
- Access to the pytools directory containing readqc.py
- Sufficient disk space for output files (typically 1-10% of input file size)
Author Information
Written by: Shijie Yao
Last Modified: March 22, 2018
Contact: For specific questions about ReadQC, contact Shijie Yao at syao@lbl.gov
Support
For questions and support:
- ReadQC-specific questions: syao@lbl.gov
- General BBTools support: bbushnell@lbl.gov
- Documentation: bbmap.org