ReadLength
Generates a length histogram of input reads with statistical analysis including read count, base count, median via Tools.percentileHistogram(), mode via Tools.calcModeHistogram(), and standard deviation via Tools.standardDeviationHistogram().
Basic Usage
readlength.sh in=<input file>
This tool analyzes sequence files using ConcurrentReadInputStream to produce length distribution statistics and histograms. It processes both single-end and paired-end reads, generating cumulative statistics and percentage distributions through parallel array processing.
Parameters
Parameters control input/output files, histogram binning, and output formatting options.
Input/Output Parameters
- in=<file>
- Input sequence file (FASTA/FASTQ). The 'in=' flag is needed only if the input file is not the first parameter. 'in=stdin.fq' will pipe from standard input.
- in2=<file>
- Second input file for paired-end reads. Use this if the second read of pairs are in a different file.
- out=<file>
- Output file for the histogram and statistics. Default is stdout (console output).
Histogram Parameters
- bin=10
- Histogram bin size. Sets the interval size for grouping read lengths. Default: 10 bases per bin.
- max=80000
- Maximum read length to track. Reads longer than this will be placed in the final bin. Default: 80,000 bases.
- round=f
- Rounding mode for bin assignment. When true, places reads in the closest bin. When false (default), places reads in the highest bin of at least readlength.
Output Control Parameters
- nzo=t
- (nonzeroonly) Controls empty bin display. When true (default), does not print bins containing zero reads. When false, prints all bins including empty ones.
- reads=-1
- Read limit for processing. If nonnegative, stops processing after this many reads. Default: -1 (process all reads).
File Handling Parameters
- append=f
- Append to output file instead of overwriting. Default: false.
- overwrite=t
- Overwrite existing output files. Default: true.
Examples
Basic Length Histogram
readlength.sh in=reads.fastq out=length_histogram.txt
Generates a length histogram for single-end reads with default 10-base bins.
Paired-End Analysis
readlength.sh in=reads_R1.fastq in2=reads_R2.fastq out=paired_lengths.txt
Analyzes both forward and reverse reads from paired-end sequencing data.
Custom Binning
readlength.sh in=long_reads.fastq bin=50 max=10000 round=t out=custom_hist.txt
Uses 50-base bins with maximum tracking of 10kb reads and closest-bin rounding.
Include Empty Bins
readlength.sh in=reads.fastq nzo=f out=complete_histogram.txt
Shows all histogram bins including those with zero reads for complete distribution visualization.
Limited Read Processing
readlength.sh in=large_dataset.fastq reads=100000 out=sample_lengths.txt
Analyzes only the first 100,000 reads for quick assessment of read length distribution.
Output Format
The tool generates statistical summaries followed by a detailed histogram table:
Statistics Header
- Reads: Total number of reads processed
- Bases: Total number of bases across all reads
- Max: Length of the longest read
- Min: Length of the shortest read
- Avg: Average read length
- Median: Median read length (50th percentile)
- Mode: Most common read length
- Std_Dev: Standard deviation of read lengths
Histogram Table Columns
- Length: Read length or bin center
- reads: Number of reads in this length category
- pct_reads: Percentage of total reads
- cum_reads: Cumulative read count
- cum_pct_reads: Cumulative percentage of reads
- bases: Total bases in reads of this length
- pct_bases: Percentage of total bases
- cum_bases: Cumulative base count
- cum_pct_bases: Cumulative percentage of bases
Algorithm Details
The readlength tool implements histogram-based length analysis using MakeLengthHistogram.calc():
Data Structure Implementation
- Dual Array System: Uses separate long[] readHist and long[] baseHist arrays (max+1 size) for read counts and base counts at each length
- Bin Assignment Algorithm: Implements y = Tools.min(max, ((ROUND_BINS ? x+MULT/2 : x))/MULT) for configurable binning
- Memory Allocation: Pre-allocates histogram arrays based on MAX_LENGTH parameter (default 80000), preventing dynamic resizing
Statistical Calculation Methods
- Median Calculation: Uses Tools.percentileHistogram(readHist, 0.5)*MULT for histogram-based median calculation
- Mode Detection: Applies Tools.calcModeHistogram(readHist)*MULT to identify peak frequency bins
- Standard Deviation: Calculates Tools.standardDeviationHistogram(readHist)*MULT using histogram-based variance formula
Cumulative Array Processing
- Reverse Cumulation: Builds readHistC[i-1]=readHistC[i]+readHist[i-1] and baseHistC[i-1]=baseHistC[i]+baseHist[i-1] from max to 0
- Percentage Conversion: Computes readHistCF[i]=readHistC[i]*100d/readHistC[0] and baseHistCF[i]=baseHistC[i]*100d/baseHistC[0]
- Dual Perspective Statistics: Provides both read-based and base-based cumulative distributions
Performance Implementation
- Memory Usage: Linear with MAX_LENGTH parameter using 400MB JVM heap (z="-Xmx400m")
- Time Complexity: O(n) read processing with constant-time histogram updates via array indexing
- Stream Processing: Uses ConcurrentReadInputStream with ListNum<Read> for sequential read processing without individual read storage
- Paired-End Handling: Processes r1.mate references in single pass through read pairs
Binning Algorithm Implementation
The tool supports two binning modes controlled by ROUND_BINS boolean:
- Floor Mode (ROUND_BINS=false): bin_index = length / MULT for floor-based assignment
- Round Mode (ROUND_BINS=true): bin_index = (length + MULT/2) / MULT for nearest-bin assignment
This implementation allows users to choose between conservative length grouping and statistically centered binning based on analysis requirements.
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org