ReadLength

Script: readlength.sh Package: jgi Class: MakeLengthHistogram.java

Generates a length histogram of input reads with statistical analysis including read count, base count, median via Tools.percentileHistogram(), mode via Tools.calcModeHistogram(), and standard deviation via Tools.standardDeviationHistogram().

Basic Usage

readlength.sh in=<input file>

This tool analyzes sequence files using ConcurrentReadInputStream to produce length distribution statistics and histograms. It processes both single-end and paired-end reads, generating cumulative statistics and percentage distributions through parallel array processing.

Parameters

Parameters control input/output files, histogram binning, and output formatting options.

Input/Output Parameters

in=<file>: Input sequence file (FASTA/FASTQ). The 'in=' flag is needed only if the input file is not the first parameter. 'in=stdin.fq' will pipe from standard input.
in2=<file>: Second input file for paired-end reads. Use this if the second read of pairs are in a different file.
out=<file>: Output file for the histogram and statistics. Default is stdout (console output).

Histogram Parameters

bin=10: Histogram bin size. Sets the interval size for grouping read lengths. Default: 10 bases per bin.
max=80000: Maximum read length to track. Reads longer than this will be placed in the final bin. Default: 80,000 bases.
round=f: Rounding mode for bin assignment. When true, places reads in the closest bin. When false (default), places reads in the highest bin of at least readlength.

Output Control Parameters

nzo=t: (nonzeroonly) Controls empty bin display. When true (default), does not print bins containing zero reads. When false, prints all bins including empty ones.
reads=-1: Read limit for processing. If nonnegative, stops processing after this many reads. Default: -1 (process all reads).

File Handling Parameters

append=f: Append to output file instead of overwriting. Default: false.
overwrite=t: Overwrite existing output files. Default: true.

Examples

Basic Length Histogram

readlength.sh in=reads.fastq out=length_histogram.txt

Generates a length histogram for single-end reads with default 10-base bins.

Paired-End Analysis

readlength.sh in=reads_R1.fastq in2=reads_R2.fastq out=paired_lengths.txt

Analyzes both forward and reverse reads from paired-end sequencing data.

Custom Binning

readlength.sh in=long_reads.fastq bin=50 max=10000 round=t out=custom_hist.txt

Uses 50-base bins with maximum tracking of 10kb reads and closest-bin rounding.

Include Empty Bins

readlength.sh in=reads.fastq nzo=f out=complete_histogram.txt

Shows all histogram bins including those with zero reads for complete distribution visualization.

Limited Read Processing

readlength.sh in=large_dataset.fastq reads=100000 out=sample_lengths.txt

Analyzes only the first 100,000 reads for quick assessment of read length distribution.

Output Format

The tool generates statistical summaries followed by a detailed histogram table:

Statistics Header

Reads: Total number of reads processed
Bases: Total number of bases across all reads
Max: Length of the longest read
Min: Length of the shortest read
Avg: Average read length
Median: Median read length (50th percentile)
Mode: Most common read length
Std_Dev: Standard deviation of read lengths

Histogram Table Columns

Length: Read length or bin center
reads: Number of reads in this length category
pct_reads: Percentage of total reads
cum_reads: Cumulative read count
cum_pct_reads: Cumulative percentage of reads
bases: Total bases in reads of this length
pct_bases: Percentage of total bases
cum_bases: Cumulative base count
cum_pct_bases: Cumulative percentage of bases

Algorithm Details

The readlength tool implements histogram-based length analysis using MakeLengthHistogram.calc():

Data Structure Implementation

Dual Array System: Uses separate long[] readHist and long[] baseHist arrays (max+1 size) for read counts and base counts at each length
Bin Assignment Algorithm: Implements y = Tools.min(max, ((ROUND_BINS ? x+MULT/2 : x))/MULT) for configurable binning
Memory Allocation: Pre-allocates histogram arrays based on MAX_LENGTH parameter (default 80000), preventing dynamic resizing

Statistical Calculation Methods

Median Calculation: Uses Tools.percentileHistogram(readHist, 0.5)*MULT for histogram-based median calculation
Mode Detection: Applies Tools.calcModeHistogram(readHist)*MULT to identify peak frequency bins
Standard Deviation: Calculates Tools.standardDeviationHistogram(readHist)*MULT using histogram-based variance formula

Cumulative Array Processing

Reverse Cumulation: Builds readHistC[i-1]=readHistC[i]+readHist[i-1] and baseHistC[i-1]=baseHistC[i]+baseHist[i-1] from max to 0
Percentage Conversion: Computes readHistCF[i]=readHistC[i]*100d/readHistC[0] and baseHistCF[i]=baseHistC[i]*100d/baseHistC[0]
Dual Perspective Statistics: Provides both read-based and base-based cumulative distributions

Performance Implementation

Memory Usage: Linear with MAX_LENGTH parameter using 400MB JVM heap (z="-Xmx400m")
Time Complexity: O(n) read processing with constant-time histogram updates via array indexing
Stream Processing: Uses ConcurrentReadInputStream with ListNum<Read> for sequential read processing without individual read storage
Paired-End Handling: Processes r1.mate references in single pass through read pairs

Binning Algorithm Implementation

The tool supports two binning modes controlled by ROUND_BINS boolean:

Floor Mode (ROUND_BINS=false): bin_index = length / MULT for floor-based assignment
Round Mode (ROUND_BINS=true): bin_index = (length + MULT/2) / MULT for nearest-bin assignment

This implementation allows users to choose between conservative length grouping and statistically centered binning based on analysis requirements.

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org