ReadLength

Script: readlength.sh Package: jgi Class: MakeLengthHistogram.java

Generates a length histogram of input reads with statistical analysis including read count, base count, median via Tools.percentileHistogram(), mode via Tools.calcModeHistogram(), and standard deviation via Tools.standardDeviationHistogram().

Basic Usage

readlength.sh in=<input file>

This tool analyzes sequence files using ConcurrentReadInputStream to produce length distribution statistics and histograms. It processes both single-end and paired-end reads, generating cumulative statistics and percentage distributions through parallel array processing.

Parameters

Parameters control input/output files, histogram binning, and output formatting options.

Input/Output Parameters

in=<file>
Input sequence file (FASTA/FASTQ). The 'in=' flag is needed only if the input file is not the first parameter. 'in=stdin.fq' will pipe from standard input.
in2=<file>
Second input file for paired-end reads. Use this if the second read of pairs are in a different file.
out=<file>
Output file for the histogram and statistics. Default is stdout (console output).

Histogram Parameters

bin=10
Histogram bin size. Sets the interval size for grouping read lengths. Default: 10 bases per bin.
max=80000
Maximum read length to track. Reads longer than this will be placed in the final bin. Default: 80,000 bases.
round=f
Rounding mode for bin assignment. When true, places reads in the closest bin. When false (default), places reads in the highest bin of at least readlength.

Output Control Parameters

nzo=t
(nonzeroonly) Controls empty bin display. When true (default), does not print bins containing zero reads. When false, prints all bins including empty ones.
reads=-1
Read limit for processing. If nonnegative, stops processing after this many reads. Default: -1 (process all reads).

File Handling Parameters

append=f
Append to output file instead of overwriting. Default: false.
overwrite=t
Overwrite existing output files. Default: true.

Examples

Basic Length Histogram

readlength.sh in=reads.fastq out=length_histogram.txt

Generates a length histogram for single-end reads with default 10-base bins.

Paired-End Analysis

readlength.sh in=reads_R1.fastq in2=reads_R2.fastq out=paired_lengths.txt

Analyzes both forward and reverse reads from paired-end sequencing data.

Custom Binning

readlength.sh in=long_reads.fastq bin=50 max=10000 round=t out=custom_hist.txt

Uses 50-base bins with maximum tracking of 10kb reads and closest-bin rounding.

Include Empty Bins

readlength.sh in=reads.fastq nzo=f out=complete_histogram.txt

Shows all histogram bins including those with zero reads for complete distribution visualization.

Limited Read Processing

readlength.sh in=large_dataset.fastq reads=100000 out=sample_lengths.txt

Analyzes only the first 100,000 reads for quick assessment of read length distribution.

Output Format

The tool generates statistical summaries followed by a detailed histogram table:

Statistics Header

Histogram Table Columns

Algorithm Details

The readlength tool implements histogram-based length analysis using MakeLengthHistogram.calc():

Data Structure Implementation

Statistical Calculation Methods

Cumulative Array Processing

Performance Implementation

Binning Algorithm Implementation

The tool supports two binning modes controlled by ROUND_BINS boolean:

This implementation allows users to choose between conservative length grouping and statistically centered binning based on analysis requirements.

Support

For questions and support: