ReformatPB

Script: reformatpb.sh Package: icecream Class: ReformatPacBio.java

Provides some of Reformat's functionality in a ZMW-aware tool for PacBio sequencing data processing. Supports filtering, sampling, trimming, and consensus calling while maintaining ZMW structure awareness.

Basic Usage

reformatpb.sh in=<input file> out=<output file> outb=<bad reads>

ReformatPB processes PacBio sequencing data with ZMW (Zero Mode Waveguide) awareness, using ProcessThread class-based multi-threading and ZMWStreamer for buffered read organization, while maintaining the relationship between reads from the same ZMW through IntHashSet-based tracking.

Parameters

Parameters are organized by their function in the PacBio data processing pipeline. All parameters from the shell script are preserved to maintain compatibility.

File I/O parameters

in=<file>: Primary input file. Must be PacBio sequencing data in FASTQ format (gzipped supported).
out=<file>: (outgood) Output file for reads that pass quality filters and processing criteria.
outb=<file>: (outbad) Output file for reads that are discarded due to quality issues or filtering criteria.
stats=<file>: Print screen output to this file instead of stderr. Useful for capturing processing statistics.
json=f: Print statistics output in JSON format instead of text format. Helpful for automated parsing.
schist=<file>: Output subread count per ZMW histogram to this file. Provides distribution analysis of subreads per ZMW.
overwrite=f: (ow) Set to false to force the program to abort rather than overwrite existing output files.
ziplevel=2: (zl) Compression level for gzipped output files. Range 1 (fastest) to 9 (maximum compression). Lower values are faster.

Processing parameters

kzt=f: (keepzmwstogether) Send all reads from a ZMW to the same output file. If any read fails quality filters, all reads from that ZMW go to the bad output.
minlen=40: Minimum read length after trimming. Reads shorter than this are discarded.
ccsin=f: Input reads are CCS (Circular Consensus Sequencing), meaning they are all full-pass reads. Currently not used for processing logic.
trimpolya=f: Trim terminal poly-A and poly-T sequences. Useful for Iso-Seq libraries that may contain poly-A tails.
minpolymer=5: Minimum length of poly-A/T sequence required for trimming. Shorter sequences are not trimmed.
polyerror=0.2: Maximum error rate allowed when identifying poly-A/T sequences for trimming.
flaglongreads=f: Flag and discard reads that are suspiciously long (longer than 1.5x the median length within a ZMW).
longreadmult=1.5: Multiplier for median length to determine suspiciously long reads. Reads longer than this multiple of the median are flagged.

Whitelist and Blacklist Parameters

whitelist=: ZMW identifiers to include, specified as a comma-delimited list of integers or files with one integer per line. All ZMWs not in this list will be discarded.
blacklist=: ZMW identifiers to exclude, specified as a comma-delimited list of integers or files with one integer per line. All ZMWs in this list will be discarded.

Sampling parameters (avoid using more than one of these at a time)

reads=-1: If positive, stop processing after this many input reads. Useful for creating subsets of large datasets.
zmws=-1: If positive, stop processing after this many ZMWs. Alternative to reads parameter for ZMW-based limiting.
bestpass=f: Keep only the best read per ZMW. The "best" read is the median-length read among non-outermost reads. If 2 or fewer passes exist, the longest read is chosen.
longestpass=f: Keep only the longest read per ZMW. Uses ZMW.longestRead() method to select reads by length comparison.
samplerate=1.0: Fraction of input reads to retain (0.0 to 1.0). Random sampling applied uniformly across the dataset.
samplereadstarget=-1: Target number of reads to retain. If positive, performs exact sampling to achieve this count.
samplebasestarget=-1: Target number of bases to retain. If positive, performs exact sampling to achieve this total base count.
samplezmwstarget=-1: Target number of ZMWs to retain. If positive, performs exact sampling to achieve this ZMW count.
subsamplefromends=f: When subsampling, preferentially eliminate outermost reads first, then work inward. Preserves higher-quality internal reads.

CCS Parameters (Note: CCS is still experimental)

ccs=f: Generate a single consensus read per ZMW using Circular Consensus Sequencing approach. Uses BaseGraph data structure for alignment and consensus calling through graph traversal.
minpasses=0: Minimum number of passes required per ZMW for consensus generation. Pass count is estimated; first and last subreads are typically partial passes.
minsubreads=0: Minimum number of subreads required per ZMW for processing. ZMWs with fewer subreads are discarded.
reorient=f: Attempt alignment of both forward and reverse strands in case ZMW read ordering is disrupted. Helps recover data from problematic ZMWs.
minshredid=0.6: Minimum identity threshold for including read fragments (shreds) in consensus generation. Lower values include more divergent reads.

Entropy Parameters (recommended setting is 'entropy=t')

minentropy=-1: Minimum entropy threshold for read complexity filtering. Range 0-1. Recommended value is 0.55. Values above 0.7 are too stringent. Negative values disable entropy filtering.
entropyk=3: K-mer length used for entropy calculation. Shorter k-mers are more sensitive to low-complexity regions.
entropylen=350: Minimum length of consecutive low-entropy sequence required to flag a read for removal.
entropyfraction=0.5: Alternative minimum length calculation as fraction of read length. The smaller of entropylen and (entropyfraction × read_length) is used.
entropywindow=50: Window size for sliding-window entropy calculation. Larger windows smooth out local complexity variations.
maxmonomerfraction=0.74: (mmf) Maximum fraction of identical bases allowed in each entropy window. Higher values are more permissive of repetitive sequences.

Java Parameters

-Xmx: Set Java heap memory size, overriding automatic detection. Use format like -Xmx20g for 20 gigabytes or -Xmx2000m for 2000 megabytes. Maximum is typically 85% of available system memory.
-eoom: Exit immediately if an out-of-memory exception occurs. Prevents hanging processes. Requires Java 8u92 or later.
-da: Disable Java assertions. May provide minor performance improvement in production use.

Examples

Basic Quality Filtering

reformatpb.sh in=pacbio_reads.fastq out=clean_reads.fastq outb=discarded.fastq minlen=500

Filter PacBio reads to keep only those 500bp or longer after processing.

ZMW-Aware Sampling

reformatpb.sh in=large_dataset.fastq out=sample.fastq samplezmwstarget=10000 kzt=t

Sample exactly 10,000 ZMWs, keeping all subreads from each selected ZMW together.

Best Pass Selection

reformatpb.sh in=subreads.fastq out=best_reads.fastq bestpass=t minpasses=2

Select the best (median-length non-outermost) read from each ZMW with at least 2 passes.

Entropy Filtering

reformatpb.sh in=reads.fastq out=complex.fastq outb=simple.fastq minentropy=0.55

Remove reads with low sequence complexity using entropy-based filtering.

CCS Generation (Experimental)

reformatpb.sh in=subreads.fastq out=ccs.fastq ccs=t minpasses=3 minshredid=0.7

Generate circular consensus sequences from subreads with at least 3 passes and high identity requirement.

Poly-A Trimming for Iso-Seq

reformatpb.sh in=isoseq.fastq out=trimmed.fastq trimpolya=t minpolymer=8

Trim poly-A/T tails from Iso-Seq reads, removing sequences of 8 or more As/Ts.

Algorithm Details

ZMW-Aware Processing

ReformatPB is specifically designed for PacBio Single Molecule Real-Time (SMRT) sequencing data, which uses Zero Mode Waveguides (ZMWs) for sequencing. Each ZMW contains multiple subreads from the same DNA molecule, and this tool maintains awareness of this structure throughout processing.

Threading and Performance

The tool uses ProcessThread instances with ZMWStreamer for concurrent processing. Threading is automatically disabled (threads=1) for exact sampling operations to ensure deterministic results. Uses ByteFile.FORCE_MODE_BF2 when threads>2 for input file reading.

Filtering Strategy

Multiple filtering approaches can be applied:

Length filtering: Removes reads below minimum length thresholds
Entropy filtering: Uses k-mer entropy calculation to identify and remove low-complexity sequences
Long read flagging: Identifies suspiciously long reads based on ZMW median length
Quality trimming: Removes undefined bases and poly-A/T sequences

Sampling Methods

The tool supports multiple sampling strategies:

Rate-based: Random sampling based on specified fraction
Target-based: Exact sampling to achieve specific read, base, or ZMW counts
Best-pass: Selects median-length read from non-outermost reads using ZMW.medianRead() method
End-preferential: Removes lower-quality outermost reads first

Consensus Generation (Experimental)

The CCS (Circular Consensus Sequencing) feature generates consensus reads using BaseGraph class by:

Setting read strand orientation (i&1) and selecting median-length read as reference
Breaking long reads (>500bp) into shreds with 10bp overlap using shred() method
Using BaseGraph constructor with reference bases, quality, and numericID
Aligning shreds via bg.alignAndGenerateMatch() with SSA aligner and minShredIdentity threshold
Traversing BaseGraph with bg.traverse() to generate consensus sequence

Memory and I/O Management

The tool implements specific memory management strategies:

ZMWStreamer processing: Uses ZMWStreamer class with configurable buffer sizes for read organization
Thread control: Automatically sets threads=1 for exact sampling (sampleExact=true) to ensure deterministic results
Compression support: Uses ReadWrite.USE_PIGZ and ReadWrite.USE_BGZIP with configurable ziplevel parameter (1-9)
Buffer configuration: Sets Shared.setBufferData(1000000) for 1MB buffer size

Statistical Output

Statistics are generated via toText() and toJson() methods including:

ZMW/read/base counts using Tools.timeZMWsReadsBasesProcessed() and Tools.ZMWsReadsBasesOut()
Filtering counts by category (readsFiltered, lowEntropyZMWs, partiallyDiscardedZMWs, fullyDiscardedZMWs)
Trimming statistics (readsTrimmed, basesTrimmed) when trimReads=true
Timer statistics via t.timeInSeconds() method
Subread count histogram via writeHistogram() method with mean/median/mode calculations

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org