CheckStrand

Script: checkstrand.sh Package: jgi Class: CheckStrand2.java

Estimates the strandedness of a library without alignment; intended for RNA-seq data. Only the reads are required input to determine strandedness, so this works when run with just a fastq file. If sam/bam input is used, additional alignment-based metrics will be reported. If an assembly and gff file are provided, the affinity of reads to the plus or minus (sense or antisense) strand will also be calculated. If a genome is specified with no gff file, the genes will be called automatically with a prokaryotic gene caller. Strandedness and P/(P+M) ratios are similar but calculated in different ways, and thus will not exactly agree, but should be close. For most calculations, only read 1 is used, or the merge of read 1 and read 2 if the merge flag is enabled and they overlap.

Basic Usage

checkstrand.sh in=<input file>

Output Analysis Explanation

CheckStrand provides multiple analysis methods, each giving different perspectives on strandedness:

Depth Analysis

Depth_Analysis: Based on comparing the fraction of kmers in forward and reverse orientation to a binomial distribution.
Strandedness: Percent of reads that came from the majority strand, based on kmer depth.
StrandednessN: Depth-normalized strandedness, where each unique kmer contributes equally regardless of depth.
AvgKmerDepth: Average depth of kmers; typically higher than read depth.
Kmers>Depth1: Fraction of kmers with depth over 1. Singleton kmers cannot be used to calculate strandedness from depth.

Stop Codon Analysis

Stop_Codon_Analysis: Based on counting stop codons in different reading frames.
MajorStrandORF: Predicted major strand based on stop-codon analysis.
AvgReadLen: Average length of R1, or the merged read pair.
AvgORFLen+: Average ORF length in the best frame on the plus side of R1.
AvgORFLen-: Average ORF length in the best frame on the minus side of R1.
AvgStopCount+: Average number of stop codons in the best frame on plus side.
AvgStopCount-: Average number of stop codons in the best frame on minus side.
GC_Content: GC content of reads, which affects stop codon frequency.

PolyA Analysis

PolyA_Analysis: Based on counting poly-A or poly-T tails.
MajorStrandPA: Predicted major strand based on poly-A tail analysis. This is unreliable if the poly-A fraction is low.
PolyA/(PA+PT): Ratio of (reads with poly-A tails)/(poly-A + poly-T tails).
PolyAFraction: Fraction of reads ending in poly-A or poly-T.

Reference Analysis

Ref_Analysis: Compares read kmer frequencies to a reference (if present). The reference can be a transcriptome or genome, and a gff file will be used if provided.
MajorStrandREF: The strand containing a majority of forward read kmers.
P/(P+M)_Ratio: P is the sum of counts of plus kmers, and M is minus kmers.
GeneCoverage: Fraction of transcriptome kmers represented in reads.
GenePrecision: Fraction of read kmers found in the transcriptome.

Read Gene Calling Analysis

Read_Gene_Calling_Analysis: Uses gene-calling on reads to calculate which strand better fits a prokaryotic gene model.
MajorStrandRGC: Predicted major strand based on read gene calling.
P/(P+M)_Ratio: P is the read count best matching the plus strand; M is minus.
AvgScorePlus: Average score of called plus-strand genes.
AvgScoreMinus: Average score of called minus-strand genes.
UsedFraction: Fraction of reads with any called genes (or partial genes); this can be increased by merging the reads for longer frames.

Alignment Results

Alignment_Results: Requires sam/bam input. The reads must have been mapped to a transcriptome or RNA-seq assembly, or to a specified genome, or a gff file must be provided.
StrandednessAL: Percent of reads aligned to the dominant strand. More accurate for transcriptome-mapped than genome-mapped reads.
StrandednessAN: Depth-normalized strandedness, where each feature or contig contributes equally.
MajorStrandAL: Strand to which a majority of reads aligned.
P/(P+M)_Ratio: P is the number of plus-mapped reads, M is minus.
P/(P+M)_RatioN: Depth-normalized plus/total ratio.
PlusFeatures: Fraction of features with majority plus-mapped reads.
AlignmentRate: Fraction of reads that aligned.
Feature-Mapped: Fraction of reads that aligned to a feature in the gff.

Parameters

Parameters are organized by their function in the strandedness analysis process.

Standard parameters

in=<file>: Primary input (a fastq, fasta, sam or bam file).
in2=<file>: Secondary input for paired fastq in twin files. Read 2 is ignored unless merge=t.
out=<file>: Optional destination to redirect results instead of stdout.
outp=<file>: Optional output for plus-mapped reads.
outm=<file>: Optional output for minus-mapped reads.

Processing parameters

ref=<file>: Optional reference (assembly) input.
gff=<file>: Optional gene annotation file input.
scafreport=<file>: Optional per-scaffold strandedness output.
transcriptome=f: Set this to 't' if the reference is a sense-strand transcriptome (rather than a genome assembly). This applies to either a reference specified by 'ref' or the reference used for alignment, for sam/bam input.
rnacontigs=f: Set this to 't' if the reference is contigs assembled from RNA-seq data, but with unknown orientation. Only affects alignment results.
size=80000: Sketch size; larger may be more precise.
merge=f: Attempt to merge paired reads, and use the merged read when successful. If unsuccessful only R1 is used. This has a performance impact on CPUs with few cores.
orf=t: Analyze stop codons and open reading frames. Usually this will allow major strand determination without a reference, unless GC is very high.
callreads=t: Perform gene-calling on reads using a prokaryotic gene model. Not very relevant to eukaryotes.
passes=2: Two passes refines the gene model for better gene-calling on reads. Only used if there is a reference.
samplerate=1.0: Set to a lower number to subsample the input reads; increases speed on CPUs with few cores.
sampleseed=17: Positive numbers are deterministic; negative use random seeds.
reads=-1: If positive, quit after processing this many reads.

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions.

Examples

Basic fastq analysis

# Won't give alignment results, just kmer results
checkstrand.sh in=reads.fq

Analyzes strandedness using kmer depth, stop codon analysis, and poly-A tail detection on fastq reads.

Aligned reads with gene-calling

# This will do gene-calling and the alignment strandedness will be based 
# on gene sense strand, but only works for prokaryotes/viruses
checkstrand.sh in=mapped.sam ref=contigs.fa

Performs gene-calling on the reference and uses alignment strandedness based on gene sense strand.

Using annotation file

# This will use the annotation and the alignment strandedness will be based
# on gene sense strand, works for prokaryotes, and should work for eukaryotes
checkstrand.sh in=mapped.sam gff=genes.gff

Uses existing gene annotation for more accurate strandedness calculation.

Transcriptome mode

# This will assume that the reference was a sense-strand transcriptome,
# and the alignment strandedness will be based on contig plus strand
checkstrand.sh in=mapped.sam ref=transcriptome.fa transcriptome=t

Optimized for reads mapped to a sense-strand transcriptome.

RNA-seq contigs

# This will assume that the reference was unstranded contigs assembled
# from RNA-seq data, so the alignment strandedness will be based on
# contig majority strand
checkstrand.sh in=mapped.sam ref=rnacontigs.fa rnacontigs=t

For contigs assembled from RNA-seq data with unknown orientation.

Merged pairs with high precision

checkstrand.sh in=reads.fq merge=t size=200000 samplerate=0.5

Merges paired reads for longer sequences, uses larger sketch size for precision, and subsamples to 50% for speed.

Algorithm Details

Multi-Method Strandedness Detection

CheckStrand2 employs five independent methods to determine RNA-seq library strandedness, providing robust analysis even when individual methods may be unreliable:

1. K-mer Depth Analysis

Implements dual SketchTool strategy using SketchMakerMini.processReadNucleotide() with canonical and forward-only k-mer processing:

Creates MinHash sketches of configurable size (default 80,000) using 32-mer sequences (SketchObject.k=32)
Uses SketchHeap data structures accumulated via canonSmm.heap() and fwdSmm.heap() methods
Applies CheckStrand.calcStrandedness() and CheckStrand.calcStrandednessNormalized() binomial distribution testing
Normalizes results using depth-aware weighting where each unique k-mer contributes equally regardless of coverage
Filters k-mers using quality thresholds (minEntropy, minProb, minQual parameters in SketchMakerMini)

2. Stop Codon Frame Analysis

Implements countStopCodons() method using 6-bit k-mer encoding with AminoAcid.codeToByte[] lookup tables:

Uses sliding window k-mer approach (k=3) with bit-shift operations: kmer=((kmer<<2)|x)&0b111111
Processes both forward and reverse complement k-mers via AminoAcid.baseToComplementNumber[] mapping
Tracks longest ORF lengths using fMaxOrf[] and rMaxOrf[] arrays with dynamic ORF boundary detection
Computes per-frame statistics: fBestFrameT[] and rBestFrameT[] accumulate frame preferences
Calculates GC content via ACGTCountT[] nucleotide frequency arrays for stop codon frequency correction

3. Poly-A Tail Detection

Implements analyzePolyA() method using Read.countTrailing() and Read.countLeading() nucleotide counting algorithms:

Detects poly-A tails via r.countTrailing('A') method and poly-T prefixes via r.countLeading('T')
Applies configurable minimum tail length threshold (minPolyA parameter, default 6 nucleotides)
Uses comparative analysis: polyACountT++ when trailingPolyA > leadingPolyT && trailingPolyA > minPolyA
Accumulates strand-specific counters: polyACount vs polyTCount for final ratio calculation
Note: Analysis limited to read 1 due to insert size interference with read 2 terminal sequences

4. Reference-Based K-mer Comparison

Uses CheckStrand.sketchGenes() and CheckStrand.calcPMRatioWithGeneSketches() for reference-based strand determination:

Generates three gene sketches: canonGeneSketch, plusSketch, minusSketch via SketchTool processing
Loads reference sequences using genomeSequence() or CheckStrand.grabGenes() with GFF annotation parsing
Compares read k-mers against reference sketches using Sketch comparison algorithms
Computes coverage metric: aFraction = (reference k-mers found in reads) / (total reference k-mers)
Computes precision metric: bFraction = (read k-mers matching reference) / (total read k-mers)

5. Prokaryotic Gene Model Scoring

Implements scoreGenes2() method using CallGenes.makeGeneCaller() with GeneModel-based prokaryotic gene prediction:

Loads prokaryotic gene model from Data.findPath("?model.pgm") via GeneModelParser.loadModel()
Executes gCallerT.callGenes(r, true) on both forward and reverse strand orientations
Accumulates maximum ORF scores per strand: Tools.max(plusScore, orf.score()) and Tools.max(minusScore, orf.score())
Implements two-pass refinement via CallGenes.makeMultipassModel() when passes>1 parameter is set
Tracks scoring statistics: gCallerPlusScoreT, gCallerMinusScoreT, and called gene counts per strand

Memory and Performance Optimization

CheckStrand2 implements thread-safe concurrent processing with dynamic resource management:

Sketching: MinHash sketching via SketchHeap data structures reduces memory to configurable targetSketchSize (default 80,000 k-mers)
Threading: Dynamic thread limiting: maxThreads = mergePairs ? 64 : 16 with Tools.min(maxThreads, Shared.threads())
Streaming: ConcurrentReadInputStream with ListNum<Read> batch processing and cris.nextList() buffer management
Subsampling: cris.setSampleRate(samplerate, sampleseed) with deterministic or random sampling modes
Memory Management: Thread-local SketchMakerMini instances prevent heap contention, with final accumulation via canonHeap.add()

Statistical Methods

The tool implements quantitative statistical algorithms for strand bias detection:

Binomial Testing: CheckStrand.calcStrandedness() applies binomial distribution comparison against null hypothesis of 50/50 strand distribution
Depth Normalization: CheckStrand.calcStrandednessNormalized() weights each unique k-mer equally using normalized depth ratios
Expected Minor Allele Calculation: CheckStrand.expectedMinorAlleleCount() models technical strand assignment variation
Multi-metric Strandedness: CheckStrand.strandedness() function computes confidence using (max(a,b) / (a+b)) ratio with statistical bounds
Range-based Calculations: Handles edge cases with maxMinorSum bounds and dif/range normalization for overly unstranded samples

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org