PROCESSHI-C

Script: processhi-c.sh Package: jgi Class: FindHiCJunctions.java

Finds and trims junctions in mapped Hi-C reads. For the purpose of reporting junction motifs, this requires paired-end reads, because only improper pairs will be considered as possibly containing junctions. However, all reads that map with soft-clipping will be trimmed on the 3' (right) end, regardless of pairing status.

Basic Usage

processhi-c.sh in=<mapped reads> out=<trimmed reads>

Processes mapped Hi-C reads to identify and trim junction sites. The tool analyzes SAM/BAM files and outputs trimmed reads, optionally generating files with kmer counts at junction sites for motif analysis.

Parameters

Parameters are organized based on their function in Hi-C junction processing. All parameters from the shell script are documented below.

Input/Output Parameters

in=<file>: A SAM/BAM file containing mapped reads. Required parameter for input.
out=<file>: Output file of trimmed reads. Writes processed reads with junctions trimmed.
overwrite=f: (ow) Set to false to force the program to abort rather than overwrite an existing file. Default: true

Junction Analysis Parameters

printkmers=t: Generate files with kmer counts at junction sites. Enables motif analysis by outputting kmer frequency data. Default: true
junctions=junctions_k%.txt: File pattern for junction output. The '%' is replaced with kmer length (4, 6, 8, 10) and direction indicators (L/R). Default: junctions_k%.txt
minclip=8: Minimum clipping length required to consider a junction. Reads must have at least this many clipped bases to be processed. Default: 8

Processing Parameters

verbose=f: Enable verbose output for debugging. Provides detailed logging of processing steps. Default: false
trim=t: Enable trimming of junction sites. When true, reads are trimmed at identified junction positions. Default: true
mintrimlength=25: Minimum length to retain after trimming. Reads shorter than this after trimming are handled according to the algorithm. Default: 25
mincount=2: Minimum count threshold for kmer reporting. Kmers appearing fewer times are filtered from output. Default: 2
minfraction=0.0005: Minimum fraction threshold for kmer reporting relative to total kmer count. Default: 0.0005

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions.

Examples

Basic Junction Processing

processhi-c.sh in=mapped_hic_reads.bam out=trimmed_reads.fastq

Process mapped Hi-C reads from a BAM file, trim junction sites, and output trimmed reads in FASTQ format.

Generate Kmer Junction Reports

processhi-c.sh in=mapped_reads.sam out=trimmed.fq printkmers=t junctions=hic_motifs_k%.txt

Process Hi-C reads while generating detailed kmer count files for junction motif analysis. Output files will be named hic_motifs_k4.txt, hic_motifs_k6.txt, etc.

Custom Clipping Thresholds

processhi-c.sh in=data.bam out=processed.fq minclip=5 mintrimlength=30

Process reads with more sensitive clipping detection (minimum 5 bases) and retain longer sequences after trimming (minimum 30 bases).

High-Sensitivity Motif Analysis

processhi-c.sh in=hic.bam out=out.fq mincount=1 minfraction=0.0001 junctions=sensitive_k%.tsv

Generate motif analysis with lower thresholds (mincount=1, minfraction=0.0001) to capture rare junction motifs. TSV format output for easier downstream analysis.

Algorithm Details

Junction Detection Strategy

ProcessHi-C uses a dual approach for identifying Hi-C junctions:

Improper Pair Analysis: For paired-end reads, identifies junctions by analyzing improper pairs where mates map to different chromosomes or with unexpected orientations
Soft-Clipping Detection: Examines CIGAR strings to identify reads with significant soft-clipping, indicating potential junction sites

Soft-Clipping Algorithm

The softClipMatch() method uses a scoring system to identify optimal clipping positions:

Match Score: +100 for exact matches, +1 for N bases
Substitution Penalties: -200 for first substitution, -100 for consecutive substitutions
Indel Penalties: -200 for insertions, -200/-10 for deletions (first/consecutive)
Clipping Penalty: -1 per clipped base

Multi-Scale Kmer Analysis

Junction motifs are analyzed at multiple kmer lengths (4, 6, 8, 10) using pre-allocated count arrays:

Kmer Lengths: 4, 6, 8, and 10 nucleotides for different resolution analysis
Directional Analysis: Separate counting for left (5') and right (3') sides of junctions
Statistical Filtering: Kmers are filtered based on both absolute count (mincount) and relative frequency (minfraction)

Memory Management

Kmer counting uses fixed-size arrays allocated during initialization:

Hash Arrays: Pre-allocated arrays sized for each kmer length (4^k entries)
Dual Counting: Separate arrays for junction kmers and flanking region analysis
Memory Efficiency: Default memory allocation of 200MB, expandable as needed

Output Formats

Junction analysis results support multiple output formats:

FASTA Format: Junction kmers with count and frequency information in headers
TSV Format: Tab-separated values for easy import into analysis tools
Multi-File Output: Separate files for each kmer length and direction combination

Performance Considerations

Memory Usage: Scales with kmer diversity; default 200MB handles most datasets
Processing Speed: Linear with input size; approximately 1-2 million reads per minute
Junction Sensitivity: Adjusting minclip affects both sensitivity and specificity
Output Size: Kmer files can be large for diverse datasets; use filtering parameters appropriately

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org