PROCESSHI-C

Script: processhi-c.sh Package: jgi Class: FindHiCJunctions.java

Finds and trims junctions in mapped Hi-C reads. For the purpose of reporting junction motifs, this requires paired-end reads, because only improper pairs will be considered as possibly containing junctions. However, all reads that map with soft-clipping will be trimmed on the 3' (right) end, regardless of pairing status.

Basic Usage

processhi-c.sh in=<mapped reads> out=<trimmed reads>

Processes mapped Hi-C reads to identify and trim junction sites. The tool analyzes SAM/BAM files and outputs trimmed reads, optionally generating files with kmer counts at junction sites for motif analysis.

Parameters

Parameters are organized based on their function in Hi-C junction processing. All parameters from the shell script are documented below.

Input/Output Parameters

in=<file>
A SAM/BAM file containing mapped reads. Required parameter for input.
out=<file>
Output file of trimmed reads. Writes processed reads with junctions trimmed.
overwrite=f
(ow) Set to false to force the program to abort rather than overwrite an existing file. Default: true

Junction Analysis Parameters

printkmers=t
Generate files with kmer counts at junction sites. Enables motif analysis by outputting kmer frequency data. Default: true
junctions=junctions_k%.txt
File pattern for junction output. The '%' is replaced with kmer length (4, 6, 8, 10) and direction indicators (L/R). Default: junctions_k%.txt
minclip=8
Minimum clipping length required to consider a junction. Reads must have at least this many clipped bases to be processed. Default: 8

Processing Parameters

verbose=f
Enable verbose output for debugging. Provides detailed logging of processing steps. Default: false
trim=t
Enable trimming of junction sites. When true, reads are trimmed at identified junction positions. Default: true
mintrimlength=25
Minimum length to retain after trimming. Reads shorter than this after trimming are handled according to the algorithm. Default: 25
mincount=2
Minimum count threshold for kmer reporting. Kmers appearing fewer times are filtered from output. Default: 2
minfraction=0.0005
Minimum fraction threshold for kmer reporting relative to total kmer count. Default: 0.0005

Java Parameters

-Xmx
This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da
Disable assertions.

Examples

Basic Junction Processing

processhi-c.sh in=mapped_hic_reads.bam out=trimmed_reads.fastq

Process mapped Hi-C reads from a BAM file, trim junction sites, and output trimmed reads in FASTQ format.

Generate Kmer Junction Reports

processhi-c.sh in=mapped_reads.sam out=trimmed.fq printkmers=t junctions=hic_motifs_k%.txt

Process Hi-C reads while generating detailed kmer count files for junction motif analysis. Output files will be named hic_motifs_k4.txt, hic_motifs_k6.txt, etc.

Custom Clipping Thresholds

processhi-c.sh in=data.bam out=processed.fq minclip=5 mintrimlength=30

Process reads with more sensitive clipping detection (minimum 5 bases) and retain longer sequences after trimming (minimum 30 bases).

High-Sensitivity Motif Analysis

processhi-c.sh in=hic.bam out=out.fq mincount=1 minfraction=0.0001 junctions=sensitive_k%.tsv

Generate motif analysis with lower thresholds (mincount=1, minfraction=0.0001) to capture rare junction motifs. TSV format output for easier downstream analysis.

Algorithm Details

Junction Detection Strategy

ProcessHi-C uses a dual approach for identifying Hi-C junctions:

Soft-Clipping Algorithm

The softClipMatch() method uses a scoring system to identify optimal clipping positions:

Multi-Scale Kmer Analysis

Junction motifs are analyzed at multiple kmer lengths (4, 6, 8, 10) using pre-allocated count arrays:

Memory Management

Kmer counting uses fixed-size arrays allocated during initialization:

Output Formats

Junction analysis results support multiple output formats:

Performance Considerations

Support

For questions and support: