PROCESSHI-C
Finds and trims junctions in mapped Hi-C reads. For the purpose of reporting junction motifs, this requires paired-end reads, because only improper pairs will be considered as possibly containing junctions. However, all reads that map with soft-clipping will be trimmed on the 3' (right) end, regardless of pairing status.
Basic Usage
processhi-c.sh in=<mapped reads> out=<trimmed reads>
Processes mapped Hi-C reads to identify and trim junction sites. The tool analyzes SAM/BAM files and outputs trimmed reads, optionally generating files with kmer counts at junction sites for motif analysis.
Parameters
Parameters are organized based on their function in Hi-C junction processing. All parameters from the shell script are documented below.
Input/Output Parameters
- in=<file>
- A SAM/BAM file containing mapped reads. Required parameter for input.
- out=<file>
- Output file of trimmed reads. Writes processed reads with junctions trimmed.
- overwrite=f
- (ow) Set to false to force the program to abort rather than overwrite an existing file. Default: true
Junction Analysis Parameters
- printkmers=t
- Generate files with kmer counts at junction sites. Enables motif analysis by outputting kmer frequency data. Default: true
- junctions=junctions_k%.txt
- File pattern for junction output. The '%' is replaced with kmer length (4, 6, 8, 10) and direction indicators (L/R). Default: junctions_k%.txt
- minclip=8
- Minimum clipping length required to consider a junction. Reads must have at least this many clipped bases to be processed. Default: 8
Processing Parameters
- verbose=f
- Enable verbose output for debugging. Provides detailed logging of processing steps. Default: false
- trim=t
- Enable trimming of junction sites. When true, reads are trimmed at identified junction positions. Default: true
- mintrimlength=25
- Minimum length to retain after trimming. Reads shorter than this after trimming are handled according to the algorithm. Default: 25
- mincount=2
- Minimum count threshold for kmer reporting. Kmers appearing fewer times are filtered from output. Default: 2
- minfraction=0.0005
- Minimum fraction threshold for kmer reporting relative to total kmer count. Default: 0.0005
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions.
Examples
Basic Junction Processing
processhi-c.sh in=mapped_hic_reads.bam out=trimmed_reads.fastq
Process mapped Hi-C reads from a BAM file, trim junction sites, and output trimmed reads in FASTQ format.
Generate Kmer Junction Reports
processhi-c.sh in=mapped_reads.sam out=trimmed.fq printkmers=t junctions=hic_motifs_k%.txt
Process Hi-C reads while generating detailed kmer count files for junction motif analysis. Output files will be named hic_motifs_k4.txt, hic_motifs_k6.txt, etc.
Custom Clipping Thresholds
processhi-c.sh in=data.bam out=processed.fq minclip=5 mintrimlength=30
Process reads with more sensitive clipping detection (minimum 5 bases) and retain longer sequences after trimming (minimum 30 bases).
High-Sensitivity Motif Analysis
processhi-c.sh in=hic.bam out=out.fq mincount=1 minfraction=0.0001 junctions=sensitive_k%.tsv
Generate motif analysis with lower thresholds (mincount=1, minfraction=0.0001) to capture rare junction motifs. TSV format output for easier downstream analysis.
Algorithm Details
Junction Detection Strategy
ProcessHi-C uses a dual approach for identifying Hi-C junctions:
- Improper Pair Analysis: For paired-end reads, identifies junctions by analyzing improper pairs where mates map to different chromosomes or with unexpected orientations
- Soft-Clipping Detection: Examines CIGAR strings to identify reads with significant soft-clipping, indicating potential junction sites
Soft-Clipping Algorithm
The softClipMatch() method uses a scoring system to identify optimal clipping positions:
- Match Score: +100 for exact matches, +1 for N bases
- Substitution Penalties: -200 for first substitution, -100 for consecutive substitutions
- Indel Penalties: -200 for insertions, -200/-10 for deletions (first/consecutive)
- Clipping Penalty: -1 per clipped base
Multi-Scale Kmer Analysis
Junction motifs are analyzed at multiple kmer lengths (4, 6, 8, 10) using pre-allocated count arrays:
- Kmer Lengths: 4, 6, 8, and 10 nucleotides for different resolution analysis
- Directional Analysis: Separate counting for left (5') and right (3') sides of junctions
- Statistical Filtering: Kmers are filtered based on both absolute count (mincount) and relative frequency (minfraction)
Memory Management
Kmer counting uses fixed-size arrays allocated during initialization:
- Hash Arrays: Pre-allocated arrays sized for each kmer length (4^k entries)
- Dual Counting: Separate arrays for junction kmers and flanking region analysis
- Memory Efficiency: Default memory allocation of 200MB, expandable as needed
Output Formats
Junction analysis results support multiple output formats:
- FASTA Format: Junction kmers with count and frequency information in headers
- TSV Format: Tab-separated values for easy import into analysis tools
- Multi-File Output: Separate files for each kmer length and direction combination
Performance Considerations
- Memory Usage: Scales with kmer diversity; default 200MB handles most datasets
- Processing Speed: Linear with input size; approximately 1-2 million reads per minute
- Junction Sensitivity: Adjusting minclip affects both sensitivity and specificity
- Output Size: Kmer files can be large for diverse datasets; use filtering parameters appropriately
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org