CountSharedLines
Counts the number of lines shared between sets of files. One output file will be printed for each input file. For example, an output file for a file in the 'in1' set will contain one line per file in the 'in2' set, indicating how many lines are shared.
Basic Usage
countsharedlines.sh in1=<file,file...> in2=<file,file...>
This tool compares two sets of text files and reports how many lines are shared between each file in the first set and each file in the second set. Output files are automatically generated with "out_" prefixes in the same directories as the input files.
Parameters
Parameters control matching behavior, file handling, and output formatting for line comparison operations.
Parameters
- include=f
- Set to 'true' to include the filtered names rather than excluding them. This controls whether matching lines are included or excluded from the counting process.
- prefix=f
- Allow matching of only the line's prefix (all characters up to first whitespace). When enabled, only the portion of each line before the first whitespace character is used for comparison, allowing partial line matching.
- case=t
- (casesensitive) Match case also. When set to true (default), line comparisons are case-sensitive. Set to false for case-insensitive matching where "ABC" would match "abc".
- ow=t
- (overwrite) Overwrites files that already exist. When true (default), existing output files will be replaced. Set to false to preserve existing output files.
- app=f
- (append) Append to files that already exist. When true, output will be appended to existing files instead of overwriting them. Cannot be used simultaneously with overwrite mode.
- zl=4
- (ziplevel) Set compression level, 1 (low) to 9 (max). Controls gzip compression level for output files. Higher values provide better compression but slower processing.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Memory usage scales with the number of unique lines in input files.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+. Useful for preventing system instability when processing very large files.
- -da
- Disable assertions. Removes internal consistency checks for slightly improved performance in production environments.
Examples
Basic File Comparison
countsharedlines.sh in1=file1.txt,file2.txt in2=ref1.txt,ref2.txt
Compares each file in the first set (file1.txt, file2.txt) against each file in the second set (ref1.txt, ref2.txt). Creates output files out_file1.txt and out_file2.txt showing shared line counts.
Case-Insensitive Prefix Matching
countsharedlines.sh in1=data.txt in2=reference.txt case=f prefix=t
Performs case-insensitive comparison using only the prefix of each line (up to first whitespace). Useful for matching identifiers that may have different suffixes or annotations.
Multiple File Sets with Custom Output
countsharedlines.sh in1=sample1.txt,sample2.txt,sample3.txt in2=control1.txt,control2.txt ow=t
Compares three sample files against two control files, creating output files for each comparison. The ow=t parameter ensures existing output files are overwritten.
Algorithm Details
Data Structure Implementation
CountSharedLines implements line comparison using LinkedHashSet collections with specific processing methods:
- LinkedHashSet<String> Storage: getContents() method loads each file into a LinkedHashSet, providing O(1) contains() operations while maintaining insertion order
- TextFile Reading: Files are processed line-by-line using TextFile.readLine(true) with automatic compression detection via FileFormat.testInput()
- Case Normalization: When ignoreCase is true (default), lines are converted using String.toLowerCase() before storage
- Prefix Extraction: In prefixMode, Character.isWhitespace() detects the first whitespace to truncate lines at word boundaries
Processing Algorithm
The comparison process follows these steps:
- File Loading: All lines from each file are loaded into separate LinkedHashSet collections
- Cross-Comparison: For each file in set 1, compare against every file in set 2 using set intersection
- Count Calculation: Shared lines are counted using HashSet.contains() for O(1) lookup per line
- Output Generation: Results are written as tab-delimited files with format: filename\tshared_count
Performance Characteristics
- Time Complexity: O(n*m) where n is total lines in set 1 and m is total lines in set 2
- Space Complexity: O(u) where u is the number of unique lines across all input files
- Memory Usage: Scales linearly with unique line count, typically ~50-100 bytes per unique line
- Optimal For: Comparing files with moderate numbers of unique lines (up to millions)
Output Format
Output files contain tab-delimited results with automatic naming:
- File Naming: Input file "path/data.txt" produces output file "path/out_data.txt"
- Content Format: Each line shows "compared_filename\tshared_line_count"
- Bidirectional Output: Both file sets generate output files showing their perspective of shared lines
Use Cases
Genomics Applications
- Gene List Comparison: Compare gene sets between different studies or conditions
- Sample Overlap Analysis: Identify shared sequences or identifiers across sample sets
- Quality Control: Verify expected overlaps between technical replicates
General Text Analysis
- Data Validation: Check consistency between different data sources
- Set Operations: Quantify intersections between text-based datasets
- Content Analysis: Compare vocabularies or term lists across documents
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org