WebCheck

Basic Usage

webcheck.sh <input files>

Webcheck processes log files containing web server monitoring data. Input files should contain pipe-delimited records with timestamp, URL, status code, and latency information.

Expected Input Format

Input is expected to look like this:

Tue Apr 26 16:40:09 2016|https://rqc.jgi-psf.org/control/|200 OK|0.61

Each line contains four pipe-separated fields: timestamp, URL, status code with message, and latency in seconds.

Parameters

Parameters are organized by their function in the webcheck analysis process.

Standard parameters

in=<file>: Primary input file containing webcheck log data. Can use a wildcard (*) if 'in=' is omitted. Multiple files can be processed sequentially.
out=<file>: Summary output file for statistics; optional. If not specified, results are written to stdout. Contains aggregated statistics for all processed log entries.
fail=<file>: Output file for failing lines (non-200 status codes); optional. Records all entries that indicate server errors, timeouts, or other failure conditions.
invalid=<file>: Output file for misformatted lines; optional. Captures log entries that don't match the expected four-field pipe-delimited format.
extendedstats=f: (es) Print more detailed statistics including line counts, latency averages, and observed failure codes. Default: false. Extended stats provide comprehensive analysis of log processing results.
overwrite=f: (ow) Set to false to force the program to abort rather than overwrite an existing file. Default: false (will overwrite). Prevents accidental data loss.

Processing Parameters

lines=unlimited: Maximum number of lines to process from input files. Set to a positive number to limit processing for testing or sampling. Default: unlimited (Long.MAX_VALUE).
ms=t: (millis) Control milliseconds display in latency output. Set to false to omit 'ms' suffix from timing statistics. Default: true (displays 'ms' suffix).
verbose=f: Enable verbose output for debugging and detailed processing information. Affects file I/O operations and stream processing. Default: false.

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions.

Examples

Basic Log Analysis

webcheck.sh webserver.log

Process a single webcheck log file and display summary statistics to stdout.

Comprehensive Analysis with Output Files

webcheck.sh in=webserver.log out=summary.txt fail=failures.log invalid=bad_entries.log extendedstats=t

Process webcheck log with full output: summary statistics to file, failed requests captured separately, malformed entries logged, and extended statistics enabled.

Multiple File Processing

webcheck.sh *.log out=combined_stats.txt extendedstats=t

Process all log files in the current directory, generating combined statistics with extended reporting.

Limited Processing for Testing

webcheck.sh webserver.log lines=1000 extendedstats=t

Process only the first 1000 lines of a log file for quick analysis and testing.

Algorithm Details

ProcessWebcheck implements a ByteFile-based streaming parser using process2() method for line-by-line log analysis. The implementation processes pipe-delimited entries through ByteFile.nextLine() iteration without memory pre-allocation, enabling analysis of arbitrarily large web server monitoring files within constant memory bounds.

Data Processing Strategy

ProcessWebcheck.process2() implements a three-phase approach:

Validation Phase: Each log line undergoes format validation checking line[0] for non-'#' and Tools.isDigit(line[line.length-1]) for numeric endings, followed by split("\\|") validation for exactly four pipe-separated fields
Classification Phase: Valid entries are classified using Integer.parseInt(split[2].substring(0, split[2].indexOf(' '))) for status code extraction, with code==200 determining pass/fail classification
Aggregation Phase: Statistics accumulate in HashMap<String, long[]> map for status code counts, with IntList passLatency and IntList failLatency for latency tracking using (int)(latency*1000) conversion

Performance Characteristics

ProcessWebcheck implements memory-efficient processing using ByteFile.makeByteFile() for streaming I/O:

Streaming Processing: ByteFile.nextLine() enables line-by-line processing without loading entire files, supporting maxLines parameter for controlled processing limits
Data Structure Optimization: IntList uses passLatency.shrink() and failLatency.shrink() for memory compaction, while failCode.sort() and failCode.shrinkToUnique() optimize status code tracking
Lazy Statistics: Extended statistics calculation triggered only by extendedStats boolean, with Tools.averageInt(passLatency.array) and Tools.max(failLatency.array) computed on-demand
Concurrent I/O: ByteStreamWriter instances (bsw, bswInvalid, bswFail) use start() for non-blocking concurrent writing with poisonAndWait() for synchronization

Statistical Analysis

ProcessWebcheck implements precision latency tracking with status code classification:

Latency Conversion: Float.parseFloat(split[3]) extracts latency values, converted to milliseconds using (int)(latency*1000) for integer precision storage in IntList arrays
Pass/Fail Segregation: Binary classification with code==200 routing to passLatency.add(latency2) versus failLatency.add(latency2) and failCode.add(code) for failed requests
Status Code Aggregation: HashMap<String, long[]> map.put(split[2], cnt) accumulates status message counts, with cnt[0]++ incrementing occurrence tracking
Data Quality Metrics: Separate counters linesProcessed, linesValid, and bytesProcessed track processing completeness, with invalid entries routed to bswInvalid.println(line)

Output Format

ProcessWebcheck.process() generates tab-delimited output using StringBuilder with Shared.sort(list) for alphabetical status code ordering. Extended statistics are conditionally appended when extendedStats=true, including Lines_Processed, Invalid_Lines, Passing/Failing counts, Avg_Pass_Latency/Max_Pass_Latency calculations, and Observed_Fail_Codes enumeration from failCode IntList iteration.

Output Interpretation

Standard Output

Basic output shows status code counts in tab-separated format:

200 OK    1247
404 Not Found    23
500 Internal Server Error    5

Extended Statistics

When extendedstats=t, additional metrics are provided:

Lines_Processed: Total number of log entries processed
Invalid_Lines: Count of malformed entries that couldn't be parsed
Passing/Failing: Count of requests with 200 vs. non-200 status codes
Latency Statistics: Average and maximum response times for both successful and failed requests
Observed_Fail_Codes: List of all non-200 status codes encountered

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org