Unicode2ASCII
Replaces unicode and control characters with printable ascii characters. WARNING - this does not work in many cases, and is not recommended! It is only retained because there is some situation in which it is needed.
Basic Usage
unicode2ascii.sh in=<file> out=<file>
WARNING: This tool has limited effectiveness and is not recommended for general use. It is retained only for specific situations where it may be needed.
Parameters
This tool accepts basic input/output parameters and file handling options.
Input/Output Parameters
- in=<file>
- Input file containing text with unicode or control characters to be converted. Required parameter.
- out=<file>
- Output file for the converted ASCII text. If not specified, output goes to stdout.
- overwrite=t
- Allow overwriting of existing output files. Default: true
- append=f
- Append to existing output file instead of overwriting. Default: false
- verbose=f
- Print verbose processing information. Default: false
Examples
Basic Conversion
# Convert a file with unicode characters to ASCII
unicode2ascii.sh in=input_with_unicode.txt out=ascii_output.txt
Reads the input file and attempts to convert unicode and control characters to printable ASCII equivalents.
Using Standard Input/Output
# Process data from stdin and output to stdout
cat unicode_file.txt | unicode2ascii.sh in=stdin out=stdout
Process text data through standard input and output streams.
Append Mode
# Append converted text to existing file
unicode2ascii.sh in=more_unicode.txt out=existing_ascii.txt append=t
Appends the converted text to an existing output file instead of overwriting it.
Algorithm Details
Conversion Process
The unicode2ascii tool implements a multi-stage character encoding conversion process:
Character Encoding Detection and Conversion
- Primary UTF-8 Conversion: First attempts to interpret input as UTF-8 encoded text using Java's standard UTF-8 decoder
- Fallback UTF-16 Conversion: If UTF-8 conversion fails, attempts UTF-16 decoding as an alternative approach
- Header Cleaning: Applies BBTools' fixHeader() function with parameters (false, true) to clean unicode and control characters
Text Processing Strategy
The tool processes text line-by-line to minimize memory usage:
- Reads input file using TextFile class with line-buffered I/O
- Each line undergoes character encoding conversion and cleaning
- Cleaned lines are written immediately using TextStreamWriter
- Supports both compressed and uncompressed input/output formats
Limitations and Warnings
Important: As noted in the tool's documentation, this conversion approach has significant limitations:
- Character mapping is not comprehensive - many unicode characters cannot be meaningfully converted to ASCII
- Loss of information is inevitable when converting from unicode's extensive character set to ASCII's limited range
- Some control characters may not be handled correctly in all contexts
- The tool is retained primarily for legacy compatibility rather than robust unicode handling
Memory Usage
The tool is designed for low memory usage:
- Default memory allocation: 200MB (-Xmx200m)
- Line-by-line processing prevents memory accumulation for large files
- TextStreamWriter uses efficient buffering for output operations
Technical Notes
File Format Support
- Supports any text-based file format
- Automatically handles compressed files (gzip, bzip2) through BBTools I/O system
- Output maintains original line structure and formatting where possible
Performance Characteristics
- Processing speed depends primarily on I/O rather than computation
- Memory usage remains constant regardless of input file size
- Suitable for processing large text files within memory constraints
Alternative Recommendations
For robust unicode handling, consider these alternatives:
- System tools: iconv, uconv, or recode for comprehensive character set conversion
- Programming libraries: ICU libraries for advanced unicode normalization
- Text editors: Modern editors with unicode-aware find/replace capabilities
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org