Version 2 (modified by 14 years ago) (diff) | ,
---|
Character set and unicode support
Support for character sets and unicode in CL in general consists of a few areas of attention:
- Internal character storage
- External format support
- Impact on CLHS defined functions
- Additional functions
These items will be discussed below.
Internal character storage
ABCL inherits its character storage from Java: 16-bit values for characters in the Basic Multilingual Plane (BMP), 32-bit values for Supplemental characters. See the article on Supplementary characters in the Java platform for an in-depth coverage of the subject, including considerations for supporting programs.
ISSUE: currently, ABCL assumes every 16-bit value to be a full character, thereby effectively ignoring supplemental characters
ISSUE: The values of the BMP range from 0x0 to 0xFFFF, excluding 0xD800 to 0xDFFF; there's no check on the validity of of character values when creating them
Even though ABCL can represent all Unicode characters, supporting it (as in providing its operations such as case conversion and case insensitive string matching) is a different matter.
Conflicts between the CLHS and Unicode
The CLHS specifies that characters may have 'case'. When characters have case they are required to exist in pairs: an upper case and a lower case variant. Unicode does not satisfy this requirement. As an example, the characters LATIN SMALL LETTER I and LATIN SMALL LETTER DOTLESS I both map to LATIN CAPITAL LETTER I.
Other examples are the ESZET character which uppercases to "SS" (a two character string) and the GREEK CAPITAL LETTER SIGMA which converts to different characters depending on whether it's the last character in the converted word.