Character set and unicode support
Support for character sets and unicode in CL in general consists of a few areas of attention:
- Internal character storage
- External format support
- Impact on CLHS defined functions
- Additional functions
These items will be discussed below.
Internal character storage
ABCL inherits its character storage from Java: 16-bit values for characters in the Basic Multilingual Plane (BMP), 32-bit values for Supplemental characters. See the article on Supplementary characters in the Java platform for an in-depth coverage of the subject, including considerations for supporting programs.
ISSUE: currently, ABCL assumes every 16-bit value to be a full character, thereby effectively ignoring supplemental characters (this relates to ticket #65)
ISSUE: The values of the BMP range from 0x0 to 0xFFFF, excluding 0xD800 to 0xDFFF; there's no check on the validity of of character values when creating them (this is now ticket #92)
Even though ABCL can represent all Unicode characters, supporting it (as in providing its operations such as case conversion and case insensitive string matching) is a different matter.
Conflicts between the CLHS and the Unicode standard
The CLHS specifies that characters may have 'case'. When characters have case they are required to exist in pairs: an upper case and a lower case variant. E.g., the lowercase character #\a is uniquely associated with the uppercase character #\A. Converting #\a to uppercase will always return #\A and the other way around, converting #\A to lower case. Unicode does not satisfy this requirement. As an example, the characters LATIN SMALL LETTER I and LATIN SMALL LETTER DOTLESS I both map to LATIN CAPITAL LETTER I.
Other examples are the ESZET character which uppercases to "SS" (a two character string) and the GREEK CAPITAL LETTER SIGMA which converts to different characters depending on whether it's the last character in the converted word.
The ESZET violates the CLHS requirement that case conversion takes exactly one character as input and produces exactly one character on output: it produces 2 characters. The dotless i violates the CLHS requirement that characters are associated in pairs; after all, there are 3 characters in the conversion set of the dotless i.
Case related functions
CLHS defines the following case conversion functions:
- CHAR-DOWNCASE
- CHAR-UPCASE
- (N)STRING-DOWNCASE
- (N)STRING-UPCASE
- (N)STRING-CAPITALIZE
Where the up/downcasing string functions are defined as the repetitive application of the character case conversion functions. Of course, given the definition characters with case, these function definitions make sense.
Next to case conversion functions, the spec defines case insensitive character and string comparisons:
- CHAR(-NOT)-EQUAL
- CHAR(-NOT)-LESSP
- CHAR(-NOT)-GREATERP
- STRING(-NOT)-EQUAL
- STRING(-NOT)-LESSP
- STRING(-NOT)-GREATERP
Case conversion in the reader
The reader algorithm definition in the spec amounts to the same behaviour of the respective string case conversion routine. This conversion rule applies to 'constituent characters' which are part of a symbol name.
Consequences of the CLHS definition
- Because characters are defined in pairs, case changing operations won't change the length of the string; destructive operations can hence be run on normal as well as simple-strings; the latter of which are defined to be non-adjustable