Opened 15 years ago
Last modified 23 months ago
#65 new defect
UTF-32 strings support
Reported by: | ehuelsmann | Owned by: | nobody |
---|---|---|---|
Priority: | minor | Milestone: | 1.9.2 |
Component: | libraries | Version: | 1.1.0 |
Keywords: | Cc: | ||
Parent Tickets: |
Description
ABCL uses Java char[]s to represent its strings. However, the char type can only represent values in the BMP (Basic Multilingual Plane), because only the BMP can be represented using 16 bits.
For supplementary characters (all Unicode chars outside the BMP), it uses a pair of surrogate characters (UTF-16).
Common Lisp programs don't expect this and need strings to be represented using complete characters.
Change History (21)
comment:1 Changed 15 years ago by
Component: | other → libraries |
---|---|
Owner: | changed from somebody to nobody |
comment:2 Changed 15 years ago by
comment:4 Changed 13 years ago by
I think it is possible to use FLEXI-STREAMS to handle UTF32 strings.
comment:5 Changed 12 years ago by
Milestone: | unscheduled → 1.2.0 |
---|---|
Version: | → 1.1.0 |
comment:6 Changed 12 years ago by
On #lisp, pjb writes on this subject:
... you must be careful that in most CL implementations, characters are unicode characters (not even code-points in a number of implementations!), and therefore we are talking of real strings of characters (32-bit each usually), not vector of utf-8 bytes. (For some things, you may need to deal with vectors of bytes instead of strings, and there, lisp macros and reader macros can come handy to ease manipulations of those vectors of bytes that usually represent ASCII or UTF-8 encoded characters).
Where I ask:
pjb: how's that possible? Some far-east "characters" will consist of multiple code points, with up to 6 or 7 "modifier" code points; how can all that fit into 32-bits, if each code point is 21-bit in itself?
and pjb answers:
ehu: that's what I mean, some implementation may choose to represent those characters as a pointer to a sequence of code points.
comment:7 Changed 12 years ago by
It would be nice if every Common Lisp implementation used the same representation of strings, but what format should that be? Currently UTF-32 is quite common, but there are exception, both Allegro CL and CMUCL uses UTF-16.
There are many good reasons for using UTF-16:
- Compatibility (Java, Windows API, libicu)
- Saves memory (approx 50%, use of characters outside BMP is very rare)
- The added complexity is actually quite low
The last point is the important one. Even when using UTF-32, what the end user thinks of as a character might be represented as sequence of code-points in the string. In Unicode this is called a grapheme cluster. Because of this, UTF-16, with it's surrogate pairs, doesn't add much complexity. Code that doesn't deal correctly with surrogate pairs, e.g. by splitting a string in the in the middle of a pair, would probably not deal correctly with grapheme clusters either.
To sum up: No, Common Lisp programs can't expect strings to be UTF-32. The are many good reasons for using UTF-16. Since Java uses UTF-16 strings it's makes perfect sense that ABCL does so too.
comment:8 Changed 12 years ago by
Milestone: | 1.2.0 → 2.0 |
---|
comment:11 Changed 11 years ago by
Milestone: | 2.0.0 → 1.4.0 |
---|
comment:16 Changed 5 years ago by
Milestone: | 1.6.2 → 1.7.0 |
---|
comment:21 Changed 23 months ago by
Milestone: | 1.8.1 → 1.9.2 |
---|
Relevant in this discussion is the article about supplementary characters (code points > #xFFFF) in Java.