As I understand java keeps string in uft16 which for every code points uses either 16 (for BMP) or 32 bits. But I am not sure if class Character can be used for keeping code point which need 32 bits. Reading http://docs.oracle.com/javase/7/docs/api/java/lang/Character.html didn't help. So can it?
No, char
and Character
can't represent a code point outside the BMP. There's no specific type for this, but all the Java APIs just use int
to refer to code points specifically as opposed to UTF-16 code units.
If you look at all the codePoint*
methods in java.lang.Character
, such as codePointAt(char[], int, int)
you'll see they use int
.
In my experience, very little code (including my own) correctly takes account of this, instead assuming that it's reasonable to talk about the length of a string as being the number of UTF-16 code units in it. Having said that, "length" is a pretty hard-to-pin-down concept for strings, in that it doesn't mean the number of displayed glyphs, and different normalization forms of logically-equivalent text can consist of different numbers of code points...
See more on this question at Stackoverflow