My name
is
Jon Skeet

UTF 16 reserved codepoints

Why UTF-16 have a reserved range in UCS Database?

UTF-16 is just a way to represent character scalar value using one or two unsigned 16-bits, the layout of these values shouldn't be related to character scalar value because we should apply some algorithm to get the actual character scalar value from such representation.

Let's assume that the reserved range D800-DBFF and DC00-DFFF are not reserved in UCS Database, and there is another representation of UTF-16 that can represent all characters in range 0-7FFF in single unsigned 16-bits and when the high order bit is set then another 16-bit is followed with the remaining bits, and for the byte order mark we will reserve the two possible values and that's it.

If I'm wrong then could you explain it to me.

Thanks

Your proposed scheme is less efficient than the current surrogate pair scheme, which is one problem.

Currently, only 0xD800-0xDFFF (2048 code units) are "out of bounds" as normal characters, leaving 63488 code units mapping to single characters. Under your proposal, 0x8000-0xFFFF (32768) code units are reserved for multi-code-unit code points, leaving only the other 32768 code units for single-code-unit code points.

I don't know how many code points are currently specified in the basic multilingual plane, but I wouldn't be surprised if it were more than 32768, and of course it can grow. As soon as it's more than 32768, there would be more characters which require two code units to be represented under your proposal than in UTF-16 as it stands.

Now I agree that none of this requires UCS to include a reserved range (and it's an ugly mix of meanings, in some ways) - but doing so makes it simple (in code) to map UTF-16 to UCS, while still maintaining a pretty efficient solution.

There are very few downsides of this - there's plenty of space in the UCS, so it's not like reserving this small block means we're going to have significantly less room for future expansion.

Supposition

This bit is an informed guess. You could do the research to find out which characters were used in which versions of Unicode, but I believe it's at least a plausible explanation.

The true reason for this particular block being used is probably historical - for a long time Unicode really was just 16-bit, for everything... and characters were already assigned in the upper ranges (the parts your scheme deems off-limits). By taking a block of 2048 values which weren't previously assigned, all previous valid UCS-2 sequences were preserved as valid UTF-16 sequences with the same meaning, while extending the UCS range beyond the BMP. It's possible that some aspects might be easier if the range had been 0xF800-0xFFFF, but it was too late by then.

See more on this question at Stackoverflow