My name
is
Jon Skeet

C# .NET Garbled character when writing 56623 with StreamWriter

I have an issue with writing the character 56623 to a stream using a StreamWriter in UTF16 (the issue persists in other encodings as well). If I get the buffer from the stream, it contains the value 65533 instead of what I originally wrote. This issue snuck up on me when doing randomised unit tests and it does does not appear for value 60000 nor 95.

To illustrate, I have a minimal program to check the behaviour:

   char value = (char)56623;
   MemoryStream stream = new MemoryStream();
   StreamWriter writer = new StreamWriter(stream, Encoding.Unicode);
   writer.Write(value);
   writer.Close();

   var byteArray = BitConverter.GetBytes(value); // Reference bytes
   var buffer = writer.GetBuffer();

By reading byteArray and buffer I get:

   byteArray = [221,47] = 11011101 00101111 = 56623
   buffer = [255,254,253,255,...] = BOM 11111101 11111111 ... = BOM 65533

Thus, the written value 65533 is clearly not equal to the original 56623. However, when trying with the value 60000 the correct values are written:

   byteArray = [96,234] = 01100000 11101010 = 60000
   buffer = [255,254,96,234,...] = BOM 01100000 11101010 ... = BOM 60000

I fail to understand why this is the behaviour, but I am unwilling to think that there is an issue with the implementation of StreamWriter so there has the be something I am missing.

What is it that I am not seeing here?

Thank you!

The problem is that 56623 is U+DD2F - which is a high surrogate UTF-16 code unit. It's invalid on its own - it's only valid as part of a surrogate pair used to encode code points which aren't in the Basic Multilingual Plane.

It should be fine if you write it as part of a valid surrogate pair (i.e. followed by a low surrogate) - but if you're trying to write it on its own, that suggests you've got invalid data to start with. You shouldn't be taking random UTF-16 code units and expecting them to be valid Unicode code points. You may be okay if you explicitly exclude U+D800 to U+DFFF inclusive, but even then you've got odd characters like a BOM which shouldn't occur within normal text.

See more on this question at Stackoverflow