I encountered a strange issue when doing some byte array stuffs together with unicode string in C#. Below is my code.
var bytes = new byte[] {128, 216};
var strstr = Encoding.Unicode.GetString(bytes);
var newBytes = Encoding.Unicode.GetBytes(strstr);
Console.WriteLine(BitConverter.ToString(bytes));
Console.WriteLine(BitConverter.ToString(newBytes));
Pretty simple, right? However, the newBytes are not the same as bytes, which shocked me. From my opinion, GetString and GetBytes should be opposite operations. Could anyone let me know this happens?
The output is
80-D8
FD-FF
Thanks in advance.
From my opinion, GetString and GetBytes should be opposite operations.
They are, when the data represents a valid string. However, you've tried to decode 0x80 0xD8 as a little-endian UTF-16 string - but that's not a binary representation of any valid string. It's a high surrogate that isn't followed by a low surrogate, so it's invalid.
Therefore Encoding.GetString
decodes the binary data as U+FFFD which is the "replacement character" used when invalid data is encounted when decoding.
See more on this question at Stackoverflow