A character in UTF8 encoding has up to 4 bytes. Now imagine I read from a stream into one buffer and then into the another. Unfortunately it just happens to be that at the end of the first buffer 2 chars of the 4 byte UTF8 encoded char are left and at the beginning of the the second buffer the rest 2 bytes.
Is there a way to partially decode that string (while leaving the 2 rest byte) without copying those two buffers into one big
string str = "Hello\u263AWorld";
Console.WriteLine(str);
Console.WriteLine("Length of 'HelloWorld': " + Encoding.UTF8.GetBytes("HelloWorld").Length);
var bytes = Encoding.UTF8.GetBytes(str);
Console.WriteLine("Length of 'Hello\u263AWorld': " + bytes.Length);
Console.WriteLine(Encoding.UTF8.GetString(bytes, 0, 6));
Console.WriteLine(Encoding.UTF8.GetString(bytes, 7, bytes.Length - 7));
This returns:
Hello☺World
Length of 'HelloWorld': 10
Length of 'Hello☺World': 13
Hello�
�World
The smiley face is 3 bytes long.
Is there a class that deals with split decoding of strings? I would like to get first "Hello" and then "☺World" reusing the reminder of the not encoded byte array. Without copying both arrays into one big array. I really just want to use the reminder of the first buffer and somehow make the magic happen.
You should use a Decoder
, which is able to maintain state between calls to GetChars
- it remembers the bytes it hasn't decoded yet.
using System;
using System.Text;
class Test
{
static void Main()
{
string str = "Hello\u263AWorld";
var bytes = Encoding.UTF8.GetBytes(str);
var decoder = Encoding.UTF8.GetDecoder();
// Long enough for the whole string
char[] buffer = new char[100];
// Convert the first "packet"
var length1 = decoder.GetChars(bytes, 0, 6, buffer, 0);
// Convert the second "packet", writing into the buffer
// from where we left off
// Note: 6 not 7, because otherwise we're skipping a byte...
var length2 = decoder.GetChars(bytes, 6, bytes.Length - 6,
buffer, length1);
var reconstituted = new string(buffer, 0, length1 + length2);
Console.WriteLine(str == reconstituted); // true
}
}
See more on this question at Stackoverflow