How to decode an utf8 encoded string split in two buffers right in between a 4 byte long char?

A character in UTF8 encoding has up to 4 bytes. Now imagine I read from a stream into one buffer and then into the another. Unfortunately it just happens to be that at the end of the first buffer 2 chars of the 4 byte UTF8 encoded char are left and at the beginning of the the second buffer the rest 2 bytes.

Is there a way to partially decode that string (while leaving the 2 rest byte) without copying those two buffers into one big

string str = "Hello\u263AWorld";

Console.WriteLine(str);
Console.WriteLine("Length of 'HelloWorld': " + Encoding.UTF8.GetBytes("HelloWorld").Length);
var bytes = Encoding.UTF8.GetBytes(str);
Console.WriteLine("Length of 'Hello\u263AWorld': " + bytes.Length);
Console.WriteLine(Encoding.UTF8.GetString(bytes, 0, 6));
Console.WriteLine(Encoding.UTF8.GetString(bytes, 7, bytes.Length - 7));

This returns:

Hello☺World

Length of 'HelloWorld': 10

Length of 'Hello☺World': 13

Hello�

�World

The smiley face is 3 bytes long.

Is there a class that deals with split decoding of strings? I would like to get first "Hello" and then "☺World" reusing the reminder of the not encoded byte array. Without copying both arrays into one big array. I really just want to use the reminder of the first buffer and somehow make the magic happen.

Jon Skeet
people
quotationmark

You should use a Decoder, which is able to maintain state between calls to GetChars - it remembers the bytes it hasn't decoded yet.

using System;
using System.Text;

class Test
{
    static void Main()
    {
        string str = "Hello\u263AWorld";

        var bytes = Encoding.UTF8.GetBytes(str);
        var decoder = Encoding.UTF8.GetDecoder();

        // Long enough for the whole string
        char[] buffer = new char[100];

        // Convert the first "packet"
        var length1 = decoder.GetChars(bytes, 0, 6, buffer, 0);
        // Convert the second "packet", writing into the buffer
        // from where we left off
        // Note: 6 not 7, because otherwise we're skipping a byte...
        var length2 = decoder.GetChars(bytes, 6, bytes.Length - 6,
                                       buffer, length1);
        var reconstituted = new string(buffer, 0, length1 + length2);
        Console.WriteLine(str == reconstituted); // true        
    }
}

people

See more on this question at Stackoverflow