My name
is
Jon Skeet

Encoding.UTF7.GetBytes does not reverse Encoding.UTF7.GetString()

I guess I'm missing something fundamental but I'm really confused by this one and searching has failed to find me anything.

I have the following...

byte[] bytes1;
string string1;
byte[] bytes2;

Then I do the following

bytes1 = { 64, 55, 121, 54, 36, 72, 101, 118, 38, 40, 100, 114, 33, 110, 85, 94, 112, 80, 163, 36, 84, 103, 58, 126 };
string1 = System.Text.Encoding.UTF7.GetString(bytes1);
bytes2 = System.Text.Encoding.UTF7.GetBytes(string1);

Bytes2 ends up as 54 instead of 24 bytes and they are completely different bytes.

Now of course this is pointless code anyway, but I've put it in while diagnosing why the bytes I'm getting from Encoding.UTF7.GetString are not the bytes I'm expecting. I have got down to the fact that this is the reason my code is not giving expected results.

Now I'm confused. I know if I don't use encoding then the result of GetBytes from a string can't be relied on to be a particular set of bytes, but I'm using encoding and still getting this difference.

Can anyone enlighten me to what I'm missing?

EDIT: Conclusion is that it's not UTF7. The original byte array is being written to a varbinary in a database by an application I'm programming in a high level language. I have no control of how the original strings are being encoded to varbinaries in that language. I'm trying to read them and handle them in a small C# add-on to the main app which is where I hit this problem. Other encodings I've tried also don't give the right results.

What you're seeing is two different ways of encoding the same text in UTF-7.

Your original text is:

@7y6$Hev&(dr!nU^pP£$Tg:~

The ASCII version of bytes2 is

+AEA-7y6+ACQ-Hev+ACY-(dr+ACE-nU+AF4-pP+AKMAJA-Tg:+AH4-

In other words, it's encoding everything other than A-Z, a-z, 0-9 as +A...-. That's unnecessary, but I suspect it's valid.

From the UTF-7 wikipedia entry:

Some characters can be represented directly as single ASCII bytes. The first group is known as "direct characters" and contains 62 alphanumeric characters and 9 symbols: ' ( ) , - . / : ?. The direct characters are safe to include literally. The other main group, known as "optional direct characters", contains all other printable characters in the range U+0020–U+007E except ~ \ + and space. Using the optional direct characters reduces size and enhances human readability but also increases the chance of breakage by things like badly designed mail gateways and may require extra escaping when used in encoded words for header fields.

See more on this question at Stackoverflow