How to remove unicode.OtherSymbol from a string

I'm trying to remove characters like ✅🔮⛱😂⛄ from a given string. These characters belong to UnicodeCategory.OtherSymbol, but char.GetUnicodeCategory returns UnicodeCategory.Surrogate.

If I just want to remove those emotion/picture characters from a string and leave other surrogate characters untouched, what should I do?

I've tried Regex.IsMatch("🔮", @"\p{So}"), didn't work.

Jon Skeet
people
quotationmark

.NET isn't terribly good when it comes to iterating over Unicode characters instead of UTF-16 code units. All the relevant code is there, but it's not terribly easy to use. It's possible that Regex can be made to understand surrogate pairs, but I haven't found it yet.

Here's an example of doing it somewhat manually:

using System;
using System.Globalization;
using System.Text;

public class Program
{
    public static void Main(string[] args)
    {
        string text = "a\u2705b\U0001f52ec\u26f1d\U0001F602e\U00010000";
        string cleansed = RemoveOtherSymbols(text);
        Console.WriteLine(cleansed);
    }

    static string RemoveOtherSymbols(string text)
    {
        // TODO: Handle malformed strings (e.g. those
        // with mismatched surrogate pairs)
        StringBuilder builder = new StringBuilder();
        int index = 0;
        while (index < text.Length)
        {
            // Full Unicode character
            int units = char.IsSurrogate(text, index) ? 2 : 1;
            UnicodeCategory category = char.GetUnicodeCategory(text, index);
            int ch = char.ConvertToUtf32(text, index);
            if (category == UnicodeCategory.OtherSymbol)
            {
                Console.WriteLine($"Skipping U+{ch:x} {category}");
            }
            else
            {
                Console.WriteLine($"Keeping U+{ch:x} {category}");
                builder.Append(text, index, units);
            }
            index += units;
        }
        return builder.ToString();
    }
}

people

See more on this question at Stackoverflow