I'm trying to remove characters like ✅🔮⛱😂⛄ from a given string. These characters belong to UnicodeCategory.OtherSymbol
, but char.GetUnicodeCategory
returns UnicodeCategory.Surrogate
.
If I just want to remove those emotion/picture characters from a string and leave other surrogate characters untouched, what should I do?
I've tried Regex.IsMatch("🔮", @"\p{So}")
, didn't work.
.NET isn't terribly good when it comes to iterating over Unicode characters instead of UTF-16 code units. All the relevant code is there, but it's not terribly easy to use. It's possible that Regex
can be made to understand surrogate pairs, but I haven't found it yet.
Here's an example of doing it somewhat manually:
using System;
using System.Globalization;
using System.Text;
public class Program
{
public static void Main(string[] args)
{
string text = "a\u2705b\U0001f52ec\u26f1d\U0001F602e\U00010000";
string cleansed = RemoveOtherSymbols(text);
Console.WriteLine(cleansed);
}
static string RemoveOtherSymbols(string text)
{
// TODO: Handle malformed strings (e.g. those
// with mismatched surrogate pairs)
StringBuilder builder = new StringBuilder();
int index = 0;
while (index < text.Length)
{
// Full Unicode character
int units = char.IsSurrogate(text, index) ? 2 : 1;
UnicodeCategory category = char.GetUnicodeCategory(text, index);
int ch = char.ConvertToUtf32(text, index);
if (category == UnicodeCategory.OtherSymbol)
{
Console.WriteLine($"Skipping U+{ch:x} {category}");
}
else
{
Console.WriteLine($"Keeping U+{ch:x} {category}");
builder.Append(text, index, units);
}
index += units;
}
return builder.ToString();
}
}
See more on this question at Stackoverflow