Step 1: Making a REST call using HttpClient to Twitter endpoint and getting a tweet message containing an emoticon. Twitter APIs returns the string with UTF-8 encoding.
Example: Message = 😄;
Step 2: I am using Java to read the string, using InputStreamReader, with charset UTF-8
. Still, the string's length turns out to be 2, rather than being 1.
How can this be possible, when I am explicitly parsing it using UTF-8
?
On net I found several resources, where its mentioned that an emoticon is a high codepoint character, and thus java considers it to be of 2 characters (surrogate pair), which doesn't makes sense.
Can someone help me with it?
You've got a string with length 2 - because the length()
property returns the number of UTF-16 code units, not the number of Unicode characters. Bear in mind that a String
in Java is really a sequence of UTF-16 code units, not a sequence of characters.
As you say, that emoji is represented with a surrogate pair - it is U+1F604, represented in UTF-16 as U+D83D U+DE04.
If you call String.codePointCount
instead of length()
, you'll get 1:
public class Test {
public static void main(String[] args) {
String emoji = "\ud83d\ude04";
System.out.println(emoji.length()); // 2
System.out.println(emoji.codePointCount(0, emoji.length())); // 1
}
}
Note that the fact that you created the string by decoding UTF-8 is entirely irrelevant to its content. Assuming you've got a string equal to the one in my sample code above, the decoding worked fine.
See more on this question at Stackoverflow