One of my data processing modules crashed while reading ANSI input. Looking at the string in question using a hex viewer, there was a mysterious 0xA0
byte at the end of it.
Turns out this is non-breaking space.
I tried replacing that:
String s = s.replace("\u00A0", "");
But it didn't work.
I then went and printed out what that character is using charAt
and Java reports
65533
or 0xFFFD
Plugging that into the replace code, I finally got rid of it!
But why do I see an 0xA0
in the file, but Java reads it as 0xFFFD
?
BufferedReader r = new BufferedReader(new InputStreamReader(new FileInputStream(path), "UTF-8"));
String line = r.readLine();
while (line != null){
// do stuff
line = r.readLine();
}
U+FFFD
is the "Unicode replacement character", which is generally used to represent "some binary data which couldn't be decoded correctly in the encoding you were using". (Sometimes ?
is used for this instead, but U+FFFD is generally a better idea, as it's unambiguous.)
Its presence is usually a sign that you've tried to use the wrong encoding. You haven't specified which encoding you were using - or indeed how you were using it - but that's probably the problem. Check the encoding you're using and the encoding of the file. Be aware that "ANSI" isn't an encoding - there are lots of encodings which are known as ANSI encodings, and you'll need to pick the right one for your file.
See more on this question at Stackoverflow