I would like to read and print the text file to console so i did this with below code
File file = new File("G:\\text.txt");
FileReader fileReader = new FileReader(file);
int ascii = fileReader.read();
while (ascii != -1)
{
result = result + (char) ascii;
ascii = fileReader.read();
}
System.out.println(result);
although i got correct result, but in some cases i will get some strange result. Suppose my text file has this text in it:
Hello to every one
In order to have a text file I've used notepad, and when i change the encoding mode i will get strange output from my code.
Ansi : Hello to every one
Unicode : ÿþh e l l o t o e v e r y o n e
Unicode big endian: þÿ h e l l o t o e v e r y o n e
UTF-8 : hello to every one
Why do i get these strange output? Is there any problem with my code? Or there are other reasons
Your file starts with a byte-order mark (U+FEFF). It should only occur in the first character of the file - it's not terribly widely used, but various Windows tools do include it, including Notepad. You can just strip it from the start of the first line.
As an aside, I'd strongly recommend not using FileReader
- it doesn't allow you to specify the encoding. I'd use Files.newBufferedReader
, and either specify the encoding or let it default to UTF-8 (rather than the system default encoding which FileReader
uses). When you're using BufferedReader
, you can then just read a line at a time with readLine()
too:
String line;
while ((line = reader.readLine()) != null) {
System.out.println(line.replace("\uFEFF", ""));
}
If you really want to read a character at a time, it's worth getting in the habit of using a StringBuilder
instead of repeated string concatenation in a loop. Also note that your variable name of ascii
is misleading: it's actually the UTF-16 code unit, which may or may not be an ASCII character.
The encoding you specify should match the encoding used to write the file - at that point you should see the correct output instead of an extra character between each "real" character when using Unicode and Unicode big endian.
See more on this question at Stackoverflow